====== Align your reads to a reference genome ====== ===== Reference genome index ===== You'll need a reference genome index to align your reads in FASTQ files. There are prebuilt indexes on the ABI server at the following directory: ''/mnt/db/genomes'' If you don't find the genome of your interest, you can build the index yourself, by following the steps in the following page [TOWIKI]. ===== Choosing the alignment program ===== There are several programs designed for the job. The choice depends on your task. A general rule of thumb: * If you have WGS or WES data, use ''bwa mem''. It is accurate for SNP and INDEL calling. * If you Short reads or ChIP-seq, use ''Bowtie2''. You can also use it if you prefer the alignment to be faster, but at the expense of accuracy in variant calling. * For RNA-seq, use ''STAR'', as it is splicing-aware. ===== Example script for bwa-mem and paired-end reads ===== ==== Daughter script ==== align() { local fastq_1="$1" local fastq_2="$2" local output_dir="$3" local reference="$4" local threads="$5" # Extract the filename without extension filename=$(basename "$fastq_1" | sed -E 's/(_1)?\.(fastq|fq)\.gz//') echo "Aligning $filename using BWA mem with $threads threads..." # Perform alignment with specified number of threads bwa mem -t "$threads" "$reference" <(zcat "$fastq_1") <(zcat "$fastq_2") > "$output_dir"/"$filename".sam # Convert SAM to BAM samtools view -@ "$threads" -bS "$output_dir"/"$filename".sam > "$output_dir"/"$filename".bam # Sort the BAM file samtools sort -@ "$threads" -o "$output_dir"/"$filename".sorted.bam "$output_dir"/"$filename".bam # Index the sorted BAM file samtools index "$output_dir"/"$filename".sorted.bam # Clean up intermediate files rm "$output_dir"/"$filename".sam "$output_dir"/"$filename".bam echo "Finished processing $filename" } # Input arguments input_dir=$1 output_dir=$2 reference=$3 threads=$4 # Validate input if [[ -z "$input_dir" || -z "$output_dir" || -z "$reference" || -z "$threads" ]]; then echo "Usage: $0 " exit 1 fi # Loop through files sequentially, matching both fastq.gz and fq.gz for fastq_1 in "$input_dir"/*_1.{fastq,fq}.gz; do if [[ -f "$fastq_1" ]]; then fastq_2="${fastq_1/_1․/_2․}" if [[ -f "$fastq_2" ]]; then align "$fastq_1" "$fastq_2" "$output_dir" "$reference" "$threads" else echo "Missing pair for $fastq_1" fi fi done ==== Parent script ==== #!/bin/bash #SBATCH --mem=40gb #SBATCH --cpus-per-task=8 #SBATCH --output=../log/align_00.log #SBATCH --job-name=align_mydata # Parameters (example, modify as needed) input_dir=fastq output_dir=bam reference="/mnt/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna" threads=10 src/align.sh $input_dir $output_dir $reference $threads Note, the reference genome FASTA file (.fna or .fa) should be located in the same directory where the bwa indexes are located (.amb, .ann, .bwt, .pac, .sa).