Align your reads to a reference genome

Align your reads to a reference genome

Reference genome index

You'll need a reference genome index to align your reads in FASTQ files. There are prebuilt indexes on the ABI server at the following directory:

/mnt/db/genomes

If you don't find the genome of your interest, you can build the index yourself, by following the steps in the following page [TOWIKI].

Choosing the alignment program

There are several programs designed for the job. The choice depends on your task. A general rule of thumb:

If you have WGS or WES data, use bwa mem. It is accurate for SNP and INDEL calling.
If you Short reads or ChIP-seq, use Bowtie2. You can also use it if you prefer the alignment to be faster, but at the expense of accuracy in variant calling.
For RNA-seq, use STAR, as it is splicing-aware.

Example script for bwa-mem and paired-end reads

Daughter script

align() {
    local fastq_1="$1"
    local fastq_2="$2"
    local output_dir="$3"
    local reference="$4"
    local threads="$5"

    # Extract the filename without extension
    filename=$(basename "$fastq_1" | sed -E 's/(_1)?\.(fastq|fq)\.gz//')

    echo "Aligning $filename using BWA mem with $threads threads..."

    # Perform alignment with specified number of threads
    bwa mem -t "$threads" "$reference" <(zcat "$fastq_1") <(zcat "$fastq_2") > "$output_dir"/"$filename".sam

    # Convert SAM to BAM
    samtools view -@ "$threads" -bS "$output_dir"/"$filename".sam > "$output_dir"/"$filename".bam

    # Sort the BAM file
    samtools sort -@ "$threads" -o "$output_dir"/"$filename".sorted.bam "$output_dir"/"$filename".bam

    # Index the sorted BAM file
    samtools index "$output_dir"/"$filename".sorted.bam

    # Clean up intermediate files
    rm "$output_dir"/"$filename".sam "$output_dir"/"$filename".bam

    echo "Finished processing $filename"
}

# Input arguments
input_dir=$1
output_dir=$2
reference=$3
threads=$4

# Validate input
if [[ -z "$input_dir" || -z "$output_dir" || -z "$reference" || -z "$threads" ]]; then
    echo "Usage: $0 <input_dir> <output_dir> <reference> <threads>"
    exit 1
fi

# Loop through files sequentially, matching both fastq.gz and fq.gz
for fastq_1 in "$input_dir"/*_1.{fastq,fq}.gz; do
    if [[ -f "$fastq_1" ]]; then
        fastq_2="${fastq_1/_1․/_2․}"
        if [[ -f "$fastq_2" ]]; then
            align "$fastq_1" "$fastq_2" "$output_dir" "$reference" "$threads"
        else
            echo "Missing pair for $fastq_1"
        fi
    fi
done

Parent script

#!/bin/bash
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --output=../log/align_00.log  
#SBATCH --job-name=align_mydata

# Parameters (example, modify as needed)

input_dir=fastq
output_dir=bam
reference="/mnt/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna"
threads=10

src/align.sh $input_dir $output_dir $reference $threads

Note, the reference genome FASTA file (.fna or .fa) should be located in the same directory where the bwa indexes are located (.amb, .ann, .bwt, .pac, .sa).