Align your reads to a reference genome

Align your reads to a reference genome

Reference genome index

You'll need a reference genome index to align your reads in FASTQ files. There are prebuilt indexes on the ABI server at the following directory:

/mnt/db/genomes

If you don't find the genome of your interest, you can build the index yourself, by following the steps in the following page [TOWIKI].

Choosing the alignment program

There are several programs designed for the job. The choice depends on your task. A general rule of thumb:

If you have WGS or WES data, use bwa mem. It is accurate for SNP and INDEL calling.
If you Short reads or ChIP-seq, use Bowtie2. You can also use it if you prefer the alignment to be faster, but at the expense of accuracy in variant calling.
For RNA-seq, use STAR, as it is splicing-aware.

Example script for bwa-mem and paired-end reads

Daughter script

align() {
    local fastq_1="$1"
    local fastq_2="$2"
    local output_dir="$3"
    local reference="$4"
    local threads="$5"

    # Extract the filename without extension
    filename=$(basename "$fastq_1" | sed -E 's/(_1)?\.(fastq|fq)\.gz//')

    echo "Aligning $filename using BWA mem with $threads threads..."

    # Perform alignment with specified number of threads
    bwa mem -t "$threads" "$reference" <(zcat "$fastq_1") <(zcat "$fastq_2") > "$output_dir"/"$filename".sam

    # Convert SAM to BAM
    samtools view -@ "$threads" -bS "$output_dir"/"$filename".sam > "$output_dir"/"$filename".bam

    # Sort the BAM file
    samtools sort -@ "$threads" -o "$output_dir"/"$filename".sorted.bam "$output_dir"/"$filename".bam

    # Index the sorted BAM file
    samtools index "$output_dir"/"$filename".sorted.bam

    # Clean up intermediate files
    rm "$output_dir"/"$filename".sam "$output_dir"/"$filename".bam

    echo "Finished processing $filename"
}

# Input arguments
input_dir=$1
output_dir=$2
reference=$3
threads=$4

# Validate input
if [[ -z "$input_dir" || -z "$output_dir" || -z "$reference" || -z "$threads" ]]; then
    echo "Usage: $0 <input_dir> <output_dir> <reference> <threads>"
    exit 1
fi

# Loop through files sequentially, matching both fastq.gz and fq.gz
for fastq_1 in "$input_dir"/*_1.{fastq,fq}.gz; do
    if [[ -f "$fastq_1" ]]; then
        fastq_2="${fastq_1/_1․/_2․}"
        if [[ -f "$fastq_2" ]]; then
            align "$fastq_1" "$fastq_2" "$output_dir" "$reference" "$threads"
        else
            echo "Missing pair for $fastq_1"
        fi
    fi
done

Parent script

#!/bin/bash
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --output=../log/align_00.log  
#SBATCH --job-name=align_mydata

# Parameters (example, modify as needed)

input_dir=fastq
output_dir=bam
reference="/mnt/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna"
threads=10

src/align.sh $input_dir $output_dir $reference $threads

Note, the reference genome FASTA file (.fna or .fa) should be located in the same directory where the bwa indexes are located (.amb, .ann, .bwt, .pac, .sa).

ABI Knowledge Base

Table of Contents