scripts:alignment
Table of Contents
Align your reads to a reference genome
Reference genome index
You'll need a reference genome index to align your reads in FASTQ files. There are prebuilt indexes on the ABI server at the following directory:
/mnt/db/genomes
If you don't find the genome of your interest, you can build the index yourself, by following the steps in the following page [TOWIKI].
Choosing the alignment program
There are several programs designed for the job. The choice depends on your task. A general rule of thumb:
- If you have WGS or WES data, use
bwa mem. It is accurate for SNP and INDEL calling. - If you Short reads or ChIP-seq, use
Bowtie2. You can also use it if you prefer the alignment to be faster, but at the expense of accuracy in variant calling. - For RNA-seq, use
STAR, as it is splicing-aware.
Example script for bwa-mem and paired-end reads
Daughter script
align() {
local fastq_1="$1"
local fastq_2="$2"
local output_dir="$3"
local reference="$4"
local threads="$5"
# Extract the filename without extension
filename=$(basename "$fastq_1" | sed -E 's/(_1)?\.(fastq|fq)\.gz//')
echo "Aligning $filename using BWA mem with $threads threads..."
# Perform alignment with specified number of threads
bwa mem -t "$threads" "$reference" <(zcat "$fastq_1") <(zcat "$fastq_2") > "$output_dir"/"$filename".sam
# Convert SAM to BAM
samtools view -@ "$threads" -bS "$output_dir"/"$filename".sam > "$output_dir"/"$filename".bam
# Sort the BAM file
samtools sort -@ "$threads" -o "$output_dir"/"$filename".sorted.bam "$output_dir"/"$filename".bam
# Index the sorted BAM file
samtools index "$output_dir"/"$filename".sorted.bam
# Clean up intermediate files
rm "$output_dir"/"$filename".sam "$output_dir"/"$filename".bam
echo "Finished processing $filename"
}
# Input arguments
input_dir=$1
output_dir=$2
reference=$3
threads=$4
# Validate input
if [[ -z "$input_dir" || -z "$output_dir" || -z "$reference" || -z "$threads" ]]; then
echo "Usage: $0 <input_dir> <output_dir> <reference> <threads>"
exit 1
fi
# Loop through files sequentially, matching both fastq.gz and fq.gz
for fastq_1 in "$input_dir"/*_1.{fastq,fq}.gz; do
if [[ -f "$fastq_1" ]]; then
fastq_2="${fastq_1/_1․/_2․}"
if [[ -f "$fastq_2" ]]; then
align "$fastq_1" "$fastq_2" "$output_dir" "$reference" "$threads"
else
echo "Missing pair for $fastq_1"
fi
fi
done
Parent script
#!/bin/bash #SBATCH --mem=40gb #SBATCH --cpus-per-task=8 #SBATCH --output=../log/align_00.log #SBATCH --job-name=align_mydata # Parameters (example, modify as needed) input_dir=fastq output_dir=bam reference="/mnt/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna" threads=10 src/align.sh $input_dir $output_dir $reference $threads
Note, the reference genome FASTA file (.fna or .fa) should be located in the same directory where the bwa indexes are located (.amb, .ann, .bwt, .pac, .sa).
scripts/alignment.txt · Last modified: by 37.26.174.181
