You'll need a reference genome index to align your reads in FASTQ files. There are prebuilt indexes on the ABI server at the following directory:
/mnt/db/genomes
If you don't find the genome of your interest, you can build the index yourself, by following the steps in the following page [TOWIKI].
There are several programs designed for the job. The choice depends on your task. A general rule of thumb:
bwa mem. It is accurate for SNP and INDEL calling.Bowtie2. You can also use it if you prefer the alignment to be faster, but at the expense of accuracy in variant calling.STAR, as it is splicing-aware.
align() {
local fastq_1="$1"
local fastq_2="$2"
local output_dir="$3"
local reference="$4"
local threads="$5"
# Extract the filename without extension
filename=$(basename "$fastq_1" | sed -E 's/(_1)?\.(fastq|fq)\.gz//')
echo "Aligning $filename using BWA mem with $threads threads..."
# Perform alignment with specified number of threads
bwa mem -t "$threads" "$reference" <(zcat "$fastq_1") <(zcat "$fastq_2") > "$output_dir"/"$filename".sam
# Convert SAM to BAM
samtools view -@ "$threads" -bS "$output_dir"/"$filename".sam > "$output_dir"/"$filename".bam
# Sort the BAM file
samtools sort -@ "$threads" -o "$output_dir"/"$filename".sorted.bam "$output_dir"/"$filename".bam
# Index the sorted BAM file
samtools index "$output_dir"/"$filename".sorted.bam
# Clean up intermediate files
rm "$output_dir"/"$filename".sam "$output_dir"/"$filename".bam
echo "Finished processing $filename"
}
# Input arguments
input_dir=$1
output_dir=$2
reference=$3
threads=$4
# Validate input
if [[ -z "$input_dir" || -z "$output_dir" || -z "$reference" || -z "$threads" ]]; then
echo "Usage: $0 <input_dir> <output_dir> <reference> <threads>"
exit 1
fi
# Loop through files sequentially, matching both fastq.gz and fq.gz
for fastq_1 in "$input_dir"/*_1.{fastq,fq}.gz; do
if [[ -f "$fastq_1" ]]; then
fastq_2="${fastq_1/_1․/_2․}"
if [[ -f "$fastq_2" ]]; then
align "$fastq_1" "$fastq_2" "$output_dir" "$reference" "$threads"
else
echo "Missing pair for $fastq_1"
fi
fi
done
#!/bin/bash #SBATCH --mem=40gb #SBATCH --cpus-per-task=8 #SBATCH --output=../log/align_00.log #SBATCH --job-name=align_mydata # Parameters (example, modify as needed) input_dir=fastq output_dir=bam reference="/mnt/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna" threads=10 src/align.sh $input_dir $output_dir $reference $threads
Note, the reference genome FASTA file (.fna or .fa) should be located in the same directory where the bwa indexes are located (.amb, .ann, .bwt, .pac, .sa).