Bioinformatics Pipelines

This section documents end-to-end bioinformatics workflows used at ABI. Each pipeline page describes a complete workflow from raw data to results, including the tools, scripts, and parameters used.

For individual reusable scripts, see the Scripts section.

Common Pipelines

The typical NGS analysis workflow follows these steps:

Raw FASTQ  -->  QC  -->  Trimming  -->  Alignment  -->  Post-processing  -->  Variant Calling / Analysis

Each step is documented in detail:

Pipeline Steps

Step Description Tools Guide
1. Download data Download FASTQ files from sequencing facility or public databases wget, sra-tools Download FASTQ
2. Quality control Assess read quality before and after trimming FastQC, MultiQC, fastp Quality Control
3. Adapter & quality trimming Remove adapters and low-quality bases fastp, cutadapt Trimming
4. Alignment Map reads to a reference genome BWA mem, Bowtie2, STAR Alignment
5. Post-alignment processing Sort, index, mark duplicates samtools, Picard *TODO: create page*
6. Variant calling Call SNPs and indels GATK HaplotypeCaller, bcftools *TODO: create page*
7. Variant filtering & annotation Filter and annotate variants GATK, SnpEff, VEP *TODO: create page*
8. Downstream analysis Statistical analysis, visualization R, Python *TODO: project-specific*

Workflow by Data Type

Different types of sequencing data require different pipelines:

Whole Genome Sequencing (WGS) / Whole Exome Sequencing (WES)

FASTQ --> FastQC --> fastp --> BWA mem --> samtools sort --> Mark Duplicates --> GATK HaplotypeCaller --> Filter --> Annotate

Relevant guides:

RNA-seq

FASTQ --> FastQC --> fastp --> STAR --> featureCounts/HTSeq --> DESeq2/edgeR

*TODO: Create RNA-seq specific pipeline pages when needed.*

Metagenomics / Microbiome

FASTQ --> FastQC --> fastp --> Kraken2/MetaPhlAn --> Diversity analysis

*TODO: Create metagenomics pipeline pages when needed.*


Script Organization

ABI uses a parent/daughter script pattern for Slurm jobs:

  • Daughter script – A reusable function or tool wrapper (e.g., src/fastqc.sh, src/align.sh). Takes parameters like input/output directories.
  • Parent script – A Slurm job script that sets parameters and calls the daughter script. Contains #SBATCH directives.

Example:

project/
  src/
    fastqc.sh          # Daughter: runs FastQC
    align.sh            # Daughter: runs BWA mem
  fastqc_00.sh          # Parent: Slurm job calling src/fastqc.sh
  align_00.sh           # Parent: Slurm job calling src/align.sh
  log/                  # Job output logs
  fastq/                # Input FASTQ files
  bam/                  # Output BAM files

This approach allows you to:

  • Reuse daughter scripts across projects
  • Keep Slurm parameters separate from tool logic
  • Track each run via its parent script and log file

See Running Jobs on Slurm for more on this pattern.


Tips

  • Create a log/ directory before submitting jobs.
  • Use one parent script per run – name them descriptively (e.g., align_sample01.sh, align_sample02.sh) or use job arrays.
  • Document your parameters – add comments in parent scripts noting why you chose specific settings.
  • Check QC at every step – run FastQC/MultiQC after trimming and after alignment.

See Also