====== Bioinformatics Pipelines ======
This section documents end-to-end bioinformatics workflows used at ABI. Each pipeline page describes a complete workflow from raw data to results, including the tools, scripts, and parameters used.
For individual reusable scripts, see the [[scripts:start|Scripts]] section.
===== Common Pipelines =====
The typical NGS analysis workflow follows these steps:
Raw FASTQ --> QC --> Trimming --> Alignment --> Post-processing --> Variant Calling / Analysis
Each step is documented in detail:
===== Pipeline Steps =====
^ Step ^ Description ^ Tools ^ Guide ^
| 1. Download data | Download FASTQ files from sequencing facility or public databases | wget, sra-tools | [[scripts:download_fastq|Download FASTQ]] |
| 2. Quality control | Assess read quality before and after trimming | FastQC, MultiQC, fastp | [[scripts:qc|Quality Control]] |
| 3. Adapter & quality trimming | Remove adapters and low-quality bases | fastp, cutadapt | [[scripts:adapter_and_quality_trimming|Trimming]] |
| 4. Alignment | Map reads to a reference genome | BWA mem, Bowtie2, STAR | [[scripts:alignment|Alignment]] |
| 5. Post-alignment processing | Sort, index, mark duplicates | samtools, Picard | *TODO: create page* |
| 6. Variant calling | Call SNPs and indels | GATK HaplotypeCaller, bcftools | *TODO: create page* |
| 7. Variant filtering & annotation | Filter and annotate variants | GATK, SnpEff, VEP | *TODO: create page* |
| 8. Downstream analysis | Statistical analysis, visualization | R, Python | *TODO: project-specific* |
----
===== Workflow by Data Type =====
Different types of sequencing data require different pipelines:
==== Whole Genome Sequencing (WGS) / Whole Exome Sequencing (WES) ====
FASTQ --> FastQC --> fastp --> BWA mem --> samtools sort --> Mark Duplicates --> GATK HaplotypeCaller --> Filter --> Annotate
Relevant guides:
* [[scripts:qc|QC]] --> [[scripts:adapter_and_quality_trimming|Trimming]] --> [[scripts:alignment|Alignment (BWA)]]
* *TODO: Add variant calling and annotation guides*
==== RNA-seq ====
FASTQ --> FastQC --> fastp --> STAR --> featureCounts/HTSeq --> DESeq2/edgeR
*TODO: Create RNA-seq specific pipeline pages when needed.*
==== Metagenomics / Microbiome ====
FASTQ --> FastQC --> fastp --> Kraken2/MetaPhlAn --> Diversity analysis
*TODO: Create metagenomics pipeline pages when needed.*
----
===== Script Organization =====
ABI uses a **parent/daughter script pattern** for Slurm jobs:
* **Daughter script** -- A reusable function or tool wrapper (e.g., ''src/fastqc.sh'', ''src/align.sh''). Takes parameters like input/output directories.
* **Parent script** -- A Slurm job script that sets parameters and calls the daughter script. Contains ''#SBATCH'' directives.
Example:
project/
src/
fastqc.sh # Daughter: runs FastQC
align.sh # Daughter: runs BWA mem
fastqc_00.sh # Parent: Slurm job calling src/fastqc.sh
align_00.sh # Parent: Slurm job calling src/align.sh
log/ # Job output logs
fastq/ # Input FASTQ files
bam/ # Output BAM files
This approach allows you to:
* Reuse daughter scripts across projects
* Keep Slurm parameters separate from tool logic
* Track each run via its parent script and log file
See [[scripts:run_job_on_slurm|Running Jobs on Slurm]] for more on this pattern.
----
===== Tips =====
* **Create a ''log/'' directory** before submitting jobs.
* **Use one parent script per run** -- name them descriptively (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use [[software:slurm#job_arrays|job arrays]].
* **Document your parameters** -- add comments in parent scripts noting why you chose specific settings.
* **Check QC at every step** -- run FastQC/MultiQC after trimming and after alignment.
----
===== See Also =====
* [[scripts:start|Scripts]] -- Individual reusable scripts
* [[software:slurm|Using Slurm]] -- Job submission and management
* [[databases:start|Databases & Reference Data]] -- Reference genomes and indexes
* [[projects:start|Projects]] -- Project-specific documentation