✎ pipelines:start [ABI Knowledge Base]

You've loaded an old revision of the document! If you save it, you will create a new version with this data.

====== Bioinformatics Pipelines ======

This section documents end-to-end bioinformatics workflows used at ABI. Each pipeline page describes a complete workflow from raw data to results, including the tools, scripts, and parameters used.

For individual reusable scripts, see the [[scripts:start|Scripts]] section.

===== Common Pipelines =====

The typical NGS analysis workflow follows these steps:

<code>
Raw FASTQ  -->  QC  -->  Trimming  -->  Alignment  -->  Post-processing  -->  Variant Calling / Analysis
</code>

Each step is documented in detail:

===== Pipeline Steps =====

----

===== Workflow by Data Type =====

Different types of sequencing data require different pipelines:

==== Whole Genome Sequencing (WGS) / Whole Exome Sequencing (WES) ====

<code>
FASTQ --> FastQC --> fastp --> BWA mem --> samtools sort --> Mark Duplicates --> GATK HaplotypeCaller --> Filter --> Annotate
</code>

Relevant guides:
  * [[scripts:qc|QC]] --> [[scripts:adapter_and_quality_trimming|Trimming]] --> [[scripts:alignment|Alignment (BWA)]]
  * *TODO: Add variant calling and annotation guides*

==== RNA-seq ====

<code>
FASTQ --> FastQC --> fastp --> STAR --> featureCounts/HTSeq --> DESeq2/edgeR
</code>

*TODO: Create RNA-seq specific pipeline pages when needed.*

==== Metagenomics / Microbiome ====

<code>
FASTQ --> FastQC --> fastp --> Kraken2/MetaPhlAn --> Diversity analysis
</code>

*TODO: Create metagenomics pipeline pages when needed.*

----

===== Script Organization =====

ABI uses a **parent/daughter script pattern** for Slurm jobs:

* **Daughter script** -- A reusable function or tool wrapper (e.g., ''src/fastqc.sh'', ''src/align.sh''). Takes parameters like input/output directories.
  * **Parent script** -- A Slurm job script that sets parameters and calls the daughter script. Contains ''#SBATCH'' directives.

Example:

<code>
project/
  src/
    fastqc.sh          # Daughter: runs FastQC
    align.sh            # Daughter: runs BWA mem
  fastqc_00.sh          # Parent: Slurm job calling src/fastqc.sh
  align_00.sh           # Parent: Slurm job calling src/align.sh
  log/                  # Job output logs
  fastq/                # Input FASTQ files
  bam/                  # Output BAM files
</code>

This approach allows you to:
  * Reuse daughter scripts across projects
  * Keep Slurm parameters separate from tool logic
  * Track each run via its parent script and log file

See [[scripts:run_job_on_slurm|Running Jobs on Slurm]] for more on this pattern.

----

===== Tips =====

* **Create a ''log/'' directory** before submitting jobs.
  * **Use one parent script per run** -- name them descriptively (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use [[software:slurm#job_arrays|job arrays]].
  * **Document your parameters** -- add comments in parent scripts noting why you chose specific settings.
  * **Check QC at every step** -- run FastQC/MultiQC after trimming and after alignment.

----

===== See Also =====

* [[scripts:start|Scripts]] -- Individual reusable scripts
  * [[software:slurm|Using Slurm]] -- Job submission and management
  * [[databases:start|Databases & Reference Data]] -- Reference genomes and indexes
  * [[projects:start|Projects]] -- Project-specific documentation

Edit summary