====== Bioinformatics Pipelines ======

This section documents end-to-end bioinformatics workflows used at ABI. Each pipeline page describes a complete workflow from raw data to results, including the tools, scripts, and parameters used.

For individual reusable scripts, see the [[scripts:start|Scripts]] section.

===== Common Pipelines =====

The typical NGS analysis workflow follows these steps:

<code>
Raw FASTQ  -->  QC  -->  Trimming  -->  Alignment  -->  Post-processing  -->  Variant Calling / Analysis
</code>

Each step is documented in detail:

===== Pipeline Steps =====

^ Step ^ Description ^ Tools ^ Guide ^
| 1. Download data | Download FASTQ files from sequencing facility or public databases | wget, sra-tools | [[scripts:download_fastq|Download FASTQ]] |
| 2. Quality control | Assess read quality before and after trimming | FastQC, MultiQC, fastp | [[scripts:qc|Quality Control]] |
| 3. Adapter & quality trimming | Remove adapters and low-quality bases | fastp, cutadapt | [[scripts:adapter_and_quality_trimming|Trimming]] |
| 4. Alignment | Map reads to a reference genome | BWA mem, Bowtie2, STAR | [[scripts:alignment|Alignment]] |
| 5. Post-alignment processing | Sort, index, mark duplicates | samtools, Picard | *TODO: create page* |
| 6. Variant calling | Call SNPs and indels | GATK HaplotypeCaller, bcftools | *TODO: create page* |
| 7. Variant filtering & annotation | Filter and annotate variants | GATK, SnpEff, VEP | *TODO: create page* |
| 8. Downstream analysis | Statistical analysis, visualization | R, Python | *TODO: project-specific* |

----

===== Workflow by Data Type =====

Different types of sequencing data require different pipelines:

==== Whole Genome Sequencing (WGS) / Whole Exome Sequencing (WES) ====

<code>
FASTQ --> FastQC --> fastp --> BWA mem --> samtools sort --> Mark Duplicates --> GATK HaplotypeCaller --> Filter --> Annotate
</code>

Relevant guides:
  * [[scripts:qc|QC]] --> [[scripts:adapter_and_quality_trimming|Trimming]] --> [[scripts:alignment|Alignment (BWA)]]
  * *TODO: Add variant calling and annotation guides*

==== RNA-seq ====

<code>
FASTQ --> FastQC --> fastp --> STAR --> featureCounts/HTSeq --> DESeq2/edgeR
</code>

*TODO: Create RNA-seq specific pipeline pages when needed.*

==== Metagenomics / Microbiome ====

<code>
FASTQ --> FastQC --> fastp --> Kraken2/MetaPhlAn --> Diversity analysis
</code>

*TODO: Create metagenomics pipeline pages when needed.*

----

===== Script Organization =====

ABI uses a **parent/daughter script pattern** for Slurm jobs:

  * **Daughter script** -- A reusable function or tool wrapper (e.g., ''src/fastqc.sh'', ''src/align.sh''). Takes parameters like input/output directories.
  * **Parent script** -- A Slurm job script that sets parameters and calls the daughter script. Contains ''#SBATCH'' directives.

Example:

<code>
project/
  src/
    fastqc.sh          # Daughter: runs FastQC
    align.sh            # Daughter: runs BWA mem
  fastqc_00.sh          # Parent: Slurm job calling src/fastqc.sh
  align_00.sh           # Parent: Slurm job calling src/align.sh
  log/                  # Job output logs
  fastq/                # Input FASTQ files
  bam/                  # Output BAM files
</code>

This approach allows you to:
  * Reuse daughter scripts across projects
  * Keep Slurm parameters separate from tool logic
  * Track each run via its parent script and log file

See [[scripts:run_job_on_slurm|Running Jobs on Slurm]] for more on this pattern.

----

===== Tips =====

  * **Create a ''log/'' directory** before submitting jobs.
  * **Use one parent script per run** -- name them descriptively (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use [[software:slurm#job_arrays|job arrays]].
  * **Document your parameters** -- add comments in parent scripts noting why you chose specific settings.
  * **Check QC at every step** -- run FastQC/MultiQC after trimming and after alignment.

----

===== See Also =====

  * [[scripts:start|Scripts]] -- Individual reusable scripts
  * [[software:slurm|Using Slurm]] -- Job submission and management
  * [[databases:start|Databases & Reference Data]] -- Reference genomes and indexes
  * [[projects:start|Projects]] -- Project-specific documentation