====== Bioinformatics Pipelines ====== This section documents end-to-end bioinformatics workflows used at ABI. Each pipeline page describes a complete workflow from raw data to results, including the tools, scripts, and parameters used. For individual reusable scripts, see the [[scripts:start|Scripts]] section. ===== Common Pipelines ===== The typical NGS analysis workflow follows these steps: Raw FASTQ --> QC --> Trimming --> Alignment --> Post-processing --> Variant Calling / Analysis Each step is documented in detail: ===== Pipeline Steps ===== ^ Step ^ Description ^ Tools ^ Guide ^ | 1. Download data | Download FASTQ files from sequencing facility or public databases | wget, sra-tools | [[scripts:download_fastq|Download FASTQ]] | | 2. Quality control | Assess read quality before and after trimming | FastQC, MultiQC, fastp | [[scripts:qc|Quality Control]] | | 3. Adapter & quality trimming | Remove adapters and low-quality bases | fastp, cutadapt | [[scripts:adapter_and_quality_trimming|Trimming]] | | 4. Alignment | Map reads to a reference genome | BWA mem, Bowtie2, STAR | [[scripts:alignment|Alignment]] | | 5. Post-alignment processing | Sort, index, mark duplicates | samtools, Picard | *TODO: create page* | | 6. Variant calling | Call SNPs and indels | GATK HaplotypeCaller, bcftools | *TODO: create page* | | 7. Variant filtering & annotation | Filter and annotate variants | GATK, SnpEff, VEP | *TODO: create page* | | 8. Downstream analysis | Statistical analysis, visualization | R, Python | *TODO: project-specific* | ---- ===== Workflow by Data Type ===== Different types of sequencing data require different pipelines: ==== Whole Genome Sequencing (WGS) / Whole Exome Sequencing (WES) ==== FASTQ --> FastQC --> fastp --> BWA mem --> samtools sort --> Mark Duplicates --> GATK HaplotypeCaller --> Filter --> Annotate Relevant guides: * [[scripts:qc|QC]] --> [[scripts:adapter_and_quality_trimming|Trimming]] --> [[scripts:alignment|Alignment (BWA)]] * *TODO: Add variant calling and annotation guides* ==== RNA-seq ==== FASTQ --> FastQC --> fastp --> STAR --> featureCounts/HTSeq --> DESeq2/edgeR *TODO: Create RNA-seq specific pipeline pages when needed.* ==== Metagenomics / Microbiome ==== FASTQ --> FastQC --> fastp --> Kraken2/MetaPhlAn --> Diversity analysis *TODO: Create metagenomics pipeline pages when needed.* ---- ===== Script Organization ===== ABI uses a **parent/daughter script pattern** for Slurm jobs: * **Daughter script** -- A reusable function or tool wrapper (e.g., ''src/fastqc.sh'', ''src/align.sh''). Takes parameters like input/output directories. * **Parent script** -- A Slurm job script that sets parameters and calls the daughter script. Contains ''#SBATCH'' directives. Example: project/ src/ fastqc.sh # Daughter: runs FastQC align.sh # Daughter: runs BWA mem fastqc_00.sh # Parent: Slurm job calling src/fastqc.sh align_00.sh # Parent: Slurm job calling src/align.sh log/ # Job output logs fastq/ # Input FASTQ files bam/ # Output BAM files This approach allows you to: * Reuse daughter scripts across projects * Keep Slurm parameters separate from tool logic * Track each run via its parent script and log file See [[scripts:run_job_on_slurm|Running Jobs on Slurm]] for more on this pattern. ---- ===== Tips ===== * **Create a ''log/'' directory** before submitting jobs. * **Use one parent script per run** -- name them descriptively (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use [[software:slurm#job_arrays|job arrays]]. * **Document your parameters** -- add comments in parent scripts noting why you chose specific settings. * **Check QC at every step** -- run FastQC/MultiQC after trimming and after alignment. ---- ===== See Also ===== * [[scripts:start|Scripts]] -- Individual reusable scripts * [[software:slurm|Using Slurm]] -- Job submission and management * [[databases:start|Databases & Reference Data]] -- Reference genomes and indexes * [[projects:start|Projects]] -- Project-specific documentation