User Tools

Site Tools


scripts:adapter_and_quality_trimming

This is an old revision of the document!


How to perform quality control and adapter trimming of your reads with cutadapt and fastp

What to trim

Adapters

If your FASTQC report shows adapter contamination, remove adapters. There are three common adapters, frequently used in NGS:

  • Illumina universal adapter (most common): AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
  • Illumina Nextera adapter: CTGTCTCTTATACACATCT
  • TruSeq adapter: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Adapters are usually found at the 3' end of your reads. In rare cases, you'll also need to remove barcodes or adapters from the 5' end.

Low quality bases

The 3' ends of the reads usually have lower quality (FASQC Report section: “Per base sequence quality”). If those are too low, you may want to trim all the reads by a certain number of bases from the 3' end.

The 5' end may also have issues. You can check the “Per base sequence content” section of FASTQC. If the bases are not uniformly distributed at the 5' end (should be close to a straight line for each base), then there may be barcodes or quality issues. You can choose to remove the first few bases until the distribution becomes uniform.

Short reads

After adapter and quality trimming, some reads may end up being too short. You can remove short read entirely from the FASTQ file. As a rule of thumb, reads shorter than 20 bp should be removed.

Example script with Cutadapt

Daughter script

This script takes the adapters, number of bases to trim, and number of threads for parallel processing as input. It operates in single-end and paired-end mode. Please note, it is important to treat your paired-end files in parallel mode, otherwise, some reads may be removed from one file, and stay in the other, which will lead to problems during alignment.

The script will search for Illumina Universal Adapter. You can modify the script to select from another two available in it (just set the one you choose as the adapter) or you can add yours.

cutadapt.sh

# ------------------------------------------
# SCRIPT TO CUT ADAPTERS AND TRIM LOW-QUALITY READS
# Processes FASTQ files using `cutadapt` to:
#   1. Remove adapter sequences.
#   2. Trim low-quality bases (Q < 20) from ends of reads.
#   3. Discard reads shorter than 20 bp after trimming.
#   4. Works for both single-end and paired-end reads.
# ------------------------------------------

# Input parameters
input_dir="$1"
output_dir="$2"
trim_5p="$3"      # Number of bases to trim from the 5' end
trim_3p="$4"      # Number of bases to trim from the 3' end
threads="$5"      # Number of threads
paired="$6"       # Set to "true" for paired-end processing

# ------------------------------------------
# DEFAULT ADAPTER SETS
# ------------------------------------------
# Illumina universal adapter (most common)
illumina_universal_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# Illumina Nextera adapter
nextera_adapter="CTGTCTCTTATACACATCT"
# TruSeq adapter
truseq_adapter="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Use the Illumina universal adapter by default
adapter="$illumina_universal_adapter"

# ------------------------------------------
# VALIDATION
# ------------------------------------------
if [[ -z "$input_dir" || -z "$output_dir" || -z "$trim_5p" || -z "$trim_3p" || -z "$threads" ]]; then
    echo "Usage: sbatch cutadapt.sh <input_dir> <output_dir> <trim_5p> <trim_3p> <threads> [paired]"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$output_dir"

# Log the command execution
echo "Script initiated with command: $0 $@"
echo "Using default adapter: $adapter"
echo "Trimming $trim_5p bases from the 5' end and $trim_3p bases from the 3' end"

# ------------------------------------------
# PROCESS FILES WITH CUTADAPT
# ------------------------------------------

if [[ "$paired" == "true" ]]; then
    # ---------- PAIRED-END MODE ----------
    echo "Running in paired-end mode..."

    for file1 in "$input_dir"/*_1.{fastq,fq}.gz "$input_dir"/*.R1.{fastq,fq}.gz; do
        if [[ -f "$file1" ]]; then
            file2="${file1/_1/_2}"
            file2="${file2/R1/R2}"

            if [[ -f "$file2" ]]; then
                filename=$(basename "$file1" | sed 's/_1\.fastq\.gz//;s/_1\.fq\.gz//;s/\.R1\.fastq\.gz//;s/\.R1\.fq\.gz//')

                echo "Processing paired-end files: $file1 and $file2 using $threads threads..."

                # Run Cutadapt for paired-end
                cutadapt -j "$threads" -q 20 -m 20 \
                    -a "$adapter" -A "$adapter" \
                    -u "$trim_5p" -U "$trim_5p" \
                    -u "-$trim_3p" -U "-$trim_3p" \
                    -o "$output_dir/${filename}_1.fq.gz" -p "$output_dir/${filename}_2.fq.gz" \
                    "$file1" "$file2"

                echo "Finished processing paired-end files: $filename"
            else
                echo "Paired file not found for $file1"
            fi
        fi
    done
else
    # ---------- SINGLE-END MODE ----------
    echo "Running in single-end mode..."

    for file in "$input_dir"/*.{fastq,fq}.gz; do
        if [[ -f "$file" ]]; then
            filename=$(basename "$file" | sed 's/\.fastq\.gz/\.fq.gz/;s/\.fq\.gz/\.fq.gz/')
            echo "Processing single-end file: $filename using $threads threads..."

            # Run Cutadapt for single-end
            cutadapt -j "$threads" -q 20 -m 20 \
                -a "$adapter" \
                -u "$trim_5p" \
                -u "-$trim_3p" \
                -o "$output_dir/$filename" "$file"

            echo "Finished processing single-end file: $filename"
        fi
    done
fi

# Completion message
echo "All files processed. Output saved in $output_dir."

Parent script

cutadapt_00.sh

#!/bin/bash
#SBATCH --mem=32gb
#SBATCH --cpus-per-task=10
#SBATCH --job-name=cutadapt_00
#SBATCH --output=log/cutadapt_00%j.log  # %j will be replaced with the job ID

# Parameters (example, modify as needed)
input_dir=fq_renamed
output_dir=fq_trimmed
trim_5p=10      # Number of bases to trim from the 5' end
trim_3p=0      # Number of bases to trim from the 3' end
threads=10      # Number of threads
paired=true       # Set to "true" for paired-end processing

src/cutadapt.sh $input_dir $output_dir $trim_5p $trim_3p $threads $paired

Example script with Fastp

Daughter script

This script takes the adapters, number of bases to trim, and number of threads for parallel processing as input and generates fastp reports (html, json). It operates in single-end and paired-end mode. Please note, it is important to treat your paired-end files in parallel mode, otherwise, some reads may be removed from one file, and stay in the other, which will lead to problems during alignment.

The script will search for Illumina Universal Adapter. You can modify the script to select from another two available in it (just set the one you choose as the adapter) or you can add yours. It also trims polyG and polyX tails, low complexity reads. There is also —-detect_adapter_pe option, which can automatically detect adapters for paired end reads (just remove #).

fastp.sh


# ------------------------------------------
# SCRIPT TO CUT ADAPTERS AND TRIM LOW-QUALITY READS
# Processes FASTQ files using `fastp` to:
#   1. Remove adapter sequences.
#   2. Trim low-quality bases (Q < 20) from ends of reads.
#   3. Discard reads shorter than 20 bp after trimming.
#   4. Works for both single-end and paired-end reads.
# ------------------------------------------

# Input parameters
input_dir="$1"
output_dir="$2"
trim_5p="$3"      # Number of bases to trim from the 5' end
trim_3p="$4"      # Number of bases to trim from the 3' end
threads="$5"      # Number of threads
paired="$6"       # Set to "true" for paired-end processing

# ------------------------------------------
# DEFAULT ADAPTER SETS
# ------------------------------------------
# Illumina universal adapter (most common)
illumina_universal_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# Illumina Nextera adapter
nextera_adapter="CTGTCTCTTATACACATCT"
# TruSeq adapter
truseq_adapter="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Use the Illumina universal adapter by default
adapter="$illumina_universal_adapter"

# ------------------------------------------
# VALIDATION
# ------------------------------------------
if [[ -z "$input_dir" || -z "$output_dir" || -z "$trim_5p" || -z "$trim_3p" || -z "$threads" ]]; then
    echo "Usage: sbatch fastp_trimming.sh <input_dir> <output_dir> <trim_5p> <trim_3p> <threads> [paired]"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$output_dir"

# Log the command execution
echo "Script initiated with command: $0 $@"
echo "Using default adapter: $adapter"
echo "Trimming $trim_5p bases from the 5' end and $trim_3p bases from the 3' end"

# ------------------------------------------
# PROCESS FILES WITH FASTP
# ------------------------------------------

if [[ "$paired" == "true" ]]; then
    # ---------- PAIRED-END MODE ----------
    echo "Running in paired-end mode..."

    for file1 in "$input_dir"/*_1.{fastq,fq}.gz "$input_dir"/*.R1.{fastq,fq}.gz; do
        if [[ -f "$file1" ]]; then
            file2="${file1/_1/_2}"
            file2="${file2/R1/R2}"

            if [[ -f "$file2" ]]; then
                filename=$(basename "$file1" | sed 's/_1\.fastq\.gz//;s/_1\.fq\.gz//;s/\.R1\.fastq\.gz//;s/\.R1\.fq\.gz//')

                echo "Processing paired-end files: $file1 and $file2 using $threads threads..."

                # Run fastp for paired-end
                fastp --in1 "$file1" --in2 "$file2" \
                      --out1 "$output_dir/${filename}_1.fq.gz" \
                      --out2 "$output_dir/${filename}_2.fq.gz" \
                      --trim_poly_g --trim_poly_x \
                      --low_complexity_filter \
                      --qualified_quality_phred 20 \
                      --length_required 20 \
                      --thread "$threads" \
                      --trim_front1 "$trim_5p" --trim_tail1 "$trim_3p" \
                      --trim_front2 "$trim_5p" --trim_tail2 "$trim_3p" \
                      --adapter_sequence "$adapter" \
                      #--detect_adapter_pe \
                      --html "$output_dir/${filename}_fastp.html" \
                      --json "$output_dir/${filename}_fastp.json"

                echo "Finished processing paired-end files: $filename"
            else
                echo "Paired file not found for $file1"
            fi
        fi
    done
else
    # ---------- SINGLE-END MODE ----------
    echo "Running in single-end mode..."

    for file in "$input_dir"/*.{fastq,fq}.gz; do
        if [[ -f "$file" ]]; then
            filename=$(basename "$file" | sed 's/\.fastq\.gz/\.fq.gz/;s/\.fq\.gz/\.fq.gz/')

            echo "Processing single-end file: $filename using $threads threads..."

            # Run fastp for single-end
            fastp --in1 "$file" \
                  --out1 "$output_dir/$filename" \
                  --trim_poly_g --trim_poly_x \
                  --low_complexity_filter \
                  --qualified_quality_phred 20 \
                  --length_required 20 \
                  --thread "$threads" \
                  --trim_front1 "$trim_5p" --trim_tail1 "$trim_3p" \
                  --adapter_sequence "$adapter" \
                  --html "$output_dir/${filename}_fastp.html" \
                  --json "$output_dir/${filename}_fastp.json"

            echo "Finished processing single-end file: $filename"
        fi
    done
fi

# Completion message
echo "All files processed. Output saved in $output_dir."

Parent script

fastp_00.sh

#!/bin/bash
#SBATCH --mem=32gb
#SBATCH --cpus-per-task=10
#SBATCH --job-name=fastp_00
#SBATCH --output=log/fastp_00%j.log  # %j will be replaced with the job ID

# Parameters (example, modify as needed)
input_dir=fq_renamed
output_dir=fq_trimmed
trim_5p=10	# Number of bases to trim from the 5' end
trim_3p=0	# Number of bases to trim from the 3' end
threads=10	# Number of threads
paired=true	# Set to "true" for paired-end processing

src/fastp.sh $input_dir $output_dir $trim_5p $trim_3p $threads $paired
scripts/adapter_and_quality_trimming.1742898637.txt.gz · Last modified: by 37.26.174.181 · Currently locked by: 37.26.174.181

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki