This is an old revision of the document!
Table of Contents
How to perform quality and adapter trimming of your reads with cutadapt and fastp
What to trim
Adapters
If your FASTQC report shows adapter contamination, remove adapters. There are three common adapters, frequently used in NGS:
- Illumina universal adapter (most common):
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC - Illumina Nextera adapter:
CTGTCTCTTATACACATCT - TruSeq adapter:
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Adapters are usually found at the 3' end of your reads. In rare cases, you'll also need to remove barcodes or adapters from the 5' end.
Low quality bases
The 3' ends of the reads usually have lower quality (FASQC Report section: “Per base sequence quality”). If those are too low, you may want to trim all the reads by a certain number of bases from the 3' end.
The 5' end may also have issues. You can check the “Per base sequence content” section of FASTQC. If the bases are not uniformly distributed at the 5' end (should be close to a straight line for each base), then there may be barcodes or quality issues. You can choose to remove the first few bases until the distribution becomes uniform.
Short reads
After adapter and quality trimming, some reads may end up being too short. You can remove short read entirely from the FASTQ file. As a rule of thumb, reads shorter than 20 bp should be removed.
Example script with Cutadapt
Daughter script
This script takes the adapters, number of bases to trim, and number of threads for parallel processing as input. It operates in single-end and paired-end mode. Please note, it is important to treat your paired-end files in parallel mode, otherwise, some reads may be removed from one file, and stay in the other, which will lead to problems during alignment.
The script will search for Illumina Universal Adapter. You can modify the script to select from another two available in it (just set the one you choose as the adapter) or you can add yours.
cutadapt.sh
# ------------------------------------------
# SCRIPT TO CUT ADAPTERS AND TRIM LOW-QUALITY READS
# Processes FASTQ files using `cutadapt` to:
# 1. Remove adapter sequences.
# 2. Trim low-quality bases (Q < 20) from ends of reads.
# 3. Discard reads shorter than 20 bp after trimming.
# 4. Works for both single-end and paired-end reads.
# ------------------------------------------
# Input parameters
input_dir="$1"
output_dir="$2"
trim_5p="$3" # Number of bases to trim from the 5' end
trim_3p="$4" # Number of bases to trim from the 3' end
threads="$5" # Number of threads
paired="$6" # Set to "true" for paired-end processing
# ------------------------------------------
# DEFAULT ADAPTER SETS
# ------------------------------------------
# Illumina universal adapter (most common)
illumina_universal_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# Illumina Nextera adapter
nextera_adapter="CTGTCTCTTATACACATCT"
# TruSeq adapter
truseq_adapter="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
# Use the Illumina universal adapter by default
adapter="$illumina_universal_adapter"
# ------------------------------------------
# VALIDATION
# ------------------------------------------
if [[ -z "$input_dir" || -z "$output_dir" || -z "$trim_5p" || -z "$trim_3p" || -z "$threads" ]]; then
echo "Usage: sbatch cutadapt.sh <input_dir> <output_dir> <trim_5p> <trim_3p> <threads> [paired]"
exit 1
fi
# Create output directory if it doesn't exist
mkdir -p "$output_dir"
# Log the command execution
echo "Script initiated with command: $0 $@"
echo "Using default adapter: $adapter"
echo "Trimming $trim_5p bases from the 5' end and $trim_3p bases from the 3' end"
# ------------------------------------------
# PROCESS FILES WITH CUTADAPT
# ------------------------------------------
if [[ "$paired" == "true" ]]; then
# ---------- PAIRED-END MODE ----------
echo "Running in paired-end mode..."
for file1 in "$input_dir"/*_1.{fastq,fq}.gz "$input_dir"/*.R1.{fastq,fq}.gz; do
if [[ -f "$file1" ]]; then
file2="${file1/_1/_2}"
file2="${file2/R1/R2}"
if [[ -f "$file2" ]]; then
filename=$(basename "$file1" | sed 's/_1\.fastq\.gz//;s/_1\.fq\.gz//;s/\.R1\.fastq\.gz//;s/\.R1\.fq\.gz//')
echo "Processing paired-end files: $file1 and $file2 using $threads threads..."
# Run Cutadapt for paired-end
cutadapt -j "$threads" -q 20 -m 20 \
-a "$adapter" -A "$adapter" \
-u "$trim_5p" -U "$trim_5p" \
-u "-$trim_3p" -U "-$trim_3p" \
-o "$output_dir/${filename}_1.fq.gz" -p "$output_dir/${filename}_2.fq.gz" \
"$file1" "$file2"
echo "Finished processing paired-end files: $filename"
else
echo "Paired file not found for $file1"
fi
fi
done
else
# ---------- SINGLE-END MODE ----------
echo "Running in single-end mode..."
for file in "$input_dir"/*.{fastq,fq}.gz; do
if [[ -f "$file" ]]; then
filename=$(basename "$file" | sed 's/\.fastq\.gz/\.fq.gz/;s/\.fq\.gz/\.fq.gz/')
echo "Processing single-end file: $filename using $threads threads..."
# Run Cutadapt for single-end
cutadapt -j "$threads" -q 20 -m 20 \
-a "$adapter" \
-u "$trim_5p" \
-u "-$trim_3p" \
-o "$output_dir/$filename" "$file"
echo "Finished processing single-end file: $filename"
fi
done
fi
# Completion message
echo "All files processed. Output saved in $output_dir."
Parent script
cutadapt_00.sh
#!/bin/bash #SBATCH --mem=32gb #SBATCH --cpus-per-task=10 #SBATCH --job-name=cutadapt_00 #SBATCH --output=log/cutadapt_00%j.log # %j will be replaced with the job ID # Parameters (example, modify as needed) input_dir=fq_renamed output_dir=fq_trimmed trim_5p=10 # Number of bases to trim from the 5' end trim_3p=0 # Number of bases to trim from the 3' end threads=10 # Number of threads paired=true # Set to "true" for paired-end processing src/cutadapt.sh $input_dir $output_dir $trim_5p $trim_3p $threads $paired
Example script with Fastp
Daughter script
This script takes the adapters, number of bases to trim, and number of threads for parallel processing as input and generates fastp reports (html, json). It operates in single-end and paired-end mode. Please note, it is important to treat your paired-end files in parallel mode, otherwise, some reads may be removed from one file, and stay in the other, which will lead to problems during alignment.
The script will search for Illumina Universal Adapter. You can modify the script to select from another two available in it (just set the one you choose as the adapter) or you can add yours. It also trims polyG and polyX tails, low complexity reads.
fastp.sh
# ------------------------------------------
# SCRIPT TO CUT ADAPTERS AND TRIM LOW-QUALITY READS
# Processes FASTQ files using `fastp` to:
# 1. Remove adapter sequences.
# 2. Trim low-quality bases (Q < 20) from ends of reads.
# 3. Discard reads shorter than 20 bp after trimming.
# 4. Works for both single-end and paired-end reads.
# ------------------------------------------
# Input parameters
input_dir="$1"
output_dir="$2"
trim_5p="$3" # Number of bases to trim from the 5' end
trim_3p="$4" # Number of bases to trim from the 3' end
threads="$5" # Number of threads
paired="$6" # Set to "true" for paired-end processing
# ------------------------------------------
# DEFAULT ADAPTER SETS
# ------------------------------------------
# Illumina universal adapter (most common)
illumina_universal_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# Illumina Nextera adapter
nextera_adapter="CTGTCTCTTATACACATCT"
# TruSeq adapter
truseq_adapter="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
# Use the Illumina universal adapter by default
adapter="$illumina_universal_adapter"
# ------------------------------------------
# VALIDATION
# ------------------------------------------
if [[ -z "$input_dir" || -z "$output_dir" || -z "$trim_5p" || -z "$trim_3p" || -z "$threads" ]]; then
echo "Usage: sbatch fastp_trimming.sh <input_dir> <output_dir> <trim_5p> <trim_3p> <threads> [paired]"
exit 1
fi
# Create output directory if it doesn't exist
mkdir -p "$output_dir"
# Log the command execution
echo "Script initiated with command: $0 $@"
echo "Using default adapter: $adapter"
echo "Trimming $trim_5p bases from the 5' end and $trim_3p bases from the 3' end"
# ------------------------------------------
# PROCESS FILES WITH FASTP
# ------------------------------------------
if [[ "$paired" == "true" ]]; then
# ---------- PAIRED-END MODE ----------
echo "Running in paired-end mode..."
for file1 in "$input_dir"/*_1.{fastq,fq}.gz "$input_dir"/*.R1.{fastq,fq}.gz; do
if [[ -f "$file1" ]]; then
file2="${file1/_1/_2}"
file2="${file2/R1/R2}"
if [[ -f "$file2" ]]; then
filename=$(basename "$file1" | sed 's/_1\.fastq\.gz//;s/_1\.fq\.gz//;s/\.R1\.fastq\.gz//;s/\.R1\.fq\.gz//')
echo "Processing paired-end files: $file1 and $file2 using $threads threads..."
# Run fastp for paired-end
fastp --in1 "$file1" --in2 "$file2" \
--out1 "$output_dir/${filename}_1.fq.gz" \
--out2 "$output_dir/${filename}_2.fq.gz" \
--trim_poly_g --trim_poly_x \
--low_complexity_filter \
--qualified_quality_phred 20 \
--length_required 20 \
--thread "$threads" \
--trim_front1 "$trim_5p" --trim_tail1 "$trim_3p" \
--trim_front2 "$trim_5p" --trim_tail2 "$trim_3p" \
--adapter_sequence "$adapter" \
--html "$output_dir/${filename}_fastp.html" \
--json "$output_dir/${filename}_fastp.json"
echo "Finished processing paired-end files: $filename"
else
echo "Paired file not found for $file1"
fi
fi
done
else
# ---------- SINGLE-END MODE ----------
echo "Running in single-end mode..."
for file in "$input_dir"/*.{fastq,fq}.gz; do
if [[ -f "$file" ]]; then
filename=$(basename "$file" | sed 's/\.fastq\.gz/\.fq.gz/;s/\.fq\.gz/\.fq.gz/')
echo "Processing single-end file: $filename using $threads threads..."
# Run fastp for single-end
fastp --in1 "$file" \
--out1 "$output_dir/$filename" \
--trim_poly_g --trim_poly_x \
--low_complexity_filter \
--qualified_quality_phred 20 \
--length_required 20 \
--thread "$threads" \
--trim_front1 "$trim_5p" --trim_tail1 "$trim_3p" \
--adapter_sequence "$adapter" \
--html "$output_dir/${filename}_fastp.html" \
--json "$output_dir/${filename}_fastp.json"
echo "Finished processing single-end file: $filename"
fi
done
fi
# Completion message
echo "All files processed. Output saved in $output_dir."
Parent script
fastp_00.sh
#!/bin/bash #SBATCH --mem=32gb #SBATCH --cpus-per-task=10 #SBATCH --job-name=fastp_00 #SBATCH --output=log/fastp_00%j.log # %j will be replaced with the job ID # Parameters (example, modify as needed) input_dir=fq_renamed output_dir=fq_trimmed trim_5p=10 # Number of bases to trim from the 5' end trim_3p=0 # Number of bases to trim from the 3' end threads=10 # Number of threads paired=true # Set to "true" for paired-end processing src/fastp.sh $input_dir $output_dir $trim_5p $trim_3p $threads $paired
