✎ scripts:adapter_and_quality_trimming [ABI Knowledge Base]

You've loaded an old revision of the document! If you save it, you will create a new version with this data.

====== How to perform quality and adapter trimming of your reads ======

===== What to trim =====

**Adapters**

If your FASTQC report shows adapter contamination, remove adapters. There are three common adapters, frequently used in NGS:

* Illumina universal adapter (most common): ''AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC''
  * Illumina Nextera adapter: ''CTGTCTCTTATACACATCT''
  * TruSeq adapter: ''AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT''

Adapters are usually found at the 3' end of your reads. In rare cases, you'll also need to remove barcodes or adapters from the 5' end.

**Low quality bases**

The 3' ends of the reads usually have lower quality (FASQC Report section: "Per base sequence quality"). If those are too low, you may want to trim all the reads by a certain number of bases from the 3' end.

The 5' end may also have issues. You can check the "Per base sequence content" section of FASTQC. If the bases are not uniformly distributed at the 5' end (should be close to a straight line for each base), then there may be barcodes or quality issues. You can choose to remove the first few bases until the distribution becomes uniform.

**Short reads**

After adapter and quality trimming, some reads may end up being too short. You can remove short read entirely from the FASTQ file. As a rule of thumb, reads shorter than 20 bp should be removed.

===== Example script with Cutadapt =====

==== Daughter script ====

This script takes the adapters, number of bases to trim, and number of threads for parallel processing as input. It operates in single-end and paired-end mode. Please note, it is important to treat your paired-end files in parallel mode, otherwise, some reads may be removed from one file, and stay in the other, which will lead to problems during alignment.

The script will search for **Illumina Universal Adapter**. You can modify the script to select from another two available in it (just set the one you choose as the ''adapter'') or you can add yours.

**cutadapt.sh**

<code>

# ------------------------------------------
# SCRIPT TO CUT ADAPTERS AND TRIM LOW-QUALITY READS
# Processes FASTQ files using `cutadapt` to:
#   1. Remove adapter sequences.
#   2. Trim low-quality bases (Q < 20) from ends of reads.
#   3. Discard reads shorter than 20 bp after trimming.
#   4. Works for both single-end and paired-end reads.
# ------------------------------------------

# Input parameters
input_dir="$1"
output_dir="$2"
trim_5p="$3"      # Number of bases to trim from the 5' end
trim_3p="$4"      # Number of bases to trim from the 3' end
threads="$5"      # Number of threads
paired="$6"       # Set to "true" for paired-end processing

# ------------------------------------------
# DEFAULT ADAPTER SETS
# ------------------------------------------
# Illumina universal adapter (most common)
illumina_universal_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# Illumina Nextera adapter
nextera_adapter="CTGTCTCTTATACACATCT"
# TruSeq adapter
truseq_adapter="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Use the Illumina universal adapter by default
adapter="$illumina_universal_adapter"

# ------------------------------------------
# VALIDATION
# ------------------------------------------
if [[ -z "$input_dir" || -z "$output_dir" || -z "$trim_5p" || -z "$trim_3p" || -z "$threads" ]]; then
    echo "Usage: sbatch cutadapt.sh <input_dir> <output_dir> <trim_5p> <trim_3p> <threads> [paired]"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$output_dir"

# Log the command execution
echo "Script initiated with command: $0 $@"
echo "Using default adapter: $adapter"
echo "Trimming $trim_5p bases from the 5' end and $trim_3p bases from the 3' end"

# ------------------------------------------
# PROCESS FILES WITH CUTADAPT
# ------------------------------------------

if [[ "$paired" == "true" ]]; then
    # ---------- PAIRED-END MODE ----------
    echo "Running in paired-end mode..."

for file1 in "$input_dir"/*_1.{fastq,fq}.gz "$input_dir"/*.R1.{fastq,fq}.gz; do
        if [[ -f "$file1" ]]; then
            file2="${file1/_1/_2}"
            file2="${file2/R1/R2}"

if [[ -f "$file2" ]]; then
                filename=$(basename "$file1" | sed 's/_1\.fastq\.gz//;s/_1\.fq\.gz//;s/\.R1\.fastq\.gz//;s/\.R1\.fq\.gz//')

echo "Processing paired-end files: $file1 and $file2 using $threads threads..."

# Run Cutadapt for paired-end
                cutadapt -j "$threads" -q 20 -m 20 \
                    -a "$adapter" -A "$adapter" \
                    -u "$trim_5p" -U "$trim_5p" \
                    -u "-$trim_3p" -U "-$trim_3p" \
                    -o "$output_dir/${filename}_1.fq.gz" -p "$output_dir/${filename}_2.fq.gz" \
                    "$file1" "$file2"

echo "Finished processing paired-end files: $filename"
            else
                echo "Paired file not found for $file1"
            fi
        fi
    done
else
    # ---------- SINGLE-END MODE ----------
    echo "Running in single-end mode..."

for file in "$input_dir"/*.{fastq,fq}.gz; do
        if [[ -f "$file" ]]; then
            filename=$(basename "$file" | sed 's/\.fastq\.gz/\.fq.gz/;s/\.fq\.gz/\.fq.gz/')
            echo "Processing single-end file: $filename using $threads threads..."

# Run Cutadapt for single-end
            cutadapt -j "$threads" -q 20 -m 20 \
                -a "$adapter" \
                -u "$trim_5p" \
                -u "-$trim_3p" \
                -o "$output_dir/$filename" "$file"

echo "Finished processing single-end file: $filename"
        fi
    done
fi

# Completion message
echo "All files processed. Output saved in $output_dir."

</code>

==== Parent script ====

**cutadapt_00.sh**

<code>
#!/bin/bash
#SBATCH --mem=32gb
#SBATCH --cpus-per-task=10
#SBATCH --job-name=cutadapt_00
#SBATCH --output=log/cutadapt_00%j.log  # %j will be replaced with the job ID

# Parameters (example, modify as needed)
input_dir=fq_renamed
output_dir=fq_trimmed
trim_5p=10      # Number of bases to trim from the 5' end
trim_3p=0      # Number of bases to trim from the 3' end
threads=10      # Number of threads
paired=true       # Set to "true" for paired-end processing

src/cutadapt.sh $input_dir $output_dir $trim_5p $trim_3p $threads $paired
</code>

Edit summary