You've loaded an old revision of the document! If you save it, you will create a new version with this data. Media Files====== How to perform quality and adapter trimming of your reads ====== ===== What to trim ===== **Adapters** If your FASTQC report shows adapter contamination, remove adapters. There are three common adapters, frequently used in NGS: * Illumina universal adapter (most common): ''AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'' * Illumina Nextera adapter: ''CTGTCTCTTATACACATCT'' * TruSeq adapter: ''AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT'' Adapters are usually found at the 3' end of your reads. In rare cases, you'll also need to remove barcodes or adapters from the 5' end. **Low quality bases** The 3' ends of the reads usually have lower quality (FASQC Report section: "Per base sequence quality"). If those are too low, you may want to trim all the reads by a certain number of bases from the 3' end. The 5' end may also have issues. You can check the "Per base sequence content" section of FASTQC. If the bases are not uniformly distributed at the 5' end (should be close to a straight line for each base), then there may be barcodes or quality issues. You can choose to remove the first few bases until the distribution becomes uniform. **Short reads** After adapter and quality trimming, some reads may end up being too short. You can remove short read entirely from the FASTQ file. As a rule of thumb, reads shorter than 20 bp should be removed. ===== Example script with Cutadapt ===== ==== Daughter script ==== This script takes the adapters, number of bases to trim, and number of threads for parallel processing as input. It operates in single-end and paired-end mode. Please note, it is important to treat your paired-end files in parallel mode, otherwise, some reads may be removed from one file, and stay in the other, which will lead to problems during alignment. The script will search for **Illumina Universal Adapter**. You can modify the script to select from another two available in it (just set the one you choose as the ''adapter'') or you can add yours. **cutadapt.sh** <code> # ------------------------------------------ # SCRIPT TO CUT ADAPTERS AND TRIM LOW-QUALITY READS # Processes FASTQ files using `cutadapt` to: # 1. Remove adapter sequences. # 2. Trim low-quality bases (Q < 20) from ends of reads. # 3. Discard reads shorter than 20 bp after trimming. # 4. Works for both single-end and paired-end reads. # ------------------------------------------ # Input parameters input_dir="$1" output_dir="$2" trim_5p="$3" # Number of bases to trim from the 5' end trim_3p="$4" # Number of bases to trim from the 3' end threads="$5" # Number of threads paired="$6" # Set to "true" for paired-end processing # ------------------------------------------ # DEFAULT ADAPTER SETS # ------------------------------------------ # Illumina universal adapter (most common) illumina_universal_adapter="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Illumina Nextera adapter nextera_adapter="CTGTCTCTTATACACATCT" # TruSeq adapter truseq_adapter="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Use the Illumina universal adapter by default adapter="$illumina_universal_adapter" # ------------------------------------------ # VALIDATION # ------------------------------------------ if [[ -z "$input_dir" || -z "$output_dir" || -z "$trim_5p" || -z "$trim_3p" || -z "$threads" ]]; then echo "Usage: sbatch cutadapt.sh <input_dir> <output_dir> <trim_5p> <trim_3p> <threads> [paired]" exit 1 fi # Create output directory if it doesn't exist mkdir -p "$output_dir" # Log the command execution echo "Script initiated with command: $0 $@" echo "Using default adapter: $adapter" echo "Trimming $trim_5p bases from the 5' end and $trim_3p bases from the 3' end" # ------------------------------------------ # PROCESS FILES WITH CUTADAPT # ------------------------------------------ if [[ "$paired" == "true" ]]; then # ---------- PAIRED-END MODE ---------- echo "Running in paired-end mode..." for file1 in "$input_dir"/*_1.{fastq,fq}.gz "$input_dir"/*.R1.{fastq,fq}.gz; do if [[ -f "$file1" ]]; then file2="${file1/_1/_2}" file2="${file2/R1/R2}" if [[ -f "$file2" ]]; then filename=$(basename "$file1" | sed 's/_1\.fastq\.gz//;s/_1\.fq\.gz//;s/\.R1\.fastq\.gz//;s/\.R1\.fq\.gz//') echo "Processing paired-end files: $file1 and $file2 using $threads threads..." # Run Cutadapt for paired-end cutadapt -j "$threads" -q 20 -m 20 \ -a "$adapter" -A "$adapter" \ -u "$trim_5p" -U "$trim_5p" \ -u "-$trim_3p" -U "-$trim_3p" \ -o "$output_dir/${filename}_1.fq.gz" -p "$output_dir/${filename}_2.fq.gz" \ "$file1" "$file2" echo "Finished processing paired-end files: $filename" else echo "Paired file not found for $file1" fi fi done else # ---------- SINGLE-END MODE ---------- echo "Running in single-end mode..." for file in "$input_dir"/*.{fastq,fq}.gz; do if [[ -f "$file" ]]; then filename=$(basename "$file" | sed 's/\.fastq\.gz/\.fq.gz/;s/\.fq\.gz/\.fq.gz/') echo "Processing single-end file: $filename using $threads threads..." # Run Cutadapt for single-end cutadapt -j "$threads" -q 20 -m 20 \ -a "$adapter" \ -u "$trim_5p" \ -u "-$trim_3p" \ -o "$output_dir/$filename" "$file" echo "Finished processing single-end file: $filename" fi done fi # Completion message echo "All files processed. Output saved in $output_dir." </code> ==== Parent script ==== **cutadapt_00.sh** <code> #!/bin/bash #SBATCH --mem=32gb #SBATCH --cpus-per-task=10 #SBATCH --job-name=cutadapt_00 #SBATCH --output=log/cutadapt_00%j.log # %j will be replaced with the job ID # Parameters (example, modify as needed) input_dir=fq_renamed output_dir=fq_trimmed trim_5p=10 # Number of bases to trim from the 5' end trim_3p=0 # Number of bases to trim from the 3' end threads=10 # Number of threads paired=true # Set to "true" for paired-end processing src/cutadapt.sh $input_dir $output_dir $trim_5p $trim_3p $threads $paired </code>SavePreviewCancel Edit summary