User Tools

Site Tools


scripts:download_fastq

This is an old revision of the document!


Download FASTQ from SRA

An easy intro

The easiest way to download FASTQ files from SRA is using the fastq-dump command from the SRA Toolkit with the command:

fastq-dump –gzip –split3 SRR[accession ID]

Options:

 ''--gzip'' will compress the downloaded FASTQ files (this is strictly recommended to save space)
 ''--split3'' will split your files into _1 and _2 files if it's a paired-end sequencing and will store singleton reads (those that don't have a pair) in a separate file. If it's single-end sequencing, skip this command.

However, you would like to put this command inside a script to download multiple files via slurm. In addition, it's recommended to complicate the script a bit to account for possible connection issues during download. Below is an example script that downloads a set of files with the provided accessions with multiple attempts to download each file in case of failures.

Script with download retry attempts

Daughter script

Name: fq_download.sh

# SCRIPT FOR DOWNLOADING SRA FILES USING THE NCBI SRA TOOLKIT
# NOTE: Run this script from the directory where the "log" and "meta" directories are located
#
# PURPOSE:
#   This script reads SRA accession IDs from a given file (one per line)
#   and downloads each corresponding SRA file using fastq-dump. 
#   The script outputs the SRA accession IDs that failed to be downloaded in a .txt file (one per line)  
#
# PARAMETERS:
#   1: OUTPUT DIRECTORY   - Directory where downloaded files will be stored.
#   2: ACCESSION FILE     - Text file containing SRA accession IDs (one per line).
#
# SAMPLE USAGE:
#   sbatch src/download_sra.sh <output_directory> <accession_file>
#
# IMPORTANT:
#   - This script downloads files using fastq-dump, gzips and splits paired-end reads
#     (it does nothing to single-end reads).
#   - Ensure that the SRA Toolkit is installed and available.

# Check for required parameters
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 <output_directory> <accession_file>"
    exit 1
fi

# Parameters
outdir="$1"
accession_file="$2"

# Ensure the accession file exists
if [ ! -f "$accession_file" ]; then
    echo "Error: Accession file '$accession_file' does not exist."
    exit 1
fi

# Create the output directory if it doesn't exist
mkdir -p "$outdir"

# Define the meta directory and create it if it doesn't exist.
meta_dir="meta"
mkdir -p "$meta_dir"

# Create (or empty) the failed downloads file in the meta directory
failed_file="$meta_dir/failed_sra.txt"
> "$failed_file"  # Truncate or create the file

# Define the log file with a fallback if SLURM_JOB_ID is not set
log_file="log/download_sra_retry${SLURM_JOB_ID:-manual}.log"
echo "Command: $0 $@" > "$log_file"
echo "Job started on: $(date)" >> "$log_file"

# Function to download an SRA accession using fastq-dump with --split-3
download_sra() {
    local acc="$1"
    echo "Downloading accession: $acc" >> "$log_file"
    fastq-dump --gzip --split-3 "$acc" -O "$outdir"
    if [ "$?" -ne 0 ]; then
        echo "Error: fastq-dump failed for accession: $acc" >> "$log_file"
        return 1
    else
        echo "Successfully downloaded: $acc" >> "$log_file"
        return 0
    fi
}

# Function to download with retries
download_with_retry() {
    local acc="$1"
    local max_retries=10
    local attempt=1
    while [ $attempt -le $max_retries ]; do
        echo "Attempt $attempt for $acc" >> "$log_file"
        if download_sra "$acc"; then
            return 0  # Success
        fi
        ((attempt++))
    done
    echo "Failed all $max_retries attempts for $acc" >> "$log_file"
    # Append the failed accession to the failed_sra.txt file, one per line
    echo "$acc" >> "$failed_file"
    return 1
}

# Export the functions and variables for use in GNU Parallel
export -f download_sra download_with_retry
export outdir
export log_file
export failed_file

# Process all accessions in parallel (ignore lines starting with #)
accessions=$(grep -v '^#' "$accession_file")
if [ -z "$accessions" ]; then
    echo "Error: No valid accessions found in '$accession_file'." >> "$log_file"
    exit 1
fi

echo "Processing accessions in parallel..." >> "$log_file"
parallel -j 20 download_with_retry ::: $accessions

# Append an extra newline to the failed downloads file
echo "" >> "$failed_file"

# Check overall exit status and log the result
if [ "$?" -eq 0 ]; then
    echo "All accessions processed successfully." >> "$log_file"
else
    echo "One or more accessions encountered errors." >> "$log_file"
fi

echo "Job completed on: $(date)" >> "$log_file"

Parent script (for slurm)

To run the script with your list of files, create a txt file with SRA Accessions per line. And create a separate parent script as follows:

#!/bin/bash
#SBATCH --mem=10gb
#SBATCH --cpus-per-task=10
#SBATCH --job-name=dwnld_fq
#SBATCH --output=log/download_fq_%j.log  # %j will be replaced with the job ID

# Parameters (example, modify as needed)
outdir="fq"  
accession_file="meta/sra_accessions.txt"

src/fq_download.sh $outdir $accession_file

Make sure the daughter script's path is correct.

Why is it advisable to use parent and daughter scripts?

It is a great practice to log everything you do. It's useful for troubleshooting in the future. If you just have one script and add the parameters on the go, you will not be able to trace back those parameters based on the log files. In other words, you'll have no idea what input you've used to produce your log files.

In this way, you can have a name for a parent script for each download attempt. And you'll have a log file with the same name. And you won't have to copy paste the whole code in each of the parent scripts, just call the daughter script and that's all!

scripts/download_fastq.1742973802.txt.gz · Last modified: by 37.26.174.181

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki