Table of Contents
Download FASTQ from SRA
An easy intro
The easiest way to download FASTQ files from SRA is using the fastq-dump command from the SRA Toolkit with the command:
fastq-dump –gzip –split-3 SRR[accession ID]
Options:
''--gzip'' will compress the downloaded FASTQ files (this is strictly recommended to save space) ''--split3'' will split your files into _1 and _2 files if it's a paired-end sequencing and will store singleton reads (those that don't have a pair) in a separate file. If it's single-end sequencing, skip this command.
However, you would like to put this command inside a script to download multiple files via slurm. In addition, it's recommended to complicate the script a bit to account for possible connection issues during download. Below is an example script that downloads a set of files with the provided accessions with multiple attempts to download each file in case of failures.
Script with download retry attempts
Daughter script
Name: fq_download.sh
#!/bin/bash
# SCRIPT FOR DOWNLOADING SRA FILES USING THE NCBI SRA TOOLKIT
# NOTE: Run this script from the directory where the "log" directory is located
#
# PURPOSE:
# This script reads SRA accession IDs from a given file (one per line)
# and downloads each corresponding SRA file using fastq-dump.
#
# PARAMETERS:
# 1: OUTPUT DIRECTORY - Directory where downloaded files will be stored.
# 2: ACCESSION FILE - Text file containing SRA accession IDs (one per line).
#
# SAMPLE USAGE:
# sbatch src/fq_download.sh <output_directory> <accession_file>
#
# IMPORTANT:
# - This script downloads files using fastq-dump, gzips and splits paired-end reads (it does nothing to single-end read).
# - Ensure that the SRA Toolkit is installed and available.
# Check for required parameters
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <output_directory> <accession_file>"
exit 1
fi
# Parameters
outdir="$1"
accession_file="$2"
# Ensure the accession file exists
if [ ! -f "$accession_file" ]; then
echo "Error: Accession file '$accession_file' does not exist."
exit 1
fi
# Create the output directory if it doesn't exist
mkdir -p "$outdir"
# Function to download an SRA accession using fastq-dump (without split/gzip)
download_sra() {
local acc="$1"
echo "Downloading accession: $acc"
if fastq-dump --gzip --split-3 "$acc" -O "$outdir"; then
echo "Successfully downloaded: $acc"
return 0
else
echo "Error: fastq-dump failed for accession: $acc"
return 1
fi
}
# Function to download with retries
download_with_retry() {
local acc="$1"
local max_retries=10
local attempt=1
while [ $attempt -le $max_retries ]; do
echo "Attempt $attempt for $acc"
if download_sra "$acc"; then
return 0 # Success
fi
((attempt++))
done
echo "Failed all $max_retries attempts for $acc"
return 1
}
# Export the functions and variables for use in GNU Parallel
export -f download_sra download_with_retry
export outdir
# Process all accessions in parallel
accessions=$(grep -v '^#' "$accession_file")
if [ -z "$accessions" ]; then
echo "Error: No valid accessions found in '$accession_file'."
exit 1
fi
echo "Processing accessions in parallel..."
parallel -j 20 download_with_retry ::: $accessions
# Check overall exit status and log the result
if [ "$?" -eq 0 ]; then
echo "All accessions processed successfully."
else
echo "One or more accessions encountered errors."
fi
Parent script (for slurm)
To run the script with your list of files, create a txt file with SRA Accessions per line. And create a separate parent script as follows:
#!/bin/bash #SBATCH --mem=10gb #SBATCH --cpus-per-task=30 #SBATCH --job-name=dwnld_fq #SBATCH --output=log/fq_download_00.log # Main log file name # Parameters outdir="fq_original" accession_file="meta/sra_accessions.txt" logfile="log/fq_download_00.log" # Start logging echo "Started at: $(date)" >> "$logfile" echo "Running fq_download.sh with:" > "$logfile" echo "Output dir: $outdir" >> "$logfile" echo "Accession file: $accession_file" >> "$logfile" # Call the daughter script and redirect both stdout and stderr to the same log src/fq_download.sh "$outdir" "$accession_file" >> "$logfile" 2>&1 # Log end time echo "Finished at: $(date)" >> "$logfile"
Make sure the daughter script's path is correct.
Why is it advisable to use parent and daughter scripts?
It is a great practice to log everything you do. It's useful for troubleshooting in the future. If you just have one script and add the parameters on the go, you will not be able to trace back those parameters based on the log files. In other words, you'll have no idea what input you've used to produce your log files.
In this way, you can have a name for a parent script for each download attempt. And you'll have a log file with the same name. And you won't have to copy paste the whole code in each of the parent scripts, just call the daughter script and that's all!
Download files from EGA
With this script you can download files from EGA database using pyega3 and have them located directly in the output directory.
You need to set some inputs:
- Credentials json file
- Connections
- List of file IDs that are to be downloaded
- Output directory
- Specify the files format that are to be downloaded
Make sure you set the number of cpus-per-task the multiplication of the number of files to be downloaded and the number of connections.
Here is an example of a credentials json file:
{
"username": "name.surname@abi.am",
"password": "your_password"
}
The script
#!/bin/bash
#SBATCH --mem=10gb
#SBATCH --cpus-per-task=1
#SBATCH --job-name=dwnld_ega
#SBATCH --output=log/dwnld_ega.log
# Set cpus-per-task the number of files to be downloaded
# Set common variables
CREDENTIALS_FILE="meta/credentials.json"
CONNECTIONS=1
# Define the paths to the text files containing the file IDs
FILE_ID_LIST="meta/test.txt"
# Define output directories
FILE_OUTPUT_DIR="output_dir"
# Define file format
file_format=".bam"
# --- Step 1: Create directories if they don't exist ---
echo "Creating necessary directories..."
mkdir -p $FILE_OUTPUT_DIR meta/md5sum log
# --- Step 3: Download files, move, and clean up temporary folders ---
echo "Starting downloads for files from $FILE_ID_LIST..."
# Check if the RNA-seq ID list file exists
if [ ! -f "$FILE_ID_LIST" ]; then
echo "Error: ID list file not found at $FILE_ID_LIST"
exit 1
fi
while IFS= read -r file_id; do
if [ -z "$file_id" ]; then
continue
fi
echo "Downloading file with ID: $file_id"
pyega3 -c "$CONNECTIONS" -cf "$CREDENTIALS_FILE" fetch "$file_id" --output-dir "$FILE_OUTPUT_DIR" &
done < "$FILE_ID_LIST"
wait
# Move files to the final location and remove temporary folders
echo "Moving downloaded files and cleaning up..."
while IFS= read -r file_id; do
if [ -z "$file_id" ]; then
continue
fi
mv "$FILE_OUTPUT_DIR/$file_id"/*"$file_format" "$FILE_OUTPUT_DIR/"
rm -r "$FILE_OUTPUT_DIR/$file_id"
done < "$FILE_ID_LIST"
# --- Step 4: Perform md5sum on all final files ---
echo "Performing md5sum on all downloaded files..."
# Loop through all files in the directories
for file in $FILE_OUTPUT_DIR/*"$file_format"; do
if [ -f "$file" ]; then # Check if the file exists
filename=$(basename "$file")
md5sum "$file" > "meta/md5sum/${filename}.md5"
echo "Generated md5sum for $file"
fi
done
# --- Step 5: Remove unnecessary log files created by pyega3 ---
if [ -f pyega3_output.log ]; then
rm pyega3_output.log
echo "pyega3_output.log removed."
fi
echo "Script finished."
