Differences

This shows you the differences between two versions of the page.

--- scripts:download_fastq [2025/04/11 06:52] – 37.26.174.181
+++ scripts:download_fastq [2026/05/03 18:41] (current) – external edit 127.0.0.1
@@ Line 7: / Line 7: @@
 The easiest way to download FASTQ files from SRA is using the fastq-dump command from the SRA Toolkit with the command:
-''fastq-dump --gzip --split3 SRR[accession ID]''
+''fastq-dump --gzip --split-3 SRR[accession ID]''
 Options:
@@ Line 67: / Line 67: @@
     local acc="$1"
     echo "Downloading accession: $acc"
-    if fastq-dump --gzip --split-files "$acc" -O "$outdir"; then
+    if fastq-dump --gzip --split-3 "$acc" -O "$outdir"; then
         echo "Successfully downloaded: $acc"
         return 0
@@ Line 152: / Line 152: @@
 In this way, you can have a name for a parent script for each download attempt. And you'll have a log file with the same name. And you won't have to copy paste the whole code in each of the parent scripts, just call the daughter script and that's all!
+====== Download files from EGA ======
+With this script you can download files from EGA database using pyega3 and have them located directly in the output directory.
+You need to set some inputs:
+  * Credentials json file
+  * Connections
+  * List of file IDs that are to be downloaded
+  * Output directory
+  * Specify the files format that are to be downloaded
+Make sure you set the number of cpus-per-task the multiplication of the number of files to be downloaded and the number of connections.
+Here is an example of a credentials json file:
+<code>
+{
+    "username": "name.surname@abi.am",
+    "password": "your_password"
+}
+</code>
+==The script==
+<code>
+#!/bin/bash
+#SBATCH --mem=10gb
+#SBATCH --cpus-per-task=1
+#SBATCH --job-name=dwnld_ega
+#SBATCH --output=log/dwnld_ega.log
+# Set cpus-per-task the number of files to be downloaded
+# Set common variables
+CREDENTIALS_FILE="meta/credentials.json"
+CONNECTIONS=1
+# Define the paths to the text files containing the file IDs
+FILE_ID_LIST="meta/test.txt"
+# Define output directories
+FILE_OUTPUT_DIR="output_dir"
+# Define file format
+file_format=".bam"
+# --- Step 1: Create directories if they don't exist ---
+echo "Creating necessary directories..."
+mkdir -p $FILE_OUTPUT_DIR meta/md5sum log
+# --- Step 3: Download files, move, and clean up temporary folders ---
+echo "Starting downloads for files from $FILE_ID_LIST..."
+# Check if the RNA-seq ID list file exists
+if [ ! -f "$FILE_ID_LIST" ]; then
+  echo "Error: ID list file not found at $FILE_ID_LIST"
+  exit 1
+fi
+while IFS= read -r file_id; do
+  if [ -z "$file_id" ]; then
+    continue
+  fi
+  echo "Downloading file with ID: $file_id"
+  pyega3 -c "$CONNECTIONS" -cf "$CREDENTIALS_FILE" fetch "$file_id" --output-dir "$FILE_OUTPUT_DIR" &
+done < "$FILE_ID_LIST"
+wait
+# Move files to the final location and remove temporary folders
+echo "Moving downloaded files and cleaning up..."
+while IFS= read -r file_id; do
+  if [ -z "$file_id" ]; then
+    continue
+  fi
+  mv "$FILE_OUTPUT_DIR/$file_id"/*"$file_format" "$FILE_OUTPUT_DIR/"
+  rm -r "$FILE_OUTPUT_DIR/$file_id"
+done < "$FILE_ID_LIST"
+# --- Step 4: Perform md5sum on all final files ---
+echo "Performing md5sum on all downloaded files..."
+# Loop through all files in the  directories
+for file in $FILE_OUTPUT_DIR/*"$file_format"; do
+  if [ -f "$file" ]; then # Check if the file exists
+    filename=$(basename "$file")
+    md5sum "$file" > "meta/md5sum/${filename}.md5"
+    echo "Generated md5sum for $file"
+  fi
+done
+# --- Step 5: Remove unnecessary log files created by pyega3 ---
+if [ -f pyega3_output.log ]; then
+  rm pyega3_output.log
+  echo "pyega3_output.log removed."
+fi
+echo "Script finished."
+</code>