====== Using Slurm ======
Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm -- **do not run heavy computation on the login nodes** (''ssh-01'', ''ssh-02'').
===== ABI Partitions =====
^ Partition ^ Nodes ^ CPUs ^ Memory ^ Purpose ^
| ''compute'' (default) | thin-01, thin-02, thick-01 | 64/node | 384G-768G | General purpose -- use this for most jobs |
| ''thin'' | thin-01, thin-02 | 64/node | ~384G each | Explicit thin-node targeting |
| ''thick'' | thick-01 | 64 | ~768G | Memory-intensive jobs (e.g., pilon, large assemblies) |
| ''download'' | dl-01, dl-02 | 2/node | ~8G each | Data downloads only (not for computation) |
* The ''compute'' partition is the **default**. If you omit ''%%--partition%%'', your job goes here.
* Use ''%%--partition=thick%%'' explicitly when you need >384G of RAM.
* Use ''%%--partition=download%%'' only for data download tasks.
===== Quick Reference =====
^ Command ^ Purpose ^ Example ^
| ''sbatch'' | Submit a batch job | ''sbatch my_job.sh'' |
| ''squeue'' | View the job queue | ''squeue --me'' |
| ''scancel'' | Cancel a job | ''scancel 12345'' |
| ''sinfo'' | View partitions & node status | ''sinfo'' |
| ''sacct'' | View completed job info | ''sacct -j 12345'' |
| ''srun'' | Run an interactive command | ''srun --pty bash'' |
| ''salloc'' | Allocate resources interactively | ''salloc --mem=4G'' |
----
===== Submitting a Batch Job =====
A batch job is a shell script with special ''#SBATCH'' directives that tell Slurm what resources you need.
=== Minimal example ===
Create a file ''my_job.sh'':
#!/bin/bash
#SBATCH --mem=10gb # Memory required
#SBATCH --cpus-per-task=4 # Number of CPU cores
#SBATCH --output=slurm-%j.log # Log file (%j = job ID)
echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "Using $SLURM_CPUS_PER_TASK CPUs"
# Your commands here
your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/
echo "Job finished at $(date)"
Submit it:
sbatch my_job.sh
=== Common SBATCH directives ===
^ Directive ^ Purpose ^ Example ^
| ''%%--mem%%'' | Total memory for the job | ''%%--mem=10gb%%'' |
| ''%%--cpus-per-task%%'' | Number of CPU cores | ''%%--cpus-per-task=4%%'' |
| ''%%--output%%'' | Standard output log file | ''%%--output=slurm-%j.log%%'' |
| ''%%--error%%'' | Standard error log file | ''%%--error=slurm-%j.err%%'' |
| ''%%--job-name%%'' | Name shown in squeue | ''%%--job-name=alignment%%'' |
| ''%%--time%%'' | Maximum wall time | ''%%--time=24:00:00%%'' |
| ''%%--partition%%'' | Which partition to use | ''%%--partition=thick%%'' |
| ''%%--mail-type%%'' | Email notifications | ''%%--mail-type=BEGIN,END,FAIL%%'' |
| ''%%--mail-user%%'' | Email address | ''%%--mail-user=you@abi.am%%'' |
| ''%%--array%%'' | Submit a job array | ''%%--array=1-10%%'' |
=== Full example with best practices ===
#!/bin/bash
#SBATCH --job-name=align_sample01
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=log/align_sample01_%j.log
#SBATCH --error=log/align_sample01_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@abi.am
# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "Directory: $(pwd)"
# Use the SLURM variable for thread count (keeps it consistent)
THREADS=$SLURM_CPUS_PER_TASK
# Create output directory
mkdir -p bam/
# Run alignment
bwa mem -t $THREADS \
/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \
fastq/sample01_1.fq.gz \
fastq/sample01_2.fq.gz \
| samtools sort -@ $THREADS -o bam/sample01.sorted.bam -
samtools index bam/sample01.sorted.bam
echo "Finished: $(date)"
----
===== Monitoring Jobs =====
=== View the queue ===
# View all jobs
squeue
# View only your jobs
squeue --me
# Detailed formatting (recommended -- add this as an alias)
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"
Job state codes:
^ Code ^ Meaning ^
| ''PD'' | Pending (waiting for resources) |
| ''R'' | Running |
| ''CG'' | Completing |
| ''CD'' | Completed |
| ''F'' | Failed |
| ''CA'' | Cancelled |
| ''TO'' | Timed out |
=== View completed job details ===
# Basic accounting
sacct -j
# Detailed resource usage
sacct -j --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS
=== Cancel a job ===
# Cancel a specific job
scancel
# Cancel all your jobs
scancel -u $USER
# Cancel all your pending jobs
scancel -u $USER --state=PENDING
----
===== Interactive Sessions =====
Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction).
# Start an interactive bash session on a compute node
srun --pty --mem=4gb --cpus-per-task=2 bash
# With a specific partition and time limit
srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash
Once the session starts, you will be on a compute node and can run commands directly. Type ''exit'' to end the session.
**Important:** Interactive sessions consume resources just like batch jobs. End them when you are done.
----
===== Job Arrays =====
Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script.
=== Example: Process 10 samples ===
#!/bin/bash
#SBATCH --job-name=qc_array
#SBATCH --mem=4gb
#SBATCH --cpus-per-task=2
#SBATCH --output=log/qc_%A_%a.log
#SBATCH --array=1-10
# %A = array master job ID, %a = array task ID
# Read the sample name from a file (one sample per line)
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)"
fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz
Where ''samples.txt'' contains:
sample01
sample02
sample03
...
sample10
Submit:
sbatch qc_array.sh
=== Controlling array parallelism ===
Limit the number of simultaneous tasks with ''%N'':
#SBATCH --array=1-100%10 # Run 100 tasks, but only 10 at a time
----
===== Choosing Resources =====
Requesting the right amount of resources is important:
* **Too little** -- your job crashes or gets killed by Slurm.
* **Too much** -- your job waits longer in the queue, and you waste cluster resources.
=== Memory ===
Guidelines for common bioinformatics tasks:
^ Task ^ Typical Memory ^
| FastQC | 2-4 GB |
| fastp trimming | 4-8 GB |
| BWA mem alignment | 10-40 GB (depends on genome size) |
| GATK HaplotypeCaller | 8-16 GB |
| samtools sort | 4-10 GB |
| *TODO: add more based on your workloads* | |
If you are unsure, start with a moderate amount and check the actual usage after the job completes:
sacct -j --format=JobID,MaxRSS,Elapsed
''MaxRSS'' shows the peak memory usage. Adjust your next job accordingly.
=== CPUs ===
* Most bioinformatics tools support a ''--threads'' or ''-t'' parameter. Set ''%%--cpus-per-task%%'' to match.
* Not all tools benefit from many threads. Common sweet spots are 4-8 threads.
* Use the ''$SLURM_CPUS_PER_TASK'' variable in your script to keep the thread count consistent.
=== Time ===
* If you do not set ''%%--time%%'', Slurm uses the partition's default limit.
* If your job exceeds the time limit, it will be killed.
* Check the partition limits with ''sinfo -l''.
----
===== Tips and Best Practices =====
* **Always create a ''log/'' directory** before submitting jobs that write to ''log/''.
* **Use ''$SLURM_CPUS_PER_TASK''** instead of hardcoding thread counts.
* **Name your jobs** with ''%%--job-name%%'' so you can identify them in ''squeue''.
* **Use one script per task** -- if running the same tool with different parameters, create separate scripts (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use job arrays.
* **Check job output** -- always review the log file after a job finishes.
* **Be a good neighbor** -- do not request more resources than you need.
=== Useful alias for squeue ===
Add this to your ''~/.bashrc'' for nicer job listing:
alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'
Then just type ''sq'' to see formatted output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPU MIN_MEMORY
2313 compute computel anahit R 1:53:18 1 thin-01 20 35G
2293 compute kneaddata nelli R 11:12:15 1 thin-01 20 30G
2299 compute glasso_j1 davith R 11:12:15 1 thin-01 8 60G
2282 compute run_som.sh melina R 11:12:16 1 thin-01 8 50G
2309 compute plot_cover mherk PD 0:00 1 (Resources) 1 0
2121 thick pilon nate PD 0:00 1 (Nodes requi.. 4 512G
----
===== Troubleshooting =====
^ Problem ^ Likely Cause ^ Solution ^
| Job stays in ''PD'' state | Not enough free resources | Wait, or reduce resource request |
| Job immediately fails | Script error or bad path | Check the log file for error messages |
| ''slurmstepd: error: Exceeded job memory limit'' | Requested too little memory | Increase ''%%--mem%%'' |
| ''CANCELLED AT ... DUE TO TIME LIMIT'' | Job took longer than ''%%--time%%'' | Increase the time limit |
| ''error: Batch job submission failed: Invalid partition'' | Wrong partition name | Valid partitions: ''compute'', ''thin'', ''thick'', ''download''. Check with ''sinfo'' |
----
===== Further Reading =====
* [[https://slurm.schedmd.com/documentation.html|Official Slurm Documentation]]
* [[https://slurm.schedmd.com/sbatch.html|sbatch Reference]]
* [[https://slurm.schedmd.com/squeue.html|squeue Reference]]
* [[getting_started:cluster_basics|ABI Cluster Basics]]
* [[pipelines:start|Pipelines]] -- Ready-to-use workflows that use Slurm