This is an old revision of the document!
Table of Contents
Using Slurm
Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm – do not run heavy computation on the login nodes (ssh-01, ssh-02).
ABI Partitions
| Partition | Nodes | CPUs | Memory | Purpose |
|---|---|---|---|---|
compute (default) | thin-01, thin-02, thick-01 | 64/node | 384G-768G | General purpose – use this for most jobs |
thin | thin-01, thin-02 | 64/node | ~384G each | Explicit thin-node targeting |
thick | thick-01 | 64 | ~768G | Memory-intensive jobs (e.g., pilon, large assemblies) |
download | dl-01, dl-02 | 2/node | ~8G each | Data downloads only (not for computation) |
- The
computepartition is the default. If you omit--partition, your job goes here. - Use
--partition=thickexplicitly when you need >384G of RAM. - Use
--partition=downloadonly for data download tasks.
Quick Reference
| Command | Purpose | Example |
|---|---|---|
sbatch | Submit a batch job | sbatch my_job.sh |
squeue | View the job queue | squeue –me |
scancel | Cancel a job | scancel 12345 |
sinfo | View partitions & node status | sinfo |
sacct | View completed job info | sacct -j 12345 |
srun | Run an interactive command | srun –pty bash |
salloc | Allocate resources interactively | salloc –mem=4G |
Submitting a Batch Job
A batch job is a shell script with special #SBATCH directives that tell Slurm what resources you need.
Minimal example
Create a file my_job.sh:
#!/bin/bash #SBATCH --mem=10gb # Memory required #SBATCH --cpus-per-task=4 # Number of CPU cores #SBATCH --output=slurm-%j.log # Log file (%j = job ID) echo "Job started at $(date)" echo "Running on node: $(hostname)" echo "Using $SLURM_CPUS_PER_TASK CPUs" # Your commands here your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/ echo "Job finished at $(date)"
Submit it:
sbatch my_job.sh
Common SBATCH directives
| Directive | Purpose | Example |
|---|---|---|
--mem | Total memory for the job | --mem=10gb |
--cpus-per-task | Number of CPU cores | --cpus-per-task=4 |
--output | Standard output log file | --output=slurm-%j.log |
--error | Standard error log file | --error=slurm-%j.err |
--job-name | Name shown in squeue | --job-name=alignment |
--time | Maximum wall time | --time=24:00:00 |
--partition | Which partition to use | --partition=thick |
--mail-type | Email notifications | --mail-type=BEGIN,END,FAIL |
--mail-user | Email address | --mail-user=you@abi.am |
--array | Submit a job array | --array=1-10 |
Full example with best practices
#!/bin/bash #SBATCH --job-name=align_sample01 #SBATCH --mem=40gb #SBATCH --cpus-per-task=8 #SBATCH --time=48:00:00 #SBATCH --output=log/align_sample01_%j.log #SBATCH --error=log/align_sample01_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=your_email@abi.am # Print job info for debugging echo "Job ID: $SLURM_JOB_ID" echo "Node: $(hostname)" echo "Start: $(date)" echo "Directory: $(pwd)" # Use the SLURM variable for thread count (keeps it consistent) THREADS=$SLURM_CPUS_PER_TASK # Create output directory mkdir -p bam/ # Run alignment bwa mem -t $THREADS \ /mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \ fastq/sample01_1.fq.gz \ fastq/sample01_2.fq.gz \ | samtools sort -@ $THREADS -o bam/sample01.sorted.bam - samtools index bam/sample01.sorted.bam echo "Finished: $(date)"
Monitoring Jobs
View the queue
# View all jobs squeue # View only your jobs squeue --me # Detailed formatting (recommended -- add this as an alias) squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"
Job state codes:
| Code | Meaning |
|---|---|
PD | Pending (waiting for resources) |
R | Running |
CG | Completing |
CD | Completed |
F | Failed |
CA | Cancelled |
TO | Timed out |
View completed job details
# Basic accounting sacct -j <jobid> # Detailed resource usage sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS
Cancel a job
# Cancel a specific job scancel <jobid> # Cancel all your jobs scancel -u $USER # Cancel all your pending jobs scancel -u $USER --state=PENDING
Interactive Sessions
Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction).
# Start an interactive bash session on a compute node srun --pty --mem=4gb --cpus-per-task=2 bash # With a specific partition and time limit srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash
Once the session starts, you will be on a compute node and can run commands directly. Type exit to end the session.
Important: Interactive sessions consume resources just like batch jobs. End them when you are done.
Job Arrays
Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script.
Example: Process 10 samples
#!/bin/bash #SBATCH --job-name=qc_array #SBATCH --mem=4gb #SBATCH --cpus-per-task=2 #SBATCH --output=log/qc_%A_%a.log #SBATCH --array=1-10 # %A = array master job ID, %a = array task ID # Read the sample name from a file (one sample per line) SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt) echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)" fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz
Where samples.txt contains:
sample01 sample02 sample03 ... sample10
Submit:
sbatch qc_array.sh
Controlling array parallelism
Limit the number of simultaneous tasks with %N:
#SBATCH --array=1-100%10 # Run 100 tasks, but only 10 at a time
Choosing Resources
Requesting the right amount of resources is important:
- Too little – your job crashes or gets killed by Slurm.
- Too much – your job waits longer in the queue, and you waste cluster resources.
Memory
Guidelines for common bioinformatics tasks:
| Task | Typical Memory |
|---|---|
| FastQC | 2-4 GB |
| fastp trimming | 4-8 GB |
| BWA mem alignment | 10-40 GB (depends on genome size) |
| GATK HaplotypeCaller | 8-16 GB |
| samtools sort | 4-10 GB |
| *TODO: add more based on your workloads* |
If you are unsure, start with a moderate amount and check the actual usage after the job completes:
sacct -j <jobid> --format=JobID,MaxRSS,Elapsed
MaxRSS shows the peak memory usage. Adjust your next job accordingly.
CPUs
- Most bioinformatics tools support a
–threadsor-tparameter. Set--cpus-per-taskto match. - Not all tools benefit from many threads. Common sweet spots are 4-8 threads.
- Use the
$SLURM_CPUS_PER_TASKvariable in your script to keep the thread count consistent.
Time
- If you do not set
--time, Slurm uses the partition's default limit. - If your job exceeds the time limit, it will be killed.
- Check the partition limits with
sinfo -l.
Tips and Best Practices
- Always create a
log/directory before submitting jobs that write tolog/. - Use
$SLURM_CPUS_PER_TASKinstead of hardcoding thread counts. - Name your jobs with
--job-nameso you can identify them insqueue. - Use one script per task – if running the same tool with different parameters, create separate scripts (e.g.,
align_sample01.sh,align_sample02.sh) or use job arrays. - Check job output – always review the log file after a job finishes.
- Be a good neighbor – do not request more resources than you need.
Useful alias for squeue
Add this to your ~/.bashrc for nicer job listing:
alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'
Then just type sq to see formatted output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPU MIN_MEMORY 2313 compute computel anahit R 1:53:18 1 thin-01 20 35G 2293 compute kneaddata nelli R 11:12:15 1 thin-01 20 30G 2299 compute glasso_j1 davith R 11:12:15 1 thin-01 8 60G 2282 compute run_som.sh melina R 11:12:16 1 thin-01 8 50G 2309 compute plot_cover mherk PD 0:00 1 (Resources) 1 0 2121 thick pilon nate PD 0:00 1 (Nodes requi.. 4 512G
Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
Job stays in PD state | Not enough free resources | Wait, or reduce resource request |
| Job immediately fails | Script error or bad path | Check the log file for error messages |
slurmstepd: error: Exceeded job memory limit | Requested too little memory | Increase --mem |
CANCELLED AT … DUE TO TIME LIMIT | Job took longer than --time | Increase the time limit |
error: Batch job submission failed: Invalid partition | Wrong partition name | Valid partitions: compute, thin, thick, download. Check with sinfo |
Further Reading
- Pipelines – Ready-to-use workflows that use Slurm
