Using Slurm

Using Slurm

Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm – do not run heavy computation on the login nodes (ssh-01, ssh-02).

ABI Partitions

Partition	Nodes	CPUs	Memory	Purpose
`compute` (default)	thin-01, thin-02, thick-01	64/node	384G-768G	General purpose – use this for most jobs
`thin`	thin-01, thin-02	64/node	~384G each	Explicit thin-node targeting
`thick`	thick-01	64	~768G	Memory-intensive jobs (e.g., pilon, large assemblies)
`download`	dl-01, dl-02	2/node	~8G each	Data downloads only (not for computation)

The compute partition is the default. If you omit --partition, your job goes here.
Use --partition=thick explicitly when you need >384G of RAM.
Use --partition=download only for data download tasks.

Quick Reference

Command	Purpose	Example
`sbatch`	Submit a batch job	`sbatch my_job.sh`
`squeue`	View the job queue	`squeue –me`
`scancel`	Cancel a job	`scancel 12345`
`sinfo`	View partitions & node status	`sinfo`
`sacct`	View completed job info	`sacct -j 12345`
`srun`	Run an interactive command	`srun –pty bash`
`salloc`	Allocate resources interactively	`salloc –mem=4G`

Submitting a Batch Job

A batch job is a shell script with special #SBATCH directives that tell Slurm what resources you need.

Minimal example

Create a file my_job.sh:

#!/bin/bash
#SBATCH --mem=10gb                  # Memory required
#SBATCH --cpus-per-task=4           # Number of CPU cores
#SBATCH --output=slurm-%j.log      # Log file (%j = job ID)
 
echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "Using $SLURM_CPUS_PER_TASK CPUs"
 
# Your commands here
your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/
 
echo "Job finished at $(date)"

Submit it:

sbatch my_job.sh

Common SBATCH directives

Directive	Purpose	Example
`--mem`	Total memory for the job	`--mem=10gb`
`--cpus-per-task`	Number of CPU cores	`--cpus-per-task=4`
`--output`	Standard output log file	`--output=slurm-%j.log`
`--error`	Standard error log file	`--error=slurm-%j.err`
`--job-name`	Name shown in squeue	`--job-name=alignment`
`--time`	Maximum wall time	`--time=24:00:00`
`--partition`	Which partition to use	`--partition=thick`
`--mail-type`	Email notifications	`--mail-type=BEGIN,END,FAIL`
`--mail-user`	Email address	`--mail-user=you@abi.am`
`--array`	Submit a job array	`--array=1-10`

Full example with best practices

#!/bin/bash
#SBATCH --job-name=align_sample01
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=log/align_sample01_%j.log
#SBATCH --error=log/align_sample01_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@abi.am
 
# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "Directory: $(pwd)"
 
# Use the SLURM variable for thread count (keeps it consistent)
THREADS=$SLURM_CPUS_PER_TASK
 
# Create output directory
mkdir -p bam/
 
# Run alignment
bwa mem -t $THREADS \
    /mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \
    fastq/sample01_1.fq.gz \
    fastq/sample01_2.fq.gz \
    | samtools sort -@ $THREADS -o bam/sample01.sorted.bam -
 
samtools index bam/sample01.sorted.bam
 
echo "Finished: $(date)"

Monitoring Jobs

View the queue

# View all jobs
squeue
 
# View only your jobs
squeue --me
 
# Detailed formatting (recommended -- add this as an alias)
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"

Job state codes:

Code	Meaning
`PD`	Pending (waiting for resources)
`R`	Running
`CG`	Completing
`CD`	Completed
`F`	Failed
`CA`	Cancelled
`TO`	Timed out

View completed job details

# Basic accounting
sacct -j <jobid>
 
# Detailed resource usage
sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS

Cancel a job

# Cancel a specific job
scancel <jobid>
 
# Cancel all your jobs
scancel -u $USER
 
# Cancel all your pending jobs
scancel -u $USER --state=PENDING

Interactive Sessions

Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction).

# Start an interactive bash session on a compute node
srun --pty --mem=4gb --cpus-per-task=2 bash
 
# With a specific partition and time limit
srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash

Once the session starts, you will be on a compute node and can run commands directly. Type exit to end the session.

Important: Interactive sessions consume resources just like batch jobs. End them when you are done.

Job Arrays

Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script.

Example: Process 10 samples

#!/bin/bash
#SBATCH --job-name=qc_array
#SBATCH --mem=4gb
#SBATCH --cpus-per-task=2
#SBATCH --output=log/qc_%A_%a.log
#SBATCH --array=1-10
 
# %A = array master job ID, %a = array task ID
 
# Read the sample name from a file (one sample per line)
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
 
echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)"
 
fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz

Where samples.txt contains:

sample01
sample02
sample03
...
sample10

Submit:

sbatch qc_array.sh

Controlling array parallelism

Limit the number of simultaneous tasks with %N:

#SBATCH --array=1-100%10   # Run 100 tasks, but only 10 at a time

Choosing Resources

Requesting the right amount of resources is important:

Too little – your job crashes or gets killed by Slurm.
Too much – your job waits longer in the queue, and you waste cluster resources.

Memory

Guidelines for common bioinformatics tasks:

Task	Typical Memory
FastQC	2-4 GB
fastp trimming	4-8 GB
BWA mem alignment	10-40 GB (depends on genome size)
GATK HaplotypeCaller	8-16 GB
samtools sort	4-10 GB
TODO: add more based on your workloads

If you are unsure, start with a moderate amount and check the actual usage after the job completes:

sacct -j <jobid> --format=JobID,MaxRSS,Elapsed

MaxRSS shows the peak memory usage. Adjust your next job accordingly.

CPUs

Most bioinformatics tools support a –threads or -t parameter. Set --cpus-per-task to match.
Not all tools benefit from many threads. Common sweet spots are 4-8 threads.
Use the $SLURM_CPUS_PER_TASK variable in your script to keep the thread count consistent.

Time

If you do not set --time, Slurm uses the partition's default limit.
If your job exceeds the time limit, it will be killed.
Check the partition limits with sinfo -l.

Tips and Best Practices

Always create a log/ directory before submitting jobs that write to log/.
Use $SLURM_CPUS_PER_TASK instead of hardcoding thread counts.
Name your jobs with --job-name so you can identify them in squeue.
Use one script per task – if running the same tool with different parameters, create separate scripts (e.g., align_sample01.sh, align_sample02.sh) or use job arrays.
Check job output – always review the log file after a job finishes.
Be a good neighbor – do not request more resources than you need.

Useful alias for squeue

Add this to your ~/.bashrc for nicer job listing:

alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'

Then just type sq to see formatted output:

 JOBID  PARTITION       NAME            USER         ST       TIME      NODES             NODELIST(REASON) CPU MIN_MEMORY
  2313    compute  computel           anahit          R    1:53:18          1              thin-01  20        35G
  2293    compute  kneaddata           nelli          R   11:12:15          1              thin-01  20        30G
  2299    compute  glasso_j1          davith          R   11:12:15          1              thin-01   8        60G
  2282    compute  run_som.sh         melina          R   11:12:16          1              thin-01   8        50G
  2309    compute  plot_cover          mherk         PD       0:00          1          (Resources)   1         0
  2121      thick  pilon                nate         PD       0:00          1       (Nodes requi..   4       512G

Troubleshooting

Problem	Likely Cause	Solution
Job stays in `PD` state	Not enough free resources	Wait, or reduce resource request
Job immediately fails	Script error or bad path	Check the log file for error messages
`slurmstepd: error: Exceeded job memory limit`	Requested too little memory	Increase `--mem`
`CANCELLED AT … DUE TO TIME LIMIT`	Job took longer than `--time`	Increase the time limit
`error: Batch job submission failed: Invalid partition`	Wrong partition name	Valid partitions: `compute`, `thin`, `thick`, `download`. Check with `sinfo`

Table of Contents