Using Slurm

Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm – do not run heavy computation on the login nodes (ssh-01, ssh-02).

ABI Partitions

Partition Nodes CPUs Memory Purpose
compute (default) thin-01, thin-02, thick-01 64/node 384G-768G General purpose – use this for most jobs
thin thin-01, thin-02 64/node ~384G each Explicit thin-node targeting
thick thick-01 64 ~768G Memory-intensive jobs (e.g., pilon, large assemblies)
download dl-01, dl-02 2/node ~8G each Data downloads only (not for computation)
  • The compute partition is the default. If you omit --partition, your job goes here.
  • Use --partition=thick explicitly when you need >384G of RAM.
  • Use --partition=download only for data download tasks.

Quick Reference

Command Purpose Example
sbatch Submit a batch job sbatch my_job.sh
squeue View the job queue squeue –me
scancel Cancel a job scancel 12345
sinfo View partitions & node status sinfo
sacct View completed job info sacct -j 12345
srun Run an interactive command srun –pty bash
salloc Allocate resources interactively salloc –mem=4G

Submitting a Batch Job

A batch job is a shell script with special #SBATCH directives that tell Slurm what resources you need.

Minimal example

Create a file my_job.sh:

#!/bin/bash
#SBATCH --mem=10gb                  # Memory required
#SBATCH --cpus-per-task=4           # Number of CPU cores
#SBATCH --output=slurm-%j.log      # Log file (%j = job ID)
 
echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "Using $SLURM_CPUS_PER_TASK CPUs"
 
# Your commands here
your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/
 
echo "Job finished at $(date)"

Submit it:

sbatch my_job.sh

Common SBATCH directives

Directive Purpose Example
--mem Total memory for the job --mem=10gb
--cpus-per-task Number of CPU cores --cpus-per-task=4
--output Standard output log file --output=slurm-%j.log
--error Standard error log file --error=slurm-%j.err
--job-name Name shown in squeue --job-name=alignment
--time Maximum wall time --time=24:00:00
--partition Which partition to use --partition=thick
--mail-type Email notifications --mail-type=BEGIN,END,FAIL
--mail-user Email address --mail-user=you@abi.am
--array Submit a job array --array=1-10

Full example with best practices

#!/bin/bash
#SBATCH --job-name=align_sample01
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=log/align_sample01_%j.log
#SBATCH --error=log/align_sample01_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@abi.am
 
# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "Directory: $(pwd)"
 
# Use the SLURM variable for thread count (keeps it consistent)
THREADS=$SLURM_CPUS_PER_TASK
 
# Create output directory
mkdir -p bam/
 
# Run alignment
bwa mem -t $THREADS \
    /mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \
    fastq/sample01_1.fq.gz \
    fastq/sample01_2.fq.gz \
    | samtools sort -@ $THREADS -o bam/sample01.sorted.bam -
 
samtools index bam/sample01.sorted.bam
 
echo "Finished: $(date)"

Monitoring Jobs

View the queue

# View all jobs
squeue
 
# View only your jobs
squeue --me
 
# Detailed formatting (recommended -- add this as an alias)
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"

Job state codes:

Code Meaning
PD Pending (waiting for resources)
R Running
CG Completing
CD Completed
F Failed
CA Cancelled
TO Timed out

View completed job details

# Basic accounting
sacct -j <jobid>
 
# Detailed resource usage
sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS

Cancel a job

# Cancel a specific job
scancel <jobid>
 
# Cancel all your jobs
scancel -u $USER
 
# Cancel all your pending jobs
scancel -u $USER --state=PENDING

Interactive Sessions

Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction).

# Start an interactive bash session on a compute node
srun --pty --mem=4gb --cpus-per-task=2 bash
 
# With a specific partition and time limit
srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash

Once the session starts, you will be on a compute node and can run commands directly. Type exit to end the session.

Important: Interactive sessions consume resources just like batch jobs. End them when you are done.


Job Arrays

Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script.

Example: Process 10 samples

#!/bin/bash
#SBATCH --job-name=qc_array
#SBATCH --mem=4gb
#SBATCH --cpus-per-task=2
#SBATCH --output=log/qc_%A_%a.log
#SBATCH --array=1-10
 
# %A = array master job ID, %a = array task ID
 
# Read the sample name from a file (one sample per line)
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
 
echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)"
 
fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz

Where samples.txt contains:

sample01
sample02
sample03
...
sample10

Submit:

sbatch qc_array.sh

Controlling array parallelism

Limit the number of simultaneous tasks with %N:

#SBATCH --array=1-100%10   # Run 100 tasks, but only 10 at a time

Choosing Resources

Requesting the right amount of resources is important:

  • Too little – your job crashes or gets killed by Slurm.
  • Too much – your job waits longer in the queue, and you waste cluster resources.

Memory

Guidelines for common bioinformatics tasks:

Task Typical Memory
FastQC 2-4 GB
fastp trimming 4-8 GB
BWA mem alignment 10-40 GB (depends on genome size)
GATK HaplotypeCaller 8-16 GB
samtools sort 4-10 GB
*TODO: add more based on your workloads*

If you are unsure, start with a moderate amount and check the actual usage after the job completes:

sacct -j <jobid> --format=JobID,MaxRSS,Elapsed

MaxRSS shows the peak memory usage. Adjust your next job accordingly.

CPUs

  • Most bioinformatics tools support a –threads or -t parameter. Set --cpus-per-task to match.
  • Not all tools benefit from many threads. Common sweet spots are 4-8 threads.
  • Use the $SLURM_CPUS_PER_TASK variable in your script to keep the thread count consistent.

Time

  • If you do not set --time, Slurm uses the partition's default limit.
  • If your job exceeds the time limit, it will be killed.
  • Check the partition limits with sinfo -l.

Tips and Best Practices

  • Always create a log/ directory before submitting jobs that write to log/.
  • Use $SLURM_CPUS_PER_TASK instead of hardcoding thread counts.
  • Name your jobs with --job-name so you can identify them in squeue.
  • Use one script per task – if running the same tool with different parameters, create separate scripts (e.g., align_sample01.sh, align_sample02.sh) or use job arrays.
  • Check job output – always review the log file after a job finishes.
  • Be a good neighbor – do not request more resources than you need.

Useful alias for squeue

Add this to your ~/.bashrc for nicer job listing:

alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'

Then just type sq to see formatted output:

 JOBID  PARTITION       NAME            USER         ST       TIME      NODES             NODELIST(REASON) CPU MIN_MEMORY
  2313    compute  computel           anahit          R    1:53:18          1              thin-01  20        35G
  2293    compute  kneaddata           nelli          R   11:12:15          1              thin-01  20        30G
  2299    compute  glasso_j1          davith          R   11:12:15          1              thin-01   8        60G
  2282    compute  run_som.sh         melina          R   11:12:16          1              thin-01   8        50G
  2309    compute  plot_cover          mherk         PD       0:00          1          (Resources)   1         0
  2121      thick  pilon                nate         PD       0:00          1       (Nodes requi..   4       512G

Troubleshooting

Problem Likely Cause Solution
Job stays in PD state Not enough free resources Wait, or reduce resource request
Job immediately fails Script error or bad path Check the log file for error messages
slurmstepd: error: Exceeded job memory limit Requested too little memory Increase --mem
CANCELLED AT … DUE TO TIME LIMIT Job took longer than --time Increase the time limit
error: Batch job submission failed: Invalid partition Wrong partition name Valid partitions: compute, thin, thick, download. Check with sinfo

Further Reading