This is an old revision of the document!

Using Slurm

Using Slurm

Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm – do not run heavy computation on the login nodes (ssh-01, ssh-02).

ABI Partitions

Partition	Nodes	CPUs	Memory	Purpose
`compute` (default)	thin-01, thin-02, thick-01	64/node	384G-768G	General purpose – use this for most jobs
`thin`	thin-01, thin-02	64/node	~384G each	Explicit thin-node targeting
`thick`	thick-01	64	~768G	Memory-intensive jobs (e.g., pilon, large assemblies)
`download`	dl-01, dl-02	2/node	~8G each	Data downloads only (not for computation)

The compute partition is the default. If you omit --partition, your job goes here.
Use --partition=thick explicitly when you need >384G of RAM.
Use --partition=download only for data download tasks.

Quick Reference

Command	Purpose	Example
`sbatch`	Submit a batch job	`sbatch my_job.sh`
`squeue`	View the job queue	`squeue –me`
`scancel`	Cancel a job	`scancel 12345`
`sinfo`	View partitions & node status	`sinfo`
`sacct`	View completed job info	`sacct -j 12345`
`srun`	Run an interactive command	`srun –pty bash`
`salloc`	Allocate resources interactively	`salloc –mem=4G`

Submitting a Batch Job

A batch job is a shell script with special #SBATCH directives that tell Slurm what resources you need.

Minimal example

Create a file my_job.sh:

#!/bin/bash
#SBATCH --mem=10gb                  # Memory required
#SBATCH --cpus-per-task=4           # Number of CPU cores
#SBATCH --output=slurm-%j.log      # Log file (%j = job ID)
 
echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "Using $SLURM_CPUS_PER_TASK CPUs"
 
# Your commands here
your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/
 
echo "Job finished at $(date)"

Submit it:

sbatch my_job.sh

Common SBATCH directives

Directive	Purpose	Example
`--mem`	Total memory for the job	`--mem=10gb`
`--cpus-per-task`	Number of CPU cores	`--cpus-per-task=4`
`--output`	Standard output log file	`--output=slurm-%j.log`
`--error`	Standard error log file	`--error=slurm-%j.err`
`--job-name`	Name shown in squeue	`--job-name=alignment`
`--time`	Maximum wall time	`--time=24:00:00`
`--partition`	Which partition to use	`--partition=thick`
`--mail-type`	Email notifications	`--mail-type=BEGIN,END,FAIL`
`--mail-user`	Email address	`--mail-user=you@abi.am`
`--array`	Submit a job array	`--array=1-10`

Full example with best practices

#!/bin/bash
#SBATCH --job-name=align_sample01
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=log/align_sample01_%j.log
#SBATCH --error=log/align_sample01_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@abi.am
 
# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "Directory: $(pwd)"
 
# Use the SLURM variable for thread count (keeps it consistent)
THREADS=$SLURM_CPUS_PER_TASK
 
# Create output directory
mkdir -p bam/
 
# Run alignment
bwa mem -t $THREADS \
    /mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \
    fastq/sample01_1.fq.gz \
    fastq/sample01_2.fq.gz \
    | samtools sort -@ $THREADS -o bam/sample01.sorted.bam -
 
samtools index bam/sample01.sorted.bam
 
echo "Finished: $(date)"

Monitoring Jobs

View the queue

# View all jobs
squeue
 
# View only your jobs
squeue --me
 
# Detailed formatting (recommended -- add this as an alias)
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"

Job state codes:

Code	Meaning
`PD`	Pending (waiting for resources)
`R`	Running
`CG`	Completing
`CD`	Completed
`F`	Failed
`CA`	Cancelled
`TO`	Timed out

View completed job details

# Basic accounting
sacct -j <jobid>
 
# Detailed resource usage
sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS

Cancel a job

# Cancel a specific job
scancel <jobid>
 
# Cancel all your jobs
scancel -u $USER
 
# Cancel all your pending jobs
scancel -u $USER --state=PENDING

Interactive Sessions

Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction).

# Start an interactive bash session on a compute node
srun --pty --mem=4gb --cpus-per-task=2 bash
 
# With a specific partition and time limit
srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash

Once the session starts, you will be on a compute node and can run commands directly. Type exit to end the session.

Important: Interactive sessions consume resources just like batch jobs. End them when you are done.

Job Arrays

Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script.

Example: Process 10 samples

#!/bin/bash
#SBATCH --job-name=qc_array
#SBATCH --mem=4gb
#SBATCH --cpus-per-task=2
#SBATCH --output=log/qc_%A_%a.log
#SBATCH --array=1-10
 
# %A = array master job ID, %a = array task ID
 
# Read the sample name from a file (one sample per line)
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
 
echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)"
 
fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz

Where samples.txt contains:

sample01
sample02
sample03
...
sample10

Submit:

sbatch qc_array.sh

Controlling array parallelism

Limit the number of simultaneous tasks with %N:

#SBATCH --array=1-100%10   # Run 100 tasks, but only 10 at a time

Choosing Resources

Requesting the right amount of resources is important:

Too little – your job crashes or gets killed by Slurm.
Too much – your job waits longer in the queue, and you waste cluster resources.

Memory

Guidelines for common bioinformatics tasks:

Task	Typical Memory
FastQC	2-4 GB
fastp trimming	4-8 GB
BWA mem alignment	10-40 GB (depends on genome size)
GATK HaplotypeCaller	8-16 GB
samtools sort	4-10 GB
TODO: add more based on your workloads

If you are unsure, start with a moderate amount and check the actual usage after the job completes:

sacct -j <jobid> --format=JobID,MaxRSS,Elapsed

MaxRSS shows the peak memory usage. Adjust your next job accordingly.

CPUs

Most bioinformatics tools support a –threads or -t parameter. Set --cpus-per-task to match.
Not all tools benefit from many threads. Common sweet spots are 4-8 threads.
Use the $SLURM_CPUS_PER_TASK variable in your script to keep the thread count consistent.

Time

If you do not set --time, Slurm uses the partition's default limit.
If your job exceeds the time limit, it will be killed.
Check the partition limits with sinfo -l.

Tips and Best Practices

Always create a log/ directory before submitting jobs that write to log/.
Use $SLURM_CPUS_PER_TASK instead of hardcoding thread counts.
Name your jobs with --job-name so you can identify them in squeue.
Use one script per task – if running the same tool with different parameters, create separate scripts (e.g., align_sample01.sh, align_sample02.sh) or use job arrays.
Check job output – always review the log file after a job finishes.
Be a good neighbor – do not request more resources than you need.

Useful alias for squeue

Add this to your ~/.bashrc for nicer job listing:

alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'

Then just type sq to see formatted output:

 JOBID  PARTITION       NAME            USER         ST       TIME      NODES             NODELIST(REASON) CPU MIN_MEMORY
  2313    compute  computel           anahit          R    1:53:18          1              thin-01  20        35G
  2293    compute  kneaddata           nelli          R   11:12:15          1              thin-01  20        30G
  2299    compute  glasso_j1          davith          R   11:12:15          1              thin-01   8        60G
  2282    compute  run_som.sh         melina          R   11:12:16          1              thin-01   8        50G
  2309    compute  plot_cover          mherk         PD       0:00          1          (Resources)   1         0
  2121      thick  pilon                nate         PD       0:00          1       (Nodes requi..   4       512G

Troubleshooting

Problem	Likely Cause	Solution
Job stays in `PD` state	Not enough free resources	Wait, or reduce resource request
Job immediately fails	Script error or bad path	Check the log file for error messages
`slurmstepd: error: Exceeded job memory limit`	Requested too little memory	Increase `--mem`
`CANCELLED AT … DUE TO TIME LIMIT`	Job took longer than `--time`	Increase the time limit
`error: Batch job submission failed: Invalid partition`	Wrong partition name	Valid partitions: `compute`, `thin`, `thick`, `download`. Check with `sinfo`

ABI Knowledge Base

Table of Contents