====== Using Slurm ======

Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm -- **do not run heavy computation on the login nodes** (''ssh-01'', ''ssh-02'').

===== ABI Partitions =====

^ Partition ^ Nodes ^ CPUs ^ Memory ^ Purpose ^
| ''compute'' (default) | thin-01, thin-02, thick-01 | 64/node | 384G-768G | General purpose -- use this for most jobs |
| ''thin'' | thin-01, thin-02 | 64/node | ~384G each | Explicit thin-node targeting |
| ''thick'' | thick-01 | 64 | ~768G | Memory-intensive jobs (e.g., pilon, large assemblies) |
| ''download'' | dl-01, dl-02 | 2/node | ~8G each | Data downloads only (not for computation) |

  * The ''compute'' partition is the **default**. If you omit ''%%--partition%%'', your job goes here.
  * Use ''%%--partition=thick%%'' explicitly when you need >384G of RAM.
  * Use ''%%--partition=download%%'' only for data download tasks.

===== Quick Reference =====

^ Command ^ Purpose ^ Example ^
| ''sbatch'' | Submit a batch job | ''sbatch my_job.sh'' |
| ''squeue'' | View the job queue | ''squeue --me'' |
| ''scancel'' | Cancel a job | ''scancel 12345'' |
| ''sinfo'' | View partitions & node status | ''sinfo'' |
| ''sacct'' | View completed job info | ''sacct -j 12345'' |
| ''srun'' | Run an interactive command | ''srun --pty bash'' |
| ''salloc'' | Allocate resources interactively | ''salloc --mem=4G'' |

----

===== Submitting a Batch Job =====

A batch job is a shell script with special ''#SBATCH'' directives that tell Slurm what resources you need.

=== Minimal example ===

Create a file ''my_job.sh'':

<code bash>
#!/bin/bash
#SBATCH --mem=10gb                  # Memory required
#SBATCH --cpus-per-task=4           # Number of CPU cores
#SBATCH --output=slurm-%j.log      # Log file (%j = job ID)

echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "Using $SLURM_CPUS_PER_TASK CPUs"

# Your commands here
your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/

echo "Job finished at $(date)"
</code>

Submit it:

<code bash>
sbatch my_job.sh
</code>

=== Common SBATCH directives ===

^ Directive ^ Purpose ^ Example ^
| ''%%--mem%%'' | Total memory for the job | ''%%--mem=10gb%%'' |
| ''%%--cpus-per-task%%'' | Number of CPU cores | ''%%--cpus-per-task=4%%'' |
| ''%%--output%%'' | Standard output log file | ''%%--output=slurm-%j.log%%'' |
| ''%%--error%%'' | Standard error log file | ''%%--error=slurm-%j.err%%'' |
| ''%%--job-name%%'' | Name shown in squeue | ''%%--job-name=alignment%%'' |
| ''%%--time%%'' | Maximum wall time | ''%%--time=24:00:00%%'' |
| ''%%--partition%%'' | Which partition to use | ''%%--partition=thick%%'' |
| ''%%--mail-type%%'' | Email notifications | ''%%--mail-type=BEGIN,END,FAIL%%'' |
| ''%%--mail-user%%'' | Email address | ''%%--mail-user=you@abi.am%%'' |
| ''%%--array%%'' | Submit a job array | ''%%--array=1-10%%'' |

=== Full example with best practices ===

<code bash>
#!/bin/bash
#SBATCH --job-name=align_sample01
#SBATCH --mem=40gb
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --output=log/align_sample01_%j.log
#SBATCH --error=log/align_sample01_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@abi.am

# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "Directory: $(pwd)"

# Use the SLURM variable for thread count (keeps it consistent)
THREADS=$SLURM_CPUS_PER_TASK

# Create output directory
mkdir -p bam/

# Run alignment
bwa mem -t $THREADS \
    /mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \
    fastq/sample01_1.fq.gz \
    fastq/sample01_2.fq.gz \
    | samtools sort -@ $THREADS -o bam/sample01.sorted.bam -

samtools index bam/sample01.sorted.bam

echo "Finished: $(date)"
</code>

----

===== Monitoring Jobs =====

=== View the queue ===

<code bash>
# View all jobs
squeue

# View only your jobs
squeue --me

# Detailed formatting (recommended -- add this as an alias)
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"
</code>

Job state codes:

^ Code ^ Meaning ^
| ''PD'' | Pending (waiting for resources) |
| ''R'' | Running |
| ''CG'' | Completing |
| ''CD'' | Completed |
| ''F'' | Failed |
| ''CA'' | Cancelled |
| ''TO'' | Timed out |

=== View completed job details ===

<code bash>
# Basic accounting
sacct -j <jobid>

# Detailed resource usage
sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS
</code>

=== Cancel a job ===

<code bash>
# Cancel a specific job
scancel <jobid>

# Cancel all your jobs
scancel -u $USER

# Cancel all your pending jobs
scancel -u $USER --state=PENDING
</code>

----

===== Interactive Sessions =====

Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction).

<code bash>
# Start an interactive bash session on a compute node
srun --pty --mem=4gb --cpus-per-task=2 bash

# With a specific partition and time limit
srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash
</code>

Once the session starts, you will be on a compute node and can run commands directly. Type ''exit'' to end the session.

**Important:** Interactive sessions consume resources just like batch jobs. End them when you are done.

----

===== Job Arrays =====

Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script.

=== Example: Process 10 samples ===

<code bash>
#!/bin/bash
#SBATCH --job-name=qc_array
#SBATCH --mem=4gb
#SBATCH --cpus-per-task=2
#SBATCH --output=log/qc_%A_%a.log
#SBATCH --array=1-10

# %A = array master job ID, %a = array task ID

# Read the sample name from a file (one sample per line)
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)

echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)"

fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz
</code>

Where ''samples.txt'' contains:

<code>
sample01
sample02
sample03
...
sample10
</code>

Submit:

<code bash>
sbatch qc_array.sh
</code>

=== Controlling array parallelism ===

Limit the number of simultaneous tasks with ''%N'':

<code bash>
#SBATCH --array=1-100%10   # Run 100 tasks, but only 10 at a time
</code>

----

===== Choosing Resources =====

Requesting the right amount of resources is important:
  * **Too little** -- your job crashes or gets killed by Slurm.
  * **Too much** -- your job waits longer in the queue, and you waste cluster resources.

=== Memory ===

Guidelines for common bioinformatics tasks:

^ Task ^ Typical Memory ^
| FastQC | 2-4 GB |
| fastp trimming | 4-8 GB |
| BWA mem alignment | 10-40 GB (depends on genome size) |
| GATK HaplotypeCaller | 8-16 GB |
| samtools sort | 4-10 GB |
| *TODO: add more based on your workloads* | |

If you are unsure, start with a moderate amount and check the actual usage after the job completes:

<code bash>
sacct -j <jobid> --format=JobID,MaxRSS,Elapsed
</code>

''MaxRSS'' shows the peak memory usage. Adjust your next job accordingly.

=== CPUs ===

  * Most bioinformatics tools support a ''--threads'' or ''-t'' parameter. Set ''%%--cpus-per-task%%'' to match.
  * Not all tools benefit from many threads. Common sweet spots are 4-8 threads.
  * Use the ''$SLURM_CPUS_PER_TASK'' variable in your script to keep the thread count consistent.

=== Time ===

  * If you do not set ''%%--time%%'', Slurm uses the partition's default limit.
  * If your job exceeds the time limit, it will be killed.
  * Check the partition limits with ''sinfo -l''.

----

===== Tips and Best Practices =====

  * **Always create a ''log/'' directory** before submitting jobs that write to ''log/''.
  * **Use ''$SLURM_CPUS_PER_TASK''** instead of hardcoding thread counts.
  * **Name your jobs** with ''%%--job-name%%'' so you can identify them in ''squeue''.
  * **Use one script per task** -- if running the same tool with different parameters, create separate scripts (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use job arrays.
  * **Check job output** -- always review the log file after a job finishes.
  * **Be a good neighbor** -- do not request more resources than you need.

=== Useful alias for squeue ===

Add this to your ''~/.bashrc'' for nicer job listing:

<code bash>
alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'
</code>

Then just type ''sq'' to see formatted output:

<code>
 JOBID  PARTITION       NAME            USER         ST       TIME      NODES             NODELIST(REASON) CPU MIN_MEMORY
  2313    compute  computel           anahit          R    1:53:18          1              thin-01  20        35G
  2293    compute  kneaddata           nelli          R   11:12:15          1              thin-01  20        30G
  2299    compute  glasso_j1          davith          R   11:12:15          1              thin-01   8        60G
  2282    compute  run_som.sh         melina          R   11:12:16          1              thin-01   8        50G
  2309    compute  plot_cover          mherk         PD       0:00          1          (Resources)   1         0
  2121      thick  pilon                nate         PD       0:00          1       (Nodes requi..   4       512G
</code>

----

===== Troubleshooting =====

^ Problem ^ Likely Cause ^ Solution ^
| Job stays in ''PD'' state | Not enough free resources | Wait, or reduce resource request |
| Job immediately fails | Script error or bad path | Check the log file for error messages |
| ''slurmstepd: error: Exceeded job memory limit'' | Requested too little memory | Increase ''%%--mem%%'' |
| ''CANCELLED AT ... DUE TO TIME LIMIT'' | Job took longer than ''%%--time%%'' | Increase the time limit |
| ''error: Batch job submission failed: Invalid partition'' | Wrong partition name | Valid partitions: ''compute'', ''thin'', ''thick'', ''download''. Check with ''sinfo'' |

----

===== Further Reading =====

  * [[https://slurm.schedmd.com/documentation.html|Official Slurm Documentation]]
  * [[https://slurm.schedmd.com/sbatch.html|sbatch Reference]]
  * [[https://slurm.schedmd.com/squeue.html|squeue Reference]]
  * [[getting_started:cluster_basics|ABI Cluster Basics]]
  * [[pipelines:start|Pipelines]] -- Ready-to-use workflows that use Slurm