====== Using Slurm ====== Slurm (Simple Linux Utility for Resource Management) is the job scheduler on ABI's cluster. All compute jobs must be submitted through Slurm -- **do not run heavy computation on the login nodes** (''ssh-01'', ''ssh-02''). ===== ABI Partitions ===== ^ Partition ^ Nodes ^ CPUs ^ Memory ^ Purpose ^ | ''compute'' (default) | thin-01, thin-02, thick-01 | 64/node | 384G-768G | General purpose -- use this for most jobs | | ''thin'' | thin-01, thin-02 | 64/node | ~384G each | Explicit thin-node targeting | | ''thick'' | thick-01 | 64 | ~768G | Memory-intensive jobs (e.g., pilon, large assemblies) | | ''download'' | dl-01, dl-02 | 2/node | ~8G each | Data downloads only (not for computation) | * The ''compute'' partition is the **default**. If you omit ''%%--partition%%'', your job goes here. * Use ''%%--partition=thick%%'' explicitly when you need >384G of RAM. * Use ''%%--partition=download%%'' only for data download tasks. ===== Quick Reference ===== ^ Command ^ Purpose ^ Example ^ | ''sbatch'' | Submit a batch job | ''sbatch my_job.sh'' | | ''squeue'' | View the job queue | ''squeue --me'' | | ''scancel'' | Cancel a job | ''scancel 12345'' | | ''sinfo'' | View partitions & node status | ''sinfo'' | | ''sacct'' | View completed job info | ''sacct -j 12345'' | | ''srun'' | Run an interactive command | ''srun --pty bash'' | | ''salloc'' | Allocate resources interactively | ''salloc --mem=4G'' | ---- ===== Submitting a Batch Job ===== A batch job is a shell script with special ''#SBATCH'' directives that tell Slurm what resources you need. === Minimal example === Create a file ''my_job.sh'': #!/bin/bash #SBATCH --mem=10gb # Memory required #SBATCH --cpus-per-task=4 # Number of CPU cores #SBATCH --output=slurm-%j.log # Log file (%j = job ID) echo "Job started at $(date)" echo "Running on node: $(hostname)" echo "Using $SLURM_CPUS_PER_TASK CPUs" # Your commands here your_command --threads $SLURM_CPUS_PER_TASK input.fastq -o output/ echo "Job finished at $(date)" Submit it: sbatch my_job.sh === Common SBATCH directives === ^ Directive ^ Purpose ^ Example ^ | ''%%--mem%%'' | Total memory for the job | ''%%--mem=10gb%%'' | | ''%%--cpus-per-task%%'' | Number of CPU cores | ''%%--cpus-per-task=4%%'' | | ''%%--output%%'' | Standard output log file | ''%%--output=slurm-%j.log%%'' | | ''%%--error%%'' | Standard error log file | ''%%--error=slurm-%j.err%%'' | | ''%%--job-name%%'' | Name shown in squeue | ''%%--job-name=alignment%%'' | | ''%%--time%%'' | Maximum wall time | ''%%--time=24:00:00%%'' | | ''%%--partition%%'' | Which partition to use | ''%%--partition=thick%%'' | | ''%%--mail-type%%'' | Email notifications | ''%%--mail-type=BEGIN,END,FAIL%%'' | | ''%%--mail-user%%'' | Email address | ''%%--mail-user=you@abi.am%%'' | | ''%%--array%%'' | Submit a job array | ''%%--array=1-10%%'' | === Full example with best practices === #!/bin/bash #SBATCH --job-name=align_sample01 #SBATCH --mem=40gb #SBATCH --cpus-per-task=8 #SBATCH --time=48:00:00 #SBATCH --output=log/align_sample01_%j.log #SBATCH --error=log/align_sample01_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=your_email@abi.am # Print job info for debugging echo "Job ID: $SLURM_JOB_ID" echo "Node: $(hostname)" echo "Start: $(date)" echo "Directory: $(pwd)" # Use the SLURM variable for thread count (keeps it consistent) THREADS=$SLURM_CPUS_PER_TASK # Create output directory mkdir -p bam/ # Run alignment bwa mem -t $THREADS \ /mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna \ fastq/sample01_1.fq.gz \ fastq/sample01_2.fq.gz \ | samtools sort -@ $THREADS -o bam/sample01.sorted.bam - samtools index bam/sample01.sorted.bam echo "Finished: $(date)" ---- ===== Monitoring Jobs ===== === View the queue === # View all jobs squeue # View only your jobs squeue --me # Detailed formatting (recommended -- add this as an alias) squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m" Job state codes: ^ Code ^ Meaning ^ | ''PD'' | Pending (waiting for resources) | | ''R'' | Running | | ''CG'' | Completing | | ''CD'' | Completed | | ''F'' | Failed | | ''CA'' | Cancelled | | ''TO'' | Timed out | === View completed job details === # Basic accounting sacct -j # Detailed resource usage sacct -j --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,NCPUS === Cancel a job === # Cancel a specific job scancel # Cancel all your jobs scancel -u $USER # Cancel all your pending jobs scancel -u $USER --state=PENDING ---- ===== Interactive Sessions ===== Sometimes you need to work interactively on a compute node (e.g., for testing, debugging, or running tools that require interaction). # Start an interactive bash session on a compute node srun --pty --mem=4gb --cpus-per-task=2 bash # With a specific partition and time limit srun --pty --mem=8gb --cpus-per-task=4 --time=2:00:00 --partition=thin bash Once the session starts, you will be on a compute node and can run commands directly. Type ''exit'' to end the session. **Important:** Interactive sessions consume resources just like batch jobs. End them when you are done. ---- ===== Job Arrays ===== Job arrays let you submit many similar jobs with a single command. This is useful for processing multiple samples with the same script. === Example: Process 10 samples === #!/bin/bash #SBATCH --job-name=qc_array #SBATCH --mem=4gb #SBATCH --cpus-per-task=2 #SBATCH --output=log/qc_%A_%a.log #SBATCH --array=1-10 # %A = array master job ID, %a = array task ID # Read the sample name from a file (one sample per line) SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt) echo "Processing sample: $SAMPLE (task $SLURM_ARRAY_TASK_ID)" fastqc -o fastqc/ fastq/${SAMPLE}_1.fq.gz fastq/${SAMPLE}_2.fq.gz Where ''samples.txt'' contains: sample01 sample02 sample03 ... sample10 Submit: sbatch qc_array.sh === Controlling array parallelism === Limit the number of simultaneous tasks with ''%N'': #SBATCH --array=1-100%10 # Run 100 tasks, but only 10 at a time ---- ===== Choosing Resources ===== Requesting the right amount of resources is important: * **Too little** -- your job crashes or gets killed by Slurm. * **Too much** -- your job waits longer in the queue, and you waste cluster resources. === Memory === Guidelines for common bioinformatics tasks: ^ Task ^ Typical Memory ^ | FastQC | 2-4 GB | | fastp trimming | 4-8 GB | | BWA mem alignment | 10-40 GB (depends on genome size) | | GATK HaplotypeCaller | 8-16 GB | | samtools sort | 4-10 GB | | *TODO: add more based on your workloads* | | If you are unsure, start with a moderate amount and check the actual usage after the job completes: sacct -j --format=JobID,MaxRSS,Elapsed ''MaxRSS'' shows the peak memory usage. Adjust your next job accordingly. === CPUs === * Most bioinformatics tools support a ''--threads'' or ''-t'' parameter. Set ''%%--cpus-per-task%%'' to match. * Not all tools benefit from many threads. Common sweet spots are 4-8 threads. * Use the ''$SLURM_CPUS_PER_TASK'' variable in your script to keep the thread count consistent. === Time === * If you do not set ''%%--time%%'', Slurm uses the partition's default limit. * If your job exceeds the time limit, it will be killed. * Check the partition limits with ''sinfo -l''. ---- ===== Tips and Best Practices ===== * **Always create a ''log/'' directory** before submitting jobs that write to ''log/''. * **Use ''$SLURM_CPUS_PER_TASK''** instead of hardcoding thread counts. * **Name your jobs** with ''%%--job-name%%'' so you can identify them in ''squeue''. * **Use one script per task** -- if running the same tool with different parameters, create separate scripts (e.g., ''align_sample01.sh'', ''align_sample02.sh'') or use job arrays. * **Check job output** -- always review the log file after a job finishes. * **Be a good neighbor** -- do not request more resources than you need. === Useful alias for squeue === Add this to your ''~/.bashrc'' for nicer job listing: alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"' Then just type ''sq'' to see formatted output: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPU MIN_MEMORY 2313 compute computel anahit R 1:53:18 1 thin-01 20 35G 2293 compute kneaddata nelli R 11:12:15 1 thin-01 20 30G 2299 compute glasso_j1 davith R 11:12:15 1 thin-01 8 60G 2282 compute run_som.sh melina R 11:12:16 1 thin-01 8 50G 2309 compute plot_cover mherk PD 0:00 1 (Resources) 1 0 2121 thick pilon nate PD 0:00 1 (Nodes requi.. 4 512G ---- ===== Troubleshooting ===== ^ Problem ^ Likely Cause ^ Solution ^ | Job stays in ''PD'' state | Not enough free resources | Wait, or reduce resource request | | Job immediately fails | Script error or bad path | Check the log file for error messages | | ''slurmstepd: error: Exceeded job memory limit'' | Requested too little memory | Increase ''%%--mem%%'' | | ''CANCELLED AT ... DUE TO TIME LIMIT'' | Job took longer than ''%%--time%%'' | Increase the time limit | | ''error: Batch job submission failed: Invalid partition'' | Wrong partition name | Valid partitions: ''compute'', ''thin'', ''thick'', ''download''. Check with ''sinfo'' | ---- ===== Further Reading ===== * [[https://slurm.schedmd.com/documentation.html|Official Slurm Documentation]] * [[https://slurm.schedmd.com/sbatch.html|sbatch Reference]] * [[https://slurm.schedmd.com/squeue.html|squeue Reference]] * [[getting_started:cluster_basics|ABI Cluster Basics]] * [[pipelines:start|Pipelines]] -- Ready-to-use workflows that use Slurm