This is an old revision of the document!


Cluster Basics

This page describes ABI's computing infrastructure at a level suitable for researchers. For detailed system administration documentation, see Infrastructure.

What is an HPC Cluster?

A High-Performance Computing (HPC) cluster is a collection of interconnected computers (called nodes) that work together to run computationally intensive tasks. Instead of running everything on your laptop, you submit jobs to the cluster, which distributes them across available resources.

Key concepts:

Term Meaning
Node A single server/computer in the cluster
Login node The server you SSH into. Used for file management and job submission – not for heavy computation
Compute node Servers dedicated to running jobs. Jobs are dispatched here by Slurm
Partition A group of nodes with shared properties (e.g., memory size, GPU availability). Also called a “queue”
Job A task you submit to run on a compute node
Slurm The job scheduler that manages the queue and assigns resources

ABI Cluster Overview

Component Details
Login node(s) ssh.abi.am (resolves to VMs ssh-01 and ssh-02)
Compute nodes thin-01 (64C/384G), thin-02 (64C/384G), thick-01 (64C/768G)
Download nodes dl-01 (2C/8G), dl-02 (2C/8G)
Total compute vCPUs 192
Total compute RAM 1536G
Scheduler Slurm (controller runs on a separate VM)
Virtualization All nodes are bhyve VMs running on a FreeBSD physical host

Partitions

Partitions define groups of compute resources. When you submit a job, you can specify which partition to use.

Partition Nodes CPUs Total Memory Default? Purpose
compute thin-01, thin-02, thick-01 64 per node 384G-768G Yes General purpose computation (default partition)
thin thin-01, thin-02 64 per node ~384G each No Jobs that fit in standard memory
thick thick-01 64 ~768G No Memory-intensive jobs (e.g., large genome assembly, pilon)
download dl-01, dl-02 2 per node ~8G each No Data download tasks only (not for computation)

Notes:

  • The compute partition is the default. If you do not specify --partition, your job goes here.
  • Use thick explicitly when you need more than ~384G of RAM (e.g., --partition=thick --mem=512G).
  • Use download only for downloading data (e.g., SRA downloads). These nodes have minimal CPU and memory.
  • Nodes may appear in multiple partitions (e.g., thick-01 is in both compute and thick).

To see current partition and node status:

sinfo

For a detailed view including memory and CPU allocation:

sinfo -N -o "%.10N %.10P %.5a %.4c %.20m %.20F %.10e"

Current cluster state (for reference):

NODELIST  PARTITION     CPUS  MEMORY     PURPOSE
dl-01     download       2     ~8G       Data downloads only
dl-02     download       2     ~8G       Data downloads only
thick-01  compute/thick  64   ~768G      High-memory computation
thin-01   compute/thin   64   ~384G      General computation
thin-02   compute/thin   64   ~384G      General computation

Storage

ABI has several storage areas. Understanding them is important for organizing your work and avoiding issues.

Storage is served from two ZFS-based NAS servers over NFS. ZFS provides transparent compression, so you do not need to manually compress old files – the filesystem handles it automatically. Home directories and selected projects are backed up to a separate server using ZFS send/recv.

Path Purpose Served from Quota Notes
/mnt/home/<user> Home directory – configs, scripts mustafar (nas1) ~12G per user Keep this small; use project/user dirs for data
/mnt/nas0/user/<user> Personal user workspace geonosis (nas0) ~100G per user For personal datasets, experiments, conda envs
/mnt/nas0/proj/<project> Project data (some projects) geonosis (nas0) Per-project *TODO: clarify which projects are on nas0 vs nas1*
/mnt/nas1/proj/<project> Project data (most projects) mustafar (nas1) Per-project (typically 14-25 TB) Shared with all project members
/mnt/nas1/db/ Shared databases and reference genomes mustafar (nas1) ~32 TB total Read-only for users. See Databases

Example current usage:

/mnt/home/<user>           ~12G quota    (personal configs, scripts)
/mnt/nas0/user/<user>     ~100G quota    (personal workspace)
/mnt/nas1/proj/armwgs      ~25 TB        (Armenian WGS project)
/mnt/nas1/proj/cfdna       ~14 TB        (cfDNA project)
/mnt/nas1/db/              ~32 TB        (reference genomes, indexes)

Best practices

  • Do not store large data in your home directory. Home has a ~12G quota. Use /mnt/nas0/user/<user> for personal data or /mnt/nas1/proj/<project> for project data.
  • Do not run jobs from your home directory if they produce many output files. Use project space.
  • You do not need to compress old files. The storage uses ZFS with transparent compression – it is handled automatically at the filesystem level.
  • Clean up temporary and intermediate files you no longer need to free up quota for others.

How Jobs Work

You (laptop) --SSH--> Login Node --sbatch--> Slurm Scheduler --> Compute Node(s)
  1. You connect to the login node via SSH.
  2. You write a job script and submit it with sbatch.
  3. Slurm puts your job in the queue.
  4. When resources are available, Slurm starts your job on a compute node.
  5. Output is written to a log file you specified.

Important rules:

  • Do not run heavy computation on the login node. It is shared by all users for file management and job submission.
  • Always request the resources you need (CPU, memory, time) in your Slurm script.
  • If you need an interactive session (e.g., for debugging), use srun or salloc (see Interactive Sessions).

Quick Slurm Commands

Command Purpose
sbatch script.sh Submit a batch job
squeue View all jobs in the queue (see recommended format below)
squeue –me View only your jobs
scancel <jobid> Cancel a job
sinfo View partition and node status
sacct -j <jobid> View job accounting info after completion
srun –pty bash Start an interactive session

The default squeue output is hard to read. We recommend this format:

squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"

Example output:

 JOBID  PARTITION       NAME            USER         ST       TIME      NODES             NODELIST(REASON) CPU MIN_MEMORY
  2313    compute  computel           anahit          R    1:53:18          1              thin-01  20        35G
  2293    compute  kneaddata           nelli          R   11:12:15          1              thin-01  20        30G
  2299    compute  glasso_j1          davith          R   11:12:15          1              thin-01   8        60G
  2282    compute  run_som.sh         melina          R   11:12:16          1              thin-01   8        50G
  2309    compute  plot_cover          mherk         PD       0:00          1          (Resources)   1         0
  2121      thick  pilon                nate         PD       0:00          1       (Nodes requi..   4       512G

You can add this as an alias in your ~/.bashrc for convenience:

alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'

For a full guide, see Using Slurm.

Environment and Software

All commonly used bioinformatics tools are installed globally on the cluster. There is no module system – tools are available directly by name:

# Check if a tool is available
which bwa
bwa --version
 
which samtools
samtools --version

See Software for a list of available tools.

If you need software that is not installed globally, you can install it locally using Conda.

Important: When using Conda, do not let it add itself to your ~/.bashrc. This slows down every login for you and can cause issues on login nodes. Instead, activate Conda manually when you need it. See the Conda Guide for details.

Network & Connectivity

  • The cluster is accessible via SSH at ssh.abi.am from anywhere on the internet.
  • No VPN is required for regular access.
  • Project leaders may be required by IT to set up two-factor authentication (2FA) on SSH. IT will inform you if this applies to you.
  • For slow connection troubleshooting, see the Support page.

Next Steps