This is an old revision of the document!
Table of Contents
Cluster Basics
This page describes ABI's computing infrastructure at a level suitable for researchers. For detailed system administration documentation, see Infrastructure.
What is an HPC Cluster?
A High-Performance Computing (HPC) cluster is a collection of interconnected computers (called nodes) that work together to run computationally intensive tasks. Instead of running everything on your laptop, you submit jobs to the cluster, which distributes them across available resources.
Key concepts:
| Term | Meaning |
|---|---|
| Node | A single server/computer in the cluster |
| Login node | The server you SSH into. Used for file management and job submission – not for heavy computation |
| Compute node | Servers dedicated to running jobs. Jobs are dispatched here by Slurm |
| Partition | A group of nodes with shared properties (e.g., memory size, GPU availability). Also called a “queue” |
| Job | A task you submit to run on a compute node |
| Slurm | The job scheduler that manages the queue and assigns resources |
ABI Cluster Overview
| Component | Details |
|---|---|
| Login node(s) | ssh.abi.am (resolves to VMs ssh-01 and ssh-02) |
| Compute nodes | thin-01 (64C/384G), thin-02 (64C/384G), thick-01 (64C/768G) |
| Download nodes | dl-01 (2C/8G), dl-02 (2C/8G) |
| Total compute vCPUs | 192 |
| Total compute RAM | 1536G |
| Scheduler | Slurm (controller runs on a separate VM) |
| Virtualization | All nodes are bhyve VMs running on a FreeBSD physical host |
Partitions
Partitions define groups of compute resources. When you submit a job, you can specify which partition to use.
| Partition | Nodes | CPUs | Total Memory | Default? | Purpose |
|---|---|---|---|---|---|
compute | thin-01, thin-02, thick-01 | 64 per node | 384G-768G | Yes | General purpose computation (default partition) |
thin | thin-01, thin-02 | 64 per node | ~384G each | No | Jobs that fit in standard memory |
thick | thick-01 | 64 | ~768G | No | Memory-intensive jobs (e.g., large genome assembly, pilon) |
download | dl-01, dl-02 | 2 per node | ~8G each | No | Data download tasks only (not for computation) |
Notes:
- The
computepartition is the default. If you do not specify--partition, your job goes here. - Use
thickexplicitly when you need more than ~384G of RAM (e.g.,--partition=thick --mem=512G). - Use
downloadonly for downloading data (e.g., SRA downloads). These nodes have minimal CPU and memory. - Nodes may appear in multiple partitions (e.g.,
thick-01is in bothcomputeandthick).
To see current partition and node status:
sinfo
For a detailed view including memory and CPU allocation:
sinfo -N -o "%.10N %.10P %.5a %.4c %.20m %.20F %.10e"
Current cluster state (for reference):
NODELIST PARTITION CPUS MEMORY PURPOSE dl-01 download 2 ~8G Data downloads only dl-02 download 2 ~8G Data downloads only thick-01 compute/thick 64 ~768G High-memory computation thin-01 compute/thin 64 ~384G General computation thin-02 compute/thin 64 ~384G General computation
Storage
ABI has several storage areas. Understanding them is important for organizing your work and avoiding issues.
Storage is served from two ZFS-based NAS servers over NFS. ZFS provides transparent compression, so you do not need to manually compress old files – the filesystem handles it automatically. Home directories and selected projects are backed up to a separate server using ZFS send/recv.
| Path | Purpose | Served from | Quota | Notes |
|---|---|---|---|---|
/mnt/home/<user> | Home directory – configs, scripts | mustafar (nas1) | ~12G per user | Keep this small; use project/user dirs for data |
/mnt/nas0/user/<user> | Personal user workspace | geonosis (nas0) | ~100G per user | For personal datasets, experiments, conda envs |
/mnt/nas0/proj/<project> | Project data (some projects) | geonosis (nas0) | Per-project | *TODO: clarify which projects are on nas0 vs nas1* |
/mnt/nas1/proj/<project> | Project data (most projects) | mustafar (nas1) | Per-project (typically 14-25 TB) | Shared with all project members |
/mnt/nas1/db/ | Shared databases and reference genomes | mustafar (nas1) | ~32 TB total | Read-only for users. See Databases |
Example current usage:
/mnt/home/<user> ~12G quota (personal configs, scripts) /mnt/nas0/user/<user> ~100G quota (personal workspace) /mnt/nas1/proj/armwgs ~25 TB (Armenian WGS project) /mnt/nas1/proj/cfdna ~14 TB (cfDNA project) /mnt/nas1/db/ ~32 TB (reference genomes, indexes)
Best practices
- Do not store large data in your home directory. Home has a ~12G quota. Use
/mnt/nas0/user/<user>for personal data or/mnt/nas1/proj/<project>for project data. - Do not run jobs from your home directory if they produce many output files. Use project space.
- You do not need to compress old files. The storage uses ZFS with transparent compression – it is handled automatically at the filesystem level.
- Clean up temporary and intermediate files you no longer need to free up quota for others.
How Jobs Work
You (laptop) --SSH--> Login Node --sbatch--> Slurm Scheduler --> Compute Node(s)
- You connect to the login node via SSH.
- You write a job script and submit it with
sbatch. - Slurm puts your job in the queue.
- When resources are available, Slurm starts your job on a compute node.
- Output is written to a log file you specified.
Important rules:
- Do not run heavy computation on the login node. It is shared by all users for file management and job submission.
- Always request the resources you need (CPU, memory, time) in your Slurm script.
- If you need an interactive session (e.g., for debugging), use
srunorsalloc(see Interactive Sessions).
Quick Slurm Commands
| Command | Purpose |
|---|---|
sbatch script.sh | Submit a batch job |
squeue | View all jobs in the queue (see recommended format below) |
squeue –me | View only your jobs |
scancel <jobid> | Cancel a job |
sinfo | View partition and node status |
sacct -j <jobid> | View job accounting info after completion |
srun –pty bash | Start an interactive session |
Recommended squeue format
The default squeue output is hard to read. We recommend this format:
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPU MIN_MEMORY 2313 compute computel anahit R 1:53:18 1 thin-01 20 35G 2293 compute kneaddata nelli R 11:12:15 1 thin-01 20 30G 2299 compute glasso_j1 davith R 11:12:15 1 thin-01 8 60G 2282 compute run_som.sh melina R 11:12:16 1 thin-01 8 50G 2309 compute plot_cover mherk PD 0:00 1 (Resources) 1 0 2121 thick pilon nate PD 0:00 1 (Nodes requi.. 4 512G
You can add this as an alias in your ~/.bashrc for convenience:
alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'
For a full guide, see Using Slurm.
Environment and Software
All commonly used bioinformatics tools are installed globally on the cluster. There is no module system – tools are available directly by name:
# Check if a tool is available which bwa bwa --version which samtools samtools --version
See Software for a list of available tools.
If you need software that is not installed globally, you can install it locally using Conda.
Important: When using Conda, do not let it add itself to your~/.bashrc. This slows down every login for you and can cause issues on login nodes. Instead, activate Conda manually when you need it. See the Conda Guide for details.
Network & Connectivity
- The cluster is accessible via SSH at
ssh.abi.amfrom anywhere on the internet. - No VPN is required for regular access.
- Project leaders may be required by IT to set up two-factor authentication (2FA) on SSH. IT will inform you if this applies to you.
- For slow connection troubleshooting, see the Support page.
Next Steps
- Slurm Guide – Full job submission reference with examples.
- Available Software – What tools are installed and how to use them.
- Pipelines – Ready-to-use bioinformatics workflows.
- Databases & Reference Data – Reference genomes available on the server.
