====== Cluster Basics ======
This page describes ABI's computing infrastructure at a level suitable for researchers. For detailed system administration documentation, see [[infra:start|Infrastructure]].
===== What is an HPC Cluster? =====
A High-Performance Computing (HPC) cluster is a collection of interconnected computers (called **nodes**) that work together to run computationally intensive tasks. Instead of running everything on your laptop, you submit **jobs** to the cluster, which distributes them across available resources.
Key concepts:
^ Term ^ Meaning ^
| **Node** | A single server/computer in the cluster |
| **Login node** | The server you SSH into. Used for file management and job submission -- **not** for heavy computation |
| **Compute node** | Servers dedicated to running jobs. Jobs are dispatched here by Slurm |
| **Partition** | A group of nodes with shared properties (e.g., memory size, GPU availability). Also called a "queue" |
| **Job** | A task you submit to run on a compute node |
| **Slurm** | The job scheduler that manages the queue and assigns resources |
===== ABI Cluster Overview =====
^ Component ^ Details ^
| Login node(s) | ''ssh.abi.am'' (resolves to VMs ''ssh-01'' and ''ssh-02'') |
| Compute nodes | ''thin-01'' (64C/384G), ''thin-02'' (64C/384G), ''thick-01'' (64C/768G) |
| Download nodes | ''dl-01'' (2C/8G), ''dl-02'' (2C/8G) |
| Total compute vCPUs | 192 |
| Total compute RAM | 1536G |
| Scheduler | Slurm (controller runs on a separate VM) |
| Virtualization | All nodes are bhyve VMs running on a FreeBSD physical host |
===== Partitions =====
Partitions define groups of compute resources. When you submit a job, you can specify which partition to use.
^ Partition ^ Nodes ^ CPUs ^ Total Memory ^ Default? ^ Purpose ^
| ''compute'' | thin-01, thin-02, thick-01 | 64 per node | 384G-768G | Yes | General purpose computation (default partition) |
| ''thin'' | thin-01, thin-02 | 64 per node | ~384G each | No | Jobs that fit in standard memory |
| ''thick'' | thick-01 | 64 | ~768G | No | Memory-intensive jobs (e.g., large genome assembly, pilon) |
| ''download'' | dl-01, dl-02 | 2 per node | ~8G each | No | Data download tasks only (not for computation) |
**Notes:**
* The ''compute'' partition is the **default**. If you do not specify ''%%--partition%%'', your job goes here.
* Use ''thick'' explicitly when you need more than ~384G of RAM (e.g., ''%%--partition=thick --mem=512G%%'').
* Use ''download'' only for downloading data (e.g., SRA downloads). These nodes have minimal CPU and memory.
* Nodes may appear in multiple partitions (e.g., ''thick-01'' is in both ''compute'' and ''thick'').
To see current partition and node status:
sinfo
For a detailed view including memory and CPU allocation:
sinfo -N -o "%.10N %.10P %.5a %.4c %.20m %.20F %.10e"
Current cluster state (for reference):
NODELIST PARTITION CPUS MEMORY PURPOSE
dl-01 download 2 ~8G Data downloads only
dl-02 download 2 ~8G Data downloads only
thick-01 compute/thick 64 ~768G High-memory computation
thin-01 compute/thin 64 ~384G General computation
thin-02 compute/thin 64 ~384G General computation
===== Storage =====
ABI has several storage areas. Understanding them is important for organizing your work and avoiding issues.
Storage is served from **two ZFS-based NAS servers** over NFS. ZFS provides **transparent compression**, so you do not need to manually compress old files -- the filesystem handles it automatically. Home directories and selected projects are backed up to a separate server using ZFS send/recv.
^ Path ^ Purpose ^ Served from ^ Quota ^ Notes ^
| ''/mnt/home/'' | Home directory -- configs, scripts | mustafar (nas1) | ~12G per user | Keep this small; use project/user dirs for data |
| ''/mnt/nas0/user/'' | Personal user workspace | geonosis (nas0) | ~100G per user | For personal datasets, experiments, conda envs |
| ''/mnt/nas0/proj/'' | Project data (some projects) | geonosis (nas0) | Per-project | *TODO: clarify which projects are on nas0 vs nas1* |
| ''/mnt/nas1/proj/'' | Project data (most projects) | mustafar (nas1) | Per-project (typically 14-25 TB) | Shared with all project members |
| ''/mnt/nas1/db/'' | Shared databases and reference genomes | mustafar (nas1) | ~32 TB total | Read-only for users. See [[databases:start|Databases]] |
**Example current usage:**
/mnt/home/ ~12G quota (personal configs, scripts)
/mnt/nas0/user/ ~100G quota (personal workspace)
/mnt/nas1/proj/armwgs ~25 TB (Armenian WGS project)
/mnt/nas1/proj/cfdna ~14 TB (cfDNA project)
/mnt/nas1/db/ ~32 TB (reference genomes, indexes)
=== Best practices ===
* **Do not store large data in your home directory.** Home has a ~12G quota. Use ''/mnt/nas0/user/'' for personal data or ''/mnt/nas1/proj/'' for project data.
* **Do not run jobs from your home directory** if they produce many output files. Use project space.
* **You do not need to compress old files.** The storage uses ZFS with transparent compression -- it is handled automatically at the filesystem level.
* **Clean up** temporary and intermediate files you no longer need to free up quota for others.
===== How Jobs Work =====
You (laptop) --SSH--> Login Node --sbatch--> Slurm Scheduler --> Compute Node(s)
- You connect to the **login node** via SSH.
- You write a job script and submit it with ''sbatch''.
- **Slurm** puts your job in the queue.
- When resources are available, Slurm starts your job on a **compute node**.
- Output is written to a log file you specified.
**Important rules:**
* **Do not run heavy computation on the login node.** It is shared by all users for file management and job submission.
* Always request the resources you need (CPU, memory, time) in your Slurm script.
* If you need an interactive session (e.g., for debugging), use ''srun'' or ''salloc'' (see [[software:slurm#interactive_sessions|Interactive Sessions]]).
===== Quick Slurm Commands =====
^ Command ^ Purpose ^
| ''sbatch script.sh'' | Submit a batch job |
| ''squeue'' | View all jobs in the queue (see recommended format below) |
| ''squeue --me'' | View only your jobs |
| ''scancel '' | Cancel a job |
| ''sinfo'' | View partition and node status |
| ''sacct -j '' | View job accounting info after completion |
| ''srun --pty bash'' | Start an interactive session |
=== Recommended squeue format ===
The default ''squeue'' output is hard to read. We recommend this format:
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPU MIN_MEMORY
2313 compute computel anahit R 1:53:18 1 thin-01 20 35G
2293 compute kneaddata nelli R 11:12:15 1 thin-01 20 30G
2299 compute glasso_j1 davith R 11:12:15 1 thin-01 8 60G
2282 compute run_som.sh melina R 11:12:16 1 thin-01 8 50G
2309 compute plot_cover mherk PD 0:00 1 (Resources) 1 0
2121 thick pilon nate PD 0:00 1 (Nodes requi.. 4 512G
You can add this as an alias in your ''~/.bashrc'' for convenience:
alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'
For a full guide, see **[[software:slurm|Using Slurm]]**.
===== Environment and Software =====
All commonly used bioinformatics tools are **installed globally** on the cluster. There is no module system -- tools are available directly by name:
# Check if a tool is available
which bwa
bwa --version
which samtools
samtools --version
See **[[software:start|Software]]** for a list of available tools.
If you need software that is not installed globally, you can install it locally using **[[software:conda|Conda]]**.
> **Important:** When using Conda, do **not** let it add itself to your ''~/.bashrc''. This slows down every login for you and can cause issues on login nodes. Instead, activate Conda manually when you need it. See the [[software:conda|Conda Guide]] for details.
===== Network & Connectivity =====
* The cluster is accessible via SSH at **''ssh.abi.am''** from anywhere on the internet.
* No VPN is required for regular access.
* **Project leaders** may be required by IT to set up **two-factor authentication (2FA)** on SSH. IT will inform you if this applies to you.
* For slow connection troubleshooting, see the [[support:#troubleshooting|Support]] page.
===== Next Steps =====
* **[[software:slurm|Slurm Guide]]** -- Full job submission reference with examples.
* **[[software:start|Available Software]]** -- What tools are installed and how to use them.
* **[[pipelines:start|Pipelines]]** -- Ready-to-use bioinformatics workflows.
* **[[databases:start|Databases & Reference Data]]** -- Reference genomes available on the server.