====== Cluster Basics ======

This page describes ABI's computing infrastructure at a level suitable for researchers. For detailed system administration documentation, see [[infra:start|Infrastructure]].

===== What is an HPC Cluster? =====

A High-Performance Computing (HPC) cluster is a collection of interconnected computers (called **nodes**) that work together to run computationally intensive tasks. Instead of running everything on your laptop, you submit **jobs** to the cluster, which distributes them across available resources.

Key concepts:

^ Term ^ Meaning ^
| **Node** | A single server/computer in the cluster |
| **Login node** | The server you SSH into. Used for file management and job submission -- **not** for heavy computation |
| **Compute node** | Servers dedicated to running jobs. Jobs are dispatched here by Slurm |
| **Partition** | A group of nodes with shared properties (e.g., memory size, GPU availability). Also called a "queue" |
| **Job** | A task you submit to run on a compute node |
| **Slurm** | The job scheduler that manages the queue and assigns resources |

===== ABI Cluster Overview =====

^ Component ^ Details ^
| Login node(s) | ''ssh.abi.am'' (resolves to VMs ''ssh-01'' and ''ssh-02'') |
| Compute nodes | ''thin-01'' (64C/384G), ''thin-02'' (64C/384G), ''thick-01'' (64C/768G) |
| Download nodes | ''dl-01'' (2C/8G), ''dl-02'' (2C/8G) |
| Total compute vCPUs | 192 |
| Total compute RAM | 1536G |
| Scheduler | Slurm (controller runs on a separate VM) |
| Virtualization | All nodes are bhyve VMs running on a FreeBSD physical host |

===== Partitions =====

Partitions define groups of compute resources. When you submit a job, you can specify which partition to use.

^ Partition ^ Nodes ^ CPUs ^ Total Memory ^ Default? ^ Purpose ^
| ''compute'' | thin-01, thin-02, thick-01 | 64 per node | 384G-768G | Yes | General purpose computation (default partition) |
| ''thin'' | thin-01, thin-02 | 64 per node | ~384G each | No | Jobs that fit in standard memory |
| ''thick'' | thick-01 | 64 | ~768G | No | Memory-intensive jobs (e.g., large genome assembly, pilon) |
| ''download'' | dl-01, dl-02 | 2 per node | ~8G each | No | Data download tasks only (not for computation) |

**Notes:**
  * The ''compute'' partition is the **default**. If you do not specify ''%%--partition%%'', your job goes here.
  * Use ''thick'' explicitly when you need more than ~384G of RAM (e.g., ''%%--partition=thick --mem=512G%%'').
  * Use ''download'' only for downloading data (e.g., SRA downloads). These nodes have minimal CPU and memory.
  * Nodes may appear in multiple partitions (e.g., ''thick-01'' is in both ''compute'' and ''thick'').

To see current partition and node status:

<code bash>
sinfo
</code>

For a detailed view including memory and CPU allocation:

<code bash>
sinfo -N -o "%.10N %.10P %.5a %.4c %.20m %.20F %.10e"
</code>

Current cluster state (for reference):

<code>
NODELIST  PARTITION     CPUS  MEMORY     PURPOSE
dl-01     download       2     ~8G       Data downloads only
dl-02     download       2     ~8G       Data downloads only
thick-01  compute/thick  64   ~768G      High-memory computation
thin-01   compute/thin   64   ~384G      General computation
thin-02   compute/thin   64   ~384G      General computation
</code>

===== Storage =====

ABI has several storage areas. Understanding them is important for organizing your work and avoiding issues.

Storage is served from **two ZFS-based NAS servers** over NFS. ZFS provides **transparent compression**, so you do not need to manually compress old files -- the filesystem handles it automatically. Home directories and selected projects are backed up to a separate server using ZFS send/recv.

^ Path ^ Purpose ^ Served from ^ Quota ^ Notes ^
| ''/mnt/home/<user>'' | Home directory -- configs, scripts | mustafar (nas1) | ~12G per user | Keep this small; use project/user dirs for data |
| ''/mnt/nas0/user/<user>'' | Personal user workspace | geonosis (nas0) | ~100G per user | For personal datasets, experiments, conda envs |
| ''/mnt/nas0/proj/<project>'' | Project data (some projects) | geonosis (nas0) | Per-project | *TODO: clarify which projects are on nas0 vs nas1* |
| ''/mnt/nas1/proj/<project>'' | Project data (most projects) | mustafar (nas1) | Per-project (typically 14-25 TB) | Shared with all project members |
| ''/mnt/nas1/db/'' | Shared databases and reference genomes | mustafar (nas1) | ~32 TB total | Read-only for users. See [[databases:start|Databases]] |

**Example current usage:**

<code>
/mnt/home/<user>           ~12G quota    (personal configs, scripts)
/mnt/nas0/user/<user>     ~100G quota    (personal workspace)
/mnt/nas1/proj/armwgs      ~25 TB        (Armenian WGS project)
/mnt/nas1/proj/cfdna       ~14 TB        (cfDNA project)
/mnt/nas1/db/              ~32 TB        (reference genomes, indexes)
</code>

=== Best practices ===

  * **Do not store large data in your home directory.** Home has a ~12G quota. Use ''/mnt/nas0/user/<user>'' for personal data or ''/mnt/nas1/proj/<project>'' for project data.
  * **Do not run jobs from your home directory** if they produce many output files. Use project space.
  * **You do not need to compress old files.** The storage uses ZFS with transparent compression -- it is handled automatically at the filesystem level.
  * **Clean up** temporary and intermediate files you no longer need to free up quota for others.

===== How Jobs Work =====

<code>
You (laptop) --SSH--> Login Node --sbatch--> Slurm Scheduler --> Compute Node(s)
</code>

  - You connect to the **login node** via SSH.
  - You write a job script and submit it with ''sbatch''.
  - **Slurm** puts your job in the queue.
  - When resources are available, Slurm starts your job on a **compute node**.
  - Output is written to a log file you specified.

**Important rules:**

  * **Do not run heavy computation on the login node.** It is shared by all users for file management and job submission.
  * Always request the resources you need (CPU, memory, time) in your Slurm script.
  * If you need an interactive session (e.g., for debugging), use ''srun'' or ''salloc'' (see [[software:slurm#interactive_sessions|Interactive Sessions]]).

===== Quick Slurm Commands =====

^ Command ^ Purpose ^
| ''sbatch script.sh'' | Submit a batch job |
| ''squeue'' | View all jobs in the queue (see recommended format below) |
| ''squeue --me'' | View only your jobs |
| ''scancel <jobid>'' | Cancel a job |
| ''sinfo'' | View partition and node status |
| ''sacct -j <jobid>'' | View job accounting info after completion |
| ''srun --pty bash'' | Start an interactive session |

=== Recommended squeue format ===

The default ''squeue'' output is hard to read. We recommend this format:

<code bash>
squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"
</code>

Example output:

<code>
 JOBID  PARTITION       NAME            USER         ST       TIME      NODES             NODELIST(REASON) CPU MIN_MEMORY
  2313    compute  computel           anahit          R    1:53:18          1              thin-01  20        35G
  2293    compute  kneaddata           nelli          R   11:12:15          1              thin-01  20        30G
  2299    compute  glasso_j1          davith          R   11:12:15          1              thin-01   8        60G
  2282    compute  run_som.sh         melina          R   11:12:16          1              thin-01   8        50G
  2309    compute  plot_cover          mherk         PD       0:00          1          (Resources)   1         0
  2121      thick  pilon                nate         PD       0:00          1       (Nodes requi..   4       512G
</code>

You can add this as an alias in your ''~/.bashrc'' for convenience:

<code bash>
alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"'
</code>

For a full guide, see **[[software:slurm|Using Slurm]]**.

===== Environment and Software =====

All commonly used bioinformatics tools are **installed globally** on the cluster. There is no module system -- tools are available directly by name:

<code bash>
# Check if a tool is available
which bwa
bwa --version

which samtools
samtools --version
</code>

See **[[software:start|Software]]** for a list of available tools.

If you need software that is not installed globally, you can install it locally using **[[software:conda|Conda]]**.

> **Important:** When using Conda, do **not** let it add itself to your ''~/.bashrc''. This slows down every login for you and can cause issues on login nodes. Instead, activate Conda manually when you need it. See the [[software:conda|Conda Guide]] for details.

===== Network & Connectivity =====

  * The cluster is accessible via SSH at **''ssh.abi.am''** from anywhere on the internet.
  * No VPN is required for regular access.
  * **Project leaders** may be required by IT to set up **two-factor authentication (2FA)** on SSH. IT will inform you if this applies to you.
  * For slow connection troubleshooting, see the [[support:#troubleshooting|Support]] page.

===== Next Steps =====

  * **[[software:slurm|Slurm Guide]]** -- Full job submission reference with examples.
  * **[[software:start|Available Software]]** -- What tools are installed and how to use them.
  * **[[pipelines:start|Pipelines]]** -- Ready-to-use bioinformatics workflows.
  * **[[databases:start|Databases & Reference Data]]** -- Reference genomes available on the server.