You've loaded an old revision of the document! If you save it, you will create a new version with this data. Media Files====== Cluster Basics ====== This page describes ABI's computing infrastructure at a level suitable for researchers. For detailed system administration documentation, see [[infra:start|Infrastructure]]. ===== What is an HPC Cluster? ===== A High-Performance Computing (HPC) cluster is a collection of interconnected computers (called **nodes**) that work together to run computationally intensive tasks. Instead of running everything on your laptop, you submit **jobs** to the cluster, which distributes them across available resources. Key concepts: ^ Term ^ Meaning ^ | **Node** | A single server/computer in the cluster | | **Login node** | The server you SSH into. Used for file management and job submission -- **not** for heavy computation | | **Compute node** | Servers dedicated to running jobs. Jobs are dispatched here by Slurm | | **Partition** | A group of nodes with shared properties (e.g., memory size, GPU availability). Also called a "queue" | | **Job** | A task you submit to run on a compute node | | **Slurm** | The job scheduler that manages the queue and assigns resources | ===== ABI Cluster Overview ===== ^ Component ^ Details ^ | Login node(s) | ''ssh.abi.am'' (resolves to VMs ''ssh-01'' and ''ssh-02'') | | Compute nodes | ''thin-01'' (64C/384G), ''thin-02'' (64C/384G), ''thick-01'' (64C/768G) | | Download nodes | ''dl-01'' (2C/8G), ''dl-02'' (2C/8G) | | Total compute vCPUs | 192 | | Total compute RAM | 1536G | | Scheduler | Slurm (controller runs on a separate VM) | | Virtualization | All nodes are bhyve VMs running on a FreeBSD physical host | ===== Partitions ===== Partitions define groups of compute resources. When you submit a job, you can specify which partition to use. ^ Partition ^ Nodes ^ CPUs ^ Total Memory ^ Default? ^ Purpose ^ | ''compute'' | thin-01, thin-02, thick-01 | 64 per node | 384G-768G | Yes | General purpose computation (default partition) | | ''thin'' | thin-01, thin-02 | 64 per node | ~384G each | No | Jobs that fit in standard memory | | ''thick'' | thick-01 | 64 | ~768G | No | Memory-intensive jobs (e.g., large genome assembly, pilon) | | ''download'' | dl-01, dl-02 | 2 per node | ~8G each | No | Data download tasks only (not for computation) | **Notes:** * The ''compute'' partition is the **default**. If you do not specify ''%%--partition%%'', your job goes here. * Use ''thick'' explicitly when you need more than ~384G of RAM (e.g., ''%%--partition=thick --mem=512G%%''). * Use ''download'' only for downloading data (e.g., SRA downloads). These nodes have minimal CPU and memory. * Nodes may appear in multiple partitions (e.g., ''thick-01'' is in both ''compute'' and ''thick''). To see current partition and node status: <code bash> sinfo </code> For a detailed view including memory and CPU allocation: <code bash> sinfo -N -o "%.10N %.10P %.5a %.4c %.20m %.20F %.10e" </code> Current cluster state (for reference): <code> NODELIST PARTITION CPUS MEMORY PURPOSE dl-01 download 2 ~8G Data downloads only dl-02 download 2 ~8G Data downloads only thick-01 compute/thick 64 ~768G High-memory computation thin-01 compute/thin 64 ~384G General computation thin-02 compute/thin 64 ~384G General computation </code> ===== Storage ===== ABI has several storage areas. Understanding them is important for organizing your work and avoiding issues. Storage is served from **two ZFS-based NAS servers** over NFS. ZFS provides **transparent compression**, so you do not need to manually compress old files -- the filesystem handles it automatically. Home directories and selected projects are backed up to a separate server using ZFS send/recv. ^ Path ^ Purpose ^ Served from ^ Quota ^ Notes ^ | ''/mnt/home/<user>'' | Home directory -- configs, scripts | mustafar (nas1) | ~12G per user | Keep this small; use project/user dirs for data | | ''/mnt/nas0/user/<user>'' | Personal user workspace | geonosis (nas0) | ~100G per user | For personal datasets, experiments, conda envs | | ''/mnt/nas0/proj/<project>'' | Project data (some projects) | geonosis (nas0) | Per-project | *TODO: clarify which projects are on nas0 vs nas1* | | ''/mnt/nas1/proj/<project>'' | Project data (most projects) | mustafar (nas1) | Per-project (typically 14-25 TB) | Shared with all project members | | ''/mnt/nas1/db/'' | Shared databases and reference genomes | mustafar (nas1) | ~32 TB total | Read-only for users. See [[databases:start|Databases]] | **Example current usage:** <code> /mnt/home/<user> ~12G quota (personal configs, scripts) /mnt/nas0/user/<user> ~100G quota (personal workspace) /mnt/nas1/proj/armwgs ~25 TB (Armenian WGS project) /mnt/nas1/proj/cfdna ~14 TB (cfDNA project) /mnt/nas1/db/ ~32 TB (reference genomes, indexes) </code> === Best practices === * **Do not store large data in your home directory.** Home has a ~12G quota. Use ''/mnt/nas0/user/<user>'' for personal data or ''/mnt/nas1/proj/<project>'' for project data. * **Do not run jobs from your home directory** if they produce many output files. Use project space. * **You do not need to compress old files.** The storage uses ZFS with transparent compression -- it is handled automatically at the filesystem level. * **Clean up** temporary and intermediate files you no longer need to free up quota for others. ===== How Jobs Work ===== <code> You (laptop) --SSH--> Login Node --sbatch--> Slurm Scheduler --> Compute Node(s) </code> - You connect to the **login node** via SSH. - You write a job script and submit it with ''sbatch''. - **Slurm** puts your job in the queue. - When resources are available, Slurm starts your job on a **compute node**. - Output is written to a log file you specified. **Important rules:** * **Do not run heavy computation on the login node.** It is shared by all users for file management and job submission. * Always request the resources you need (CPU, memory, time) in your Slurm script. * If you need an interactive session (e.g., for debugging), use ''srun'' or ''salloc'' (see [[software:slurm#interactive_sessions|Interactive Sessions]]). ===== Quick Slurm Commands ===== ^ Command ^ Purpose ^ | ''sbatch script.sh'' | Submit a batch job | | ''squeue'' | View all jobs in the queue (see recommended format below) | | ''squeue --me'' | View only your jobs | | ''scancel <jobid>'' | Cancel a job | | ''sinfo'' | View partition and node status | | ''sacct -j <jobid>'' | View job accounting info after completion | | ''srun --pty bash'' | Start an interactive session | === Recommended squeue format === The default ''squeue'' output is hard to read. We recommend this format: <code bash> squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m" </code> Example output: <code> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPU MIN_MEMORY 2313 compute computel anahit R 1:53:18 1 thin-01 20 35G 2293 compute kneaddata nelli R 11:12:15 1 thin-01 20 30G 2299 compute glasso_j1 davith R 11:12:15 1 thin-01 8 60G 2282 compute run_som.sh melina R 11:12:16 1 thin-01 8 50G 2309 compute plot_cover mherk PD 0:00 1 (Resources) 1 0 2121 thick pilon nate PD 0:00 1 (Nodes requi.. 4 512G </code> You can add this as an alias in your ''~/.bashrc'' for convenience: <code bash> alias sq='squeue -o "%.6i %.10P %.10j %.15u %.10t %.10M %.10D %.20R %.3C %.10m"' </code> For a full guide, see **[[software:slurm|Using Slurm]]**. ===== Environment and Software ===== All commonly used bioinformatics tools are **installed globally** on the cluster. There is no module system -- tools are available directly by name: <code bash> # Check if a tool is available which bwa bwa --version which samtools samtools --version </code> See **[[software:start|Software]]** for a list of available tools. If you need software that is not installed globally, you can install it locally using **[[software:conda|Conda]]**. > **Important:** When using Conda, do **not** let it add itself to your ''~/.bashrc''. This slows down every login for you and can cause issues on login nodes. Instead, activate Conda manually when you need it. See the [[software:conda|Conda Guide]] for details. ===== Network & Connectivity ===== * The cluster is accessible via SSH at **''ssh.abi.am''** from anywhere on the internet. * No VPN is required for regular access. * **Project leaders** may be required by IT to set up **two-factor authentication (2FA)** on SSH. IT will inform you if this applies to you. * For slow connection troubleshooting, see the [[support#troubleshooting|Support]] page. ===== Next Steps ===== * **[[software:slurm|Slurm Guide]]** -- Full job submission reference with examples. * **[[software:start|Available Software]]** -- What tools are installed and how to use them. * **[[pipelines:start|Pipelines]]** -- Ready-to-use bioinformatics workflows. * **[[databases:start|Databases & Reference Data]]** -- Reference genomes available on the server. SavePreviewCancel Edit summary