====== Databases & Reference Data ======

ABI maintains a collection of reference genomes, indexes, and shared databases for use in bioinformatics analyses. These are stored on shared storage and are **read-only** for regular users.

===== Location =====

All shared databases are located under:

<code>
/mnt/nas1/db/
</code>

This volume is served from the NAS (''nas1:/znas1/abi/collections/db'') and has approximately 32 TB of total space (~1.8 TB currently used).

===== Reference Genomes =====

Pre-built reference genomes and their indexes are stored at:

<code>
/mnt/nas1/db/genomes/
</code>

=== Available Genomes ===

> TODO: Fill in the table below with the actual contents of /mnt/nas1/db/genomes/. Run ''ls /mnt/nas1/db/genomes/'' to get the full list.

^ Organism ^ Assembly ^ Path ^ Indexes Available ^
| Human | GRCh38.p14 | ''/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/'' | BWA, *TODO: others?* |
| *TODO* | *TODO* | *TODO* | *TODO* |

=== BWA Indexes ===

BWA indexes for the human reference genome are located at:

<code>
/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/
</code>

This directory contains:
  * ''GCF_000001405.40_GRCh38.p14_genomic.fna'' -- the reference FASTA
  * ''.amb'', ''.ann'', ''.bwt'', ''.pac'', ''.sa'' -- BWA index files

**Important:** The FASTA file (.fna/.fa) must be in the same directory as the index files for BWA to work.

Usage example:

<code bash>
REF="/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna"
bwa mem -t 8 $REF reads_1.fq.gz reads_2.fq.gz | samtools sort -o aligned.sorted.bam -
</code>

=== Building Your Own Index ===

If you need an index for a genome not listed above, you can build it yourself:

<code bash>
# BWA index
bwa index reference.fasta

# samtools index (for BAM operations)
samtools faidx reference.fasta

# STAR index (for RNA-seq)
STAR --runMode genomeGenerate --genomeDir star_index/ --genomeFastaFiles reference.fasta --sjdbGTFfile annotations.gtf
</code>

**Request addition:** If you think a genome or index should be added to the shared collection, email **[[mailto:it-support@abi.am|it-support@abi.am]]** with:
  * Organism and assembly version
  * Download source (e.g., NCBI, Ensembl, UCSC)
  * Which indexes you need (BWA, Bowtie2, STAR, etc.)

----

===== Other Shared Databases =====

> TODO: List any other shared databases available at ABI. Examples might include:

^ Database ^ Description ^ Path ^
| *TODO: e.g., BLAST NT/NR* | *NCBI nucleotide/protein databases* | *TODO: /mnt/nas1/db/blast/* |
| *TODO: e.g., Kraken2 DB* | *Taxonomic classification database* | *TODO: /mnt/nas1/db/kraken2/* |
| *TODO: e.g., dbSNP* | *Known human variants* | *TODO* |
| *TODO* | *Add more as needed* | *TODO* |

----

===== Directory Structure =====

> TODO: Run ''ls -la /mnt/nas1/db/'' on the server and paste the top-level structure here. Example:

<code>
/mnt/nas1/db/
  genomes/
    homo_sapiens/
      GRCh38.p14/
        bwa_mem_0.7.17-r1188/
    TODO: other organisms
  TODO: other database directories
</code>

----

===== Best Practices =====

  * **Do not copy reference data to your home or project directory.** Use the shared paths directly to save disk space.
  * **Always use absolute paths** to reference data in your scripts, so they work from any directory.
  * **Check the version** of the reference genome and index before starting an analysis. Mixing different versions will cause errors.
  * **Document which reference you used** in your project notes for reproducibility.

----

===== See Also =====

  * [[pipelines:start|Pipelines]] -- workflows that use these reference data
  * [[scripts:alignment|Alignment Scripts]] -- examples using BWA with the shared references
  * [[software:start|Software]] -- tools for working with genomic data