Databases & Reference Data

Databases & Reference Data

ABI maintains a collection of reference genomes, indexes, and shared databases for use in bioinformatics analyses. These are stored on shared storage and are read-only for regular users.

Location

All shared databases are located under:

/mnt/nas1/db/

This volume is served from the NAS (nas1:/znas1/abi/collections/db) and has approximately 32 TB of total space (~1.8 TB currently used).

Reference Genomes

Pre-built reference genomes and their indexes are stored at:

/mnt/nas1/db/genomes/

Available Genomes

TODO: Fill in the table below with the actual contents of /mnt/nas1/db/genomes/. Run ls /mnt/nas1/db/genomes/ to get the full list.

Organism	Assembly	Path	Indexes Available
Human	GRCh38.p14	`/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/`	BWA, TODO: others?
TODO	TODO	TODO	TODO

BWA Indexes

BWA indexes for the human reference genome are located at:

/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/

This directory contains:

GCF_000001405.40_GRCh38.p14_genomic.fna – the reference FASTA
.amb, .ann, .bwt, .pac, .sa – BWA index files

Important: The FASTA file (.fna/.fa) must be in the same directory as the index files for BWA to work.

Usage example:

REF="/mnt/nas1/db/genomes/homo_sapiens/GRCh38.p14/bwa_mem_0.7.17-r1188/GCF_000001405.40_GRCh38.p14_genomic.fna"
bwa mem -t 8 $REF reads_1.fq.gz reads_2.fq.gz | samtools sort -o aligned.sorted.bam -

Building Your Own Index

If you need an index for a genome not listed above, you can build it yourself:

# BWA index
bwa index reference.fasta
 
# samtools index (for BAM operations)
samtools faidx reference.fasta
 
# STAR index (for RNA-seq)
STAR --runMode genomeGenerate --genomeDir star_index/ --genomeFastaFiles reference.fasta --sjdbGTFfile annotations.gtf

Request addition: If you think a genome or index should be added to the shared collection, email it-support@abi.am with:

Organism and assembly version
Download source (e.g., NCBI, Ensembl, UCSC)
Which indexes you need (BWA, Bowtie2, STAR, etc.)

Other Shared Databases

TODO: List any other shared databases available at ABI. Examples might include:

Database	Description	Path
TODO: e.g., BLAST NT/NR	NCBI nucleotide/protein databases	TODO: /mnt/nas1/db/blast/
TODO: e.g., Kraken2 DB	Taxonomic classification database	TODO: /mnt/nas1/db/kraken2/
TODO: e.g., dbSNP	Known human variants	TODO
TODO	Add more as needed	TODO

Directory Structure

TODO: Run ls -la /mnt/nas1/db/ on the server and paste the top-level structure here. Example:

/mnt/nas1/db/
  genomes/
    homo_sapiens/
      GRCh38.p14/
        bwa_mem_0.7.17-r1188/
    TODO: other organisms
  TODO: other database directories

Best Practices

Do not copy reference data to your home or project directory. Use the shared paths directly to save disk space.
Always use absolute paths to reference data in your scripts, so they work from any directory.
Check the version of the reference genome and index before starting an analysis. Mixing different versions will cause errors.
Document which reference you used in your project notes for reproducibility.

Database	Description	Path
TODO: e.g., BLAST NT/NR	NCBI nucleotide/protein databases	TODO: /mnt/nas1/db/blast/
TODO: e.g., Kraken2 DB	Taxonomic classification database	TODO: /mnt/nas1/db/kraken2/
TODO: e.g., dbSNP	Known human variants	TODO
TODO	Add more as needed	TODO

Table of Contents