The document discusses biological databases that store and make available large datasets of biological data. It describes the aims of databases to store and communicate this data, make it available to scientists, and in a computer-readable format. The document classifies databases as primary, composite, or secondary. It provides details on the formats, availability, terminology, and examples of primary nucleotide sequence databases like GenBank, EMBL, and DDBJ. Derived databases and tools for sequence retrieval are also summarized.
3. Aims:
• Need for storing and communicating
large datasets has grown.
• Make biological data available to
scientists.
• To make biological data available in
computer-readable form.
• To enhance availability.
7. Availablity:
• Publicly available, no
restrictions
• Available, but with copyright
• Accessible, but not
downloadable
• Academic, but not freely
available
• Proprietary, commercial;
possibly free for academics
8. Terminology:
• LOCUS
– size of sequence (in base pairs)
– nature of molecule (e.g. DNA or RNA)
– topology (linear or circular)
• DEFINITION: brief description of gene
• ACCESSION: unique identifier for this (and
some other) databases
• VERSION: lists synonymous or past ID
numbers
9. Terminology:
• KEYWORDS: list of terms related
to entry; can be used for
keyword searching for related
data
• SOURCE: common name of
relevant organism
• ORGANISM: complete id, with
taxonomic classification
10. Terminology:
• REFERENCE: credits author(s) who initially
determined the sequence; includes
subsections:
– AUTHOR
– TITLE
– JOURNAL
– PUBMED
• COMMENT: free-formatted text that doesn’t
fit in another category
12. Genbank
• An annotated collection of all publicly
available nucleotide and proteins
• Set up in 1979 at the LANL (Los Alamos).
• Maintained since 1992 NCBI (Bethesda).
• http://www.ncbi.nlm.nih.gov
14. EMBL Nucleotide Sequence Database
• An annotated collection of all publicly
available nucleotide and protein sequences
• Created in 1980 at the European Molecular
Biology Laboratory in Heidelberg.
• Maintained since 1994 by EBI- Cambridge.
• http://www.ebi.ac.uk/embl.html
16. DDBJ–DNA Data Bank of
Japan
• An annotated collection of all publicly
available nucleotide and protein sequences
• Started, 1984 at the National Institute of
Genetics (NIG) in Mishima.
• Still maintained in this institute a team led
by Takashi Gojobori.
• http://www.ddbj.nig.ac.jp
18. Derived databases
• CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/
• Genetic Codes Deviations from the standard genetic code in
various organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mo
de=c
• TIGR Gene Indices Organism-specific databases of EST and gene
sequences http://www.tigr.org/tdb/tgi.shtml
• UniGene Unified clusters of ESTs and full-length mRNA
sequences http://www.ncbi.nlm.nih.gov/UniGene/
• ASAP Alternative spliced isoforms
http://www.bioinformatics.ucla.edu/ASAP
• Intronerator Introns and alternative splicing in C.elegans and
C.briggsae http://www.cse.ucsc.edu/~kent/intronerator/
30. Sequence Retrieval Tools
• Various tools to get sequences of interests
from databases
• Entrez in NCBI
http://www.ncbi.nlm.nih.gov/Entrez
• SRS for EMBL and other DBs
http://srs.ebi.ac.uk
• Fetch in GCG package
• Seqret in EMBOSS