Disentangling the origin of chemical differences using GHOST
Gen bank (genetic sequence databank)
1. GenBank (Genetic Sequence Databank)
Introduction:
GenBank® is the genetic sequence database at the National Center for
Biotechnology Information (NCBI).
It was established in the year 1982 and now maintained by the National Center
for Biotechnology (NCBI).
DNA sequences can be submitted to GenBank using several different methods.
It contains publicly available nucleotide sequences for more than 240 000 named
organisms, obtained primarily through submissions from individual laboratories
and batch submissions from large-scale sequencing projects.
It has a flat file structure that is an ASCII text file, readable & downloadable by
both humans and computers.
There are two main ways of making batch sequence submissions to GenBank:
NCBI’s Barcode Submission Tool (BarSTool) and Sequin.
2. Entry data contains information on:
1. The sequence;
2. Accession numbers;
3. The scientific and gene names;
4. Taxonomy/phylogenetic classification of the source organism;
5. A feature that identifies coding regions;
6. References to published literature;
7. Transcription units
8. Mutation sites.
GenBank flat file Format
1. The LOCUS field: It consists of five different subfields, namely:
1a Locus Name (e.g. HSHFE) - It is a tag for grouping similar sequences.
The first two or three letters usually designate the organism.
3. In this case HS stands for Homo sapiens. The last several characters are associated with another
group designation, such as gene product. In this example, the last three digits represent the gene
symbol, HFE.
1b Sequence Length (12146 bp) – It is the total number of nucleotide base pairs (or amino acid
residues) in the sequence record.
1c Molecule Type (e.g. DNA) - Type of molecule that was sequenced.
1d GenBank Division (PRI) - GenBank has different divisions.
In this example, PRI stands for primate sequences.
Other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant,
fungal, and algal sequences), & BCT (bacterial sequences).
2. 1e Modification Date (23-July-1999) - Date of most recent modification made to the record.
DEFINITION: – It is a brief description of the sequence.
The description may include source organism name, gene or protein name, or designation as
untranscribed or untranslated sequences (e.g., a promoter region).
For sequences containing a coding region (CDS), the definition field may also contain a
“completeness” qualifier such as "complete CDS" or "exon 1."
3. ACCESSION (Z92910): – It is a unique identifier assigned to a complete sequence record.
This number never changes, even if the record is modified.
4. VERSION (Z92910.1) – It is an identification number assigned to a single, specific sequence in
the database.
This number is in the format “accession.version.”
If any changes are made to the sequence data, the version part of the number will increase by one.
E.g. U12345.1 becomes U12345.2.
5. Gene Identifier (GI) (1890179) - Also a sequence identification number.
Whenever a sequence is changed, the version number is increased and a new GI is assigned.
6. KEYWORDS (haemochromatosis; HFE gene) – A “keyword” can be “any word or phrase used
to describe the sequence”.
7. SOURCE (human) - Usually contains an abbreviated or common name of the source organism.
4. 8. ORGANISM (Homo sapiens) - The scientific name (usually genus & species)
9. REFERENCE – It is a citation of publications by sequence authors that supports information
presented in the sequence record.
Several references may be included in one record.
References are automatically sorted from the oldest to the newest.
Cited publications are searchable by author, article or publication title, journal title, or MEDLINE
unique identifier (UID).
10. . The FEATURES Table:
5. 11. BASE COUNT & ORIGIN:
BASECOUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine
(T) bases in the sequence.
12. ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field
title.
6. //
Locus name helps in group entries with similar sequences. The first 3 characters denotes the organism, the
fourth and fifth characters gives other group designations, such as gene product and the last character is a
series of sequential integers.
Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the sequence
record.
Molecule Type shows the type of sequenced molecule.
Genbank Division shows the GenBank division to which a record belongs and is indicated by a three letter
abbreviation.
1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTG sequences (high-throughput genomic seq)
17. HTC - unfinished high-throughput cDNA sequencing
18. ENV - environmental sampling sequences
Modification Date shows the last date of modification.
Definition is a brief description of sequence that includes information such as source organism, gene
name/protein name, or some description of the sequence's function.
Accession number indicates the unique identifier for a sequence record.
7. Records from the RefSeq
NT_123456 constructed genomic contigs
NM_123456 mRNAs
NP_123456 proteins
NC_123456 chromosomes
Version shows a nucleotide sequence identification number that represents a single, specific sequence in
the GenBank database.
GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.
Keywords describes word or phrase of the sequence.
Source indicates free-format information including an abbreviated form of the organism name, sometimes
followed by a molecule type.
Organism describes the formal scientific name for the source organism and its lineage.
Reference includes publications by the authors of the sequence that discuss the data reported in the record.
Authors contains List of authors in the order in which they appear in the cited article.
Entrez Search Field: Author [AUTH]
Title represents the title of the published work or tentative title of an unpublished word.
Entrez Search Field: Text Word [WORD]
Journal: MEDLINE abbreviation of the journal name.
Entrez Search Field: Journal Name [JOUR]
Pubmed: PubMed Identifier (PMID)
Features shows information about genes and gene products, as well as regions of biological significance
reported in the sequence.
Source is a mandatory feature in each record that summarizes the length of the sequence, scientific name
of the source organism, and Taxon ID number. Can also include other information such as map location,
strain, clone, tissue type, etc., if provided by submitter.
Taxon is a stable unique identification number for the taxon of the source organism.
CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence of amino
acids in a protein.