2. WHAT YOU NEED TO LEARN:
What is a database and what are the features of
an ideal database?
What are the relationships/differences between
primary and derived sequence databases?
Why is data integration useful?
3. WHAT ARE DATABASES?
Structured collection of data/information.
Consists of basic units called records or entries.
Each record consists of fields, which hold pre-
defined data related to the record.
For example, a protein database would have
protein entries as records and protein properties
as fields (e.g., name of protein, length, amino-acid
sequence)
4. THE „PERFECT‟ DATABASE
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
5. Bioinformatics sequence databases
# Can be broadly be divided into 2 classes:
primary databases
secondary databases
# Primary databases contain original biological data such as:
DNA sequence, or protein structure information from experiments such as
crystallography. Examples: GenBank, TreMBL
#Secondary databases attempt to add value to the primary databases and
make them more useful for certain specialist applications,
for example PROSITE, the database of common structural or functional motifs
found in proteins.
6. THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
8. THE CENTRAL DOGMA & BIOLOGICAL DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequence
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tag
(ESTs)
9. NCBI DATABASES AND SERVICES
GenBank primary sequence database
Free public access to biomedical literature
PubMed free Medline (3 million searches per day)
PubMed Central full text online access
Entrez integrated molecular and literature databases
10. TYPES OF MOLECULAR DATABASES
Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples: GenBank, Trace, SRA, SNP, GEO
Derivative /Secondary Databases
Derived from primary data
Content controlled by third party (NCBI)
Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
11. PRIMARY VS. SECONDARY SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
12. SEQUENCE DATABASES AT NCBI
Primary
GenBank: NCBI‟s primary sequence database
Trace Archive: reads from capillary sequencers
Sequence Read Archive: next generation data
Derivative
GenPept (GenBank translations)
Outside Protein (UniProt—Swiss-Prot, PDB)
NCBI Reference Sequences (RefSeq)
13. GENBANK - PRIMARY SEQUENCE DB
Nucleotide only sequence database
Archival in nature
Historical
Reflective of submitter point of view (subjective)
Redundant
Data
Direct submissions (traditional records)
Batch submissions
FTP accounts (genome data)
14. GENBANK - PRIMARY SEQUENCE DB (2)
Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
15. TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
GI number
NCBI internal use
well annotated
the sequence is the data
20. REFSEQ BENEFITS
Non-redundancy
Updates to reflect current sequence data and biology
Data validation
Format consistency
Distinct accession series
Stewardship by NCBI staff and collaborators
23. ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLAST
BLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
24. GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databases
25. TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
26. THE PROBLEM
Rapidly growing databases with complex and changing
relationships
Rapidly changing interfaces to match the above
Result
Many people don‟t know:
Where to begin
Where to click on a Web page
Why it might be useful to click there
31. ADVANTAGES OF DATA INTEGRATION
More relevant inter-related information in one place
Makes it easier to find additional relevant information related to your
initial query
Potentially find information indirectly linked, but relevant to your
subject of interest
uncover non-obvious genetic features that explain phenotype or
disease
Easier to build a „story‟ based on multiple pieces of biological
evidence
32. Remember : When reporting on a Bioinformatics analysis it is very important to
state, which release of the sequence databases were used.
• Because of the enormous size of the databases, to ease management they
are now broken up into sections.
• Most of these divisions are organised on taxonomically basis ( prokaryotes,
plants, fungi, rodents, mammals etc)
• These divisions are useful in that they make it easier to search only in the
relevant part of the database.
• User manuals for each database clearly state their structures
33. Protein Databases
a. GenPept
• GenBank Gene Products Data Bank (GenPept) is a protein database produced by the
National Centre of Biotechnology Information (NCBI).
• The entries in this database are derived from the translations of all open reading frame
GenBank, DDBJ and EMBL.
• It contains the same annotations present in the nucleotide records.
• The entries in this database lacks additional annotation and does not contain protein
derived from amino acid sequencing.
• It is also expected to see a protein represented by multiple records
- i.eredundancy.
34. b. RefSeq
# The aim of the Reference Sequence (RefSeq) database is to provide a comprehensiv
integrated, non redundant sequence set on both the genomic, transcript (including splic
variants), and protein levels for major organisms.
# RefSeq records represent the current best view of genomes and their transcript and/o
protein products.
# However, the majority of the entries RefSeq are automatically generated with minimal
manual intervention.
# But as a non-redundant database it offers a significant advantage for database sea
# RefSeq collection is substantially based on the sequence records from GenBank, EM
and DDBJ, but it differs in that each record in RefSeq includes attribution to the original
sequence data, not a piece of primary search data in itself.
35. c. TrEMBL
# Translated EMBL (TrEMBL) is the European counterpart of American GenPept and RefS
# The TrEMBL database, maintained by EBI, contains the translations of all coding sequen
(CDS) present in the EMBL/DDBJ/GenBank that are not yet integrated into SWISS-PROT.
# TrEMBL is a computer-annotated protein database that serves as a kind of a halfway hou
SWISS-PROT.
# As a supplementary database to SWISS-PROT, TrEMBL serves to accommodate the gr
influx of protein sequences and make these sequences available as fast as possible witho
comprising the quality standards of SWISS-PROT.
36. # Each TrEMBL entry is assigned a SWISS-PROT type accession number that would sta
-
it when the sequence is finally manually checked and accepted into SWISS-PROT.
# To simplify curation, TrEMBL follows the SWISS-PROT format and convention as close
possible.
# But we should bear in mind that due to the fact that TrEMBL entries are generated
automatically, the quality of these entries is not guaranteed.
37. Universal curated databases
a. PIR-PSD (Protein information resource- protein sequence database)
• The PIR-International Protein Sequence Database (PIR-PSD) was created by the
collaboration of Protein Information Resources (PIR) with the Munich Information
Centre for Protein Sequences (MIPS) and the Japan International Protein Information
Database (JIPID).
• The primary sources of PIR
-PSD are sequences from GenBank/EMBL/DDBJ
translations,
published literature and direct submission to PIR-International.
• PIR-PSD maintains a set of integrated protein sequence databases as shown below:
38.
39. b. SWISS-PROT Database
# SWISS-PROT, the leading universal curated protein sequence database, is established
1986 and maintained collaboratively by the Department of Medical Biochemistry of the
university of Geneva (Switzerland) and EBI.
# The database contains high-quality annotated data, and the annotation for each entry
includes the description of:
# function(s) of the protein,
# post translation of the modification(s),
# domains and sites,
# secondary and quaternary structure,
# similarities to other proteins,
# disease(s) associated with protein defect in which tissues the protein is fo
# pathways in which the protein is involved
# sequence conflict and variants.
40. # As a non-redundant database, the SWISSPROT tries to maintains minimal redundancy,
all reports for a given protein are merged into a single entry.
# The feature table (FT) will indicate any cases of conflicts between various sequencing rep
of the corresponding entry.
# The entries in SWISSPROT are produced from translation of sequences in EMBL, extrac
from the literature or submitted directly by researchers.
# To build the annotation, SWISSPROT curators review not only the publications referenc
the author, but also relate articles to periodically update the annotations of the families or g
of proteins.
# The added annotation is stored mainly in the description (DE) and gene (GN) lines, the
comment (CC) lines, the feature table (FT) lines and the keyword (KW) lines.
# SWISSPROT offers added values by providing links to over 30 different databases, includ
databases of nucleic acid and protein sequences, protein families etc.
41. c. The UniProt knowledgebase (UNIPROT)
# From December 2003, the SWISSPROT, PIR-PSD and TrEMBL protein databases have unite
their activities to form the Universal Protein Knowledgebase (UniProt) consortium.
# The UniProt build upon these solid foundations aims to provide biologists a central,
comprehensive and high- quality protein database with efficient and clear access mechanism.
# UniProt is comprised of three database layers:
1. UniParc
2. UniProtKB
3.UniRef
42. 1. UniParc
# UniProt Archive (UniParc) is the most comprehensive non-redundant protein sequence
repository available.
# UniParc is designed to capture all publicly available protein sequence data from the
databases DDBJ, EMBL, GenBank, SWISSPROT, TrEMBL, PIR-PSD, Ensembl, IPI (Inte
Protein Index), PDB,ReSeq, FlyBase, WormBase and the European, United States and J
Patent Offices.
# As a result, performing a sequence search against UniParc will be equivalent to perform
the same search against all databases cross-referenced by UniParc.
# To avoid redundancy, UniParc assign each unique entry a unique UniParc identifier.
43. Genome databases
# A second major source of primary data is the various genome projects.
# A large number of which are underway.
# A representative sample of these projects are shown in the table below.
# Much of the information from these projects can be found in the EMBL nucleotide
sequence database.