INTRODUCTION TO
BIOLOGICAL DATABASES
SARFARAZ HUSSAIN
Department of Bioinformatics
& Biotechnology GCU-
Faisalabad
NOTE: Most slides derived from NCBI’s field guide
/ sarfraz1412@gmail.com
WHAT YOU NEED TO LEARN:
 What is a database and what are the features of
an ideal db?
 What are the relationships/differences between
primary and derived sequence databases?
 What are the benefits of RefSeq?
 Why is data integration useful?
THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New Homepage
Common footerCommon footer
New pages!New pages!
WHAT ARE DATABASES?
 Structured collection of information.
 Consists of basic units called records or entries.
 Each record consists of fields, which hold pre-
defined data related to the record.
 For example, a protein database would have
protein entries as records and protein properties
as fields (e.g., name of protein, length, amino-
acid sequence)
THE ‘PERFECT’ DATABASE
 Comprehensive, but easy to search.
 Annotated, but not “too annotated”.
 A simple, easy to understand structure.
 Cross-referenced.
 Minimum redundancy.
 Easy retrieval of data.
THE CENTRAL DOGMA & BIOLOGICAL
DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
NCBI DATABASES AND SERVICES
 GenBank primary sequence database
 Free public access to biomedical literature
 PubMed free Medline (3 million searches per day)
 PubMed Central full text online access
 Entrez integrated molecular and literature databases
TYPES OF MOLECULAR DATABASES
 Primary Databases
 Original submissions by experimentalists
 Content controlled by the submitter
 Examples: GenBank, Trace, SRA, SNP, GEO
 Derivative Databases
 Derived from primary data
 Content controlled by third party (NCBI)
 Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
 PubMed is a free search engine accessing
primarily the MEDLINE database of references
and abstracts on life sciences and biomedical
topics. The United States National Library of
Medicine (NLM) at the
National Institutes of Health maintains the
database as part of the Entrez system of
information retrieval.
 Pubmed: click on the drop down menu select the
pubmed option. Type any topic which you want to
find in the search box.
 After typing the topic of our interest lots of
research papers will appear on window from
where we select the specific papers for our study.
PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBankGenBank
SequencingSequencing
CentersCenters
GA
GAGA
ATT
ATT
C
CGAGA
ATT
ATT
C
C
AT
GAGA
ATT
C
C GAGA
ATT
C
C
TTGACA
ATTGACTA
ACGTGC
TTGACA
CGTGA
ATTGACTA
TATAGCCG
ACGTGC
ACGTGC
ACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTA
ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCG
TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATT
C
C GAGA
ATT
C
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
SEQUENCE DATABASES AT NCBI
 Primary
 GenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation data
 Derivative
 GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference Sequences (RefSeq)
GENBANK - PRIMARY SEQUENCE DB
 Nucleotide only sequence database
 Archival in nature
 Historical
 Reflective of submitter point of view (subjective)
 Redundant
 Data
 Direct submissions (traditional records)
 Batch submissions
 FTP accounts (genome data)
GENBANK - PRIMARY SEQUENCE DB (2)
 Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
Version
Tracks changes in sequence
GI number
NCBI internal use
GI number
NCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
DERIVATIVE SEQUENCE
DATABASES
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS
TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
REFSEQ: DERIVATIVE SEQUENCE DATABASE
 Curated transcripts and proteins
 Model transcripts and proteins
 Assembled Genomic Regions
 Chromosome records
 Human genome
 microbial
 organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
SELECTED REFSEQ ACCESSION
NUMBERS
Curated mRNA
Curated Protein
Curated non-coding RNA
Predicted mRNA
Predicted Protein
Predicted non-coding RNA
Reference Genomic Sequence
Microbial replicons, organelle genomes,
Alternate assemblies
Contig
WGS Supercontig
GENBANK TO REFSEQ
REFSEQS: ANNOTATION REAGENTS
Genomic DNAGenomic DNA
((NCNC,, NT, NWNT, NW))
Model mRNAModel mRNA (XM)(XM)
(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)
(NR)(NR)
Model proteinModel protein (XP)(XP)
Curated ProteinCurated Protein (NP)(NP)
Scanning....
= ?
GenBank
Sequences
RefSeq
REFSEQ BENEFITS
 Non-redundancy  
 Updates to reflect current sequence data and biology
 Data validation
 Format consistency
 Distinct accession series
 Stewardship by NCBI staff and collaborators
OTHER DERIVATIVE DATABASES
 Expressed Sequences
 dbSNP
 Structure
 Gene
 and more…
ENTREZ
FINDING RELEVANT
INFORMATION IN NCBI
DATABASES
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases
TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
THE PROBLEM
 Rapidly growing databases with complex and changing
relationships
 Rapidly changing interfaces to match the above
Result
 Many people don’t know:
 Where to begin
 Where to click on a Web page
 Why it might be useful to click there
GLOBAL NCBI (ENTREZ) SEARCH
colon cancercolon cancer
GLOBAL ENTREZ SEARCH
RESULTS
ENTREZ TIP: START SEARCHES IN
GENE
Other Entrez DBs
HomoloGene
Entrez
Protein
Gene
UniGene
BLink
Homologene:
Gene Neighbors
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]MLH1[Gene Name] AND Human[Organism]
MLH1 GENE RECORD
MLH1:LINKS TO SEQUENCE
GENEVIEW: HUMAN MLH1 VARIATIONS
ATPase domain
‘TAKE HOME MESSAGE’ ADVANTAGES
OF DATA INTEGRATION
 More relevant inter-related information in one
place
 Makes it easier to find additional relevant
information related to your initial query
 Potentially find information indirectly linked, but
relevant to your subject of interest
 uncover non-obvious genetic features that explain
phenotype or disease
 Easier to build a ‘story’ based on multiple pieces
of biological evidence

Biological databases

  • 1.
    INTRODUCTION TO BIOLOGICAL DATABASES SARFARAZHUSSAIN Department of Bioinformatics & Biotechnology GCU- Faisalabad NOTE: Most slides derived from NCBI’s field guide / sarfraz1412@gmail.com
  • 2.
    WHAT YOU NEEDTO LEARN:  What is a database and what are the features of an ideal db?  What are the relationships/differences between primary and derived sequence databases?  What are the benefits of RefSeq?  Why is data integration useful?
  • 3.
    THE NATIONAL CENTERFOR BIOTECHNOLOGY INFORMATION Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Bethesda,MD
  • 4.
    WEB ACCESS: WWW.NCBI.NLM.NIH.GOV NewHomepage Common footerCommon footer New pages!New pages!
  • 5.
    WHAT ARE DATABASES? Structured collection of information.  Consists of basic units called records or entries.  Each record consists of fields, which hold pre- defined data related to the record.  For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino- acid sequence)
  • 6.
    THE ‘PERFECT’ DATABASE Comprehensive, but easy to search.  Annotated, but not “too annotated”.  A simple, easy to understand structure.  Cross-referenced.  Minimum redundancy.  Easy retrieval of data.
  • 7.
    THE CENTRAL DOGMA& BIOLOGICAL DATA Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs)
  • 8.
    NCBI DATABASES ANDSERVICES  GenBank primary sequence database  Free public access to biomedical literature  PubMed free Medline (3 million searches per day)  PubMed Central full text online access  Entrez integrated molecular and literature databases
  • 9.
    TYPES OF MOLECULARDATABASES  Primary Databases  Original submissions by experimentalists  Content controlled by the submitter  Examples: GenBank, Trace, SRA, SNP, GEO  Derivative Databases  Derived from primary data  Content controlled by third party (NCBI)  Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
  • 10.
     PubMed isa free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.
  • 11.
     Pubmed: clickon the drop down menu select the pubmed option. Type any topic which you want to find in the search box.
  • 12.
     After typingthe topic of our interest lots of research papers will appear on window from where we select the specific papers for our study.
  • 13.
    PRIMARY VS. DERIVATIVESEQUENCE DATABASES GenBankGenBank SequencingSequencing CentersCenters GA GAGA ATT ATT C CGAGA ATT ATT C C AT GAGA ATT C C GAGA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC ACGTGC ACGTGC TTGACA TTGACA CGTGA CGTGA CGTGA ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG CATT GAGA ATT C C GAGA ATT C C LabsLabs AlgorithmsAlgorithms UniGene CuratorsCurators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  • 14.
    SEQUENCE DATABASES ATNCBI  Primary  GenBank: NCBI’s primary sequence database  Trace Archive: reads from capillary sequencers  Sequence Read Archive: next generation data  Derivative  GenPept (GenBank translations)  Outside Protein (UniProt—Swiss-Prot, PDB)  NCBI Reference Sequences (RefSeq)
  • 15.
    GENBANK - PRIMARYSEQUENCE DB  Nucleotide only sequence database  Archival in nature  Historical  Reflective of submitter point of view (subjective)  Redundant  Data  Direct submissions (traditional records)  Batch submissions  FTP accounts (genome data)
  • 16.
    GENBANK - PRIMARYSEQUENCE DB (2)  Three collaborating databases 1. GenBank 2. DNA Database of Japan (DDBJ) 3. European Molecular Biology Laboratory (EMBL) Database
  • 17.
    TRADITIONAL GENBANK RECORD ACCESSIONU07418 VERSION U07418.1 GI:466461 ACCESSION U07418 VERSION U07418.1 GI:466461 Accession •Stable •Reportable •Universal Accession •Stable •Reportable •Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use well annotatedwell annotated the sequence is the datathe sequence is the data
  • 18.
  • 19.
    FEATURES Location/Qualifiers source 1..2484 /organism="Homosapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS GENPEPT: GENBANK CDS TRANSLATIONS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
  • 20.
    REFSEQ: DERIVATIVE SEQUENCEDATABASE  Curated transcripts and proteins  Model transcripts and proteins  Assembled Genomic Regions  Chromosome records  Human genome  microbial  organelle ftp://ftp.ncbi.nih.gov/refseq/release/
  • 21.
    SELECTED REFSEQ ACCESSION NUMBERS CuratedmRNA Curated Protein Curated non-coding RNA Predicted mRNA Predicted Protein Predicted non-coding RNA Reference Genomic Sequence Microbial replicons, organelle genomes, Alternate assemblies Contig WGS Supercontig
  • 22.
  • 23.
    REFSEQS: ANNOTATION REAGENTS GenomicDNAGenomic DNA ((NCNC,, NT, NWNT, NW)) Model mRNAModel mRNA (XM)(XM) (XR)(XR) Curated mRNACurated mRNA (NM)(NM) (NR)(NR) Model proteinModel protein (XP)(XP) Curated ProteinCurated Protein (NP)(NP) Scanning.... = ? GenBank Sequences RefSeq
  • 24.
    REFSEQ BENEFITS  Non-redundancy   Updates to reflect current sequence data and biology  Data validation  Format consistency  Distinct accession series  Stewardship by NCBI staff and collaborators
  • 25.
    OTHER DERIVATIVE DATABASES Expressed Sequences  dbSNP  Structure  Gene  and more…
  • 26.
  • 27.
    ENTREZ: A DISCOVERYSYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLASTBLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected. Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected.
  • 28.
    GLOBAL QUERY: ALLNCBI DATABASES The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases
  • 29.
    TRADITIONAL METHOD: THELINKS MENU DNA Sequence Nucleotide – Protein Link Related Proteins Protein – Structure Link 3-D Structure
  • 30.
    THE PROBLEM  Rapidlygrowing databases with complex and changing relationships  Rapidly changing interfaces to match the above Result  Many people don’t know:  Where to begin  Where to click on a Web page  Why it might be useful to click there
  • 31.
    GLOBAL NCBI (ENTREZ)SEARCH colon cancercolon cancer
  • 32.
  • 33.
    ENTREZ TIP: STARTSEARCHES IN GENE Other Entrez DBs HomoloGene Entrez Protein Gene UniGene BLink Homologene: Gene Neighbors
  • 34.
    PRECISE RESULTS MLH1[Gene Name]AND Human[Organism]MLH1[Gene Name] AND Human[Organism]
  • 35.
  • 36.
  • 37.
    GENEVIEW: HUMAN MLH1VARIATIONS ATPase domain
  • 38.
    ‘TAKE HOME MESSAGE’ADVANTAGES OF DATA INTEGRATION  More relevant inter-related information in one place  Makes it easier to find additional relevant information related to your initial query  Potentially find information indirectly linked, but relevant to your subject of interest  uncover non-obvious genetic features that explain phenotype or disease  Easier to build a ‘story’ based on multiple pieces of biological evidence

Editor's Notes

  • #5 NCBI homepage. Logo will take you back to home page. About NCBI provides introduction to the NCBI and contains basic information on genetics and bioinformatics.
  • #10 Primary databases serve as a repository of experimentalist sequences (GenBank). Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)
  • #14 ~11,000 sequences are submitted per day.
  • #21 Goal= nonredundant set of genes/proteins for each organism represented Model= comes from analysis of genomic content from organism assembly Reannotation of microbial genomes, for example.
  • #22 Two letter prefix, underscore, numeric portion. NT=contig assemblies produced by NCBI NW=supercontig assembles from WGS