INTRODUCTION TO
BIOLOGICAL DATABASES
M.Al
roy
Mas
cren
ghe
2
DATABASES
 Sequence info is stored in databases
 So that they can be manipulated easily
 The db (next slide) are located at diff places
 They exchange info on a daily basis so that
they are up-to-date and are in sync
 Primary db – sequence data
WHAT ARE DATABASES?
 Structured collection of information.
 Consists of basic units called records or entries.
 Each record consists of fields, which hold pre-
defined data related to the record.
 For example, a protein database would have
protein entries as records and protein properties as
fields (e.g., name of protein, length, amino-acid
sequence)
THE ‘PERFECT’ DATABASE
 Comprehensive, but easy to search.
 Annotated, but not “too annotated”.
 A simple, easy to understand structure.
 Cross-referenced.
 Minimum redundancy.
 Easy retrieval of data.
TYPES OF MOLECULAR DATABASES
 Primary Databases
 Original submissions by experimentalists
 Content controlled by the submitter
 Examples: GenBank, Trace, SRA, SNP, GEO
 Derivative Databases
 Derived from primary data
 Content controlled by third party (NCBI)
 Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene,
Homologene, Structure, Conserved Domain
PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
MAJOR PRIMARY DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS-
PROT
NRL-3D
IAM: International Advisory
Meeting
ICM: International
Collaborative Meeting
International
Nucleotide Sequence Database
EMBL:
European Molecular Biology
Laboratory
EBI:
European Bioinformatics
Institute
DDBJ: DNA Data Bank of Japan
CIB: Center for Information Biology
and
DNA Data Bank of Japan
NIG: National Institute of Genetics
NCBI:
National Center for Biotechnology
Information
NLM:
National Library of Medicine
Protein Databases
In 1988, The Protein Information Resource (PIR), established a
cooperative effort with the Munich Information Center for Protein
Sequences (MIPS) and the Japan International Protein
Information Database (JIPID) , produces the PIR-International .
Protein Sequence Database (PIR-PSD) -- a comprehensive, non-
redundant, expertly annotated, fully classified and extensively
cross-referenced protein sequence database in the public domain.
http://pir.georgetown.edu/
Protein Information Resources (PIR)
SWISSPROT http://www.ebi.ac.uk/swissprot/
The SWISS-PROT Protein Knowledgebase is an annotated
protein sequence database established in 1986. It is
maintained collaboratively by the Swiss Institute for
Bioinformatics (SIB) and the European Bioinformatics
Institute (EBI).
NCBI DATABASES
THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New Homepage
Common footer
New pages!
NCBI DATABASES AND SERVICES
 GenBank primary sequence database
 Free public access to biomedical literature
 PubMed free Medline (3 million searches per day)
 PubMed Central full text online access
 Entrez integrated molecular and literature databases
SEQUENCE DATABASES AT NCBI
 Primary
 GenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation data
 Derivative
 GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference Sequences (RefSeq)
GENBANK - PRIMARY SEQUENCE DB
 Nucleotide only sequence database
 Archival in nature
 Historical
 Reflective of submitter point of view (subjective)
 Redundant
 Data
 Direct submissions (traditional records)
 Batch submissions
 FTP accounts (genome data)
GENBANK - PRIMARY SEQUENCE DB (2)
 Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
The International Nucleotide Sequence
Database Collaboration (INSDC)
EMBL:
European Bioinformatics Institute (EBI)
GenBank:
National Center for Biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov/
DDBJ:
National Institute of Genetics (NIG)
http://www.ddbj.nig.ac.jp/
http://www.ebi.ac.uk
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
GI number
NCBI internal use
well annotated
the sequence is the data
GENBANK
 First example: prokaryotic gene
 point your browser to:
http://www.ncbi.nlm.nih.gov/entrez
 choose Nucleotide from the Search pull-down menu
 in For box, type X01714 and click Go
 Click the link labeled X01714
 Can “Send To Text” if you want to save the file
GENBANK FIELDS
• LOCUS
– size of sequence (in base pairs)
– nature of molecule (e.g. DNA or RNA)
– topology (linear or circular)
• DEFINITION: brief description of gene
• ACCESSION: unique identifier for this (and some
other) databases
• VERSION: lists synonymous or past ID numbers
GENBANK FIELDS
 KEYWORDS: list of terms related to entry; can be
used for keyword searching for related data
 SOURCE: common name of relevant organism
 ORGANISM: complete id, with taxonomic
classification
 note that ORGANISM is indented under SOURCE; this
indicates that ORGANISM is a subordinate term, or
subsection, of SOURCE
GENBANK FIELDS
• REFERENCE: credits author(s) who initially
determined the sequence; includes subsections:
– AUTHOR
– TITLE
– JOURNAL
– PUBMED
• COMMENT: free-formatted text that doesn’t fit in
another category
GENBANK FIELDS: FEATURES: CDS
 gives coordinates from initial nucleotide (ATG) to
last nucleotide of stop codon (TAA)
 several lines follow, listing protein products, reading
frame to use, genetic code to apply and several IDs
for the protein sequence
 /translation section gives computer translation of
sequence into amino acid sequence
LAST SECTION: SEQUENCE ITSELF
• This is the most important section in terms of
analysis using other tools
• Can isolate just this section and save the file, as
follows:
– Choose FASTA from the Display pull-down menu (top of
page)
– Choose Text in the Send To pull-down menu
– Use File/Save As to save the file
• use “Text” as file type
• give the file a name that you’ll know to associate with this
particular sequence
EXAMPLE 2: EUKARYOTIC MRNA
• Can obtain this example by searching Nucleotide
database for U90223
• Similar to prokaryote example, because we’re
looking at a direct coding sequence for a protein –
not DNA, in other words
• Notes on example:
– KEYWORD field is empty: this is an example of an
incomplete annotation
– remember, you’re looking at a primary database!
– FEATURES field contains some new terms:
• sig_peptide: location of mitochondrial targeting sequence
• mat_peptide: exact boundaries of mature peptide
EXAMPLE 3: EUKARYOTIC GENE
• Can obtain this record by searching Nucleotide for
AF018430
• General information:
– LOCUS: same info as previous examples – note the
locus name is different from the accession number this
time
– DEFINITION: specifies exon; remember, protein-coding
regions in eukaryotes are not contiguous as in
prokaryotes
– SEGMENT: indicates this is the second of 4; you’d need
all 4 to reconstruct the mRNA that codes for the protein
EUKARYOTIC GENE: FEATURES SECTION
 source subsection includes a /map section:
 indicates chromosome (15)
 arm (q means long arm)
 cytogenic band (q21.1)
‘TAKE HOME MESSAGE’ ADVANTAGES OF
DATA INTEGRATION
 More relevant inter-related information in one place
 Makes it easier to find additional relevant
information related to your initial query
 Potentially find information indirectly linked, but
relevant to your subject of interest
 uncover non-obvious genetic features that explain
phenotype or disease
 Easier to build a ‘story’ based on multiple pieces of
biological evidence

Introduction to Bioinformatics and DatabasesDay1.ppt

  • 1.
  • 2.
    M.Al roy Mas cren ghe 2 DATABASES  Sequence infois stored in databases  So that they can be manipulated easily  The db (next slide) are located at diff places  They exchange info on a daily basis so that they are up-to-date and are in sync  Primary db – sequence data
  • 3.
    WHAT ARE DATABASES? Structured collection of information.  Consists of basic units called records or entries.  Each record consists of fields, which hold pre- defined data related to the record.  For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)
  • 4.
    THE ‘PERFECT’ DATABASE Comprehensive, but easy to search.  Annotated, but not “too annotated”.  A simple, easy to understand structure.  Cross-referenced.  Minimum redundancy.  Easy retrieval of data.
  • 5.
    TYPES OF MOLECULARDATABASES  Primary Databases  Original submissions by experimentalists  Content controlled by the submitter  Examples: GenBank, Trace, SRA, SNP, GEO  Derivative Databases  Derived from primary data  Content controlled by third party (NCBI)  Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
  • 6.
    PRIMARY VS. DERIVATIVESEQUENCE DATABASES GenBank Sequencing Centers TATAGCCG TATAGCCG TATAGCCG TATAGCCG Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  • 7.
    MAJOR PRIMARY DB NucleicAcid Protein EMBL (Europe) PIR - Protein Information Resource GenBank (USA) MIPS DDBJ (Japan) SWISS-PROT University of Geneva, now with EBI TrEMBL A supplement to SWISS- PROT NRL-3D
  • 8.
    IAM: International Advisory Meeting ICM:International Collaborative Meeting International Nucleotide Sequence Database EMBL: European Molecular Biology Laboratory EBI: European Bioinformatics Institute DDBJ: DNA Data Bank of Japan CIB: Center for Information Biology and DNA Data Bank of Japan NIG: National Institute of Genetics NCBI: National Center for Biotechnology Information NLM: National Library of Medicine
  • 9.
    Protein Databases In 1988,The Protein Information Resource (PIR), established a cooperative effort with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID) , produces the PIR-International . Protein Sequence Database (PIR-PSD) -- a comprehensive, non- redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain. http://pir.georgetown.edu/ Protein Information Resources (PIR) SWISSPROT http://www.ebi.ac.uk/swissprot/ The SWISS-PROT Protein Knowledgebase is an annotated protein sequence database established in 1986. It is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).
  • 10.
  • 11.
    THE NATIONAL CENTERFOR BIOTECHNOLOGY INFORMATION Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Bethesda,MD
  • 12.
    WEB ACCESS: WWW.NCBI.NLM.NIH.GOV NewHomepage Common footer New pages!
  • 13.
    NCBI DATABASES ANDSERVICES  GenBank primary sequence database  Free public access to biomedical literature  PubMed free Medline (3 million searches per day)  PubMed Central full text online access  Entrez integrated molecular and literature databases
  • 14.
    SEQUENCE DATABASES ATNCBI  Primary  GenBank: NCBI’s primary sequence database  Trace Archive: reads from capillary sequencers  Sequence Read Archive: next generation data  Derivative  GenPept (GenBank translations)  Outside Protein (UniProt—Swiss-Prot, PDB)  NCBI Reference Sequences (RefSeq)
  • 15.
    GENBANK - PRIMARYSEQUENCE DB  Nucleotide only sequence database  Archival in nature  Historical  Reflective of submitter point of view (subjective)  Redundant  Data  Direct submissions (traditional records)  Batch submissions  FTP accounts (genome data)
  • 16.
    GENBANK - PRIMARYSEQUENCE DB (2)  Three collaborating databases 1. GenBank 2. DNA Database of Japan (DDBJ) 3. European Molecular Biology Laboratory (EMBL) Database
  • 17.
    The International NucleotideSequence Database Collaboration (INSDC) EMBL: European Bioinformatics Institute (EBI) GenBank: National Center for Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov/ DDBJ: National Institute of Genetics (NIG) http://www.ddbj.nig.ac.jp/ http://www.ebi.ac.uk
  • 18.
    TRADITIONAL GENBANK RECORD ACCESSIONU07418 VERSION U07418.1 GI:466461 Accession •Stable •Reportable •Universal Version Tracks changes in sequence GI number NCBI internal use well annotated the sequence is the data
  • 19.
    GENBANK  First example:prokaryotic gene  point your browser to: http://www.ncbi.nlm.nih.gov/entrez  choose Nucleotide from the Search pull-down menu  in For box, type X01714 and click Go  Click the link labeled X01714  Can “Send To Text” if you want to save the file
  • 23.
    GENBANK FIELDS • LOCUS –size of sequence (in base pairs) – nature of molecule (e.g. DNA or RNA) – topology (linear or circular) • DEFINITION: brief description of gene • ACCESSION: unique identifier for this (and some other) databases • VERSION: lists synonymous or past ID numbers
  • 25.
    GENBANK FIELDS  KEYWORDS:list of terms related to entry; can be used for keyword searching for related data  SOURCE: common name of relevant organism  ORGANISM: complete id, with taxonomic classification  note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE
  • 26.
    GENBANK FIELDS • REFERENCE:credits author(s) who initially determined the sequence; includes subsections: – AUTHOR – TITLE – JOURNAL – PUBMED • COMMENT: free-formatted text that doesn’t fit in another category
  • 27.
    GENBANK FIELDS: FEATURES:CDS  gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA)  several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence  /translation section gives computer translation of sequence into amino acid sequence
  • 30.
    LAST SECTION: SEQUENCEITSELF • This is the most important section in terms of analysis using other tools • Can isolate just this section and save the file, as follows: – Choose FASTA from the Display pull-down menu (top of page) – Choose Text in the Send To pull-down menu – Use File/Save As to save the file • use “Text” as file type • give the file a name that you’ll know to associate with this particular sequence
  • 32.
    EXAMPLE 2: EUKARYOTICMRNA • Can obtain this example by searching Nucleotide database for U90223 • Similar to prokaryote example, because we’re looking at a direct coding sequence for a protein – not DNA, in other words • Notes on example: – KEYWORD field is empty: this is an example of an incomplete annotation – remember, you’re looking at a primary database! – FEATURES field contains some new terms: • sig_peptide: location of mitochondrial targeting sequence • mat_peptide: exact boundaries of mature peptide
  • 33.
    EXAMPLE 3: EUKARYOTICGENE • Can obtain this record by searching Nucleotide for AF018430 • General information: – LOCUS: same info as previous examples – note the locus name is different from the accession number this time – DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes – SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein
  • 34.
    EUKARYOTIC GENE: FEATURESSECTION  source subsection includes a /map section:  indicates chromosome (15)  arm (q means long arm)  cytogenic band (q21.1)
  • 35.
    ‘TAKE HOME MESSAGE’ADVANTAGES OF DATA INTEGRATION  More relevant inter-related information in one place  Makes it easier to find additional relevant information related to your initial query  Potentially find information indirectly linked, but relevant to your subject of interest  uncover non-obvious genetic features that explain phenotype or disease  Easier to build a ‘story’ based on multiple pieces of biological evidence