Biomolecular databases
Bioinformatics
Jacques van HeldenFORMER ADDRESS (1999-2011)
Université Libre de Bruxelles, Belgique
Bioinformatique des Génomes et des Réseaux (BiGRe lab)
http://www.bigre.ulb.ac.be/
NEW ADDRESS (since Nov 1st,2011)
Jacques.van-Helden@univ-amu.fr
Université d’Aix-Marseille, France
Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090)
http://tagc.univ-mrs.fr/
B!GRe
Bioinformatique des
Génomes et Réseaux
!"#$%&'&()#*' *,-*%#". /&0("%&1)#. *%, #')%)#
! "#$
Inserm U1090
Contents
 Examples of biological databases
 Nucleic sequences: Genbank, EMBL, and DDBJ
 Protein sequences: UniProt
 The Gene Ontology (GO) project
 Issues and perspectives for biological databases
Examples of biomolecular databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Examples of biomolecular databases
 Sequence and structure databases
 Protein sequences (UniProt)
 DNA sequences (EMBL, Genbank, DDBJ)
 3D structures (PDB)
 Structural motifs (CATH)
 Sequence motifs (PROSITE, PRODOM)
 Genome sequences and annotations
 Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)
 Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)
 Molecular functions
 Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)
 Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)
 Transport (YTPdb)
 Biological processes
 Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)
 Signal transduction pathways (CSNdb, Transpath)
 Protein-protein interactions (DIP, BIND, MINT)
 Gene networks (GeneNet, FlyNets)
Databases of databases
 There are hundreds of databases related to molecular biology and biochemistry.
New databases are created every year.
 Every year, the first issue of Nucleic Acids Research is dedicated to biological
databases
 http://nar.oupjournals.org/
 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1
 The same journal maintains a database of databases: the Molecular Biology
Database Collection
 http://www.oxfordjournals.org/nar/database/c/
 Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of
databases.
 http://srs.ebi.ac.uk/
Nucleic sequence databases:
GenBank, EMBL, and DDBJ
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Okubo et al. (2006) NAR 34: D6-D9
Nucleic sequence databases
 To publish an article dealing with a sequence, scientific journals impose to have
previously deposited this sequence in a reference database.
 There are 3 main repositories for nucleic acid sequences.
 Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.
Adapted from Didier Gonze
The sequencing pace
 Nucleic sequences
 Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/
• 126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions
• 191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing
 Entire genomes
 GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced
genomes.
 http://www.genomesonline.org/gold_statistics.htm
 Protein sequences
 Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing).
 UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.
 http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
Size of the nucleotide database
EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012
http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html
Class entries nucleotides
------------------------------------------------------------------
CON:Constructed 7,236,371 359,112,791,043
EST:Expressed Sequence Tag 73,715,376 40,997,082,803
GSS:Genome Sequence Scan 34,528,104 21,985,922,905
HTC:High Throughput CDNA sequencing 491,770 594,229,662
HTG:High Throughput Genome sequencing 152,599 25,159,746,658
PAT:Patents 24,364,832 12,117,896,594
STD:Standard 13,920,617 37,665,112,606
STS:Sequence Tagged Site 1,322,570 636,037,867
TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279
WGS:Whole Genome Shotgun
Total
88,288,431
-----------
252,106,363
305,661,696,545
---------------
450,481,663,919
Division entries nucleotides
------------------------------------------------------------------
ENV:Environmental Samples 30,908,230 14,420,391,278
FUN:Fungi 6,522,586 11,614,472,226
HUM:Human 32,094,500 38,072,362,804
INV:Invertebrates 31,907,138 52,527,673,643
MAM:Other Mammals 40,012,731 145,678,620,711
MUS:Mus musculus 11,745,671 19,701,637,499
PHG:Bacteriophage 8,511 85,549,111
PLN:Plants 52,428,994 55,570,452,118
PRO:Prokaryotes 2,808,489 28,807,572,238
ROD:Rodents 6,554,012 33,326,106,733
SYN:Synthetic 4,045,013 782,174,055
TGN:Transgenic 285,307 849,743,891
UNC:Unclassified 8,617,225 4,957,442,673
VRL:Viruses 1,358,528 1,518,575,082
VRT:Other Vertebrates
Total
22,809,428
-----------
252,106,363
42,568,889,857
---------------
450,481,663,919
Genbank (NCBI - USA)
http://www.ncbi.nlm.nih.gov/Genbank/
The EMBL Nucleotide Sequence Database (EBI - UK)
http://www.ebi.ac.uk/embl/
DDBJ - DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/
URL Sequences
Bases
(without
shotgun)
bases
(including
shotgun) Organisms
DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09
EMBL
GenBank
http://www.ebi.ac.uk/embl/
http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10
1.0E+11
1.0E+11
2.0E+05
2.1E+05
Size of the nucleic sequence databases
 Summary of database contents for the 3 main databases of nucleic sequences.
 Source: NAR database issue January 2006.
UniProt : protein sequences
and functional annotations
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
UniProt - the Universal Protein Resource
http://www.uniprot.org/
 Database content (Sept 2012)
 UniProtKB:
• 24,532,088 entries
• Translation of EMBL coding sequences
(non-redundant with Swiss-Prot)
 UniProtKB/Swiss-Prot section (reviewed):
• 537,505 entries
• annotation by experts
• high information content
• many references to the literature
• good reliability of the information
 The rest (90% of the entries)
• Automatic annotation by sequence
similarity.
 Features
 The most comprehensive protein database in
the world.
 A huge team: >100 annotators + developers.
 Annotation by experts: annotators are
specialized for different types of proteins or
organisms.
 World-wide recognized as an essential
resource.
 References
 Bairoch et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991)
vol. 19 Suppl pp. 2247-9
 The UniProt Consortium. The Universal Protein
Resource (UniProt) 2009. Nucleic Acids Res
(2008). Database Issue.
Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html
Taxonomic distribution of the sequences
Within Eukaryotes
UniProt example - Human Pax-6 protein
Header : name and synonyms
UniProt example - Human Pax-6 protein
Human-based annotation by specialists
UniProt example - Human Pax-6 protein
Structured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 protein
Protein interactions; Alternative products
UniProt example - Human Pax-6 protein
Detailed description of regions, variations, and secondary structure
UniProt example - Human Pax-6 protein
Peptidic sequence
UniProt example - Human Pax-6 protein
References to original publications
UniProt example - Human Pax-6 protein
Cross-references to many databases (fragment shown)
3D Structure of macromolecules
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
PDB - The Protein Data Bank
http://www.rcsb.org/pdb/
Genome browsers
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
EnsEMBL Genome Browser (Sanger Institute + EBI)
http://www.ensembl.org/
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
ECR Browser
http://ecrbrowser.dcode.org/
EnsEMBL - Example: Drosophila gene Pax6
http://www.ensembl.org/
Comparative genomics
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Integr8 - access to complete genomes and proteomes
http://www.ebi.ac.uk/integr8/
Integr8 - genome summaries
http://www.ebi.ac.uk/integr8/
Integr8 - clusters of orthologous genes (COGs)
http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous genes
http://www.ebi.ac.uk/integr8/
Databases of protein domains
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Prosite - protein domains, families and functional sites
http://www.expasy.ch/prosite/
Prosite - aligned sequences and logo
http://www.expasy.ch/prosite/
 Some of the sequences that were
used to built the Prosite profile for
the Zn(2)-C6 fungal-type DNA-
binding domain
(ZN2_CY6_FUNGAL_2,
PS50048).
 The Sequence Logo (below)
indicates the level of conservation
of each residue in each column of
the alignment.
 Note the 6 cysteines,
characteristic of this domain.
Prosite - Example of profile matrix
http://www.expasy.ch/prosite/
Prosite - Example of sequence logo
http://www.expasy.ch/prosite/
Prosite - Example of domain signature
http://www.expasy.ch/prosite/
 The domain signature is a string-based pattern representing the residues that
are characteristic of a domain.
PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/
Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
CATH - Protein Structure Classification
http://www.cathdb.info/
 CATH is a hierarchical classification of
protein domain structures, which clusters
proteins at four major levels:
 Class (C),
 Architecture (A),
 Topology (T)
 Homologous superfamily (H).
 The boundaries and assignments for
each protein domain are determined
using a combination of automated and
manual procedures which include
computational techniques, empirical and
statistical evidence, literature review and
expert analysis.
 References
 Orengo et al. The CATH Database
provides insights into protein structure/
function relationships. Nucleic Acids Res
(1999) vol. 27 (1) pp. 275-9
 Cuff et al. The CATH classification
revisited--architectures reviewed and new
ways to characterize structural divergence
in superfamilies. Nucleic Acids Res (2008)
pp.
CATH - Protein Structure Classification
http://www.cathdb.info/
InterPro (EBI - UK)
http://www.ebi.ac.uk/interpro/
InterPro (EBI - UK)
Antennapedia-like Homeobox (entry IPR001827)
The Gene Ontology (GO) database
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Ontology definition
 Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être,
indépendamment de ses déterminations particulières
 Ontology: part of the metaphysics that focusses on the being as a beging, independently of
its particular determinations
Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993
The "bio-ontologies"
 Answer to the problem of inconsistencies in the annotations
 Controlled vocabulary
 Hierarchical classification between the terms of the controlled vocabulary
 E.g.: The Gene Ontology
 molecular function ontology
 process ontology
 cellular component ontology
Gene ontology: processes
Gene ontology: molecular functions
Gene ontology: cellular components
Gene Ontology Database
http://www.geneontology.org/
Gene Ontology Database (http://www.geneontology.org/)
Example: methionine biosynthetic process
Status of GO annotations (NAR DB issue 2006)
 Term definitions
 Biological process terms
 Molecular function terms
 Cellular component terms
 Sequence Ontology terms
9,805
7,076
1,574
963
 Genomes with annotation 30
 Excludes annotations from UniProt, which represent 261 annotated proteomes.
 Annotated gene products
 Total
 Electronic only
 Manually curated
1,618,739
1,460,632
158,107
QuickGO (http://www.ebi.ac.uk/QuickGO/)
 Web site
http://www.ebi.ac.uk/QuickGO/
 A user-friendly Web interface to
the Gene Ontology.
 Graphical display of the
hierarchical relationships
between terms.
 Convenient browsing between
classes.
Remarks on "bio-ontologies"
 Improvement compared to free text
 controlled vocabulary (choice among synonyms)
 hierarchical relationships between the concepts
 Nothing to do with the philosophical concept of ontology
 A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary
 Multiple possibilities of classification criteria
 e.g. compartment subtypes (plasma membrane is a membrane)
 e.g. compartment locations (nucleus is inside cytoplasm is inside plasma
membrane)
 To be useful, should remain purpose-based
 each biologist might wish to define his/her own classification based on his/her
needs and scope of interest
 impossible to define a unifying standard for all biologists
 No representation of molecular interactions
 relationships between objects are only hierarchical, not horizontal or cyclic
 e.g. does not describe which genes are the target of a given transcription
factor
What is biological function ?
 A general definition
 Fonction: action, rôle caractéristique dʼun élément, dʼun organe, dans un ensemble
(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et
analogique de la langue francaise. 1982.
 Function: characteristic action (role) of an element (organ) within an set
(often opposed to structure)
 Function and gene ontology
 Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process).
 Multifunctionality
• Same activity can play different roles in different processes.
 Example: scute gene in Drosophila melanogaster: a transcription factor
(activity) involved in sex determination, determination of neural precursors
and malpighian tubules (3 processes).
• Multiple activities of a same protein in a given process
 Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding
transcription factor) -> 3 molecular activities in the same process (proline
utilization).
Small compounds, reactions
and metabolic pathways
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
LIGAND - Small compounds and metabolic reactions
KEGG - Kyoto Encycplopaedia of Genes and Genomes
Ecocyc, BioCyc and Metacyc - Metabolic pathways
Protein interaction networks
and transduction pathways
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Microarray databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Human genome resources
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
HapMap
http://www.hapmap.org/
 The International HapMap
Project is a multi-country effort to
identify and catalog genetic
similarities and differences in
human beings.
 Associations between genetic
variations (SNPs, ...) and
diseases + response to
pharmaceuticals.
Issues for
biomolecular databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Issues for biological databases
 Dealing with biological complexity
 Data content
 Coverage
 Information content
 Data quality
 Data structure
 Consistency
 Query capabilities
 Interfaces
 User interfaces
 Programmatic interfaces
 Annotation
 Funding
Towards biological complexity
 The main databases currently available are focussed on one type of molecular
entity : nucleic sequences, proteins, compounds, …
 This type of organization is very convenient as far as the information to be
represented is simple (e.g. DNA sequences, structures of small molecules and
macromolecules).
 It becomes more difficult if we want to represent
 the interactions between biological objects,
 the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks, …)
 complex concepts such as ”biological function”
Data content
 Scope of the database
 types of biological objects represented
 Number of entries
 coverage of the current knowledge
 Information content
 Level of detail in the description of the biological objects
 References to the source of information
Data quality
 Data Consistency
 always use the same name to indicate the same object
 (this seems trivial, but its is unfortunately still not always the case)
 event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms
 spelling mistakes
 Data Structuration
 distinct fields for distinct attributes of the biological objects
 Reliability
 Evidences ? Level of confidence ?
 Assignation of function by similarity
• recursive process  propagation of errors
Query capabilities
 Browsing (click and read)
 Simple search
 select records with some constraints
 More elaborate search
 select specific fields of some records with constraints on some fields (~SQL
SELECT)
 Complex querying
 ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase
Interfaces
 User interfaces
 user-friendly
 convenient browsing
 intuitive query forms
 visualization (graphical output)
 Programmatic interfaces
 communication with external programs:
• other databases (concept of distributed database)
• analysis tools
Annotation
 Problem
 The flow of available data is increasing exponentially
 Strategies
 internal curators
 selected external experts
 public submission
 computer-based extraction of information from biological texts
Funding
 Public funding
 Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources
 Private funding
 Industrial companies are
• ready to invest in good data and good query capabilities
• interested by academic expertise
 Solutions
 All users pay (per query for example)
• Note: academic users are anyway funded by public funds
 Hybrid solution
• access is free for academic users, not for companies
• companies can buy the whole database an install it in-house
(+ add their own private data)
• academia-industry interface is often ensured by a spinoff company

02.databases slides

  • 1.
    Biomolecular databases Bioinformatics Jacques vanHeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/ NEW ADDRESS (since Nov 1st,2011) Jacques.van-Helden@univ-amu.fr Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ B!GRe Bioinformatique des Génomes et Réseaux !"#$%&'&()#*' *,-*%#". /&0("%&1)#. *%, #')%)# ! "#$ Inserm U1090
  • 2.
    Contents  Examples ofbiological databases  Nucleic sequences: Genbank, EMBL, and DDBJ  Protein sequences: UniProt  The Gene Ontology (GO) project  Issues and perspectives for biological databases
  • 3.
    Examples of biomoleculardatabases Biomolecular Databases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 4.
    Examples of biomoleculardatabases  Sequence and structure databases  Protein sequences (UniProt)  DNA sequences (EMBL, Genbank, DDBJ)  3D structures (PDB)  Structural motifs (CATH)  Sequence motifs (PROSITE, PRODOM)  Genome sequences and annotations  Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)  Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)  Molecular functions  Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)  Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)  Transport (YTPdb)  Biological processes  Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)  Signal transduction pathways (CSNdb, Transpath)  Protein-protein interactions (DIP, BIND, MINT)  Gene networks (GeneNet, FlyNets)
  • 5.
    Databases of databases There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.  Every year, the first issue of Nucleic Acids Research is dedicated to biological databases  http://nar.oupjournals.org/  2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1  The same journal maintains a database of databases: the Molecular Biology Database Collection  http://www.oxfordjournals.org/nar/database/c/  Some bioinformatics centres maintain multiple database, with cross-links between them. The SRS server at EBI holds an impressive collection of databases.  http://srs.ebi.ac.uk/
  • 6.
    Nucleic sequence databases: GenBank,EMBL, and DDBJ Biomolecular Databases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 7.
    Okubo et al.(2006) NAR 34: D6-D9 Nucleic sequence databases  To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.  There are 3 main repositories for nucleic acid sequences.  Sequences deposited in any of these 3 databases are automatically synchronized in the 2 other ones.
  • 8.
    Adapted from DidierGonze The sequencing pace  Nucleic sequences  Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ • 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions • 191,401,393,188 bases in 62,715,288 sequence records in the Whole Genome Ssequencing  Entire genomes  GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.  http://www.genomesonline.org/gold_statistics.htm  Protein sequences  Essentially obtained by translation of putative genes in nucleic sequences (almost no direct protein sequencing).  UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.  http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
  • 9.
    Size of thenucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html Class entries nucleotides ------------------------------------------------------------------ CON:Constructed 7,236,371 359,112,791,043 EST:Expressed Sequence Tag 73,715,376 40,997,082,803 GSS:Genome Sequence Scan 34,528,104 21,985,922,905 HTC:High Throughput CDNA sequencing 491,770 594,229,662 HTG:High Throughput Genome sequencing 152,599 25,159,746,658 PAT:Patents 24,364,832 12,117,896,594 STD:Standard 13,920,617 37,665,112,606 STS:Sequence Tagged Site 1,322,570 636,037,867 TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279 WGS:Whole Genome Shotgun Total 88,288,431 ----------- 252,106,363 305,661,696,545 --------------- 450,481,663,919 Division entries nucleotides ------------------------------------------------------------------ ENV:Environmental Samples 30,908,230 14,420,391,278 FUN:Fungi 6,522,586 11,614,472,226 HUM:Human 32,094,500 38,072,362,804 INV:Invertebrates 31,907,138 52,527,673,643 MAM:Other Mammals 40,012,731 145,678,620,711 MUS:Mus musculus 11,745,671 19,701,637,499 PHG:Bacteriophage 8,511 85,549,111 PLN:Plants 52,428,994 55,570,452,118 PRO:Prokaryotes 2,808,489 28,807,572,238 ROD:Rodents 6,554,012 33,326,106,733 SYN:Synthetic 4,045,013 782,174,055 TGN:Transgenic 285,307 849,743,891 UNC:Unclassified 8,617,225 4,957,442,673 VRL:Viruses 1,358,528 1,518,575,082 VRT:Other Vertebrates Total 22,809,428 ----------- 252,106,363 42,568,889,857 --------------- 450,481,663,919
  • 10.
    Genbank (NCBI -USA) http://www.ncbi.nlm.nih.gov/Genbank/
  • 11.
    The EMBL NucleotideSequence Database (EBI - UK) http://www.ebi.ac.uk/embl/
  • 12.
    DDBJ - DNAData Bank of Japan http://www.ddbj.nig.ac.jp/
  • 13.
    URL Sequences Bases (without shotgun) bases (including shotgun) Organisms DDBJhttp://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09 EMBL GenBank http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 1.0E+11 2.0E+05 2.1E+05 Size of the nucleic sequence databases  Summary of database contents for the 3 main databases of nucleic sequences.  Source: NAR database issue January 2006.
  • 14.
    UniProt : proteinsequences and functional annotations Biomolecular Databases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 15.
    UniProt - theUniversal Protein Resource http://www.uniprot.org/  Database content (Sept 2012)  UniProtKB: • 24,532,088 entries • Translation of EMBL coding sequences (non-redundant with Swiss-Prot)  UniProtKB/Swiss-Prot section (reviewed): • 537,505 entries • annotation by experts • high information content • many references to the literature • good reliability of the information  The rest (90% of the entries) • Automatic annotation by sequence similarity.  Features  The most comprehensive protein database in the world.  A huge team: >100 annotators + developers.  Annotation by experts: annotators are specialized for different types of proteins or organisms.  World-wide recognized as an essential resource.  References  Bairoch et al. The SWISS-PROT protein sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9  The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue. Number of entries (polypeptides) in Swiss-Prot http://www.expasy.org/sprot/relnotes/relstat.html Taxonomic distribution of the sequences Within Eukaryotes
  • 16.
    UniProt example -Human Pax-6 protein Header : name and synonyms
  • 17.
    UniProt example -Human Pax-6 protein Human-based annotation by specialists
  • 18.
    UniProt example -Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms
  • 19.
    UniProt example -Human Pax-6 protein Protein interactions; Alternative products
  • 20.
    UniProt example -Human Pax-6 protein Detailed description of regions, variations, and secondary structure
  • 21.
    UniProt example -Human Pax-6 protein Peptidic sequence
  • 22.
    UniProt example -Human Pax-6 protein References to original publications
  • 23.
    UniProt example -Human Pax-6 protein Cross-references to many databases (fragment shown)
  • 24.
    3D Structure ofmacromolecules Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 25.
    PDB - TheProtein Data Bank http://www.rcsb.org/pdb/
  • 26.
    Genome browsers Jacques.van.Helden@ulb.ac.be Université Librede Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 27.
    EnsEMBL Genome Browser(Sanger Institute + EBI) http://www.ensembl.org/
  • 28.
    UCSC Genome Browser(University California Santa Cruz - USA) http://genome.ucsc.edu/ Human gene Pax6 aligned with Vertebrate genomes
  • 29.
    UCSC Genome Browser(University California Santa Cruz - USA) http://genome.ucsc.edu/ Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
  • 30.
    UCSC Genome Browser(University California Santa Cruz - USA) http://genome.ucsc.edu/ Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
  • 31.
  • 32.
    EnsEMBL - Example:Drosophila gene Pax6 http://www.ensembl.org/
  • 33.
    Comparative genomics Jacques.van.Helden@ulb.ac.be Université Librede Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 34.
    Integr8 - accessto complete genomes and proteomes http://www.ebi.ac.uk/integr8/
  • 35.
    Integr8 - genomesummaries http://www.ebi.ac.uk/integr8/
  • 36.
    Integr8 - clustersof orthologous genes (COGs) http://www.ebi.ac.uk/integr8/
  • 37.
    Integr8 - clustersof paralogous genes http://www.ebi.ac.uk/integr8/
  • 38.
    Databases of proteindomains Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 39.
    Prosite - proteindomains, families and functional sites http://www.expasy.ch/prosite/
  • 40.
    Prosite - alignedsequences and logo http://www.expasy.ch/prosite/  Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA- binding domain (ZN2_CY6_FUNGAL_2, PS50048).  The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.  Note the 6 cysteines, characteristic of this domain.
  • 41.
    Prosite - Exampleof profile matrix http://www.expasy.ch/prosite/
  • 42.
    Prosite - Exampleof sequence logo http://www.expasy.ch/prosite/
  • 43.
    Prosite - Exampleof domain signature http://www.expasy.ch/prosite/  The domain signature is a string-based pattern representing the residues that are characteristic of a domain.
  • 44.
    PFAM (Sanger Institute- UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
  • 45.
    CATH - ProteinStructure Classification http://www.cathdb.info/  CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:  Class (C),  Architecture (A),  Topology (T)  Homologous superfamily (H).  The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.  References  Orengo et al. The CATH Database provides insights into protein structure/ function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9  Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.
  • 46.
    CATH - ProteinStructure Classification http://www.cathdb.info/
  • 47.
    InterPro (EBI -UK) http://www.ebi.ac.uk/interpro/
  • 48.
    InterPro (EBI -UK) Antennapedia-like Homeobox (entry IPR001827)
  • 49.
    The Gene Ontology(GO) database Biomolecular Databases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 50.
    Ontology definition  Ontologie:partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières  Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993
  • 51.
    The "bio-ontologies"  Answerto the problem of inconsistencies in the annotations  Controlled vocabulary  Hierarchical classification between the terms of the controlled vocabulary  E.g.: The Gene Ontology  molecular function ontology  process ontology  cellular component ontology
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
    Gene Ontology Database(http://www.geneontology.org/) Example: methionine biosynthetic process
  • 57.
    Status of GOannotations (NAR DB issue 2006)  Term definitions  Biological process terms  Molecular function terms  Cellular component terms  Sequence Ontology terms 9,805 7,076 1,574 963  Genomes with annotation 30  Excludes annotations from UniProt, which represent 261 annotated proteomes.  Annotated gene products  Total  Electronic only  Manually curated 1,618,739 1,460,632 158,107
  • 58.
    QuickGO (http://www.ebi.ac.uk/QuickGO/)  Website http://www.ebi.ac.uk/QuickGO/  A user-friendly Web interface to the Gene Ontology.  Graphical display of the hierarchical relationships between terms.  Convenient browsing between classes.
  • 59.
    Remarks on "bio-ontologies" Improvement compared to free text  controlled vocabulary (choice among synonyms)  hierarchical relationships between the concepts  Nothing to do with the philosophical concept of ontology  A "bio-ontologies" is usually nothing more than a taxonomical classification of the terms of a controlled vocabulary  Multiple possibilities of classification criteria  e.g. compartment subtypes (plasma membrane is a membrane)  e.g. compartment locations (nucleus is inside cytoplasm is inside plasma membrane)  To be useful, should remain purpose-based  each biologist might wish to define his/her own classification based on his/her needs and scope of interest  impossible to define a unifying standard for all biologists  No representation of molecular interactions  relationships between objects are only hierarchical, not horizontal or cyclic  e.g. does not describe which genes are the target of a given transcription factor
  • 60.
    What is biologicalfunction ?  A general definition  Fonction: action, rôle caractéristique dʼun élément, dʼun organe, dans un ensemble (souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.  Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)  Function and gene ontology  Understanding the function requires to establish the link between molecular activity and the context in which it takes place (process).  Multifunctionality • Same activity can play different roles in different processes.  Example: scute gene in Drosophila melanogaster: a transcription factor (activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes). • Multiple activities of a same protein in a given process  Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).
  • 61.
    Small compounds, reactions andmetabolic pathways Biomolecular Databases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 62.
    LIGAND - Smallcompounds and metabolic reactions
  • 63.
    KEGG - KyotoEncycplopaedia of Genes and Genomes
  • 64.
    Ecocyc, BioCyc andMetacyc - Metabolic pathways
  • 65.
    Protein interaction networks andtransduction pathways Biomolecular Databases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 66.
    Microarray databases Biomolecular Databases Jacques.van.Helden@ulb.ac.be UniversitéLibre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 67.
    Human genome resources Jacques.van.Helden@ulb.ac.be UniversitéLibre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 68.
    HapMap http://www.hapmap.org/  The InternationalHapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.  Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.
  • 69.
    Issues for biomolecular databases BiomolecularDatabases Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 70.
    Issues for biologicaldatabases  Dealing with biological complexity  Data content  Coverage  Information content  Data quality  Data structure  Consistency  Query capabilities  Interfaces  User interfaces  Programmatic interfaces  Annotation  Funding
  • 71.
    Towards biological complexity The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, …  This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).  It becomes more difficult if we want to represent  the interactions between biological objects,  the integration of various elements in a biological process (metabolic pathways, protein interaction networks, regulatory networks, …)  complex concepts such as ”biological function”
  • 72.
    Data content  Scopeof the database  types of biological objects represented  Number of entries  coverage of the current knowledge  Information content  Level of detail in the description of the biological objects  References to the source of information
  • 73.
    Data quality  DataConsistency  always use the same name to indicate the same object  (this seems trivial, but its is unfortunately still not always the case)  event better: define an ID for each objects, and allow to retrieve it by any of its synonyms  spelling mistakes  Data Structuration  distinct fields for distinct attributes of the biological objects  Reliability  Evidences ? Level of confidence ?  Assignation of function by similarity • recursive process  propagation of errors
  • 74.
    Query capabilities  Browsing(click and read)  Simple search  select records with some constraints  More elaborate search  select specific fields of some records with constraints on some fields (~SQL SELECT)  Complex querying  ability to return an answer that results from a "live" computation, and was not part of any record of the dabatase
  • 75.
    Interfaces  User interfaces user-friendly  convenient browsing  intuitive query forms  visualization (graphical output)  Programmatic interfaces  communication with external programs: • other databases (concept of distributed database) • analysis tools
  • 76.
    Annotation  Problem  Theflow of available data is increasing exponentially  Strategies  internal curators  selected external experts  public submission  computer-based extraction of information from biological texts
  • 77.
    Funding  Public funding Problem: easier to obtain public funds for creating a new database than for maintaining or expanding existing resources  Private funding  Industrial companies are • ready to invest in good data and good query capabilities • interested by academic expertise  Solutions  All users pay (per query for example) • Note: academic users are anyway funded by public funds  Hybrid solution • access is free for academic users, not for companies • companies can buy the whole database an install it in-house (+ add their own private data) • academia-industry interface is often ensured by a spinoff company