Basic bioinformatics concepts,
                      databases and tools
                                                       Module 4
                                       Beyond the sequences

                                                    Dr. Joachim Jacob
                                                http://www.bits.vib.be

Updated Nov 2011
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf
Module 4 broadens our view
To understand life, we need not only
sequences, but many other concepts
      
          Bioinformatics is also storing and analyzing
             −   gene information: variations, isoforms,...
             −   Expression data
             −   3D protein structure data
             −   Interaction data
             −   Pathways and network


                     “Storing all relevant biological data”
Schematic view II
GeneA                sequence     annotations – gene expr – pathway – struct,...

GeneB                sequence     annotations – gene expr – pathway – struct,...

GeneC                sequence     annotations – gene expr – pathway – struct,...


                       analysis                  Additional information
                                                        sources
                   results   results
Primary database
Other sequence
databases
The indispensable databases
      
          Gene Ontology – structuring
      
          KEGG – biochemical pathways
      
          PDB – Structure of proteins
      
          Intact – Interaction data
      
          dbSNP – database of genomic variation
      
          Expression sources – Microarray data
Gene Ontology structures the way we
communicate about life




Gene translation                  Protein production                 Protein synthesis



                                            http://www.arabidopsis.org/help/tutorials/go1.jsp
  http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
Gene Ontology structures life
               http://www.geneontology.org/
               Agreement on standardized keywords (often referred to as
                 'controlled vocabularies'), describing all natural processes in an
                 hierarchical way (ontology).
               Keywords are assigned to genes based different evidence
               Keywords are ordered in a hierarchical tree-like structure ( 'directed
                 acyclic graphs')
               Three GO 'trees' exists, describing:
                                 "Biological Process"
                                 "Cellular Component"
                                 "Molecular Function"
                                           http://www.arabidopsis.org/help/tutorials/go1.jsp
 http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
A gene can be given
different GO terms

 Example, cytochrome c:

     molecular function: oxidoreductase activity,

     biological process: oxidative phosphorylation and
 induction of cell death,

     cellular component: mitochondrial matrix and
 mitochondrial inner membrane.

 In each tree, the terms are organised in a directed acyclic
 graph: a network consisting of parents and child-terms (as
 nodes) and lines between them as relationships.
Different evidence codes can assign a
degree of confidence to the assignment
         http://www.geneontology.org/GO.evidence.shtml

         Evidence codes can be grouped by:
         
             Experimental (e.g. IDA – inferred from direct assay)
         
             Computational analysis
         
             Author statement
         
             Curator statement
         
             Inferred from electronic annotation (IEA)
         If available, each annotation has also a reference
Different evidence codes can assign a
degree of confidence to the assignment
Gene Ontology structures all genes
according to their biological significance
         The GO structure and the terms can be browsed by a browser
           called AmiGO.
         The Quick Go from EBI has some nice visualisation
         Excellent GO-wiki for all your questions
GO can be used to retrieve all gene
(products) related to one specific term
         You can search broad, e.g. Amigo search for Diabetes
           leads to following GO term
         http://amigo.geneontology.org/
GO can be used to retrieve all gene
(products) related to one specific term
              Amigo search for Diabetes
GO can be used to retrieve all gene
(products) related to one specific term
              Amigo search for Diabetes
GO is also useful to analyze and compare
different gene lists
          A lot of tools on GO are available on website.




                                http://www.geneontology.org/GO.tools.shtml
Some things to know about GO
         For analyses, one can make use of 'shrinked' GO sets,
           the so-called GO-slims
                –   GO slims are a subset of biologically more
                    relevant GO terms (available per species)
                –   GO ontologies can be downloaded in .obo
                    format.
         Not all information is captured by GO and need to be
           retrieved in other databases
                Metabolic pathways: KEGG, …
                Phenotype/diseases
                       •   Mapping files exists e.g. kegg2go
                              http://www.geneontology.org/GO.slims.shtml
Biological pathways databases organise
genes by molecular reactions
        3 important databases on biological pathways
        
            http://www.kegg.jp/




           http://www.reactome.org/ - EBI
           http://metacyc.org
Proteins with enzymatic function receive
an Enzyme Commission (EC) number
        http://www.chem.qmul.ac.uk/iubmb/enzyme/
        EC 6   Ligases
        EC 5   Isomerases
        EC 4   Lyases
        EC 3   Hydrolases
        EC 2   Transferases
        EC 1   Oxidoreductases
IntAct database contains interaction
information of proteins
         http://www.ebi.ac.uk/intact
         Three types of interactions stored
            
                Protein-protein
            
                Protein-dna
            
                Protein-small molecule
IntAct database represents all
interactions as binary: caution!
Interaction networks can be analysed on
your computer using Cytoscape




                    Cytoscape training material on the BITS website
PDB hosts 3-dimensional
structural data on molecules
PDB hosts 3-dimensional
structural data on molecules

         PDB = Protein DataBank
             http://www.pdb.org/pdb/home/home.do
         Only structures resolved through NMR and X-ray
           (or other accurate techniques)
         
             Proteins
         
             DNA
         
             RNA
         
             Ligands

         Understanding PDB data: tutorial
PDB files can be read by a lot of different
  tools to display the structure
                       Every entry in PDB contains its own PDB accession
                         number (often 1 digit and three letters)
                       The PDB file contains 3D coordinates from every
                         single atom in the structure, together with
                         variability of that position (last two digits)




http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-
PDB files can be read by a lot of different
tools to display the structure
         Tools to visualize (and some to analyze
           structures) (see BITS wiki)




                      http://www.bits.vib.be/wiki/index.php/Protein_structure
To find a structure for your protein
  sequence is to search for similarity
               Homology modeling
               Similarity on sequence level projected to a structure
                    Blast your query against PDB db by cblast , or at expasy
                    PSI-BLAST - can detect sequences with similar structures
                     (twilight zone!)
                    If still no success: 3D-jury (a meta approach, including fold
                     recognition and local structure prediction)
               Similarity on structural level: aligning structures
                    VAST (structure)
                    Distance mAtrix aLIgnment DALI

                                             BITS training on protein structure analysis
                http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdf
Tools at EBI                           http://consurf.tau.ac.il/pe/protexpl/psbiores.htm
Structural information is used to classify
proteins              Database cross-references in PDB entry




             
                 SCOP
             Groups proteins based on evolutionary, domain
               architecture and structural information.
             
                 CATH
             Manually curated classification on protein domains

                                           http://scop.mrc-lmb.cam.ac.uk/scop/
                                                        http://www.cathdb.info/
dbSNP is a public-domain archive for
simple genetic polymorphisms
      
          Single Nucleotide Polymorphism database (NCBI)
      
          Each dbSNP entry has a code rsxx (RefSNP) or ssxx
          (submitted SNP)
          
              single-base nucleotide substitutions (also known as
              single nucleotide polymorphisms or SNPs),
          
              small-scale multi-base deletions or insertions (also
              called deletion insertion polymorphisms or DIPs)
          
              retroposable element insertions and microsatellite
              repeat variations (also called short tandem repeats or
              STRs).
      
          Synchronized with new genome builds
Expression data can be sequence-based
or hybridisation-based
      Sequence-based (ESTs - RNA seq - SAGE)
        
            Digital gene expression/northern
      Microarray databases – hybridisation based:
        
            GEO: gene expression omnibus (NCBI)
             −   Platform: GPLxxxxxxx
             −   Experiment: GSExxxxxx (= several samples)
             −   Sample: GSMxxxxxxxx
             −   Some experiments are curated: GDSxxxxx (online
                 analysis possible)
        
            ArrayExpress (EBI)
Example of expression data at GEO
Example of expression data at GEO
Example of expression data at GEO
Example at ArrayExpress
Example at ArrayExpress
Entrez interconnects the databases at
NCBI for easy querying
        
            UniGene : sequences grouped by gene
        
            PopSet : sequence alignments for population
            studies and phylogeny
        
            Structure : 3D structures (PDB)
        
            Genome : genomic maps of chromosomes and
            plasmids
        
            UniSTS (Sequence Tagged Sites)
        
            PubMed : literature abstracts (MEDLINE,…)
        
            OMIM (Online Mendelian Inheritance in Man) :
            literature reviews,
        
            Mesh (Medical Subject Headings) : keywords
        
            Taxonomy
Finding relevant data
Summarizing most important links to
discover everything you need ...
             Protein data
               Interpro (heavily integrated with EBI resources)
               http://www.interpro.org

             Gene data
               Entrez at NCBI : 'Entrez Gene'
               http://www.ncbi.nlm.nih.gov/Entrez/
               Ebeye Search at EBI : excellent for cross-species
               http://www.ebi.ac.uk/ebisearch/
Hold back your horses!

            Phew, where do I place this all?
Bioinformatics is all about different data,
as versatile as life itself
            Due to the strong cross-references between
              different databases, new databases and
              relevant info are rapidly integrated in existing
              databases.
            You can discover them by taking time to read the
              entries.
New tools are emerging everyday to
enable you to browse all data sources...
         BioGPS, all in one window!
New tools are emerging everyday to
enable you to browse all data sources...
Integrative resources are increasingly
being organised on a species basis
        
            EMAGE database of in situ gene expression in mouse
        
            OMIM Database of diseases in man
        
            Websites providing an interface to integrate all
            this data is increasingly important
        
            Often organized on a species basis
             −   TAIR
             −   Flybase
             −   Wormbase
The organizing biological data
information by species

                     By species, why?
  There is one biological information resource which stays
           more or less unchanged per species ...

BITS: Overview of important biological databases beyond sequences

  • 1.
    Basic bioinformatics concepts, databases and tools Module 4 Beyond the sequences Dr. Joachim Jacob http://www.bits.vib.be Updated Nov 2011 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf
  • 2.
  • 3.
    To understand life,we need not only sequences, but many other concepts  Bioinformatics is also storing and analyzing − gene information: variations, isoforms,... − Expression data − 3D protein structure data − Interaction data − Pathways and network “Storing all relevant biological data”
  • 4.
    Schematic view II GeneA sequence annotations – gene expr – pathway – struct,... GeneB sequence annotations – gene expr – pathway – struct,... GeneC sequence annotations – gene expr – pathway – struct,... analysis Additional information sources results results Primary database Other sequence databases
  • 5.
    The indispensable databases  Gene Ontology – structuring  KEGG – biochemical pathways  PDB – Structure of proteins  Intact – Interaction data  dbSNP – database of genomic variation  Expression sources – Microarray data
  • 6.
    Gene Ontology structuresthe way we communicate about life Gene translation Protein production Protein synthesis http://www.arabidopsis.org/help/tutorials/go1.jsp http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
  • 7.
    Gene Ontology structureslife http://www.geneontology.org/ Agreement on standardized keywords (often referred to as 'controlled vocabularies'), describing all natural processes in an hierarchical way (ontology). Keywords are assigned to genes based different evidence Keywords are ordered in a hierarchical tree-like structure ( 'directed acyclic graphs') Three GO 'trees' exists, describing: "Biological Process" "Cellular Component" "Molecular Function" http://www.arabidopsis.org/help/tutorials/go1.jsp http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
  • 8.
    A gene canbe given different GO terms Example, cytochrome c: molecular function: oxidoreductase activity, biological process: oxidative phosphorylation and induction of cell death, cellular component: mitochondrial matrix and mitochondrial inner membrane. In each tree, the terms are organised in a directed acyclic graph: a network consisting of parents and child-terms (as nodes) and lines between them as relationships.
  • 10.
    Different evidence codescan assign a degree of confidence to the assignment http://www.geneontology.org/GO.evidence.shtml Evidence codes can be grouped by:  Experimental (e.g. IDA – inferred from direct assay)  Computational analysis  Author statement  Curator statement  Inferred from electronic annotation (IEA) If available, each annotation has also a reference
  • 11.
    Different evidence codescan assign a degree of confidence to the assignment
  • 12.
    Gene Ontology structuresall genes according to their biological significance The GO structure and the terms can be browsed by a browser called AmiGO. The Quick Go from EBI has some nice visualisation Excellent GO-wiki for all your questions
  • 13.
    GO can beused to retrieve all gene (products) related to one specific term You can search broad, e.g. Amigo search for Diabetes leads to following GO term http://amigo.geneontology.org/
  • 14.
    GO can beused to retrieve all gene (products) related to one specific term Amigo search for Diabetes
  • 15.
    GO can beused to retrieve all gene (products) related to one specific term Amigo search for Diabetes
  • 16.
    GO is alsouseful to analyze and compare different gene lists A lot of tools on GO are available on website. http://www.geneontology.org/GO.tools.shtml
  • 17.
    Some things toknow about GO For analyses, one can make use of 'shrinked' GO sets, the so-called GO-slims – GO slims are a subset of biologically more relevant GO terms (available per species) – GO ontologies can be downloaded in .obo format. Not all information is captured by GO and need to be retrieved in other databases Metabolic pathways: KEGG, … Phenotype/diseases • Mapping files exists e.g. kegg2go http://www.geneontology.org/GO.slims.shtml
  • 18.
    Biological pathways databasesorganise genes by molecular reactions 3 important databases on biological pathways  http://www.kegg.jp/  http://www.reactome.org/ - EBI  http://metacyc.org
  • 19.
    Proteins with enzymaticfunction receive an Enzyme Commission (EC) number http://www.chem.qmul.ac.uk/iubmb/enzyme/ EC 6 Ligases EC 5 Isomerases EC 4 Lyases EC 3 Hydrolases EC 2 Transferases EC 1 Oxidoreductases
  • 20.
    IntAct database containsinteraction information of proteins http://www.ebi.ac.uk/intact Three types of interactions stored  Protein-protein  Protein-dna  Protein-small molecule
  • 21.
    IntAct database representsall interactions as binary: caution!
  • 22.
    Interaction networks canbe analysed on your computer using Cytoscape Cytoscape training material on the BITS website
  • 23.
  • 24.
    PDB hosts 3-dimensional structuraldata on molecules PDB = Protein DataBank http://www.pdb.org/pdb/home/home.do Only structures resolved through NMR and X-ray (or other accurate techniques)  Proteins  DNA  RNA  Ligands Understanding PDB data: tutorial
  • 25.
    PDB files canbe read by a lot of different tools to display the structure Every entry in PDB contains its own PDB accession number (often 1 digit and three letters) The PDB file contains 3D coordinates from every single atom in the structure, together with variability of that position (last two digits) http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-
  • 26.
    PDB files canbe read by a lot of different tools to display the structure Tools to visualize (and some to analyze structures) (see BITS wiki) http://www.bits.vib.be/wiki/index.php/Protein_structure
  • 27.
    To find astructure for your protein sequence is to search for similarity Homology modeling Similarity on sequence level projected to a structure  Blast your query against PDB db by cblast , or at expasy  PSI-BLAST - can detect sequences with similar structures (twilight zone!)  If still no success: 3D-jury (a meta approach, including fold recognition and local structure prediction) Similarity on structural level: aligning structures  VAST (structure)  Distance mAtrix aLIgnment DALI BITS training on protein structure analysis http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdf Tools at EBI http://consurf.tau.ac.il/pe/protexpl/psbiores.htm
  • 28.
    Structural information isused to classify proteins Database cross-references in PDB entry  SCOP Groups proteins based on evolutionary, domain architecture and structural information.  CATH Manually curated classification on protein domains http://scop.mrc-lmb.cam.ac.uk/scop/ http://www.cathdb.info/
  • 29.
    dbSNP is apublic-domain archive for simple genetic polymorphisms  Single Nucleotide Polymorphism database (NCBI)  Each dbSNP entry has a code rsxx (RefSNP) or ssxx (submitted SNP)  single-base nucleotide substitutions (also known as single nucleotide polymorphisms or SNPs),  small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs)  retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).  Synchronized with new genome builds
  • 30.
    Expression data canbe sequence-based or hybridisation-based Sequence-based (ESTs - RNA seq - SAGE)  Digital gene expression/northern Microarray databases – hybridisation based:  GEO: gene expression omnibus (NCBI) − Platform: GPLxxxxxxx − Experiment: GSExxxxxx (= several samples) − Sample: GSMxxxxxxxx − Some experiments are curated: GDSxxxxx (online analysis possible)  ArrayExpress (EBI)
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Entrez interconnects thedatabases at NCBI for easy querying  UniGene : sequences grouped by gene  PopSet : sequence alignments for population studies and phylogeny  Structure : 3D structures (PDB)  Genome : genomic maps of chromosomes and plasmids  UniSTS (Sequence Tagged Sites)  PubMed : literature abstracts (MEDLINE,…)  OMIM (Online Mendelian Inheritance in Man) : literature reviews,  Mesh (Medical Subject Headings) : keywords  Taxonomy
  • 37.
  • 38.
    Summarizing most importantlinks to discover everything you need ... Protein data Interpro (heavily integrated with EBI resources) http://www.interpro.org Gene data Entrez at NCBI : 'Entrez Gene' http://www.ncbi.nlm.nih.gov/Entrez/ Ebeye Search at EBI : excellent for cross-species http://www.ebi.ac.uk/ebisearch/
  • 39.
    Hold back yourhorses! Phew, where do I place this all?
  • 40.
    Bioinformatics is allabout different data, as versatile as life itself Due to the strong cross-references between different databases, new databases and relevant info are rapidly integrated in existing databases. You can discover them by taking time to read the entries.
  • 41.
    New tools areemerging everyday to enable you to browse all data sources... BioGPS, all in one window!
  • 42.
    New tools areemerging everyday to enable you to browse all data sources...
  • 43.
    Integrative resources areincreasingly being organised on a species basis  EMAGE database of in situ gene expression in mouse  OMIM Database of diseases in man  Websites providing an interface to integrate all this data is increasingly important  Often organized on a species basis − TAIR − Flybase − Wormbase
  • 44.
    The organizing biologicaldata information by species By species, why? There is one biological information resource which stays more or less unchanged per species ...

Editor's Notes

  • #11 'translation', whereas another uses the phrase 'protein synthesis',
  • #12 'translation', whereas another uses the phrase 'protein synthesis',
  • #13 'translation', whereas another uses the phrase 'protein synthesis',
  • #14 GO hierarchy can be downloaded (obo format) GO Slim: selection of categories
  • #15 GO hierarchy can be downloaded (obo format) GO Slim: selection of categories
  • #22 Different types: Ribbon Cartoon Ball and stick Space filling
  • #23 Different types: Ribbon Cartoon Ball and stick Space filling