• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BITS: Overview of important biological databases beyond sequences
 

BITS: Overview of important biological databases beyond sequences

on

  • 1,811 views

Module 4 Other relevant biological data sources beyond sequences

Module 4 Other relevant biological data sources beyond sequences

Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training

Statistics

Views

Total Views
1,811
Views on SlideShare
1,506
Embed Views
305

Actions

Likes
0
Downloads
25
Comments
0

2 Embeds 305

http://www.bits.vib.be 304
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 'translation', whereas another uses the phrase 'protein synthesis',
  • 'translation', whereas another uses the phrase 'protein synthesis',
  • 'translation', whereas another uses the phrase 'protein synthesis',
  • GO hierarchy can be downloaded (obo format) GO Slim: selection of categories
  • GO hierarchy can be downloaded (obo format) GO Slim: selection of categories
  • Different types: Ribbon Cartoon Ball and stick Space filling
  • Different types: Ribbon Cartoon Ball and stick Space filling

BITS: Overview of important biological databases beyond sequences BITS: Overview of important biological databases beyond sequences Presentation Transcript

  • Basic bioinformatics concepts, databases and tools Module 4 Beyond the sequences Dr. Joachim Jacob http://www.bits.vib.beUpdated Nov 2011http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf
  • Module 4 broadens our view
  • To understand life, we need not onlysequences, but many other concepts  Bioinformatics is also storing and analyzing − gene information: variations, isoforms,... − Expression data − 3D protein structure data − Interaction data − Pathways and network “Storing all relevant biological data”
  • Schematic view IIGeneA sequence annotations – gene expr – pathway – struct,...GeneB sequence annotations – gene expr – pathway – struct,...GeneC sequence annotations – gene expr – pathway – struct,... analysis Additional information sources results resultsPrimary databaseOther sequencedatabases
  • The indispensable databases  Gene Ontology – structuring  KEGG – biochemical pathways  PDB – Structure of proteins  Intact – Interaction data  dbSNP – database of genomic variation  Expression sources – Microarray data
  • Gene Ontology structures the way wecommunicate about lifeGene translation Protein production Protein synthesis http://www.arabidopsis.org/help/tutorials/go1.jsp http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
  • Gene Ontology structures life http://www.geneontology.org/ Agreement on standardized keywords (often referred to as controlled vocabularies), describing all natural processes in an hierarchical way (ontology). Keywords are assigned to genes based different evidence Keywords are ordered in a hierarchical tree-like structure ( directed acyclic graphs) Three GO trees exists, describing: "Biological Process" "Cellular Component" "Molecular Function" http://www.arabidopsis.org/help/tutorials/go1.jsp http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
  • A gene can be givendifferent GO terms Example, cytochrome c: molecular function: oxidoreductase activity, biological process: oxidative phosphorylation and induction of cell death, cellular component: mitochondrial matrix and mitochondrial inner membrane. In each tree, the terms are organised in a directed acyclic graph: a network consisting of parents and child-terms (as nodes) and lines between them as relationships.
  • Different evidence codes can assign adegree of confidence to the assignment http://www.geneontology.org/GO.evidence.shtml Evidence codes can be grouped by:  Experimental (e.g. IDA – inferred from direct assay)  Computational analysis  Author statement  Curator statement  Inferred from electronic annotation (IEA) If available, each annotation has also a reference
  • Different evidence codes can assign adegree of confidence to the assignment
  • Gene Ontology structures all genesaccording to their biological significance The GO structure and the terms can be browsed by a browser called AmiGO. The Quick Go from EBI has some nice visualisation Excellent GO-wiki for all your questions
  • GO can be used to retrieve all gene(products) related to one specific term You can search broad, e.g. Amigo search for Diabetes leads to following GO term http://amigo.geneontology.org/
  • GO can be used to retrieve all gene(products) related to one specific term Amigo search for Diabetes
  • GO can be used to retrieve all gene(products) related to one specific term Amigo search for Diabetes
  • GO is also useful to analyze and comparedifferent gene lists A lot of tools on GO are available on website. http://www.geneontology.org/GO.tools.shtml
  • Some things to know about GO For analyses, one can make use of shrinked GO sets, the so-called GO-slims – GO slims are a subset of biologically more relevant GO terms (available per species) – GO ontologies can be downloaded in .obo format. Not all information is captured by GO and need to be retrieved in other databases Metabolic pathways: KEGG, … Phenotype/diseases • Mapping files exists e.g. kegg2go http://www.geneontology.org/GO.slims.shtml
  • Biological pathways databases organisegenes by molecular reactions 3 important databases on biological pathways  http://www.kegg.jp/  http://www.reactome.org/ - EBI  http://metacyc.org
  • Proteins with enzymatic function receivean Enzyme Commission (EC) number http://www.chem.qmul.ac.uk/iubmb/enzyme/ EC 6 Ligases EC 5 Isomerases EC 4 Lyases EC 3 Hydrolases EC 2 Transferases EC 1 Oxidoreductases
  • IntAct database contains interactioninformation of proteins http://www.ebi.ac.uk/intact Three types of interactions stored  Protein-protein  Protein-dna  Protein-small molecule
  • IntAct database represents allinteractions as binary: caution!
  • Interaction networks can be analysed onyour computer using Cytoscape Cytoscape training material on the BITS website
  • PDB hosts 3-dimensionalstructural data on molecules
  • PDB hosts 3-dimensionalstructural data on molecules PDB = Protein DataBank http://www.pdb.org/pdb/home/home.do Only structures resolved through NMR and X-ray (or other accurate techniques)  Proteins  DNA  RNA  Ligands Understanding PDB data: tutorial
  • PDB files can be read by a lot of different tools to display the structure Every entry in PDB contains its own PDB accession number (often 1 digit and three letters) The PDB file contains 3D coordinates from every single atom in the structure, together with variability of that position (last two digits)http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-
  • PDB files can be read by a lot of differenttools to display the structure Tools to visualize (and some to analyze structures) (see BITS wiki) http://www.bits.vib.be/wiki/index.php/Protein_structure
  • To find a structure for your protein sequence is to search for similarity Homology modeling Similarity on sequence level projected to a structure  Blast your query against PDB db by cblast , or at expasy  PSI-BLAST - can detect sequences with similar structures (twilight zone!)  If still no success: 3D-jury (a meta approach, including fold recognition and local structure prediction) Similarity on structural level: aligning structures  VAST (structure)  Distance mAtrix aLIgnment DALI BITS training on protein structure analysis http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdfTools at EBI http://consurf.tau.ac.il/pe/protexpl/psbiores.htm
  • Structural information is used to classifyproteins Database cross-references in PDB entry  SCOP Groups proteins based on evolutionary, domain architecture and structural information.  CATH Manually curated classification on protein domains http://scop.mrc-lmb.cam.ac.uk/scop/ http://www.cathdb.info/
  • dbSNP is a public-domain archive forsimple genetic polymorphisms  Single Nucleotide Polymorphism database (NCBI)  Each dbSNP entry has a code rsxx (RefSNP) or ssxx (submitted SNP)  single-base nucleotide substitutions (also known as single nucleotide polymorphisms or SNPs),  small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs)  retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).  Synchronized with new genome builds
  • Expression data can be sequence-basedor hybridisation-based Sequence-based (ESTs - RNA seq - SAGE)  Digital gene expression/northern Microarray databases – hybridisation based:  GEO: gene expression omnibus (NCBI) − Platform: GPLxxxxxxx − Experiment: GSExxxxxx (= several samples) − Sample: GSMxxxxxxxx − Some experiments are curated: GDSxxxxx (online analysis possible)  ArrayExpress (EBI)
  • Example of expression data at GEO
  • Example of expression data at GEO
  • Example of expression data at GEO
  • Example at ArrayExpress
  • Example at ArrayExpress
  • Entrez interconnects the databases atNCBI for easy querying  UniGene : sequences grouped by gene  PopSet : sequence alignments for population studies and phylogeny  Structure : 3D structures (PDB)  Genome : genomic maps of chromosomes and plasmids  UniSTS (Sequence Tagged Sites)  PubMed : literature abstracts (MEDLINE,…)  OMIM (Online Mendelian Inheritance in Man) : literature reviews,  Mesh (Medical Subject Headings) : keywords  Taxonomy
  • Finding relevant data
  • Summarizing most important links todiscover everything you need ... Protein data Interpro (heavily integrated with EBI resources) http://www.interpro.org Gene data Entrez at NCBI : Entrez Gene http://www.ncbi.nlm.nih.gov/Entrez/ Ebeye Search at EBI : excellent for cross-species http://www.ebi.ac.uk/ebisearch/
  • Hold back your horses! Phew, where do I place this all?
  • Bioinformatics is all about different data,as versatile as life itself Due to the strong cross-references between different databases, new databases and relevant info are rapidly integrated in existing databases. You can discover them by taking time to read the entries.
  • New tools are emerging everyday toenable you to browse all data sources... BioGPS, all in one window!
  • New tools are emerging everyday toenable you to browse all data sources...
  • Integrative resources are increasinglybeing organised on a species basis  EMAGE database of in situ gene expression in mouse  OMIM Database of diseases in man  Websites providing an interface to integrate all this data is increasingly important  Often organized on a species basis − TAIR − Flybase − Wormbase
  • The organizing biological datainformation by species By species, why? There is one biological information resource which stays more or less unchanged per species ...