Ondex – Data integration and 			visualisation<br />Catherine Canevet<br />Rothamsted Research<br />London Biogeeks – May T...
Rothamsted Research<br />North Wyke<br /><ul><li>Largest agricultural and crop science research institute in UK
Almost certainly the oldest in the world (started in 1843)
350 Scientific staff
Open weekend May 22nd-23rd 11am-5pm</li></ul>www.rothamsted.ac.uk/openweekend/<br />
Outline<br /><ul><li>Data integration in Systems Biology
 Data integration in Ondex
 Data visualisation in Ondex and application cases</li></li></ul><li>Outline<br /><ul><li>Data integration in Systems Biology
 Data integration in Ondex
 Data visualisation in Ondex and application cases</li></li></ul><li>Opportunities and challenges for modelling<br /><ul><...
Genomics, transcriptomics, proteomics, metabolomics, …
The biological systems span multiple levels of biological organisation
Non-trivial to integrate the data</li></ul> 2 main challenges<br />
Syntactic integration challenge<br />Over 1000 databases freely available to public<br />Over 60 million sequences in GenB...
Ear<br />Semantic Integration challenge<br />Same concept different names<br />Synonyms<br />Same name different concepts<...
Outline<br /><ul><li>Data integration in Systems Biology
Data integration in Ondex
 Data visualisation in Ondex and application cases</li></li></ul><li>Everything is a network<br />
Concepts and relations (1/2)<br />interact<br />Cell<br />Protein – Protein interaction network (PPI)<br />Cellular locati...
Reaction<br />Reaction<br />produced by<br />consumed by<br />consumed by<br />produced by<br />Metabolite<br />Metabolite...
Data integration in Ondex<br />Data Integration<br />Data Input<br />Graph of concepts and relations <br />Biological Data...
 Sequence analysis
 Text mining</li></ul>Experimental Data<br />
Importing data into Ondex<br />What databases to import<br />What format these are in<br />Ondex parsers already written<b...
Example of resulting graph<br />Has similar sequence<br />Target sequence<br />Binds to, has similar sequence<br />Repress...
Ondex Data Integration Scheme<br />Treatments from DRASTIC<br />Graph alignment<br />Pathways from KEGG<br />Data input& t...
Semantic Integration by Graph Alignment<br />Create relations between equivalent entries from different data sources<br />...
Outline<br /><ul><li>Data integration in Systems Biology
Data integration in Ondex
Data visualisation in Ondex and application cases</li></li></ul><li>Ondex network – more than a visual dump of a database ...
Complexity of interactions
PPI, co-expression, </li></ul>	co-citation, …<br /><ul><li>Bring together data, exploit graph structure
Candidate gene prioritisation and pathway discovery</li></ul> Use Ondex tools (filters, annotators, layouts …)<br />
Filters<br />Integrating different datasets <br /> large resulting graph<br />Need to narrow down<br />Select meaningful ...
Filters in Ondex<br />Protein protein interactions measured using quantitative techniques<br /><ul><li> Relations on graph...
 Threshold filter</li></li></ul><li>Application case 1: Predicting fungal pathogenicity genes<br />Reference database of v...
http://www.phi-base.org/<br /><ul><li>List of “hot” target genes curated from literature
Loss of pathogenicity
Reduced virulence
Only genes validated by gene disruption experiments</li></li></ul><li>ONDEX<br />Fusariumgraminearum<br />(microscopic fun...
Integrated phenotype and comparative genome information<br />
Annotators (1/3)<br /><ul><li>Visualise concepts and relations using their attributes/properties
Colour
Shape
Size</li></li></ul><li>Andrew Beacham<br />Mixed phenotype cluster<br />Reduced<br />virulence<br />Shape Legend<br />Star...
Annotators (2/3)<br />Virtual Knock-out<br />Annotator to see how important a single concept is to all possible paths cont...
Upcoming SlideShare
Loading in …5
×

Ondex: Data integration and visualisation

2,119 views

Published on

Catherine Canevet – Ondex: Data integration and visualisation

Ondex (http://ondex.org/) is a data integration platform which enables data from diverse biological data sets to be linked, integrated and visualised through graph analysis techniques. This talk describes its functionalities and a few application cases.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,119
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Light pink – Increased virulenceLight blue – Reduced virulenceLight Green – Loss of pathogenicityYellow – Unaffected pathogenicityStar – animalCircle – plant
  • Virtual KO scoreis based on 3 other scores: - &quot;extension&quot; gives the number of paths that would be extended if a concept was added- &quot;deletion&quot; gives the number of paths that would be deleted if this concept was deleted- &quot;nochange&quot; gives the number of paths that would not be shortened/extended if this concept was deleted
  • IntAct4625 protein interactions (data derived from literature curation or direct user submissions)TAIR (The Arabidopsis Information Resource) – 1143 interactionsgenome sequence, gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publicationsBioGrid (General Repository for Interaction Datasets)collections of protein and genetic interactions from major model organism species1223 interactions for Arabidopsis derived from high-throughput studies and conventional focused studies
  • ATTED II (Arabidopsis thalianatrans-factor and cis-element prediction database)provides co-regulated gene relationships in Arabidopsis to estimate gene functionsgives the Pearson correlation coefficients of co-expressed genes in Arabidopsis calculated from available microarray dataNCBI PSI-BLASTidentify similarities between our reference set of proteinsMatching against Arabidopsis subset of UNIPROTCo-occurrence of protein names25,900 Medline abstracts related to Arabidopsis ThalianaIntegrated Lucene-based mapping method
  • Solid biomass (in the form of plants and trees) can be converted into liquid fuels (such as ethanol, methanol, and biodiesel)The challenge lies in efficient conversion,creating more energy than the input required to produce itincrease biomass yieldDevelop means to support systematic analysis of QTL regions and prioritise genes for experimental analyses identify genes controlling biomass production in willow
  • QTL are genomic regions that assign variations observed in a phenotype to a region on the genetic mapBiomass traits: branching, height, leaf number etc.Going from Willow to Poplar to Arabidopsis and other species
  • Reduced hypothesis space from 100 potential candidates to 3 hot candidates.Next steps: Cloning and transformation for experimental validation.
  • Ondex: Data integration and visualisation

    1. 1. Ondex – Data integration and visualisation<br />Catherine Canevet<br />Rothamsted Research<br />London Biogeeks – May Tech Meet<br />
    2. 2. Rothamsted Research<br />North Wyke<br /><ul><li>Largest agricultural and crop science research institute in UK
    3. 3. Almost certainly the oldest in the world (started in 1843)
    4. 4. 350 Scientific staff
    5. 5. Open weekend May 22nd-23rd 11am-5pm</li></ul>www.rothamsted.ac.uk/openweekend/<br />
    6. 6. Outline<br /><ul><li>Data integration in Systems Biology
    7. 7. Data integration in Ondex
    8. 8. Data visualisation in Ondex and application cases</li></li></ul><li>Outline<br /><ul><li>Data integration in Systems Biology
    9. 9. Data integration in Ondex
    10. 10. Data visualisation in Ondex and application cases</li></li></ul><li>Opportunities and challenges for modelling<br /><ul><li>Large amounts of multi-’omics data
    11. 11. Genomics, transcriptomics, proteomics, metabolomics, …
    12. 12. The biological systems span multiple levels of biological organisation
    13. 13. Non-trivial to integrate the data</li></ul> 2 main challenges<br />
    14. 14. Syntactic integration challenge<br />Over 1000 databases freely available to public<br />Over 60 million sequences in GenBank<br />Over 870 complete genomes and many ongoing projects<br />Over 17 million citations in PubMed<br />PubMed growth by 600,000 publications each year<br />Integration of Life Science data sources is essential for Systems Biology research<br />http://www.ncbi.nlm.nih.gov/Database<br />
    15. 15. Ear<br />Semantic Integration challenge<br />Same concept different names<br />Synonyms<br />Same name different concepts<br />Homographs<br />
    16. 16. Outline<br /><ul><li>Data integration in Systems Biology
    17. 17. Data integration in Ondex
    18. 18. Data visualisation in Ondex and application cases</li></li></ul><li>Everything is a network<br />
    19. 19. Concepts and relations (1/2)<br />interact<br />Cell<br />Protein – Protein interaction network (PPI)<br />Cellular location of proteins<br />Protein<br />Protein<br />e.g. Network of Concepts and Relations<br />RelationType<br />interact<br />located in<br />ConceptClass<br />ConceptClass<br />Protein<br />CelComp<br />Protein<br />Protein<br />Properties: compound name, protein sequence, protein structure, cellular component, KM-value, PH optimum … <br />Ontology of Concept Classes, Relation Types and additional Properties<br />
    20. 20. Reaction<br />Reaction<br />produced by<br />consumed by<br />consumed by<br />produced by<br />Metabolite<br />Metabolite<br />Metabolite<br />Concepts and relations (2/2)<br />Transformation to binary graph<br />Properties: compound name, protein sequence, protein structure, cellular component, KM-value, PH optimum … <br />Concepts:<br />Relations:<br />
    21. 21. Data integration in Ondex<br />Data Integration<br />Data Input<br />Graph of concepts and relations <br />Biological Databases<br />Import<br />Ontologies & Free Text<br />Data alignment<br /><ul><li> Concept mapping
    22. 22. Sequence analysis
    23. 23. Text mining</li></ul>Experimental Data<br />
    24. 24. Importing data into Ondex<br />What databases to import<br />What format these are in<br />Ondex parsers already written<br />Generic<br />OBO, PSI-MI, SBML, Tab-delimited, Fasta<br />Database-specific<br />Aracyc, AtRegNet, BioCyc, BioGRID, Brenda, Drastic, EcoCyc, GO, GOA, Gramene, Grassius, KEGG, Medline, MetaCyc, Oglycbase, OMIM, PDB, Pfam, SGD, TAIR, TIGR, Transfac, Transpath, UniProt, WGS, WordNet<br />
    25. 25. Example of resulting graph<br />Has similar sequence<br />Target sequence<br />Binds to, has similar sequence<br />Repressed by, regulated by, activated by<br />Member is part of<br />Gene<br />Protein<br />Encoded by<br />Is_a<br />Member is part of<br />Is_a<br />Transcription factor<br />Is_a<br />Member is part of<br />Enzyme<br />Protein complex<br />Is_a<br />catalyses<br />Catalysing class<br />Member is part of<br />Reaction<br />Member is part of<br />EC<br />Is_a<br />Pathway<br />
    26. 26. Ondex Data Integration Scheme<br />Treatments from DRASTIC<br />Graph alignment<br />Pathways from KEGG<br />Data input& transformation<br />Data integration<br />Visualisation<br />Clients/Tools<br />Heterogeneous <br />data sources<br />Ondexgraph warehouse<br />Integration<br />Methods<br />Ondex<br />Visualization <br />Tool Kit<br />UniProt<br />Accession<br />Generalized Object Data Model<br />Database Layer<br />Parser<br />Name based<br />Web Client<br />AraCyc<br />Parser<br />Transitive<br />Taverna<br />KEGG<br />Blast<br />Parser<br />ProteinFamily<br />Transfac<br />Data Exchange<br />Parser<br />Pfam2GO<br />OXL/RDF<br />Microarray<br />Lucene<br />Parser<br />Web Service<br />
    27. 27. Semantic Integration by Graph Alignment<br />Create relations between equivalent entries from different data sources<br />Identified by mapping methods<br />Concept accessions (UniProt ID)<br />Concept name (gene name), synonyms<br />Sequence methods<br />Graph neighbourhood<br />Text mining<br />
    28. 28. Outline<br /><ul><li>Data integration in Systems Biology
    29. 29. Data integration in Ondex
    30. 30. Data visualisation in Ondex and application cases</li></li></ul><li>Ondex network – more than a visual dump of a database …<br /><ul><li>Addressed issue of interrelatedness of databases
    31. 31. Complexity of interactions
    32. 32. PPI, co-expression, </li></ul> co-citation, …<br /><ul><li>Bring together data, exploit graph structure
    33. 33. Candidate gene prioritisation and pathway discovery</li></ul> Use Ondex tools (filters, annotators, layouts …)<br />
    34. 34. Filters<br />Integrating different datasets <br /> large resulting graph<br />Need to narrow down<br />Select meaningful areas of the graph<br />Example in Ondex<br />protein-protein interaction network<br />
    35. 35. Filters in Ondex<br />Protein protein interactions measured using quantitative techniques<br /><ul><li> Relations on graph have confidence values (confidence)
    36. 36. Threshold filter</li></li></ul><li>Application case 1: Predicting fungal pathogenicity genes<br />Reference database of virulence and pathogenicity genes validated by gene disruption experiments <br />Literature mining<br />http://www.phi-base.org/<br />Sequence comparison – orthology and gene cluster analysis <br />
    37. 37. http://www.phi-base.org/<br /><ul><li>List of “hot” target genes curated from literature
    38. 38. Loss of pathogenicity
    39. 39. Reduced virulence
    40. 40. Only genes validated by gene disruption experiments</li></li></ul><li>ONDEX<br />Fusariumgraminearum<br />(microscopic fungus)<br />genome<br />FASTA<br />OXL<br />tab file<br />OXL<br />clusters of orthologs and paralogs between entries of PHI-base and Fusariumgraminearum<br />tab separated text file of clusters loaded in Excel<br />Inparanoid<br />mapping<br />Ondex front-end<br />Integrated phenotype and comparative genome information<br />
    41. 41. Integrated phenotype and comparative genome information<br />
    42. 42. Annotators (1/3)<br /><ul><li>Visualise concepts and relations using their attributes/properties
    43. 43. Colour
    44. 44. Shape
    45. 45. Size</li></li></ul><li>Andrew Beacham<br />Mixed phenotype cluster<br />Reduced<br />virulence<br />Shape Legend<br />Star: animal<br />Circle: plant<br />Red square: Fusarium<br />Uninfected<br />Wild-type <br /> PH-1<br />FGSG_9908<br />pkar PH-1<br />Protein kinase A- regulatory subunit <br />
    46. 46. Annotators (2/3)<br />Virtual Knock-out<br />Annotator to see how important a single concept is to all possible paths contained in a network <br />Ondex resizes the concepts based on this score<br />Scale Concept by Value <br />Pie charts<br />Up/down regulation is indicated in red/green<br />
    47. 47. AraCyc<br />ONDEX<br />Application case2: Mapping microarray expression data to integrated pathways<br />Parser<br />tab file<br />Arabidopsis C/N uptake<br />OXL<br />tab file<br />Jan Taubert<br />Accession based<br />Mapping<br />usingTAIR IDs<br />Ondex Interactive exploration<br />Enriched spreadsheet, e.g. AraCyc pathways<br />
    48. 48.
    49. 49. Annotators (3/3)<br /><ul><li>Run network statistics such as:
    50. 50. Connectivity
    51. 51. Centrality
    52. 52. Clustering
    53. 53. Network diameter</li></ul> Add annotation to the graph<br />
    54. 54. Application case 3: Arabidopsis PPI network<br />Artem Lysenko<br />IntAct<br />TAIR<br />BioGRID<br /> Mapping the 3 databases based on TAIR accessions<br />
    55. 55. Adding 3 sources of evidence<br />co-expression<br />sequence similarity<br />co-occurrence in scientific literature<br /> facilitate the identification of functionally related groups of proteins<br />
    56. 56. Added attributes to nodes/edges<br />Network stats<br />Betweenness centrality (BWC)<br /> How influential (bridge)<br />Degree centrality (DC)<br /> Hub likeness<br />Markov Clustering<br />Identifies strongly connected groups of proteins in the network<br />
    57. 57. Ondex annotator to visualise calculated centrality measures<br /><ul><li>Identify ‘influential’ nodes with low degree and high betweenness
    58. 58. Degree centrality repr. by node size
    59. 59. Betweenness centrality repr. by node colour</li></ul>Artem Lysenko<br />
    60. 60. Filters, annotators and layouts<br />Combination of these three types of tools in Ondex<br /> a more complex application case …<br />
    61. 61. Application case 4: Bioenergy Project<br />Use bioinformatics to support phenotype-genotype research in bioenergy crops<br />Given a phenotypic variant is it possible to pin down the relevant genes? <br />Develop tools to support systematic analysis of QTL regions to pin down relevant genes<br />Identify genes implicated in biomass production in willow<br />Prioritise genes for experimental validation<br />Keywan Hassani-Pak<br />Biofuel Conversion Process<br />http://www.jgi.doe.gov/education/bioenergy/bioenergy_1.html<br />
    62. 62. QTL and Genomic Data<br />QTL<br />Willow genome is not sequenced yetQTL may encompass many potentialcandidates, perhaps hundreds<br />Poplar is the first tree with fully sequenced genome<br />19 Chromosomes, 45778 predicted genes<br />4x larger than Arabidopsis genome<br />Not much known about the function of the genes<br />
    63. 63. Linking genes to data sources<br />Linked<br />References<br />model<br />e.g. Poplar, Arabidopsis<br />Willow<br />Pathways<br />Plant Hormones<br />QTL Map<br />Orthologous<br />Markers<br />Physical <br />map<br />Expression Patterns<br />Genes<br />Gene Function<br />List of candidate genes linked to biological processes<br />
    64. 64. Relevant Data Sources<br />Release 15.10<br />Poplar Gene Prediction v2.0 (Jan 2010)<br />All plants: 739,396 proteins<br />Reviewed: 28,404 proteins (3,84%)<br />PoplarCyc 1.0: 285 pathways, 3434 enzymes, 1363 compounds (Oct 2009)<br />Pfam 24.0: 11,912 protein families (Oct 2009)<br />Poplar Transcription Factors<br /> - DPTF: 2,576 putative TF (March 2007)<br /> - PlnTFDB: 2,901 putative TF (July 2009)<br />29,365 GO terms (Jan 2010)<br />Poplar/ Willow QTL<br /> - work in progress<br /> - preliminary dataset available <br />Only loading referenced publications<br />~15,000 articles<br />
    65. 65. Unique Knowledge Base for Poplar<br />Proteins annotated with functional information and publications<br />Based on Comparative genomics and<br />Protein familyanalysis<br />Genes, QTLs enriched withpositionalinformation<br />Data integration was done in Ondex<br />
    66. 66. Ondex Genomics Layout<br />Genomic Layout displays chromosomes, genes and QTLs<br />Chromosomal regions and QTLs can be selected<br />
    67. 67. Ondex Genomics Filter<br />Genes of interest<br />Enriched protein annotation network<br />
    68. 68. Phenotypic Information in Literature<br />HMMer: 650581 – HLH<br />E-Value: 3.4E-7<br />Score: 30.0<br />BLAST 217086 – LAX<br />E-Value: 8.3E-17<br />Score: 80.88<br />BLAST 217086 – BHLH63<br />E-Value: 8.3E-9<br />Score: 54.3<br />PMID:13130077<br />“LAX and SPA: major regulators of shoot branching in rice.”<br />Poplar protein 217086 We identified two remote homologs in Rice (LAX) and in Arabidopsis (BHLH63), <br />as well as one protein domain HLH<br />The LAX homolog contains evidence to be a major regulator of shoot branching<br /> Hypothesis generation<br />
    69. 69. Dynamic process<br />Data analysis<br />Data Integration<br />Data Input<br />Graph of concepts and relations <br />Biological Databases<br />Parse<br />Export<br />Ontologies & Free Text<br />Data alignment<br /><ul><li> Concept mapping
    70. 70. Sequence analysis
    71. 71. Text mining</li></ul>Experimental Data<br />Hypothesis<br />New experiments<br />
    72. 72. Acknowledgements<br />Newcastle members: <br /><ul><li> Simon Cockell
    73. 73. James Dewar
    74. 74. Eva Holstein
    75. 75. Katherine James
    76. 76. Philip Lord
    77. 77. David Lydall
    78. 78. Matthew Pocock
    79. 79. JochenWeile
    80. 80. Darren Wilkinson
    81. 81. Anil Wipat</li></ul>Rothamsted members:<br /><ul><li> Catherine Canevet
    82. 82. Keywan Hassani-Pak
    83. 83. Stephen Hanley
    84. 84. Matthew Hindle
    85. 85. Angela Karp
    86. 86. Shao Chih Kuo
    87. 87. Artem Lysenko
    88. 88. Chris Rawlings
    89. 89. Mansoor Saqi
    90. 90. Andrea Splendiani
    91. 91. Jan Taubert </li></ul>Manchester members:<br /><ul><li>Sophia Ananiadou
    92. 92. Paul Dobson
    93. 93. Paul Fisher
    94. 94. Carole Goble
    95. 95. Gina Levow
    96. 96. Pedro Mendes
    97. 97. Raheel Nawaz
    98. 98. Georgina Moulton
    99. 99. Robert Stevens
    100. 100. David Withers
    101. 101. Katy Wolstencroft</li></ul>Biological collaborators:<br /><ul><li>Kim Hammond-Kosack
    102. 102. Martin Urban
    103. 103. DimahHabash
    104. 104. David Wild
    105. 105. Katherine Denby
    106. 106. RoxaneLegaie</li></ul>Former members:<br /><ul><li>Jacob Köhler
    107. 107. Rainer Winnenberg</li></ul>Edinburgh members:<br /><ul><li>Igor Goryanin
    108. 108. Andrew Millar
    109. 109. Luna De Ferrari</li></li></ul><li>

    ×