Ondex: Data integration and visualisation

1. Ondex – Data integration and visualisation Catherine Canevet Rothamsted Research London Biogeeks – May Tech Meet

3. Almost certainly the oldest in the world (started in 1843)

4. 350 Scientific staff

5. Open weekend May 22nd-23rd 11am-5pmwww.rothamsted.ac.uk/openweekend/

7. Data integration in Ondex

11. Genomics, transcriptomics, proteomics, metabolomics, …

12. The biological systems span multiple levels of biological organisation

13. Non-trivial to integrate the data 2 main challenges

14. Syntactic integration challenge Over 1000 databases freely available to public Over 60 million sequences in GenBank Over 870 complete genomes and many ongoing projects Over 17 million citations in PubMed PubMed growth by 600,000 publications each year Integration of Life Science data sources is essential for Systems Biology research http://www.ncbi.nlm.nih.gov/Database

15. Ear Semantic Integration challenge Same concept different names Synonyms Same name different concepts Homographs

19. Concepts and relations (1/2) interact Cell Protein – Protein interaction network (PPI) Cellular location of proteins Protein Protein e.g. Network of Concepts and Relations RelationType interact located in ConceptClass ConceptClass Protein CelComp Protein Protein Properties: compound name, protein sequence, protein structure, cellular component, KM-value, PH optimum … Ontology of Concept Classes, Relation Types and additional Properties

20. Reaction Reaction produced by consumed by consumed by produced by Metabolite Metabolite Metabolite Concepts and relations (2/2) Transformation to binary graph Properties: compound name, protein sequence, protein structure, cellular component, KM-value, PH optimum … Concepts: Relations:

22. Sequence analysis

23. Text miningExperimental Data

24. Importing data into Ondex What databases to import What format these are in Ondex parsers already written Generic OBO, PSI-MI, SBML, Tab-delimited, Fasta Database-specific Aracyc, AtRegNet, BioCyc, BioGRID, Brenda, Drastic, EcoCyc, GO, GOA, Gramene, Grassius, KEGG, Medline, MetaCyc, Oglycbase, OMIM, PDB, Pfam, SGD, TAIR, TIGR, Transfac, Transpath, UniProt, WGS, WordNet

25. Example of resulting graph Has similar sequence Target sequence Binds to, has similar sequence Repressed by, regulated by, activated by Member is part of Gene Protein Encoded by Is_a Member is part of Is_a Transcription factor Is_a Member is part of Enzyme Protein complex Is_a catalyses Catalysing class Member is part of Reaction Member is part of EC Is_a Pathway

26. Ondex Data Integration Scheme Treatments from DRASTIC Graph alignment Pathways from KEGG Data input& transformation Data integration Visualisation Clients/Tools Heterogeneous data sources Ondexgraph warehouse Integration Methods Ondex Visualization Tool Kit UniProt Accession Generalized Object Data Model Database Layer Parser Name based Web Client AraCyc Parser Transitive Taverna KEGG Blast Parser ProteinFamily Transfac Data Exchange Parser Pfam2GO OXL/RDF Microarray Lucene Parser Web Service

27. Semantic Integration by Graph Alignment Create relations between equivalent entries from different data sources Identified by mapping methods Concept accessions (UniProt ID) Concept name (gene name), synonyms Sequence methods Graph neighbourhood Text mining

31. Complexity of interactions

33. Candidate gene prioritisation and pathway discovery Use Ondex tools (filters, annotators, layouts …)

34. Filters Integrating different datasets  large resulting graph Need to narrow down Select meaningful areas of the graph Example in Ondex protein-protein interaction network

38. Loss of pathogenicity

39. Reduced virulence

41. Integrated phenotype and comparative genome information

43. Colour

44. Shape

46. Annotators (2/3) Virtual Knock-out Annotator to see how important a single concept is to all possible paths contained in a network Ondex resizes the concepts based on this score Scale Concept by Value Pie charts Up/down regulation is indicated in red/green

47. AraCyc ONDEX Application case2: Mapping microarray expression data to integrated pathways Parser tab file Arabidopsis C/N uptake OXL tab file Jan Taubert Accession based Mapping usingTAIR IDs Ondex Interactive exploration Enriched spreadsheet, e.g. AraCyc pathways

50. Connectivity

51. Centrality

52. Clustering

53. Network diameter Add annotation to the graph

54. Application case 3: Arabidopsis PPI network Artem Lysenko IntAct TAIR BioGRID  Mapping the 3 databases based on TAIR accessions

55. Adding 3 sources of evidence co-expression sequence similarity co-occurrence in scientific literature  facilitate the identification of functionally related groups of proteins

56. Added attributes to nodes/edges Network stats Betweenness centrality (BWC)  How influential (bridge) Degree centrality (DC)  Hub likeness Markov Clustering Identifies strongly connected groups of proteins in the network

58. Degree centrality repr. by node size

59. Betweenness centrality repr. by node colourArtem Lysenko

60. Filters, annotators and layouts Combination of these three types of tools in Ondex  a more complex application case …

61. Application case 4: Bioenergy Project Use bioinformatics to support phenotype-genotype research in bioenergy crops Given a phenotypic variant is it possible to pin down the relevant genes? Develop tools to support systematic analysis of QTL regions to pin down relevant genes Identify genes implicated in biomass production in willow Prioritise genes for experimental validation Keywan Hassani-Pak Biofuel Conversion Process http://www.jgi.doe.gov/education/bioenergy/bioenergy_1.html

62. QTL and Genomic Data QTL Willow genome is not sequenced yetQTL may encompass many potentialcandidates, perhaps hundreds Poplar is the first tree with fully sequenced genome 19 Chromosomes, 45778 predicted genes 4x larger than Arabidopsis genome Not much known about the function of the genes

63. Linking genes to data sources Linked References model e.g. Poplar, Arabidopsis Willow Pathways Plant Hormones QTL Map Orthologous Markers Physical map Expression Patterns Genes Gene Function List of candidate genes linked to biological processes

64. Relevant Data Sources Release 15.10 Poplar Gene Prediction v2.0 (Jan 2010) All plants: 739,396 proteins Reviewed: 28,404 proteins (3,84%) PoplarCyc 1.0: 285 pathways, 3434 enzymes, 1363 compounds (Oct 2009) Pfam 24.0: 11,912 protein families (Oct 2009) Poplar Transcription Factors - DPTF: 2,576 putative TF (March 2007) - PlnTFDB: 2,901 putative TF (July 2009) 29,365 GO terms (Jan 2010) Poplar/ Willow QTL - work in progress - preliminary dataset available Only loading referenced publications ~15,000 articles

65. Unique Knowledge Base for Poplar Proteins annotated with functional information and publications Based on Comparative genomics and Protein familyanalysis Genes, QTLs enriched withpositionalinformation Data integration was done in Ondex

66. Ondex Genomics Layout Genomic Layout displays chromosomes, genes and QTLs Chromosomal regions and QTLs can be selected

67. Ondex Genomics Filter Genes of interest Enriched protein annotation network

68. Phenotypic Information in Literature HMMer: 650581 – HLH E-Value: 3.4E-7 Score: 30.0 BLAST 217086 – LAX E-Value: 8.3E-17 Score: 80.88 BLAST 217086 – BHLH63 E-Value: 8.3E-9 Score: 54.3 PMID:13130077 “LAX and SPA: major regulators of shoot branching in rice.” Poplar protein 217086 We identified two remote homologs in Rice (LAX) and in Arabidopsis (BHLH63), as well as one protein domain HLH The LAX homolog contains evidence to be a major regulator of shoot branching  Hypothesis generation

70. Sequence analysis

71. Text miningExperimental Data Hypothesis New experiments

73. James Dewar

74. Eva Holstein

75. Katherine James

76. Philip Lord

77. David Lydall

78. Matthew Pocock

79. JochenWeile

80. Darren Wilkinson

82. Keywan Hassani-Pak

83. Stephen Hanley

84. Matthew Hindle

85. Angela Karp

86. Shao Chih Kuo

87. Artem Lysenko

88. Chris Rawlings

89. Mansoor Saqi

90. Andrea Splendiani

92. Paul Dobson

93. Paul Fisher

94. Carole Goble

95. Gina Levow

96. Pedro Mendes

97. Raheel Nawaz

98. Georgina Moulton

99. Robert Stevens

100. David Withers

102. Martin Urban

103. DimahHabash

104. David Wild

105. Katherine Denby

108. Andrew Millar

Ondex: Data integration and visualisation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ondex: Data integration and visualisation

Similar to Ondex: Data integration and visualisation (20)

More from Biogeeks

More from Biogeeks (6)

Recently uploaded

Recently uploaded (20)

Ondex: Data integration and visualisation

Editor's Notes