Web Science 2.0 - in silico science

563 views

Published on

the same story as usual, but with a bit more context (why it is absolutely necessary to move science in this direction). Presented to University of Potsdam, Germany, and the University of New Brunswick, Canada in December, 2012.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
563
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Web Science 2.0 - in silico science

  1. 1. Web Science 2.0Conducting in silico research in the Web from hypothesis to publication Mark Wilkinson Isaac Peral Senior Researcher in Biological Informatics Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain Adjunct Professor of Medical Genetics, University of British Columbia Vancouver, BC, Canada.
  2. 2. ContextMultiple recent surveys of high-throughput biologyreveal that upwards of 50% of published studies are not reproducible - Baggerly, 2009 - Ioannidis, 2009
  3. 3. Context“the most common errors are simple,the most simple errors are common” - Baggerly, 2009
  4. 4. Context These errors pass peer review The researcher is unaware of the error The process that led to the error is not recordedTherefore it cannot be detected during peer-review
  5. 5. ContextDiscovery of such errors have resulted in retractions and even shut-down clinical trials
  6. 6. ContextIn March, 2012, the US Institute of Medicine said “Enough is enough!”
  7. 7. Context Institute of Medicine Recommendations For Conduct of High-Throughput Research:1. Rigorously-described, -annotated, and -followed data management procedures2. “Lock down” the computational analysis pipeline once it has been selected3. Publish the workflow in a formal manner, together with the full starting and result datasets Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.
  8. 8. Achieving these recommendationsrequires integration of existing technologies and invention of new ones
  9. 9. Context“While it took 2,300 years after the first report of angina for the condition to be commonly taught in medical curricula, modern discoveries are being disseminated at an increasingly rapid pace.” The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009 Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA June 22, 2012.
  10. 10. “The Singularity” The X-intercept is where, the moment a discovery is made, it is immediately put into practice (not only medical practice, but any research endeavour...)The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.
  11. 11. The technology required to achieve this does not yet exist
  12. 12. You Are HereScientific research would have to be conducted within a medium that immediately interpreted and disseminated the results...
  13. 13. You Are Here...in a form that immediately (actively!) affected the research of others...
  14. 14. You Are Here...without requiring them to be aware of these new discoveries.
  15. 15. I‟d like to show you how close we now are to this vision and how we got there
  16. 16. Web Science 2.0
  17. 17. We wanted to duplicatea real, peer-reviewed, bioinformatics analysis simply by building a model in the Web describing what the answer (if one existed) would look like
  18. 18. ...the machine had to make every other decision on it‟s own
  19. 19. Brief Digression “in” the Web??
  20. 20. How we useThe Web today
  21. 21. By clicking here you cause this incrediblypowerful computational tool called The Web to retrieve a chunk of text and images that can only be understood by a human...
  22. 22. The Web is not a pigeon!
  23. 23. To achieve this vision We must learn how todo research IN the Web Not OVER the Web
  24. 24. Resume Speed
  25. 25. We wanted to duplicatea real, peer-reviewed, bioinformatics analysis simply by building a model in the Web describing what the answer (if one existed) would look like
  26. 26. ...the machine had to make every other decision on it‟s own
  27. 27. This is the study we chose:
  28. 28. Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspeciesdata mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).
  29. 29. Original Study SimplifiedUsing what is known about interactions in fly & yeast predict new interactions with your human protein of interest
  30. 30. “Pseudo-code” Abstracted WorkflowGiven a protein P in Species X Find proteins similar to P in Species Y Retrieve interactors in Species Y Sequence-compare Y-interactors with Species X genome (1)  Keep only those with homologue in X Find proteins similar to P in Species Z Retrieve interactors in Species Z Sequence-compare Z-interactors with (1)  Putative interactors in Species X
  31. 31. Modeling the answer... OWL Web Ontology Language (OWL) is the language approved by the W3C for representing knowledge in the Web
  32. 32. Modeling the answer... Note that every word in this diagram is, in reality, a URL (because it is OWL) The model of the answer is published in The Web and borrows ideas from other models published in The Web
  33. 33. Modeling the answer... ProbableInteractor is homologous to ( Potential Interactor from ModelOrganism1…) and Potential Interactor from ModelOrganism2…)Probable Interactor is defined in OWL as a subclass of Potential Interactor that requires homologous pairs of interacting proteins to exist in both comparator model organisms. (Effectively, an intersection)
  34. 34. Publish our OWL model of a Probable Interactor in the Web
  35. 35. Running the Web Science Experiment In a local data-file provide the protein we are interested in and the two species we wish to use in our comparison taxon:9606 a i:OrganismOfInterest . # human uniprot:Q9UK53 a i:ProteinOfInterest . # ING1 taxon:4932 a i:ModelOrganism1 . # yeast taxon:7227 a i:ModelOrganism2 . # fly
  36. 36. The tricky bit is... In the abstract, thesearch for homology is “generic” – ANY Protein, ANY model systemBut when the machinedoes the experiment, it must use specific ofresources because the answer requires taxon:4932 a i:ModelOrganism1 . # yeast information from two taxon:7227 a i:ModelOrganism2 . # fly declared species
  37. 37. This is the question we ask: (the query language here is SPARQL)PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .} The reference (URL) to our OWL model of the answer
  38. 38. Our system then derives (and executes) the following workflow automatically These are different Web services! ...selected at run-time based on the same model
  39. 39. There are four very cool things about what you just saw...
  40. 40. There are four very cool things about what you just saw... The system was able to create a workflow based on an OWL model (ontology)
  41. 41. There are four very cool things about what you just saw... The system was able to create a COMPUTATIONAL workflow based on a BIOLOGICAL model
  42. 42. There are four very cool things about what you just saw... The workflow it created (i.e. the services chosen) differed depending on context
  43. 43. There are four very cool things about what you just saw... The choice of tool-selection was guided by the encoded knowledge of domain-experts worldwide
  44. 44. We got the answer“simply” by designing a model of the answer!
  45. 45. How did we do that?
  46. 46. A “Smart” Biomedical Resource Representation System
  47. 47. A Web application that answers SPARQL-DL queries Query-answering Enhanced by SADI
  48. 48. Demo #1
  49. 49. Imagine a “virtual database” all of the data from all databases + result of every conceivable analysis
  50. 50. How can we query that database?
  51. 51. What is the phenotype of every allele of the Antirrhinum majus DEFICIENS geneSELECT ?allele ?image ?descWHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image . ?image info:hasDescription ?desc}
  52. 52. What is the phenotype of every allele of the Antirrhinum majus DEFICIENS geneSELECT ?allele ?image ?descWHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image . ?image info:hasDescription ?desc} Note that there is no “FROM” clause! We don‟t tell it where it should get the information, The machine has to figure that out by itself...
  53. 53. Enter that query into SHARE
  54. 54. Click “Submit”...
  55. 55. ...and in a few seconds you get your answer.
  56. 56. The query results are live hyperlinksto the respective Database or images
  57. 57. Neither SADI nor SHARE know anything aboutplant biology or genetics
  58. 58. What pathways does UniProt protein P47989 belong to?PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathwayWHERE { uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .}
  59. 59. What pathways does UniProt protein P47989 belong to?PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathwayWHERE { uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .}
  60. 60. What pathways does UniProt protein P47989 belong to?PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathwayWHERE { uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .} Note again that there is no “From” clause… I have not told SHARE where to look for the answer, I am simply asking my question
  61. 61. Enter that query into SHARE
  62. 62. Two differentTwo different providers ofproviders of pathwaygene informationinformation (KEGG and(KEGG & GO);NCBI); were found &were found & accessedaccessed
  63. 63. The results are all links to the original data
  64. 64. Neither SADI nor SHARE know anything aboutproteins or biochemical pathways
  65. 65. Recapwhat we just saw We posed, and answered~complex multi-database queriesWITHOUT A DATA WAREHOUSE
  66. 66. Demo #2An example from the Clinical domain
  67. 67. Show me the latest Blood Urea Nitrogen and Creatinine levels of patients who appear to be rejecting their transplantsPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE { ?patient rdf:type patient:LikelyRejecter . ?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .}
  68. 68. Show me the latest Blood Urea Nitrogen (BUN) and Creatinine levels of patients who appear to be rejecting their transplantsPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#>PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#>SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE { ?patient rdf:type patient:LikelyRejecter . ?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .}
  69. 69. Likely Rejecter:A patient who has creatinine levels that are increasing over time - - Mark D Wilkinson‟s definition
  70. 70. Likely Rejecter: …but there is no “likely rejecter” column or table in our database…only blood chemistry measurements at various time-points
  71. 71. Likely Rejecter:So the data required to answer this question DOESN‟T EXIST!
  72. 72. ?
  73. 73. Enter that query into SHARE
  74. 74. Now…Two “magical” events occur…
  75. 75. The machine decides by itself that it needs to do a Linear Regression analysison the blood creatinine measurements in order to answer your question
  76. 76. The machine decides by itselfhow and where that analysis can be doneand does it automatically!
  77. 77. http://www.impactlab.net/2009/03/22/improve-your-brain-power/
  78. 78. The SHARE system utilizes SADI to discoveranalytical services on the Web that do linear regression analysis and sends the data to be analysed
  79. 79. VOILA!
  80. 80. Neither SADI nor SHARE know anything aboutblood chemistry, or mathematics
  81. 81. So how does the machine know what to do??
  82. 82. Ontologies
  83. 83. Ontologies explicitly define the kinds of things that (can) exist… …and what those things are “like” i.e. what properties they have (color, weight, shape, texture, temperature, “state”) and what relationships they have to one another (inside-of, adjacent-to, part-of, binds-to, controls, inhibits, degrades, etc.)
  84. 84. So we create …………. ontologies about biology and healthWe* publish them on the Web * We… or anybody! Anybody can publish an ontology!
  85. 85. My definition of a Likely Rejecter is encoded in a machine-readable document written in the OWL Ontology languageBasically: “the regression line over creatinine measurements should have an increasing slope”
  86. 86. Our ontology refers to other ontologies (possibly published by other people) to learn about what the properties of “regression models” are e.g. that regression models have slopes and intercepts and that slopes and intercepts have decimal values
  87. 87. SHARE examines the queryLooks on the Web for ontologies that describe the problem it is trying to solve, and “reads” them then uses that “knowledge” to figure out which data-sources and analytical tools it needs to answer the query
  88. 88. The way SHARE “interprets” data varies depending on the context of the query (i.e. which ontologies it reads – Mine? Yours?) and on what part of the query it is trying to answer at any given moment(which ontological concept is relevant to that clause)
  89. 89. Data exhibits “late binding”
  90. 90. Late binding: “purpose and meaning” of the data is not determined until the moment it is requireda.k.a The “semantics” of the data
  91. 91. Benefit of late binding Data is amenable toconstant re-interpretation
  92. 92. Example?Blood Creatinine measurements were not dictated to be (only)Blood Creatinine measurements
  93. 93. Example? The data had the „qualities/properties‟ that allowed one machine to interpretthat they were Blood Creatinine measurements(e.g. to determine which patients were rejecting)
  94. 94. Example?But the data also had the „qualities/properties‟ that allowed another machine to interpret them as Simple X/Y coordinate data (e.g. the Linear Regression calculation tool)
  95. 95. Benefit of late binding Data is amenable toconstant re-interpretation
  96. 96. http://www.flickr.com/people/faernworks/
  97. 97. And that brings us to...
  98. 98. Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspeciesdata mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).
  99. 99. We built a model of the proposed answer
  100. 100. Our system converted the model into the experiment
  101. 101. The analytical tools chosen for that experiment changed depending on contexteven though the biological model driving their selection was the same
  102. 102. i.e.The published model is re-usable
  103. 103. i.e. The published model is re-usableIn different contexts... By different researchers
  104. 104. and because the model IS the experiment the published EXPERIMENT is re-usable!!simply point the same query at your own dataset...
  105. 105. The publication is anexecutable document!
  106. 106. Every component of the model Every component of the input data Every component of the output data is a URLTherefore the question, the experiment, and theanswer, are immediately published IN the Web
  107. 107. Every component of the model Every component of the input data Every component of the output data is a URLThe answer, and the knowledge derived from it, is immediately available to search enginesand moreover, can affect the outcome of other Web Science experiments
  108. 108. YouAre Now Here!!!
  109. 109. Final thoughts
  110. 110. An experiment... based on a hypothesis
  111. 111. An experiment... based on a hypothesis now modeled in OWL
  112. 112. Does this OWL Class represent the Hypothesis? I think it does!
  113. 113. We modeled the answer......but the answer was hypothetical
  114. 114. Change the way we think of “hypotheses”
  115. 115. In Web Science 2.0Model what the world would “look like” if your hypothesis were true Then ask “is there any data that fits that model?”
  116. 116. Like the blind men examining an elephant Seemingly different aspects of researchwhen viewed from the perspective of Web Science become the same “thing” The Model
  117. 117. Our vision of Web Science 2.0Hypothesis QueryWorkflow Ontology Result Materials & Methods These can be automatically derived through provenance information during workflow execution
  118. 118. Please join us!SADI and SHARE are Open-Source projects http://sadiframework.org
  119. 119. My New Home!
  120. 120. University of British ColumbiaLuke McCarthy – Lead Dev. Edward KawasEverything... SADI Service auto-generatorBenjamin VanderValk Ian WoodSHARE & SADI & Experimental modeling & Experimental modeling projectmyHeath ButtonSoroush SamadianCardiovascular data modeling and queries
  121. 121. C-BRASS Collaborators at other sitesU of New Brunswick Carleton UniversityDr. Chris Baker Dr. Michel DumontierAlexandre Riazanov Marc-Alexandre Nolin Leonid Chepelev Steve Etlinger Nichaella Kieth Jose Cruz
  122. 122. Microsoft Research

×