Technologies, methods and challenges to data sharing and aggrigation

697 views

Published on

Presentation to the "Pest Practices in Personalized Medicine" (B2PM) Conference, Vancouver, BC, Canada. March, 2011.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
697
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Technologies, methods and challenges to data sharing and aggrigation

  1. 1. Technologies, methods and challengesto effective public data sharing and aggregation<br />Mark D Wilkinson<br />Medical Genetics, UBC<br />PI Bioinformatics<br />Heart + Lung Institute at St. Paul’s Hospital<br />markw@illuminae.com<br />http://wilkinsonlab.ca<br />
  2. 2. Thanks in advance(very few of these ideas are my own!)<br />Paul Gordon – Sun Center of Excellence, U Calgary<br />Carole Goble – University of Manchester<br />Charles Petrie – Stanford University<br />
  3. 3. We’ve come a long way!!<br />
  4. 4. In the beginning...<br />
  5. 5. Link Integration<br />
  6. 6.
  7. 7. Integration of What?<br />
  8. 8. Specially-formatted Files<br />EMBL Format<br />
  9. 9. Specially-formatted Files<br />GenBankFormat<br />
  10. 10. Specially-formatted Files<br />FASTA Format<br />
  11. 11. Specially-formatted Files<br />GCG Format<br />
  12. 12. At least 20 different formats for <br /> representing DNA sequences...<br />Lord et al. 2004<br />
  13. 13. Many formats contained a wide variety of related, but different, information<br />DNA sequence<br />Sequence features<br />Translation<br />Date/time/method<br />Publication cross-references<br />...<br />...<br />
  14. 14. Each file format required it’s own parser...<br />
  15. 15. ...and that problem wasn’t limited to DNA...<br />
  16. 16. XML<br />
  17. 17. What did XML do for us?<br />“...advent of XML meant that we didn’t <br />have to write our own parsers anymore...”<br />Individual data elements in a file can be automatically located and extracted<br />
  18. 18. Predictable way to represent data<br />Makes it easier for machines to encode/extract<br />
  19. 19. EMBL Record for BRCA1<br />In XML<br />
  20. 20. GenBank Record for BRCA1<br />In XML<br />
  21. 21. So now we can share and <br />aggregate data!<br />
  22. 22. So now we can share and <br />aggregate data!<br />
  23. 23. EMBL Record for BRCA1<br />In XML<br />GenBank Record for BRCA1<br />In XML<br />
  24. 24. So now we can share and <br />aggregate data!<br />...because it isn’t (just) a parsing problem...<br />Various resources have various data models<br />
  25. 25. So... Let’s find a way to describe the data models!<br />
  26. 26. XML Schema<br />
  27. 27. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  28. 28. So now we can share and <br />aggregate data!<br />
  29. 29. So now we can share and <br />aggregate data!<br />
  30. 30. What did XML Schema do for us?<br />“...XML Schema (among other things) allowed <br />us to ~automate the creation of (in-memory) <br />Structures which could hold the given <br />XML-formatted data...”<br />
  31. 31. Does not solve the integration or aggregation problem<br />
  32. 32. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  33. 33. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />Because the “meaning” of eachelement is implicit, we resort to<br />“Schema Mapping” <br />to integrate the data<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  34. 34. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  35. 35. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  36. 36. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />Automatically<br />Map Schema, thenwe can integrate data!<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  37. 37. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  38. 38. Nevertheless...<br />
  39. 39. Web Services<br />“Service Oriented Architectures”<br />WSDL <br />(and many other 4-letter words)<br />
  40. 40. Web Services & SOA’s<br />Allow you to expose software <br />(e.g. a database, analytical tool, or service)<br />on the Web<br />so that others can use it(in their own analytical pipelines)<br />
  41. 41. Excellent!!<br />
  42. 42. But...<br />
  43. 43. XML Schema<br />
  44. 44. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  45. 45. “The phrase ‘practical Web Services’ is not intrinsically an oxymoron, but [I] argue that there are <br /> few in existence.”<br />
  46. 46. Why?<br />
  47. 47. Because this problem is so disruptivethat there is little point in building “public” Web Services... <br />They are simply too difficult to integrate with other “public” Web Services.<br />-- adapted from Petrie, SWSIP 2009<br />XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  48. 48. ...and that’s pretty muchwhere the world is right now...<br />
  49. 49. But there is hope!<br />
  50. 50. “Linked Data” movement<br />Resource Description Framework<br />“RDF”<br />Two new technologies & communities<br />The “Semantic Web” movement<br />Web Ontology Language<br />“OWL” (+ RDF)<br />
  51. 51. What does RDF do for us?<br />“...RDF replaces XML Schema, because RDF says that there is only one data model...”<br />
  52. 52. What does OWL do for us?<br />XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />“...the semantics are no longer implicit in the data model...”<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  53. 53. So what?<br />
  54. 54. The Semantic Web <br />Gives us the opportunity to re-thinkhow we build our health data infrastructures<br />
  55. 55. The Semantic Web <br />isn’t <br />“yet another layer of technology”<br />
  56. 56. The Semantic Web <br />changes<br />the way we write software<br />
  57. 57. The Semantic Web <br />What to do & How to do it<br />is no longer encoded in your software<br />
  58. 58. The Semantic Web <br />What to do & How to do it<br />is part of the data<br />
  59. 59. The Semantic Web <br />What to do & How to do itis part of a shared, expert understanding<br />
  60. 60. The Semantic Web <br />What to do & How to do it<br />IS* PERSONAL!<br />* can be...<br />
  61. 61. One piece of software<br />Any question... Any answer<br />
  62. 62. Let me demonstrate what I mean<br />
  63. 63. Semantic Automated Discovery and Integrationhttp://sadiframework.org<br />MicrosoftResearch<br />Founding partner<br />
  64. 64. Semantic Health And Research Environment(a Semantic Web question answerer...)<br />
  65. 65. Example #1<br />Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants<br />SELECT ?patient ?bun ?creat<br />FROM <http://sadiframework.org/ontologies/patients.rdf><br />WHERE {<br /> ?patientrdf:typepatient:LikelyRejecter .<br /> ?patient l:latestBUN ?bun . <br /> ?patient l:latestCreatinine ?creat . <br />}<br />
  66. 66. Likely Rejecter:<br />A patient who has creatinine levelsthat are increasing over time<br />- - Wilkinson “MD”<br />
  67. 67. Likely Rejecter:<br />…but there is no “likely rejecter” column or table in our database… <br />
  68. 68. Likely Rejecter:<br />Our database contains various blood chemistry measurementsat various time-points<br />
  69. 69. SHARE determines<br />by itselfthe need to do a Linear Regression analysis over Creatinine blood chemistry measurements<br />Semantics<br />
  70. 70. SHARE determines<br />by itselfhow and wherethat analysiscan be doneand does it<br />Semantics<br />
  71. 71. The SHARE system utilizes Semantics (via SADI) to discover and access analytical services on the Web that do linear regression analysis<br />
  72. 72. VOILA!<br />
  73. 73. Neither SADI nor SHAREknow anything aboutblood chemistry, or mathematics<br />
  74. 74. Example #2<br />From a (contrived) integrated dataset, <br />retrieve the blood pressure measurements<br />SELECT ?output ?unit ?value <br />FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE { <br /> ?output rdf:typesbp:BloodPressure . <br /> ?output local:hasCanonicalAttribute ?pr .<br /> ?pr sio:SIO_000221 ?unit .<br /> ?pr sio:SIO_000300 ?value .<br />}<br />
  75. 75. This should be extremely straightforward...<br />
  76. 76. ...except for one problem...<br />
  77. 77. <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>0.137</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>12.45</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>5.3</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/><br /> </owl:NamedIndividual><br />
  78. 78. Spot the problem?<br />
  79. 79. Let me help...<br />
  80. 80. <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>0.137</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>12.45</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource=“&ucum;unit/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>5.3</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/><br /> </owl:NamedIndividual><br />
  81. 81. Example #2<br />From a (contrived) integrated dataset, <br />retrieve the blood pressure measurements<br />Semantics<br />SELECT ?output ?unit ?value <br />FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE { <br /> ?output rdf:typesbp:BloodPressure . <br /> ?output local:hasCanonicalAttribute ?pr .<br /> ?pr sio:SIO_000221 ?unit .<br /> ?pr sio:SIO_000300 ?value .<br />}<br />My semantic definition of “Blood Pressure”includes the units that I want...<br />
  82. 82. Semantics<br />This is enough to trigger SHARE to automatically discover<br />an online unit-conversion service...<br />
  83. 83.
  84. 84. Neither SADI nor SHAREknow anything aboutunits or unit conversions<br />
  85. 85. Many of the challenges <br />to data aggregation and sharingnow have solutions that work!<br />
  86. 86. What, in my opinion, is the greatest remaining challenge?<br />
  87. 87. To a biologist...<br /> ...“data mining” means “this data is mine!”<br />
  88. 88. The challenge to us all<br />Move from Data Mine-ing<br />To Data Ours-ing<br />-- Len Silverston, 2007<br />
  89. 89. XML<br />We’ve come a long way!!<br />XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  90. 90. TEAM:<br />Luke McCarthyBenjamin Vandervalk<br />SoroushSamadian<br />Microsoft Research<br />

×