Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bio2RDF and Beyond!

3,588 views

Published on

The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope.

Published in: Health & Medicine
  • Be the first to comment

Bio2RDF and Beyond!

  1. 1. Bio2RDF and Beyond! Large Scale, Distributed Biological Knowledge Discovery<br />1<br />EBI : 14-01-10<br />Michel Dumontier, Ph.D.<br />Associate Professor of Bioinformatics<br />Carleton University<br />Department of Biology<br />School of Computer Science<br />Institute of Biochemistry<br />Ottawa Institute of Systems Biology<br />Ottawa-Carleton Institute of Biomedical Engineering<br />
  2. 2. Web-based Knowledge Discovery<br /> a very painful process<br />Carole Goble (ISWC 2005)<br />2<br />EBI : 14-01-10<br />
  3. 3. Syntactic Web…<br />It takes a lot of digging to get answers<br />3<br />EBI : 14-01-10<br />
  4. 4. Portals provide structured information<br />and give better results<br />4<br />EBI : 14-01-10<br />
  5. 5. We need to expose the deep web <br />Surface web:167 terabytes<br />Deep web:91,000 terabytes<br />545-to-one<br />EBI : 14-01-10<br />5<br />
  6. 6. Data silos – not made for sharing<br />6<br />EBI : 14-01-10<br />
  7. 7. How do we integrate these resources?<br />7<br />EBI : 14-01-10<br />
  8. 8. We want to simultaneously query the 1000+ biological databases<br />8<br />EBI : 14-01-10<br />
  9. 9. The Semantic Web is a web of knowledge.<br />9<br />EBI : 14-01-10<br />It is about standards for publishing, sharing and querying <br />knowledge drawn from diverse sources<br />It enables the answering of <br />sophisticated questions<br />
  10. 10. A growing web of linked data<br />10<br />EBI : 14-01-10<br />
  11. 11. Life Science Data Contributors<br />HCLS (LODD)<br />Neurocommons<br />Bio2RDF<br />EBI : 14-01-10<br />11<br />
  12. 12. Resource Description Framework (RDF)<br />Allows one to talk about anything<br />Uniform Resource Identifier (URI) can be used as entity names<br /> Bio2RDF specifies the naming convention<br />http://bio2rdf.org/uniprot:P05067<br /> is a name for Amyloid precursor protein<br />http://bio2rdf.org/omim:104300<br /> is a name for Alzheimer disease<br />uniprot:P05067<br />omim:104300<br />12<br />EBI : 14-01-10<br />
  13. 13. Resource Description Framework (RDF)<br />Allows one to express statements<br /> A RDF statement consists of:<br /><ul><li>Subject: resource identified by a URI
  14. 14. Predicate: resource identified by a URI
  15. 15. Object: resource or literal</li></ul>uniprot:P05067<br />is a<br />uniprot:Protein<br />13<br />EBI : 14-01-10<br />
  16. 16. Multi-Source Data Integration<br />depends on consistent naming<br />uniprot:P05067<br />uniprot:Protein<br />uniprot:Protein<br />is a<br />UniProt<br />has name<br />+<br />uniprot:P05067<br />go:Membrane<br />uniprot:P05067<br />go:Membrane<br />located in<br />located in<br />Gene Ontology<br />+<br />uniprot:P05067<br />interacts with<br />uniprot:P05067<br />uniprot:P05067<br />interacts with<br />Unified view<br />iRefIndex<br />14<br />EBI : 14-01-10<br />
  17. 17. Building statements creates knowledge<br />Amyloid precursor protein<br />Alzheimer<br />Disease<br />label<br />label<br />is involved in<br />uniprot:P05067<br />omim:104300<br />is a <br />is a<br />Protein<br />Disease<br />15<br />EBI : 14-01-10<br />
  18. 18. RDF has multiple representations<br />RDF/XML<br />&lt;?xml version=&quot;1.0&quot;?&gt;<br />&lt;rdf:RDF<br />xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;<br />xmlns:u=&quot;http://bio2rdf.org/uniprot:&quot; <br /> &lt;rdf:Descriptionrdf:about=“&u;Q16665&quot;&gt;<br /> &lt;rdf:typerdf:resource=“&u;Protein&quot;/&gt;<br /> &lt;/rdf:Description&gt;<br />&lt;/rdf:RDF&gt;<br />RDF/N3<br />PREFIX u: &lt;http://bio2rdf.org/uniprot:&gt; <br />&lt;u:Q16665&gt; a &lt;u:Protein&gt; .<br />16<br />EBI : 14-01-10<br />
  19. 19. Bio2RDF is a framework to create and provision linked data networks<br />17<br />EBI : 14-01-10<br />Francois Belleau, Laval University<br />Marc-Alexandre Nolin, Laval University<br />Peter Ansell, Queensland University of Technology<br />Michel Dumontier, Carleton University<br />
  20. 20. Bio2RDF’s RDFized data fits together<br />EBI : 14-01-10<br />18<br />
  21. 21. Bio2RDF now serving over 5 / 15 billion triples of linked biological data<br />19<br />EBI : 14-01-10<br />
  22. 22. Bio2RDF linked data<br />20<br />EBI : 14-01-10<br />
  23. 23. Bioinformatics Discovery Registry<br />SharedName initiative to provide stable URI patterns for data records.<br />We added the relationship between entities and records<br />Directory Service<br />~1700 datasets & dozens of resolvers.<br />Discovery Service<br />Registry links entities to data records, their formats (RDF/XML, HTML, etc) and provider (Bio2RDF, Uniprot)<br />Redirection Service<br />Automatic redirection to data provider document<br />
  24. 24. something you can lookup or search for with rich descriptions<br />22<br />EBI : 14-01-10<br />
  25. 25. Bio2RDF: Raw Data!<br />EBI : 14-01-10<br />23<br />
  26. 26. 24<br />SPARQL is the newcool kid on the query block<br />SQLSPARQL<br />EBI : 14-01-10<br />
  27. 27. Bio2RDF’s describe service uses SPARQL<br />CONSTRUCT {<br /> ?s ?p ?o .<br />}<br />WHERE {<br /> ?s ?p ?o .<br /> FILTER(?s = &lt;http://bio2rdf.org/ns:id&gt;).<br />}<br /> Sent to http://ns.bio2rdf.org/sparql?query=... <br />25<br />EBI : 14-01-10<br />http://bio2rdf.org/ns:id<br />
  28. 28. Bio2RDF’s search service uses SPARQLhttp://bio2rdf.org/search/hexokinase<br />26<br />EBI : 14-01-10<br />bio2rdf.org<br />kegg<br />gene<br />uniprot<br />
  29. 29. Bio2RDF<br />Scalable, Decentralized Data Provision<br />Globally Mirrored and Point Provision<br />Customizable Query Resolution<br />27<br />EBI : 14-01-10<br />
  30. 30. Customizable Configuration (in N3)Single Query, Single Provider<br />EBI : 14-01-10<br />28<br />
  31. 31. Query Resolution<br />EBI : 14-01-10<br />29<br />
  32. 32. EBI : 14-01-10<br />30<br />
  33. 33. 31<br />EBI : 14-01-10<br />700,000 queries in November 2009<br />
  34. 34. Yai for data!<br />32<br />EBI : 14-01-10<br />But how do we discover more than what was in the data?<br />
  35. 35. Ontology as Strategy<br />33<br />EBI : 14-01-10<br />
  36. 36. Reasoning and Inference through Semantics<br />fact<br />uniprot:P05067<br />is a<br />is a<br />Uniprot:Protein<br />is a<br />chebi:Polyatomic<br />Entity<br />ontology<br />Knowledge base<br />34<br />EBI : 14-01-10<br />
  37. 37. The Web Ontology Language (OWL) Has Explicit Semantics<br />Can therefore be used to capture knowledge in a machine understandable way<br />35<br />EBI : 14-01-10<br />
  38. 38. Over 170 bio-ontologies<br />EBI : 14-01-10<br />36<br />
  39. 39. From linked data to linked knowledge through syntactic and semantic normalization. <br />
  40. 40. Multiple Ways To Represent Knowledge<br />Three ways to model the relationship between a protein and the volume it occupies. <br />
  41. 41. Web-based Knowledge Discovery<br />Some of our queries need services<br />39<br />EBI : 14-01-10<br />
  42. 42. The Holy Grail:<br />Align the promoters of all serine threoninekinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.<br />Retrieve and align 2000nt 5&apos; from every serine/threoninekinase in Musmusculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is &lt;40% similar in the active site to kinases known to be involved in cell-cycle regulation in any other species.<br />40<br />EBI : 14-01-10<br />
  43. 43. Semantic Automated Discovery and Integration<br />http://sadiframework.org<br />41<br />EBI : 14-01-10<br />Mark Wilkinson, UBC<br />Michel Dumontier, Carleton University<br />Christopher Baker, UNB<br />
  44. 44. SADI – described oriented service matching based on registered predicates<br />
  45. 45. EBI : 14-01-10<br />43<br />
  46. 46. What pathways does UniProt protein P47989 belong to?<br />PREFIX pred: &lt;http://sadiframework.org/ontologies/predicates.owl#&gt;<br />PREFIX ont: &lt;http://ontology.dumontierlab.com/&gt;<br />PREFIX uniprot: &lt;http://lsrn.org/UniProt:&gt;<br />SELECT ?gene ?pathway <br />WHERE { <br />uniprot:P47989 pred:isEncodedBy ?gene . <br />?gene ont:isParticipantIn ?pathway . <br />}<br />EBI : 14-01-10<br />44<br />
  47. 47. SADI<br /><ul><li> Describe the input and output using OWL-DL classes
  48. 48. Subject of input and output mustbe the same
  49. 49. Web services correspond to predicates
  50. 50. Biocatalogue to register SADI-compliant services
  51. 51. Simplified migration path for existing web services (java, perl)</li></ul>45<br />EBI : 14-01-10<br />
  52. 52. Build a<br />knowledge base<br />from a series of questions<br />46<br />EBI : 14-01-10<br />
  53. 53. You want to join the knowledge web<br />47<br />EBI : 14-01-10<br />
  54. 54. Share your data<br />48<br />EBI : 14-01-10<br />
  55. 55. Build semantic web services<br />49<br />EBI : 14-01-10<br />
  56. 56. 50<br />EBI : 14-01-10<br />Get to where you want to be … faster!<br />
  57. 57. Next Steps<br />Service and Data Buildout<br />Formal Partnerships<br />Applications<br />51<br />EBI : 14-01-10<br />
  58. 58. dumontierlab.com<br />michel_dumontier@carleton.ca<br />52<br />EBI : 14-01-10<br />
  59. 59. EBI : 14-01-10<br />53<br />
  60. 60. We’re interested in Personalized Medicine<br />The ability to offer <br />The Right Drug<br />To The Right Patient<br />For The Right Disease<br />At The Right Time<br />With The Right Dosage<br /> Genetic and metabolic data will allow drugs to be tailored to patient subgroups<br />54<br />EBI : 14-01-10<br />
  61. 61. PHARMGKB <br />is an emerging resource for pharmacogenomics<br />+ Role of genes, gene variants , drugs <br />+ pharmacokinetics <br />+ pharmacodynamics<br />+ clinical outcomes. <br />+ Links to publications<br />- Natural language descriptions<br />- Variant details in publications<br />55<br />EBI : 14-01-10<br />
  62. 62. Pharmacogenomics of Depression KNOWLEDGE BASE<br />contains statements from 11/40 relevant publications involving 45 genes / gene variants, 57 drugs annotated with 19 classes of antidepressants, 45 drug treatments, 47 drug-gene interactions, 29 clinical outcomes, 10 drug-induced side-effects, and 8 gene-disease interactions.<br />56<br />EBI : 14-01-10<br />
  63. 63. Protégé 4, FaCT++, DL Query Tab<br />Querying the PDKB<br />Nortriptyline induced side effects for ABCB1 gene variants<br /> ‘side effect’ that <br /> ‘is realized by’ some <br /> (‘drug treatment’ that <br /> ‘involves’ some ‘nortriptyline’ and <br /> ‘involves’ some (‘variant of’ some ‘ABCB1’))<br />57<br />EBI : 14-01-10<br />postural hypotension is a side effect of nortriptyline treatment of depression for individuals presenting the 3435C&gt;T genotype<br />

×