Knowledge Discovery using an Integrated
               Semantic Web




                                  Michel Dumontier

      Department of Biology, School of Computer Science, Institute of Biochemistry
                         Ottawa Institute for Systems Biology
                 Ottawa-Carleton Institute for Biomedical Engineering
                                  Carleton University
                                    Ottawa, Canada

       Chair, W3C Semantic Web for Health Care and Life Sciences Interest Group
1                                                                                    BH2012
2   BH2012
3   BH2012
uncovering a sufficient amount of evidence to support/refute
             a hypothesis is becoming increasingly difficult
                             it requires a lot of digging around




4                                                          BH2012
continuous growth in research literature




    Source:http://www.nlm.nih.gov/bsd/stats/cit_added.html




5                                                            BH2012
growing amount of biomedical data




6                                       BH2012
increasingly complex software & interfaces
         to predict, compare and evaluate




7                                          BH2012
ultimately, we answer questions by building
              sophisticated workflows




8                                             BH2012
What if we could just pose a hypothesis
     and have a system automatically use
9
    available data, ontologies and services?
                                         BH2012
HyQue

     HyQue is the Hypothesis query and evaluation system
     • A platform for knowledge discovery
     • Facilitates hypothesis formulation and evaluation
     • Leverages Semantic Web technologies to provide access to
       facts, expert knowledge and web services
     • Conforms to a simplified event-based model
     • Supports evaluation against positive and negative findings
     • Transparent and reproducible evidence prioritization
     • Provenance of across all elements of hypothesis testing
        – trace a hypothesis to its evaluation, including the data and rules used

       Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference
       (ESWC 2012). Heraklion, Crete. May 27-31, 2012.
       HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
10                                                                                                                   BH2012
HyQue Architecture




                                     Ontologies




                          Services




11                                                BH2012
Event-based data model

     HyQue events denote a phenomenon involving two
     objects: ‘agent’ and ‘target’ . In addition, we can specify the
     location of this event (e.g. located in nucleus, or under
     some genetic background)
                                        supported events
     Event                          1. protein-protein binding
      ‘has agent’ agent             2. protein-nucleic acid binding
      ‘has target’ target           3. molecular activation
      ‘is located in’ location
                                    4. molecular inhibition
                                    5. gene induction
      ‘is negated’ boolean
                                    6. gene repression
                                    7. transport




12                                                                    BH2012
HyQue domain rules CALCULATE a quantitative
           measure of evidence for an event
     ‘induce’ rule (maximum score: 5):
        – Is event negated?                               GO:0010628
            • If yes, subtract 2
        – Is event of type ‘induce’?                               CHEBI:36080
            • If yes, add 1; if no, subtract 1
        – Is agent of type ‘protein’ or ‘RNA’?
            • If yes, add 1; if type ‘gene’, subtract 1
        – Is target of type ‘gene’?                                    SO:0000236
            • If yes, add 1; if no, subtract 1
        – Does agent have known ‘transcription factor activity’?
            • If yes, add 1                                            GO:0003700
        – Is event located in the ‘nucleus’?
            • If yes, add 1; if no, subtract 1
                                                               GO:0005634

13                                                                               BH2012
Customization of rules/data sources will generate
          different evidence-based evaluations




14                                                  BH2012
The Semantic Web
     is the new global web of knowledge
         It involves standards for publishing, sharing and querying
                           facts, expert knowledge and services

                                  It is a scalable approach to the
                           discovery of independently formulated
                                        and distributed knowledge




15                                                              BH2012
something you can search,
        lookup, link to, check
     consistency of, and query for



16                                   BH2012
An ever expanding web of linked data




17    “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” BH2012
Bio2RDF provides a
      simple convention
     and infrastructure to
      provide linked data
     for the life sciences




18                           BH2012
linked data for the life sciences

                             An Open Source Project for the Provision of
                       Scalable, Decentralized Data with Global Mirroring
                                     and Customizable Query Resolution
     http://bio2rdf.org/ns:id                     Laval University, Carleton University,
                                                  Queensland University of Technology




19                                                                            BH2012
provides billions of interconnections




20                                           BH2012
Towards universally-accepted identifiers




21                                              BH2012
http://identifiers.org/taxonomy/$id
22                                         BH2012
(coming soon)




23                   BH2012
engaging the BioPAX community to
            adopt identifiers.org Andrea Splendiani
Pathwaycommons (level 2; download)
<bp:unificationXref rdf:ID="CPATH-LOCAL-653">
 <bp:ID rdf:datatype="xsd:string">9606</bp:ID>
 <bp:DB rdf:datatype="xsd:string">NCBI_TAXONOMY</bp:DB>
</bp:unificationXref>

Pathwaycommons (level 3; web service)
<bp:UnificationXref rdf:about="urn:biopax:UnificationXref:REACTOME+DATABASE+ID_109276">
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">109276</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Reactome Database ID</bp:db>
</bp:UnificationXref>

Biomodels (level 3)
<bp:UnificationXref rdf:about="http://identifiers.org/obo.go/GO:0004889">
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">GO:0004889</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Gene Ontology</bp:db>
</bp:UnificationXref>


24                                                                                       BH2012
More sophisticated OWL-based Data Integration,
      Consistency Checking and Discovery
                                                                                                           Robert Hoehndorf
  • Checking the consistency of semantic annotations [1]
     – Formalized semantic annotations in SBML models as OWL axioms.
       Automated reasoning uncovered inconsistencies in 16 models.
               • e.g. alpha-D-glucose phosphate is not the required ATP in an ATP-dependent
                 reaction (GO + ChEBI + disjoint + closure axioms)
  • Finding significant biomedical associations [2] (initiated at BH11)
     – found significant associations between genes, drugs, diseases and
       pathways using Drugbank, PharmGKB, CTD, PID across categories
       of drugs (ChEBI, ATC, MeSH) and diseases (DO, MeSH)
         – 22,653 pathway-disease type associations (6304 over; 16,349 under)
               • carcinosarcoma (DOID:4236) and (HIV RT) Zidovudine Pathway
                 (PharmGKB:PA165859361)
         – 13,826 pathway-chemical type associations (12,564 over; 1262 under)
               • drug clopidogrel (CHEBI:37941) with Endothelin signaling pathway
                 (PharmGKB:PA164728163) -> (smooth muscle mitogenesis)
                                                                                                  http://pharmgkb-owl.googlecode.com
1. Integrating systems biology models and biomedical ontologies. BMC Systems Biology. 2011. 5 : 124
2. Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. in press
25                                                                                                                       BH2012
Personal Health Lens
                                                                   Mark Wilkinson
                                                                      Chris Baker
     Observation: Patients often look up new/alternative drugs to treat their
     condition or alleviate side effects.

     Opportunity: A patient-centric health care application that identifies
     contraindications for drugs mentioned on web pages using the patient’s
     own health data

     Components:
     • RDFized patient data
     • Bio2RDF semantically annotated data
     • SADI semantic web services to process the page and retrieve data
     • SHARE automatic workflow composition



26                                                                         BH2012
27   BH2012
Matthias Samwald

 We are developing a simple, cheap and ubiquitous solutions for
       anchoring pharmacogenomics in medical practice


 Curated and unified
 set of essential 385+
     markers, 50+
 pharmacogenes and
  rulesystem unified
       under one
 standardized model:
 The Medicine Safety
         Code




                              W3C Task Force: Clinical Decision
                              Support for Personalized Medicine
28                                                                BH2012
Unified
   OWL
 Ontology

(inferencing
      ,
consistency
  checking,
  mapping)




29             BH2012
At this Biohackathon

     •   refine Bio2RDF RDFization Guide
     •   complete Dataset Description (BH11)
     •   dataspace statistics & visualization
     •   SPARQL-based Enrichment Analysis
     •   ontology-based Similarity Networks
         – see Rob’s email




30                                              BH2012
dumontierlab.com
     michel_dumontier@carleton.ca
                              Website: http://dumontierlab.com
         Presentations: http://slideshare.com/micheldumontier




31                                                          BH2012

Knowledge Discovery using an Integrated Semantic Web

  • 1.
    Knowledge Discovery usingan Integrated Semantic Web Michel Dumontier Department of Biology, School of Computer Science, Institute of Biochemistry Ottawa Institute for Systems Biology Ottawa-Carleton Institute for Biomedical Engineering Carleton University Ottawa, Canada Chair, W3C Semantic Web for Health Care and Life Sciences Interest Group 1 BH2012
  • 2.
    2 BH2012
  • 3.
    3 BH2012
  • 4.
    uncovering a sufficientamount of evidence to support/refute a hypothesis is becoming increasingly difficult it requires a lot of digging around 4 BH2012
  • 5.
    continuous growth inresearch literature Source:http://www.nlm.nih.gov/bsd/stats/cit_added.html 5 BH2012
  • 6.
    growing amount ofbiomedical data 6 BH2012
  • 7.
    increasingly complex software& interfaces to predict, compare and evaluate 7 BH2012
  • 8.
    ultimately, we answerquestions by building sophisticated workflows 8 BH2012
  • 9.
    What if wecould just pose a hypothesis and have a system automatically use 9 available data, ontologies and services? BH2012
  • 10.
    HyQue HyQue is the Hypothesis query and evaluation system • A platform for knowledge discovery • Facilitates hypothesis formulation and evaluation • Leverages Semantic Web technologies to provide access to facts, expert knowledge and web services • Conforms to a simplified event-based model • Supports evaluation against positive and negative findings • Transparent and reproducible evidence prioritization • Provenance of across all elements of hypothesis testing – trace a hypothesis to its evaluation, including the data and rules used Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. HyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3. 10 BH2012
  • 11.
    HyQue Architecture Ontologies Services 11 BH2012
  • 12.
    Event-based data model HyQue events denote a phenomenon involving two objects: ‘agent’ and ‘target’ . In addition, we can specify the location of this event (e.g. located in nucleus, or under some genetic background) supported events Event 1. protein-protein binding ‘has agent’ agent 2. protein-nucleic acid binding ‘has target’ target 3. molecular activation ‘is located in’ location 4. molecular inhibition 5. gene induction ‘is negated’ boolean 6. gene repression 7. transport 12 BH2012
  • 13.
    HyQue domain rulesCALCULATE a quantitative measure of evidence for an event ‘induce’ rule (maximum score: 5): – Is event negated? GO:0010628 • If yes, subtract 2 – Is event of type ‘induce’? CHEBI:36080 • If yes, add 1; if no, subtract 1 – Is agent of type ‘protein’ or ‘RNA’? • If yes, add 1; if type ‘gene’, subtract 1 – Is target of type ‘gene’? SO:0000236 • If yes, add 1; if no, subtract 1 – Does agent have known ‘transcription factor activity’? • If yes, add 1 GO:0003700 – Is event located in the ‘nucleus’? • If yes, add 1; if no, subtract 1 GO:0005634 13 BH2012
  • 14.
    Customization of rules/datasources will generate different evidence-based evaluations 14 BH2012
  • 15.
    The Semantic Web is the new global web of knowledge It involves standards for publishing, sharing and querying facts, expert knowledge and services It is a scalable approach to the discovery of independently formulated and distributed knowledge 15 BH2012
  • 16.
    something you cansearch, lookup, link to, check consistency of, and query for 16 BH2012
  • 17.
    An ever expandingweb of linked data 17 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” BH2012
  • 18.
    Bio2RDF provides a simple convention and infrastructure to provide linked data for the life sciences 18 BH2012
  • 19.
    linked data forthe life sciences An Open Source Project for the Provision of Scalable, Decentralized Data with Global Mirroring and Customizable Query Resolution http://bio2rdf.org/ns:id Laval University, Carleton University, Queensland University of Technology 19 BH2012
  • 20.
    provides billions ofinterconnections 20 BH2012
  • 21.
  • 22.
  • 23.
  • 24.
    engaging the BioPAXcommunity to adopt identifiers.org Andrea Splendiani Pathwaycommons (level 2; download) <bp:unificationXref rdf:ID="CPATH-LOCAL-653"> <bp:ID rdf:datatype="xsd:string">9606</bp:ID> <bp:DB rdf:datatype="xsd:string">NCBI_TAXONOMY</bp:DB> </bp:unificationXref> Pathwaycommons (level 3; web service) <bp:UnificationXref rdf:about="urn:biopax:UnificationXref:REACTOME+DATABASE+ID_109276"> <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">109276</bp:id> <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Reactome Database ID</bp:db> </bp:UnificationXref> Biomodels (level 3) <bp:UnificationXref rdf:about="http://identifiers.org/obo.go/GO:0004889"> <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">GO:0004889</bp:id> <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Gene Ontology</bp:db> </bp:UnificationXref> 24 BH2012
  • 25.
    More sophisticated OWL-basedData Integration, Consistency Checking and Discovery Robert Hoehndorf • Checking the consistency of semantic annotations [1] – Formalized semantic annotations in SBML models as OWL axioms. Automated reasoning uncovered inconsistencies in 16 models. • e.g. alpha-D-glucose phosphate is not the required ATP in an ATP-dependent reaction (GO + ChEBI + disjoint + closure axioms) • Finding significant biomedical associations [2] (initiated at BH11) – found significant associations between genes, drugs, diseases and pathways using Drugbank, PharmGKB, CTD, PID across categories of drugs (ChEBI, ATC, MeSH) and diseases (DO, MeSH) – 22,653 pathway-disease type associations (6304 over; 16,349 under) • carcinosarcoma (DOID:4236) and (HIV RT) Zidovudine Pathway (PharmGKB:PA165859361) – 13,826 pathway-chemical type associations (12,564 over; 1262 under) • drug clopidogrel (CHEBI:37941) with Endothelin signaling pathway (PharmGKB:PA164728163) -> (smooth muscle mitogenesis) http://pharmgkb-owl.googlecode.com 1. Integrating systems biology models and biomedical ontologies. BMC Systems Biology. 2011. 5 : 124 2. Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. in press 25 BH2012
  • 26.
    Personal Health Lens Mark Wilkinson Chris Baker Observation: Patients often look up new/alternative drugs to treat their condition or alleviate side effects. Opportunity: A patient-centric health care application that identifies contraindications for drugs mentioned on web pages using the patient’s own health data Components: • RDFized patient data • Bio2RDF semantically annotated data • SADI semantic web services to process the page and retrieve data • SHARE automatic workflow composition 26 BH2012
  • 27.
    27 BH2012
  • 28.
    Matthias Samwald Weare developing a simple, cheap and ubiquitous solutions for anchoring pharmacogenomics in medical practice Curated and unified set of essential 385+ markers, 50+ pharmacogenes and rulesystem unified under one standardized model: The Medicine Safety Code W3C Task Force: Clinical Decision Support for Personalized Medicine 28 BH2012
  • 29.
    Unified OWL Ontology (inferencing , consistency checking, mapping) 29 BH2012
  • 30.
    At this Biohackathon • refine Bio2RDF RDFization Guide • complete Dataset Description (BH11) • dataspace statistics & visualization • SPARQL-based Enrichment Analysis • ontology-based Similarity Networks – see Rob’s email 30 BH2012
  • 31.
    dumontierlab.com michel_dumontier@carleton.ca Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier 31 BH2012