Role of Semantic Web in Health Informatics


Published on

Tutorial presented at 2012 ACM SIGHIT International Health Informatics Symposium (IHI 2012), January 28-30, 2012.

This tutorial weaves together three themes and the associated topics:

[1] The role of biomedical ontologies
[2] Key Semantic Web technologies with focus on Semantic provenance and integration
[3] In-practice tools and real world use cases built to serve the needs of sleep medicine researchers, cardiologists involved in clinical practice, and work on vaccine development for human pathogens.

Published in: Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • RDF: Triple structure
  • Review types of heterogeneity. Why we need to reconcile data heterogeneityUniform Resource Locator: A network location and used as an identifier for resources on the Web. URL is a specific type of URI. URI can be used to refer to anythingIRI: In addition to ASCII character set, contains Universal Character Set (from RFC 3987)
  • RDF uses XML Schema datatypes
  • Allows creation of an abstract representation of domain
  • Allows creation of an abstract representation of domain
  • Review types of heterogeneity. Why we need to reconcile data heterogeneity
  • Review types of heterogeneity. Why we need to reconcile data heterogeneity
  • Review types of heterogeneity. Why we need to reconcile data heterogeneity
  • Better yet, for those who are not originally involved in developing the data resource
  • Web based, anywhere anytime, not requiring knowledge of data structure, data elements, how they are stored
  • Under the hood
  • Simple interface: select query representing Cases; selecting query representing controls; then “explore”!
  • Walking through an example to illustrate the VisAgE’s Case Control features. Cleveland family study
  • Green means matched; visually inspect PieChart
  • Try to provide 1:5 matching; not enough controls; COLOER indicator
  • Controls coming from both Cleveland Family Study and Sleep Heart Health Study gives sufficient number of matched controls.This also illustrates the POWER of data federation in PhysioMIMI: combine different data sources
  • Can accomondate differences, but prevent the same term to change meaning from one paragraph to the next in the same paper.
  • Lets take another example from the NIH-funded collaborative project led by our lab along with the CTEGD at UGA to identify vaccine targets for the human pathogen T.cruzi that causes… specifically, we consider the experiment protocol used to create new strains of the parasites by knocking out specific genes in the parasite – to identify function of the gene among other functionalities.Proposal writing
  • In terms of modeling: A formal representation is needed to support consistent interpretation (also machine processable for large volumes of data) and expressive to closely reflect the domain-specific details i.e. domain semanticsProvenance queries have many characteristics that I will cover later that need to be supported by the query infrastructure.Provenance queries over scientific data are characterized by high expression and data complexity (I will discuss these aspects in detail)Lets discuss the issue of provenance modeling first.
  • In terms of modeling: A formal representation is needed to support consistent interpretation (also machine processable for large volumes of data) and expressive to closely reflect the domain-specific details i.e. domain semanticsProvenance queries have many characteristics that I will cover later that need to be supported by the query infrastructure.Provenance queries over scientific data are characterized by high expression and data complexity (I will discuss these aspects in detail)Lets discuss the issue of provenance modeling first.
  • We can extend the provenir ontology to model domain-specific provenance. For example, we extended the provenir ontology to create the peo that models the gene knockout and strain creation protocols. We have also extended provenir to model the trident ontology representing oceanography specific provenance information – will discuss later in evaluation section.PEO has a expressivity of ALCHQ(D) is because of use qualifiers on the properties, for example cell cloning has input value drug selected sample and output value of cloned sample, datatype property for researcher notes
  • First types of queries are “standard” provenance queryThe second type of queries is a complete reverse view: define constraints on provenance information to retrieve datasets that satisfy those constraints (provenance context)
  • Formal definitions and implemented in prolog (to validate functional semantics) and mapped to SPARQL – the RDF query language
  • Query composer: maps the functional semantics of the query operator to SPARQL syntax in other words, creates a SPARQL query pattern that conforms to the required behavior of the query operator. One of the challenges we faced in implementing this query composer is the use highly nested OPTIONAL function to account for the fact that all application do not collect the comprehensive provenance information that the query operators have been defined to operate on – hence is the query operators are translated to a basic graph pattern in SPARQL then it will return no results even though partial information is present. The OPTIONAL function allows us to sidestep this, but we had to map the nesting of the OPTIONAL functions to reflect the structure of the Provenir ontology schema, for example given a process, the link to an agent requires use of top-level OPTIONAL function, but the OPTIONAL function to retrieve the spatial parameters associated with the agent need to be nested within the top-level OPTIONAL functionI discussed the transitive closure function earlier, this is implemented using existing SPARQL function ASK – I will discuss the experiment results to justify the use of ASKThe query optimizer uses materialized view based approach to significantly improve query performance. The entities in the materialized view are indexed in B+ tree and in response to a query, the query optimizer looks up this index to see if a query can be answered using the materialized view or needs to sent to the DB. The query optimizer also includes a module to do “view selection” that is to decide whether to materialize a query result or not – I will expand on this in a few slides
  • Role of Semantic Web in Health Informatics

    1. 1. Role of Semantic Web in Health InformaticsTutorial at 2012 ACM SIGHIT International Health Informatics Symposium (IHI 2012), January 28-30, 2012 Satya S. Sahoo, GQ Zhang AmitSheth Division of Medical Informatics Kno.e.sis Center Case Western Reserve University Wright State University
    2. 2. Outline• Semantic Web o Introductory Overview• Clinical Research o Physio-MIMI• Bench Research and Provenance o Semantic Problem Solving Environment for T.cruzi• Clinical Practice o Active Semantic Electronic Medical Record
    3. 3. Semantic Web
    4. 4. Landscape of Health Informatics Patient Care Personalized Medicine Drug DevelopmentClinical Research Bench Research Privacy Cost Clinical Practice * Images from
    5. 5. Challenges• Information Integration: Reconcile heterogeneity o Syntactic Heterogeneity: DOB vs. Date of Birth o Structural Heterogeneity: Street + Apt + City vs. Address o Semantic Heterogeneity: Age vs. Age at time of surgery vs. Age at time of admission• Humans can (often) accurately interpret, but extremely difficult for machine o Role for Metadata/Contextual Information/Semantics
    6. 6. Semantic Web• Web of Linked Data• Introduced by Berners Lee et. al as next step for Web of Documents• Allow “machine understanding” of data,• Create “common” models of domains using formal language - Semantic Web Layer Cake ontologies Layer cake image source:
    7. 7. Resource Description Framework Location Company Armonk, New York, United States IBM Zurich, Switzerland• Resource Description Framework – Recommended by W3C for metadata modeling [RDF]• A standard common modeling framework – usable by humans and machine understandable
    8. 8. RDF: Triple Structure, IRI, Namespace Headquarters located in Armonk, New York, IBM United States• RDF Triple o Subject: The resource that the triple is about o Predicate: The property of the subject that is described by the triple o Object:The value of the property• Web Addressable Resource:Uniform Resource Locator (URL), Uniform Resource Identifier(URI), Internationalized Resource Identifier (IRI)• Qualified Namespace: asxsd: o xsd: string instead of
    9. 9. RDF Representation• Two types of property values in a triple o Web resource Headquarters located in IBM Armonk, New York, o Typed literal United States Has total employees IBM “430,000” ^^xsd:integer • The graph model of RDF:node-arc-node is the primary representation model • Secondary notations: Triple notation o companyExample:IBM companyExample:has-Total- Employee “430,000”^^xsd:integer .
    10. 10. RDF Schema Headquarters located in Armonk, New IBM York, United States Headquarters located in Redwood Shores, Oracle California, United States Headquarters located in Company Geographical Location• RDF Schema: Vocabulary for describing groups of resources [RDFS]
    11. 11. RDF Schema • Propertydomain(rdfs:domain) and range(rdfs:range) Domain Headquarters located in Range Company Geographical Location • Class Hierarchy/Taxonomy:rdfs:subClassOf SubClass rdfs:subClassOf (Parent) ClassComputer Technology CompanyCompanyBanking CompanyInsurance Company
    12. 12. Ontology: A Working Definition• Ontologies are shared conceptualizations of a domain represented in a formal language*• Ontologies in health informatics: o Common representation model - facilitate interoperability, integration across different projects, and enforce consistent use of terminology o Closely reflect domain-specific details (domain semantics) essential to answer end user o Support reasoning to discover implicit knowledge* Paraphrased from Gruber, 1993
    13. 13. OWL2 Web Ontology Language• A language for modeling ontologies [OWL]• OWL2 is declarative• An OWL2 ontology (schema) consists of: o Entities:Company, Person o Axioms:Company employs Person o Expressions:A Person Employed by a Company = CompanyEmployee• Reasoning: Draw a conclusion given certain constraints are satisfied o RDF(S) Entailment o OWL2 Entailment
    14. 14. OWL2 Constructs• Class Disjointness: Instance of class A cannot be instance of class B• Complex Classes: Combining multiple classes with set theory operators: o Union:Parent =ObjectUnionOf(:Mother :Father) o Logical negation:UnemployedPerson = ObjectIntersectionOf(:EmployedPerson) o Intersection:Mother =ObjectIntersectionOf(:Parent :Woman)
    15. 15. OWL2 Constructs• Property restrictions: defined over property• Existential Quantification: o Parent =ObjectSomeValuesFrom(:hasChild :Person) o To capture incomplete knowledge• Universal Quantification: o US President = objectAllValuesFrom(:hasBirthPlace United States)• Cardinality Restriction
    16. 16. SPARQL: Querying Semantic Web Data• A SPARQL query pattern composed of triples• Triples correspond to RDF triple structure, but have variable at: o Subject: ?companyex:hasHeadquaterLocationex:NewYork. o Predicate: ex:IBM?whatislocatedinex:NewYork. o Object: ex:IBMex:hasHeadquaterLocation?location.• Result of SPARQL query is list of values – valuescan replace variable in query pattern
    17. 17. SPARQL: Query Patterns• An example query patternPREFIX ex:<>SELECT?company ?location WHERE{?company ex:hasHeadquaterLocation?location.}• Query Result company location Multiple Matches IBM NewYork Oracle RedwoodCity MicorosoftCorporation Bellevue
    18. 18. SPARQL: Query Forms• SELECT: Returns the values bound to the variables• CONSTRUCT: Returns an RDF graph• DESCRIBE: Returns a description (RDF graph) of a resource (e.g. IBM) o The contents of RDF graph is determined by SPARQL query processor• ASK: Returns a Boolean o True o False
    19. 19. Semantic Web+Clinical Research Informatics = Physio-MIMI
    20. 20. Physio-MIMI Overview• Physio-MIMI: Multi-Modality, Multi-Resource Environment for Physiological and Clinical Research• NCRR-funded, multi-CTSA-site project (RFP 08-001) for providing informatics tools to clinical investigators and clinical research teams at and across CTSA institutions to enhance the collection, management and sharing of data• Collaboration among Case Western, U Michigan, Marshfield Clinic and U Wisconsin Madison• Use Sleep Medicine as an exemplar, but also generalizable• Two year duration: Dec 2008 – Dec 2010
    21. 21. Features of Physio-MIMI• Federated data integration environment – Linking existing data resources without a centralized data repository• Query interface directly usable by clinical researchers – Minimize the role of the data-access middleman• Secure and policy-compliant data access – Fine-grained access control, dual SSL, auditing• Tools for curatingPSGs Data Integration Framework Physio-MIMI SHHS Portal
    22. 22. Data Access, Secondary Use
    23. 23. Measure not by the size of the database, but thenumber of secondary studies it supported
    24. 24. Query Interface – driven by access• Visual Aggregator and Explorer (VISAGE)• Federated, Web-based• Driven by Domain Ontology (SDO)• PhysioMap to connect autonomous data sources Clinical Clinical Investigator Investigator 1 3 1 • GQ Zhang et al. VISAGE: A Query Interface for Clinical Data Analyst Data Manager Database 3 Research, Proceedings of the 2010 AMIA Clinical Research Informatics 2 2 Summit, San Francisco, March 12-13, pp. Data Analyst 76-80, 2010 Database Data Manager
    25. 25. Physio-MIMI Components Sleep Researcher Domain Expert InformaticianMETA SERVER Query Builder VISAGE Query Manager Query Explorer DB-Ontology MapperDATA SERVER Institutional Databases Institutional Databases Institutional Databases Institutional Firewall Institutional Firewall Institutional Firewall
    26. 26. VISAGE screenshot
    27. 27. Components of VISAGE
    28. 28. Case Control Study Design•Case-control is a common study design• Used for epidemiological studies involving two cohorts,one representing the casesand the second representing the controls• Adjusting matching ratio to improve statistical power
    29. 29. Example (CFS)• Suppose we are interested in the question of whether sleep parameters (EEG) differ by obesity in age and race matched males• Case: adult 55-75, male, BMI 35-50 (obese)• Control: adult 55-75, male, BMI 20-30 (non-obese)• Matching 1:2 on race (minimize race as a factor initially)
    30. 30. Adult 55-75, male, BMI 35-50
    31. 31. Adult 55-75, male, BMI 20-30
    32. 32. Set up 1:2 Matching
    33. 33. 1:2 Matching Result Control MatchedCase
    34. 34. 1:5 Matching?
    35. 35. 1:5 Matching – CFS+SHHSModify Control to IncludeTWO data sources
    36. 36. Sleep Domain Ontology (SDO)• Standardize terminology and semantics (define variations) [RO]• Facilitate definition of data elements• Valuable for data collection, data curation• Data integration• Data sharing and access• Take advantage of progress in related areas (e.g. Gene Ontology)• Improving data quality – provenance, reproducibility
    37. 37. Sleep Domain Ontology (SDO)
    38. 38. Sleep Domain Ontology (SDO)
    39. 39. VISAGE Query Builder showing a data query on Parkinsonian Disorders and REM sleepbehavior disorder with race demographics
    40. 40. Semantic Web+Provenance +BenchResearch=T.cruzi SemanticProblem Solving Environment
    41. 41. Semantic Problem Solving Environment for T.cruzi
    42. 42. Provenance in Scientific Experiments New Parasite Strains
    43. 43. Provenance in Scientific Experiments Gene Name Sequence Extraction Drug 3‘ & 5’Resistant Region Plasmid Gene Name Plasmid Construction KnockoutT.Cruzi Construct Plasmidsample ? Transfection Transfected Sample Drug Cloned Sample Selection Selected Sample Cell Cloning Cloned Sample
    44. 44. Provenance in Scientific Experiments Gene Name • Provenance from the French word “provenir” describes the lineage or Sequence Extraction DrugResistant Plasmid 3‘ & 5’ Region history of a data entity Plasmid Construction • For Verification and Validation ofT.Cruzi Knockout Construct Plasmid Data Integrity, Process Quality, and Trustsample Transfection Transfected • Semantic Provenance Framework Sample addresses three aspects [Prov] o Provenance Modeling Drug Selection Selected Sample o Provenance Query Infrastructure Cell Cloning o Scalable Provenance System Cloned Sample
    45. 45. Domain-specific Provenance ontology has_agent agent is_a PROVENIR data ONTOLOGY is_a parameter data_collection is_a process is_a spatial_parameter temporal_parameter is_a domain_parameter is_a is_ais_a is_a is_a transfection_machine location is_a drug_selection is_a subPropertyOf sample has_temporal_parameter strain_creation is_a _protocol Time:DateTime Descritptiontransfection cell_cloning is_a transfection_buffer PARASITE has_input_value Tcruzi_sample EXPERIMENT ONTOLOGY has_parameter • Total Number of Classes - 118 • DL Expressivity – ALCHQ(D)
    46. 46. Provenance Query ClassificationClassified Provenance Queries into Three Categories• Type 1: Querying for Provenance Metadata o Example: Which gene was used create the cloned sample with ID = 66?• Type 2: Querying for Specific Data Set o Example: Find all knockout construct plasmids created by researcher Michelle using “Hygromycin” drug resistant plasmid between April 25, 2008 and August 15, 2008• Type 3: Operations on Provenance Metadata o Example: Were the two cloned samples 65 and 46 prepared under similar conditions – compare the associated provenance information
    47. 47. Provenance Query OperatorsFour Query Operators – based on Query Classification• provenance () – Closure operation, returns the complete set of provenance metadata for input data entity• provenance_context() - Given set of constraints defined on provenance, retrieves datasets that satisfy constraints• provenance_compare () - adapt the RDF graph equivalence definition• provenance_merge () - Two sets of provenance information are combined using the RDF graph merge
    48. 48. Answering Provenance Queries using provenance () Operator
    49. 49. Implementation: Provenance Query Engine QUERY OPTIMIZER• Three modules: o Query Composer o Transitive closure o Query Optimizer• Deployable over a RDF store with support for reasoning TRANSITIVE CLOSURE
    50. 50. Application in T.cruzi SPSE Project • Provenance tracking for gene knockout, strain creation, proteomics, microarray experiments • Part of the Parasite Knowledge Repository [BKR]
    51. 51. W3C Provenance Working Group• Define a “provenance interchange language for publishing and accessing provenance”• Three working drafts: o PROV-Data Model: A conceptual model for provenance representation o PROV-Ontology: An OWL ontology for provenance representation o PROV-Access and Query: A framework to query and retrieve provenance on the Web
    52. 52. Semantic Web+Clinical Practice Informatics =Active Semantic Electronic Medical Record (ASEMR)
    53. 53. Semantic Web application in useIn daily use at Athens Heart Center – 28 person staff • Interventional Cardiologists • Electrophysiology Cardiologists – Deployed since January 2006 – 40-60 patients seen daily – 3000+ active patients – Serves a population of 250,000 people
    54. 54. Information Overload in Clinical Practice• New drugs added to market – Adds interactions with current drugs – Changes possible procedures to treat an illness• Insurance Coverages Change – Insurance may pay for drug X but not drug Y even though drug X and Y are equivalent – Patient may need a certain diagnosis before some expensive test are run• Physicians need a system to keep track of ever changing landscape
    55. 55. System though out the practice
    56. 56. System though out the practice
    57. 57. System though out the practice
    58. 58. System though out the practice
    59. 59. Active Semantic Document (ASD)A document (typically in XML) with the following features:• Semantic annotations – Linking entities found in a document to ontology – Linking terms to a specialized lexicon [TR]• Actionable information – Rules over semantic annotations – Violated rules can modify the appearance of the document (Show an alert)
    60. 60. Active Semantic Patient Record• An application of ASD• Three Ontologies – Practice Information about practice such as patient/physician data – Drug Information about drugs, interaction, formularies, etc. – ICD/CPT Describes the relationships between CPT and ICD codes• Medical Records in XML created from database
    61. 61. Practice Ontology Hierarchy (showing is-a relationships) facility insurance_ ancillary owl:thing carrier ambularory insurance _episode insurance_encounter plan person event insurance_ patient policy practitioner
    62. 62. Drug Ontology Hierarchy (showing is-a relationships) formulary_ non_drug_ interaction_ property formulary reactant property indication indication_ property owl:thingmonograph property_ix_class prescription interaction_ _drug_ with_non_ brandname_ prescription brand_name drug_reactantprescription individual _drug interaction_drug_property brandname_ brandname_ composite prescription interaction_ undeclared _drug_ with_mono interaction_ generic graph_ix_cl with_prescri cpnum_ generic_ ass ption_drug group composite generic_ individual
    63. 63. Drug Ontology showing neighborhood of PrescriptionDrug concept
    64. 64. Part of Procedure/Diagnosis/ICD9/CPT Ontology maps_to_diagnosis specificity procedure diagnosis maps_to_procedure
    65. 65. Semantic Technologies in Use• Semantic Web: OWL, RDF/RDQL, Jena – OWL (constraints useful for data consistency), RDF – Rules are expressed as RDQL – REST Based Web Services: from server side• Web 2.0: client makes AJAX calls to ontology, also auto completeProblem:• Jena main memory- large memory footprint, future scalability challenge• Using Jena’s persistent model (MySQL) noticeably slower
    66. 66. Architecture & Technology
    67. 67. Benefits: Athens Heart Center Practice Growth 1400 1300 1200 1100Appointments 1000 2003 900 2004 800 2005 700 2006 600 500 400 v b g r c n n l p ar t ay ju ap no oc fe ja ju au de se m m Month
    68. 68. Chart Completion before the preliminary deployment of the ASMER 600 500 400Charts Same Day 300 Back Log 200 100 0 Se 4 5 04 05 04 05 04 05 04 04 l0 l0 n n ay ay pt ar ar ov Ju Ju Ja Ja M M M M N Month/Year
    69. 69. Chart Completion after the preliminary deployment of the ASMER 700 600 500Charts 400 Same Day 300 Back Log 200 100 0 Sept Nov 05 Jan 06 Mar 06 05 Month/Year
    70. 70. Benefits of current system• Error prevention (drug interactions, allergy) – Patient care – insurance• Decision Support (formulary, billing) – Patient satisfaction – Reimbursement• Efficiency/time – Real-time chart completion – “semantic” and automated linking with billing
    71. 71. Demo On-line demo of Active Semantic Electronic Medical Record deployed and in use at Athens Heart Center71
    72. 72. Challenges, Opportunities, and Future Direction
    73. 73. ConclusionsBenefits of SW in Health Informatics:• RDF a “universal” data model; Application- purpose agnostic (clinical care vs research)• Integration “ready,” supporting distributed query out of box• Semantic interoperability addressed at root level• Better support of user interfaces for data capture, data query, data integration• Scalability demonstrated
    74. 74. Challenges and Future Directions• Design and implementation of health information systems with RDF as primary data store from ground up• User-friendly graphical query interface on top of SPARQL• Managing Protected Health Information (PHI) e.g. data encryption “at rest” for RDF store• From retrospective annotation of data (with ontology) to prospective annotation of data: ontology-driven data capture with annotation happening at the point of primary source (eliminating the need to annotate data retrospectively)• Let ontology drive “everything”
    75. 75. References• [RDF] Manola F, Miller, E.(Eds.). RDF Primer. 2004; Available from:• [RDFS] Brickley D, Guha, R.V. RDF Schema. 2004; Available from:• [OWL] Hitzler P, Krötzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S. OWL 2 Web Ontology Language Primer: W3C; 2009• [Physio-MIMI]:• [ASEMR] A. P. Sheth, Agrawal, S., Lathem, J., Oldham, N., Wingate, H., Yadav, P., Gallagher, K., "Active Semantic Electronic Medical Record," in 5th International Semantic Web Conference, Athens, GA, USA, 2006.• [BioRDF] BioRDF subgroup: Health Care and Life Sciences interest group Available:• [TR] A. Ruttenberg, et al., "Advancing translational research with the Semantic Web," BMC Bioinformatics vol. in Press, 2007.
    76. 76. References 2• [Visage] GQ Zhang et al. VISAGE: A Query Interface for Clinical Research, Proceedings of the 2010 AMIA Clinical Research Informatics Summit, San Francisco, March 12-13, pp. 76-80, 2010• [Prov] S.S. Sahoo, V. Nguyen, O. Bodenreider, P. Parikh, T. Minning, A.P. Sheth, “A unified framework for managing provenance information in translational research.” BMC Bioinformatics 2011, 12:461• [RO] Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C: Relations in biomedical ontologies. Genome Biol 2005, 6(5):R46.• [BKR] Bodenreider O, Rindflesch, T.C.: Advanced library services: Developing a biomedical knowledge repository to support advanced information management applications. In. Bethesda, Maryland: Lister Hill National Center for Biomedical Communications, National Library of Medicine; 2006.• T.cruzi project web site:
    77. 77. Acknowledgements• Collaborators: o Susan Redline, Remo Mueller, and other members of Physio-MIMI team o Rick Tarleton, Todd Manning, Priti Parikh and other members of the T.cruzi SPSE team o Dr. S. Agrawal and other members at the Athens Heart Center, GA• NIH Support: UL1-RR024989, UL1-RR024989-05S, NCRR-94681DBS78, NS076965, and 1R01HL087795