http://metacognition.info/presentations/SW-usecases-outcomes-research.ppt




  Semantic Web use cases
  in outcomes research
  Experiences from building a patient repository and
  developing standards



                                                            Chimezie Ogbuji
                                                  Metacognition Inc. (Owner)
Outline
• Me
• Semantic Web and Semantic Web technologies
  • RDF, GRDDL, OWL, RIF, and SPARQL
• Cleveland Clinic Semantic DB project
  •   Content repository
  •   Data collection workflow
  •   Quality and outcomes reporting
  •   Cohort identification
• Use of the system
Me and the Semantic Web
• I’ve been developing software using standards of the Semantic
  Web since 2001
  • Worked on a startup that developed an XML & RDF content
    repository
• Began working on Cleveland Clinic SemanticDB project in 2003
• Began working in the World-Wide Consortium (W3C),
  developing the SPARQL and GRDDL standards in 2007 and
  2006, respectively
• I contribute to and maintain several open source software
  projects related to Semantic Web technologies:
  • RDFLib (https://code.google.com/p/rdflib/)
  • FuXi (https://code.google.com/p/fuxi/)
  • Akamu (https://code.google.com/p/akamu/)
The Semantic Web
• The Semantic Web
  • What is it? Like asking “What is the Matrix?”
  • A vision of how the existing WWW can be extended such that
    machines can interpret the meaning of data involved in protocol
    interactions
  • A vision of the founder of the World-wide Web Consortium (W3C)
    and inventor of the internet (Tim Berners-Lee)
• Semantic Web technologies / standards
  • Layers of W3C standards (“Layer cake”)
  • A technological roadmap that attempts to realize this vision
  • The technologies are well-suited to addressing many enterprise
    software architecture challenges
http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/
http://www.bnode.org/blog/2009/07/08/the-semantic-web-not-a-piece-of-cake
“Focus” standards
•   Resource Description Framework
•   Gleaning Resource Descriptions from Dialects of Language
•   SPARQL Protocol And RDF Query Language
•   Ontology Web Language
RDF
• A framework for representing information in on the WWW.
• Motivation
  • machine-interpretable metadata about web resources
  • mashup of application data
  • automated processing of web information by software agents
• Graph data model (directed, labeled graph)




• Nodes and links are labeled with URIs
• Some nodes are not labeled (Blank nodes)
• Links are called RDF sentences or triples
                                   http://www.w3.org/TR/rdf-concepts/
GRDDL
    • A protocol for sowing semantics in structured (XML) web
      content for harvest
    • Vast amount of latent semantics
      in web documents
    • Web content today is
      primarily built for human
      consumption




http://www.w3.org/TR/grddl/
Faithful Rendition
“By specifying a GRDDL transformation, the author of a document states that
the transformation will provide a faithful rendition in RDF of information (or
some portion of the information) expressed through the XML dialect used in
the source document.”
• Licenses an interpretation of an XML document that is
  certified by the author
                                    (embedded)
                                     transform
                  XHTML / XML
                                                        RDF
                   (instances)




                                   namespace
                                    transform
                 XML namespace                          RDF
Architectural value
• XML is well-suited for messaging, data collection, and
  structural validation
• RDF is well-suited for expressive logical assertions, querying,
  and inference.
• RDF graphs can be created, update, deleted, etc. (managed)
  using a particular XML vocabulary
  • vocabulary can be specific to a particular purpose
• GRDDL facilitates mutually-beneficial use of XML and RDF
  processing and representation
SPARQL
   • The query language for RDF content
   • It operates over an RDF dataset
      • comprised of named (a URI) RDF graphs and a single RDF graph
        without a name
   • Operationally and structurally similar to SQL
   • Many implementations (including the ones we used) build on
     existing relational database management systems
      • translate SPARQL queries into SQL queries




Elliott et al. A complete translation from SPARQL into efficient SQL. 2009
                                             http://www.w3.org/TR/sparql11-query/
OWL
• Language for describing and constraining the semantics of an
  RDF vocabulary
• Such constraints (often hierarchical) are called ontologies
• An ontology specifies a conceptualization of a particular
  domain as categories, relationships between them, and
  constraints on both
• By defining an OWL document for the terms in an RDF
  graph, additional RDF sentences can be inferred
• Additionally, an RDF graph can be determined to be consistent
  or inconsistent with respect to the ontology
• Both tasks can be performed by a logical reasoning engine
Semantic Database (SDB)
    • Cleveland Clinic’s Heart and Vascular Institute (HVI)
    • Challenges:
       • fragmented gathering and storing of clinical research data
       • compartmentalization of medical science and practice
       • clinical knowledge is often expressed in ambiguous, idiosyncratic
         terminology
       • problematic for longitudinal patient data that can feasibly span
         multiple, geographically separated sources and disciplines
    • Longitudinal patient record:
       • patient records from different times, providers, and sites of care
         that are linked to form a lifelong view of a patient’s health care
         experience
Institute of Medicine. The computer-based patient record: an essential technology for
health care. 1997
                   http://www.w3.org/2001/sw/sweo/public/UseCases/ClevelandClinic/
Project goals
• Create a framework for context-free data management
• Usable for any domain with nothing (or little) assumed about
  the domain
• Expert-provided, domain-specific knowledge is used to control
  most aspects of
  •   Data entry
  •   Storage
  •   Display
  •   Retrieval
  •   Formatting for external systems
Components
   • Content repository
      • supports data collection, document management, and knowledge
        representation for use in managing longitudinal clinical data
      • manages patient record documents as XML and converts them to
        RDF graphs for downstream semantic processing
   • Data collection workflow management
      • process of transcribing details of a heart procedure from the EHR
        into a registry
      • RDF used as the state machine of a workflow engine


Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and
Quality Reporting. 2012
Ogbuji. A Role for Semantic Web Technologies in Patient Record Data Collection.
2009
Workflow State as RDF Dataset
• Each task is an XML document in a content repository
• Mirrored into a named RDF graph that shares a web location
  (the name) with the document
• (SPARQL) query is dispatched against a workflow dataset to
  find tasks in particular states or assigned to particular people
• Applications interact with task information and fetch:
  • JSON and XML representations (for client-side web applications)
  • XHTML documents that render as faceted views of a collection of
    tasks
  • faceted view includes links to subsequent stages in workflow and
    into other web applications on server
Reporting challenges
   • Reporting places a heavy burden on institutions to produce
     data in specific formats with precise definitions
   • Definitions vary across reports
      • makes it difficult to use the same source data for all reports
   • Institutions are typically forced to manually abstract the data
     for each report
   • This is done separately to conform to the requirements for
     each report




Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and
Quality Reporting. 2012
Components: reporting
   • Quality and outcomes reporting
      • generate outcomes reports both for internal and external
        consumption
      • internal reports were generated monthly and external reports are
        generated quarterly
      • quarterly reports submitted to Society of Thoracic Surgeons (STS)
        Adult Cardiac Surgery National Database and American College of
        Cardiology (ACC) CathPCI Database
      • submissions are required for certification




Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and
Quality Reporting. 2012
Cohort identification
  • SPARQL and RDF datasets are well-suited as infrastructure for
    a longitudinal patient record data warehouse
  • HVI software development team partnered with Cycorp to
    build a cohort identification interface called the Semantic
    Research Assistant (SRA)
  • Based on the Cyc inference engine
     • a powerful reasoning system and knowledge base with built-in
       capability for natural language (NL)processing, forward-chaining
       inference and backward-chaining inference.
     • incorporates Cyc's NL processing to permit a user to compose a
       cohort selection query by typing an English sentence or sentence
       fragment

Lenat et al. Harnessing Cyc to Answer Clinical Researchers' Ad Hoc Queries. 2010.
RDF dataset warehouse
• CycL to SPARQL
  • domain-specific medical ontologies in conjunction with the Cyc
    general ontology are used to convert the NL query into a formal
    representation and then into SPARQL queries.
  • SPARQL queries are submitted to the SemanticDB RDF store for
    execution
• Cleveland Clinic’s registry of 200,000 patient records
  comprises an RDF graph of roughly 80 million RDF assertion
Dataset topology
• An RDF dataset with no default graph and one named graph
  per patient record (a patient record graph)
• Beyond identifying the cohort, most subsequent query
  processing happens within a single patient record graph
• In our vocabulary, there are instances of
  PatientRecord, Operation, Patient, MedicalEvent, HospitalEpi
  sode, etc.
• PatientRecord resources share a URI with their containing
  graph
• GRAPH operator can be used to optimize the search space
• Optimal for the following cohort querying paradigm
   • Constraints in the first part of query are cross-graph and the second
      part are intra-graph
Use of system
• From 2009 through June of 2011
  • over 200 clinical investigations utilized SemanticDB to identify
    study cohorts and retrieve appropriate data for analysis
  • studies ranged from relatively simple feasibility assessments to
    extremely complex investigations of time-related events and
    competing risks of the patient experiencing a certain outcome
    after treatment
  • prior cohort identification and data export queries for studies
    would have been performed by a skilled database administrator
    (DBA) interpreting instructions from domain experts
  • Using SemanticDB and the SRA, a non-technical domain expert
    performed most of the queries

Semantic Web use cases in outcomes research

  • 1.
    http://metacognition.info/presentations/SW-usecases-outcomes-research.ppt SemanticWeb use cases in outcomes research Experiences from building a patient repository and developing standards Chimezie Ogbuji Metacognition Inc. (Owner)
  • 2.
    Outline • Me • SemanticWeb and Semantic Web technologies • RDF, GRDDL, OWL, RIF, and SPARQL • Cleveland Clinic Semantic DB project • Content repository • Data collection workflow • Quality and outcomes reporting • Cohort identification • Use of the system
  • 3.
    Me and theSemantic Web • I’ve been developing software using standards of the Semantic Web since 2001 • Worked on a startup that developed an XML & RDF content repository • Began working on Cleveland Clinic SemanticDB project in 2003 • Began working in the World-Wide Consortium (W3C), developing the SPARQL and GRDDL standards in 2007 and 2006, respectively • I contribute to and maintain several open source software projects related to Semantic Web technologies: • RDFLib (https://code.google.com/p/rdflib/) • FuXi (https://code.google.com/p/fuxi/) • Akamu (https://code.google.com/p/akamu/)
  • 4.
    The Semantic Web •The Semantic Web • What is it? Like asking “What is the Matrix?” • A vision of how the existing WWW can be extended such that machines can interpret the meaning of data involved in protocol interactions • A vision of the founder of the World-wide Web Consortium (W3C) and inventor of the internet (Tim Berners-Lee) • Semantic Web technologies / standards • Layers of W3C standards (“Layer cake”) • A technological roadmap that attempts to realize this vision • The technologies are well-suited to addressing many enterprise software architecture challenges
  • 5.
  • 6.
  • 7.
    “Focus” standards • Resource Description Framework • Gleaning Resource Descriptions from Dialects of Language • SPARQL Protocol And RDF Query Language • Ontology Web Language
  • 8.
    RDF • A frameworkfor representing information in on the WWW. • Motivation • machine-interpretable metadata about web resources • mashup of application data • automated processing of web information by software agents • Graph data model (directed, labeled graph) • Nodes and links are labeled with URIs • Some nodes are not labeled (Blank nodes) • Links are called RDF sentences or triples http://www.w3.org/TR/rdf-concepts/
  • 9.
    GRDDL • A protocol for sowing semantics in structured (XML) web content for harvest • Vast amount of latent semantics in web documents • Web content today is primarily built for human consumption http://www.w3.org/TR/grddl/
  • 10.
    Faithful Rendition “By specifyinga GRDDL transformation, the author of a document states that the transformation will provide a faithful rendition in RDF of information (or some portion of the information) expressed through the XML dialect used in the source document.” • Licenses an interpretation of an XML document that is certified by the author (embedded) transform XHTML / XML RDF (instances) namespace transform XML namespace RDF
  • 11.
    Architectural value • XMLis well-suited for messaging, data collection, and structural validation • RDF is well-suited for expressive logical assertions, querying, and inference. • RDF graphs can be created, update, deleted, etc. (managed) using a particular XML vocabulary • vocabulary can be specific to a particular purpose • GRDDL facilitates mutually-beneficial use of XML and RDF processing and representation
  • 12.
    SPARQL • The query language for RDF content • It operates over an RDF dataset • comprised of named (a URI) RDF graphs and a single RDF graph without a name • Operationally and structurally similar to SQL • Many implementations (including the ones we used) build on existing relational database management systems • translate SPARQL queries into SQL queries Elliott et al. A complete translation from SPARQL into efficient SQL. 2009 http://www.w3.org/TR/sparql11-query/
  • 13.
    OWL • Language fordescribing and constraining the semantics of an RDF vocabulary • Such constraints (often hierarchical) are called ontologies • An ontology specifies a conceptualization of a particular domain as categories, relationships between them, and constraints on both • By defining an OWL document for the terms in an RDF graph, additional RDF sentences can be inferred • Additionally, an RDF graph can be determined to be consistent or inconsistent with respect to the ontology • Both tasks can be performed by a logical reasoning engine
  • 14.
    Semantic Database (SDB) • Cleveland Clinic’s Heart and Vascular Institute (HVI) • Challenges: • fragmented gathering and storing of clinical research data • compartmentalization of medical science and practice • clinical knowledge is often expressed in ambiguous, idiosyncratic terminology • problematic for longitudinal patient data that can feasibly span multiple, geographically separated sources and disciplines • Longitudinal patient record: • patient records from different times, providers, and sites of care that are linked to form a lifelong view of a patient’s health care experience Institute of Medicine. The computer-based patient record: an essential technology for health care. 1997 http://www.w3.org/2001/sw/sweo/public/UseCases/ClevelandClinic/
  • 15.
    Project goals • Createa framework for context-free data management • Usable for any domain with nothing (or little) assumed about the domain • Expert-provided, domain-specific knowledge is used to control most aspects of • Data entry • Storage • Display • Retrieval • Formatting for external systems
  • 16.
    Components • Content repository • supports data collection, document management, and knowledge representation for use in managing longitudinal clinical data • manages patient record documents as XML and converts them to RDF graphs for downstream semantic processing • Data collection workflow management • process of transcribing details of a heart procedure from the EHR into a registry • RDF used as the state machine of a workflow engine Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and Quality Reporting. 2012 Ogbuji. A Role for Semantic Web Technologies in Patient Record Data Collection. 2009
  • 17.
    Workflow State asRDF Dataset • Each task is an XML document in a content repository • Mirrored into a named RDF graph that shares a web location (the name) with the document • (SPARQL) query is dispatched against a workflow dataset to find tasks in particular states or assigned to particular people • Applications interact with task information and fetch: • JSON and XML representations (for client-side web applications) • XHTML documents that render as faceted views of a collection of tasks • faceted view includes links to subsequent stages in workflow and into other web applications on server
  • 19.
    Reporting challenges • Reporting places a heavy burden on institutions to produce data in specific formats with precise definitions • Definitions vary across reports • makes it difficult to use the same source data for all reports • Institutions are typically forced to manually abstract the data for each report • This is done separately to conform to the requirements for each report Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and Quality Reporting. 2012
  • 20.
    Components: reporting • Quality and outcomes reporting • generate outcomes reports both for internal and external consumption • internal reports were generated monthly and external reports are generated quarterly • quarterly reports submitted to Society of Thoracic Surgeons (STS) Adult Cardiac Surgery National Database and American College of Cardiology (ACC) CathPCI Database • submissions are required for certification Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and Quality Reporting. 2012
  • 22.
    Cohort identification • SPARQL and RDF datasets are well-suited as infrastructure for a longitudinal patient record data warehouse • HVI software development team partnered with Cycorp to build a cohort identification interface called the Semantic Research Assistant (SRA) • Based on the Cyc inference engine • a powerful reasoning system and knowledge base with built-in capability for natural language (NL)processing, forward-chaining inference and backward-chaining inference. • incorporates Cyc's NL processing to permit a user to compose a cohort selection query by typing an English sentence or sentence fragment Lenat et al. Harnessing Cyc to Answer Clinical Researchers' Ad Hoc Queries. 2010.
  • 24.
    RDF dataset warehouse •CycL to SPARQL • domain-specific medical ontologies in conjunction with the Cyc general ontology are used to convert the NL query into a formal representation and then into SPARQL queries. • SPARQL queries are submitted to the SemanticDB RDF store for execution • Cleveland Clinic’s registry of 200,000 patient records comprises an RDF graph of roughly 80 million RDF assertion
  • 25.
    Dataset topology • AnRDF dataset with no default graph and one named graph per patient record (a patient record graph) • Beyond identifying the cohort, most subsequent query processing happens within a single patient record graph • In our vocabulary, there are instances of PatientRecord, Operation, Patient, MedicalEvent, HospitalEpi sode, etc. • PatientRecord resources share a URI with their containing graph
  • 26.
    • GRAPH operatorcan be used to optimize the search space • Optimal for the following cohort querying paradigm • Constraints in the first part of query are cross-graph and the second part are intra-graph
  • 27.
    Use of system •From 2009 through June of 2011 • over 200 clinical investigations utilized SemanticDB to identify study cohorts and retrieve appropriate data for analysis • studies ranged from relatively simple feasibility assessments to extremely complex investigations of time-related events and competing risks of the patient experiencing a certain outcome after treatment • prior cohort identification and data export queries for studies would have been performed by a skilled database administrator (DBA) interpreting instructions from domain experts • Using SemanticDB and the SRA, a non-technical domain expert performed most of the queries