2. Outline
• Me
• Semantic Web and Semantic Web technologies
• RDF, GRDDL, OWL, RIF, and SPARQL
• Cleveland Clinic Semantic DB project
• Content repository
• Data collection workflow
• Quality and outcomes reporting
• Cohort identification
• Use of the system
3. Me and the Semantic Web
• I’ve been developing software using standards of the Semantic
Web since 2001
• Worked on a startup that developed an XML & RDF content
repository
• Began working on Cleveland Clinic SemanticDB project in 2003
• Began working in the World-Wide Consortium (W3C),
developing the SPARQL and GRDDL standards in 2007 and
2006, respectively
• I contribute to and maintain several open source software
projects related to Semantic Web technologies:
• RDFLib (https://code.google.com/p/rdflib/)
• FuXi (https://code.google.com/p/fuxi/)
• Akamu (https://code.google.com/p/akamu/)
4. The Semantic Web
• The Semantic Web
• What is it? Like asking “What is the Matrix?”
• A vision of how the existing WWW can be extended such that
machines can interpret the meaning of data involved in protocol
interactions
• A vision of the founder of the World-wide Web Consortium (W3C)
and inventor of the internet (Tim Berners-Lee)
• Semantic Web technologies / standards
• Layers of W3C standards (“Layer cake”)
• A technological roadmap that attempts to realize this vision
• The technologies are well-suited to addressing many enterprise
software architecture challenges
7. “Focus” standards
• Resource Description Framework
• Gleaning Resource Descriptions from Dialects of Language
• SPARQL Protocol And RDF Query Language
• Ontology Web Language
8. RDF
• A framework for representing information in on the WWW.
• Motivation
• machine-interpretable metadata about web resources
• mashup of application data
• automated processing of web information by software agents
• Graph data model (directed, labeled graph)
• Nodes and links are labeled with URIs
• Some nodes are not labeled (Blank nodes)
• Links are called RDF sentences or triples
http://www.w3.org/TR/rdf-concepts/
9. GRDDL
• A protocol for sowing semantics in structured (XML) web
content for harvest
• Vast amount of latent semantics
in web documents
• Web content today is
primarily built for human
consumption
http://www.w3.org/TR/grddl/
10. Faithful Rendition
“By specifying a GRDDL transformation, the author of a document states that
the transformation will provide a faithful rendition in RDF of information (or
some portion of the information) expressed through the XML dialect used in
the source document.”
• Licenses an interpretation of an XML document that is
certified by the author
(embedded)
transform
XHTML / XML
RDF
(instances)
namespace
transform
XML namespace RDF
11. Architectural value
• XML is well-suited for messaging, data collection, and
structural validation
• RDF is well-suited for expressive logical assertions, querying,
and inference.
• RDF graphs can be created, update, deleted, etc. (managed)
using a particular XML vocabulary
• vocabulary can be specific to a particular purpose
• GRDDL facilitates mutually-beneficial use of XML and RDF
processing and representation
12. SPARQL
• The query language for RDF content
• It operates over an RDF dataset
• comprised of named (a URI) RDF graphs and a single RDF graph
without a name
• Operationally and structurally similar to SQL
• Many implementations (including the ones we used) build on
existing relational database management systems
• translate SPARQL queries into SQL queries
Elliott et al. A complete translation from SPARQL into efficient SQL. 2009
http://www.w3.org/TR/sparql11-query/
13. OWL
• Language for describing and constraining the semantics of an
RDF vocabulary
• Such constraints (often hierarchical) are called ontologies
• An ontology specifies a conceptualization of a particular
domain as categories, relationships between them, and
constraints on both
• By defining an OWL document for the terms in an RDF
graph, additional RDF sentences can be inferred
• Additionally, an RDF graph can be determined to be consistent
or inconsistent with respect to the ontology
• Both tasks can be performed by a logical reasoning engine
14. Semantic Database (SDB)
• Cleveland Clinic’s Heart and Vascular Institute (HVI)
• Challenges:
• fragmented gathering and storing of clinical research data
• compartmentalization of medical science and practice
• clinical knowledge is often expressed in ambiguous, idiosyncratic
terminology
• problematic for longitudinal patient data that can feasibly span
multiple, geographically separated sources and disciplines
• Longitudinal patient record:
• patient records from different times, providers, and sites of care
that are linked to form a lifelong view of a patient’s health care
experience
Institute of Medicine. The computer-based patient record: an essential technology for
health care. 1997
http://www.w3.org/2001/sw/sweo/public/UseCases/ClevelandClinic/
15. Project goals
• Create a framework for context-free data management
• Usable for any domain with nothing (or little) assumed about
the domain
• Expert-provided, domain-specific knowledge is used to control
most aspects of
• Data entry
• Storage
• Display
• Retrieval
• Formatting for external systems
16. Components
• Content repository
• supports data collection, document management, and knowledge
representation for use in managing longitudinal clinical data
• manages patient record documents as XML and converts them to
RDF graphs for downstream semantic processing
• Data collection workflow management
• process of transcribing details of a heart procedure from the EHR
into a registry
• RDF used as the state machine of a workflow engine
Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and
Quality Reporting. 2012
Ogbuji. A Role for Semantic Web Technologies in Patient Record Data Collection.
2009
17. Workflow State as RDF Dataset
• Each task is an XML document in a content repository
• Mirrored into a named RDF graph that shares a web location
(the name) with the document
• (SPARQL) query is dispatched against a workflow dataset to
find tasks in particular states or assigned to particular people
• Applications interact with task information and fetch:
• JSON and XML representations (for client-side web applications)
• XHTML documents that render as faceted views of a collection of
tasks
• faceted view includes links to subsequent stages in workflow and
into other web applications on server
18.
19. Reporting challenges
• Reporting places a heavy burden on institutions to produce
data in specific formats with precise definitions
• Definitions vary across reports
• makes it difficult to use the same source data for all reports
• Institutions are typically forced to manually abstract the data
for each report
• This is done separately to conform to the requirements for
each report
Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and
Quality Reporting. 2012
20. Components: reporting
• Quality and outcomes reporting
• generate outcomes reports both for internal and external
consumption
• internal reports were generated monthly and external reports are
generated quarterly
• quarterly reports submitted to Society of Thoracic Surgeons (STS)
Adult Cardiac Surgery National Database and American College of
Cardiology (ACC) CathPCI Database
• submissions are required for certification
Pierce et al. SemanticDB: A Semantic Web Infrastructure for Clinical Research and
Quality Reporting. 2012
21.
22. Cohort identification
• SPARQL and RDF datasets are well-suited as infrastructure for
a longitudinal patient record data warehouse
• HVI software development team partnered with Cycorp to
build a cohort identification interface called the Semantic
Research Assistant (SRA)
• Based on the Cyc inference engine
• a powerful reasoning system and knowledge base with built-in
capability for natural language (NL)processing, forward-chaining
inference and backward-chaining inference.
• incorporates Cyc's NL processing to permit a user to compose a
cohort selection query by typing an English sentence or sentence
fragment
Lenat et al. Harnessing Cyc to Answer Clinical Researchers' Ad Hoc Queries. 2010.
23.
24. RDF dataset warehouse
• CycL to SPARQL
• domain-specific medical ontologies in conjunction with the Cyc
general ontology are used to convert the NL query into a formal
representation and then into SPARQL queries.
• SPARQL queries are submitted to the SemanticDB RDF store for
execution
• Cleveland Clinic’s registry of 200,000 patient records
comprises an RDF graph of roughly 80 million RDF assertion
25. Dataset topology
• An RDF dataset with no default graph and one named graph
per patient record (a patient record graph)
• Beyond identifying the cohort, most subsequent query
processing happens within a single patient record graph
• In our vocabulary, there are instances of
PatientRecord, Operation, Patient, MedicalEvent, HospitalEpi
sode, etc.
• PatientRecord resources share a URI with their containing
graph
26. • GRAPH operator can be used to optimize the search space
• Optimal for the following cohort querying paradigm
• Constraints in the first part of query are cross-graph and the second
part are intra-graph
27. Use of system
• From 2009 through June of 2011
• over 200 clinical investigations utilized SemanticDB to identify
study cohorts and retrieve appropriate data for analysis
• studies ranged from relatively simple feasibility assessments to
extremely complex investigations of time-related events and
competing risks of the patient experiencing a certain outcome
after treatment
• prior cohort identification and data export queries for studies
would have been performed by a skilled database administrator
(DBA) interpreting instructions from domain experts
• Using SemanticDB and the SRA, a non-technical domain expert
performed most of the queries