Health Datapalooza IV: June 3rd-4th, 2013
Open Government Data
Moderator:
George Thomas, Enterprise Architect, Office of the Chief Information Officer (CIO), U.S. Department of Health & Human Services
Speakers:
John Erickson, Director of Web Science Operations, Tetherless Word Constellation, Rensselaer Polytechnic Institute
James P. McCusker, Ph.D Student, Dept. of Computer Science, Rensselaer Polytechnic Institute
Mark Musen, Professor, Stanford University and Principal Investigator, National Center for Biomedical Ontologies
Natasha Noy, Senior Research Scientist, Stanford University and Executive Committee Member, National Center for Biomedical Ontologies
Michael Pendleton, Linked Open Data Manager, US Environmental Protection Agency
The session will open with an overview of trends affecting open data sharing, including ‘broad data’ challenges that emerge when application developers have millions of open government datasets available. We will explore issues of web-scale data discovery, rapid and potentially ad hoc integration, visualization, and analysis of partially modeled datasets as well as issues arising from combining different data use policies. We will present emerging solution standards and transitioning academic technologies, including innovative work conducted by the ‘Watson’ research group at Rensselaer Polytechnic Institute on using Watson as a ‘data advisor’. Panelists will synthesize session topics including optimal steps toward an open health knowledge graph facilitating ‘data liquidity’ (as defined by the ability to easily combine and refine data from disparate publishers). Panelists will discuss enabling the implementation of effective ‘lifting schemes’ by leveraging ‘collaboration without coordination’ processes to produce efficient data access techniques that drive innovative new application development tools, products, and services.
Health Datapalooza 2013: Hearing from the Community - Richard Martin
Health Datapalooza 2013: Open Government Data - Natasha Noy
1. Healthdata.gov Metadata:
Lifting Schemes and Controlled
Vocabularies
Mark Musen
Natasha Noy
National Center for Biomedical Ontology
Stanford Center for Biomedical Informatics Research
Stanford University
2. National Center for Biomedical
Ontology
• We create and maintain a library of
biomedical ontologies and terminologies.
• We build tools and Web services to enable
the use of ontologies and terminologies.
• We collaborate with scientific communities
that develop and use ontologies and
terminologies in biomedicine.
2
3. Controlled Vocabularies in Healthcare
Healthcare Common Procedure Coding
System (HCPCS) Current Procedural Terminology (CPT)
International Classification of Diseases (ICD)
RxNorm
The National Drug File - Reference
Terminology (NDF-RT)
4.
5. National Center for Biomedical
Ontology
• Provides key technology for
– Describing medical terminologies Accessing
information about terms in medical terminologies
– Using mappings across terminologies
– Annotating content with terms from medical
terminologies
7. Mappings Between Terminologies
• Available through a REST API and SPARQL endpoint
• Example: Term mappings from HCPCS to CPT
source term
Mapped to CPT:
lexically and
through UMLS CUI
Mapping
metadata
8.
9.
10. Protégé: Editing Ontologies and
Terminologies
• An open-source ontology editor
• Has more than 200,000 registered users
• Works in a web browser
• Supports collaboration
• Built on Semantic Web standards
• Has dozens of plugins developed by our
user community
13. The Anatomy of a Dataset
Basic metadata
about the dataset
published by HHS,
describes Part B
National Summary
Data File, covers the
period from Jan 1,
2000 to Dec 31, 2000,
...
Metadata about
the structure of
the dataset
allowed
services, charges, and
payments
for each HCPCS code
and modifier
Content of
the dataset
charge of $4,966 for
radiology procedure
Focus of the
Metadata
Challenge
14. The NCBO Response to the
Healthdata.gov Metadata Challenge
• Protégé to create dataset descriptions
• Custom scripts to extract metadata
• BioPortal to provide terminologies
• SPARQL endpoint to access the resulting
knowledge graph
15. Metadata about
the structure of
the dataset
Anchoring values
in controlled
terminologies
Basic metadata
about the dataset
NCBO Solution: Outline
Metadata about
the structure of
the dataset
Anchoring values
in controlled
terminologies
http://healthdata.bioontology.org
19. Linking Healthdata Datasets
to Other Metadata
Find all reports authored by
the Department of Health and
Human Services
and its agencies
annually
20. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX org: <http://www.w3.org/ns/org#>
SELECT DISTINCT ?title WHERE {
?ds dc:accrualPeriodicity ?per .
?per rdfs:label "Annual" .
?ds dc:creator ?ag .
?ag org:subOrganzationOf
dbpedia:United_States_Department_of_Health_and_Human_Services .
?ds dc:title ?title
} GROUP BY ?title
Ask it in SPARQL
Find all reports authored by
the Department of Health and Human Services
and its agencies annually
22. Basic metadata
about the dataset
Anchoring values
in controlled
terminologies
NCBO Solution: Outline
Metadata about
the structure of
the dataset
http://healthdata.bioontology.org
Basic metadata
about the dataset
Anchoring values
in controlled
terminologies
23. There Is No Free Lunch
• Modeling metadata enables us to understand
how the content is related
• Necessary if we want to integrate data
24. Dataset Descriptions Are Buried in Text
Files
Part B National
Summary Data File Map
HCPCS code and
HCPCS modifier
to the amount charged for
allowed services,
allowed charges, and
payment
RDF Data Cube
Vocabulary
25. Modeling Dimensions and
Measures of the data
PartB National
Summary data
rdf:type:
qb:Dataset, dcat:Dataset
DSD
rdf:type:
qb:DataStructureDefinition
qb:structure
HCPCS/CPT Payments
rdf:type:
qb:ComponentSpecfication
qb:component
HCPCS or CPT Code
rdf:type:
qb:CodedProperty,
qb:DimensionProperty
qb:dimension
HCPCS or CPT
modifier
rdf:type:
qb:CodedProperty,
qb:DimensionProperty
qb:dimension
Number of services
rdf:type:
qb:MeasureProperty
qb:measure Allowed charges
rdf:type:
qb:MeasurePropertyqb:measure
Payment for services
rdf:type:
qb:MeasureProperty
qb:measure
HCPCS codes
rdf:type:
skos:ConceptScheme
HCPCS modifiers
rdf:type:
skos:ConceptScheme
qb:codeList
qb:codeList
CPT codes
rdf:type:
skos:ConceptScheme
qb:codeList
CPT modifiers
rdf:type:
skos:ConceptScheme
qb:codeList
29. Basic metadata
about the dataset
Metadata about
the structure of
the dataset
NCBO Solution: Outline
Anchoring values
in controlled
terminologies
Basic metadata
about the dataset
Metadata about
the structure of
the dataset
http://healthdata.bioontology.org
30. Controlled Vocabularies in Healthcare
Healthcare Common Procedure Coding
System (HCPCS) Current Procedural Terminology (CPT)
International Classification of Diseases (ICD)
RxNorm
The National Drug File - Reference
Terminology (NDF-RT)
31. NCBO BioPortal
• Uniform access to 330 public ontologies
and terminologies in biomedicine
• Web interface
• REST API
• Search across all ontologies
• Resolvable URIs for each term
• Mappings between terms in different
ontologies
33. BioPortal as a Terminology Service
• Resolvable Uniform Resource Identifiers (URIs)
• Mappings to other terminologies
– Including metadata
– Including multiple mappings from a variety of sources
• Regular terminology updates from primary
sources
• Ability to define and upload new value sets
– We defined several value sets for the Metadata
Challenge
– Linked them to other terminologies in BioPortal
35. NCBO Solution: Outline
Metadata about
the structure of
the dataset
Basic metadata
about the dataset
Anchoring in
controlled
terminologies
http://healthdata.bioontology.org
36. Recipe for Linking
Government Data
• Describe the information about each dataset
– Specify provenance
– Use standard representation schemes
– Use consensus vocabularies
• Describe the metadata about the content of the dataset
– Involve domain experts who understand the structure of the
data
– Use consensus vocabularies (e.g., W3C RDF Cube)
• Anchor the values in controlled vocabularies and datasets
– Use existing terminologies
– Define and publish value sets if none exist
– Map the value sets to standard terminologies
37. What we have learned from the
challenge
• Our entry demonstrated feasibility of
modeling metadata and linking them to
standard vocabularies
– Uses top–down approach
– Enables “deep” integration
– Extracts knowledge from data
• We need tools and scripts that are specific to
the task of dataset description
38. NCBO Solution: Outline
Metadata about
the structure of
the dataset
Basic metadata
about the dataset
Anchoring in
controlled
terminologies
http://healthdata.bioontology.org
Editor's Notes
400K terms
Make bigger
Secret sauce: going into rdf
Authored byShow the query and what comes backLOD cloud