Healthdata.gov Metadata:Lifting Schemes and ControlledVocabulariesMark MusenNatasha NoyNational Center for Biomedical Onto...
National Center for BiomedicalOntology• We create and maintain a library ofbiomedical ontologies and terminologies.• We bu...
Controlled Vocabularies in HealthcareHealthcare Common Procedure CodingSystem (HCPCS) Current Procedural Terminology (CPT)...
National Center for BiomedicalOntology• Provides key technology for– Describing medical terminologies Accessinginformation...
BioPortal stores Big Data about Big Data
Mappings Between Terminologies• Available through a REST API and SPARQL endpoint• Example: Term mappings from HCPCS to CPT...
Protégé: Editing Ontologies andTerminologies• An open-source ontology editor• Has more than 200,000 registered users• Work...
Health Data PlatformMetadata Challenge
Each Dataset Is(Largely) a Silo
The Anatomy of a DatasetBasic metadataabout the datasetpublished by HHS,describes Part BNational SummaryData File, covers ...
The NCBO Response to theHealthdata.gov Metadata Challenge• Protégé to create dataset descriptions• Custom scripts to extra...
Metadata aboutthe structure ofthe datasetAnchoring valuesin controlledterminologiesBasic metadataabout the datasetNCBO Sol...
Bring Healthdata.gov Datasetsinto the Linked Open Data Cloud
Step 1: Basic MetadataAbout the Dataset
Links to Additional Metadata andVocabularies
Linking Healthdata Datasetsto Other MetadataFind all reports authored bythe Department of Health andHuman Servicesand its ...
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dc: <http://purl.org/dc/terms/>PREFIX dbpedia: <http://dbpedia....
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dc: <http://purl.org/dc/terms/>PREFIX dbpedia: <http://dbpedia....
Basic metadataabout the datasetAnchoring valuesin controlledterminologiesNCBO Solution: OutlineMetadata aboutthe structure...
There Is No Free Lunch• Modeling metadata enables us to understandhow the content is related• Necessary if we want to inte...
Dataset Descriptions Are Buried in TextFilesPart B NationalSummary Data File MapHCPCS code andHCPCS modifierto the amount ...
Modeling Dimensions andMeasures of the dataPartB NationalSummary datardf:type:qb:Dataset, dcat:DatasetDSDrdf:type:qb:DataS...
PartB NationalSummary datardf:type:qb:Dataset, dcat:DatasetDSDrdf:type:qb:DataStructureDefinitionqb:structureHCPCS/CPT Paym...
PREFIX dcat: <http://www.w3.org/ns/dcat#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX qb: <http://purl.org/l...
PREFIX dcat: <http://www.w3.org/ns/dcat#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX qb: <http://purl.org/l...
Basic metadataabout the datasetMetadata aboutthe structure ofthe datasetNCBO Solution: OutlineAnchoring valuesin controlle...
Controlled Vocabularies in HealthcareHealthcare Common Procedure CodingSystem (HCPCS) Current Procedural Terminology (CPT)...
NCBO BioPortal• Uniform access to 330 public ontologiesand terminologies in biomedicine• Web interface• REST API• Search a...
From Codes to Ontology Terms
BioPortal as a Terminology Service• Resolvable Uniform Resource Identifiers (URIs)• Mappings to other terminologies– Inclu...
http://purl.bioontology.org/ontology/CPT/70010
NCBO Solution: OutlineMetadata aboutthe structure ofthe datasetBasic metadataabout the datasetAnchoring incontrolledtermin...
Recipe for LinkingGovernment Data• Describe the information about each dataset– Specify provenance– Use standard represent...
What we have learned from thechallenge• Our entry demonstrated feasibility ofmodeling metadata and linking them tostandard...
NCBO Solution: OutlineMetadata aboutthe structure ofthe datasetBasic metadataabout the datasetAnchoring incontrolledtermin...
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
Upcoming SlideShare
Loading in...5
×

Health Datapalooza 2013: Open Government Data - Natasha Noy

349

Published on

Health Datapalooza IV: June 3rd-4th, 2013
Open Government Data
Moderator:
George Thomas, Enterprise Architect, Office of the Chief Information Officer (CIO), U.S. Department of Health & Human Services
Speakers:
John Erickson, Director of Web Science Operations, Tetherless Word Constellation, Rensselaer Polytechnic Institute
James P. McCusker, Ph.D Student, Dept. of Computer Science, Rensselaer Polytechnic Institute
Mark Musen, Professor, Stanford University and Principal Investigator, National Center for Biomedical Ontologies
Natasha Noy, Senior Research Scientist, Stanford University and Executive Committee Member, National Center for Biomedical Ontologies
Michael Pendleton, Linked Open Data Manager, US Environmental Protection Agency

The session will open with an overview of trends affecting open data sharing, including ‘broad data’ challenges that emerge when application developers have millions of open government datasets available. We will explore issues of web-scale data discovery, rapid and potentially ad hoc integration, visualization, and analysis of partially modeled datasets as well as issues arising from combining different data use policies. We will present emerging solution standards and transitioning academic technologies, including innovative work conducted by the ‘Watson’ research group at Rensselaer Polytechnic Institute on using Watson as a ‘data advisor’. Panelists will synthesize session topics including optimal steps toward an open health knowledge graph facilitating ‘data liquidity’ (as defined by the ability to easily combine and refine data from disparate publishers). Panelists will discuss enabling the implementation of effective ‘lifting schemes’ by leveraging ‘collaboration without coordination’ processes to produce efficient data access techniques that drive innovative new application development tools, products, and services.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
349
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 400K terms
  • Make bigger
  • Secret sauce: going into rdf
  • Authored byShow the query and what comes backLOD cloud
  • Anchoring what?
  • Health Datapalooza 2013: Open Government Data - Natasha Noy

    1. 1. Healthdata.gov Metadata:Lifting Schemes and ControlledVocabulariesMark MusenNatasha NoyNational Center for Biomedical OntologyStanford Center for Biomedical Informatics ResearchStanford University
    2. 2. National Center for BiomedicalOntology• We create and maintain a library ofbiomedical ontologies and terminologies.• We build tools and Web services to enablethe use of ontologies and terminologies.• We collaborate with scientific communitiesthat develop and use ontologies andterminologies in biomedicine.2
    3. 3. Controlled Vocabularies in HealthcareHealthcare Common Procedure CodingSystem (HCPCS) Current Procedural Terminology (CPT)International Classification of Diseases (ICD)RxNormThe National Drug File - ReferenceTerminology (NDF-RT)
    4. 4. National Center for BiomedicalOntology• Provides key technology for– Describing medical terminologies Accessinginformation about terms in medical terminologies– Using mappings across terminologies– Annotating content with terms from medicalterminologies
    5. 5. BioPortal stores Big Data about Big Data
    6. 6. Mappings Between Terminologies• Available through a REST API and SPARQL endpoint• Example: Term mappings from HCPCS to CPTsource termMapped to CPT:lexically andthrough UMLS CUIMappingmetadata
    7. 7. Protégé: Editing Ontologies andTerminologies• An open-source ontology editor• Has more than 200,000 registered users• Works in a web browser• Supports collaboration• Built on Semantic Web standards• Has dozens of plugins developed by ouruser community
    8. 8. Health Data PlatformMetadata Challenge
    9. 9. Each Dataset Is(Largely) a Silo
    10. 10. The Anatomy of a DatasetBasic metadataabout the datasetpublished by HHS,describes Part BNational SummaryData File, covers theperiod from Jan 1,2000 to Dec 31, 2000,...Metadata aboutthe structure ofthe datasetallowedservices, charges, andpaymentsfor each HCPCS codeand modifierContent ofthe datasetcharge of $4,966 forradiology procedureFocus of theMetadataChallenge
    11. 11. The NCBO Response to theHealthdata.gov Metadata Challenge• Protégé to create dataset descriptions• Custom scripts to extract metadata• BioPortal to provide terminologies• SPARQL endpoint to access the resultingknowledge graph
    12. 12. Metadata aboutthe structure ofthe datasetAnchoring valuesin controlledterminologiesBasic metadataabout the datasetNCBO Solution: OutlineMetadata aboutthe structure ofthe datasetAnchoring valuesin controlledterminologieshttp://healthdata.bioontology.org
    13. 13. Bring Healthdata.gov Datasetsinto the Linked Open Data Cloud
    14. 14. Step 1: Basic MetadataAbout the Dataset
    15. 15. Links to Additional Metadata andVocabularies
    16. 16. Linking Healthdata Datasetsto Other MetadataFind all reports authored bythe Department of Health andHuman Servicesand its agenciesannually
    17. 17. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dc: <http://purl.org/dc/terms/>PREFIX dbpedia: <http://dbpedia.org/resource/>PREFIX org: <http://www.w3.org/ns/org#>SELECT DISTINCT ?title WHERE {?ds dc:accrualPeriodicity ?per .?per rdfs:label "Annual" .?ds dc:creator ?ag .?ag org:subOrganzationOfdbpedia:United_States_Department_of_Health_and_Human_Services .?ds dc:title ?title} GROUP BY ?titleAsk it in SPARQLFind all reports authored bythe Department of Health and Human Servicesand its agencies annually
    18. 18. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dc: <http://purl.org/dc/terms/>PREFIX dbpedia: <http://dbpedia.org/resource/>PREFIX org: <http://www.w3.org/ns/org#>SELECT DISTINCT ?title WHERE {?ds dc:accrualPeriodicity ?per .?per rdfs:label "Annual" .?ds dc:creator ?ag .?ag org:subOrganzationOfdbpedia:United_States_Department_of_Health_and_Human_Services .?ds dc:title ?title} GROUP BY ?title
    19. 19. Basic metadataabout the datasetAnchoring valuesin controlledterminologiesNCBO Solution: OutlineMetadata aboutthe structure ofthe datasethttp://healthdata.bioontology.orgBasic metadataabout the datasetAnchoring valuesin controlledterminologies
    20. 20. There Is No Free Lunch• Modeling metadata enables us to understandhow the content is related• Necessary if we want to integrate data
    21. 21. Dataset Descriptions Are Buried in TextFilesPart B NationalSummary Data File MapHCPCS code andHCPCS modifierto the amount charged forallowed services,allowed charges, andpaymentRDF Data CubeVocabulary
    22. 22. Modeling Dimensions andMeasures of the dataPartB NationalSummary datardf:type:qb:Dataset, dcat:DatasetDSDrdf:type:qb:DataStructureDefinitionqb:structureHCPCS/CPT Paymentsrdf:type:qb:ComponentSpecficationqb:componentHCPCS or CPT Coderdf:type:qb:CodedProperty,qb:DimensionPropertyqb:dimensionHCPCS or CPTmodifierrdf:type:qb:CodedProperty,qb:DimensionPropertyqb:dimensionNumber of servicesrdf:type:qb:MeasurePropertyqb:measure Allowed chargesrdf:type:qb:MeasurePropertyqb:measurePayment for servicesrdf:type:qb:MeasurePropertyqb:measureHCPCS codesrdf:type:skos:ConceptSchemeHCPCS modifiersrdf:type:skos:ConceptSchemeqb:codeListqb:codeListCPT codesrdf:type:skos:ConceptSchemeqb:codeListCPT modifiersrdf:type:skos:ConceptSchemeqb:codeList
    23. 23. PartB NationalSummary datardf:type:qb:Dataset, dcat:DatasetDSDrdf:type:qb:DataStructureDefinitionqb:structureHCPCS/CPT Paymentsrdf:type:qb:ComponentSpecficationqb:componentHCPCS or CPT Coderdf:type:qb:CodedProperty,qb:DimensionPropertyqb:dimensionHCPCS or CPTmodifierrdf:type:qb:CodedProperty,qb:DimensionPropertyqb:dimensionNumber of servicesrdf:type:qb:MeasurePropertyqb:measure Allowed chargesrdf:type:qb:MeasurePropertyqb:measurePayment for servicesrdf:type:qb:MeasurePropertyqb:measureHCPCS codesrdf:type:skos:ConceptSchemeHCPCS modifiersrdf:type:skos:ConceptSchemeqb:codeListqb:codeListCPT codesrdf:type:skos:ConceptSchemeqb:codeListCPT modifiersrdf:type:skos:ConceptSchemeqb:codeListStructureddatasetdefinition:~20 triplesDataset content:~10,000−1,000,000rows
    24. 24. PREFIX dcat: <http://www.w3.org/ns/dcat#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX qb: <http://purl.org/linked-data/cube#>PREFIX bp: <http://purl.bioontology.org/ontology/>PREFIX dsd: <http://purl.bioontology.org/healthdata/dsd/>SELECT DISTINCT ?title WHERE {?dataset a dcat:Dataset .?dataset qb:structure ?dsd .?dsd qb:component ?cmp .?cmp qb:dimension ?dim .?dataset rdfs:label ?title .?dim qb:codeList bp:CPT .?cmp qb:measure dsd:number-of-services .}Find datasets that map HCPCS and CPT codesto number of services for each code
    25. 25. PREFIX dcat: <http://www.w3.org/ns/dcat#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX qb: <http://purl.org/linked-data/cube#>PREFIX bp: <http://purl.bioontology.org/ontology/>PREFIX dsd: <http://purl.bioontology.org/healthdata/dsd/>SELECT DISTINCT ?title WHERE {?dataset a dcat:Dataset .?dataset qb:structure ?dsd .?dsd qb:component ?cmp .?cmp qb:dimension ?dim .?dataset rdfs:label ?title .?dim qb:codeList bp:CPT .?cmp qb:measure dsd:number-of-services .}
    26. 26. Basic metadataabout the datasetMetadata aboutthe structure ofthe datasetNCBO Solution: OutlineAnchoring valuesin controlledterminologiesBasic metadataabout the datasetMetadata aboutthe structure ofthe datasethttp://healthdata.bioontology.org
    27. 27. Controlled Vocabularies in HealthcareHealthcare Common Procedure CodingSystem (HCPCS) Current Procedural Terminology (CPT)International Classification of Diseases (ICD)RxNormThe National Drug File - ReferenceTerminology (NDF-RT)
    28. 28. NCBO BioPortal• Uniform access to 330 public ontologiesand terminologies in biomedicine• Web interface• REST API• Search across all ontologies• Resolvable URIs for each term• Mappings between terms in differentontologies
    29. 29. From Codes to Ontology Terms
    30. 30. BioPortal as a Terminology Service• Resolvable Uniform Resource Identifiers (URIs)• Mappings to other terminologies– Including metadata– Including multiple mappings from a variety of sources• Regular terminology updates from primarysources• Ability to define and upload new value sets– We defined several value sets for the MetadataChallenge– Linked them to other terminologies in BioPortal
    31. 31. http://purl.bioontology.org/ontology/CPT/70010
    32. 32. NCBO Solution: OutlineMetadata aboutthe structure ofthe datasetBasic metadataabout the datasetAnchoring incontrolledterminologieshttp://healthdata.bioontology.org
    33. 33. Recipe for LinkingGovernment Data• Describe the information about each dataset– Specify provenance– Use standard representation schemes– Use consensus vocabularies• Describe the metadata about the content of the dataset– Involve domain experts who understand the structure of thedata– Use consensus vocabularies (e.g., W3C RDF Cube)• Anchor the values in controlled vocabularies and datasets– Use existing terminologies– Define and publish value sets if none exist– Map the value sets to standard terminologies
    34. 34. What we have learned from thechallenge• Our entry demonstrated feasibility ofmodeling metadata and linking them tostandard vocabularies– Uses top–down approach– Enables “deep” integration– Extracts knowledge from data• We need tools and scripts that are specific tothe task of dataset description
    35. 35. NCBO Solution: OutlineMetadata aboutthe structure ofthe datasetBasic metadataabout the datasetAnchoring incontrolledterminologieshttp://healthdata.bioontology.org

    ×