Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

W3C HCLS Dataset Description Guidelines


Published on

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. This document describes a consensus among participating stakeholders in the Health Care and the Life Sciences domain on the description of datasets using the Resource Description Framework (RDF). This specification meets key functional requirements, reuses existing vocabularies to the extent that it is possible, and addresses elements of data description, versioning, provenance, discovery, exchange, query, and retrieval.

Published in: Internet
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

W3C HCLS Dataset Description Guidelines

  1. 1. Describing Scientific Datasets: The HCLS Community Profile 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University
  2. 2. World Wide Web Consortium (W3C) • The W3C is the main international standards organization for the World Wide Web. • The W3C is made up of over 400 member organizations for the purpose of working together in the development of standards for the World Wide Web. @micheldumontier::CEDAR:Jan 20152
  3. 3. The Semantic Web is the new global web of knowledge 3 @micheldumontier::CEDAR:Jan 2015 It involves standards for publishing, sharing and querying facts, expert knowledge and services It is a scalable approach to the discovery of independently formulated and distributed knowledge
  4. 4. Resource Description Framework • It’s a language to represent knowledge – Logic-based formalism -> automated reasoning – graph-like properties -> data analysis • Good for – Describing in terms of type, attributes, relations – Integrating data from different sources – Sharing the data (W3C standard) – Reusing what is available, developing what you need, and contributing back to the web of data. @micheldumontier::CEDAR:Jan 20154
  5. 5. @micheldumontier::CEDAR:Jan 2015 drugbank:DB00586 drugbank_vocabulary:Drug rdf:type drugbank:290 drugbank_vocabulary:Target rdf:type drugbank_vocabulary:targets rdfs:label Prostaglandin G/H synthase 2 [drugbank_target:290] rdfs:label Diclofenac [drugbank:DB00586] 5 PREFIX rdf: <> PREFIX rdfs: PREFIX drugbank: <> PREFIX drugbank_vocabulary: <>
  6. 6. The linked data network expands with every reference @micheldumontier::CEDAR:Jan 2015 drugbank:DB00586 pharmgkb_vocabulary:Drug rdf:type rdfs:label diclofenac [drugbank:DB00586] pharmgkb:PA449293 drugbank_vocabulary:Drug pharmgkb_vocabulary:x-drugbank diclofenac [pharmgkb:PA449293] rdfs:label DrugBank PharmGKB 6
  7. 7. We are building a massive network of linked open data 7 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.” @micheldumontier::CEDAR:Jan 2015
  8. 8. Linked Data for the Life Sciences • Free and open source • Leverages Semantic Web standards • 10B+ interlinked statements from 30+ conventional and high value datasets • Partnerships with EBI, SIB, NCBI, DBCLS, NCBO, OpenPHACTS, and many others chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications @micheldumontier::CEDAR:Jan 20158 Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier: Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data. ESWC 2013: 200-212
  9. 9. Semantic Web for Health Care and Life Sciences Interest Group (HCLS) • Mission: to develop, advocate for, and support the use of Semantic Web technologies across health care, life sciences, clinical research and translational medicine. • Since 2001. 86 members from 29 organizations. • Chairs: Michel Dumontier and Charlie Mead • Objectives: – Develop high level and architectural vocabularies. – Implement proof-of-concept demonstrations and industry-ready code. – Document guidelines to accelerate the adoption of the technology. – Disseminate information about the group's work at government, industry, academic events and by participating in community initiatives. @micheldumontier::CEDAR:Jan 20159
  10. 10. Challenge: Working with Web Data • Often have inadequate descriptions so we don’t know what they are about or how they were constructed. • datasets change over time, but often don’t come with versioning information • may have been constructed using other data, but it’s not clear which version of data was used or whether these were modified • Data may be available in a variety of formats • There may be multiple copies of data from different providers, but it’s unclear if they are exact copies or derivatives @micheldumontier::CEDAR:Jan 201510
  11. 11. Data registries aren’t in sync –,,, etc. – May be concerned about only some data elements i.e. incomplete – May be out-of-date and there is no easy way to exchange data descriptions – May contain conflicting information, unclear the sources used. @micheldumontier::CEDAR:Jan 201511
  12. 12. no single vocabulary provides all key metadata fields @micheldumontier::CEDAR:Jan 201512
  13. 13. Key Use Cases 1. Dataset Identification, Description, Licensing and Provenance 2. Dataset Discovery (via Catalog) 3. Exchange of Dataset Descriptions 4. Dataset Linking 5. Content Summary 6. Monitoring of Dataset Changes @micheldumontier::CEDAR:Jan 201513
  14. 14. Objective • Develop a guidance note for reusing existing vocabularies to describe datasets with RDF – Mandatory, recommended, optional descriptors – Identifiers – Versioning – Attribution – Provenance – Content summarization • Recommend vocabulary-linked attributes and value sets • Provide reference editor and validation @micheldumontier::CEDAR:Jan 201514
  15. 15. Dublin Core Metadata Initiative Widely used Broadly applicable – Documents – Datasets ✗Generic terms ✗Not comprehensive ✗No required properties @micheldumontier::CEDAR:Jan 15 “Date: A point or period of time associated with an event in the lifecycle of the resource.”
  16. 16. DCAT: Data Catalog  Separates Dataset and Distribution ✗No versioning ✗No prescribed properties @micheldumontier::CEDAR:Jan 201516
  17. 17. 17 @micheldumontier::CEDAR:Jan VoID: Vocabulary of Interlinked Datasets Metadata carried with data – Directly embedded: void:inDataset ✗No versioning ✗No checklist of requisite fields ✗Only for RDF data
  18. 18. We compiled a list of metadata fields used across the community @micheldumontier::CEDAR:Jan 201518 and then surveyed over 20 vocabularies to see if they provided relevant metadata elements or value sets To produce a big spreadsheet that maps metadata needs with existing vocabularies
  19. 19. @micheldumontier::CEDAR:Jan 201519
  20. 20. @micheldumontier::CEDAR:Jan 201520
  21. 21. Dataset “A collection of data, available for access or download in one or more formats” – DCAT @micheldumontier::CEDAR:Jan 201521
  22. 22. Included Vocabularies @micheldumontier::CEDAR:Jan 201522
  23. 23. Three Component Metadata Model: description – version - distribution @micheldumontier::CEDAR:Jan 201523
  24. 24. Example of Use @micheldumontier::CEDAR:Jan 201524
  25. 25. 61 metadata elements @micheldumontier::CEDAR:Jan 201525
  26. 26. Metadata element, description, and example of use @micheldumontier::CEDAR:Jan 201526
  27. 27. Metadata Specification constrained property:value pairs @micheldumontier::CEDAR:Jan 201527 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
  28. 28. Description • Identifiers • Title • Description • Homepage • License • Language • Keywords • Concepts and vocabularies used • Standards • Publication @micheldumontier::CEDAR:Jan 201528
  29. 29. Attribution • Simple Model – Individuals are related to roles using specific properties e.g. dct:creator, pav:createdBy, pav:curatedBy • Expandable Model – Individuals are related to roles and dates by associated object – PROV, ViVo @micheldumontier::CEDAR:Jan 201529
  30. 30. Provenance and Change • Version number • Source • Provenance: retrieved from, derived from, created with • Frequency of change @micheldumontier::CEDAR:Jan 201530
  31. 31. Availability • Format • Download URL • Landing page • SPARQL endpoint @micheldumontier::CEDAR:Jan 201531
  32. 32. RDF Dataset Statistics Basic Statistics • # of triples • # of typed entities • # of distinct subjects • # of distinct predicates • # of distinct objects • # of classes • # of literals Enhanced Statistics • Classes + # • Properties + triples • Subject Types + # Property + triples • Object Types + # Property + triples • Literals + # Property + triples • Dataset-Dataset links @micheldumontier::CEDAR:Jan 201532
  33. 33. Application scenarios @micheldumontier::CEDAR:Jan 201533
  34. 34. VoID Editor @micheldumontier::CEDAR:Jan 201534
  35. 35. Validator @micheldumontier::CEDAR:Jan 201535 New version using ShEx in development
  36. 36. Towards Semantic Interoperability @micheldumontier::CEDAR:Jan 201536
  37. 37. @micheldumontier::CEDAR:Jan 2015 Website: Presentations: 37 HCLS: Mailing list: Editors’ Draft: W3C Interest Group Note: Special thanks to Alasdair Gray, Scott Marshall, Joachim Baran Thanks to all other contributors to the HCLS note