Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dataset description using the W3C HCLS standard

772 views

Published on

This talk was presented at the BioCaddie http://biocaddie.org/ workshop at the Force15 conference (https://www.force11.org/meetings/force2015) on changing the future of scholarly communication. The goal was to increase awareness of why a Semantic Web-compliant standard was needed for describing data, where current standards fall short, and how this new emerging standard that extends prior efforts can aid data discovery and integration. This work is being lead by Michel Dumontier, Alasdair Gray, Joachim Baran, and M. Scott Marshall; participants and end-user testers are welcome, see: http://tiny.cc/hcls-datadesc-ed

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Dataset description using the W3C HCLS standard

  1. 1. Describing Datasets with the W3C HCLS standard Melissa Haendel Michel Dumontier
  2. 2. World Wide Web Consortium (W3C)  The W3C is the main international standards organization for the World Wide Web  The W3C is made up of over 400 member organizations for the purpose of working together in the development of standards for the World Wide Web  W3C has sophisticated development and community validation procedures for standards development
  3. 3. The Semantic Web is the new global web of knowledge It involves standards for publishing, sharing, and querying facts, expert knowledge and services It is a scalable approach to the discovery of independently formulated and distributed knowledge Cyganiak and Jentzsch. http://lod-cloud.net/
  4. 4. Resource Description Framework  Language to represent knowledge  Logic-based formalism -> automated reasoning  graph-like properties -> data analysis  Good for:  Describing in terms of type, attributes, relations  Integrating data from different sources  Sharing the data (W3C standard)  Reusing what is available, developing what you need, and contributing back to the web of data
  5. 5. Challenge: Working with Web Data  Often have inadequate descriptions so we don’t know what they are about or how they were constructed  Datasets change over time, but often don’t come with versioning information  May have been constructed using other data, but it’s not clear which version of data was used or whether these were modified  Data may be available in a variety of formats  There may be multiple copies of data from different providers, but it’s unclear if they are exact copies or derivatives  Version of standard or vocabulary used not indicated  Data registries are not synchronized and can contain conflicting information
  6. 6. Key Use Cases for HCLS Dataset description 1. Dataset Identification, Description, Licensing and Provenance 2. Dataset Discovery (via Catalog) 3. Exchange of Dataset Descriptions 4. Dataset Linking 5. Content Summary 6. Monitoring of Dataset Changes
  7. 7. Objectives  Develop a guidance note for reusing existing vocabularies to describe datasets with RDF – Mandatory, recommended, optional descriptors – Identifiers – Versioning – Attribution – Provenance – Content summarization  Recommend vocabulary-linked attributes and value sets  Provide reference editor and validation
  8. 8. We complied a list of metadata fields used across the community and then surveyed over 20 vocabularies to see if they provided relevant metadata elements or value sets… …to produce a big spreadsheet that maps metadata needs with existing vocabularies
  9. 9. Dublin Core Metadata Initiative Widely used Broadly applicable – Documents – Datasets ✗Generic terms ✗Not comprehensive ✗No required properties “Date: A point or period of time associated with an event in the lifecycle of the resource.”
  10. 10. DCAT: Data Catalog  Separates Dataset and Distribution ✗No versioning ✗No prescribed properties
  11. 11. No single vocabulary provides all key metadata fields
  12. 12. http://tiny.cc/hcls-datadesc
  13. 13. Included Vocabularies
  14. 14. Three Component Metadata Model: description – version – distribution
  15. 15. Description  Identifiers  Title  Description  Homepage  License  Language  Keywords  Concepts and vocabularies used  Standards  Publication
  16. 16. Attribution  Simple Model – Individuals are related to roles using specific properties e.g. dct:creator, pav:createdBy, pav:curatedBy  Expandable Model – Individuals are related to roles and dates via associated object – PROV, VIVO-ISF
  17. 17. Provenance and Change  Version number  Source  Provenance: retrieved from, derived from, created with  Frequency of change
  18. 18. Availability  Format  Download URL  Landing page  SPARQL endpoint
  19. 19. VoID Editor Tools to create the metadata
  20. 20. Tools to validate the metadata New version using ShEx in development
  21. 21. HCLS: http://www.w3.org/blog/hcls/ Mailing list: http://lists.w3.org/Archives/Public/public- semweb-lifesci/ Editors’ Draft: http://tiny.cc/hcls-datadesc-ed W3C Interest Group Note: http://tiny.cc/hcls-datadesc Special thanks to Alasdair Gray, Scott Marshall, Joachim Baran Thanks to all other contributors to the HCLS note

×