Successfully reported this slideshow.
Your SlideShare is downloading. ×

The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 17 Ad

The HCLS Community Profile: Describing Datasets, Versions, and Distributions

Download to read offline

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.

The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.

Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.

The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.

Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to The HCLS Community Profile: Describing Datasets, Versions, and Distributions (20)

Advertisement

More from Alasdair Gray (18)

Advertisement

The HCLS Community Profile: Describing Datasets, Versions, and Distributions

  1. 1. The HCLS Community Profile: Describing Datasets, Versions, and Distributions Alasdair J G Gray Heriot-Watt University www.macs.hw.ac.uk/~ajg33 A.J.G.Gray@hw.ac.uk @gray_alasdair Michel Dumontier Stanford University M. Scott Marshall MAASTRO Clinic 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 1
  2. 2. Open PHACTS Example
  3. 3. Data Cache (Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8 Which ChEMBL version? @gray_alasdair www.macs.hw.ac.uk/~ajg33 3
  4. 4. Open PHACTS Example
  5. 5. OPS Example
  6. 6. Data Cache (Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8 Open PHACTS Discovery PlatformHistoric Use Case ~January 2012 Open PHACTS v2.1 ChEMBL 20 http://tiny.cc/ops-datasets Which ChEMBL version? @gray_alasdair www.macs.hw.ac.uk/~ajg33 6
  7. 7. Challenges • Datasets available – In many versions over time – In different formats – From many mirrors/registries • Datasets build on each other • Files do not carry metadata • Registries – Can be out-of-date – Can contain conflicting information 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 7 Scientists require data provenance!
  8. 8. Dublin Core Metadata Initiative  Widely used  Broadly applicable – Documents – Datasets ✗Generic terms ✗Not comprehensive ✗No required properties 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 8 “Date: A point or period of time associated with an event in the lifecycle of the resource.”
  9. 9. 9 @gray_alasdair www.macs.hw.ac.uk/~ajg33  Metadata carried with data – Directly embedded: void:inDataset ✗No versioning ✗No checklist of requisite fields ✗Only for RDF data VoID: Vocabulary of Interlinked Datasets 30/11/2016
  10. 10. DCAT: Data Catalog  Separates Dataset and Distribution ✗No versioning ✗No prescribed properties 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 10
  11. 11. W3C HCLS Group
  12. 12. HCLS Dataset Descriptions 61 Metadata properties from 18 vocabularies 5 Modules: Core, Identifiers, Provenance, Distributions, Stats
  13. 13. Prescribed Usage Element Property Value Summary Level Version Level Distribution Level Core Metadata Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD Type declaration rdf:type void:Dataset or dcat:Distribution MUST NOT MUST NOT MUST Title dct:title rdf:langString MUST MUST MUST Alternative titles dct:alternative rdf:langString MAY MAY MAY Description dct:description rdf:langString MUST MUST MUST … … … … … …
  14. 14. 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 15 ChEMBL: Summary Level
  15. 15. Requires Tooling Creation Validation 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 17
  16. 16. Implementations RDF Platform More coming…
  17. 17. HCLS Dataset Descriptions https://www.w3.org/TR/hcls-dataset/ Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331 A.J.G.Gray@hw.ac.uk @gray_alasdair

Editor's Notes

  • OPS Explorer screenshot returning compound information
    Open PHACTS as a use case; data integration platform
    One of many use cases gathered in HCLSIG
  • Behind the scenes!
    ChemSpider: EBI SDF file
    ChEMBL 13
    Data Cache: Chem2Bio2RDF ChEMBL RDF
    File downloaded May 2011
    Chem2Bio2RDF metadata webpages: ChEMBL 8
    File contents: ChEMBL 2
    Mapping Server: Kasabi ChEMBL RDF file
    ChEMBL 12
  • OPS Explorer screenshot returning compound information
    Open PHACTS as a use case; data integration platform
    One of many use cases gathered in HCLSIG
  • Key element is provenance of where the data has come from
    Enabled by having detailed descriptions of the sources
  • Behind the scenes!
    ChemSpider: EBI SDF file
    ChEMBL 13
    Data Cache: Chem2Bio2RDF ChEMBL RDF
    File downloaded May 2011
    Chem2Bio2RDF metadata webpages: ChEMBL 8
    File contents: ChEMBL 2
    Mapping Server: Kasabi ChEMBL RDF file
    ChEMBL 12
  • We reuse several properties
  • Large community buy in
    - 27 authors
    – Major data providers EBI, RIKEN, SIB
    Weekly telcons, collaborative editing
    2-3 year process

    Wide range of use cases
  • Summary level: time unchanging information, e.g. name, description, publisher
    Version level: version specific information, e.g. version number, creator, etc
    Distribution level: file specific information, e.g. file location and format, number of triples

    Reuse vocabularies: DCTerms, DCAT, VoID, FOAF, …
    Prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
  • 61 properties from 18 vocabularies
    Minimised number of MUST/SHOULD to those for interoperability
    MAYs are recommended terms
  • 21 Properties
    4 MUST
    4 SHOULD
    13 May
  • Summary level: time unchanging information, e.g. name, description, publisher
    Version level: version specific information, e.g. version number, creator, etc
    Distribution level: file specific information, e.g. file location and format, number of triples

    Acknowledge W3C HCLS IG

×