The HCLS Community Profile:
Describing Datasets, Versions, and
Distributions
Alasdair J G Gray
Heriot-Watt University
www.macs.hw.ac.uk/~ajg33
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Michel Dumontier
Stanford University
M. Scott Marshall
MAASTRO Clinic
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
1
Open PHACTS Example
Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Which ChEMBL version?
@gray_alasdair
www.macs.hw.ac.uk/~ajg33 3
Open PHACTS Example
OPS Example
Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Open PHACTS
Discovery PlatformHistoric Use Case
~January 2012
Open PHACTS v2.1
ChEMBL 20
http://tiny.cc/ops-datasets
Which ChEMBL version?
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
6
Challenges
• Datasets available
– In many versions over time
– In different formats
– From many mirrors/registries
• Datasets build on each other
• Files do not carry metadata
• Registries
– Can be out-of-date
– Can contain conflicting information
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
7
Scientists
require data
provenance!
Dublin Core Metadata Initiative
 Widely used
 Broadly applicable
– Documents
– Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
8
“Date: A point or period of time
associated with an event in the
lifecycle of the resource.”
9
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
 Metadata carried with data
– Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
VoID: Vocabulary of
Interlinked Datasets
30/11/2016
DCAT: Data Catalog
 Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
10
W3C HCLS Group
HCLS Dataset Descriptions
61 Metadata properties from 18 vocabularies
5 Modules: Core, Identifiers, Provenance, Distributions, Stats
Prescribed Usage
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Core
Metadata
Type
declaration
rdf:type dctypes:Dataset MUST MUST SHOULD
Type
declaration
rdf:type
void:Dataset or
dcat:Distribution
MUST
NOT
MUST
NOT
MUST
Title dct:title rdf:langString MUST MUST MUST
Alternative
titles
dct:alternative rdf:langString MAY MAY MAY
Description dct:description rdf:langString MUST MUST MUST
… … … … … …
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
15
ChEMBL: Summary Level
Requires Tooling
Creation Validation
30/11/2016
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
17
Implementations
RDF Platform
More coming…
HCLS Dataset Descriptions
https://www.w3.org/TR/hcls-dataset/
Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care
and life sciences community profile for dataset descriptions.
PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331
A.J.G.Gray@hw.ac.uk @gray_alasdair

The HCLS Community Profile: Describing Datasets, Versions, and Distributions

  • 1.
    The HCLS CommunityProfile: Describing Datasets, Versions, and Distributions Alasdair J G Gray Heriot-Watt University www.macs.hw.ac.uk/~ajg33 A.J.G.Gray@hw.ac.uk @gray_alasdair Michel Dumontier Stanford University M. Scott Marshall MAASTRO Clinic 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 1
  • 2.
  • 3.
    Data Cache (Triple Store) SemanticWorkflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8 Which ChEMBL version? @gray_alasdair www.macs.hw.ac.uk/~ajg33 3
  • 4.
  • 5.
  • 6.
    Data Cache (Triple Store) SemanticWorkflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8 Open PHACTS Discovery PlatformHistoric Use Case ~January 2012 Open PHACTS v2.1 ChEMBL 20 http://tiny.cc/ops-datasets Which ChEMBL version? @gray_alasdair www.macs.hw.ac.uk/~ajg33 6
  • 7.
    Challenges • Datasets available –In many versions over time – In different formats – From many mirrors/registries • Datasets build on each other • Files do not carry metadata • Registries – Can be out-of-date – Can contain conflicting information 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 7 Scientists require data provenance!
  • 8.
    Dublin Core MetadataInitiative  Widely used  Broadly applicable – Documents – Datasets ✗Generic terms ✗Not comprehensive ✗No required properties 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 8 “Date: A point or period of time associated with an event in the lifecycle of the resource.”
  • 9.
    9 @gray_alasdair www.macs.hw.ac.uk/~ajg33  Metadata carriedwith data – Directly embedded: void:inDataset ✗No versioning ✗No checklist of requisite fields ✗Only for RDF data VoID: Vocabulary of Interlinked Datasets 30/11/2016
  • 10.
    DCAT: Data Catalog Separates Dataset and Distribution ✗No versioning ✗No prescribed properties 30/11/2016 @gray_alasdair www.macs.hw.ac.uk/~ajg33 10
  • 11.
  • 12.
    HCLS Dataset Descriptions 61Metadata properties from 18 vocabularies 5 Modules: Core, Identifiers, Provenance, Distributions, Stats
  • 13.
    Prescribed Usage Element PropertyValue Summary Level Version Level Distribution Level Core Metadata Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD Type declaration rdf:type void:Dataset or dcat:Distribution MUST NOT MUST NOT MUST Title dct:title rdf:langString MUST MUST MUST Alternative titles dct:alternative rdf:langString MAY MAY MAY Description dct:description rdf:langString MUST MUST MUST … … … … … …
  • 14.
  • 15.
  • 16.
  • 17.
    HCLS Dataset Descriptions https://www.w3.org/TR/hcls-dataset/ DumontierM, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331 A.J.G.Gray@hw.ac.uk @gray_alasdair

Editor's Notes

  • #3 OPS Explorer screenshot returning compound information Open PHACTS as a use case; data integration platform One of many use cases gathered in HCLSIG
  • #4 Behind the scenes! ChemSpider: EBI SDF file ChEMBL 13 Data Cache: Chem2Bio2RDF ChEMBL RDF File downloaded May 2011 Chem2Bio2RDF metadata webpages: ChEMBL 8 File contents: ChEMBL 2 Mapping Server: Kasabi ChEMBL RDF file ChEMBL 12
  • #5 OPS Explorer screenshot returning compound information Open PHACTS as a use case; data integration platform One of many use cases gathered in HCLSIG
  • #6 Key element is provenance of where the data has come from Enabled by having detailed descriptions of the sources
  • #7 Behind the scenes! ChemSpider: EBI SDF file ChEMBL 13 Data Cache: Chem2Bio2RDF ChEMBL RDF File downloaded May 2011 Chem2Bio2RDF metadata webpages: ChEMBL 8 File contents: ChEMBL 2 Mapping Server: Kasabi ChEMBL RDF file ChEMBL 12
  • #11 We reuse several properties
  • #12 Large community buy in - 27 authors – Major data providers EBI, RIKEN, SIB Weekly telcons, collaborative editing 2-3 year process Wide range of use cases
  • #14 Summary level: time unchanging information, e.g. name, description, publisher Version level: version specific information, e.g. version number, creator, etc Distribution level: file specific information, e.g. file location and format, number of triples Reuse vocabularies: DCTerms, DCAT, VoID, FOAF, … Prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
  • #15 61 properties from 18 vocabularies Minimised number of MUST/SHOULD to those for interoperability MAYs are recommended terms
  • #16 21 Properties 4 MUST 4 SHOULD 13 May
  • #20 Summary level: time unchanging information, e.g. name, description, publisher Version level: version specific information, e.g. version number, creator, etc Distribution level: file specific information, e.g. file location and format, number of triples Acknowledge W3C HCLS IG