Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile

Describing Datasets with the
Health Care and Life Sciences
Community Profile
Alasdair Gray
Department of Computer Science
Heriot-Watt University
www.macs.hw.ac.uk/~ajg33/
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Michel Dumontier
Stanford University
M. Scott Marshall
Netherlands Cancer Institute

Materials released
under CC-BY License
You are free to:
• Share — copy and redistribute the material in any medium or format
• Adapt — remix, transform, and build upon the material for any purpose,
even commercially.
The licensor cannot revoke these freedoms as long as you follow the license
terms.
Under the following terms:
• Attribution — You must give appropriate credit, provide a link to the
license, and indicate if changes were made. You may do so in any
reasonable manner, but not in any way that suggests the licensor endorses
you or your use.
4 December 2016 Alasdair Gray @gray_alasdair 2

Outline
• Dataset Descriptions
• Overview
• Use Cases
• Requirements
• Implementations
• Existing Vocabularies
• HCLS Community Profile
• Overview
• Example
• Modules
• Hands On
• Develop your own description
• Validation
• Overview
• Demonstration
• Hands On
• Validate your description
• RDF Dataset Statistics
• Overview
• SPARQL Queries

W3C HCLS Group
Dumontier M, Gray AJG, Marshall MS, et
al. (2016) The health care and life sciences
community profile for dataset descriptions.
PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331

Use Cases
Overview of a couple of examples

FAIR Data Principles
Findable
• Global persistent identifier
• Rich metadata
• Store metadata in registries
Accessible
• Resolvable identifiers
• Metadata persists
• Machine and human access
Interoperable
• Open data format
• Modelled with FAIR compliant vocabularies
• Reference external data
Reusable
• Rich metadata
• Clear license
• Provenance

FAIR Data

Open PHACTS Drug Discovery Platform

Open PHACTS Drug Discovery Platform
9

Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
10
Which version of ChEMBL?

Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Open PHACTS
Discovery PlatformHistoric Use Case
~January 2012
Open PHACTS v2.1
ChEMBL 20
http://tiny.cc/ops-datasets
11
Which version of ChEMBL?

Challenges
• Datasets available
• In many versions over time
• In different formats
• From many mirrors/registries
• Datasets build on each other
• Files do not carry metadata
• Registries
• Can be out-of-date
• Can contain conflicting information
Scientists
require data
provenance!

Requirements

Metadata Requirements
• What is the dataset about?
• Name and description
• Content themes
• Who produced the dataset?
• Publisher details
• Content author details
• When was the dataset created?
• Date of creation
• Date of publication
• Versioning
• Where can I get the dataset?
• Licence and rights
• Download URL
• SPARQL endpoint
• Web service
• How was the dataset created?
• Experimental methodology
• Post-processing
• Why was the dataset created?
• Motivation

HCLS Additional Requirements
Standard metadata requirements plus:
1. Resolvable identifiers for metadata
2. Descriptions of data identifiers
3. Data provenance
4. Data statistics

HCLS Implementations
RDF Platform
More coming…

Existing Vocabularies

Dublin Core Metadata Initiative
Widely used
Broadly applicable
• Documents
• Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
“Date: A point or period of time
associated with an event in the
lifecycle of the resource.”

19Alasdair Gray @gray_alasdair
Metadata carried with data
• Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
VoID: Vocabulary of Interlinked Datasets
4 December 2016

DCAT: Data Catalog
 Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
Fixed with DCAT-AP
DCAT-AP doesn’t meet use
case needs

HCLS Community Profile
http://www.w3.org/TR/hcls-dataset/

HCLS Dataset Descriptions
• 61 Metadata
properties
– 5 Modules
• 18 vocabularies
– DCTerms
– DCAT
– VoID
– …

ChEMBL Summary Level Description

ChEMBL 17 Version Level Description

ChEMBL 17 DB Distribution

ChEMBL 17 RDF Distribution

Description Modules

Core Metadata (Title & description)
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD
Type declaration rdf:type
void:Dataset or
dcat:Distribution MUST NOT MUST NOT MUST
Title dct:title rdf:langString MUST MUST MUST
Alternative titles dct:alternative rdf:langString MAY MAY MAY
Description dct:description rdf:langString MUST MUST MUST

Core Metadata (Dates & contributors)
Summary
Level
Version
Level
Distribution
Level
Date createddct:created
rdfs:Literal encoded using the relevant ISO 8601
Date and Time compliant string and typed using
the appropriate XML Schema datatype
MUST
NOT SHOULD SHOULD
Other dates
pav:createdOn or
pav:authoredOn or
pav:curatedOn
xsd:dateTime, xsd:date, xsd:gYearMonth, or
xsd:gYear
MUST
NOT MAY MAY
Creators dct:creator IRI
MUST
NOT MUST MUST
Contributors
dct:contributor or
pav:createdBy or
pav:authoredBy or
pav:curatedBy IRI
MUST
NOT MAY MAY
Date of issuedct:issued
rdfs:Literal encoded using the relevant ISO 8601
Date and Time compliant string and typed using
the appropriate XML Schema datatype
MUST
NOT SHOULD SHOULD

Core Metadata (Publisher and licence)
Summary
Level
Version
Level
Distribution
Level
Publisher dct:publisher IRI MUST MUST MUST
HTML page foaf:page IRI SHOULD SHOULD SHOULD
Logo schemaorg:logo IRI SHOULD SHOULD SHOULD
License dct:license IRI MAY SHOULD MUST
Rights dct:rights rdf:langString MAY MAY MAY

Core Metadata (Content description)
Summary
Level
Version
Level
Distribution
Level
Keywords dcat:keyword xsd:string MAY MAY MAY
Language dct:language
http://lexvo.org/id/iso
639-3/{tag} MUST NOT SHOULD SHOULD
References dct:references IRI MAY MAY MAY
Concept
descriptors dcat:theme
IRI of type
skos:Concept MAY MAY MAY
Vocabulary
used void:vocabulary IRI MUST NOT MUST NOTSHOULD
Standards used dct:conformsTo IRI MUST NOT MAY SHOULD
Citations cito:citesAsAuthority IRI MAY MAY MAY
Related
material rdfs:seeAlso IRI MAY MAY MAY
Partitions dct:hasPart IRI MAY MAY MUST NOT

Identifiers
Summary
Level
Version
Level
Distribution
Level
Preferred prefix idot:preferredPrefix xsd:string MAY MAY MAY
Alternate prefix idot:alternatePrefix xsd:string MAY MAY MAY
Identifier pattern idot:identifierPattern xsd:string MUST NOT MUST NOT MAY
URI pattern void:uriRegexPattern xsd:string MUST NOT MUST NOT MAY
File access pattern idot:accessPattern idot:AccessPattern MUST NOT MUST NOT MAY
Example identifier idot:exampleIdentifier xsd:string MUST NOT MUST NOT SHOULD
Example resource void:exampleResource IRI MUST NOT MUST NOT SHOULD

Provenance and Change (Versioning)
Summary
Level
Version
Level
Distribution
Level
Version identifier pav:version xsd:string MUST NOT MUST SHOULD
Version linking dct:isVersionOf IRI MUST NOT MUST MUST NOT
Version linking pav:previousVersion IRI MUST NOT SHOULD SHOULD
Version linking pav:hasCurrentVersion IRI MAY MUST NOT MUST NOT

Provenance and Change
Summary
Level
Version
Level
Distribution
Level
Data source
provenance
dct:source or
pav:retrievedFrom or
prov:wasDerivedFrom IRI MUST NOT SHOULD SHOULD
Item listing sio:has-data-item IRI MUST NOT MUST NOT MAY
Creation tool pav:createdWith IRI MUST NOT SHOULD SHOULD
Update frequency dct:accrualPeriodicity
IRI of type
dctypes:Frequency SHOULD MUST NOT MUST NOT

Availability and Distributions
Summary
Level
Version
Level
Distribution
Level
Distribution description dcat:distribution
IRI of Distribution
Level description MUST NOT SHOULD MUST NOT
File format dct:format IRI or xsd:String MUST NOT MUST NOT MUST
File directory dcat:accessURL IRI MAY MAY MAY
File URL dcat:downloadURL IRI MUST NOT MUST NOT SHOULD
Byte size dcat:byteSize xsd:decimal MUST NOT MUST NOT SHOULD
RDF File URL void:dataDump IRI MUST NOT MUST NOT SHOULD
SPARQL endpoint void:sparqlEndpoint IRI SHOULD
SHOULD
NOT SHOULD NOT
Documentation dcat:landingPage IRI MUST NOT MAY MAY
Linkset void:subset IRI MUST NOT MUST NOT SHOULD

Statistics (RDF Core)
Summary
Level
Version
Level
Distribution
Level
# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD
# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD
# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD
# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD
# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD
# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD
# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD
# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD

Statistics (RDF Complete)
Summary
Level
Version
Level
Distribution
Level
class frequency void:classPartition IRI MUST NOT MUST NOT MAY
property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY
property and subject
types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and object
property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY
property subject and
object types void:propertyPartition IRI MUST NOT MUST NOT MAY

Hands on
Create your own dataset description
If you don’t have your own, pick one of the following two

PHI-Base: http://www.phi-base.org/

Guide to Pharmacology:
http://www.guidetopharmacology.org/

Validation Service
Andrew Beveridge
Jacob Baungard
Hansen
Johnny Val
Leif Gehrmann
Roisin Farmer
Sunil Khutan
Tomas Robertson

Example Constraint
4 December 2016 46
• Shape
• A Dataset Summary
• MUST be declared to be of type dctype:Dataset
• MUST have a dcterms:title as a language typed string
• MUST NOT have dcterms:created date
<Dataset> rdf:langString
.
✗
Alasdair Gray @gray_alasdair
Dates are associated
with versions in HCLS

Example Validation
4 December 2016 47
.
✗
• Shape
• Data

Example Validation
• Shape
• Data
4 December 2016 48
.
✗

Example Validation
4 December 2016 49
.
✗
• Shape
• Data

Example Validation (Closed Shape)
4 December 2016 50
.
✗
• Shape
• Data

<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
Shape
4 December 2016 51
.
✗
Shape Expressions (ShEx)

<Dataset> {
!dct:created .
}
<Dataset> {
!dct:created .
}
<Dataset> {
!dct:created .
}
<Dataset> {
!dct:created .
}
<Dataset> {
!dct:created .
}
<Dataset> {
!dct:created .
}
4 December 2016 52Alasdair Gray @gray_alasdair
Validator can’t warn of
missing property
Example data
ShEx Validation

<Dataset> {
`MUST` rdf:type (dctypes:Dataset),
`MUST` dct:title rdf:langString,
`MAY` dct:alternative rdf:langString+,
`MUST` !dct:created .
}
Shape
4 December 2016 53
.
✗
Validator can warn of
missing property
Requirement Levels

http://www.w3.org/2015/03/ShExValidata/
Validata Demo

Validator Hands-On

http://www.w3.org/2015/03/ShExValidata/
• Try the ChEMBL example
• Play with Options
• Add resources
• Switch between open/closed shapes
• Change requirement level
• Check your own work
• Try descriptions from
• Riken Metadatabase
http://metadb.riken.jp/archives/HCLSProfile/HCLSProfile_MetaDB_ALL.ttl
• PHI-Base
https://www.dropbox.com/s/tghogwagn962tmi/Metadata.ttl?dl=0

RDF Dataset Statistics

Statistics (RDF Complete)
Summary
Level
Version
Level
Distribution
Level
class frequency void:classPartition IRI MUST NOT MUST NOT MAY
property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY
property and subject
property and object
property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY
property subject and
object types void:propertyPartition IRI MUST NOT MUST NOT MAY

Why provide rich dataset descriptions?
• Support
• Endpoint
exploration
• Query writing
• Eliminates
expensive
exploratory queries

• Support
• Endpoint
exploration
• Query writing
• Eliminates
expensive
exploratory queries

Generate with SPARQL queries
Number of triples Subject-types related to Object-types

Summary

FAIR Data Principles
Findable
• Global persistent identifier
• Rich metadata
• Store metadata in registries
Accessible
• Resolvable identifiers
• Metadata persists
• Machine and human access
Interoperable
• Open data format
• Modelled with FAIR compliant vocabularies
• Reference external data
Reusable
• Rich metadata
• Clear license
• Provenance

HCLS Dataset Descriptions
https://www.w3.org/TR/hcls-dataset/
Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences
community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331
A.J.G.Gray@hw.ac.uk @gray_alasdair

Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile

Similar to Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile (20)

More from Alasdair Gray

More from Alasdair Gray (18)

Recently uploaded

Recently uploaded (20)

Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile

Editor's Notes