Supporting Dataset Descriptions
in the Life Sciences
Alasdair J G Gray
Heriot-Watt University
www.macs.hw.ac.uk/~ajg33
A.J.G.Gray@hw.ac.uk
@gray_alasdair
FAIR Data Principles
Findable
• Global persistent identifier
• Rich metadata
• Store metadata in
registries
Accessible
• Resolvable identifiers
• Metadata persists
• Machine and human
access
Interoperable
• Open data format
• Modelled with FAIR
compliant vocabularies
• Reference external data
Reusable
• Rich metadata
• Clear license
• Provenance
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 2
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship
Authors. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18
Degrees of FAIRness
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 3
Open PHACTS Explorer
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 4
Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Which ChEMBL version?
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 5
Historic Use Case
~January 2012
Open PHACTS v2.1
ChEMBL 20
http://tiny.cc/ops-datasets
Open PHACTS Provenance
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 6
Open PHACTS FAIR Data
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 7
Data Reuse Challenges
• Datasets available
– In many versions over time
– In different formats
– From many mirrors/registries
• Datasets build on each other
• Files do not carry metadata
• Registries
– Can be out-of-date
– Can contain conflicting information
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 8
Scientists
require data
provenance!
Goal: To be FAIR
Findable
Accessible
Interoperable
Reusable
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 9
Open PHACTS Dataset
Description Guidelines
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 10
Challenging for Publishers:
• Datasets are complex
• Evolve over time
• Another publishing burden
• Requires RDF knowledge
• Descriptions are complex
• Metadata precision
Tooling support required!
Open PHACTS
Dataset Description Model
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 11
Open PHACTS Dataset
Description Guidelines
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 12
Help me describe my data!
No! Use the Open PHACTS VoID Editor
Thanks for converting my data to RDF, can you help me
make it findable by creating a VoID dataset description?
Dataset description 
Metadata  Boring
Here are the guidelines, just write the
terms in a text document.
Characters reproduced from Piled Higher and Deeper by Jorge Cham, http://phdcomics.com
Open PHACTS VoID Editor
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 14
Open PHACTS VoID Editor
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 15
Open PHACTS VoID Editor
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 16
Open PHACTS Validator
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 17
(Some) Life Sciences
Metadata Specifications
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 18
Depth
Reach
model
HCLS DataDesc
Bioschemas
Schema.org for biology
Minimum properties for
• Finding data
• Presenting search results
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 19
<div>
<h1>Classic potato salad</h1>
<div>
Nutrition facts:
<span>144 kcal</span>,
</div>
Ingredients:
- <span>800g small new potato</span>
- <span>3 shallot</span>
. . .
Structured data markup for web pages
Without markup
<div>
<h1>Classic potato salad</h1>
<div>
Nutrition facts:
<span>144 kcal</span>,
</div>
Ingredients:
- <span>800g small new potato</span>
- <span>3 shallot</span>
. . .
Structured data markup for web pages
Recipe
Nutrition
Calories
Ingridients
Title
Without markup
<div itemscope itemtype="http://schema.org/Recipe">
<h1 itemprop="name">Classic potato salad</h1>
<div itemprop="nutrition” itemscope
itemtype="http://schema.org/NutritionInformation">
Nutrition facts:
<span itemprop="calories">144 kcal</span>,
</div>
Ingredients:
- <span itemprop="recipeIngredient">800g small new potato</span>
- <span itemprop="recipeIngredient">3 shallot</span>
. . .
Structured data markup for web pages
RDFa
JSON-LD
Microdata With markup
Minimum information
Controlled vocabularies
Cardinality
Data model
New properties
24
The ELIXIR Implementation Study
2.Datasets
5.Plant
Phenotypes
7. Bioschemas
registry
8. Validation SusannaASansone
Rafa Jimenez
???
Alasdair Gray
Planning
Agreement
Adoption
Application
1
2
3
4
March-April 2017
May-June 2017
July-Oct 2017
Nov-Feb 2018
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 25
(Some) Life Sciences
Metadata Specifications
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 26
Depth
Reach
model
HCLS DataDesc
W3C HCLS Group
27
Dumontier, M. et al. The health
care and life sciences community
profile for dataset descriptions.
PeerJ 4, e2331 (2016).
DOI:10.7717/peerj.2331
Use Case Requirements
Standard metadata requirements plus:
1. Resolvable identifiers for metadata
2. Descriptions of data identifiers
3. Data provenance
4. Data statistics
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 28
HCLS Dataset Descriptions
61 Metadata properties from 18 vocabularies
5 Modules: Core, Identifiers, Provenance, Distributions, Stats
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 29
Prescribed Usage
Element Property Value
Summary
Level
Version
Level
Distribution
Level
Core
Metadata
Type
declaration
rdf:type dctypes:Dataset MUST MUST SHOULD
Type
declaration
rdf:type
void:Dataset or
dcat:Distribution
MUST
NOT
MUST
NOT
MUST
Title dct:title rdf:langString MUST MUST MUST
Alternative
titles
dct:alternative rdf:langString MAY MAY MAY
Description dct:description rdf:langString MUST MUST MUST
… … … … … …
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 30
ChEMBL: Summary Level
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 31
Implementations
RDF Platform
More coming…
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 33
(Some) Life Sciences
Metadata Specifications
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 34
Depth
Reach
model
HCLS DataDesc
Layered Descriptions
Minimal dataset
description More detailed
description
Dataset
Sketch of content
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 35
HCLS DataDesc
Supported by the NIH grant 1U24 AI117966-01 to UCSD
PI , Co-Investigators at:
The m
annotated with schem
Susanna-Assunta Sansone, Alejandra Gonzalez-Beltran, Phil
Oxford e-Research Centre, University of Oxford,
Configurable Tooling
Creation Validation
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 36
Configurable Tooling
Creation Validation
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 37
Constraint Languages
ShEx SHACL JSON Schema
Status W3C Draft CG
Report
W3C Working Draft IETF Internet-Draft
v5
Notation Concise notation Extended SPARQL JSON
Data model RDF RDF JSON (JSON-LD?)
Open/closed Supported Supported Closed
Result format Defined Defined
Constraint types supported
• Domain ✓ ✓ ✓
• Values ✓ ✓ ✓
• Cardinality ✓ ✓ ✓
• Vocabulary ✓ ✓ ✗
• Recursion ✓ ✗ ✗
• Conformance
levels
Extension Fixed ✗
Example Constraint
• Shape
• A Dataset
– MUST be declared to be of type dctype:Dataset
– MUST have a dcterms:title as a language typed
string
– MUST NOT have dcterms:created date
<Dataset> rdf:langString
.
✗
Dates are associated
with versions in HCLS
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 39
Example Validation
<Dataset> rdf:langString
.
✗
• Shape
• Data
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 40
Example Validation
• Shape
• Data
<Dataset> rdf:langString
.
✗
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 41
Example Validation
<Dataset> rdf:langString
.
✗
• Shape
• Data
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 42
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
Shape
<Dataset> rdf:langString
.
✗
Shape Expressions (ShEx)
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 43
ShEx: Validation
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
<Dataset> {
rdf:type (dctypes:Dataset),
dct:title rdf:langString,
dct:alternative rdf:langString+,
!dct:created .
}
Validator can’t warn of
missing property
Example data
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 44
<Dataset> {
`MUST` rdf:type (dctypes:Dataset),
`MUST` dct:title rdf:langString,
`MAY` dct:alternative rdf:langString+,
`MUST` !dct:created .
}
Shape
<Dataset> rdf:langString
.
✗
Requirement Levels
Validator can warn of
missing property
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 45
Implementation
Validata
• Web app front end
• Javascript + HTML
• Relies on ShEx-validator
– Validates documents
– Returns report
https://github.com/HW-
SWeL/Validata
ShEx-validator
• Validation system
• Validation API
• Javascript
– nodejs engine
• Reuses
– n3: RDF Library
– ShExParser
https://github.com/HW-
SWeL/ShEx-validator
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 46
http://hw-swel.github.io/Validata/
VALIDATA DEMO
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 47
(Some) Life Sciences
Metadata Specifications
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 48
Depth
Reach
model
HCLS DataDesc
Findable
Accessible
Interoperable
Reusable
Configurable Tooling
Creation Validation
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 49
Acknowledgements
BioSchemas
• Carole Gobel
• Rafael Jimenez
FAIR Data
• FAIRdom project
• Jun Zhao
Open PHACTS
• Christian Brenninkmeijer
• Lefteris Tatakis
• Andra Waagmeester
Validata (MEng 2015)
• Andrew Beveridge
• Jacob Baungard Hansen
• Johnny Val
• Leif Gehrmann
• Roisin Farmer
• Sunil Khutan
• Tomas Robertson
• Eric Prud’hommeaux
5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 50
Questions
Validata https://github.com/HW-SWeL/Validata
• RDF constraint validation tool
– Configurable to any profile
• Shape Expression (ShEx) constraints
Dumontier, M. et al. The health care and life sciences community
profile for dataset descriptions. PeerJ 4, e2331 (2016).
DOI:10.7717/peerj.2331
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific
data management and stewardship. Nature Scientific Data 3, 1–15
(2016). DOI: 10.1038/sdata.2016.18
www.macs.hw.ac.uk/~ajg33/
A.J.G.Gray@hw.ac.uk
@gray_alasdair5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 51

Supporting Dataset Descriptions in the Life Sciences

  • 1.
    Supporting Dataset Descriptions inthe Life Sciences Alasdair J G Gray Heriot-Watt University www.macs.hw.ac.uk/~ajg33 A.J.G.Gray@hw.ac.uk @gray_alasdair
  • 2.
    FAIR Data Principles Findable •Global persistent identifier • Rich metadata • Store metadata in registries Accessible • Resolvable identifiers • Metadata persists • Machine and human access Interoperable • Open data format • Modelled with FAIR compliant vocabularies • Reference external data Reusable • Rich metadata • Clear license • Provenance 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 2 Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship Authors. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18
  • 3.
    Degrees of FAIRness 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 3
  • 4.
    Open PHACTS Explorer 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 4
  • 5.
    Data Cache (Triple Store) SemanticWorkflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8 Which ChEMBL version? 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 5 Historic Use Case ~January 2012 Open PHACTS v2.1 ChEMBL 20 http://tiny.cc/ops-datasets
  • 6.
    Open PHACTS Provenance 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 6
  • 7.
    Open PHACTS FAIRData 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 7
  • 8.
    Data Reuse Challenges •Datasets available – In many versions over time – In different formats – From many mirrors/registries • Datasets build on each other • Files do not carry metadata • Registries – Can be out-of-date – Can contain conflicting information 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 8 Scientists require data provenance!
  • 9.
    Goal: To beFAIR Findable Accessible Interoperable Reusable 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 9
  • 10.
    Open PHACTS Dataset DescriptionGuidelines 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 10 Challenging for Publishers: • Datasets are complex • Evolve over time • Another publishing burden • Requires RDF knowledge • Descriptions are complex • Metadata precision Tooling support required!
  • 11.
    Open PHACTS Dataset DescriptionModel 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 11
  • 12.
    Open PHACTS Dataset DescriptionGuidelines 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 12
  • 13.
    Help me describemy data! No! Use the Open PHACTS VoID Editor Thanks for converting my data to RDF, can you help me make it findable by creating a VoID dataset description? Dataset description  Metadata  Boring Here are the guidelines, just write the terms in a text document. Characters reproduced from Piled Higher and Deeper by Jorge Cham, http://phdcomics.com
  • 14.
    Open PHACTS VoIDEditor 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 14
  • 15.
    Open PHACTS VoIDEditor 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 15
  • 16.
    Open PHACTS VoIDEditor 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 16
  • 17.
    Open PHACTS Validator 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 17
  • 18.
    (Some) Life Sciences MetadataSpecifications 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 18 Depth Reach model HCLS DataDesc
  • 19.
    Bioschemas Schema.org for biology Minimumproperties for • Finding data • Presenting search results 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 19
  • 20.
    <div> <h1>Classic potato salad</h1> <div> Nutritionfacts: <span>144 kcal</span>, </div> Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . . Structured data markup for web pages Without markup
  • 21.
    <div> <h1>Classic potato salad</h1> <div> Nutritionfacts: <span>144 kcal</span>, </div> Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . . Structured data markup for web pages Recipe Nutrition Calories Ingridients Title Without markup
  • 22.
    <div itemscope itemtype="http://schema.org/Recipe"> <h1itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div> Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . . Structured data markup for web pages RDFa JSON-LD Microdata With markup
  • 24.
  • 25.
    The ELIXIR ImplementationStudy 2.Datasets 5.Plant Phenotypes 7. Bioschemas registry 8. Validation SusannaASansone Rafa Jimenez ??? Alasdair Gray Planning Agreement Adoption Application 1 2 3 4 March-April 2017 May-June 2017 July-Oct 2017 Nov-Feb 2018 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 25
  • 26.
    (Some) Life Sciences MetadataSpecifications 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 26 Depth Reach model HCLS DataDesc
  • 27.
    W3C HCLS Group 27 Dumontier,M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331
  • 28.
    Use Case Requirements Standardmetadata requirements plus: 1. Resolvable identifiers for metadata 2. Descriptions of data identifiers 3. Data provenance 4. Data statistics 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 28
  • 29.
    HCLS Dataset Descriptions 61Metadata properties from 18 vocabularies 5 Modules: Core, Identifiers, Provenance, Distributions, Stats 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 29
  • 30.
    Prescribed Usage Element PropertyValue Summary Level Version Level Distribution Level Core Metadata Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD Type declaration rdf:type void:Dataset or dcat:Distribution MUST NOT MUST NOT MUST Title dct:title rdf:langString MUST MUST MUST Alternative titles dct:alternative rdf:langString MAY MAY MAY Description dct:description rdf:langString MUST MUST MUST … … … … … … 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 30
  • 31.
    ChEMBL: Summary Level 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 31
  • 32.
    Implementations RDF Platform More coming… 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 33
  • 33.
    (Some) Life Sciences MetadataSpecifications 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 34 Depth Reach model HCLS DataDesc
  • 34.
    Layered Descriptions Minimal dataset descriptionMore detailed description Dataset Sketch of content 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 35 HCLS DataDesc Supported by the NIH grant 1U24 AI117966-01 to UCSD PI , Co-Investigators at: The m annotated with schem Susanna-Assunta Sansone, Alejandra Gonzalez-Beltran, Phil Oxford e-Research Centre, University of Oxford,
  • 35.
    Configurable Tooling Creation Validation 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 36
  • 36.
    Configurable Tooling Creation Validation 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 37
  • 37.
    Constraint Languages ShEx SHACLJSON Schema Status W3C Draft CG Report W3C Working Draft IETF Internet-Draft v5 Notation Concise notation Extended SPARQL JSON Data model RDF RDF JSON (JSON-LD?) Open/closed Supported Supported Closed Result format Defined Defined Constraint types supported • Domain ✓ ✓ ✓ • Values ✓ ✓ ✓ • Cardinality ✓ ✓ ✓ • Vocabulary ✓ ✓ ✗ • Recursion ✓ ✗ ✗ • Conformance levels Extension Fixed ✗
  • 38.
    Example Constraint • Shape •A Dataset – MUST be declared to be of type dctype:Dataset – MUST have a dcterms:title as a language typed string – MUST NOT have dcterms:created date <Dataset> rdf:langString . ✗ Dates are associated with versions in HCLS 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 39
  • 39.
    Example Validation <Dataset> rdf:langString . ✗ •Shape • Data 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 40
  • 40.
    Example Validation • Shape •Data <Dataset> rdf:langString . ✗ 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 41
  • 41.
    Example Validation <Dataset> rdf:langString . ✗ •Shape • Data 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 42
  • 42.
    <Dataset> { rdf:type (dctypes:Dataset), dct:titlerdf:langString, dct:alternative rdf:langString+, !dct:created . } Shape <Dataset> rdf:langString . ✗ Shape Expressions (ShEx) 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 43
  • 43.
    ShEx: Validation <Dataset> { rdf:type(dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } <Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created . } Validator can’t warn of missing property Example data 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 44
  • 44.
    <Dataset> { `MUST` rdf:type(dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created . } Shape <Dataset> rdf:langString . ✗ Requirement Levels Validator can warn of missing property 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 45
  • 45.
    Implementation Validata • Web appfront end • Javascript + HTML • Relies on ShEx-validator – Validates documents – Returns report https://github.com/HW- SWeL/Validata ShEx-validator • Validation system • Validation API • Javascript – nodejs engine • Reuses – n3: RDF Library – ShExParser https://github.com/HW- SWeL/ShEx-validator 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 46
  • 46.
    http://hw-swel.github.io/Validata/ VALIDATA DEMO 5 April2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 47
  • 47.
    (Some) Life Sciences MetadataSpecifications 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 48 Depth Reach model HCLS DataDesc Findable Accessible Interoperable Reusable
  • 48.
    Configurable Tooling Creation Validation 5April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 49
  • 49.
    Acknowledgements BioSchemas • Carole Gobel •Rafael Jimenez FAIR Data • FAIRdom project • Jun Zhao Open PHACTS • Christian Brenninkmeijer • Lefteris Tatakis • Andra Waagmeester Validata (MEng 2015) • Andrew Beveridge • Jacob Baungard Hansen • Johnny Val • Leif Gehrmann • Roisin Farmer • Sunil Khutan • Tomas Robertson • Eric Prud’hommeaux 5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 50
  • 50.
    Questions Validata https://github.com/HW-SWeL/Validata • RDFconstraint validation tool – Configurable to any profile • Shape Expression (ShEx) constraints Dumontier, M. et al. The health care and life sciences community profile for dataset descriptions. PeerJ 4, e2331 (2016). DOI:10.7717/peerj.2331 Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data 3, 1–15 (2016). DOI: 10.1038/sdata.2016.18 www.macs.hw.ac.uk/~ajg33/ A.J.G.Gray@hw.ac.uk @gray_alasdair5 April 2017 @gray_alasdair www.macs.hw.ac.uk/~ajg33 51

Editor's Notes

  • #4  FAIR data-restricted access:  Beacons
  • #5 OPS Explorer screenshot returning compound information Open PHACTS as a use case; data integration platform One of many use cases gathered in HCLSIG
  • #6 Behind the scenes! ChemSpider: EBI SDF file ChEMBL 13 Data Cache: Chem2Bio2RDF ChEMBL RDF File downloaded May 2011 Chem2Bio2RDF metadata webpages: ChEMBL 8 File contents: ChEMBL 2 Mapping Server: Kasabi ChEMBL RDF file ChEMBL 12
  • #7 OPS Explorer screenshot returning compound information Open PHACTS as a use case; data integration platform One of many use cases gathered in HCLSIG
  • #8 Key element is provenance of where the data has come from Enabled by having detailed descriptions of the sources
  • #11 Project approach to create FAIR descriptions Essential for provenance tracking Hard for publishers – revised/simplified Tooling - Eliminates RDF knowledge, guideline knowledge - Supports metadata precision
  • #12 Key aspect: two level model – dataset and distribution Based largely on DCAT and VoID Complexity from RDF data derived from non-RDF data Linksets between datasets – need to track versions
  • #13 Second version of the guidelines – easier to use Properties grouped by resource type and then requirement level – repetition in document Still hard to write accurate descriptions
  • #15 User centric design Thematic questions No knowledge of RDF required – can look under the hood
  • #16 Displays partial RDF Full RDF available for download at end of VoID Editor screens ~15 minutes to write description – use as boilerplate
  • #17 Validation Questions and validation hard coded to OPS standard
  • #18 Validation OWL Closed World Semantics Can be repurposed Difficult to maintain
  • #19 Incomplete list Take with a pinch of salt
  • #21 Schema aims to make webpages understandable to machines This page has no mark up (not FAIR)
  • #22 Page does have structure which can be exploited but still guessing Each site has their own structure, i.e. no guarantee there will be divs or spans where needed
  • #23 Schema.org: Standard vocabulary Multiple representations Page structure not important Focus on findability
  • #24 Allows improved search results and display Better understanding improves precision Rich snippet generation Data input into knowledge graph
  • #25 Bioschemas going beyond schema.org Additional properties being proposed Minimal information model Cardinality of properties Requires validation
  • #26 Learned from previous experience Delivering multiple specifications – feeding into schema.org Within a year Rapid prototyping
  • #27 Incomplete list Take with a pinch of salt
  • #28 Large community buy in - 27 authors – Major data providers EBI, RIKEN, SIB Weekly telcons, collaborative editing 3+ year process Wide range of use cases
  • #29 Not HCLS specific
  • #30 Summary level: time unchanging information, e.g. name, description, publisher Version level: version specific information, e.g. version number, creator, etc Distribution level: file specific information, e.g. file location and format, number of triples Reuse vocabularies: DCTerms, DCAT, VoID, FOAF, … Prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
  • #31 61 properties from 18 vocabularies Minimised number of MUST/SHOULD to those for interoperability MAYs are recommended terms
  • #32 21 Properties 4 MUST 4 SHOULD 13 May
  • #35 Which should you use? Can you use more than one?
  • #36 BioSchemas with HCLS
  • #37 VoID Editor hard coded OPS Validator configurable, but hard
  • #38 RDForms looks promising not currently driven by configuration file User experience could be improved Validata 
  • #40 Constraints form a graph pattern that data must comply with
  • #41 How do we validate that our example data conforms to a certain shape Express expected shape as ShEx Toy example, what about for real
  • #42 How do we validate that our example data conforms to a certain shape Express expected shape as ShEx Toy example, what about for real
  • #43 How do we validate that our example data conforms to a certain shape Express expected shape as ShEx Toy example, what about for real
  • #44 ShEx: Concise notation regex based W3C SHACL not stable when work done ShEx is an implementation of SHACL with extra features
  • #45 Step through validation process
  • #46 Extended ShEx to allow arbitrary hierarchies Toy example, what about for real
  • #47 ShEx-validator has other dependencies too Minimist: arguments parser Promise: call backs Pegjs: parser generator Mocha: test driven development
  • #49 Different dataset descriptions – developed for different reasons Aim to make data FAIR
  • #50 Require tooling Tooling must be configurable Can standards publish constraint schema?