Describing Scientific
Datasets: The HCLS
Community Profile
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Michel Dumontier
Stanford University
M. Scott Marshall
MAASTRO Clinic
Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
CorePlatform
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13
v12
v2 or v8
Open PHACTS
Discovery PlatformHistoric Use Case
~January 2012
Open PHACTS v1.3
ChEMBL 16
http://tiny.cc/ops-datasets
Challenges
 Datasets available
 In many versions over time
 In different formats
 From many mirrors/registries
 Datasets build on each other
 Files do not carry metadata
 Registries
 Can be out-of-date
 Can contain conflicting information
25 September 2014 EUON - HCLS Dataset Description 5
Scientists
require data
provenance!
Dublin Core Metadata Initiative
 Widely used
 Broadly applicable
 Documents
 Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
25 September 2014 EUON - HCLS Dataset Description 6
“Date: A point or period of
time associated with an
event in the lifecycle of
the resource.”
7EUON - HCLS Dataset Description
 Metadata carried with data
 Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
VoID: Vocabulary of
Interlinked Datasets
25 September 2014
DCAT: Data Catalog
 Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
25 September 2014 EUON - HCLS Dataset Description 8
W3C HCLS Group
25 September 2014 EUON - HCLS Dataset Description 9
HCLS Dataset Descriptions
25 September 2014 EUON - HCLS Dataset Description 10
VoID Editor
25 September 2014 EUON - HCLS Dataset Description 12
Validator
25 September 2014 EUON - HCLS Dataset Description 13
New version
using ShEx in
development
Future Vision
 Provide rich and accurate provenance
trail of data
 Write once, use many times
 Automatic pipeline from description file to registries
 FAIR Data
25 September 2014 EUON - HCLS Dataset Description 14
Thank you
Editors’ Draft:
http://tiny.cc/hcls-datadesc-ed
W3C Interest Group Note:
http://tiny.cc/hcls-datadesc
Acknowledgements to W3C HCLS Group
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
25 September 2014 EUON - HCLS Dataset Description 15

Describing Scientific Datasets: The HCLS Community Profile

  • 1.
    Describing Scientific Datasets: TheHCLS Community Profile Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair Michel Dumontier Stanford University M. Scott Marshall MAASTRO Clinic
  • 4.
    Data Cache (Triple Store) SemanticWorkflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8
  • 5.
    Data Cache (Triple Store) SemanticWorkflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” EC2.43.4 CS4532 P12374 CorePlatform ChEMBL- RDF ChEMBL v13 Chem2 Bio2RDF SD v13 v12 v2 or v8 Open PHACTS Discovery PlatformHistoric Use Case ~January 2012 Open PHACTS v1.3 ChEMBL 16 http://tiny.cc/ops-datasets
  • 6.
    Challenges  Datasets available In many versions over time  In different formats  From many mirrors/registries  Datasets build on each other  Files do not carry metadata  Registries  Can be out-of-date  Can contain conflicting information 25 September 2014 EUON - HCLS Dataset Description 5 Scientists require data provenance!
  • 7.
    Dublin Core MetadataInitiative  Widely used  Broadly applicable  Documents  Datasets ✗Generic terms ✗Not comprehensive ✗No required properties 25 September 2014 EUON - HCLS Dataset Description 6 “Date: A point or period of time associated with an event in the lifecycle of the resource.”
  • 8.
    7EUON - HCLSDataset Description  Metadata carried with data  Directly embedded: void:inDataset ✗No versioning ✗No checklist of requisite fields ✗Only for RDF data VoID: Vocabulary of Interlinked Datasets 25 September 2014
  • 9.
    DCAT: Data Catalog Separates Dataset and Distribution ✗No versioning ✗No prescribed properties 25 September 2014 EUON - HCLS Dataset Description 8
  • 10.
    W3C HCLS Group 25September 2014 EUON - HCLS Dataset Description 9
  • 11.
    HCLS Dataset Descriptions 25September 2014 EUON - HCLS Dataset Description 10
  • 12.
    VoID Editor 25 September2014 EUON - HCLS Dataset Description 12
  • 13.
    Validator 25 September 2014EUON - HCLS Dataset Description 13 New version using ShEx in development
  • 14.
    Future Vision  Providerich and accurate provenance trail of data  Write once, use many times  Automatic pipeline from description file to registries  FAIR Data 25 September 2014 EUON - HCLS Dataset Description 14
  • 15.
    Thank you Editors’ Draft: http://tiny.cc/hcls-datadesc-ed W3CInterest Group Note: http://tiny.cc/hcls-datadesc Acknowledgements to W3C HCLS Group www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair 25 September 2014 EUON - HCLS Dataset Description 15

Editor's Notes

  • #3 Open PHACTS as a use case; data integration platform Explorer screenshot returning compound information
  • #4 Key element is provenance of where the data has come from Enabled by having detailed descriptions of the sources
  • #5 Behind the scenes! ChemSpider: EBI SDF file ChEMBL 13 Data Cache: Chem2Bio2RDF ChEMBL RDF File downloaded May 2011 Chem2Bio2RDF metadata webpages: ChEMBL 8 File contents: ChEMBL 2 Mapping Server: Kasabi ChEMBL RDF file ChEMBL 12
  • #6 Behind the scenes! ChemSpider: EBI SDF file ChEMBL 13 Data Cache: Chem2Bio2RDF ChEMBL RDF File downloaded May 2011 Chem2Bio2RDF metadata webpages: ChEMBL 8 File contents: ChEMBL 2 Mapping Server: Kasabi ChEMBL RDF file ChEMBL 12
  • #10 We reuse several properties
  • #11 Large community buy in – Including EBI Builds on OPS document: Checklist and guidance notes! Wide range of use cases Should be finalised by end of May – not final URL
  • #12 Summary level: time unchanging information, e.g. name, description, publisher Version level: version specific information, e.g. version number, creator, etc Distribution level: file specific information, e.g. file location and format, number of triples Reuse vocabularies: DCTerms, DCAT, VoID, FOAF, … Prescribed properties: MUST, SHOULD, MAY, MUST NOT for each level
  • #14 Dataset description creator Generates outline description through web form Allows you to see generated content
  • #15 Given a dataset description, does it conform to the OPS guidelines Generates error (red) and warning (orange) reports Error for MUST properties Warning for SHOULD properties Information for MAY properties