Dataset Catalogs as a Foundation for FAIR* Data

Dataset Catalogs as a Foundation for FAIR* Data
Tom Plasterer, PhD
Research & Development Information (RDI); US Cross-Science Director 16 May 2017
* Findable, Accessible, Interoperable and Reusable

3
AstraZeneca FAIR Data Enablers
AZ-Insight and Nanopublications
Integrative Informatics Data Catalog
Differential Privacy
URI Policy
clinical trials, competitive intelligence,
translational Science

4
FAIR Data: Data Stewardship Survey
Data Stewardship Survey
13 Questions, now managed by Cambridge Healthtech Institute

5
What controlled vocabularies and/or ontologies do you use for structuring and
annotating your data and models?
31.6
21.1
26.3 26.3
36.8
68.4
42.1
21.1
15.8
52.6
78.9
63.2
15.8 15.8 15.8
10.5
52.6
15.8
31.6
0
0
10
20
30
40
50
60
70
80
90
BAO
-BioAssayOntology
BioPax-BiologicalPathw
aysOntology
CL-CellType
ontology
CHEBI-Chem
icalEntitiesofBiological…
DO/HDO
-Hum
an
Disease
Ontology
GO
-Gene
Ontology
HPO
-Hum
an
Phenotype
Ontology
EFO
-Experim
entalFactorsOntology
FM
A
-FoundationalM
odelofAnatom
y
ICD9/10
–InternationalClassification…
M
edDRA
–
M
edicalDictionaryfor…
M
ESH
-M
edicalSubject
OBI-OntologyforBiom
edical…
ORDO
-OrphanetRare
Disease…
RxNorm
SIO
-Sem
anticScience
SnoM
ed
-System
atized
Nom
enclature…
UBERON
-UberAnatom
yOntology
OtherAllOthers

Data Stewardship is important for the business
Data Stewardship is challenging
Metadata may no longer be considered proprietary
Best Practices for use (Governance) are not well understood
There is a consensus emerging around vocabulary standards
6
Survey Insights

What do MedI Researchers want the ability to do?
7
• Gain a greater understanding of
the biology of the molecular
mechanisms of diseases
• Use the human as a model
organism to a greater degree
• Discover how the microbiome is
involved with human
pathogenesis
• Understanding molecular
mechanisms of drug failures
• Use patient-level clinical data to
identify subphenotypes of
diseases
Integrative Informatics: A hybrid approach to
integrating data for Drug Discovery
@Mathew Woodwark;
Pharma 2020: March 28, 2018

Can MedImmune researchers do these things today?
8
• Currently, data exists in file shares, on
laptops, eLN, in silos of managed
systems and unknown places
• The level of data integration is
immature and fragmented
• Using systems biology approaches
requires considerable time and effort
• Bioinformatics groups become a
bottleneck to analyzing data
• Research scientists not empowered
to use information and knowledge to
answer complex questions
Integrative Informatics: A hybrid approach to
integrating data for Drug Discovery
@Mathew Woodwark;
Pharma 2020: March 28, 2018

9
Dublin-Core-Type (DCT): Dataset
• A dataset is information encoded in a defined structure (for
example, lists, tables, and databases), intended to be useful for
direct machine processing
Data-Catalog (DCAT): Dataset
• A collection of data, published or curated by a single source, and
available for access or download in one or more formats
Vocabulary of Interlinked Dataset (VoID): Dataset
• A set of RDF triples that are published, maintained or aggregated
by a single provider.
Data and Datasets…
dct:Dataset
dcat:Dataset
void:Dataset
rdfs:subClassOf
rdfs:subClassOf

10
Dataset Catalog is a collection of Dataset Records
• Catalogs are needed to supporting FAIR (Findable) data
• Catalogs can and should support Enterprise MDM strategies
• Consumers can be internal or external
Dataset Catalogs are needed so data consumers can find Datasets
• Dataset records need sufficient metadata to support discoverability
• Dataset terms are NOT the data instance
Dataset Catalogs surface dataset provenance and enable data access
Dataset Catalogs can provide datasets for multiple consumption patters
• Analytics readiness and fit
• ‘Walking’ across information models
Dataset Catalogs: Findability Starts Here

11
Dataset Catalogs: Find me Datasets about:
Projects
Study
Indication/
Disease
Technology
Targets
Cohort DatesAgent
Therapeutic
Area
Drugs

12
Best Practices:
Data on the Web, Vocabulary of Interlinked Datasets
Dataset Descriptions for the Open Pharmacological Space
http://www.openphacts.org/specs/2012/WD-datadesc-20121019/
Data on the Web Best Practices
https://www.w3.org/TR/dwbp/

13
Findable: Metadata, documentation, identifiers
FAIRness Metrics (early draft)
• Addresses some but not all
sub-principles.
• Nothing about how you can
actually find the resource.
• Understanding the content is
not specified in this principle.
Data FAIRNESS metrics
@MichelDumontier;
Linked Data CoP: May 19, 2017

14
The Backbone: A DCAT conformant Data Catalog
https://www.w3.org/TR/hcls-dataset/
https://www.w3.org/TR/vocab-dcat/#vocabulary-overview
Semantic tagging of datasets with
concepts from taxonomies:
• provides context
• multi-dimensional & flexible
• effective for discoverability
• light-weight semantics
skos:Concept
dcat:Catalog skos:ConceptScheme
dctypes:Dataset (summary)
dct:title
dct:publisher <foaf:Agent>
foaf:page
void:sparqlEndpoint
dct:accrualPeriodicity
dcat:keyword
dcat:dataset
dcat:theme
dctypes:Dataset (version)
dcat:Distribution
(dctypes:Dataset)
void:vocabulary
dct:conformsTo
void:exampleResource
…other void properties
dcat:distribution
dcat:themeTaxonomy
dct:isVersionOf
pav:previousVersion
dct:hasPart
pav:hasCurrentVersion
dct:hasPart
dct:title
pav:version
dct:creator <foaf:Agent>
dct:created
dct:source
dct:creator <foaf:Agent>
dct:license
dct:format
pav:retrievedFrom
dct:created
pav:createdWith
dcat:accessURL
dcat:downloadURL
void:Dataset
dct:title
dctDescription

15
Metadata Model Stack for the AZ Data Catalog
DCAT
VoID
DCTerms
RDF/S, OWL, SKOS/SKOS-XL
AZ TaxonomiesPAV
AZ DataCatalog
ontology and instances for catalogs, datasets, distributions
(could be further modularized later)
DCMI
bdm-tech
core bdm
internal external
uniprot
umls
sio
chembl…
W3C and
Metadata
Standards
Reference
Master Data

17
Flexible Vocabularies and Mapping Services
Public (Extended):
• Indication/Disease
• Drugs
• Targets
• Technology
Internal, Organizational:
• Therapeutic Area (business unit)
• Project
• Cohort
Mapping Services & APIs:
• Clinical Study
• Agent
Other:
• Dates

18
Validation and SHACL
Az:ANYDataset
az:BDMDataset
dct:Dataset
rdfs:subClassOf
az:BDMDataset (Node Shape)
sh:targetClass az:BDMDataset
sh:and
dctypes:dataset (Node Shape)
sh:targetClass dctypes:Dataset
sh:property
az:BDMDatasetExtension (Shape)
sh:property
dct:title (Propery Shape)
sh:path dct:title
sh:datatype xsd:string
sh:minCount: 1
sh:maxCount: 1
…
az:theme (Shape)
sh:path dcat:theme
sh:class bdm:Technology
sh:minCount: 1
…
For BDM dataset: at least one technology MUST be specified
Use of SHACL for Data Catalogs & Dataset
Types
@Heiner Oberkampf;
<internal talk> April 18, 2018

Data Discoverability: Multi-phase Filtering
Data Catalog Filter
Phase 1
Experiment Metadata Filter
Phase 2
Ad hoc Analyses Filtering
Phase 3
Outbound
to Data Analytics
Data Science
Tools
Statistical
Filtering
e.g., clinical trial with > 50
participants
Dataset
Catalog
Descriptions

20
Example: Graph Model
azds:cp1071
dctypes:Dataset
CHEMBL1743039
bdm:Project
core:Project
rdf:type
rdf:type
dcat:theme
“CP1071 RDF
Dataset”
dcterms:title
core:hasDrug
core:hasProject
= catalog
= BDM
= inferred
Named Graphs
dcat:theme
P15509
dcat:theme
core:hasTarget
Project
Instance
owl:NamedIndividual
rdf:type
core:hasTherapeuticArea
RIA
dcat:theme
bdm:createdBy?
pav:hasCurrentVersion
v2
pav:createdBy
kqsp092

R&D | RDI
DCTERMS, DCAT, VoID are nearly sufficient
• Extend for local needs
Public Domain Ontologies should be reused
• Consensus is emerging around best practices and cross-mapping
Use Multi-Phase Filtering for Shallow & Deep Questions
• Balance to what belongs in a catalog record vs. instance data
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
Dataset Catalogs: Take-aways

R&D | RDI
Thanks
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Juan Sequeda
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
AZ/MedImmune Linked
Data Community
David Fenstermacher
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki

Dataset Catalogs as a Foundation for FAIR* Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dataset Catalogs as a Foundation for FAIR* Data

Similar to Dataset Catalogs as a Foundation for FAIR* Data (20)

Recently uploaded

Recently uploaded (20)

Dataset Catalogs as a Foundation for FAIR* Data