BioPharma and the broader research community is faced with the challenge of simply finding the appropriate internal and external datasets for downstream analytics, knowledge-generation and collaboration. With datasets as the core asset, we wanted to promote both human and machine exploitability, using web-centric data cataloguing principles as described in the W3C Data on the Web Best Practices. To do so, we adopted DCAT (Data CATalog Vocabulary) and VoID (Vocabulary of Interlinked Datasets) for both RDF and non-RDF datasets at summary, version and distribution levels. Further, we’ve described datasets using a limited set of well-vetted public vocabularies, focused on cross-omics analytes and clinical features of the catalogued datasets.
Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
Dataset Catalogs as a Foundation for FAIR* Data
1. Dataset Catalogs as a Foundation for FAIR* Data
Tom Plasterer, PhD
Research & Development Information (RDI); US Cross-Science Director 16 May 2017
* Findable, Accessible, Interoperable and Reusable
2. 3
AstraZeneca FAIR Data Enablers
AZ-Insight and Nanopublications
Integrative Informatics Data Catalog
Differential Privacy
URI Policy
clinical trials, competitive intelligence,
translational Science
3. 4
FAIR Data: Data Stewardship Survey
Data Stewardship Survey
13 Questions, now managed by Cambridge Healthtech Institute
4. 5
What controlled vocabularies and/or ontologies do you use for structuring and
annotating your data and models?
31.6
21.1
26.3 26.3
36.8
68.4
42.1
21.1
15.8
52.6
78.9
63.2
15.8 15.8 15.8
10.5
52.6
15.8
31.6
0
0
10
20
30
40
50
60
70
80
90
BAO
-BioAssayOntology
BioPax-BiologicalPathw
aysOntology
CL-CellType
ontology
CHEBI-Chem
icalEntitiesofBiological…
DO/HDO
-Hum
an
Disease
Ontology
GO
-Gene
Ontology
HPO
-Hum
an
Phenotype
Ontology
EFO
-Experim
entalFactorsOntology
FM
A
-FoundationalM
odelofAnatom
y
ICD9/10
–InternationalClassification…
M
edDRA
–
M
edicalDictionaryfor…
M
ESH
-M
edicalSubject
OBI-OntologyforBiom
edical…
ORDO
-OrphanetRare
Disease…
RxNorm
SIO
-Sem
anticScience
SnoM
ed
-System
atized
Nom
enclature…
UBERON
-UberAnatom
yOntology
OtherAllOthers
5. Data Stewardship is important for the business
Data Stewardship is challenging
Metadata may no longer be considered proprietary
Best Practices for use (Governance) are not well understood
There is a consensus emerging around vocabulary standards
6
Survey Insights
6. What do MedI Researchers want the ability to do?
7
• Gain a greater understanding of
the biology of the molecular
mechanisms of diseases
• Use the human as a model
organism to a greater degree
• Discover how the microbiome is
involved with human
pathogenesis
• Understanding molecular
mechanisms of drug failures
• Use patient-level clinical data to
identify subphenotypes of
diseases
Integrative Informatics: A hybrid approach to
integrating data for Drug Discovery
@Mathew Woodwark;
Pharma 2020: March 28, 2018
7. Can MedImmune researchers do these things today?
8
• Currently, data exists in file shares, on
laptops, eLN, in silos of managed
systems and unknown places
• The level of data integration is
immature and fragmented
• Using systems biology approaches
requires considerable time and effort
• Bioinformatics groups become a
bottleneck to analyzing data
• Research scientists not empowered
to use information and knowledge to
answer complex questions
Integrative Informatics: A hybrid approach to
integrating data for Drug Discovery
@Mathew Woodwark;
Pharma 2020: March 28, 2018
8. 9
Dublin-Core-Type (DCT): Dataset
• A dataset is information encoded in a defined structure (for
example, lists, tables, and databases), intended to be useful for
direct machine processing
Data-Catalog (DCAT): Dataset
• A collection of data, published or curated by a single source, and
available for access or download in one or more formats
Vocabulary of Interlinked Dataset (VoID): Dataset
• A set of RDF triples that are published, maintained or aggregated
by a single provider.
Data and Datasets…
dct:Dataset
dcat:Dataset
void:Dataset
rdfs:subClassOf
rdfs:subClassOf
9. 10
Dataset Catalog is a collection of Dataset Records
• Catalogs are needed to supporting FAIR (Findable) data
• Catalogs can and should support Enterprise MDM strategies
• Consumers can be internal or external
Dataset Catalogs are needed so data consumers can find Datasets
• Dataset records need sufficient metadata to support discoverability
• Dataset terms are NOT the data instance
Dataset Catalogs surface dataset provenance and enable data access
Dataset Catalogs can provide datasets for multiple consumption patters
• Analytics readiness and fit
• ‘Walking’ across information models
Dataset Catalogs: Findability Starts Here
10. 11
Dataset Catalogs: Find me Datasets about:
Projects
Study
Indication/
Disease
Technology
Targets
Cohort DatesAgent
Therapeutic
Area
Drugs
11. 12
Best Practices:
Data on the Web, Vocabulary of Interlinked Datasets
Dataset Descriptions for the Open Pharmacological Space
http://www.openphacts.org/specs/2012/WD-datadesc-20121019/
Data on the Web Best Practices
https://www.w3.org/TR/dwbp/
12. 13
Findable: Metadata, documentation, identifiers
FAIRness Metrics (early draft)
• Addresses some but not all
sub-principles.
• Nothing about how you can
actually find the resource.
• Understanding the content is
not specified in this principle.
Data FAIRNESS metrics
@MichelDumontier;
Linked Data CoP: May 19, 2017
14. 15
Metadata Model Stack for the AZ Data Catalog
DCAT
VoID
DCTerms
RDF/S, OWL, SKOS/SKOS-XL
AZ TaxonomiesPAV
AZ DataCatalog
ontology and instances for catalogs, datasets, distributions
(could be further modularized later)
DCMI
bdm-tech
core bdm
internal external
uniprot
umls
sio
chembl…
W3C and
Metadata
Standards
Reference
Master Data
16. 17
Flexible Vocabularies and Mapping Services
Public (Extended):
• Indication/Disease
• Drugs
• Targets
• Technology
Internal, Organizational:
• Therapeutic Area (business unit)
• Project
• Cohort
Mapping Services & APIs:
• Clinical Study
• Agent
Other:
• Dates
17. 18
Validation and SHACL
Az:ANYDataset
az:BDMDataset
dct:Dataset
rdfs:subClassOf
az:BDMDataset (Node Shape)
sh:targetClass az:BDMDataset
sh:and
dctypes:dataset (Node Shape)
sh:targetClass dctypes:Dataset
sh:property
az:BDMDatasetExtension (Shape)
sh:property
dct:title (Propery Shape)
sh:path dct:title
sh:datatype xsd:string
sh:minCount: 1
sh:maxCount: 1
…
az:theme (Shape)
sh:path dcat:theme
sh:class bdm:Technology
sh:minCount: 1
…
For BDM dataset: at least one technology MUST be specified
Use of SHACL for Data Catalogs & Dataset
Types
@Heiner Oberkampf;
<internal talk> April 18, 2018
18. Data Discoverability: Multi-phase Filtering
Data Catalog Filter
Phase 1
Experiment Metadata Filter
Phase 2
Ad hoc Analyses Filtering
Phase 3
Outbound
to Data Analytics
Data Science
Tools
Statistical
Filtering
e.g., clinical trial with > 50
participants
Dataset
Catalog
Descriptions
21. R&D | RDI
DCTERMS, DCAT, VoID are nearly sufficient
• Extend for local needs
Public Domain Ontologies should be reused
• Consensus is emerging around best practices and cross-mapping
Use Multi-Phase Filtering for Shallow & Deep Questions
• Balance to what belongs in a catalog record vs. instance data
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
Dataset Catalogs: Take-aways
22. R&D | RDI
Thanks
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Juan Sequeda
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
AZ/MedImmune Linked
Data Community
David Fenstermacher
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki