Multiple groups in the life sciences community have started their journey towards data FAIR-ification by implementing Data Catalogs, a clear first step towards Finding your data. While in many cases the approaches are quite similar, in both origin and intent, differing implementations could end up hampering interoperability and reuse. The Pistoia Alliance and the Linked Data Community of Practice hosted a panel discussion describing at three implementations and their downstream goals:
[1] Pharma cross-omics data catalogs,
[2] Clinical data catalogs
[3] Bioschemas for dataset discoverability on the inter/intranet
Decoding Loan Approval: Predictive Modeling in Action
Pistoia alliance harmonizing fair data catalog approaches webinar
1. November 29, 2018
Harmonizing FAIR Data Catalog Approaches
A Pistoia Alliance Debates Webinar
Moderated by Martin Romacker – Roche
Panelists Tom Plasterer - AstraZeneca
Eric Little - OSTHUS
Kees van Bochove The Hyve
Rafael Jimenez - Elixir
3. Poll Question 1: How familiar are you with FAIR principles and
metrics?
A. Unfamiliar with FAIR principles
B. Some familiarity with FAIR principles
C. Familiar with FAIR principles
D. Some expertise with FAIR principles and metrics
E. Highly expert with FAIR principles and metrics
4. Poll Question 2: What is the maturity level of your organization
with respect to implementation of FAIR?
A. Don’t know
B. No plans or interest in FAIR implementation
B. Thinking about FAIR implementation
C. Several projects have FAIR implementation
E. Systematic FAIR implementation across whole organization
6. Harmonizing FAIR Data Catalog Approaches –
Introduction
Martin Romacker
Data and Information Architect
Pharma Research and Early Development Informatics (pREDi)
Roche Innovation Center Basel
Pistoia Alliance Webinar, 29th November 2018
7. Data Catalogue & FAIR Principles
High Quality Data Driving Insights
• Digitial Transformation leading to a data-driven Industry
• Data are more and more perceived as an asset
9. Findability and FAIR Data Catalogs
Tom Plasterer, PhD
Science & Enabling Units IT (S&EUIT), Semantic Technology Lead
29 Nov 2018
Pistoia/Linked Data Community of Practice Webinar Series
Eric Little, PhD
Chief Data Officer, Osthus
10. 10
Starting Point: Questions and Iterative Data Model
core:Study
core:Project
core:Target
core:Subject
core:Drug
core:Indication core:TherapeuticArea
core:BiologicalSample
core:Measurement core:Technologycore:Visit
bdm:Cohort
core:hasSubject
core:hasProject
core:hasDrug
core:hasIndication
bdm:hasArm
bdm:participatesIn
core:hasTA
core:hasTarget
core:hasMeasurement
core:hasSample
core:hasVisit
core:measuredBy
How do you find
datasets
described with
this model when
you have
hundreds or
thousands?
11. 11
Dataset Catalog is a collection of Dataset Records
• Catalogs are needed to supporting FAIR (Findable) data
• Catalogs can and should support Enterprise MDM strategies
• Consumers can be internal or external
Dataset Catalogs are needed so data consumers can find Datasets
• Dataset records need sufficient metadata to support discoverability
• Dataset terms are NOT the data instance
Dataset Catalogs surface dataset provenance and enable data access
Dataset Catalogs can provide datasets for multiple consumption patters
• Analytics readiness and fit
• ‘Walking’ across information models
Dataset Catalogs: Findability Starts Here
12. 12
Best Practices:
Data on the Web, Vocabulary of Interlinked Datasets
Dataset Descriptions for the Open Pharmacological Space
http://www.openphacts.org/specs/2012/WD-datadesc-20121019/
Data on the Web Best Practices
https://www.w3.org/TR/dwbp/
13. 13
Dataset Catalogs: Find me Datasets about:
Projects
Study
Indication/
Disease
Technology
Targets
Cohort DatesAgent
Therapeutic
Area
Drugs
14. 14
Dublin-Core-Type (DCT): Dataset
• A dataset is information encoded in a defined structure (for example,
lists, tables, and databases), intended to be useful for direct
machine processing
Data-Catalog (DCAT): Dataset
• A collection of data, published or curated by a single source, and
available for access or download in one or more formats
Vocabulary of Interlinked Dataset (VoID): Dataset
• A set of RDF triples that are published, maintained or aggregated by
a single provider.
Data and Datasets…
dct:Dataset
dcat:Dataset
void:Dataset
rdfs:subClassOf
rdfs:subClassOf
16. 16
Metadata Model Stack for the AZ Data Catalog
DCAT
VoID
DCTerms
RDF/S, OWL, SKOS/SKOS-XL
AZ TaxonomiesPAV
AZ DataCatalog
ontology and instances for catalogs, datasets, distributions
(could be further modularized later)
DCMI
bdm-tech
core bdm
internal external
uniprot
umls
sio
chembl…
W3C and
Metadata
Standards
Reference
Master Data
17. Data Discoverability: Multi-phase Filtering
Data Catalog Filter
Phase 1
Experiment Metadata Filter
Phase 2
Ad hoc Analyses Filtering
Phase 3
Outbound
to Data Analytics
Data Science
Tools
Statistical
Filtering
e.g., clinical trial with > 50
participants
Dataset
Catalog
Descriptions
21. Slide 21
How FAIR is Your Data with a Catalog?
FAIR metrics: https://tinyurl.com/FAIRMetrics-ALL
I
22. Slide 22
Data Science Layer
machine learning, text analytics, NLP, clustering, matching, classification
Lightweight Semantic Integration Layer: FAIR Catalog
enablers: reference master data mgt., metadata mgt., semantic indexing, linking, governance, APIs
A Data-Centric World Allows to Utilize Data Effectively
Linked Open Data
& Open APIs
Semantic
Graph DB
Operational DBs
…
Unstructured
Documents
Analytics Tools
simulations
statistics
Visualization
dashboards
exploration
search
…
Semi-structured
Data
Instrument
Data
Reporting
regulatory
internal
external
23. R&D | RDI
DCTERMS, DCAT, VoID are nearly sufficient
• Extend for local needs
DCTERMS, DCAT, VoID are nearly sufficient
• Extend for local needs
Public Domain Ontologies should be reused
• Consensus is emerging around best practices and cross-mapping
Public Domain Ontologies should be reused
• Consensus is emerging around best practices and cross-mapping
Use Multi-Phase Filtering for Shallow & Deep Questions
• Balance to what belongs in a catalog record vs. instance data
Use Multi-Phase Filtering for Shallow & Deep Questions
• Balance to what belongs in a catalog record vs. instance data
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
Dataset Catalogs: Take-aways
24. R&D | RDI
Thanks
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Juan Sequeda
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Juan Sequeda
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
AZ/MedImmune Linked
Data Community
David Fenstermacher
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Colin Wood
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki
AZ/MedImmune Linked
Data Community
David Fenstermacher
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Colin Wood
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki
25. Clinical Data Catalogs
3 case studies
2018, NOVEMBER 29th
Kees van Bochove, CEO & Founder, The Hyve
@keesvanbochove
26. ● Cohort study / registry / biobank: Netherlands Twin Registry
● Academic medical center: Institut Curie
● Pharma company: Cataloguing the Data Lake
Clinical data catalog case studies
27. 27
Teams
Research Data Management
● FAIR / Data Governance consultancy
● Fairspace (meta)data management
Cancer Genomics
● Cancer data warehouse: cBioPortal
● Knowledge base: Open Targets
Data Warehousing
● Data warehouses: tranSMART, i2b2
● Cohort selection: Glowing Bear
● Request Portals: Podium
Real World Data
● Real world evidence: OMOP/OHDSI
● Wearables platform: RADAR-BASE
● Data catalogues: CKAN, DataVerse
29. 29
Case 1: Netherlands Twin Register
1. Data Catalogue: Low barrier of entry, variable level
metadata
30. 30
Case 1: Netherlands Twin Register
2. Podium Request Portal: Request access to data (or samples)
31. 31
Case 1: Netherlands Twin Register
3. Glowing Bear: Data selection tool for data manager (or researcher)
32. ● Provide foundations for a comprehensive knowledge system to
make data accessible
● Foster a strong extended research network within the organization
● Translated to 4 processes / propositions:
● Understand which data is available in the institute
● Collaborate easily on data analysis
● Comply to regulations (GDPR) and best practices (i.e. FAIR)
● Manage the full data lifecycle
Case 2: Institut Curie: project goals
33. Research
Team A
SEMANTIC LAYER
GOVERNANCE LAYER
VIRTUAL RESEARCH DESKTOP
HOSPITAL INFORMATION SYSTEM (HIS)
CURIE DATA RESOURCE
A C
Research
Team B
AD E
Research
Team C
FBAB
Clinical annotations,
Consent
metadata metadata metadata
Sequencing data (WGS, WES, targeted)
RNAseq
Images
Institut Curie – Data Office – Vision on FAIR Data
By: X. Fernandez and J.
Guerien
FAIRSPACE
- Apache Jena
- FS Protocol
- Kubernetes
- Apps a.o.
cBioPortal
34. FAIRSPACE
34
Add your
data to
Fairspace
Automate it!
Share and
publish your
data
Annotate your
data with FAIR
metadata
01
02
03
04
Collaborative science in 4 steps
Metadata
Governance
Audit Logs
35. ● Strategic programme underway to make preclinical, clinical and
real world data assets available across R&D
● Most files and legacy data warehouses moved to AWS data lake
● Computational pipelines, data analysis solutions etc. available
● Need for indexing and searching data assets on study, indication,
compound, site etc.
Case 3: Unlocking the enterprise data lake
36. 36
Unlock the data lake with cataloguing
Data Lake
Data Source
Adapters
S3, GCP buckets, FS
mounts, Azure blobs etc.
File & Collection
Metadata
DCAT, DATS etc.
Logging & Auditing
Lightweight Metadata Layer
(Kubernetes microservices)
Data Governance
Access control
Request flow
FAIRSPACE FederationData Sources
Dataverse, C/DKAN
etc. via OAI-PMH
Semantic search
(e.g. Disqover) via
SPARQL
Analytics (e.g. SAS,
Jupyter, Matlab etc.)
Cohort & Patient
Finders (e.g.
Glowing Bear)
37. ● Understand the driving use case for the catalog
● Filtering, browse and search: which criteria would users search on?
● Result sets: do your users / customers want to locate datasets, files, patients, samples etc?
● Strategy to populate the catalogue
● Favour automation, but some manual work and terminology management will be needed
● Some further reading: https://peerj.com/preprints/27151/
● Testing the FAIR metrics on data catalogs, comparing CKAN, Dataverse, Invenio, by The Hyve team
● Let me know if you are interested in beta testing Fairspace: kees@thehyve.nl
Conclusions
38. www.elixir-europe.org
Bioschemas for dataset discoverability
on the inter/intranet
Harmonizing FAIR Data Catalog Approaches
Rafael C Jimenez
Chief Data Architect
November 29, 2018
The European Open Science Cloud for
Research pilot project is funded by the
European Commission, DG Research &
Innovation under contract no. 739563
ELIXIR-EXCELERATE is funded by the European
Commission within the Research Infrastructures
programme of Horizon 2020, grant agreement
number 676559.
46. Bioschemas
• New life sciences schema.org types (eg. protein)
• Life sciences profiles for existing types (eg. dataset)
Types
• Proposed to Schema.org
• More descriptive
• Guidelines applying constrains to
existing Schema.org types
• Managed by Bioschemas and
specific for life sciences
• Minimum properties and best
practices for finding and accessing
data
• Focused on few types and well
defined relationships
Profiles 2210
DataCatalog Datasets DataRecord
47. Bioschemas
• Use case driven
• Finding data
• Presenting search results
• Metadata exchange
• Minimum information guidelines
• Link to domain ontologies
• Examples and documentation
48. Tools
• Buzzbang
Open-source components for software to find, crawl and use
Bioschemas markup, and for humans to search it.
• GoCrawlIt
Minimal crawler and extractor of microdata and JSON-LD metadata.
• GoWeb
Application to help publishing Bioschemas profiles on the Bisochemas
website.
• Validata
A web application for validating Bioschemas markup against the
specifications.
• Markup Builder
A web application for prototyping markup against the Bioschemas
profiles
49. Data resources Tools Training
Bioschemas Bioschemas Bioschemas
Search
engines
Metadata
Registries
Data
Aggregators
BioschemasBioschemas
Providers
Consumers
50. MarRef -> BioSamples
MarRef <- BioSamples
Without an API
https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md
Use case: data exchange
Metadata exchange
51. Data producers
Bioschemas
Data
Aggregators
Bioschemas
Providers
Consumers
Organization “A” private data
Data producers
Bioschemas
Data
Aggregators
Bioschemas
Providers
Consumers
Organization “B” private data
Search
Compare
Link
Exchange
Metadata
Metadata
Metadata
Metadata
• Different data models and interfaces
• Minimum common metadata agreements on a selection of
data types
52. Remarks
• Bioschemas is focused on findability, not data modeling
• It can be useful for some basic metadata exchange (MarRef example)
• It is a complement, not a replacement for more tailored ways to share metadata
• It is mainly used embedded in HTML, but it can be exposed via APIs or JSON files
• For data providers it is a simple way to expose metadata
• It is not just for Google, it is also helping our metadata catalogues to automatically
ingest metadata
53. Thanks for your attention!
200+
People
8
Events
(2018)
10
Types
23Working
groups22Profiles
55Live
deploys
6M+Pages
55. Upcoming Webinars
Data Quality in support of AI/ML: Pistoia Alliance CoE for AI Webinar Series
Date/Time: 10 Dec 2018: 4pm - 5pm GMT, 11 am ET/5pm CET
Panel members include:
Terry Stouch, Science for Solutions, Isabella Feierberg, AstraZeneca,
Jamie Powers, Cambridge Semantics,
Sirarat Sarntivijai, ELIXIR EU, Jabe Wilson, Elsevier
Knowledge Graphs for Pharma: A perspective from the PhUSE Project
'Clinical Trials Data as RDF'
Date/Time: January 24th, 2019 11am ET/4pmGMT/5pm CET
Speaker: Tim Williams (UCB and PhUSE)