Pistoia alliance harmonizing fair data catalog approaches webinar

November 29, 2018
Harmonizing FAIR Data Catalog Approaches
A Pistoia Alliance Debates Webinar
Moderated by Martin Romacker – Roche
Panelists Tom Plasterer - AstraZeneca
Eric Little - OSTHUS
Kees van Bochove The Hyve
Rafael Jimenez - Elixir

This webinar is being recorded

Poll Question 1: How familiar are you with FAIR principles and
metrics?
A. Unfamiliar with FAIR principles
B. Some familiarity with FAIR principles
C. Familiar with FAIR principles
D. Some expertise with FAIR principles and metrics
E. Highly expert with FAIR principles and metrics

Poll Question 2: What is the maturity level of your organization
with respect to implementation of FAIR?
A. Don’t know
B. No plans or interest in FAIR implementation
B. Thinking about FAIR implementation
C. Several projects have FAIR implementation
E. Systematic FAIR implementation across whole organization

©PistoiaAlliance
Phuse and the Regulated Cloud 5
Eric Little
Chiref Data Officer
OSTHUS
Martin Romacker
Principal Scientist
Roche
Tom Plasterer
US Cross-Science Director,
R&D Information
AstraZeneca
Rafael Jimenez
Chief Data Architect
ELIXIR
Kees van Bochove
CEO
The Hyve

Harmonizing FAIR Data Catalog Approaches –
Introduction
Martin Romacker
Data and Information Architect
Pharma Research and Early Development Informatics (pREDi)
Roche Innovation Center Basel
Pistoia Alliance Webinar, 29th November 2018

Data Catalogue & FAIR Principles
High Quality Data Driving Insights
• Digitial Transformation leading to a data-driven Industry
• Data are more and more perceived as an asset

Doing now what patients need next

Findability and FAIR Data Catalogs
Tom Plasterer, PhD
Science & Enabling Units IT (S&EUIT), Semantic Technology Lead
29 Nov 2018
Pistoia/Linked Data Community of Practice Webinar Series
Eric Little, PhD
Chief Data Officer, Osthus

10
Starting Point: Questions and Iterative Data Model
core:Study
core:Project
core:Target
core:Subject
core:Drug
core:Indication core:TherapeuticArea
core:BiologicalSample
core:Measurement core:Technologycore:Visit
bdm:Cohort
core:hasSubject
core:hasProject
core:hasDrug
core:hasIndication
bdm:hasArm
bdm:participatesIn
core:hasTA
core:hasTarget
core:hasMeasurement
core:hasSample
core:hasVisit
core:measuredBy
How do you find
datasets
described with
this model when
you have
hundreds or
thousands?

11
Dataset Catalog is a collection of Dataset Records
• Catalogs are needed to supporting FAIR (Findable) data
• Catalogs can and should support Enterprise MDM strategies
• Consumers can be internal or external
Dataset Catalogs are needed so data consumers can find Datasets
• Dataset records need sufficient metadata to support discoverability
• Dataset terms are NOT the data instance
Dataset Catalogs surface dataset provenance and enable data access
Dataset Catalogs can provide datasets for multiple consumption patters
• Analytics readiness and fit
• ‘Walking’ across information models
Dataset Catalogs: Findability Starts Here

12
Best Practices:
Data on the Web, Vocabulary of Interlinked Datasets
Dataset Descriptions for the Open Pharmacological Space
http://www.openphacts.org/specs/2012/WD-datadesc-20121019/
Data on the Web Best Practices
https://www.w3.org/TR/dwbp/

13
Dataset Catalogs: Find me Datasets about:
Projects
Study
Indication/
Disease
Technology
Targets
Cohort DatesAgent
Therapeutic
Area
Drugs

14
Dublin-Core-Type (DCT): Dataset
• A dataset is information encoded in a defined structure (for example,
lists, tables, and databases), intended to be useful for direct
machine processing
Data-Catalog (DCAT): Dataset
• A collection of data, published or curated by a single source, and
available for access or download in one or more formats
Vocabulary of Interlinked Dataset (VoID): Dataset
• A set of RDF triples that are published, maintained or aggregated by
a single provider.
Data and Datasets…
dct:Dataset
dcat:Dataset
void:Dataset
rdfs:subClassOf
rdfs:subClassOf

15
The Backbone: A DCAT conformant Data Catalog
https://www.w3.org/TR/hcls-dataset/
https://www.w3.org/TR/vocab-dcat/#vocabulary-overview
Semantic tagging of datasets with
concepts from taxonomies:
• provides context
• multi-dimensional & flexible
• effective for discoverability
• light-weight semantics
skos:Concept
dcat:Catalog skos:ConceptScheme
dctypes:Dataset (summary)
dct:title
dct:publisher <foaf:Agent>
foaf:page
void:sparqlEndpoint
dct:accrualPeriodicity
dcat:keyword
dcat:dataset
dcat:theme
dctypes:Dataset (version)
dcat:Distribution
(dctypes:Dataset)
void:vocabulary
dct:conformsTo
void:exampleResource
…other void properties
dcat:distribution
dcat:themeTaxonomy
dct:isVersionOf
pav:previousVersion
dct:hasPart
pav:hasCurrentVersion
dct:hasPart
dct:title
pav:version
dct:creator <foaf:Agent>
dct:created
dct:source
dct:creator <foaf:Agent>
dct:license
dct:format
pav:retrievedFrom
dct:created
pav:createdWith
dcat:accessURL
dcat:downloadURL
void:Dataset
dct:title
dctDescription

16
Metadata Model Stack for the AZ Data Catalog
DCAT
VoID
DCTerms
RDF/S, OWL, SKOS/SKOS-XL
AZ TaxonomiesPAV
AZ DataCatalog
ontology and instances for catalogs, datasets, distributions
(could be further modularized later)
DCMI
bdm-tech
core bdm
internal external
uniprot
umls
sio
chembl…
W3C and
Metadata
Standards
Reference
Master Data

Data Discoverability: Multi-phase Filtering
Data Catalog Filter
Phase 1
Experiment Metadata Filter
Phase 2
Ad hoc Analyses Filtering
Phase 3
Outbound
to Data Analytics
Data Science
Tools
Statistical
Filtering
e.g., clinical trial with > 50
participants
Dataset
Catalog
Descriptions

19
Validation and SHACL
Az:ANYDataset
az:BDMDataset
dct:Dataset
rdfs:subClassOf
az:BDMDataset (Node Shape)
sh:targetClass az:BDMDataset
sh:and
dctypes:dataset (Node Shape)
sh:targetClass dctypes:Dataset
sh:property
az:BDMDatasetExtension (Shape)
sh:property
dct:title (Propery Shape)
sh:path dct:title
sh:datatype xsd:string
sh:minCount: 1
sh:maxCount: 1
…
az:theme (Shape)
sh:path dcat:theme
sh:class bdm:Technology
sh:minCount: 1
…
For BDM dataset: at least one technology MUST be specified
Use of SHACL for Data Catalogs & Dataset
Types
@Heiner Oberkampf;
<internal talk> April 18, 2018

How FAIR is Your Data with a Catalog?
FAIR metrics: https://tinyurl.com/FAIRMetrics-ALL
I

Data Science Layer
machine learning, text analytics, NLP, clustering, matching, classification
Lightweight Semantic Integration Layer: FAIR Catalog
enablers: reference master data mgt., metadata mgt., semantic indexing, linking, governance, APIs
A Data-Centric World Allows to Utilize Data Effectively
Linked Open Data
& Open APIs
Semantic
Graph DB
Operational DBs
…
Unstructured
Documents
Analytics Tools
simulations
statistics
Visualization
dashboards
exploration
search
…
Semi-structured
Data
Instrument
Data
Reporting
regulatory
internal
external

R&D | RDI
DCTERMS, DCAT, VoID are nearly sufficient
• Extend for local needs
DCTERMS, DCAT, VoID are nearly sufficient
• Extend for local needs
Public Domain Ontologies should be reused
• Consensus is emerging around best practices and cross-mapping
Public Domain Ontologies should be reused
• Consensus is emerging around best practices and cross-mapping
Use Multi-Phase Filtering for Shallow & Deep Questions
• Balance to what belongs in a catalog record vs. instance data
Use Multi-Phase Filtering for Shallow & Deep Questions
• Balance to what belongs in a catalog record vs. instance data
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
Lots of Activity to Learn and Shape Best Practices
• Didn’t reinvent a wheel
Dataset Catalogs: Take-aways

R&D | RDI
Thanks
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Juan Sequeda
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
Key Influencers
David Wood
Tim Berners-Lee
Lee Harland
Jane Lomax
James Malone
Dean Allemang
Barend Mons
Carole Goble
Bernadette Hyland
Bob Stanley
Eric Little
Juan Sequeda
Michel Dumontier
John Wilbanks
Hans Constandt
Filip Pattyn
Dan Crowther
Tim Hoctor
Ian Harrow
AZ/MedImmune Linked
Data Community
David Fenstermacher
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Colin Wood
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki
AZ/MedImmune Linked
Data Community
David Fenstermacher
Mathew Woodwark
Rajan Desai
Nic Sinibaldi
Chia-Chien Chiang
Kerstin Forsberg
Ola Engkvist
Ian Dix
Colin Wood
Ted Slater
Martin Romacker
Eric Neumann
Jeff Saltzman
Kathy Reinold
Nirmal Keshava
Bryan Takasaki

Clinical Data Catalogs
3 case studies
2018, NOVEMBER 29th
Kees van Bochove, CEO & Founder, The Hyve
@keesvanbochove

● Cohort study / registry / biobank: Netherlands Twin Registry
● Academic medical center: Institut Curie
● Pharma company: Cataloguing the Data Lake
Clinical data catalog case studies

27
Teams
Research Data Management
● FAIR / Data Governance consultancy
● Fairspace (meta)data management
Cancer Genomics
● Cancer data warehouse: cBioPortal
● Knowledge base: Open Targets
Data Warehousing
● Data warehouses: tranSMART, i2b2
● Cohort selection: Glowing Bear
● Request Portals: Podium
Real World Data
● Real world evidence: OMOP/OHDSI
● Wearables platform: RADAR-BASE
● Data catalogues: CKAN, DataVerse

28
Case 1: Netherlands Twin Register

29
1. Data Catalogue: Low barrier of entry, variable level
metadata

30
2. Podium Request Portal: Request access to data (or samples)

31
3. Glowing Bear: Data selection tool for data manager (or researcher)

● Provide foundations for a comprehensive knowledge system to
make data accessible
● Foster a strong extended research network within the organization
● Translated to 4 processes / propositions:
● Understand which data is available in the institute
● Collaborate easily on data analysis
● Comply to regulations (GDPR) and best practices (i.e. FAIR)
● Manage the full data lifecycle
Case 2: Institut Curie: project goals

Research
Team A
SEMANTIC LAYER
GOVERNANCE LAYER
VIRTUAL RESEARCH DESKTOP
HOSPITAL INFORMATION SYSTEM (HIS)
CURIE DATA RESOURCE
A C
Research
Team B
AD E
Research
Team C
FBAB
Clinical annotations,
Consent
metadata metadata metadata
Sequencing data (WGS, WES, targeted)
RNAseq
Images
Institut Curie – Data Office – Vision on FAIR Data
By: X. Fernandez and J.
Guerien
FAIRSPACE
- Apache Jena
- FS Protocol
- Kubernetes
- Apps a.o.
cBioPortal

FAIRSPACE
34
Add your
data to
Fairspace
Automate it!
Share and
publish your
data
Annotate your
data with FAIR
metadata
01
02
03
04
Collaborative science in 4 steps
Metadata
Governance
Audit Logs

● Strategic programme underway to make preclinical, clinical and
real world data assets available across R&D
● Most files and legacy data warehouses moved to AWS data lake
● Computational pipelines, data analysis solutions etc. available
● Need for indexing and searching data assets on study, indication,
compound, site etc.
Case 3: Unlocking the enterprise data lake

36
Unlock the data lake with cataloguing
Data Lake
Data Source
Adapters
S3, GCP buckets, FS
mounts, Azure blobs etc.
File & Collection
Metadata
DCAT, DATS etc.
Logging & Auditing
Lightweight Metadata Layer
(Kubernetes microservices)
Data Governance
Access control
Request flow
FAIRSPACE FederationData Sources
Dataverse, C/DKAN
etc. via OAI-PMH
Semantic search
(e.g. Disqover) via
SPARQL
Analytics (e.g. SAS,
Jupyter, Matlab etc.)
Cohort & Patient
Finders (e.g.
Glowing Bear)

● Understand the driving use case for the catalog
● Filtering, browse and search: which criteria would users search on?
● Result sets: do your users / customers want to locate datasets, files, patients, samples etc?
● Strategy to populate the catalogue
● Favour automation, but some manual work and terminology management will be needed
● Some further reading: https://peerj.com/preprints/27151/
● Testing the FAIR metrics on data catalogs, comparing CKAN, Dataverse, Invenio, by The Hyve team
● Let me know if you are interested in beta testing Fairspace: kees@thehyve.nl
Conclusions

www.elixir-europe.org
Bioschemas for dataset discoverability
on the inter/intranet
Harmonizing FAIR Data Catalog Approaches
Rafael C Jimenez
Chief Data Architect
November 29, 2018
The European Open Science Cloud for
Research pilot project is funded by the
European Commission, DG Research &
Innovation under contract no. 739563
ELIXIR-EXCELERATE is funded by the European
Commission within the Research Infrastructures
programme of Horizon 2020, grant agreement
number 676559.

<div itemscope itemtype="http://schema.org/Recipe">
<h1 itemprop="name">Classic potato salad</h1>
<div itemprop="nutrition” itemscope
itemtype="http://schema.org/NutritionInformation">
Nutrition facts:
<span itemprop="calories">144 kcal</span>,
</div>
Ingredients:
- <span itemprop="recipeIngredient">800g small new potato</span>
- <span itemprop="recipeIngredient">3 shallot</span>
. . .
RDFa
JSON-LD
Microdata

http://schema.org/docs/full.html
More than 600 types

Bioschemas
Schema.org for life sciences
Collection of specifications
Community initiative

Bioschemas
• New life sciences schema.org types (eg. protein)
• Life sciences profiles for existing types (eg. dataset)
Types
• Proposed to Schema.org
• More descriptive
• Guidelines applying constrains to
existing Schema.org types
• Managed by Bioschemas and
specific for life sciences
• Minimum properties and best
practices for finding and accessing
data
• Focused on few types and well
defined relationships
Profiles 2210
DataCatalog Datasets DataRecord

Bioschemas
• Use case driven
• Finding data
• Presenting search results
• Metadata exchange
• Minimum information guidelines
• Link to domain ontologies
• Examples and documentation

Tools
• Buzzbang
Open-source components for software to find, crawl and use
Bioschemas markup, and for humans to search it.
• GoCrawlIt
Minimal crawler and extractor of microdata and JSON-LD metadata.
• GoWeb
Application to help publishing Bioschemas profiles on the Bisochemas
website.
• Validata
A web application for validating Bioschemas markup against the
specifications.
• Markup Builder
A web application for prototyping markup against the Bioschemas
profiles

Data resources Tools Training
Bioschemas Bioschemas Bioschemas
Search
engines
Metadata
Registries
Data
Aggregators
BioschemasBioschemas
Providers
Consumers

MarRef -> BioSamples
MarRef <- BioSamples
Without an API
https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md
Use case: data exchange
Metadata exchange

Data producers
Bioschemas
Data
Aggregators
Bioschemas
Providers
Consumers
Organization “A” private data
Data producers
Bioschemas
Data
Aggregators
Bioschemas
Providers
Consumers
Organization “B” private data
Search
Compare
Link
Exchange
Metadata
Metadata
Metadata
Metadata
• Different data models and interfaces
• Minimum common metadata agreements on a selection of
data types

Remarks
• Bioschemas is focused on findability, not data modeling
• It can be useful for some basic metadata exchange (MarRef example)
• It is a complement, not a replacement for more tailored ways to share metadata
• It is mainly used embedded in HTML, but it can be exposed via APIs or JSON files
• For data providers it is a simple way to expose metadata
• It is not just for Google, it is also helping our metadata catalogues to automatically
ingest metadata

Thanks for your attention!
200+
People
8
Events
(2018)
10
Types
23Working
groups22Profiles
55Live
deploys
6M+Pages

Audience Q&A
Please use the Question function in GoToWebinar

Upcoming Webinars
Data Quality in support of AI/ML: Pistoia Alliance CoE for AI Webinar Series
Date/Time: 10 Dec 2018: 4pm - 5pm GMT, 11 am ET/5pm CET
Panel members include:
Terry Stouch, Science for Solutions, Isabella Feierberg, AstraZeneca,
Jamie Powers, Cambridge Semantics,
Sirarat Sarntivijai, ELIXIR EU, Jabe Wilson, Elsevier
Knowledge Graphs for Pharma: A perspective from the PhUSE Project
'Clinical Trials Data as RDF'
Date/Time: January 24th, 2019 11am ET/4pmGMT/5pm CET
Speaker: Tim Williams (UCB and PhUSE)

Pistoia alliance harmonizing fair data catalog approaches webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pistoia alliance harmonizing fair data catalog approaches webinar

Similar to Pistoia alliance harmonizing fair data catalog approaches webinar (20)

More from Pistoia Alliance

More from Pistoia Alliance (20)

Recently uploaded

Recently uploaded (20)

Pistoia alliance harmonizing fair data catalog approaches webinar