At the Bio Data World conference in Basel in December 2019, Kees van Bochove, Founder of The Hyve gave a talk on re-use of pharma R&D data, and what strategies could be used to realize operationalization of FAIR data at scale.
Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204
1. Kees van Bochove, Founder, The Hyve
Reuse of R&D data and the
promise of FAIR data lakes
@keesvanbochove
BioDataWorld
Basel, 5 Dec 2019
2. Outline
1. FAIR Data is about people & change
2. The data lake is a passing phase
3. Forget about AI. Data & UX matter.
3. The Hyve
We advance biology and medical research…
… by building and serving thriving open source communities.
Services
Professional support for
open source software in
biomedical informatics
➢Software development
➢Data engineering
➢Consultancy
➢Hosting / SLAs
Core values
Share
Reuse
Specialize
Office Locations
Utrecht, The Netherlands
Cambridge, MA, United States
Customer Segments
Pharma
Life Sciences
Healthcare
Fast-growing
Started in 2012
40+ people by now
4. FAIR Data is
about people &
embracing change
Statement #1
@keesvanbochove @TheHyveNL
5. The roots of FAIR
►Public-private partnership to advance:
►Open Science
► Sustainability & reuse of data
►Workshop in Leiden in 2014
►Towards a Modular Blueprint ‘Floor-plan’ of a safe
and fair Data Stewardship, Trading and Routing
environment, provisionally called the Data
FAIRPORT
https://www.lorentzcenter.nl/lc/web/2014/602/info.php3?wsid=602
6. FAIR Workshop at The Hyve in Utrecht, 2018
http://blog.thehyve.nl/blog/highlights-from-pistoia-alliances-fair-workshop
https://www.sciencedirect.com/science/article/pii/S1359644618303039
7. 15 FAIR principles for (meta)data
http://www.nature.com/articles/sdata201618
Accessible:
A1. standardized protocol
A1.1 open, free and universally implementable
A1.2. authentication and authorization
A2. metadata stay accessible
Reusable:
R1. attributes
R1.1. license
R1.2. provenance
R1.3. community standards
Interoperable:
I1. language for knowledge representation
I2. vocabularies that follow FAIR principles
I3. qualified references to other (meta)data
Findable:
F1. persistent identifier
F2. metadata
F3. metadata - data link
F4. registered or indexed
8. The fundamental change behind FAIR
Data
management
Data
stewardship
scope: project scope: organization
10. Why resilience to change matters
● Domain changes and focus shifts: new data types,
new applications, new scientific paradigms etc.
● Organizational changes: M&A, re-orgs, people
moving roles etc.
● Technology changes: new software and hardware
platforms, analysis methods, automation, ML/AI etc.
11. FAIR Data is
about people &
embracing change
Statement #1
● From data management
to data stewardship
● Implies cultural, process
and technical change
● Data strategy should be
resilient to change
@keesvanbochove @TheHyveNL
12. The data lake is a
passing phase
Statement #2
@keesvanbochove @TheHyveNL
17. The modern (?) monolith
Ingest
Self-service
Pipelines
AnalyticsEnterprise Data Lake
Ingestion Team Data Engineering Team Unification TeamSearch TeamPlatform API Team Analytics Team
Architectural division
Axis of
change
18. Decentralized data management
● IRI / identifier schemes
● Metadata standards
● Provenance standards
CDO
Data Federation
{
{
Oncology
Neuro-
science Development
ClinOps
HCS
Omics platforms
Data science
Preclinical
ADME/Tox
Biomarker dev.
RWD
Epidemiology
● Catalog function
● Data standards
● Entities / data sets
Publish
19. Advantages of a decentralized FAIR approach
● More resilient to change: no dependency on large central functions
● Allows for an iterative data strategy operationalization (no ‘big bang’
data lake delivery needed, FAIRification can start today and locally)
● No need to shuffle people around to start a big data lake project:
embed informatics and data experts directly in the research and
development teams
● Centralize only standardization functions, decentralize the rest
empower teams to do their own data science and informatics
● Embrace usage of external data and collaborations, no need to
‘ingest first’ via a central function, but use & link directly
20. The data lake is a
passing phase
Statement #2
● Centralization is a
potential bottleneck and
a barrier for change
● The solution is in
decentralization of
storage, applications etc.
● Standards management
and data federation as
central functions
@keesvanbochove @TheHyveNL
23. Teams at The Hyve: open source communities
Research Data Management
● FAIR Data Governance consultancy
● Fairspace (meta)data management
Genomics
● Cancer data portal: cBioPortal
● Knowledge base: Open Targets
Health Data Networks
● Data warehouses: tranSMART, i2b2
● Cohort selection: Glowing Bear
● Request Portals: Podium
Real World Data
● Real world evidence: OMOP/OHDSI
● Wearables platform: RADAR-BASE
24. FAIR Services at The Hyve
● Semantic modelling: creating (meta)data models that allow traversal of
linked data
● Data conformance: choose the right data standard for specific problems,
align with community standards to maximize benefits from the open
science communities and precompetitive collaborations
● Data landscape: create an understanding of existing applications and
data sources in the company and readiness for FAIR
● FAIRification: get started with FAIRifying datasets, defining metadata,
appropriate standards, provenance etc.
● Data catalog: build collaborative environment around data catalog (e.g.
using Fairspace)
25. Example: OMOP CDM v5 for RWE/RWD
● Observational
healthcare
data
● Fields defined
per domain
● Standardized
Vocabularies
26. cBioPortal: hard to resist value proposition
● 4000+ citations
in literature
● ~20k+ unique
users per
month
● Local instances
deployed in
many pharma
companies
and cancer
centers
28. Forget about AI.
Data &UX matter.
Statement #3
● Decision making by HIPPO
instead of by algorithms
● Make AI developers happy
with relevant FAIR data
● Strong semantics are key to
and standards can help
(e.g. OMOP, CDISC)
● Investments in UX are costly
& you should capitalize on
them (e.g. OpenTargets)
● It’s a great time to build
knowledge graphs!
@keesvanbochove @TheHyveNL
30. We advance biology and medical
sciences by building and serving
thriving open source communities