Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204

Kees van Bochove, Founder, The Hyve
Reuse of R&D data and the
promise of FAIR data lakes
@keesvanbochove
BioDataWorld
Basel, 5 Dec 2019

Outline
1. FAIR Data is about people & change
2. The data lake is a passing phase
3. Forget about AI. Data & UX matter.

The Hyve
We advance biology and medical research…
… by building and serving thriving open source communities.
Services
Professional support for
open source software in
biomedical informatics
➢Software development
➢Data engineering
➢Consultancy
➢Hosting / SLAs
Core values
Share
Reuse
Specialize
Office Locations
Utrecht, The Netherlands
Cambridge, MA, United States
Customer Segments
Pharma
Life Sciences
Healthcare
Fast-growing
Started in 2012
40+ people by now

FAIR Data is
about people &
embracing change
Statement #1
@keesvanbochove @TheHyveNL

The roots of FAIR
►Public-private partnership to advance:
►Open Science
► Sustainability & reuse of data
►Workshop in Leiden in 2014
►Towards a Modular Blueprint ‘Floor-plan’ of a safe
and fair Data Stewardship, Trading and Routing
environment, provisionally called the Data
FAIRPORT
https://www.lorentzcenter.nl/lc/web/2014/602/info.php3?wsid=602

FAIR Workshop at The Hyve in Utrecht, 2018
http://blog.thehyve.nl/blog/highlights-from-pistoia-alliances-fair-workshop
https://www.sciencedirect.com/science/article/pii/S1359644618303039

15 FAIR principles for (meta)data
http://www.nature.com/articles/sdata201618
Accessible:
A1. standardized protocol
A1.1 open, free and universally implementable
A1.2. authentication and authorization
A2. metadata stay accessible
Reusable:
R1. attributes
R1.1. license
R1.2. provenance
R1.3. community standards
Interoperable:
I1. language for knowledge representation
I2. vocabularies that follow FAIR principles
I3. qualified references to other (meta)data
Findable:
F1. persistent identifier
F2. metadata
F3. metadata - data link
F4. registered or indexed

The fundamental change behind FAIR
Data
management
Data
stewardship
scope: project scope: organization

Why resilience to change matters
● Domain changes and focus shifts: new data types,
new applications, new scientific paradigms etc.
● Organizational changes: M&A, re-orgs, people
moving roles etc.
● Technology changes: new software and hardware
platforms, analysis methods, automation, ML/AI etc.

FAIR Data is
about people &
embracing change
Statement #1
● From data management
to data stewardship
● Implies cultural, process
and technical change
● Data strategy should be
resilient to change

The data lake is a
passing phase
Statement #2

Genomics
England
Research Environment
NHS Trusts
Airlock
Research Community

17
MedMij Personal Health Apps

The classical monolith
Enterprise
Data Warehouse
ETL
ETL
ETL
Business Intelligence
/ Analytics

The modern (?) monolith
Ingest
Self-service
Pipelines
AnalyticsEnterprise Data Lake
Ingestion Team Data Engineering Team Unification TeamSearch TeamPlatform API Team Analytics Team
Architectural division
Axis of
change

Decentralized data management
● IRI / identifier schemes
● Metadata standards
● Provenance standards
CDO
Data Federation
{
{
Oncology
Neuro-
science Development
ClinOps
HCS
Omics platforms
Data science
Preclinical
ADME/Tox
Biomarker dev.
RWD
Epidemiology
● Catalog function
● Data standards
● Entities / data sets
Publish

Advantages of a decentralized FAIR approach
● More resilient to change: no dependency on large central functions
● Allows for an iterative data strategy operationalization (no ‘big bang’
data lake delivery needed, FAIRification can start today and locally)
● No need to shuffle people around to start a big data lake project:
embed informatics and data experts directly in the research and
development teams
● Centralize only standardization functions, decentralize the rest 
empower teams to do their own data science and informatics
● Embrace usage of external data and collaborations, no need to
‘ingest first’ via a central function, but use & link directly

The data lake is a
passing phase
Statement #2
● Centralization is a
potential bottleneck and
a barrier for change
● The solution is in
decentralization of
storage, applications etc.
● Standards management
and data federation as
central functions

Forget about AI.
Data &UX matter.
Statement #3

Teams at The Hyve: open source communities
Research Data Management
● FAIR Data Governance consultancy
● Fairspace (meta)data management
Genomics
● Cancer data portal: cBioPortal
● Knowledge base: Open Targets
Health Data Networks
● Data warehouses: tranSMART, i2b2
● Cohort selection: Glowing Bear
● Request Portals: Podium
Real World Data
● Real world evidence: OMOP/OHDSI
● Wearables platform: RADAR-BASE

FAIR Services at The Hyve
● Semantic modelling: creating (meta)data models that allow traversal of
linked data
● Data conformance: choose the right data standard for specific problems,
align with community standards to maximize benefits from the open
science communities and precompetitive collaborations
● Data landscape: create an understanding of existing applications and
data sources in the company and readiness for FAIR
● FAIRification: get started with FAIRifying datasets, defining metadata,
appropriate standards, provenance etc.
● Data catalog: build collaborative environment around data catalog (e.g.
using Fairspace)

Example: OMOP CDM v5 for RWE/RWD
● Observational
healthcare
data
● Fields defined
per domain
● Standardized
Vocabularies

cBioPortal: hard to resist value proposition
● 4000+ citations
in literature
● ~20k+ unique
users per
month
● Local instances
deployed in
many pharma
companies
and cancer
centers

Open
Targets
● Integration
of 20+ key
public data
sources for
target
discovery

Forget about AI.
Data &UX matter.
Statement #3
● Decision making by HIPPO
instead of by algorithms
● Make AI developers happy
with relevant FAIR data
● Strong semantics are key to
and standards can help
(e.g. OMOP, CDISC)
● Investments in UX are costly
& you should capitalize on
them (e.g. OpenTargets)
● It’s a great time to build
knowledge graphs!

We advance biology and medical
sciences by building and serving
thriving open source communities

Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204

Similar to Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204 (20)

More from Kees van Bochove

More from Kees van Bochove (13)

Recently uploaded

Recently uploaded (20)

Bio Data World - The promise of FAIR data lakes - The Hyve - 20191204