Managing Biomedical Data and Metadata in Large Scale Collaborations

Managing Biomedical
Data and Metadata in
Large Scale Collaborations
November 28, 2018

Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ What is Metadata?
− Content
− Context
− Process
▪ Metadata not always derived from the artifact
directly, but obtained from multiple sources
▪ Metadata semantics are key to unlocking
findability, provenance and usability of data
artefacts
Page ▪ 2
Why Metadata?

▪ Data continues to be accumulated at exponential rate
− There are multiple efforts capturing anything conceivable
− Study data vs non study data lines are blurring
▪ Data demands continues to grow
− Everyone hungers for high quality consented biomedical datasets
− Regulation like GDPR points to large scale consent management capability
▪ Generating and storing all data inhouse is no longer making business sense
Page ▪ 3
Why Collaboration?

▪ Data is produced in silos
− Specialized systems: clinical, prescriptions, lab,
imaging, sequencing, sensors, etc.
▪ Not one warehouse of everything for
everyone
− For the foreseeable future there will always be
some (largish) degree of federation
− No single data science platform can cater to
everyone
▪ Not one view on the data
− No use case needs all the data
− Each use case needs unique combination of data
Page ▪ 4
Status Quo

▪ Working with data
− Data Access
o Non-local data
o Data islands
o Multi-disciplinary
− Data Preparation
o Data normalization
o Data scientist grunt work challenge
▪ Working together – sharing vs collaborating
− Different organizations involvement
− Differing methods of processing
▪ Regulation, contracts and audit
Page ▪ 5
Obstacles to Collaboration

▪ Aggregation: Central data warehouse with corresponding API layer for querying
very large data sets quickly
▪ Common Challenges
− Data vs Meta-data is blurred
− Scalability
− Cost
− Access controls
Page ▪ 6
The Common Approaches: Aggregation

▪ Standardization: Common Data Models and APIs to obtain
information from different custodians
− Many standards
− They are all in flux
− Big effort to implement and to maintain
− Coverage
Page ▪ 7
The Common Approaches: Standardization
Analytics CoverageStandards Coverage

▪ Federation: Based on aggregation and standardization query multiple data
custodians and deliver aggregate answers
− Standardizing queries
− Authentication / Authorization
− Normalization
− Performance
Page ▪ 8
The Common Approaches: Federation

Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 9
Metadata and Conway’s Law
“Organizations which design systems (in
the broad sense) ... are constrained to
produce designs which are copies of the
communication structures of these
organizations."
Conway’s Law
Melvin Conway
Datamation, 1968

▪ One person's metadata is another person's data
▪ Collaborate and establish broadest consensus for a given data
type
− Minimum viable standard metadata model across custodians
− Further enriched with contextual data specialized per study
− Requirements:
o Handling presence of unexpected as well as absence of expected data
o Propagation of change and impact on provenance
▪ Data model needs to be accomodating - ideally standardized
summary data with ad hoc extensions by interest
Page ▪ 10
Metadata – Description of Data Artefacts

Metadata Aggregation Lifecycle
Extract Translate Validate Annotate Store Index Project
Any
combination
of tools to
extract data
from one or
many sources:
• File Systems
• Files
• Databases
• APIs
Prepare
extracted
native data
fields for
processing by
DBE
Validate
Metadata
inputs against
type
constraints
Process data
fields marked
for annotation
with ontology
providers
Store
validated and
annotated
data in DBE
database
Index stored
data in DBE
search index
Projection of
outputs
directly into
analysis
frameworks
or via API
Importers DBE Core PlatformData
Sources
Data
Consumers
Distributed Centralized

Metadata Federation Lifecycle
Portal API
Authentication
Query Builder
Query Federator
Data Basket
HL7 FHIR API
Workspaces
Cohort Management
Importers DBE Core Platform
Extract Translate Validate Annotate Store Index Project
Federation Backends
GA4GH Beacon API

Data as a function of other data
“Rien ne se perd, rien ne se
crée, tout se transforme”
Antoine-Laurent de Lavoisier
▪ Metadata not only for content of artefact, but also function
that created / transformed the artefact
▪ Every data artefact is the result of one of more functions
− User
− Application Stack, Configuration, Version
− Infrastructure
− Data Dependencies
− Projections
o Inputs or Source
o Outputs (Data)
Essential for provenance, reproducibility and
consent operations

Do You Have
Any Questions?

Databiology Ltd.
Magdalen Centre
The Oxford Science Park
Oxford, OX4 4GA
United Kingdom
+44-1865-784426
contactus@databiology.com
twitter.com/databiologylinkedin.com/company/databiologydatabiology.com
Databiology Inc.
201 Spear Street, Suite 1100
San Francisco, CA 94105
USA
+1-415-426-3592
Contact us or follow us online!
Databiology Hong Kong Ltd.
Unit E, 6/F Golden Sun Centre
59-67 Bonham Street West
Sheung Wan, Hong Kong
Hong Kong (SAR)
+852-8193-4005

Managing Biomedical Data and Metadata in Large Scale Collaborations

More Related Content

What's hot

Similar to Managing Biomedical Data and Metadata in Large Scale Collaborations

Recently uploaded

Managing Biomedical Data and Metadata in Large Scale Collaborations