Managing Biomedical
Data and Metadata in
Large Scale Collaborations
November 28, 2018
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ What is Metadata?
− Content
− Context
− Process
▪ Metadata not always derived from the artifact
directly, but obtained from multiple sources
▪ Metadata semantics are key to unlocking
findability, provenance and usability of data
artefacts
Page ▪ 2
Why Metadata?
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Data continues to be accumulated at exponential rate
− There are multiple efforts capturing anything conceivable
− Study data vs non study data lines are blurring
▪ Data demands continues to grow
− Everyone hungers for high quality consented biomedical datasets
− Regulation like GDPR points to large scale consent management capability
▪ Generating and storing all data inhouse is no longer making business sense
Page ▪ 3
Why Collaboration?
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Data is produced in silos
− Specialized systems: clinical, prescriptions, lab,
imaging, sequencing, sensors, etc.
▪ Not one warehouse of everything for
everyone
− For the foreseeable future there will always be
some (largish) degree of federation
− No single data science platform can cater to
everyone
▪ Not one view on the data
− No use case needs all the data
− Each use case needs unique combination of data
Page ▪ 4
Status Quo
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Working with data
− Data Access
o Non-local data
o Data islands
o Multi-disciplinary
− Data Preparation
o Data normalization
o Data scientist grunt work challenge
▪ Working together – sharing vs collaborating
− Different organizations involvement
− Differing methods of processing
▪ Regulation, contracts and audit
Page ▪ 5
Obstacles to Collaboration
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Aggregation: Central data warehouse with corresponding API layer for querying
very large data sets quickly
▪ Common Challenges
− Data vs Meta-data is blurred
− Scalability
− Cost
− Access controls
Page ▪ 6
The Common Approaches: Aggregation
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Standardization: Common Data Models and APIs to obtain
information from different custodians
▪ Common Challenges
− Many standards
− They are all in flux
− Big effort to implement and to maintain
− Coverage
Page ▪ 7
The Common Approaches: Standardization
Analytics CoverageStandards Coverage
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Federation: Based on aggregation and standardization query multiple data
custodians and deliver aggregate answers
▪ Common Challenges
− Standardizing queries
− Authentication / Authorization
− Normalization
− Performance
Page ▪ 8
The Common Approaches: Federation
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 9
Metadata and Conway’s Law
“Organizations which design systems (in
the broad sense) ... are constrained to
produce designs which are copies of the
communication structures of these
organizations."
Conway’s Law
Melvin Conway
Datamation, 1968
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ One person's metadata is another person's data
▪ Collaborate and establish broadest consensus for a given data
type
− Minimum viable standard metadata model across custodians
− Further enriched with contextual data specialized per study
− Requirements:
o Handling presence of unexpected as well as absence of expected data
o Propagation of change and impact on provenance
▪ Data model needs to be accomodating - ideally standardized
summary data with ad hoc extensions by interest
Page ▪ 10
Metadata – Description of Data Artefacts
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 11
Metadata Aggregation Lifecycle
Extract Translate Validate Annotate Store Index Project
Any
combination
of tools to
extract data
from one or
many sources:
• File Systems
• Files
• Databases
• APIs
Prepare
extracted
native data
fields for
processing by
DBE
Validate
Metadata
inputs against
type
constraints
Process data
fields marked
for annotation
with ontology
providers
Store
validated and
annotated
data in DBE
database
Index stored
data in DBE
search index
Projection of
outputs
directly into
analysis
frameworks
or via API
Importers DBE Core PlatformData
Sources
Data
Consumers
Distributed Centralized
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 12
Metadata Federation Lifecycle
Portal API
Authentication
Query Builder
Query Federator
Data Basket
HL7 FHIR API
Workspaces
Cohort Management
Importers DBE Core Platform
Extract Translate Validate Annotate Store Index Project
Federation Backends
GA4GH Beacon API
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 13
Data as a function of other data
“Rien ne se perd, rien ne se
crée, tout se transforme”
Antoine-Laurent de Lavoisier
▪ Metadata not only for content of artefact, but also function
that created / transformed the artefact
▪ Every data artefact is the result of one of more functions
− User
− Application Stack, Configuration, Version
− Infrastructure
− Data Dependencies
− Projections
o Inputs or Source
o Outputs (Data)
Essential for provenance, reproducibility and
consent operations
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 14
Do You Have
Any Questions?
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
Databiology Ltd.
Magdalen Centre
The Oxford Science Park
Oxford, OX4 4GA
United Kingdom
+44-1865-784426
contactus@databiology.com
twitter.com/databiologylinkedin.com/company/databiologydatabiology.com
Databiology Inc.
201 Spear Street, Suite 1100
San Francisco, CA 94105
USA
+1-415-426-3592
contactus@databiology.com
Contact us or follow us online!
Databiology Hong Kong Ltd.
Unit E, 6/F Golden Sun Centre
59-67 Bonham Street West
Sheung Wan, Hong Kong
Hong Kong (SAR)
+852-8193-4005
contactus@databiology.com

Managing Biomedical Data and Metadata in Large Scale Collaborations

  • 1.
    Managing Biomedical Data andMetadata in Large Scale Collaborations November 28, 2018
  • 2.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ What is Metadata? − Content − Context − Process ▪ Metadata not always derived from the artifact directly, but obtained from multiple sources ▪ Metadata semantics are key to unlocking findability, provenance and usability of data artefacts Page ▪ 2 Why Metadata?
  • 3.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ Data continues to be accumulated at exponential rate − There are multiple efforts capturing anything conceivable − Study data vs non study data lines are blurring ▪ Data demands continues to grow − Everyone hungers for high quality consented biomedical datasets − Regulation like GDPR points to large scale consent management capability ▪ Generating and storing all data inhouse is no longer making business sense Page ▪ 3 Why Collaboration?
  • 4.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ Data is produced in silos − Specialized systems: clinical, prescriptions, lab, imaging, sequencing, sensors, etc. ▪ Not one warehouse of everything for everyone − For the foreseeable future there will always be some (largish) degree of federation − No single data science platform can cater to everyone ▪ Not one view on the data − No use case needs all the data − Each use case needs unique combination of data Page ▪ 4 Status Quo
  • 5.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ Working with data − Data Access o Non-local data o Data islands o Multi-disciplinary − Data Preparation o Data normalization o Data scientist grunt work challenge ▪ Working together – sharing vs collaborating − Different organizations involvement − Differing methods of processing ▪ Regulation, contracts and audit Page ▪ 5 Obstacles to Collaboration
  • 6.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ Aggregation: Central data warehouse with corresponding API layer for querying very large data sets quickly ▪ Common Challenges − Data vs Meta-data is blurred − Scalability − Cost − Access controls Page ▪ 6 The Common Approaches: Aggregation
  • 7.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ Standardization: Common Data Models and APIs to obtain information from different custodians ▪ Common Challenges − Many standards − They are all in flux − Big effort to implement and to maintain − Coverage Page ▪ 7 The Common Approaches: Standardization Analytics CoverageStandards Coverage
  • 8.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ Federation: Based on aggregation and standardization query multiple data custodians and deliver aggregate answers ▪ Common Challenges − Standardizing queries − Authentication / Authorization − Normalization − Performance Page ▪ 8 The Common Approaches: Federation
  • 9.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 9 Metadata and Conway’s Law “Organizations which design systems (in the broad sense) ... are constrained to produce designs which are copies of the communication structures of these organizations." Conway’s Law Melvin Conway Datamation, 1968
  • 10.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. ▪ One person's metadata is another person's data ▪ Collaborate and establish broadest consensus for a given data type − Minimum viable standard metadata model across custodians − Further enriched with contextual data specialized per study − Requirements: o Handling presence of unexpected as well as absence of expected data o Propagation of change and impact on provenance ▪ Data model needs to be accomodating - ideally standardized summary data with ad hoc extensions by interest Page ▪ 10 Metadata – Description of Data Artefacts
  • 11.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 11 Metadata Aggregation Lifecycle Extract Translate Validate Annotate Store Index Project Any combination of tools to extract data from one or many sources: • File Systems • Files • Databases • APIs Prepare extracted native data fields for processing by DBE Validate Metadata inputs against type constraints Process data fields marked for annotation with ontology providers Store validated and annotated data in DBE database Index stored data in DBE search index Projection of outputs directly into analysis frameworks or via API Importers DBE Core PlatformData Sources Data Consumers Distributed Centralized
  • 12.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 12 Metadata Federation Lifecycle Portal API Authentication Query Builder Query Federator Data Basket HL7 FHIR API Workspaces Cohort Management Importers DBE Core Platform Extract Translate Validate Annotate Store Index Project Federation Backends GA4GH Beacon API
  • 13.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 13 Data as a function of other data “Rien ne se perd, rien ne se crée, tout se transforme” Antoine-Laurent de Lavoisier ▪ Metadata not only for content of artefact, but also function that created / transformed the artefact ▪ Every data artefact is the result of one of more functions − User − Application Stack, Configuration, Version − Infrastructure − Data Dependencies − Projections o Inputs or Source o Outputs (Data) Essential for provenance, reproducibility and consent operations
  • 14.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd.Page ▪ 14 Do You Have Any Questions?
  • 15.
    Copyright ©2018. AllRights Reserved. Confidential Databiology Ltd. Databiology Ltd. Magdalen Centre The Oxford Science Park Oxford, OX4 4GA United Kingdom +44-1865-784426 contactus@databiology.com twitter.com/databiologylinkedin.com/company/databiologydatabiology.com Databiology Inc. 201 Spear Street, Suite 1100 San Francisco, CA 94105 USA +1-415-426-3592 contactus@databiology.com Contact us or follow us online! Databiology Hong Kong Ltd. Unit E, 6/F Golden Sun Centre 59-67 Bonham Street West Sheung Wan, Hong Kong Hong Kong (SAR) +852-8193-4005 contactus@databiology.com