SlideShare a Scribd company logo
Lightweight, Practical, Cross-domain Metadata
March 2020: Bio-IT World West
Chris Dwan (chris@dwan.org)
https://dwan.org @fdmts
Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us
Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
NIH circa 2008
“Gene expression data …
… are meaningful only in the
context of a detailed description
of the conditions under which
they were generated …
… including the particular state
of the living system under study
…
… and the perturbations to which
it has been subjected
State of the art: 2001
Most metadata field names and
their values are not standardized
or controlled.
Even simple binary or numeric
fields are often populated with
inadequate values of different
data types.
By clustering metadata field
names, we discovered there are
often many distinct ways to
represent the same aspect of a
sample.
State of the art: 2019
201 6:Data Quality Matters
Ask a computational biologist /
data scientist what fraction of their
time is spent fighting data quality,
formatting, and similar issues.
State of the Art: 2020
A miscommunication between
the wet lab and the
bioinformatics group resulted
in an “embarrassing
miscommunication” …
… to the press.
Genomic Data Production in ContextGenomic data production @ Broad
I did research computing at
Broad from 2014 - 2017
ExAC / gnomAD: A powerful example
ExAC (the Exome Aggregation
Consortium) and gnomAD (the
Genome Aggregation Database)
represent vast amounts of work to
harmonize both phenotype and
consent
The IT Services Perspective
Filesystem Metadata
– File attributes: Size, format, creation and modification times
– Permissions: Ownership, Access Control Lists
– Access / usage patterns: Which files are accessed, and by whom
– Compressibility / Deduplication: Lies Optimistic projections by vendors
One function of a research computing team is to bridge the gap between data
storage (usable capacity as provided by enterprise IT) and data services
(semantically usable data)
The filesystem / directory tree is your
default metadata database
It is what your team is using today
Any proposal less functional than
descriptive filenames will fail
If you have four groups working on a compiler, you’ll
get a four-pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996
Most primary data files (in
bioinformatics) include valuable
metadata, usually in the header.
These can be quite verbose.
Bioinformatics as a discipline is
filled with duplication.
“hg19” here is identical to “build
37” from the previous slide.
Many file headers include the
command line and parameters that
were used to generate the file.
This is the default method for storing
experimental “provenance”
2002: Parsers are a pretty awful solution
Container technology (Docker / Singularity)
revolutionized software deployment.
Instead of installers and configurators, we ship a
whole operating system, with the app pre-installed.
We do not have a similar solution for experimental
metadata.
We have not found a way to package and ship
Domain experts and researchers.
We don’t have containers for metadata
NoSQL is a delightful prototyping tool
• NoSQL databases (MongoDB, PostgreSQL, …) do
not require a fully defined schema.
• You gain flexibility at the cost of consistency and
possibly performance.
• This makes them ideal for prototyping
• “Plan to throw one away; you will, anyhow.”
Fred Brooks, 1975
The Mythical Man-Month
What goes in the NoSQL?
Unique key to identify the file
The path to where it is stored
A checksum (to find duplicates later)
The header (scrape and store wholesale)
Whatever else the lab said was important
– Perhaps column headers from that spreadsheet they use…
Capture metadata at the time of creation
• Metadata needs to be captured at the point of data creation
• This amounts to putting more work on staff who are likely
already overburdened and time conscious
• Very little of the benefit of rigorous data processes will be felt in
the lab (at least at first)
The Data Tzar
Data Tzar Clearly empowered. Title sparks curiosity
Data Janitor Data are trash. Low prestige job.
Data Monkey Disrespectful, vaguely racist
Chief Data Officer They mostly seem to work on licensing
The Data Tzar: Day 1
Engage with data generators
– Tools to make their lives earlier - Dashboards, alerting systems, backups, routine
analysis, QC checks
– Go “breadth first” across the enterprise
– Do not start with the whole genomes
Sneakily harvest metadata
Necessary resources (day 1):
– 1 – 2 early career bioinformatics programmers
– Access to an infrastructure engineer
– A modest budget on your cloud provider of choice
The Data Tzar: First Year
Do a lot of favors, build a lot of Shiny apps
Convene working groups around specific types of data
Create crosscutting dashboards for leadership
Make friends with the heads of information security and compliance.
Prepare a budget proposal
FAIR Data (within the enterprise)
Findable
• NoSQL database of metadata and checksums
• It’s plenty for a good long time.
Accessible
• Federated identity management
• Architecture of S3 buckets and production
“roles”
Interoperable
• ”It’s much easier to go FAR than to go FAIR”
Reusable
• Data standards, ontologies, strong policy
framework, including electronic consents for
human subjects data.
Incredible opportunities
here, and rapidly
developing data silos
The Clinical Data Ecosystem
There is an incredible wealth of
data available to support both
clinical care and research
Unfortunately, it is carved up and
isolated.
The phrase I hear most frequently
from hospital CIOs: “No Upside”
Patient Journals
Consumer products
Longitudinal Data from
other providers …
Electronic
Medical Records
Possibility of a self-normal
(N of 1) over time
Diagnostic
Imaging
Natural language processing
has strong potentialClinical Notes
Innovations in the basics of
clinical observation
Hospital Telemetry
Pressure to avoid incidental
findings prevent bias
Primary Lab Data
Appropriate Use and Consent
“We should be up front with participants that we can’t protect
their privacy completely, and we should ensure that the most
appropriate legislation is in place to protect participants from
being exploited in any way.”
- Eric Schadt, CEO, Sema4
Policies and Governance
Appropriate usage
Human readable document: Expectations of privacy and
standards of behavior.
Data Classification
Governance document: Defines the major categories of data
(corporate sensitive, clinical, …) and standards for handling of
each.
Written Information Security Policy (WISP)
Technical document: Defines how systems must be configured to
protect sensitive data and operations.
Vendor Qualification
Business SOP to establish practices around how vendor access
and systes should be managed.
Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us
Thank you!
Chris Dwan (chris@dwan.org)
https://dwan.org @fdmts

More Related Content

What's hot

RDM & ELNs @ Edinburgh
RDM & ELNs @ EdinburghRDM & ELNs @ Edinburgh
RDM & ELNs @ Edinburgh
EDINA, University of Edinburgh
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
Chris Rusbridge
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
Brad Houston
 
Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data
Robert Oostenveld
 
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Research Support Team, IT Services, University of Oxford
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
Cunera Buys
 
The Donders Repository
The Donders RepositoryThe Donders Repository
The Donders Repository
Robert Oostenveld
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
Johann van Wyk
 
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Research Support Team, IT Services, University of Oxford
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
Carole Goble
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
Amanda Whitmire
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchers
Jez Cope
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
Robert Oostenveld
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
ENUG
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
Kathirvel Ayyaswamy
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
Projeto RCAAP
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Tom Plasterer
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
Robert Oostenveld
 
Microfilm or Digitize: Which is Right for You?
Microfilm or Digitize: Which is Right for You?Microfilm or Digitize: Which is Right for You?
Microfilm or Digitize: Which is Right for You?
Brad Houston
 

What's hot (20)

RDM & ELNs @ Edinburgh
RDM & ELNs @ EdinburghRDM & ELNs @ Edinburgh
RDM & ELNs @ Edinburgh
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
 
Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data
 
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
The Donders Repository
The Donders RepositoryThe Donders Repository
The Donders Repository
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
 
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchers
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
Microfilm or Digitize: Which is Right for You?
Microfilm or Digitize: Which is Right for You?Microfilm or Digitize: Which is Right for You?
Microfilm or Digitize: Which is Right for You?
 

Similar to No Free Lunch: Metadata in the life sciences

Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
Toronto-Oracle-Users-Group
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
Klawal13
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Jason S
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
Sarah Jones
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
C. Tobin Magle
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
Chris Dwan
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
University of Arizona
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
IRJET Journal
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
Jamie Bisset
 
data mining
data miningdata mining
data mining
Geet chopra
 
Data Management Planning for researchers
Data Management Planning for researchersData Management Planning for researchers
Data Management Planning for researchers
Sarah Jones
 
Data management
Data management Data management
Data management
Graça Gabriel
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
IRJET Journal
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Greg Landrum
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
Wake Tech BAS
 

Similar to No Free Lunch: Metadata in the life sciences (20)

Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
data mining
data miningdata mining
data mining
 
Data Management Planning for researchers
Data Management Planning for researchersData Management Planning for researchers
Data Management Planning for researchers
 
Data management
Data management Data management
Data management
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 

More from Chris Dwan

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
Chris Dwan
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
Chris Dwan
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
Chris Dwan
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
Chris Dwan
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
Chris Dwan
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
Chris Dwan
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
Chris Dwan
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
Chris Dwan
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
Chris Dwan
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
Chris Dwan
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
Chris Dwan
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
Chris Dwan
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
Chris Dwan
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
Chris Dwan
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
Chris Dwan
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
Chris Dwan
 
Response from newport
Response from newportResponse from newport
Response from newport
Chris Dwan
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
Chris Dwan
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
Chris Dwan
 
Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12
Chris Dwan
 

More from Chris Dwan (20)

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
 
Response from newport
Response from newportResponse from newport
Response from newport
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
 
Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12
 

Recently uploaded

What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 

Recently uploaded (20)

What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 

No Free Lunch: Metadata in the life sciences

  • 1. Lightweight, Practical, Cross-domain Metadata March 2020: Bio-IT World West Chris Dwan (chris@dwan.org) https://dwan.org @fdmts
  • 2. Conclusions Organizing data is a human practice, not a technology choice – There is no free lunch Start simple, with free technologies, and quick wins – NoSQL databases with headers and checksum – Plan to invest in infrastructure about 18 months into the journey – Don’t start with the whole genomes Good policy makes simple practice – Make data somebody’s job. – Enterprise data management has much to teach us
  • 3. Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
  • 5. “Gene expression data … … are meaningful only in the context of a detailed description of the conditions under which they were generated … … including the particular state of the living system under study … … and the perturbations to which it has been subjected State of the art: 2001
  • 6. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. State of the art: 2019
  • 7. 201 6:Data Quality Matters Ask a computational biologist / data scientist what fraction of their time is spent fighting data quality, formatting, and similar issues.
  • 8. State of the Art: 2020 A miscommunication between the wet lab and the bioinformatics group resulted in an “embarrassing miscommunication” … … to the press.
  • 9. Genomic Data Production in ContextGenomic data production @ Broad I did research computing at Broad from 2014 - 2017
  • 10. ExAC / gnomAD: A powerful example ExAC (the Exome Aggregation Consortium) and gnomAD (the Genome Aggregation Database) represent vast amounts of work to harmonize both phenotype and consent
  • 11. The IT Services Perspective Filesystem Metadata – File attributes: Size, format, creation and modification times – Permissions: Ownership, Access Control Lists – Access / usage patterns: Which files are accessed, and by whom – Compressibility / Deduplication: Lies Optimistic projections by vendors One function of a research computing team is to bridge the gap between data storage (usable capacity as provided by enterprise IT) and data services (semantically usable data)
  • 12. The filesystem / directory tree is your default metadata database It is what your team is using today Any proposal less functional than descriptive filenames will fail
  • 13. If you have four groups working on a compiler, you’ll get a four-pass compiler Eric S Raymond, The New Hacker’s Dictionary, 1996
  • 14. Most primary data files (in bioinformatics) include valuable metadata, usually in the header. These can be quite verbose.
  • 15. Bioinformatics as a discipline is filled with duplication. “hg19” here is identical to “build 37” from the previous slide.
  • 16. Many file headers include the command line and parameters that were used to generate the file. This is the default method for storing experimental “provenance”
  • 17. 2002: Parsers are a pretty awful solution
  • 18. Container technology (Docker / Singularity) revolutionized software deployment. Instead of installers and configurators, we ship a whole operating system, with the app pre-installed. We do not have a similar solution for experimental metadata. We have not found a way to package and ship Domain experts and researchers. We don’t have containers for metadata
  • 19. NoSQL is a delightful prototyping tool • NoSQL databases (MongoDB, PostgreSQL, …) do not require a fully defined schema. • You gain flexibility at the cost of consistency and possibly performance. • This makes them ideal for prototyping • “Plan to throw one away; you will, anyhow.” Fred Brooks, 1975 The Mythical Man-Month
  • 20. What goes in the NoSQL? Unique key to identify the file The path to where it is stored A checksum (to find duplicates later) The header (scrape and store wholesale) Whatever else the lab said was important – Perhaps column headers from that spreadsheet they use…
  • 21. Capture metadata at the time of creation • Metadata needs to be captured at the point of data creation • This amounts to putting more work on staff who are likely already overburdened and time conscious • Very little of the benefit of rigorous data processes will be felt in the lab (at least at first)
  • 22. The Data Tzar Data Tzar Clearly empowered. Title sparks curiosity Data Janitor Data are trash. Low prestige job. Data Monkey Disrespectful, vaguely racist Chief Data Officer They mostly seem to work on licensing
  • 23. The Data Tzar: Day 1 Engage with data generators – Tools to make their lives earlier - Dashboards, alerting systems, backups, routine analysis, QC checks – Go “breadth first” across the enterprise – Do not start with the whole genomes Sneakily harvest metadata Necessary resources (day 1): – 1 – 2 early career bioinformatics programmers – Access to an infrastructure engineer – A modest budget on your cloud provider of choice
  • 24. The Data Tzar: First Year Do a lot of favors, build a lot of Shiny apps Convene working groups around specific types of data Create crosscutting dashboards for leadership Make friends with the heads of information security and compliance. Prepare a budget proposal
  • 25. FAIR Data (within the enterprise) Findable • NoSQL database of metadata and checksums • It’s plenty for a good long time. Accessible • Federated identity management • Architecture of S3 buckets and production “roles” Interoperable • ”It’s much easier to go FAR than to go FAIR” Reusable • Data standards, ontologies, strong policy framework, including electronic consents for human subjects data.
  • 26. Incredible opportunities here, and rapidly developing data silos The Clinical Data Ecosystem There is an incredible wealth of data available to support both clinical care and research Unfortunately, it is carved up and isolated. The phrase I hear most frequently from hospital CIOs: “No Upside” Patient Journals Consumer products Longitudinal Data from other providers … Electronic Medical Records Possibility of a self-normal (N of 1) over time Diagnostic Imaging Natural language processing has strong potentialClinical Notes Innovations in the basics of clinical observation Hospital Telemetry Pressure to avoid incidental findings prevent bias Primary Lab Data
  • 27. Appropriate Use and Consent “We should be up front with participants that we can’t protect their privacy completely, and we should ensure that the most appropriate legislation is in place to protect participants from being exploited in any way.” - Eric Schadt, CEO, Sema4
  • 28. Policies and Governance Appropriate usage Human readable document: Expectations of privacy and standards of behavior. Data Classification Governance document: Defines the major categories of data (corporate sensitive, clinical, …) and standards for handling of each. Written Information Security Policy (WISP) Technical document: Defines how systems must be configured to protect sensitive data and operations. Vendor Qualification Business SOP to establish practices around how vendor access and systes should be managed.
  • 29. Conclusions Organizing data is a human practice, not a technology choice – There is no free lunch Start simple, with free technologies, and quick wins – NoSQL databases with headers and checksum – Plan to invest in infrastructure about 18 months into the journey – Don’t start with the whole genomes Good policy makes simple practice – Make data somebody’s job. – Enterprise data management has much to teach us
  • 30. Thank you! Chris Dwan (chris@dwan.org) https://dwan.org @fdmts