SlideShare a Scribd company logo
1 of 30
Lightweight, Practical, Cross-domain Metadata
March 2020: Bio-IT World West
Chris Dwan (chris@dwan.org)
https://dwan.org @fdmts
Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us
Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
NIH circa 2008
“Gene expression data …
… are meaningful only in the
context of a detailed description
of the conditions under which
they were generated …
… including the particular state
of the living system under study
…
… and the perturbations to which
it has been subjected
State of the art: 2001
Most metadata field names and
their values are not standardized
or controlled.
Even simple binary or numeric
fields are often populated with
inadequate values of different
data types.
By clustering metadata field
names, we discovered there are
often many distinct ways to
represent the same aspect of a
sample.
State of the art: 2019
201 6:Data Quality Matters
Ask a computational biologist /
data scientist what fraction of their
time is spent fighting data quality,
formatting, and similar issues.
State of the Art: 2020
A miscommunication between
the wet lab and the
bioinformatics group resulted
in an “embarrassing
miscommunication” …
… to the press.
Genomic Data Production in ContextGenomic data production @ Broad
I did research computing at
Broad from 2014 - 2017
ExAC / gnomAD: A powerful example
ExAC (the Exome Aggregation
Consortium) and gnomAD (the
Genome Aggregation Database)
represent vast amounts of work to
harmonize both phenotype and
consent
The IT Services Perspective
Filesystem Metadata
– File attributes: Size, format, creation and modification times
– Permissions: Ownership, Access Control Lists
– Access / usage patterns: Which files are accessed, and by whom
– Compressibility / Deduplication: Lies Optimistic projections by vendors
One function of a research computing team is to bridge the gap between data
storage (usable capacity as provided by enterprise IT) and data services
(semantically usable data)
The filesystem / directory tree is your
default metadata database
It is what your team is using today
Any proposal less functional than
descriptive filenames will fail
If you have four groups working on a compiler, you’ll
get a four-pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996
Most primary data files (in
bioinformatics) include valuable
metadata, usually in the header.
These can be quite verbose.
Bioinformatics as a discipline is
filled with duplication.
“hg19” here is identical to “build
37” from the previous slide.
Many file headers include the
command line and parameters that
were used to generate the file.
This is the default method for storing
experimental “provenance”
2002: Parsers are a pretty awful solution
Container technology (Docker / Singularity)
revolutionized software deployment.
Instead of installers and configurators, we ship a
whole operating system, with the app pre-installed.
We do not have a similar solution for experimental
metadata.
We have not found a way to package and ship
Domain experts and researchers.
We don’t have containers for metadata
NoSQL is a delightful prototyping tool
• NoSQL databases (MongoDB, PostgreSQL, …) do
not require a fully defined schema.
• You gain flexibility at the cost of consistency and
possibly performance.
• This makes them ideal for prototyping
• “Plan to throw one away; you will, anyhow.”
Fred Brooks, 1975
The Mythical Man-Month
What goes in the NoSQL?
Unique key to identify the file
The path to where it is stored
A checksum (to find duplicates later)
The header (scrape and store wholesale)
Whatever else the lab said was important
– Perhaps column headers from that spreadsheet they use…
Capture metadata at the time of creation
• Metadata needs to be captured at the point of data creation
• This amounts to putting more work on staff who are likely
already overburdened and time conscious
• Very little of the benefit of rigorous data processes will be felt in
the lab (at least at first)
The Data Tzar
Data Tzar Clearly empowered. Title sparks curiosity
Data Janitor Data are trash. Low prestige job.
Data Monkey Disrespectful, vaguely racist
Chief Data Officer They mostly seem to work on licensing
The Data Tzar: Day 1
Engage with data generators
– Tools to make their lives earlier - Dashboards, alerting systems, backups, routine
analysis, QC checks
– Go “breadth first” across the enterprise
– Do not start with the whole genomes
Sneakily harvest metadata
Necessary resources (day 1):
– 1 – 2 early career bioinformatics programmers
– Access to an infrastructure engineer
– A modest budget on your cloud provider of choice
The Data Tzar: First Year
Do a lot of favors, build a lot of Shiny apps
Convene working groups around specific types of data
Create crosscutting dashboards for leadership
Make friends with the heads of information security and compliance.
Prepare a budget proposal
FAIR Data (within the enterprise)
Findable
• NoSQL database of metadata and checksums
• It’s plenty for a good long time.
Accessible
• Federated identity management
• Architecture of S3 buckets and production
“roles”
Interoperable
• ”It’s much easier to go FAR than to go FAIR”
Reusable
• Data standards, ontologies, strong policy
framework, including electronic consents for
human subjects data.
Incredible opportunities
here, and rapidly
developing data silos
The Clinical Data Ecosystem
There is an incredible wealth of
data available to support both
clinical care and research
Unfortunately, it is carved up and
isolated.
The phrase I hear most frequently
from hospital CIOs: “No Upside”
Patient Journals
Consumer products
Longitudinal Data from
other providers …
Electronic
Medical Records
Possibility of a self-normal
(N of 1) over time
Diagnostic
Imaging
Natural language processing
has strong potentialClinical Notes
Innovations in the basics of
clinical observation
Hospital Telemetry
Pressure to avoid incidental
findings prevent bias
Primary Lab Data
Appropriate Use and Consent
“We should be up front with participants that we can’t protect
their privacy completely, and we should ensure that the most
appropriate legislation is in place to protect participants from
being exploited in any way.”
- Eric Schadt, CEO, Sema4
Policies and Governance
Appropriate usage
Human readable document: Expectations of privacy and
standards of behavior.
Data Classification
Governance document: Defines the major categories of data
(corporate sensitive, clinical, …) and standards for handling of
each.
Written Information Security Policy (WISP)
Technical document: Defines how systems must be configured to
protect sensitive data and operations.
Vendor Qualification
Business SOP to establish practices around how vendor access
and systes should be managed.
Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us
Thank you!
Chris Dwan (chris@dwan.org)
https://dwan.org @fdmts

More Related Content

What's hot

What's hot (20)

RDM & ELNs @ Edinburgh
RDM & ELNs @ EdinburghRDM & ELNs @ Edinburgh
RDM & ELNs @ Edinburgh
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
 
Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data
 
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
The Donders Repository
The Donders RepositoryThe Donders Repository
The Donders Repository
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
 
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
Preparing Your Research Data for the Future - 2015-06-08 - Medical Sciences D...
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchers
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
Microfilm or Digitize: Which is Right for You?
Microfilm or Digitize: Which is Right for You?Microfilm or Digitize: Which is Right for You?
Microfilm or Digitize: Which is Right for You?
 

Similar to No Free Lunch: Metadata in the life sciences

Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
Brad Houston
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 

Similar to No Free Lunch: Metadata in the life sciences (20)

Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
data mining
data miningdata mining
data mining
 
Data Management Planning for researchers
Data Management Planning for researchersData Management Planning for researchers
Data Management Planning for researchers
 
Data management
Data management Data management
Data management
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 

More from Chris Dwan

More from Chris Dwan (20)

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Intro bioinformatics
Intro bioinformaticsIntro bioinformatics
Intro bioinformatics
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
 
Response from newport
Response from newportResponse from newport
Response from newport
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
 
Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 

Recently uploaded (20)

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

No Free Lunch: Metadata in the life sciences

  • 1. Lightweight, Practical, Cross-domain Metadata March 2020: Bio-IT World West Chris Dwan (chris@dwan.org) https://dwan.org @fdmts
  • 2. Conclusions Organizing data is a human practice, not a technology choice – There is no free lunch Start simple, with free technologies, and quick wins – NoSQL databases with headers and checksum – Plan to invest in infrastructure about 18 months into the journey – Don’t start with the whole genomes Good policy makes simple practice – Make data somebody’s job. – Enterprise data management has much to teach us
  • 3. Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
  • 5. “Gene expression data … … are meaningful only in the context of a detailed description of the conditions under which they were generated … … including the particular state of the living system under study … … and the perturbations to which it has been subjected State of the art: 2001
  • 6. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. State of the art: 2019
  • 7. 201 6:Data Quality Matters Ask a computational biologist / data scientist what fraction of their time is spent fighting data quality, formatting, and similar issues.
  • 8. State of the Art: 2020 A miscommunication between the wet lab and the bioinformatics group resulted in an “embarrassing miscommunication” … … to the press.
  • 9. Genomic Data Production in ContextGenomic data production @ Broad I did research computing at Broad from 2014 - 2017
  • 10. ExAC / gnomAD: A powerful example ExAC (the Exome Aggregation Consortium) and gnomAD (the Genome Aggregation Database) represent vast amounts of work to harmonize both phenotype and consent
  • 11. The IT Services Perspective Filesystem Metadata – File attributes: Size, format, creation and modification times – Permissions: Ownership, Access Control Lists – Access / usage patterns: Which files are accessed, and by whom – Compressibility / Deduplication: Lies Optimistic projections by vendors One function of a research computing team is to bridge the gap between data storage (usable capacity as provided by enterprise IT) and data services (semantically usable data)
  • 12. The filesystem / directory tree is your default metadata database It is what your team is using today Any proposal less functional than descriptive filenames will fail
  • 13. If you have four groups working on a compiler, you’ll get a four-pass compiler Eric S Raymond, The New Hacker’s Dictionary, 1996
  • 14. Most primary data files (in bioinformatics) include valuable metadata, usually in the header. These can be quite verbose.
  • 15. Bioinformatics as a discipline is filled with duplication. “hg19” here is identical to “build 37” from the previous slide.
  • 16. Many file headers include the command line and parameters that were used to generate the file. This is the default method for storing experimental “provenance”
  • 17. 2002: Parsers are a pretty awful solution
  • 18. Container technology (Docker / Singularity) revolutionized software deployment. Instead of installers and configurators, we ship a whole operating system, with the app pre-installed. We do not have a similar solution for experimental metadata. We have not found a way to package and ship Domain experts and researchers. We don’t have containers for metadata
  • 19. NoSQL is a delightful prototyping tool • NoSQL databases (MongoDB, PostgreSQL, …) do not require a fully defined schema. • You gain flexibility at the cost of consistency and possibly performance. • This makes them ideal for prototyping • “Plan to throw one away; you will, anyhow.” Fred Brooks, 1975 The Mythical Man-Month
  • 20. What goes in the NoSQL? Unique key to identify the file The path to where it is stored A checksum (to find duplicates later) The header (scrape and store wholesale) Whatever else the lab said was important – Perhaps column headers from that spreadsheet they use…
  • 21. Capture metadata at the time of creation • Metadata needs to be captured at the point of data creation • This amounts to putting more work on staff who are likely already overburdened and time conscious • Very little of the benefit of rigorous data processes will be felt in the lab (at least at first)
  • 22. The Data Tzar Data Tzar Clearly empowered. Title sparks curiosity Data Janitor Data are trash. Low prestige job. Data Monkey Disrespectful, vaguely racist Chief Data Officer They mostly seem to work on licensing
  • 23. The Data Tzar: Day 1 Engage with data generators – Tools to make their lives earlier - Dashboards, alerting systems, backups, routine analysis, QC checks – Go “breadth first” across the enterprise – Do not start with the whole genomes Sneakily harvest metadata Necessary resources (day 1): – 1 – 2 early career bioinformatics programmers – Access to an infrastructure engineer – A modest budget on your cloud provider of choice
  • 24. The Data Tzar: First Year Do a lot of favors, build a lot of Shiny apps Convene working groups around specific types of data Create crosscutting dashboards for leadership Make friends with the heads of information security and compliance. Prepare a budget proposal
  • 25. FAIR Data (within the enterprise) Findable • NoSQL database of metadata and checksums • It’s plenty for a good long time. Accessible • Federated identity management • Architecture of S3 buckets and production “roles” Interoperable • ”It’s much easier to go FAR than to go FAIR” Reusable • Data standards, ontologies, strong policy framework, including electronic consents for human subjects data.
  • 26. Incredible opportunities here, and rapidly developing data silos The Clinical Data Ecosystem There is an incredible wealth of data available to support both clinical care and research Unfortunately, it is carved up and isolated. The phrase I hear most frequently from hospital CIOs: “No Upside” Patient Journals Consumer products Longitudinal Data from other providers … Electronic Medical Records Possibility of a self-normal (N of 1) over time Diagnostic Imaging Natural language processing has strong potentialClinical Notes Innovations in the basics of clinical observation Hospital Telemetry Pressure to avoid incidental findings prevent bias Primary Lab Data
  • 27. Appropriate Use and Consent “We should be up front with participants that we can’t protect their privacy completely, and we should ensure that the most appropriate legislation is in place to protect participants from being exploited in any way.” - Eric Schadt, CEO, Sema4
  • 28. Policies and Governance Appropriate usage Human readable document: Expectations of privacy and standards of behavior. Data Classification Governance document: Defines the major categories of data (corporate sensitive, clinical, …) and standards for handling of each. Written Information Security Policy (WISP) Technical document: Defines how systems must be configured to protect sensitive data and operations. Vendor Qualification Business SOP to establish practices around how vendor access and systes should be managed.
  • 29. Conclusions Organizing data is a human practice, not a technology choice – There is no free lunch Start simple, with free technologies, and quick wins – NoSQL databases with headers and checksum – Plan to invest in infrastructure about 18 months into the journey – Don’t start with the whole genomes Good policy makes simple practice – Make data somebody’s job. – Enterprise data management has much to teach us
  • 30. Thank you! Chris Dwan (chris@dwan.org) https://dwan.org @fdmts