This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.
2. Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us
3. Geek Cred: My First Petabyte, 2008My first Petabyte: 2008
5. “Gene expression data …
… are meaningful only in the
context of a detailed description
of the conditions under which
they were generated …
… including the particular state
of the living system under study
…
… and the perturbations to which
it has been subjected
State of the art: 2001
6. Most metadata field names and
their values are not standardized
or controlled.
Even simple binary or numeric
fields are often populated with
inadequate values of different
data types.
By clustering metadata field
names, we discovered there are
often many distinct ways to
represent the same aspect of a
sample.
State of the art: 2019
7. 201 6:Data Quality Matters
Ask a computational biologist /
data scientist what fraction of their
time is spent fighting data quality,
formatting, and similar issues.
8. State of the Art: 2020
A miscommunication between
the wet lab and the
bioinformatics group resulted
in an “embarrassing
miscommunication” …
… to the press.
9. Genomic Data Production in ContextGenomic data production @ Broad
I did research computing at
Broad from 2014 - 2017
10. ExAC / gnomAD: A powerful example
ExAC (the Exome Aggregation
Consortium) and gnomAD (the
Genome Aggregation Database)
represent vast amounts of work to
harmonize both phenotype and
consent
11. The IT Services Perspective
Filesystem Metadata
– File attributes: Size, format, creation and modification times
– Permissions: Ownership, Access Control Lists
– Access / usage patterns: Which files are accessed, and by whom
– Compressibility / Deduplication: Lies Optimistic projections by vendors
One function of a research computing team is to bridge the gap between data
storage (usable capacity as provided by enterprise IT) and data services
(semantically usable data)
12. The filesystem / directory tree is your
default metadata database
It is what your team is using today
Any proposal less functional than
descriptive filenames will fail
13. If you have four groups working on a compiler, you’ll
get a four-pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996
14. Most primary data files (in
bioinformatics) include valuable
metadata, usually in the header.
These can be quite verbose.
15. Bioinformatics as a discipline is
filled with duplication.
“hg19” here is identical to “build
37” from the previous slide.
16. Many file headers include the
command line and parameters that
were used to generate the file.
This is the default method for storing
experimental “provenance”
18. Container technology (Docker / Singularity)
revolutionized software deployment.
Instead of installers and configurators, we ship a
whole operating system, with the app pre-installed.
We do not have a similar solution for experimental
metadata.
We have not found a way to package and ship
Domain experts and researchers.
We don’t have containers for metadata
19. NoSQL is a delightful prototyping tool
• NoSQL databases (MongoDB, PostgreSQL, …) do
not require a fully defined schema.
• You gain flexibility at the cost of consistency and
possibly performance.
• This makes them ideal for prototyping
• “Plan to throw one away; you will, anyhow.”
Fred Brooks, 1975
The Mythical Man-Month
20. What goes in the NoSQL?
Unique key to identify the file
The path to where it is stored
A checksum (to find duplicates later)
The header (scrape and store wholesale)
Whatever else the lab said was important
– Perhaps column headers from that spreadsheet they use…
21. Capture metadata at the time of creation
• Metadata needs to be captured at the point of data creation
• This amounts to putting more work on staff who are likely
already overburdened and time conscious
• Very little of the benefit of rigorous data processes will be felt in
the lab (at least at first)
22. The Data Tzar
Data Tzar Clearly empowered. Title sparks curiosity
Data Janitor Data are trash. Low prestige job.
Data Monkey Disrespectful, vaguely racist
Chief Data Officer They mostly seem to work on licensing
23. The Data Tzar: Day 1
Engage with data generators
– Tools to make their lives earlier - Dashboards, alerting systems, backups, routine
analysis, QC checks
– Go “breadth first” across the enterprise
– Do not start with the whole genomes
Sneakily harvest metadata
Necessary resources (day 1):
– 1 – 2 early career bioinformatics programmers
– Access to an infrastructure engineer
– A modest budget on your cloud provider of choice
24. The Data Tzar: First Year
Do a lot of favors, build a lot of Shiny apps
Convene working groups around specific types of data
Create crosscutting dashboards for leadership
Make friends with the heads of information security and compliance.
Prepare a budget proposal
25. FAIR Data (within the enterprise)
Findable
• NoSQL database of metadata and checksums
• It’s plenty for a good long time.
Accessible
• Federated identity management
• Architecture of S3 buckets and production
“roles”
Interoperable
• ”It’s much easier to go FAR than to go FAIR”
Reusable
• Data standards, ontologies, strong policy
framework, including electronic consents for
human subjects data.
26. Incredible opportunities
here, and rapidly
developing data silos
The Clinical Data Ecosystem
There is an incredible wealth of
data available to support both
clinical care and research
Unfortunately, it is carved up and
isolated.
The phrase I hear most frequently
from hospital CIOs: “No Upside”
Patient Journals
Consumer products
Longitudinal Data from
other providers …
Electronic
Medical Records
Possibility of a self-normal
(N of 1) over time
Diagnostic
Imaging
Natural language processing
has strong potentialClinical Notes
Innovations in the basics of
clinical observation
Hospital Telemetry
Pressure to avoid incidental
findings prevent bias
Primary Lab Data
27. Appropriate Use and Consent
“We should be up front with participants that we can’t protect
their privacy completely, and we should ensure that the most
appropriate legislation is in place to protect participants from
being exploited in any way.”
- Eric Schadt, CEO, Sema4
28. Policies and Governance
Appropriate usage
Human readable document: Expectations of privacy and
standards of behavior.
Data Classification
Governance document: Defines the major categories of data
(corporate sensitive, clinical, …) and standards for handling of
each.
Written Information Security Policy (WISP)
Technical document: Defines how systems must be configured to
protect sensitive data and operations.
Vendor Qualification
Business SOP to establish practices around how vendor access
and systes should be managed.
29. Conclusions
Organizing data is a human practice, not a technology choice
– There is no free lunch
Start simple, with free technologies, and quick wins
– NoSQL databases with headers and checksum
– Plan to invest in infrastructure about 18 months into the journey
– Don’t start with the whole genomes
Good policy makes simple practice
– Make data somebody’s job.
– Enterprise data management has much to teach us