- Data challenges are growing in terms of volume, variety, velocity and quality. There is no single solution and real-world solutions will be hybrid.
- Metadata management is a huge challenge, even basic metadata is beyond most small organizations. Federated systems are needed to transform medicine.
- The document discusses challenges with data management across various domains including life sciences, healthcare, genomics, machine learning, artificial intelligence, and personal data. It emphasizes the importance of data visibility, quality, and integration across siloed systems.
2. Take-home messages
Data challenges are large and growing
– Not just volume
– Also variety, velocity, quality
There is no one single perfect solution
– Requirements are diverse
– Real world solutions will be hybrid
Metadata management is a huge challenge
– Even the basics are beyond most small organizations
– We need federated systems to transform medicine
3.
4. Geek Cred: My First Petabyte,
2008
My first Petabyte: 2008
5. Geek Cred: My First Petabyte,
2008
My first Petabyte: 2008
9. Genomic Data Production in ContextGenomic data production @ Broad
I did research computing at
Broad from 2014 - 2017
10. Geek Cred: My First Petabyte,
2008
My first Exabyte: 2014
11. Data: The new oil*
Data Base: Structure, queries
Data Warehouse: All the data in one place. Limited
integration.
Data Mart: Serve up warehoused data to users (Shiny counts)
Big Data: Volume, Variety, Velocity
Data Lake: Data warehouse, but designed for in-situ analytics
Data Ocean: A data lake, for the cromulently embiggened!
Data Commons: When the benefits of sharing data outweigh
the competitive instinct to horde it
Data Biosphere: A data commons, but for the cool kids
An immature ‘tyrant
flycatcher. Needs a data
mart, because it doesn’t
know R or Linux yet.
Hype-o-meter Impact-o-meter
12. Primary Data Production
Data are produced
on instruments …
Sequencer /
Mass Spec /
…
Analysis
Systems
High
Performance
Storage
… Transformed
and distilled …
… Delivered to
downstream
processes …
Customer
facing storage
13. Primary Data Production
Data are produced
on instruments …
Sequencer /
Mass Spec /
…
Analysis
Systems
High
Performance
Storage
… Transformed
and distilled …
… Delivered to
downstream
processes …
… And archived for various
purposes (FDA, HIPAA,
Intellectual property, …).
Customer
facing storage
Durable, cost
effective storage
14. Primary Data Production
Data are produced
on instruments …
Sequencer /
Mass Spec /
…
Analysis
Systems
High
Performance
Storage
… Transformed
and distilled …
… Delivered to
downstream
processes …
… And archived for various
purposes (FDA, HIPAA,
Intellectual property, …).
Customer
facing storage
Durable, cost
effective storage
I recommend an
‘archive first’ approach,
15. EMR
ELN
Primary Data Production
Data are produced
on instruments …
Sequencer /
Mass Spec /
…
Analysis
Systems
High
Performance
Storage
… Transformed
and distilled …
… Delivered to
downstream
processes …
… And archived for various
purposes (FDA, HIPAA,
Intellectual property, …).
Customer
facing storage
Durable, cost
effective storage
I recommend an
‘archive first’ approach,
LIMS
LIS
Metadata management is still a
massive challenge
Lab_Sample_tracker.xls
Filename_as_
metadata_for
_eric_v2
17. Quality Matters
Ask a computational
biologist / data scientist
what fraction of their time
is spent fighting data
quality, formatting, and
similar issues.
Multiply that by an entire
industry
They deserve better.
18. Machine Learning (ML)
Algorithms that optimize and tune based on
large amounts of data
These have been around for a very long time
(KNN and Linear Regression are totally ML).
Algorithm innovations (deep neural nets),
plus ubiquitous big data, plus improvements
in computing, storage, network, and
software.
Killer apps everywhere in image recognition,
natural language processing, clustering,
categorization
Hype-o-meter Impact-o-meter
A ‘swan pink yellow’ columbine
flower. Identifying objects in
images is machine work now.
19. Data for Analytics / ML / AI
Analysis Systems
High Performance
Storage
A large and
growing set of
data is curated…
Commercial
/ outsource
labs
Public or
licensed
datasets
In-house
labs
Curation
… and mined for insights.
Analyst
20. Data for analytics
Analysis Systems
High Performance
Storage
A large and
growing set of
data is curated…
Commercial
/ outsource
labs
Public or
licensed
datasets
In-house
labs
Curation
… and mined for insights.
insights take both short and long
paths back into the system
Analyst
21. Data for analytics
Analysis Systems
High Performance
Storage
A large and
growing set of
data is curated…
Commercial
/ outsource
labs
Public or
licensed
datasets
In-house
labs
Curation
… and mined for insights.
insights take both short and long
paths back into the system
Analyst
Durable, cost
effective storage
• What does “backup”
mean, exactly?
• How do we capture
provenance without
massive duplication?
22. Artificial Intelligence (AI)
Distinguished (for me) by autonomous
behavior and clever-looking behavior in
the face of unanticipated situations.
No requirement that “intelligent” mean
“like a human.”
Machine learning algorithms are a great
(but not the only) way to create AI
systems.
Beware “bread machine AI.”
Hype-o-meter Impact-o-meter
Getting there!
My cat shows surprising
intelligence despite having a
brain the size of a walnut
23. Artificial Intelligence (AI)
Distinguished (for me) by autonomous
behavior and clever-looking behavior in
the face of unanticipated situations.
No requirement that intelligence be
human style.
Machine learning algorithms are a great
(but not the only) way to build AI
systems.
Beware “bread machine AI.”
Hype-o-meter Impact-o-meter
Getting there!
My cat shows surprising
intelligence despite having a
brain the size of a walnut
24. Incredible opportunities
here, and rapidly
developing data silos
The Clinical Data Ecosystem
There is an incredible
wealth of data available to
support both clinical care
and research
Patient Journals
Consumer products
Unfortunately, it is carved
up and isolated
Longitudinal Data from
other providers …
Electronic
Medical Records
Possibility of a self-normal
(N of 1) over time
Diagnostic
Imaging
Natural language processing
has strong potentialClinical Notes
Innovations in the basics of
clinical observation
Hospital Telemetry
Pressure to avoid incidental
findings prevent bias
Primary Lab Data
There are both good and
bad reasons for this
25. Personal Data Impacts Behavior
I use a commercial service
that combines labwork with
wearable data
They provide insights and
coaching
I have, personally, found this
transformational in how I
approach my health.
26. Personal Data Impacts Behavior
I use a commercial service
that combines labwork with
wearable data
They provide insights and
coaching
I have, personally, found this
transformational in how I
approach my health.
27. Personal Data Impacts Behavior
I use a commercial service
that combines labwork with
wearable data
They provide insights and
coaching
I have, personally, found this
transformational in how I
approach my health.
28. Personal Data Impacts Behavior
I use a commercial service
that combines labwork with
wearable data
They provide insights and
coaching
I have, personally, found this
transformational in how I
approach my health.
29. Personal Data Impacts Behavior
I use a commercial service
that combines labwork with
wearable data
They provide insights and
coaching
I have, personally, found this
transformational in how I
approach my health.
30. Why are we here?
• Improved health outcomes
• Quality-adjusted life-years
• Increased therapeutic effectiveness
• Reduced barriers to access
• Publications / Patents / Druggable leads
• Accelerated innovation cycle
• Reduced time to market
• Speeds & Feeds
• Improved performance on benchmarks
• Lower cost per unit
• Infrastructure agility
Social Mission
Scientific / Business Goals
Technology / Infrastructure
31. Maslow’s Hierarchy of Needs
Friendship, connectedness, belonging
Confidence, achievement
Creativity,
Purpose
Safety, physical and economic stability
Air, food, shelter, sleep
If you lack this
You don’t get
to engage here
32. Maslow’s Hierarchy of Needs
Friendship, connectedness, belonging
Confidence, achievement
Creativity,
Purpose
Safety, physical and economic stability
Air, food, shelter, sleep
Wireless Internet, Fully charged battery
If you lack this
You don’t get
to engage here
33. IT Hierarchy of Needs
Productivity and Security, Applications,
disaster preparedness
Automation and
compliance
“Thought
Partner”
Files, formats, naming conventions, access controls
Phones, Projectors, Internet, Email, Chat
Power, Building Access, Laptops, Wifi, Identity
If you lack this
You don’t get
to engage here
34. Data Visibility Saves Money
Private Data Holdings
Public
Data
Backups
…
Private
copy of
public
data
$$ !!
Lack of data visibility leads to
increased costs and engineering
challenges.
It is depressingly common to see
multiple representations of the same
data, all being archived together.
BAM BCL
FASTQ
This is also a metadata challenge
35. Challenge Architecture: The data DMZ
• An architecture to support data creation, delivery, and
use
• … for seamless collaboration between organizations …
• … without sacrificing security, appropriate usage, or
privacy …
• … and that delivers on the potential of modern analytic
capabilities.
36. Blockchain
”The clown car of our industry in 2018”
• Distributed ledger: trustworthy data /
records without a central authority.
• Self executing contracts: Shared,
trustworthy code to operate on that
data.
• Initial Coin Offerings: massively
accelerated (and deregulated) way to
set monetary value on a data
ecosystem.
Amazing possibilities in permission /
consent management.
When I make snarky comments on
LinkedIn, people ask if they can invest.
Hype-o-meter Impact-o-meter
The angel weeps because there are
some really compelling use cases for
blockchain, but the hype is
deafening.
37. Take-home messages
Data challenges are large and growing
– Not just volume
– Also variety, velocity, quality
There is no one single perfect solution
– Requirements are diverse
– Real world solutions will be hybrid
Metadata management is a huge challenge
– Even the basics are beyond most small organizations
– We need federated systems in order to transform
medicine.