Presented at European Respiratory Society, Berlin, October 2017. High level talk to mix of clinicians and scientists on the difficulties of biomedical analysis, including practical, statistical and data issues.
The 4-headed
beast
● The 4 heads
○ Acquisition
○ Storage
○ Analysis
○ Sharing
● Big data 4Vs
○ Velocity
○ Volume
○ Variety
○ Veracity
The problems of biomedical data
Many ...
● Types
● Formats
● Silos
● Gaps
● Interactions
Difficult analysis
● The curse of dimensionality
● Multiple hypothesis testing &
false discovery
● Batch effects
● Life history
● Biased sampling
● Need for integrative analysis
Practical issues
● Unstructured data
● Managing big data
● Security
● Legal & privacy
Future
medicine
A mix of promise & peril
● More data
○ Genomic medicine
○ Other “omic” medicine
○ Wearables
○ EHR & digital health
● P4 medicine
○ Stratification
○ Analysis at the bedside
○ Patient participation
● Translational medicine
○ Leveraging health data for
research
Scientific data
doubles every
18 months
A new paper
is published
every 30
seconds
Most papers
are never
cited or
even read
No new principle will declare itself from below
a heap of facts. (Peter Medewar)
Standards
● Clinical descriptions
● Measurements:
○ blood pressure
○ White cells
● Cross-study
Yes!
● Allows combining & comparing studies
● CDISC
● HPO
But!
● A lot of work
Data formats & storageYes!
● Plain text
● Open formats
● Structured formats
● Advantages:
○ Human & machine readable
○ Unambiguous
○ WYSIWYG
● Examples:
○ Open bio formats
○ CSV, TSV
No!
● Homebrew formats
● Proprietary / closed formats
● Binary formats
● Excel
Workflow systems & notebooks● Analysis as:
○ An executable recipe
○ A document or commentary
● Many candidates:
○ Workflows:
■ Snakemake
■ Nextflow
■ CWL etc.
○ Computational notebooks:
■ Jupyter / IPython
■ RMarkdown
Deep learning / machine learning
How do you know a biologist
is using deep learning in
their research?
Don’t worry, they’ll tell you.
● “Just” optimization and search techniques
● Takes a set of features and produces a
model that performs a classification or a
regression
● A series of layers that assemble features
into higher level features
● Several high-quality toolkits
● Some need for specialised hardware
(GPU)
● Interpretability
● Ground truths
● Needs lots of data
Batch effects
● Technical sources of
variation
○ Reageants
○ Technician
○ Platform
○ ...
● Solutions:
○ Plot data
○ Don’t batch
○ COMBAT etc. (but
loss of information)
Omnigenics
What if every gene affected
every other gene?
● Pritchard et al 2017
● FOAF / six degrees of separation effect
● Implicated genes are a few drivers and an
enormous number of “related” loci
● Context?
Conclusion
Taming the 4-headed beast
Acquiring: interpret EHR
Storing: data formats & systems
Analysing: statistics, correct for
batch effects, integrative analysis,
deep learning
Sharing: standards, data formats,
workflow systems
Editor's Notes
Scientific data doubles every 18 months
A new paper is published every 30 seconds
Most papers are never cited (or even read)