Big Data & ML for Clinical Data

Big Data & Machine Learning for
Clinical Data
Paul Agapow <p.agapow@imperial.ac.uk>
Data Science Institute, Imperial College London

 Biomedical science is now data
science
 I was a biochemist, immunologist,
and then a infectious disease
bioinformatician
 I’m now a “biomedical data
scientist”
 I will be a Health Informatics
Director at AstraZeneca
About me & these lectures
WikiMedia Commons

 We increasingly use & need:
 Lots of complex data
 Real world evidence (outside RCTs)
 Computational tools
 Statistical analysis
 Complex interactions
 Precision medicine: prediction &
(sub)typing
 Also:
 Cheap
 Successful in other domains
 But lots of hype and jargon
Biomedical science is now data science
WikiMedia Commons

 The world is increasingly
“datafied” – we make more and
bigger datasets
 Devices
 Routine collection
 Aggregation & integration
 Big Data is “too big”for
conventional approaches
Part 1: Big Data
WikiMedia Commons

 “Quantity has a quality of its
own”
 Often free
 Real
 Rich, deep, interactions
 Needed for ML and other
assumption-light approaches
Why Big Data?
By Ender005 - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=49888192

 Many diseases with the same clinical presentation have different
molecular phenotypes
 Several overlapping terms
 stratified: separate patients into groups for treatment
 precision:
 tailor treatment to individual
 improved targeted therapies with fewer side effects
 “Right medication, right dose, right patient, right time, right route”
 Also personalised, P4 …
 E.g. asthma
Why Big Data? Precision medicine

 Volume
 Velocity
 Variety
 Veracity
 Value
The 3 / 4 / 5 Vs of Big Data
By MuhammadAbuHijleh - Own work, CC BY-SA 4.0,

 Limits labile to technological
progress
 Memory
 Compute
 Data schema
 Solutions: distributed & parallel
computation, new high-end
databases
The problem with volume: tools & platforms
WikiMedia Commons

 Multiple hypothesis testing
and false discovery
 Bias: a sample is not the
population
 The Past is not the Present
 Observation without
understanding
 The curse of dimensionality
 Privacy
 Some ML-specific issues
The problem with volume: methodology
From KDNuggets

 Many, many types of data
 How do we use multiple types?
 Which type do we use?
 Disease is systemic
 Interactions
 Evidence
 Solutions: integrated analysis,
independent analysis with
validation
The problem with variety
Wu, Sanin, Wang (2016) Clinical Applications and Systems
Biomedicine

 Much biodata is uncertain
 Noise
 Mistakes
 People lie
 A sample is not a population
 Incompatible systems
 Most analyses are not reproducible
 Solutions: imputation, standards,
cross-validation etc.
The problem with veracity
By Khaydock - Own work, CC BY-SA 3.0,

 How do we
 Re-use data
 Compare data
 Store data from multiple sources
 Even know what data is
 FAIR, OHDSI / OMOPS, HPO
 Even just metadata helps for
cataloguing
 But: multiple & incomplete
standards, translation, complexity
Solution: Standards & ontologies
WikiMedia Commons

 Much data cannot leave its
home institution
 Hospitals
 Registries
 Insurance companies
 Governance is hard & slow
 So take the analysis to the data
 Data looks the same but may
be internally different
Solution: Federated analysis
International Collaboration for Autism Registry Epidemiology

 In a vast sea of biodata, how do you
discover anything? How do you avoid
cherry-picking?
 Solutions:
 Distinguish discovery from
exploration
 Non-parametric methods (e.g.
machine learning)
 Some problems don’t have a single
solution but many (e.g. prediction)
The problem with it all: discoverability
EnterpriseKnowledge.com

 Write analyses as recipes
 Snakemake, Nextflow, Flowr
 Use recreatable computational
systems
 Docker
 “Your biggest collaborator is
you, six months ago”
 But: it’s work
Solution: Reproducibility
From RevolutionR

 Big Data is “too big” for current conventional tools & practices
 But it’s ideal for solving many biomedical problems
 There are problems with valid discovery and just handling the data
 Standards, distributed databases and analysis and
Summary: Big Data

 “a field of Artificial Intelligence”
 “(the science of) getting computers to learn and act like humans do”
 “getting computers to act without being explicitly programmed”
 “computer systems that automatically improve with experience”
 “neural networks”
 “using statistical techniques to give computer systems the ability to
learn”
Part 2: Machine Learning

In practice:
 broadly-defined set of
algorithms that recognise &
generalise patterns in data
 “non-parametric” or
assumption-light
 may require training over
initial dataset
What is Machine Learning?
By Chire - Own work, Public Domain,

 Enough data
 Enough compute
 Technical progress
 Need 'good enough'
solutions
 Prediction & forecasting
 Categorization
 Pattern recognition
 Early, startling success
Why now?
Ray Kurzweil The Singularity is Near

How is ML different to stats?
Statistical Machine
Assumptions strong weak
Data small large
Optimize by fitting training
Solutions “the best” “good enough”
Hypothesis proof exploration
Test p-values etc. validation

In practice:
 a field of scientific research
 machine learning
 neural networks
 deep learning
 more of an objective than a methodology
 computational systems that duplicate / emulate / replace human effort
What is Artificial Intelligence

• Many methods
• Broadly split into:
• Unsupervised: finds structure within data
• e.g. (most) clustering, self-organised maps, principal component
analysis
• Supervised: trained using labelled examples
• e.g. regression, decision trees, naive bayes, neural networks
• Categories can blur
• e.g. k-means, nearest neighbour?
• Which is better?
What are ML methods?

• (Train a model from data)
• This model encapsulates or generalizes the data
• (Validate the model against test data)
• This model transforms features into labels
• Continuous outputs (e.g. real numbers) are regressions
• Discrete outputs (e.g. categories) are classifications
ML terms & process

• Take gene expression profiles from patients and cluster to:
• See genes with similar expression profiles
• Similar patients
• Train a model on radiographs with tumours labelled, use to diagnose
unlabelled images
• Find patients with similar symptoms & signs (computational
phenotypes) in HER
• Train on histories of patients to forecast their future condition
• Find out how terms in a medical corpus relate to each other
Examples of ML

Unsupervised learning: clustering

 What does ‘similar’ mean? How
do we measure it?
 Which features & how weighted?
 Noise & overlapping clusters
 Non-numeric, non-ordered data
 What shapes can clusters be?
 How many clusters? When do we
stop?
 …
Clustering isn’t simple
By Chire - Own work, CC BY-SA 3.0,

Varies but:
 Start with record-feature matrix
 Normalise data
 (“Supervised”: select number of
clusters)
 Run algorithm
 Validate
Clustering process
WikiMedia Commons

 A cluster partitioning is a hypothesis
 How do we assess? Validate:
 External: compare against external label or data
 e.g. accuracy, entropy
 Internal: goodness of clustering
 e.g. sum squared errors, cluster cohesion & separation,
silhouette
 Relative: against another clustering scheme
 e.g. is this better with 3 or 4 clusters
Validating clusters

Average over each point:
1. Calculate the average distance to all
other members of its cluster, a
2. For each other cluster, calculate the
average distance to every member.
The minimum of these is b
3. The silhouette width is (b−a) /
max(a,b), the higher the better
Clustering process

What if there are sub-clusters or
structure?
• Use hierarchical clustering
• Use homogeneity or
completeness metrics to
compare
Nesting & hierarchies

• Complex, heterogeneous
disease
• Many attempts at clustering
• Use transcriptomic &
proteomic data
• Validate with clinical
• 4 clusters with characteristic
genes & clinical behaviour
Example: asthma

 a.k.a. deep learning, (artificial)
neural networks, “AI”
 A series of layers of nodes, each of
which transforms the previous layer.
 Training sets weights on
transformations
 Capable of learning representations
Supervised learning: deep networks
WikiMedia Commons

 There’s little information in an
individual pixel (gene, data point …)
 But individual data points make up
more complete entities
 Each layer takes the layer below and
creates higher-level entities
(representations) from it.
 The system “recognises” higher-
level features that can appear
anywhere in the data.
What’s a representation?
WikiMedia Commons

 Radiologists are overwhelmed
 Want to catch errors &
double-check
 Train ANN over medical
imagery with tumour labelled
 Accuracy similar to humans
Example: diagnosis from medical imagery
From Nvidia

• The model is right but learns
the wrong thing (from our
point of view)
• Solutions:
• Interpreting models
• Better (more examined) data
Problem: useless solutions
Ribeiro et al. (2016) Why Should I Trust You?

 Reversing the model & asking “why”
 What features are important
 Mechanistic insight
 But many ML models are tangled & horribly complex
 And ML community often uninterested
 Solutions:
 Choose an intepretable model
 Software that explores feature space (LIME, Lift, IML)
Problem: interpretability

• Bias (systematic error) vs. Variance
(random error)
• Want a model that captures the
regularities in training data AND
generalizes to unseen data.
• This is impossible
• Solutions:
• Use a variety of data
• Feature selection
• Regularization
Problem: how do models get it wrong?
From KDNuggets

• What do we want from our ML
models?
• Power / accuracy
• Insight
• Error tolerance
• e.g. drug discovery vs drug safety
Problem: how good do models have to be?
After Harel

• Much (most) data has few positives
• Results in an imbalanced model
• Solutions:
• Over- and under-sampling
• Pre-train with poor data
• Ensemble methods
Problem: imbalanced data & lack of data
DataScience.com

 Machine learning uses large amounts of data with few assumptions to
make models that generalise that data
 This is useful for situations where we don’t have an explicit model and
just need ‘a’ solution.
 But this means we need to examine our data and validate our
solutions
 A ‘bad’ solution can be useful, depending on what you want to
achieve.
Summary: Machine Learning

Big Data & ML for Clinical Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Big Data & ML for Clinical Data

Similar to Big Data & ML for Clinical Data (20)

More from Paul Agapow

More from Paul Agapow (20)

Recently uploaded

Recently uploaded (20)

Big Data & ML for Clinical Data