Machine learning,
health data & the limits
of knowledge
Paul Agapow
ONC R&D ML&AI AstraZeneca
<paul.agapow@astrazeneca.com>
20201/3/10
2
Disclosure
• Does not reflect official AZ thought or projects
• No conflicts of interest
3
About me
• Have been a:
• At
• Oncology R&D ML&AI / RWE @AZ
• Data Science Institute @ICL
• Centre for Infection @HPA (UK)
• Universities, industry, government …
health informatician, data scientist, bioinformatician, database
administrator, epi-informaticist, software dev, data manager,
consultant, molecular geneticist, data scientist, evolutionary
scientist, biochemist, phylogeneticist, immunologist, programmer …
Using this paper as a jumping-off point
• The Hierarchical Classifier for COVID-19
Resistance Evaluation (2021) Shakhovska,
Izonin & Melnykova, Data v6:6
• https://doi.org/10.3390/data6010006
• https://www.mdpi.com/2306-
5729/6/1/6/htm
• How to analyse for patterns in COVID data
when the observational data is diverse &
complex
4
Data is a
saviour & a
curse
• Data & analytics has saved us several times in the current
crisis
• But too much data can create problems
• And data is not information
5
RWE: real world evidence
6
• Electronic Health Records
• Registries
• Claims databases
• Repurposed trial data
• Defined:
• Anything that isn’t an RCT
(randomised controlled trial)
• Observational data
• Anything we have to consider the
context & sourcing of?
• Why?
• Cheap
• Ethical
• Accesses scales & types of data &
situations that are otherwise
unavailable
• Where was it collected?
• Who did they look for?
• What are those peoples
habits and histories?
But all (RWE) data is biased
What population does it
come from?
• “severe asthma” or
“PDL1 expression”
• What are the diagnostic
devices?
• What’s common medical
practice there?
What are the definitions
used?
• E.g. surveys, visits
• Are inclusion / exclusion
at random?
• What incidental
correlations?
• Choice of features
What causes data to be
included / excluded?
7
The COVID publication: is it good data?
• Do we know where it came from?
• Do we know who is in it?
• Is there missing data?
• “maybe” COVID?
• Are the populations comparable?
• Are antibody levels comparable?
• Different test kits?
• Imbalanced classes?
8
The data
How do we analyse RWE correctly?
• Patients are complex:
• Co-morbidities
• Lifestyle, prior history, exposure
• Demographics, genetics, epigenome, microbiome …
• Disease is complex:
• Affects different body subsystems
• Health data is complex:
• Sparse, irregular
• A product of a healthcare system …
• Underlying models unclear
• Many opportunities for confounders & noise
9
10
Is ML the best approach for RWE analysis?
Messy data
Clear
assumptions
Explicit
models … No model
Statistical modelling Machine Learning / AI
…
a continuum of approaches
Few
assumptions
Clean &
controlled data
Trained from
data
Larger data
But what are the pitfalls of using ML on health data?
11
• Need more (labelled) data
• Bias – how was the data
sourced?
• Needs to be handled carefully
• May require specialised
computation & skills
• Some problems difficult to
adapt to ML
• Interpretability – data never
lies, but what is it telling us?
Clustering: how simple algorithms can
actually be very complex
• Idea of clustering is simple: but what does it actually do?
• Every dataset has clusters, even random noise
• Do clusters reflect the underlying reality?
• Are the clusters revealed valid and/or robust?
• Are the clusters of groups you are interested in?
• A cluster is the truth, it’s a hypothesis
(The paper is modestly convincing about these points)
12
The COVID publication: is it good methodology?
• Many different methods but:
• What’s the concordance?
• What use is 6-7 methods?
• Ensemble them?
• Where’s the validation?
• What’s the question?
• How many people are actually
infected with COVID? or
• Can we build a model to calculate
this?
13
The data
What makes a good machine-learning approach?
14
• Be clear what it is predicting
• It should be reproducible
• It should be validated:
• Internally: performance, convergence, loss,
sensitivity, robust, …
• Externally: against another dataset
• Almost any ML method can
• Do (slightly) better than humans
• Get better than 50%
• If it is “better”, compared to what?
How do the
systems in the
paper measure
up?
15
How do we know what a system is doing?
• Interpretability is non-negotiable
• AI models can only be built for data that
you have
• Biased data gives rise to biased models
• A model may not be doing what we
think it is
• Toolkits like Shap & Lime make
interpretability easy and comparable
(Paper used very interpretable systems)
How could this have been done better?
• What question are we trying to solve?
• “What’s the actual level of infected people in the population”?
• In what time period or setting?
• What’s actionable?
• What data can we get?
• What data can we get for validation?
• We don’t need 6-7 different methods, just 1 good one
• Be clear about “how good” the results are
16
Summary
• RWE may be a broad and over-reaching category
• But it underlines the complexity & biases of health data
• ML may be the best approach for analysing RWE
• However its power and flexibility introduces other problems
• Data “bias”
• Validation
• Interpretability
• ML “findings” are almost always just hypotheses
• Healthcare analytics should not be about analytics but about biology
17
Final thought
• If you are driven by science and passionate about improving lives, why not work at
AstraZeneca?
• Example jobs – please visit our careers website
• Principal Data Scientist - https://careers.astrazeneca.com/job/gaithersburg/principal-
data-scientist/7684/14833674
• Associate Director Imaging & AI - Imaging & Data Analytics -
https://careers.astrazeneca.com/job/gothenburg/associate-director-imaging-and-ai-
imaging-and-data-analytics/7684/14469379
• Data Sciences & AI Graduate Programme – UK -
https://careers.astrazeneca.com/data-sciences-and-ai-graduate-programme
18

Machine learning, health data & the limits of knowledge

  • 1.
    Machine learning, health data& the limits of knowledge Paul Agapow ONC R&D ML&AI AstraZeneca <paul.agapow@astrazeneca.com> 20201/3/10
  • 2.
    2 Disclosure • Does notreflect official AZ thought or projects • No conflicts of interest
  • 3.
    3 About me • Havebeen a: • At • Oncology R&D ML&AI / RWE @AZ • Data Science Institute @ICL • Centre for Infection @HPA (UK) • Universities, industry, government … health informatician, data scientist, bioinformatician, database administrator, epi-informaticist, software dev, data manager, consultant, molecular geneticist, data scientist, evolutionary scientist, biochemist, phylogeneticist, immunologist, programmer …
  • 4.
    Using this paperas a jumping-off point • The Hierarchical Classifier for COVID-19 Resistance Evaluation (2021) Shakhovska, Izonin & Melnykova, Data v6:6 • https://doi.org/10.3390/data6010006 • https://www.mdpi.com/2306- 5729/6/1/6/htm • How to analyse for patterns in COVID data when the observational data is diverse & complex 4
  • 5.
    Data is a saviour& a curse • Data & analytics has saved us several times in the current crisis • But too much data can create problems • And data is not information 5
  • 6.
    RWE: real worldevidence 6 • Electronic Health Records • Registries • Claims databases • Repurposed trial data • Defined: • Anything that isn’t an RCT (randomised controlled trial) • Observational data • Anything we have to consider the context & sourcing of? • Why? • Cheap • Ethical • Accesses scales & types of data & situations that are otherwise unavailable
  • 7.
    • Where wasit collected? • Who did they look for? • What are those peoples habits and histories? But all (RWE) data is biased What population does it come from? • “severe asthma” or “PDL1 expression” • What are the diagnostic devices? • What’s common medical practice there? What are the definitions used? • E.g. surveys, visits • Are inclusion / exclusion at random? • What incidental correlations? • Choice of features What causes data to be included / excluded? 7
  • 8.
    The COVID publication:is it good data? • Do we know where it came from? • Do we know who is in it? • Is there missing data? • “maybe” COVID? • Are the populations comparable? • Are antibody levels comparable? • Different test kits? • Imbalanced classes? 8 The data
  • 9.
    How do weanalyse RWE correctly? • Patients are complex: • Co-morbidities • Lifestyle, prior history, exposure • Demographics, genetics, epigenome, microbiome … • Disease is complex: • Affects different body subsystems • Health data is complex: • Sparse, irregular • A product of a healthcare system … • Underlying models unclear • Many opportunities for confounders & noise 9
  • 10.
    10 Is ML thebest approach for RWE analysis? Messy data Clear assumptions Explicit models … No model Statistical modelling Machine Learning / AI … a continuum of approaches Few assumptions Clean & controlled data Trained from data Larger data
  • 11.
    But what arethe pitfalls of using ML on health data? 11 • Need more (labelled) data • Bias – how was the data sourced? • Needs to be handled carefully • May require specialised computation & skills • Some problems difficult to adapt to ML • Interpretability – data never lies, but what is it telling us?
  • 12.
    Clustering: how simplealgorithms can actually be very complex • Idea of clustering is simple: but what does it actually do? • Every dataset has clusters, even random noise • Do clusters reflect the underlying reality? • Are the clusters revealed valid and/or robust? • Are the clusters of groups you are interested in? • A cluster is the truth, it’s a hypothesis (The paper is modestly convincing about these points) 12
  • 13.
    The COVID publication:is it good methodology? • Many different methods but: • What’s the concordance? • What use is 6-7 methods? • Ensemble them? • Where’s the validation? • What’s the question? • How many people are actually infected with COVID? or • Can we build a model to calculate this? 13 The data
  • 14.
    What makes agood machine-learning approach? 14 • Be clear what it is predicting • It should be reproducible • It should be validated: • Internally: performance, convergence, loss, sensitivity, robust, … • Externally: against another dataset • Almost any ML method can • Do (slightly) better than humans • Get better than 50% • If it is “better”, compared to what? How do the systems in the paper measure up?
  • 15.
    15 How do weknow what a system is doing? • Interpretability is non-negotiable • AI models can only be built for data that you have • Biased data gives rise to biased models • A model may not be doing what we think it is • Toolkits like Shap & Lime make interpretability easy and comparable (Paper used very interpretable systems)
  • 16.
    How could thishave been done better? • What question are we trying to solve? • “What’s the actual level of infected people in the population”? • In what time period or setting? • What’s actionable? • What data can we get? • What data can we get for validation? • We don’t need 6-7 different methods, just 1 good one • Be clear about “how good” the results are 16
  • 17.
    Summary • RWE maybe a broad and over-reaching category • But it underlines the complexity & biases of health data • ML may be the best approach for analysing RWE • However its power and flexibility introduces other problems • Data “bias” • Validation • Interpretability • ML “findings” are almost always just hypotheses • Healthcare analytics should not be about analytics but about biology 17
  • 18.
    Final thought • Ifyou are driven by science and passionate about improving lives, why not work at AstraZeneca? • Example jobs – please visit our careers website • Principal Data Scientist - https://careers.astrazeneca.com/job/gaithersburg/principal- data-scientist/7684/14833674 • Associate Director Imaging & AI - Imaging & Data Analytics - https://careers.astrazeneca.com/job/gothenburg/associate-director-imaging-and-ai- imaging-and-data-analytics/7684/14469379 • Data Sciences & AI Graduate Programme – UK - https://careers.astrazeneca.com/data-sciences-and-ai-graduate-programme 18