Lecture for Imperial College London's MSc in Health Data Analytics, critiquing a recent paper on COVID diagnosis and moving out to talk about good practices (& limits) in ML and model building
3. 3
About me
• Have been a:
• At
• Oncology R&D ML&AI / RWE @AZ
• Data Science Institute @ICL
• Centre for Infection @HPA (UK)
• Universities, industry, government …
health informatician, data scientist, bioinformatician, database
administrator, epi-informaticist, software dev, data manager,
consultant, molecular geneticist, data scientist, evolutionary
scientist, biochemist, phylogeneticist, immunologist, programmer …
4. Using this paper as a jumping-off point
• The Hierarchical Classifier for COVID-19
Resistance Evaluation (2021) Shakhovska,
Izonin & Melnykova, Data v6:6
• https://doi.org/10.3390/data6010006
• https://www.mdpi.com/2306-
5729/6/1/6/htm
• How to analyse for patterns in COVID data
when the observational data is diverse &
complex
4
5. Data is a
saviour & a
curse
• Data & analytics has saved us several times in the current
crisis
• But too much data can create problems
• And data is not information
5
6. RWE: real world evidence
6
• Electronic Health Records
• Registries
• Claims databases
• Repurposed trial data
• Defined:
• Anything that isn’t an RCT
(randomised controlled trial)
• Observational data
• Anything we have to consider the
context & sourcing of?
• Why?
• Cheap
• Ethical
• Accesses scales & types of data &
situations that are otherwise
unavailable
7. • Where was it collected?
• Who did they look for?
• What are those peoples
habits and histories?
But all (RWE) data is biased
What population does it
come from?
• “severe asthma” or
“PDL1 expression”
• What are the diagnostic
devices?
• What’s common medical
practice there?
What are the definitions
used?
• E.g. surveys, visits
• Are inclusion / exclusion
at random?
• What incidental
correlations?
• Choice of features
What causes data to be
included / excluded?
7
8. The COVID publication: is it good data?
• Do we know where it came from?
• Do we know who is in it?
• Is there missing data?
• “maybe” COVID?
• Are the populations comparable?
• Are antibody levels comparable?
• Different test kits?
• Imbalanced classes?
8
The data
9. How do we analyse RWE correctly?
• Patients are complex:
• Co-morbidities
• Lifestyle, prior history, exposure
• Demographics, genetics, epigenome, microbiome …
• Disease is complex:
• Affects different body subsystems
• Health data is complex:
• Sparse, irregular
• A product of a healthcare system …
• Underlying models unclear
• Many opportunities for confounders & noise
9
10. 10
Is ML the best approach for RWE analysis?
Messy data
Clear
assumptions
Explicit
models … No model
Statistical modelling Machine Learning / AI
…
a continuum of approaches
Few
assumptions
Clean &
controlled data
Trained from
data
Larger data
11. But what are the pitfalls of using ML on health data?
11
• Need more (labelled) data
• Bias – how was the data
sourced?
• Needs to be handled carefully
• May require specialised
computation & skills
• Some problems difficult to
adapt to ML
• Interpretability – data never
lies, but what is it telling us?
12. Clustering: how simple algorithms can
actually be very complex
• Idea of clustering is simple: but what does it actually do?
• Every dataset has clusters, even random noise
• Do clusters reflect the underlying reality?
• Are the clusters revealed valid and/or robust?
• Are the clusters of groups you are interested in?
• A cluster is the truth, it’s a hypothesis
(The paper is modestly convincing about these points)
12
13. The COVID publication: is it good methodology?
• Many different methods but:
• What’s the concordance?
• What use is 6-7 methods?
• Ensemble them?
• Where’s the validation?
• What’s the question?
• How many people are actually
infected with COVID? or
• Can we build a model to calculate
this?
13
The data
14. What makes a good machine-learning approach?
14
• Be clear what it is predicting
• It should be reproducible
• It should be validated:
• Internally: performance, convergence, loss,
sensitivity, robust, …
• Externally: against another dataset
• Almost any ML method can
• Do (slightly) better than humans
• Get better than 50%
• If it is “better”, compared to what?
How do the
systems in the
paper measure
up?
15. 15
How do we know what a system is doing?
• Interpretability is non-negotiable
• AI models can only be built for data that
you have
• Biased data gives rise to biased models
• A model may not be doing what we
think it is
• Toolkits like Shap & Lime make
interpretability easy and comparable
(Paper used very interpretable systems)
16. How could this have been done better?
• What question are we trying to solve?
• “What’s the actual level of infected people in the population”?
• In what time period or setting?
• What’s actionable?
• What data can we get?
• What data can we get for validation?
• We don’t need 6-7 different methods, just 1 good one
• Be clear about “how good” the results are
16
17. Summary
• RWE may be a broad and over-reaching category
• But it underlines the complexity & biases of health data
• ML may be the best approach for analysing RWE
• However its power and flexibility introduces other problems
• Data “bias”
• Validation
• Interpretability
• ML “findings” are almost always just hypotheses
• Healthcare analytics should not be about analytics but about biology
17
18. Final thought
• If you are driven by science and passionate about improving lives, why not work at
AstraZeneca?
• Example jobs – please visit our careers website
• Principal Data Scientist - https://careers.astrazeneca.com/job/gaithersburg/principal-
data-scientist/7684/14833674
• Associate Director Imaging & AI - Imaging & Data Analytics -
https://careers.astrazeneca.com/job/gothenburg/associate-director-imaging-and-ai-
imaging-and-data-analytics/7684/14469379
• Data Sciences & AI Graduate Programme – UK -
https://careers.astrazeneca.com/data-sciences-and-ai-graduate-programme
18