Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI

Str-AI-ght to heaven?
Pitfalls for clinical decision support based on AI
Ben Van Calster
Department Development and Regeneration and EPI-centre, KU Leuven
Department Biomedical Data Sciences, LUMC Leiden
Research Ethics Committee, UZ Leuven
ben.vancalster@kuleuven.be; @BenVanCalster
ISUOG World Congress, 16 October 2021

Disclaimer
• Talk last year: “a plea for good methodology”
• This talk builds on that, in the context of AI and machine learning
• There is a lot of hype surrounding AI/ML. It may have potential, but we better
start to get real!
2
https://lawtomated.com/enough-with-the-a-i-hype-and-why/
Lawtomated

Do not celebrate too early…
3
Copyright Bas Czerwinski / Getty Images
Julian Alaphilippe, Liège-Bastogne-Liège (Oct. 4th, 2020)
Real winner: Primož Roglič
Real winner

Deep learning on medical images
4
Topol. Nat Med 2019;25:44-56. Zhu et al. Front Neurol 2019;10:869.
Titano et al. Nat Med 2018;24:1337-41; Nam et al. Radiology 2019;290:218-28; Ehteshami Bejnordi et al. JAMA 2017;318:2199-210;
Esteva et al. Nature 2017;542:115-8; De Fauw et al. Nat Med 2018;24:1342-50; Raman et al. Eye 2019;33:97-109.

Machine Learning for ‘EHR’ data
5
Rajkomar et al. Npj Digit Med 2018;1:18.
Rose. JAMA Netw Open 2018;1:e181404.

Reason for popularity?
6
“Very complex machine learning algorithms are highly flexible,
and hence find relationships we could not see before.
Therefore we make better predictions and better decisions.”
→ Guaranteed success!
Right?

Pitfalls for “predictive analytics”
7
 1. Poor methodology
 2. Lack of evidence
 3. Considerable heterogeneity
 4. (Financial) conflicts of interest
 5. Actual implementation in clinical practice

1. Methodology matters, not impact factors
8
Altman DG. BMJ 1994;308:283-284.
Van Calster et al, J Clin Epidemiol, in press.
Altman. BMJ 1994.
Our own frustration paper. JCE 2021.

‘Predictive analytics’: covid-19
9
Wynants et al. BMJ 2020;369:m1328.
The review found more than 1 paper a day (!)
Results not trustworthy for 97% of the 231 models
Median sample size: 338
Non-representative sample: 42%
Representativity unclear: 25%
Data analysis problematic: 94%
No model validation at all: 22%

Predictive analytics for covid-19
10
Wynants et al. BMJ 2020;369:m1328
Deep learning models for covid-19 diagnosis using CT or RX
- No discussion of target population or setting
- Control group (without covid-19):
 Images from pediatric population
 Images from a different country
 Images from different time periods
 Barely defined, e.g. ‘healthy persons’
- Images from online repository, without further information
- Often not any demographic description (not even age or sex!)

Covid-19 deep learning: deep failure!
11
Roberts et al. Nat Mach Intell 2021;3:199-217.

Public covid-19 RX datasets
12
Santa Cruz et al. Med Image Analysis 2021;74:102225.

Complex algorithms are data hungry
So you dream of
having a Porsche?
If you cannot (or don’t want to) pay for it,
you may get this...
This also holds for predictive analytics. More fancy model? More expensive.
Currency: GOOD data.
13

Measurement and data quality
14
Missing values: the tricky importance of the invisible
Measurement: timing and procedure matters
Outcome: quality labels are key (see e.g. deep learning on medical images)
Beam & Kohane. JAMA 2018;319:1317-1318.

2. Wanted: evidence
• Kleinrouweler (AJOG 2016): 263 models in obstetrics
• Only 23 of these (9%) had been externally validated…
Other examples of model overload:
• 1060 models predicting outcomes after CVD (1990-2015) (Wessler et al, 2017)
• 363 models predicting CVD (Damen et al, 2016)
• 231 models related to Covid-19 (Wynants et al, 2020), and counting!
• 116 models to diagnose ovarian malignancy (Kaijser et al, 2014)
15
Wessler et al. Diagn Progn Res 2017;1:20. Damen et al. BMJ 2016;353:i2416. Wynants et al. BMJ 2020;369:m1328.
Kleinrouweler et al. AJOG 2016;214:79-90. Kaijser et al. Hum Reprod Update 2014;20:229-62.

Smartphone apps for skin lesions
16
Freeman et al. BMJ 2020;368:m127
• 9 validation studies covering 6 apps
• 1132 lesions in total (average 126 per study)
• Methodological quality was poor
o Selective inclusions (non-representative)
o Images were taken and selected by clinicians
o Lots of unusable images
Scarce and poor evidence

Radiology AI
17
Van Leeuwen et al. Eur Radiol 2021;31:3797-3804
• 64/100: no evidence
• 18/100: evidence of diagnostic performance
• 18/100: evidence of potential impact
• Half of the studies were independent, the other half had conflicts of interest

3. Expect (a lot of) heterogeneity
18
• Changes in care over time
• Differences in care between healthcare systems
• Differences in populations between practices/hospitals/regions
• Differences in hardware, software, and measurement procedures
• Differences in performance between patient subgroups (cf fairness)
Futoma et al. Lancet Digit Health 2020;2:e489-e492.

19
https://www.unite.ai/andrew-ng-criticizes-the-culture-of-overfitting-in-machine-learning/.
https://www.youtube.com/watch?v=Gbnep6RJinQ

Procedural heterogeneity
20
Agniel et al. BMJ 2018;360:k1479.

Hardware/software
21
Badgeley et al. npj Digit Med 2019;2:31.
Deep learning was better at predicting scanner model and brand
(AUC>=0.98) than at predicting hip fracture (AUC 0.78)

Where do DL datasets come from anyway?
22
Kaushal et al. JAMA 2020;324:1212-1213.

Implications?
23
Van Calster et al. BMC Med 2019;17:230.
THERE IS NO SUCH THING AS A ‘VALIDATED’ MODEL

DL research (Sep 2021)
24
Perkonigg et al. Nat Comm 2021;12:5678.

4. Proprietary datasets and models
25
Van Calster et al. JAMIA 2019;26:1651-1654.
https://hai.stanford.edu/news/flying-dark-hospital-ai-tools-arent-well-documented.
Not necessarily bad in principle: financial resources are needed
But it may hamper openness, availability, independent validation
COVID review: companies often did not react, but claimed that the model
was used on thousands of patients

Google’s Dermatology Assist (CE label)
26
https://www.bbc.com/news/technology-57157566.
May 18th, 2021

Google’s Dermatology Assist (CE label)
27
https://www.statnews.com/2021/06/02/machine-learning-ai-methodology-research-flaws/.
Roxana Daneshjou (Stanford):
- No evaluation on external dataset.
- Insufficient variation in skin types.
- Outcome rarely based on biopsy.
- “I haven't seen data that makes me feel
comfortable with putting this in the hands of
patients or physicians.”

External validation of EPIC sepsis model
28
Wong et al. JAMA Intern Med 2021;181:1065-1070.
Model: penalized logistic regression with 80 variables
Data: 3 healthcare organizations, 2013-2015
AUC according to internal documentation: 0.78-0.83
Validation: 1 academic center, 2018-2019
AUC 0.63, calibration poor (risks way too high)

5. Actual implementation
29
Logistical/practical issues to fit model in clinical workflow
Psychological issues regarding model use by healthcare staff
Medicolegal: Who is responsible when prediction is wrong?
https://www.statnews.com/2020/03/09/can-you-sue-artificial-intelligence-algorithm-for-malpractice/
Panch et al. npj Digit Med 2019;2:77.

Lack of evidence revisited: impact?
30
Clinical impact studies: scarce, difficult
Clinical decision support is a complex intervention (Kappen et al, 2018)
Endpoints of impact studies?
- Process-related: ‘easy’, but intermediate
- Long-term patient outcomes: difficult, lower effect sizes expected
Kappen et al. Diagn Progn Res 2018.

So, does medical AI ‘work’?
31
We still often don’t know!
Trust jeopardized by
- poor methodology
- lack of evidence
- lack of openness.
It may have potential if done well and evidence is gathered.
AI community / academia often shoots itself in the foot, this is a pity
Academia: wrong incentives (publish or perish)!
Companies: financial conflicts of interest!

That’s (not) all folks…
32
https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/.
https://spectrum.ieee.org/deep-learning-computational-cost
Thompson et al. IEEE Spectrum 2021.
Hao. MlT Technology review 2019.

Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI

Similar to Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI (20)

Recently uploaded

Recently uploaded (20)

Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI