1. Wirewalking over Two
Medical AI Chasms
Results and Open Problems in Making "Valid AI"
Also Useful in Medical Practice
Prof. Ing. Federico Cabitza, PhD
Università degli Studi di Milano-Bicocca, Milan, Italy
IRCCS Orthopedic Institute Galeazzi, Milan, Italy
1
12. DECISION
Recording Training
advice
You can fail to achieve
ecological validity even with
high (statistical) accuracy,
and you can have
ecological validity without
high accuracy!
26. The replicability
problem
or bloody external validity!
1st reason:
Covid-19 positives, negatives,
White Blood count and
lymphocytes, First wave (before
and after May)
CONCEPT DRIFT
ACCURACY DECREASE
TIME 26
28. PIÙ È BASSA LA CORRELAZIONE
E MEGLIO È
HSR
5/20
IOG
3/20
1st reason:
The replicability
problem
or bloody external validity! POTENTIAL ROBUSTNESS DIAGRAM
THE LOWER THE CORRELATION
THE BETTER…
EACH CIRCLE IS A ML
MODEL IN CROSS VALIDATION
28
29. POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
2nd reason:
The problem of
the ground truth
reliability 29
30. POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
2nd reason:
The problem of
the ground truth
reliability 30
31. POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
HOW MANY RATERS TO INVOLVE TO GET A DECENT GROUND TRUTH?
N
O
.
OF
RATERS
AVERAGE RATER ERROR RATE
2nd reason:
The problem of
the ground truth
reliability 31
32. POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
HOW MANY RATERS TO INVOLVE TO GET A DECENT GROUND TRUTH?
If they’re good (acc 90%)…
7!
N
O
.
OF
RATERS
AVERAGE RATER ERROR RATE
2nd reason:
The problem of
the ground truth
reliability 32
33. NUMBER OF RATERS TO INVOLVE TO GET A LABELLING ERROR UNDER 5%
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NOMOGRAM TO UNDERSTAND THE PROBLEM!
OR KAPPA, OR ALPHA, …
2nd reason:
The problem of
the ground truth
reliability 33
34. NUMBER OF RATERS TO INVOLVE TO GET A LABELLING ERROR UNDER 5%
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NOMOGRAM TO UNDERSTAND THE PROBLEM!
OR KAPPA, OR ALPHA, …
2nd reason:
The problem of
the ground truth
reliability 34
39. Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
Intuitively, decision support is
useful if the number of times it is
right in detecting a health
problem is greater than the
number of times it is wrong. /1
True positive rate
False positive rate
39
40. Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
True positive rate
False positive rate
And if:
It is optimized to avoid the most
impactful kinds of error (at class level)
It helps you when you need it most
(i.e., most difficult/rarest cases)
It doesn’t take guesses
(and it’s calibrated in its guesses).
40
41. Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
True positive rate
False positive rate
And if:
It is optimized to avoid the most
impactful kinds of error (at class level)
It helps you when you need it most
(i.e., most difficult/rarest cases)
It doesn’t take guesses
(and it’s calibrated in its guesses).
But
how?
41
45. It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
45
46. It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
46
http://prova-meta-validation.herokuapp.com/
47. It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
47
http://psicorrespondence.pythonanywhere.com/
48. 1 2
… data reliability
So why accuracy should not tell us much?
3 reasons
It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
48
49. It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
49
https://reliability-test.herokuapp.com/
50. 1 2
…the model utility
3
So why accuracy should not tell us much?
3 reasons
… data reliability
It does tell anything about…
…robustness
Not only avoidance overfitting, but pursue external validation
50
52. By looking at how it performs wrt what matters most....
52
53. That is…
Robustness as a performance
characteristic assessed on multiple
aspects in light of statistical
significance and similarity to training
data
EXTERNAL PERFORMANCE DIAGRAM
53
54. Consider similarity
between training
set and test set (or
external validation
set)
Consider a
balanced
«discriminative
performance»
metric.
Consider a
utility metric
Consider a
calibration
metric
Combine them
into a
robustness
metric
Consider
whether the
test set has
sufficient
numerosity
(100%)
54
56. ✓ The model performs better with Italian
data
✓ But similarity is moderate, not high
✓ Model is also good with Spanish dataset
✓ Model is moderately robust
56
57. There is also a 4th reason why we
should not talk about accuracy in
the case of medical AI
57
58. Medical decision making is
multifactorial and not
black/white.
AI is a tool and not an agent.
58
59. So, we should not want
accurate models…
But rather (potentially)
useful.
Two take-home messages to
“trust” any accuracy score…
59
60. 1st: evaluate the similarity
between training datasets and
validation set.
60
61. 1st: evaluate the similarity
between training datasets and
validation set.
2nd: weigh your data for what
clinicians really care about
(relevance, complexity, rarity, …).
61
62. To discover more, please
refer to the IJMEDI
checklist for assessment of
medical AI 62
65. Cabitza, F., & Zeitoun, J. D. (2019). The proof of the pudding: in praise of a culture of real-world validation for
medical artificial intelligence. Annals of translational medicine, 7(8).
Cabitza, F., Campagner, A., & Balsano, C. (2020). Bridging the “last mile” gap between AI implementation and
operation:“data awareness” that matters. Annals of translational medicine, 8(7).
Cabitza, F., Campagner, A., & Sconfienza, L. M. (2020). As if sand were stone. New concepts and metrics to
probe the ground on which to build trustable AI. BMC Medical Informatics and Decision Making, 20(1), 1-21.
Cabitza, F., Locoro, A., Alderighi, C., Rasoini, R., Compagnone, D., & Berjano, P. (2019). The elephant in the
record: on the multiplicity of data recording work. Health informatics journal, 25(3), 475-490.
Cabitza, F., Campagner, A., Albano, D., Aliprandi, A., Bruno, A., Chianca, V., ... & Sconfienza, L. M. (2020). The
elephant in the machine: Proposing a new metric of data reliability and its application to a medical case to
assess classification reliability. Applied Sciences, 10(11), 4014.
Cabitza F., Campagner A., (2021) The need to separate the wheat from the chaff. International Journal of
Medical Informatics. https://doi.org/10.1016/j.ijmedinf.2021.104510
References
Cabitza, F., Campagner, A., (2021) "Decisions are not all equal. Introducing a utility metric
based on the case-wise raters' perceptions“. Proceedings of CD-MAKE 2021
Cabitza, F., Campagner, A., ... & Carobene, A. (2020). Development, evaluation, and validation of
machine learning models for COVID-19 detection based on routine blood tests. Clinical
Chemistry and Laboratory Medicine (CCLM), 1
65