Cabitza biostec-2022

Federico Cabitza
Federico CabitzaAssociate Professor at the Università degli Studi di Milano-Bicocca
Wirewalking over Two
Medical AI Chasms
Results and Open Problems in Making "Valid AI"
Also Useful in Medical Practice
Prof. Ing. Federico Cabitza, PhD
Università degli Studi di Milano-Bicocca, Milan, Italy
IRCCS Orthopedic Institute Galeazzi, Milan, Italy
1
2
3
4
5
6
Decision
Recording Training
advice
DECISION
Recording Training
advice
DECISION
Recording Training
advice
DECISION
Recording Training
advice
DECISION
Recording Training
advice
DECISION
Recording Training
advice
You can fail to achieve
ecological validity even with
high (statistical) accuracy,
and you can have
ecological validity without
high accuracy!
DECISION
Recording Training
advice
DECISION
Recording Training
advice
DECISION
RECORDING TRAINING
ADVICE
Cabitza biostec-2022
Chasm of
Human
Trust
Chasm of
Machine
Experience
Decision
Recording Training
advice
Chasm of
Human
Trust
Chasm of
Machine
Experience
Decision
Recording Training
advice
The last mile for medical AI
PROCESSES OF TRUST BUILDING (E.G., VALIDATION, EXPLANATION, …)
PROCESSES OF DATA PRODUCTION(E.G., RELIABILITY, VETTING FOR BIAS, …)
* »Chaque jour j'attache moins de prix à l'intelligence », Marcel Proust, 1908
"WITH EACH PASSING
DAY, I PLACE
LESS VALUE ON
ACCURACY"*
19
20
Accuracy>95%!!
21
ThaT don’T impress me much!
22
So why doesn’t accuracy tell me
much?
3 reasons
23
12/2016
24
The replicability
problem
or bloody external validity!
1st reason:
25
The replicability
problem
or bloody external validity!
1st reason:
Covid-19 positives, negatives,
White Blood count and
lymphocytes, First wave (before
and after May)
CONCEPT DRIFT
ACCURACY DECREASE
TIME 26
The replicability
problem
or bloody external validity!
1st reason:
POTENTIAL ROBUSTNESS DIAGRAM
27
PIÙ È BASSA LA CORRELAZIONE
E MEGLIO È
HSR
5/20
IOG
3/20
1st reason:
The replicability
problem
or bloody external validity! POTENTIAL ROBUSTNESS DIAGRAM
THE LOWER THE CORRELATION
THE BETTER…
EACH CIRCLE IS A ML
MODEL IN CROSS VALIDATION
28
POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
2nd reason:
The problem of
the ground truth
reliability 29
POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
2nd reason:
The problem of
the ground truth
reliability 30
POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
HOW MANY RATERS TO INVOLVE TO GET A DECENT GROUND TRUTH?
N
O
.
OF
RATERS
AVERAGE RATER ERROR RATE
2nd reason:
The problem of
the ground truth
reliability 31
POTENTIAL ROBUSTNESS DIAGRAM
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
HOW MANY RATERS TO INVOLVE TO GET A DECENT GROUND TRUTH?
If they’re good (acc 90%)…
7!
N
O
.
OF
RATERS
AVERAGE RATER ERROR RATE
2nd reason:
The problem of
the ground truth
reliability 32
NUMBER OF RATERS TO INVOLVE TO GET A LABELLING ERROR UNDER 5%
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NOMOGRAM TO UNDERSTAND THE PROBLEM!
OR KAPPA, OR ALPHA, …
2nd reason:
The problem of
the ground truth
reliability 33
NUMBER OF RATERS TO INVOLVE TO GET A LABELLING ERROR UNDER 5%
PIÙ È BASSO IL VALORE
E MEGLIO È
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NOMOGRAM TO UNDERSTAND THE PROBLEM!
OR KAPPA, OR ALPHA, …
2nd reason:
The problem of
the ground truth
reliability 34
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
3rd reason:
The problem of
meaning
35
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
The problem of
meaning
3rd reason:
36
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
37
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
38
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
Intuitively, decision support is
useful if the number of times it is
right in detecting a health
problem is greater than the
number of times it is wrong. /1
True positive rate
False positive rate
39
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
True positive rate
False positive rate
And if:
It is optimized to avoid the most
impactful kinds of error (at class level)
It helps you when you need it most
(i.e., most difficult/rarest cases)
It doesn’t take guesses
(and it’s calibrated in its guesses).
40
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
A NEW UTILITY METRIC
3rd reason:
The problem of
meaning
True positive rate
False positive rate
And if:
It is optimized to avoid the most
impactful kinds of error (at class level)
It helps you when you need it most
(i.e., most difficult/rarest cases)
It doesn’t take guesses
(and it’s calibrated in its guesses).
But
how?
41
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
3rd reason:
The problem of
meaning
True positive rate
False positive rate
42
Problema della
Replicabilità
(o validazione esterna)
1a ragione:
3rd reason:
The problem of
meaning
True positive rate
False positive rate
43
So why accuracy should not tell us much?
3 reasons
44
It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
45
It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
46
http://prova-meta-validation.herokuapp.com/
It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
47
http://psicorrespondence.pythonanywhere.com/
1 2
… data reliability
So why accuracy should not tell us much?
3 reasons
It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
48
It does tell nothing about…
…robustness
Not only avoid overfitting, but pursue external validation!
1
So why accuracy should not tell us much?
3 reasons
49
https://reliability-test.herokuapp.com/
1 2
…the model utility
3
So why accuracy should not tell us much?
3 reasons
… data reliability
It does tell anything about…
…robustness
Not only avoidance overfitting, but pursue external validation
50
51
By looking at how it performs wrt what matters most....
52
That is…
Robustness as a performance
characteristic assessed on multiple
aspects in light of statistical
significance and similarity to training
data
EXTERNAL PERFORMANCE DIAGRAM
53
Consider similarity
between training
set and test set (or
external validation
set)
Consider a
balanced
«discriminative
performance»
metric.
Consider a
utility metric
Consider a
calibration
metric
Combine them
into a
robustness
metric
Consider
whether the
test set has
sufficient
numerosity
(100%)
54
https://covid19-bloodtests-ml.herokuapp.com
DIAGNOSTIC TOOL FROM
ROUTINE BLOOD TEST (CBC)
55
✓ The model performs better with Italian
data
✓ But similarity is moderate, not high
✓ Model is also good with Spanish dataset
✓ Model is moderately robust
56
There is also a 4th reason why we
should not talk about accuracy in
the case of medical AI
57
Medical decision making is
multifactorial and not
black/white.
AI is a tool and not an agent.
58
So, we should not want
accurate models…
But rather (potentially)
useful.
Two take-home messages to
“trust” any accuracy score…
59
1st: evaluate the similarity
between training datasets and
validation set.
60
1st: evaluate the similarity
between training datasets and
validation set.
2nd: weigh your data for what
clinicians really care about
(relevance, complexity, rarity, …).
61
To discover more, please
refer to the IJMEDI
checklist for assessment of
medical AI 62
63
GRAZIE!
federico.cabitza @ unimib.it
@cabitzaf
*
64
Cabitza, F., & Zeitoun, J. D. (2019). The proof of the pudding: in praise of a culture of real-world validation for
medical artificial intelligence. Annals of translational medicine, 7(8).
Cabitza, F., Campagner, A., & Balsano, C. (2020). Bridging the “last mile” gap between AI implementation and
operation:“data awareness” that matters. Annals of translational medicine, 8(7).
Cabitza, F., Campagner, A., & Sconfienza, L. M. (2020). As if sand were stone. New concepts and metrics to
probe the ground on which to build trustable AI. BMC Medical Informatics and Decision Making, 20(1), 1-21.
Cabitza, F., Locoro, A., Alderighi, C., Rasoini, R., Compagnone, D., & Berjano, P. (2019). The elephant in the
record: on the multiplicity of data recording work. Health informatics journal, 25(3), 475-490.
Cabitza, F., Campagner, A., Albano, D., Aliprandi, A., Bruno, A., Chianca, V., ... & Sconfienza, L. M. (2020). The
elephant in the machine: Proposing a new metric of data reliability and its application to a medical case to
assess classification reliability. Applied Sciences, 10(11), 4014.
Cabitza F., Campagner A., (2021) The need to separate the wheat from the chaff. International Journal of
Medical Informatics. https://doi.org/10.1016/j.ijmedinf.2021.104510
References
Cabitza, F., Campagner, A., (2021) "Decisions are not all equal. Introducing a utility metric
based on the case-wise raters' perceptions“. Proceedings of CD-MAKE 2021
Cabitza, F., Campagner, A., ... & Carobene, A. (2020). Development, evaluation, and validation of
machine learning models for COVID-19 detection based on routine blood tests. Clinical
Chemistry and Laboratory Medicine (CCLM), 1
65
1 of 65

Recommended

Cabitza cuore-e by
Cabitza cuore-eCabitza cuore-e
Cabitza cuore-eFederico Cabitza
162 views51 slides
Cabitza - Informatica e Nietzsche by
Cabitza - Informatica e NietzscheCabitza - Informatica e Nietzsche
Cabitza - Informatica e NietzscheFederico Cabitza
1.3K views55 slides
Alle fonti del computing (To the source of computing) by
Alle fonti del computing (To the source of computing)Alle fonti del computing (To the source of computing)
Alle fonti del computing (To the source of computing)Federico Cabitza
540 views129 slides
ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
22.6K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.5K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.2K views99 slides

More Related Content

Recently uploaded

Psychology KS4 by
Psychology KS4Psychology KS4
Psychology KS4WestHatch
84 views4 slides
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively by
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyPECB
585 views18 slides
ACTIVITY BOOK key water sports.pptx by
ACTIVITY BOOK key water sports.pptxACTIVITY BOOK key water sports.pptx
ACTIVITY BOOK key water sports.pptxMar Caston Palacio
605 views4 slides
Classification of crude drugs.pptx by
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptxGayatriPatra14
86 views13 slides
The Accursed House by Émile Gaboriau by
The Accursed House  by Émile GaboriauThe Accursed House  by Émile Gaboriau
The Accursed House by Émile GaboriauDivyaSheta
201 views15 slides
CONTENTS.pptx by
CONTENTS.pptxCONTENTS.pptx
CONTENTS.pptxiguerendiain
57 views17 slides

Recently uploaded(20)

Psychology KS4 by WestHatch
Psychology KS4Psychology KS4
Psychology KS4
WestHatch84 views
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively by PECB
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
PECB 585 views
Classification of crude drugs.pptx by GayatriPatra14
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptx
GayatriPatra1486 views
The Accursed House by Émile Gaboriau by DivyaSheta
The Accursed House  by Émile GaboriauThe Accursed House  by Émile Gaboriau
The Accursed House by Émile Gaboriau
DivyaSheta201 views
7 NOVEL DRUG DELIVERY SYSTEM.pptx by Sachin Nitave
7 NOVEL DRUG DELIVERY SYSTEM.pptx7 NOVEL DRUG DELIVERY SYSTEM.pptx
7 NOVEL DRUG DELIVERY SYSTEM.pptx
Sachin Nitave61 views
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB... by Nguyen Thanh Tu Collection
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
Pharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptx by Ms. Pooja Bhandare
Pharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptxPharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptx
Pharmaceutical Inorganic chemistry UNIT-V Radiopharmaceutical.pptx
REPRESENTATION - GAUNTLET.pptx by iammrhaywood
REPRESENTATION - GAUNTLET.pptxREPRESENTATION - GAUNTLET.pptx
REPRESENTATION - GAUNTLET.pptx
iammrhaywood100 views
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant... by Ms. Pooja Bhandare
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...
CUNY IT Picciano.pptx by apicciano
CUNY IT Picciano.pptxCUNY IT Picciano.pptx
CUNY IT Picciano.pptx
apicciano54 views
Ch. 8 Political Party and Party System.pptx by Rommel Regala
Ch. 8 Political Party and Party System.pptxCh. 8 Political Party and Party System.pptx
Ch. 8 Political Party and Party System.pptx
Rommel Regala49 views

Featured

Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides
The six step guide to practical project management by
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
36.6K views27 slides
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
12.6K views21 slides
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.5K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.6K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.3K views36 slides

Featured(20)

Time Management & Productivity - Best Practices by Vit Horky
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky169.7K views
The six step guide to practical project management by MindGenius
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius36.6K views
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by RachelPearson36
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson3612.6K views
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.6K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.7K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views

Cabitza biostec-2022

  • 1. Wirewalking over Two Medical AI Chasms Results and Open Problems in Making "Valid AI" Also Useful in Medical Practice Prof. Ing. Federico Cabitza, PhD Università degli Studi di Milano-Bicocca, Milan, Italy IRCCS Orthopedic Institute Galeazzi, Milan, Italy 1
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 12. DECISION Recording Training advice You can fail to achieve ecological validity even with high (statistical) accuracy, and you can have ecological validity without high accuracy!
  • 18. Chasm of Human Trust Chasm of Machine Experience Decision Recording Training advice The last mile for medical AI PROCESSES OF TRUST BUILDING (E.G., VALIDATION, EXPLANATION, …) PROCESSES OF DATA PRODUCTION(E.G., RELIABILITY, VETTING FOR BIAS, …)
  • 19. * »Chaque jour j'attache moins de prix à l'intelligence », Marcel Proust, 1908 "WITH EACH PASSING DAY, I PLACE LESS VALUE ON ACCURACY"* 19
  • 20. 20
  • 22. ThaT don’T impress me much! 22
  • 23. So why doesn’t accuracy tell me much? 3 reasons 23
  • 25. The replicability problem or bloody external validity! 1st reason: 25
  • 26. The replicability problem or bloody external validity! 1st reason: Covid-19 positives, negatives, White Blood count and lymphocytes, First wave (before and after May) CONCEPT DRIFT ACCURACY DECREASE TIME 26
  • 27. The replicability problem or bloody external validity! 1st reason: POTENTIAL ROBUSTNESS DIAGRAM 27
  • 28. PIÙ È BASSA LA CORRELAZIONE E MEGLIO È HSR 5/20 IOG 3/20 1st reason: The replicability problem or bloody external validity! POTENTIAL ROBUSTNESS DIAGRAM THE LOWER THE CORRELATION THE BETTER… EACH CIRCLE IS A ML MODEL IN CROSS VALIDATION 28
  • 29. POTENTIAL ROBUSTNESS DIAGRAM PIÙ È BASSO IL VALORE E MEGLIO È Problema della Replicabilità (o validazione esterna) 1a ragione: 2nd reason: The problem of the ground truth reliability 29
  • 30. POTENTIAL ROBUSTNESS DIAGRAM PIÙ È BASSO IL VALORE E MEGLIO È Problema della Replicabilità (o validazione esterna) 1a ragione: 2nd reason: The problem of the ground truth reliability 30
  • 31. POTENTIAL ROBUSTNESS DIAGRAM PIÙ È BASSO IL VALORE E MEGLIO È Problema della Replicabilità (o validazione esterna) 1a ragione: HOW MANY RATERS TO INVOLVE TO GET A DECENT GROUND TRUTH? N O . OF RATERS AVERAGE RATER ERROR RATE 2nd reason: The problem of the ground truth reliability 31
  • 32. POTENTIAL ROBUSTNESS DIAGRAM PIÙ È BASSO IL VALORE E MEGLIO È Problema della Replicabilità (o validazione esterna) 1a ragione: HOW MANY RATERS TO INVOLVE TO GET A DECENT GROUND TRUTH? If they’re good (acc 90%)… 7! N O . OF RATERS AVERAGE RATER ERROR RATE 2nd reason: The problem of the ground truth reliability 32
  • 33. NUMBER OF RATERS TO INVOLVE TO GET A LABELLING ERROR UNDER 5% PIÙ È BASSO IL VALORE E MEGLIO È Problema della Replicabilità (o validazione esterna) 1a ragione: A NOMOGRAM TO UNDERSTAND THE PROBLEM! OR KAPPA, OR ALPHA, … 2nd reason: The problem of the ground truth reliability 33
  • 34. NUMBER OF RATERS TO INVOLVE TO GET A LABELLING ERROR UNDER 5% PIÙ È BASSO IL VALORE E MEGLIO È Problema della Replicabilità (o validazione esterna) 1a ragione: A NOMOGRAM TO UNDERSTAND THE PROBLEM! OR KAPPA, OR ALPHA, … 2nd reason: The problem of the ground truth reliability 34
  • 35. Problema della Replicabilità (o validazione esterna) 1a ragione: 3rd reason: The problem of meaning 35
  • 36. Problema della Replicabilità (o validazione esterna) 1a ragione: The problem of meaning 3rd reason: 36
  • 37. Problema della Replicabilità (o validazione esterna) 1a ragione: A NEW UTILITY METRIC 3rd reason: The problem of meaning 37
  • 38. Problema della Replicabilità (o validazione esterna) 1a ragione: A NEW UTILITY METRIC 3rd reason: The problem of meaning 38
  • 39. Problema della Replicabilità (o validazione esterna) 1a ragione: A NEW UTILITY METRIC 3rd reason: The problem of meaning Intuitively, decision support is useful if the number of times it is right in detecting a health problem is greater than the number of times it is wrong. /1 True positive rate False positive rate 39
  • 40. Problema della Replicabilità (o validazione esterna) 1a ragione: A NEW UTILITY METRIC 3rd reason: The problem of meaning True positive rate False positive rate And if: It is optimized to avoid the most impactful kinds of error (at class level) It helps you when you need it most (i.e., most difficult/rarest cases) It doesn’t take guesses (and it’s calibrated in its guesses). 40
  • 41. Problema della Replicabilità (o validazione esterna) 1a ragione: A NEW UTILITY METRIC 3rd reason: The problem of meaning True positive rate False positive rate And if: It is optimized to avoid the most impactful kinds of error (at class level) It helps you when you need it most (i.e., most difficult/rarest cases) It doesn’t take guesses (and it’s calibrated in its guesses). But how? 41
  • 42. Problema della Replicabilità (o validazione esterna) 1a ragione: 3rd reason: The problem of meaning True positive rate False positive rate 42
  • 43. Problema della Replicabilità (o validazione esterna) 1a ragione: 3rd reason: The problem of meaning True positive rate False positive rate 43
  • 44. So why accuracy should not tell us much? 3 reasons 44
  • 45. It does tell nothing about… …robustness Not only avoid overfitting, but pursue external validation! 1 So why accuracy should not tell us much? 3 reasons 45
  • 46. It does tell nothing about… …robustness Not only avoid overfitting, but pursue external validation! 1 So why accuracy should not tell us much? 3 reasons 46 http://prova-meta-validation.herokuapp.com/
  • 47. It does tell nothing about… …robustness Not only avoid overfitting, but pursue external validation! 1 So why accuracy should not tell us much? 3 reasons 47 http://psicorrespondence.pythonanywhere.com/
  • 48. 1 2 … data reliability So why accuracy should not tell us much? 3 reasons It does tell nothing about… …robustness Not only avoid overfitting, but pursue external validation! 48
  • 49. It does tell nothing about… …robustness Not only avoid overfitting, but pursue external validation! 1 So why accuracy should not tell us much? 3 reasons 49 https://reliability-test.herokuapp.com/
  • 50. 1 2 …the model utility 3 So why accuracy should not tell us much? 3 reasons … data reliability It does tell anything about… …robustness Not only avoidance overfitting, but pursue external validation 50
  • 51. 51
  • 52. By looking at how it performs wrt what matters most.... 52
  • 53. That is… Robustness as a performance characteristic assessed on multiple aspects in light of statistical significance and similarity to training data EXTERNAL PERFORMANCE DIAGRAM 53
  • 54. Consider similarity between training set and test set (or external validation set) Consider a balanced «discriminative performance» metric. Consider a utility metric Consider a calibration metric Combine them into a robustness metric Consider whether the test set has sufficient numerosity (100%) 54
  • 56. ✓ The model performs better with Italian data ✓ But similarity is moderate, not high ✓ Model is also good with Spanish dataset ✓ Model is moderately robust 56
  • 57. There is also a 4th reason why we should not talk about accuracy in the case of medical AI 57
  • 58. Medical decision making is multifactorial and not black/white. AI is a tool and not an agent. 58
  • 59. So, we should not want accurate models… But rather (potentially) useful. Two take-home messages to “trust” any accuracy score… 59
  • 60. 1st: evaluate the similarity between training datasets and validation set. 60
  • 61. 1st: evaluate the similarity between training datasets and validation set. 2nd: weigh your data for what clinicians really care about (relevance, complexity, rarity, …). 61
  • 62. To discover more, please refer to the IJMEDI checklist for assessment of medical AI 62
  • 63. 63
  • 65. Cabitza, F., & Zeitoun, J. D. (2019). The proof of the pudding: in praise of a culture of real-world validation for medical artificial intelligence. Annals of translational medicine, 7(8). Cabitza, F., Campagner, A., & Balsano, C. (2020). Bridging the “last mile” gap between AI implementation and operation:“data awareness” that matters. Annals of translational medicine, 8(7). Cabitza, F., Campagner, A., & Sconfienza, L. M. (2020). As if sand were stone. New concepts and metrics to probe the ground on which to build trustable AI. BMC Medical Informatics and Decision Making, 20(1), 1-21. Cabitza, F., Locoro, A., Alderighi, C., Rasoini, R., Compagnone, D., & Berjano, P. (2019). The elephant in the record: on the multiplicity of data recording work. Health informatics journal, 25(3), 475-490. Cabitza, F., Campagner, A., Albano, D., Aliprandi, A., Bruno, A., Chianca, V., ... & Sconfienza, L. M. (2020). The elephant in the machine: Proposing a new metric of data reliability and its application to a medical case to assess classification reliability. Applied Sciences, 10(11), 4014. Cabitza F., Campagner A., (2021) The need to separate the wheat from the chaff. International Journal of Medical Informatics. https://doi.org/10.1016/j.ijmedinf.2021.104510 References Cabitza, F., Campagner, A., (2021) "Decisions are not all equal. Introducing a utility metric based on the case-wise raters' perceptions“. Proceedings of CD-MAKE 2021 Cabitza, F., Campagner, A., ... & Carobene, A. (2020). Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests. Clinical Chemistry and Laboratory Medicine (CCLM), 1 65