When it comes to AI use for prediction, diagnosis and treatment of medical conditions, reality is often replaced with a hype. Limitations should be known. A review of AI failures and challenges in healthcare showing why it is not likely for algorithms to replace physicians in the nearest future.
1. 1
ABOUT
OUR COMPANY
W R I T E S O M E T H I N G H E R E
Dark side of AI in Healthcare
Yu l i a S e r e d a ,
R e s e a r c h C o n s u l t a n t ( A l l i a n c e f o r P u b l i c H e a l t h , U S A I D H e a l t h c a r e R e f o r m S u p p o r t P r o j e c t )
a n d l e c t u r e r a t K y i v S c h o o l o f E c o n o m i c s
3. 3
Data challenges
in healthcare
Data comes in all shapes and sizes, nature of input
keeps changing and confusing
ALL = acute lymphoblastic leukemia
ALL = shorthand for allergy
Integration of data sources is crucial but difficult
(structured EHR + images / signals + narrative)
Data pre-processing requires clinical knowledge
Ethical and liability challenges (GDPR, GCP, HIPAA)
4. 4
B i a s e d d a t a R i s k o f
m a n i p u l a t i o n
O b s c u r e d l o g i c
Reasons of AI failure in healthcare
5. 5
Watson for Oncology
B i a s e d d a t a i n p e r s o n a l i z e d m e d i c i n e
Stat+ investigation: https://www.statnews.com/wp-content/uploads/2018/09/IBMs-Watson-recommended-unsafe-
and-incorrect-cancer-treatments-STAT.pdf
“Physicians like it. Physicians
have said to me, if I took it
away now, I’d have a revolt”
D. DiSanzo,
general manager of IBM Watson Health, 06.2017
“This product is a
piece of s—. We
can’t use it for
most cases”
Oncologist at Jupiter Medical
Center, quoted in IBM internal
document, 06.2017
“Synthetic” cases instead of real EHR
Failure to digest written case-records and notes
Oncology treatment guidelines may change monthly and vary
across countries!
6. 6
Pneumonia-screening
CNNs
H o s p i t a l s y s t e m – s p e c i f i c b i a s e s
A cross-sectional design was used to train and evaluate
pneumonia screening CNNs on 158,323 chest X-rays from
3 medical facilities with extreme differences in pneumonia
prevalence (Zech et al., 2018).
“Better internal than external performance in 3 out of 5
natural comparisons”.
7. 7
Smartphone Applications
for Melanoma Detection
P e r f o r m a n c e m a n i p u l a t i o n
A review of available apps for the detection of melanoma:
(Wolf et al., 2013)
“3 of the 4 applications evaluated do not involve a
physician at any point in the evaluation. Even the best-
performing among these 3 applications classified 18 of
60 melanomas (30%) as benign”
8. 8
AI for genetic screening
P e r f o r m a n c e m a n i p u l a t i o n
Collecting genetic, personal
and behavioral information
from customers without proper
informed consent. Customers’
data is sold.
Limited ability to predict risk.
10% of disease risk is based
on genetics, PLoS One, 2016
Results depend on available
genetic datasets in the
company
Cannot replace clinic tests and
genetic counseling
9. 9
Failure of Google Trends
in surveillance
O b s c u r e d l o g i c f o r o u t b r e a k p r e d i c t i o n
Tended to over- or
underestimate the real
epidemiological burden
Missed the peak of the 2013
flu season by 140%
Challenges:
• Flu-like symptoms indicate
many diseases
• Public resonance vs. real
outbreak
• Linear approach
10. 10
Reinforcement learning for
sepsis treatment policy
N o t a l l p r o b l e m s c a n b e s o l v e d b y f a n c i e r a l g o r i t h m s
Evaluation of policies on administration of vasopressor and IV-fluids
(intravenous fluids) to patients with sepsis based on historical data
(Gottesman et al, 2019).
Challenges:
Dimensionality reduction introduced confounding bias
Not enough decisions - We cannot evaluate things we have not tried
11. 11
Most of the >1 billion start‐ups in
healthcare have a limited or non‐
existent impact in the publicly
available scientific literature
(Cristea et al, 2019).
T r a n s p a r e n c y
i n t h e c o m m u n i t y
Performing more modest tasks which
could still be of tremendous use in
healthcare
U n d e r s t a n d i n g
l i m i t a t i o n s
Using explainers for “black
boxes” (LIME, DALEX, SHAP),
confidence scoring
E x p l a i n i n g
t h e m o d e l
& l i m i t a t i o n s
Good practice