1. Bias in COVID-19 models
Learning Machine Learning
Universidad del Rosario, 15/07/2021
Laure Wynants PhD
Maastricht University, Department of Epidemiology
KU Leuven, Department of Development and Regeneration, EPI-Centre
laure.wynants@maastrichtuniversity.nl
@laure_wynants
9. Some more terminology
– you may want to take a screenshot
Statistics / Epi Machine learning
Prediction Supervised learning
Outcome variable, dependent variable Target
Gold standard Ground truth
Predictor, covariate, independent
variable
Feature
Fitting Learning
Parameter Weights
Development – validation Training - test
Sensitivity Recall
Positive Predictive value Precision
11. Why bother?
“As of today, we have deployed the system in 16 hospitals, and it is
performing over 1,300 screenings per day”
MedRxiv pre-print only, 23 March 2020,
doi.org/10.1101/2020.03.19.20039354
16. Characteristics of reviewed models II
114 out of 236 models (48%) were available in a format for use in clinical practice.
17. Commonly included predictors
DIAGNOSTIC MODELS
VITAL SIGNS (FEVER)
FLU-LIKE SIGNS AND SYMPTOMS
AGE
ELECTROLYTES
IMAGE FEATURES
PROGNOSTIC MODELS
AGE
COMORBIDITIES
VITAL SIGNS
IMAGE FEATURES
SEX
19. Performance: AUC
• General population models: 0.71 to ≥0.99
• Diagnostic models: 0.65 to ≥ 0.99
• Diagnostic severity models: 0.80 to ≥ 0.99
• Diagnostic imaging models: 0.70 to ≥ 0.99
• Prognosis models: 0.54 to ≥0.99
(prediction horizon varies from 1 to 37 days, if reported)
20. How often can we trust the estimated
predictive performance?
o187 / 236 models
o121 / 236 models
o4 / 236 models
21. Characteristics of reviewed models
Median (IQR)
Sample size 344 (134 to 748)
Number of events 70 (37 to 160)
26. A good model could improve care and reduce
costs
Help allocate scarce resources
Why care about bias?
27. Poor models can make things worse
Inaccurate predictions -> harmful decisions
(Van Calster & Vickers, Med Dec Mak, 2015)
ICU scores during H1N1 pandemic (Enfield, Chest, 2011)
30. What is bias anyway?
epidemiologists
“an error in the conception and design of a study – or in
the collection, analysis, interpretation, reporting,
publication, or review or data – leading to results or
conclusions that are systematically (as opposed to
randomly) different from truth”
Porta M, ed. A Dictionary of Epidemiology. 6th Edition. Oxford: Oxford University Press, 2014.
32. Risk of bias in prediction models
“We define risk of bias to occur when
shortcomings in study design, conduct, or
analysis could lead to systematically
distorted estimates of a model’s
predictive performance.”
PROBAST
33. The numbers are only as good as the process
producing them
Participants Predictors Outcome Analysis
Signalling
questions in
4 domains:
38. Issues
Set of images from patients with covid-19 stems from a different source than the set of
images from patients without covid-19:
1. Non-covid images not representative of typical patients suspected of having covid-19
• Metadata (e.g. age, comorbidities such as pre-existing chronic lung disease)?
• Alternative diagnoses in the target population include pathology such as heart
failure or pulmonary embolism,…
• Predictive performance (AUC, PPV (precision), NPV, calibration) depends on
patient case-mix
2. Sets differ systematically in many respects -> spurious correlations -> performance
inflated
• geographical location, time period (pre- or post 12/2019), type of machine,
settings of the imaging procedure, image preparation/preprocessing
3. Frankenstein datasets
• Combinations of existing databases of images
• Same images often included >1x
• Train and test set no longer independent
40. Predictors
1. Were predictors defined and assessed in a similar way for all
participants?
2. Were predictor assessments made without knowledge of
outcome data?
3. Are all predictors available at the time the model is intended to
be used?
42. Problem: predict mortality due to covid-19
Comparable?
Measured one time vs measured
throughout the hospital stay?
Actionable for doctors?
Are we predicting death or are we
diagnosing it (the patient is already
dead/dying)?
45. Outcome
1. Was the outcome determined appropriately?
2. Was a pre-specified or standard outcome definition used?
3. Were predictors excluded from the outcome definition?
4. Was the outcome defined and determined in a similar way for all
participants?
5. Was the outcome determined without knowledge of predictor
information?
6. Was the time interval between predictor assessment and outcome
determination appropriate?
47. arXiv:2003.07347v3
Problem: identify people at risk in the general population
Is it appropriate to predict covid-19 hospitalization risk
without data on covid-19 hospitalizations?
49. Analysis
1. Were there a reasonable number of participants with the outcome?
2. Were continuous and categorical predictors handled appropriately?
3. Were all enrolled participants included in the analysis?
4. Were participants with missing data handled appropriately?
5. Was selection of predictors based on univariable analysis avoided?
6. Were complexities in the data (e.g. censoring, competing risks,
sampling of control participants) accounted for appropriately?
7. Were relevant model performance measures evaluated appropriately?
8. Were model overfitting and optimism in model performance accounted
for?
9. Do predictors and their assigned weights in the final model correspond
to the results from the reported multivariable analysis?
50. Problem: predict covid-19 mortality
DOI: 10.1093/cid/ciaa538
Very little data to learn from
-> risk of overfitting
51. Handling of missing data for training data: not reported
Excluding patients with missing data leads to biased results when the analyzed
individuals are a selective subgroup from the original sample
53. Analysis
1. Were there a reasonable number of participants with the outcome?
2. Were continuous and categorical predictors handled appropriately?
3. Were all enrolled participants included in the analysis?
4. Were participants with missing data handled appropriately?
5. Was selection of predictors based on univariable analysis avoided?
6. Were complexities in the data (e.g. censoring, competing risks,
sampling of control participants) accounted for appropriately?
7. Were relevant model performance measures evaluated appropriately?
8. Were model overfitting and optimism in model performance accounted
for?
9. Do predictors and their assigned weights in the final model correspond
to the results from the reported multivariable analysis?
54. Problem: predict covid-19 mortality
DOI: 10.1093/cid/ciaa538
• Some associations may be spurious and predictors may no longer be
important after you take others into account
• Predictors known from previous research to be important may not reach
statistical significance (for example, due to small sample size)
• Some predictors are important only after adjustment for other predictors
55. Problem: predict covid-19 mortality
DOI: 10.1093/cid/ciaa538
• How far ahead are we predicting? Not everyone is followed up for the
same amount of time (16 hours vs > 1 month)
• Excludes over half of patients!
• Survival analysis uses available information on all patients and is more
appropriate for this type of data
58. Analysis
1. Were there a reasonable number of participants with the outcome?
2. Were continuous and categorical predictors handled appropriately?
3. Were all enrolled participants included in the analysis?
4. Were participants with missing data handled appropriately?
5. Was selection of predictors based on univariable analysis avoided?
6. Were complexities in the data (e.g. censoring, competing risks,
sampling of control participants) accounted for appropriately?
7. Were relevant model performance measures evaluated appropriately?
8. Were model overfitting and optimism in model performance accounted
for?
9. Do predictors and their assigned weights in the final model correspond
to the results from the reported multivariable analysis?
59. Problem: predict covid-19 mortality
DOI: 10.1093/cid/ciaa538
• Very little data for testing
• Calibration is not assessed
61. Conclusion
• Despite reports of impressive predictive performance, much
of the growing body of literature on prediction research for
covid-19 is of low quality.
• Don‘t trust a good reported performance alone – study
design & analysis & validation matters!
• Prediction is not just a methodological exercise to get the
best performance on your dataset. You need to be able to
trust the predictions for real patients.
62. If it’s not reported, it’s unclear to everyone else
but yourself
22 items deemed essential for transparent reporting of a prediction model study