Topic 9- General Principles of International Law.pptx
ย
Calibration of risk prediction models: decision making with the lights on or off?
1. Calibration of risk prediction models:
decision making with the lights on or off?
Ben Van Calster
KU Leuven (B), LUMC (NL)
ISCB Krakow, August 27th 2020
2. TG6: Evaluating diagnostic tests and prediction models
โข Chairs:
o Ewout Steyerberg (Leiden LUMC)
o Ben Van Calster (KU Leuven)
โข Members (alphabetically):
o Patrick Bossuyt (AMC Amsterdam)
o Tom Boyles (U Witwatersrand, Johannesburg; clinician member)
o Gary Collins (U Oxford)
o Kathleen Kerr (U Washington, Seattle)
o Petra Macaskill (U Sydney)
o David McLernon (Aberdeen)
o Carl Moons (UMC Utrecht)
o Maarten van Smeden (UMC Utrecht)
o Andrew Vickers (MSKCC, New York)
o Laure Wynants (U Maastricht)
2
5. Risk prediction or binary prediction?
5
Risk is most interpretable, acknowledges imperfect prediction,
can be combined with other information, and allows to vary decision thresholds.
If you predict risk, you can assess the accuracy of the estimates (calibration).
Binary predictions easily hide potential miscalibration.
7. The Achilles heel of predictive analytics
7
Systematically wrong risk estimates can distort decision-making
o Risk overestimated: can lead to many unnecessary interventions
o Risk underestimated: can lead to withholding many important interventions
Calibration often not assessed during model validation.
So for many models, it is not known how accurate the risks are in a specific
setting. In that case, you are in fact using a model with the lights off.
8. The Achilles heel of predictive analytics
8
But if AUC is high, the ranking of patients into lower vs higher risk must be very
good?
๏ฎ Good relative performance does not imply good absolute performance!
Using binary predictions only (e.g. treat vs donโt treat), you are not avoiding the
problem. I think you aggravate it by pretending to avoid the problem.
9. 9
Published online in
J Ultrasound Med
on Aug 11 2020
Objective: develop risk model for first trimester miscarriage in very early pregnancies
- Retrospective data, single institution.
- 590 pregnancies, 345 miscarried; 9 parameters studied.
- Most important predictor (hCG rise) missing in 79%.
- No validation at all.
โIt might appear to be a weakness of our study that the first trimester loss rate was
considerably higher than the rates found by other investigators (48% vs 10-30%).The rate
is high because of the high prevalence of pregnancy risk factors in our population.โ
Web-calculator given that allows risk estimation. I cannot support that.
10. How can risks be inaccurate?
10
โข Methodological issues at model development or validation
o Overfitting, leading to overly extreme risk estimates on new data
โin small datasets, it is reasonable for a model not to be developed at allโ
o Heterogeneity of measurement error between settings (Luijken et al, Stat Med
2019)
โข Variables and characteristics unrelated to model development
o Patient characteristics and outcome incidence/prevalence vary greatly between
settings
o Patient populations change over time within setting (โdriftโ)
o So there is โHeterogeneity across time and placeโ
11. Levels of calibration
1. Mean calibration / calibration-in-the-large
2. Weak calibration
3. Moderate calibration
4. Strong calibration
Work motivated by a very nice and thought provoking paper from WernerVach (JCE 2013;66:1296-1301)
11
12. 1. Mean calibration
The average estimated risk is accurate
Compare average risk with outcome prevalence/incidence
12
13. 2. Weak calibration
On average, the model does not overestimate or underestimate risk, and
does not give too extreme or too modest risks
โLogistic recalibrationโ framework:
Evaluate calibration intercept a: log
๐ ๐=1
๐ ๐=0
= ๐ + ๐ฟ
๐ < 0 means overestimation, ๐ > 0 means underestimation
Evaluate calibration slope b: log
๐ ๐=1
๐ ๐=0
= ๐ + ๐๐ฟ
๐ < 1 means too extreme risks, ๐ > 1 means too modest risks
13
14. 3. Moderate calibration
Observed proportion of events correspond to estimated risk
Construct a flexible calibration curve based on log
๐ ๐=1
๐ ๐=0
= ๐ + ๐(๐ฟ).
๐(. ) is usually a loess fit, but can also be based on splines.
This is preferable at external validation, but sufficient N needed.
Intercept and slope are nice summaries, but reduce calibration to 2 numbers (weak).
The slope is usually sufficient for internal validation (using bootstrapping or cross-
validation), but the intercept or plotting a curve can sometimes be defended as well.
14
16. Example curves with low N
16Verhoeven et al. Ultrasound ObstetGynecol 2009;34:316-321.
240 cases, 27 events (Caesarean delivery)
โCalibration of the model on the right was not as good
as the calibration of the model on the leftโ
17. 4. Strong calibration
Observed proportion of events correspond to estimated risk for each
covariate pattern
Hard to assess (unless the model has only a few dichotomous predictors)
This is clinically desirable but utopic.The model needs to be fully correct.
A diagonal calibration curve (i.e. moderate) does not imply strong calibration.
We have shown that moderate calibration cannot lead to harmful decisions (in the
framework of decision curve analysis).
17
19. Multinomial outcomes?
1. Calibration intercepts and slopes can be calculated for multinomial logistic
regression by extending the approach for binary outcomes to
log
๐ ๐ = ๐
๐ ๐ = ๐ฝ
= ๐ ๐ +
๐=1
๐พโ1
๐ ๐,๐ ๐ฟ๐
2. Flexible calibration curves can be obtained by using vector splines s(.)
log
๐ ๐ = ๐
๐ ๐ = ๐ฝ
= ๐ ๐ +
๐=1
๐พโ1
๐ ๐,๐ ๐ฟ๐
This can be extended to risk models for ordinal outcomes, and to risk models
based on e.g. machine learning algorithms
19
26. Cox models
What you can do depends on the information you have (next to the validation
dataset)
In my view, level 2 is what is needed for clinical application. It is also what
TRIPOD recommends (Moons et al, Ann Intern Med 2005).
26
Level Available information about the model
Level 1 Only model coefficients (very common)
Level 2 Coefficients + cumulative baseline hazard at t1, ๐ป0 ๐ก1
Level 3 Original dataset
27. Cox models
If ๐ป0 ๐ก1 is available, flexible adaptive hazard regression can be used to
generate a flexible calibration curve at time t1
๐๐๐ โ ๐ก = ๐ ๐๐๐ โ๐๐๐ 1 โ ๐๐ก1
, ๐ก , with
๐๐ก1
= 1 โ ๐๐ฅ๐ โ๐ป0 ๐ก1
exp ๐ ๐ ๐
Can also be used for other time-to-event models.
See Austin, Harrell, van Klaveren (Stat Med 2020).
27
29. 3 myths about risk thresholds
1. Risk groups are more useful than continuous risk estimates
๏ฎ Clinically actionable groups (that have consensus) can make sense, but
this remains rough for decision making at individual level
2. You can ask your statistician to get you the threshold
๏ฎ Depends on clinical context, you need reasonable information on
misclassification costs
3. The threshold is a part of the model
๏ฎ Different preferences, different healthcare systems
These 3 issues are obviously related to each other.
29
30. Further plans TG6
Practical guidance on validation of risk models for time-to-event outcomes
Practical guidance on validation of risk models accounting for competing risks
Simple paper (level 1) with advice for prediction model development
Multicenter diagnostic test evaluations: guidance on design and analysis
Hands-on tutorial of tools to assess calibration for different outcomes
30
31. โMedicine is a science of uncertainty and an art of probabilityโ
WilliamOsler
31