Calibration of risk prediction models: decision making with the lights on or off?

Calibration of risk prediction models:
decision making with the lights on or off?
Ben Van Calster
KU Leuven (B), LUMC (NL)
ISCB Krakow, August 27th 2020

TG6: Evaluating diagnostic tests and prediction models
• Chairs:
o Ewout Steyerberg (Leiden LUMC)
o Ben Van Calster (KU Leuven)
• Members (alphabetically):
o Patrick Bossuyt (AMC Amsterdam)
o Tom Boyles (U Witwatersrand, Johannesburg; clinician member)
o Gary Collins (U Oxford)
o Kathleen Kerr (U Washington, Seattle)
o Petra Macaskill (U Sydney)
o David McLernon (Aberdeen)
o Carl Moons (UMC Utrecht)
o Maarten van Smeden (UMC Utrecht)
o Andrew Vickers (MSKCC, New York)
o Laure Wynants (U Maastricht)
2

Risk prediction or binary prediction?
3

4

5
Risk is most interpretable, acknowledges imperfect prediction,
can be combined with other information, and allows to vary decision thresholds.
If you predict risk, you can assess the accuracy of the estimates (calibration).
Binary predictions easily hide potential miscalibration.

Level 1-2 TG6 paper on calibration
6

The Achilles heel of predictive analytics
7
Systematically wrong risk estimates can distort decision-making
o Risk overestimated: can lead to many unnecessary interventions
o Risk underestimated: can lead to withholding many important interventions
Calibration often not assessed during model validation.
So for many models, it is not known how accurate the risks are in a specific
setting. In that case, you are in fact using a model with the lights off.

The Achilles heel of predictive analytics
8
But if AUC is high, the ranking of patients into lower vs higher risk must be very
good?
 Good relative performance does not imply good absolute performance!
Using binary predictions only (e.g. treat vs don’t treat), you are not avoiding the
problem. I think you aggravate it by pretending to avoid the problem.

9
Published online in
J Ultrasound Med
on Aug 11 2020
Objective: develop risk model for first trimester miscarriage in very early pregnancies
- Retrospective data, single institution.
- 590 pregnancies, 345 miscarried; 9 parameters studied.
- Most important predictor (hCG rise) missing in 79%.
- No validation at all.
“It might appear to be a weakness of our study that the first trimester loss rate was
considerably higher than the rates found by other investigators (48% vs 10-30%).The rate
is high because of the high prevalence of pregnancy risk factors in our population.”
Web-calculator given that allows risk estimation. I cannot support that.

How can risks be inaccurate?
10
• Methodological issues at model development or validation
o Overfitting, leading to overly extreme risk estimates on new data
“in small datasets, it is reasonable for a model not to be developed at all”
o Heterogeneity of measurement error between settings (Luijken et al, Stat Med
2019)
• Variables and characteristics unrelated to model development
o Patient characteristics and outcome incidence/prevalence vary greatly between
settings
o Patient populations change over time within setting (“drift”)
o So there is “Heterogeneity across time and place”

Levels of calibration
1. Mean calibration / calibration-in-the-large
2. Weak calibration
3. Moderate calibration
4. Strong calibration
Work motivated by a very nice and thought provoking paper from WernerVach (JCE 2013;66:1296-1301)
11

1. Mean calibration
The average estimated risk is accurate
Compare average risk with outcome prevalence/incidence
12

2. Weak calibration
On average, the model does not overestimate or underestimate risk, and
does not give too extreme or too modest risks
‘Logistic recalibration’ framework:
Evaluate calibration intercept a: log
𝑃 𝑌=1
𝑃 𝑌=0
= 𝑎 + 𝐿
𝑎 < 0 means overestimation, 𝑎 > 0 means underestimation
Evaluate calibration slope b: log
𝑃 𝑌=1
𝑃 𝑌=0
= 𝑎 + 𝑏𝐿
𝑏 < 1 means too extreme risks, 𝑏 > 1 means too modest risks
13

3. Moderate calibration
Observed proportion of events correspond to estimated risk
Construct a flexible calibration curve based on log
𝑃 𝑌=1
𝑃 𝑌=0
= 𝑎 + 𝑓(𝐿).
𝑓(. ) is usually a loess fit, but can also be based on splines.
This is preferable at external validation, but sufficient N needed.
Intercept and slope are nice summaries, but reduce calibration to 2 numbers (weak).
The slope is usually sufficient for internal validation (using bootstrapping or cross-
validation), but the intercept or plotting a curve can sometimes be defended as well.
14

Some reference calibration curves
15
25% outcome
prevalence

Example curves with low N
16Verhoeven et al. Ultrasound ObstetGynecol 2009;34:316-321.
240 cases, 27 events (Caesarean delivery)
“Calibration of the model on the right was not as good
as the calibration of the model on the left”

4. Strong calibration
Observed proportion of events correspond to estimated risk for each
covariate pattern
Hard to assess (unless the model has only a few dichotomous predictors)
This is clinically desirable but utopic.The model needs to be fully correct.
A diagonal calibration curve (i.e. moderate) does not imply strong calibration.
We have shown that moderate calibration cannot lead to harmful decisions (in the
framework of decision curve analysis).
17

Example external validation
18
N=4905, 978 events

Multinomial outcomes?
1. Calibration intercepts and slopes can be calculated for multinomial logistic
regression by extending the approach for binary outcomes to
log
𝑃 𝑌 = 𝑘
𝑃 𝑌 = 𝐽
= 𝑎 𝑘 +
𝑖=1
𝐾−1
𝑏 𝑘,𝑖 𝐿𝑖
2. Flexible calibration curves can be obtained by using vector splines s(.)
log
𝑃 𝑌 = 𝑘
𝑃 𝑌 = 𝐽
= 𝑎 𝑘 +
𝑖=1
𝐾−1
𝑠 𝑘,𝑖 𝐿𝑖
This can be extended to risk models for ordinal outcomes, and to risk models
based on e.g. machine learning algorithms
19

Heterogeneity between centers
21

Heterogeneity between centers
22

Heterogeneity: example
Centre-specific and overall logistic (i.e. non-flexible) calibration curves:
logistic recalibration model with random intercept and random slope for the J
centres (Wynants et al, SMMR 2018):
𝑙𝑜𝑔
𝑃 𝑌=1
𝑃 𝑌=0
= 𝛼 + 𝑎𝑗 + 𝛽𝐿 + 𝑏𝑗 𝐿,
where
𝑎𝑗
𝑏𝑗
~𝑁
0
0
,
𝜏 𝑎
2 𝜏 𝑎𝑏
𝜏 𝑎𝑏 𝜏 𝑏
2 .
24

Cox models (TG6 paper in preparation)
25

Cox models
What you can do depends on the information you have (next to the validation
dataset)
In my view, level 2 is what is needed for clinical application. It is also what
TRIPOD recommends (Moons et al, Ann Intern Med 2005).
26
Level Available information about the model
Level 1 Only model coefficients (very common)
Level 2 Coefficients + cumulative baseline hazard at t1, 𝐻0 𝑡1
Level 3 Original dataset

Cox models
If 𝐻0 𝑡1 is available, flexible adaptive hazard regression can be used to
generate a flexible calibration curve at time t1
𝑙𝑜𝑔 ℎ 𝑡 = 𝑔 𝑙𝑜𝑔 −𝑙𝑜𝑔 1 − 𝑝𝑡1
, 𝑡 , with
𝑝𝑡1
= 1 − 𝑒𝑥𝑝 −𝐻0 𝑡1
exp 𝛃 𝑇 𝐗
Can also be used for other time-to-event models.
See Austin, Harrell, van Klaveren (Stat Med 2020).
27

3 myths about risk thresholds (TG6 paper)
28

3 myths about risk thresholds
1. Risk groups are more useful than continuous risk estimates
 Clinically actionable groups (that have consensus) can make sense, but
this remains rough for decision making at individual level
2. You can ask your statistician to get you the threshold
 Depends on clinical context, you need reasonable information on
misclassification costs
3. The threshold is a part of the model
 Different preferences, different healthcare systems
These 3 issues are obviously related to each other.
29

Further plans TG6
Practical guidance on validation of risk models for time-to-event outcomes
Practical guidance on validation of risk models accounting for competing risks
Simple paper (level 1) with advice for prediction model development
Multicenter diagnostic test evaluations: guidance on design and analysis
Hands-on tutorial of tools to assess calibration for different outcomes
30

“Medicine is a science of uncertainty and an art of probability”
WilliamOsler
31

Calibration of risk prediction models: decision making with the lights on or off?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Calibration of risk prediction models: decision making with the lights on or off?

Similar to Calibration of risk prediction models: decision making with the lights on or off? (20)

Recently uploaded

Recently uploaded (20)

Calibration of risk prediction models: decision making with the lights on or off?