Introduction to prediction modelling - Berlin 2018 - Part II

Advanced Epidemiologic Methods
causal research and prediction modelling
Prediction modelling topics 5 - 7
Maarten van Smeden
LUMC, Department of Clinical Epidemiology
20-24 August 2018
Maarten van Smeden (LUMC) Risk prediction model building 20-24 August 2018

Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overﬁtting
7 External validation and updating

Books

TRIPOD statement
TRIPOD, Ann Int Med, 2016, doi: 10.7326/M14-0697 and 10.7326/M14-0698

Steps of model development
• Research design and data collection
• Choice of statistical model, outcome and (candidate) predictors
• Initial data analysis
• Descriptive analysis
• Model speciﬁcation and estimation
• Evaluation of performance and internal validation
• Presentation

Research design: aims
• Point of intended use of the risk model
- Primary care (paper/computer/app)?
- Secondary care (beside)?
- Low resource setting?
• Complexity
- Number of predictors?
- Transparency of calculation?
- Should it be fast?

Research design: design of data collection
• Diagnostic risk prediction:
cross-sectional design (e.g.
consecutive patients):
measurement of predictors
at baseline + reference
standard (”gold standard” is
often a misnomer)
• Prognostic risk prediction:
(prospective) cohort study:
measurement of predictors at
baseline + follow-up until
event occurs (time-horizon)
Figure: Moons, Ann Int Med, 2016, doi: 10.7326/M14-0698
Alternative data collection designs:
• Randomized trial: typically small, large treatment effects, strict eligibility criteria
• Routine care data: often suffering from data quality issues (misclassifications, missing data)
• Case-control study: generally unsuitable for risk prediction

Possible outcomes
Types of outcomes
• Death (e.g. 10 day in hospital mortality)
• Hospital readmission (e.g. 1 year after CVD event)
• Developing a disease (e.g. 10 year risk of Diabetes Type-II)
• Bleeding risk (Thrombosis)
• Complications after surgery
• Response to treatment
Considerations
• Relevant time horizon for risk essential
• Broad composite outcomes not informative
• Misclassiﬁcation errors can be inﬂuential on risk prediction

Possible candidate predictors
General advise: Use clinical knowledge and (systematic) reviews to identify predictors that are
plausibly related to the outcome of interest
Type of predictors
• Demographics (age, sex, SES)
• Patient history (previous disease)
• Physical examination (may be subjective)
• Diagnostic tests (imaging, ECG)
• Biomarkers
• Disease characteristics (diagnosis, severity)
• Therapies received
• Physical functioning
• . . .
Include?
• Unique contribution to prediction
• Cost of measurement
• Speed of measurement
• Invasiveness of measurement
• Availability in clinical practice
• Measurement objectivity
• Measurement quality
• Model parsimony
• . . .

Choice of statistical model
Outcome Regression model Example
Continuous linear (OLS) blood pressure at discharge
Binary
(death/alive)
binary logistic EuroSCORE: 30 day mortality
after cardiac surgery
Survival (time to
event)
Cox model Framingham risk score: 10-year
cardio-vascular disease
Categorical multinomial logistic Operative delivery (spontaneous,
instrumental, caesarean section)
Note: many alternative regression models exist for similar outcomes (e.g. weighted linear, probit, Weibull, proportional odds)
Machine learning methods and artiﬁcial intelligence:
so far shown to give little advantage or to perform worse than regression models based risk
prediction (more about this tomorrow)
EuroSCORE: 10.1016/S0195-668X(02)00799-6; Framingham: 10.1161/CIRCULATIONAHA.107.699579; Operative delivery:
10.1111/j.1471-0528.2012.03334.x

Initial data analysis and descriptive analysis
Risk model for venous thromboembolism in postpartum women: Abdul Sultan, BMJ, 2016, doi:10.1136/bmj.i6253

Selecting predictors on univariable associations
• The association between one particular predictor and the outcome is a univariable
association ⇒ informative at the initial data analysis and descriptive analysis step
Univariable selection:
• Is the use of a p-value criterion (p < .05) for selecting predictors for inclusion in the
prediction model based on the univariable relations between predictors and the outcome
• Is commonly used for selecting predictors
• Is inappropriate as it rejects important predictors
• Is inappropriate as it selects unimportant predictors
• only works for completely uncorrelated predictor variables, which they never are
Bottom line: don’t use univariable selection to select or reject predictors
Read more: Sun, JCE, 1996, doi: 10.1016/0895-4356(96)00025-X

Missing data
Discussed extensively on day 2.
Missing data often poses a non-ignorable problem for prediction
models, requiring extra steps and eﬀorts when developing and
validating the model. But there is consensus on how to deal with
particular forms of missing data (e.g. multiple imputation by chained
equations when MAR, sensitivity analyses when MNAR). Missing
data should be prevented as much as possible.
Read more: Vergouwe, JCE, 2010, doi: 10.1016/j.jclinepi.2009.03.017

Model speciﬁcation
f(X) → linear predictor (lp)
Simplest case: lp = β0 + β1x1 + . . . + βPxP (only ”main eﬀects”)
linear regression
Y = lp + ε
logistic regression
ln{Pr(Y = 1)/(1 – Pr(Y = 1))} = lp
Pr(Y = 1) = 1/(1 + exp{–lp})
Cox regression
h(t) = h0(t)exp(lp)

Continuous predictors
• Many predictors are measured on a continuous scale
- Age
- Systolic/diastolic blood pressure
- HDL/LDL
- Biomarkers
- . . .
• Decision required on how to include continuous predictors in the modelling
• Allow for nonlinearity
- Polynomials (e.g. quadratic)
- Splines functions
- Fractional polynomials
Read more: Collins, Stat Med, 2016, doi: 10.1002/sim.6986

Continuous predictors
Source: Collins, Stat Med, 2016, doi: 10.1002/sim.6986

Dichotomania
Dichotomania is an obsessive compulsive disorder to which medical advisors in
particular are prone [. . .]. Show a medical advisor some continuous measurements
and he or she immediately wonders. Hmm, how can I make these clinically
meaningful? Where can I cut them in two? What ludicrous side conditions can I
impose on this?
Stephen Senn
Quote source: Senn, http://www.senns.demon.co.uk/Geep.htm
Dichotomising predictors is unfortunately very common in prediction modeling
• Example: create a new predictor with 0 if age < 50 years (’young’); 1 if age ≥ 50 years
(’old’)
• Throws away precious information for risk prediction
• Unrealistic, it assumes those immediately above and below the cut point have diﬀerent risk
• Reduces predictive accuracy of the model
Avoid dichotomising predictors!

Dichotomania
Source: Royston, Stat Med, 2006, doi: 10.1002/sim.2331

Model predictive performance
Source: Steyerberg, Epidemiology, 2010, doi: 10.1097/EDE.0b013e3181c30fb2

Numerical example

Sensitivity/speciﬁcity at threshold 1

Discrimination
• Sensitivity/specificity trade-off
• Arbitrary choice threshold → many
possible sensitivity/specificity pairs
• All pairs in 1 graph: ROC curve
• Area under the ROC-curve:
probability that a random individual
with event has a higher predicted
probability than a random individual
without event
• Area under the ROC-curve: the c-
statistic (for logistic regression) takes
on values between 0.5 (no better
than a coin-flip) and 1.0 (perfect
discrimination)
Read more: Sedgwick, BMJ, 2015, doi: 10.1136/bmj.h2464

Calibration plot

Discrimination and calibration
• Discrimination: the extent to which risks diﬀerentiate between positive and negative
outcomes
• Calibration: the extent to which estimated risks are valid
• Discrimination is usually the no. 1 performance measure
- Risk models are typically compared based discriminative performance; not on
calibration
- A risk prediction model with no discriminative performance is uninformative
- A risk prediction model that is poorly calibrated is misleading
Van Calster, JCE, 2016, doi: 10.1016/j.jclinepi.2015.12.005

Overoptimism
Overoptimsm
Predictive performance evaluations are too optimistic when estimated
on the same data where the risk prediction model was developed. This
is therefore called apparent performance of the model
• Optimism can be large, especially in small datasets and with a large number of predictors
• To get a better estimate of the predictive performance:
- Internal validation (same data sample)
- External validation (other data sample, discussed in tomorrow’s lecture)

Internal validation
• Evaluate performance of risk prediction model on data from the same population from
which model was developed
• Say that we start with one dataset with all data available: the original data
• Option 1: Splitting original data
- One portion to develop (’training set’); one portion to evaluate (’test set’)
- Non-random vs random split
- Generates 1 test of performance
• Option 2: Resampling from original data
- Cross-validation
- Bootstrapping
- Generates a distribution of performances
• General advice: avoid splitting (option 1) because
- Ineﬃcient → especially when original data is small
- Usually leads to a too small test set
Read more: Steyerberg, JCE, 2001, doi: 10.1016/S0895-4356(01)00341-9

Bootstrapping
Steps:
• Randomly selects individuals from the original data until a dataset of the same size is
obtained (called the bootstrap sample)
• Each time an individual is selected, they are put back into the original dataset individuals
may therefore be selected more than once in each bootstrap sample
• Repeat this process many times - say 500 - to obtain 500 bootstrap samples
• Repeat the model development process (incl non-linear eﬀects, variable selection) on each
of the bootstrap samples
• Calculate the predictive performance of the developed models on the original data.
• Take the average over these samples to get an optimism corrected estimate of performance
of the model in the original sample.

Presentation
• Make sure that information about all the estimated regression parameters are provided,
including intercept.
• Consider: adding a nomogram, developing a score chart or app
• Follow the reporting guideline TRIPOD
TRIPOD, Ann Int Med, 2016, doi: 10.7326/M14-0697 and 10.7326/M14-0698

Report all estimated parameters

Nomogram

Outline
6 Overﬁtting
Maarten van Smeden (LUMC) Overﬁtting 20-24 August 2018

Overﬁtting
Curse of all statistical modelling1
What you see is not what you get2
When a model is fitted that is too complex, that is it has too many free
parameters to estimate for the amount of information in the data, the worth of
the model (e.g., R2 ) will be exaggerated and future observed values will not
agree with predicted values3
Idiosyncrasies in the data are fitted rather than generalizable patterns. A
model may hence not be applicable to new patients, even when the setting of
application is very similar to the development setting4
1van Houwelingen, Stat Med, 2000, PMID: 11122504; 2Babyak, Psychosomatic Medicine, 2004, PMID: 15184705; 3Harrell, 2001, Springer, ISBN
978-1-4757-3462-1; 4 Steyerberg, 2009, Springer, ISBN 978-0-387-77244-8.

Overﬁtting poem
Wherry, Personnel Psychology, 1975, doi: 10.1111/j.1744-6570.1975.tb00387.x

Overﬁtting artist impression
https://twitter.com/LesGuessing/status/997146590442799105

Overﬁtting causes and consequences
Steyerberg, 2009, Springer, ISBN 978-0-387-77244-8.

Overﬁtting: typical calibration plot
• Low probabilities are predicted too low, high probabilities are predicted too high

Calibration slope
logistic regression
ln{Pr(Y = 1)/(1 – Pr(Y = 1))} = lp
Pr(Y = 1) = 1/(1 + exp{–lp})
lp = β0 + β1x1 + . . . + βPxP
Calibration slope (λ):
ln{Pr(Y = 1)/(1–Pr(Y = 1))} = α+λlp
λ < 1: overﬁtting →
λ > 1: underﬁtting

Calibration development data: not insightful
Bell, BMJ, 2015, doi: 10.1136/bmj.h5639

How to avoid overﬁtting?
• Be conservative selecting/removing variable predictor variables
• Avoid stepwise selection and forward selection
• When using backward elimination use conservative p-values (e.g. p = 0.10 or 0.20)
• Apply shrinkage methods
• Sample size

Automated (stepwise) variable selection
• Selection unstable: selection and order of entry often overinterpreted
• Limited power to detect true effects: predictive ability suffers, underfitting
• Risk of false-positive associations: multiple testing, overfitting
• Inference biased: P-values exaggerated; standard errors too small
• Estimated coefficients biased: testimation
Figure: Steyerberg, JCE, 2018, doi: 10.1016/j.jclinepi.2017.11.013; Read more: Heinze, Biometrical J, 2018, doi: 10.1002/bimj.201700067

1956: Steins paradox
Stein, 1956: http://www.dtic.mil/dtic/tr/fulltext/u2/1028390.pdf

1956: Steins paradox
In words (rather simpliﬁed):
When one has three or more units (say, individuals), and for each unit one can
calculate an average score (say, average blood pressure), then the best guess
of future observations (blood pressure) for each unit is NOT its average score

1961: James-Stein estimator: the next Berkley Symposium
James, 1961: https://projecteuclid.org/euclid.bsmsp/1200512173

1977: Baseball example
Efron, Scientiﬁc American, 1977, www.jstor.org/stable/24954030

Lessons from Stein’s paradox
• Stein’s paradox is among the most surprising (and initially doubted) phenomena in statistics
• After the James-Stein paradox many other shrinkage estimators were developed. Now a
large family: shrinkage estimators reduce prediction variance to an extent that outweighs
the bias that is introduced (bias/variance trade-oﬀ)
Bias, variance and prediction error
Expected prediction error = irreducible error + bias + variance2
Friedman et al. (2001). The elements of statistical learning. Vol. 1. New York: Springer series.

Illustration of regression shrinkage

Was I just lucky?
No: 5% reduction in MSPE just by shrinkage estimator (Van Houwelingen and le Cessie’s
heuristic shrinkage factor)

Heuristic argument for shrinkage

Shrinkage estimators
Popular shrinkage approaches for prediction modeling:
• Bootstrap
• Heuristic formula
• Firths correction
• Ridge regression
• LASSO regression
• Bayesian prediction modeling
• Note: shrinkage is in general particularly beneﬁcial for calibration of the risk prediction
model and less so for its discrimination
Further reading: Pavlou, BMJ, 2015, doi: 10.1136/bmj.h3868; van Smeden, SMMR, 2018, doi: 10.1177/0962280218784726

Sample size
• Sample size is important factor driving performance of risk prediction models
• No consensus on what counts as an adequate sample size
• General principles for adequate sample size:
- Eﬀective sample size driven by number of observations in the group with or without
the outcome predicted whichever is the smallest, per convention called ”events”
- EPV: the number of events divided by the number of candidate predictors is a
common ratio to describe model parsimony vs eﬀective sample size
- EPV < 10 is ”danger zone”: avoid
- EPV much larger than 10 is often needed to a prediction model that gives precise risk
estimates
Further reading: van Smeden, SMMR, 2018, doi: 10.1177/0962280218784726

Sample size and shrinkage
Benefit of regression shrinkage dependents on:
• Sample size
• Correlations between predictor variables
• Sparsity of outcome and predictor variables
• The irreducible error component
• Type of outcome (continuous, binary, count, time-to-event,...)
• Number of candidate predictor variables
• Non-linear/interaction effects
• Weak/strong predictor balance
How to know that there is no need for shrinkage at some sample size?
Advice: always apply shrinkage regardless of sample size and compare to non-shrunken model.
Very large differences may indicate a variety of non-identified issues that may need fixing →
contact statistician

Outline
6 Overﬁtting
Maarten van Smeden (LUMC) External validation and updating 20-24 August 2018

Prediction model landscape
• > 110 models for prostate cancer (Shariat 2008)
• > 100 models for traumatic brain injury (Perel 2006)
• 83 models for stroke (Counsell 2001)
• 54 models for breast cancer (Altman 2009)
• 43 models for type 2 diabetes (Collins 2011; Dieren 2012)
• 31 models for osteoporotic fracture (Steurer 2011)
• 29 models in reproductive medicine (Leushuis 2009)
• 26 models for hospital readmission (Kansagara 2011)
• > 25 models for length of stay in cardiac surgery (Ettema 2010)
• > 350 models for cardiovascular disease outcomes (Damen 2016)
• What if your model becomes number 300-something?
• What about the clinical beneﬁt/utility of number 300-something?
Courtesy of KGM Moons and GS Collins for this overview

Before developing yet another model, know that:
• For most diseases / outcomes risk prediction models have already been developed
→ Only few are externally validated or updated
→ Even fewer are disseminated and used in clinical practice
• Use your data for external validation of models already developed!

External validation
• Study of the predictive performance of the risk prediction model in data of new subjects
that were not used to develop it
• The larger the difference between development and validation data, the more likely the
model will be useful in (as yet) untested populations
- Case-mix (distributions of predictors and outcome)
• External validation is the strongest test of a prediction model
- Different time period (’temporal’)
- Different areas/centres (’geographical’)
- Ideally by independent investigators
Collins, BMJ, 2012, doi: 10.1136/bmj.e3186

External validation is not
• It is not repeating model development steps
• Whether the same predictors, regression coeﬃcients and predictive performance would be
found in new data is not in question
• It is not re-estimating a previously developed model
• Updating regression coeﬃcients is sometimes done when the performance at external
validation is unsatisfactory. This can be viewed as model (model revision) and calls for new
external validation

What to expect at external validation
• Decreased predictive performance compared to development is expected
• Many possible causes:
- Overfitting of the model at development
- Different type of patients (case mix)
- Different outcome occurrence
- Differences in care over time
- Differences in treatments
- Improvement in measurements over time (e.g.previous CTs less accurate than spiral
CT for PE detection)
- . . .
• When predictive performance is judged too low → consider model updating

Model updating
• Recalibration in the large: re-estimate the intercept
• Recalibration: re-estimate the intercept + additional factor that multiplies all coeﬃcients
with same factor (calibration slope)
Table from Vergouwe, Stat Med, 2017, doi: 10.1002/sim.7179

Sample size for external validation
Vergouwe, JCE, 2005, doi: 10.1016/j.jclinepi.2004.06.017; Collins, Stat Med, 2015, doi: 10.1002/sim.6787

Advanced Epidemiologic Methods
causal research and prediction modelling
Final remarks
Maarten van Smeden
LUMC, Department of Clinical Epidemiology
20-24 August 2018
Maarten van Smeden (LUMC) Final remarks 20-24 August 2018

Machine learning
Beam, JAMA, 2018, doi:

Machine learning
Beam, JAMA, 2018, doi: 10.1001/jama.2017.18391

Machine learning
Shah, JAMA, 2018, doi: 10.1001/jama.2018.5602

Machine learning
source: blog Frank Harrell, http://www.fharrell.com/post/stat-ml/

Final remarks
• Prediction models can take many forms but in medicine the interest is often in calculating
risk of a health state currently being present (diagnostic) or developing in the future
(prognostic)
• Risk prediction models are tools that aim to support medical decision making, not replace
physicians
• Many prediction models have been developed already → make sure you know review the
earlier models in the ﬁeld before deciding to build your own
• Calibration is essential for accurate risk prediction. Miscalibrated models misinform and may
cause patients harm

Acknowledgment
The materials (slides) used for in this course were inspired by materials that belong to Prof dr Gary Collins.

Introduction to prediction modelling - Berlin 2018 - Part II

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to prediction modelling - Berlin 2018 - Part II

Similar to Introduction to prediction modelling - Berlin 2018 - Part II (20)

More from Maarten van Smeden

More from Maarten van Smeden (17)

Recently uploaded

Recently uploaded (20)

Introduction to prediction modelling - Berlin 2018 - Part II