Introduction to prediction modelling - Berlin 2018 - Part I

Advanced Epidemiologic Methods
causal research and prediction modelling
Prediction modelling topics 1-4
Maarten van Smeden
LUMC, Department of Clinical Epidemiology
20-24 August 2018
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018

Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overﬁtting
7 External validation and updating

About
• Statistician by training
• PhD (2016): diagnostic research in the absence of a gold standard
• Post-doc department of Biostatistics (University Medical Center Utrecht)
• Currently: senior researcher department of Clinical Epidemiology (Leiden University Medical
Center)

Types of prediction research
• Prevalence/incidence studies
- Occurrence of health outcomes within/across an geographical area or over time
- Average risk of having/experiencing of an health outcome

Prevalence study
Beasley, The Lancet, 1998. doi: 10.1016/S0140-6736(97)07302-9

Incidence study
Adabag, JAMA, 2008. doi: 10.1001/jama.2008.553

• Predictor ﬁnding studies
- Identifying factors associated with a health outcome

Predictor ﬁnding study
Letellier, BJC, 2017. doi: 10.1038/bjc.2017.352

• Stratiﬁed medicine
- Identify biomarkers that predict response to a treatment

Stratiﬁed medicine
Bass, JCEM, 2010, doi:10.1210/jc.2010-0947

- Average risk of having/experiencing the health outcome
• Stratiﬁed medicine
- Identify biomarkers that predict response to a treatment
Topic of today and tomorrow
• Prediction models
- Modelling combinations of factors to predict a health outcome for individual patients

In the doctor’s oﬃce
The relevant questions to ask?
”What is wrong with this patient?”
”What happens to this patient without/after
treatment X?”

⇒ Diagnosis
⇒ Prognosis/therapy

The patient
• 52-year-old man
• Endurance cyclist
• Swollen calf since 10 days
• ”Calf feels hot”
• Previously documented DVT
• Elbow surgery 5 weeks ago

The patient
• 52-year-old man
• Endurance cyclist
• Swollen calf since 10 days
• ”Calf feels hot”
• Previously documented DVT
• Elbow surgery 5 weeks ago
Deep venous thrombosis likely?

Clinical prediction example 1: Apgar
Apgar, JAMA, 1958. doi: 10.1001/jama.1958.03000150027007

Clinical prediction example 1: Apgar
Casey, NEJM, 2001, doi: 10.1056/NEJM200102153440701

Clinical prediction example 2:...

Clinical prediction example 2: Framingham risk score
10 year CVD risk
To online calculator
D’Agostino, Circulation, 2008. doi: 10.1161/CIRCULATIONAHA.107.699579

Clinical prediction example 3: Score
10 year fatal CVD risk
Conroy, European Heart Journal, 2003. doi: 10.1016/S0195-668X(03)00114-3

Clinical prediction example 4: Lymph node metastasis

Outline
6 Overﬁtting
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018

Example: predicting systolic blood pressure (sbp) at discharge
• Patients hospitalized for heart failure
• Goal is to develop a model predicting systolic blood
pressure at discharge
• This example is inspired by paper of Austin and
Steyerberg; data (N = 7,000) are simulated
Austin, J Clin Epi, 2015, doi: 10.1016/j.jclinepi.2014.12.014

Data dictionary
label explanation
admission sbp systolic blood pressure at admission (in mm Hg)
age age at hospitalization (in years)
female gender
hypertension presence of hypertension
ischemichd ischemic heart failure
LVEFlow left ventricular ejection fraction < 20%
LVEFmedium left ventricular ejection fraction > 20%, < 40%
angiotensin1 angiotensin converting enzyme inhibitors
betablock1 beta-blockers
ccantagon1 calcium channel antagonists
digoxin1 digoxin
diuretic1 diuretic
vasodilator1 vasodilator
discharge sbp systolic blood pressure at discharge (in mm Hg)
coding: 0 = no, 1 = yes
1during hospital stay

A note on data collection
Rubbish in = Rubbish out
A descriptive analysis tells only part of the data’s story
Wynants, BJOG, 2017, doi:10.1111/1471-0528.14170

Examples of rubbish data
Outcome measurements
• Irrelevant time horizons (e.g. too long or short follow-up times)
• Broad composite outcomes
• Outcomes measured with large error/misclassiﬁcations
Predictor variables
• That are too expensive for use in practice
• That are undue invasive
• That are unavailable at the point where prediction is needed (follow-up data)

Descriptive analyses

Descriptive analyses
automated data summary using R library summarytools (version 0.8.7) with command view(dfSummary(Data))

Initial data analysis
Is:
• cleaning: ﬁnding/resolving inconsistencies
• screening: description of data properties
• documentation of steps
• preparation
Isn’t:
• for selection of predictors
• for selection of subgroups
• for developing prediction models
• always fun
Figure: Huebner, JTCS, 2016, doi: 10.1016/j.jtcvs.2015.09.085

Statistical model
Multivariable linear regression
discharge sbpi = β0 + β1admission sbpi + β2agei + β3femalei + β4hypertensioni
+ β5ischemichdi + β6LVEFlowi + β7LVEFmediumi + β8angiotensini
+ β9betablocki + β10ccantagoni + β11digoxini + β12diuretici + β13vasodilatori
+ i , ∼ N(0, σ2
), i = 1, . . . , 7000.
Meaning: a linear multivariable regression model will be ﬁtted (i.e. forced on) the systolic blood
pressure data

Some terminology
Data set
• discharge sbp is the outcome variable, also known as: dependent variable, target variable,
response variable, predicted variable,. . .
• admission sbp, . . ., vasodilator are the predictor variables, also known as: independent
variables, predictors, features, explanatory variables, input variables, risk factors,. . .
• Together, the 7,000 observations on the outcome and predictor variables make up the
development data set, also known as: derivation data, training data,. . .
Model
• β0, . . . , β13 are the regression coeﬃcients, β0 is the intercept
• once the regression coeﬃcients are estimated (i.e. calculated a value for them) from the
development data set we usually give them a ”hat”: ˆβ0, . . . , ˆβ13
• ˆβ0 + ˆβ1admission sbpi + . . . + ˆβ13vasodilatori is the linear predictor for individual i
•
i is the residual for individual i

Model output for SBP at discharge
ˆβ (95% CI)
Intercept 82.340 (78.791, 85.889)
admission sbp 0.244 (0.227, 0.262)
age 0.067 (0.031, 0.103)
female 1.158 (0.382, 1.935)
hypertension 5.395 (4.574, 6.217)
ischemichd 0.191 (−0.686, 1.068)
LVEFlow −8.246 (−9.783, −6.708)
LVEVmedium −0.130 (−1.093, 0.832)
angiotensin −1.528 (−2.867, −0.188)
betablock −0.055 (−0.832, 0.721)
ccantagon 2.786 (1.965, 3.607)
digoxin −0.296 (−1.096, 0.505)
diuretic −1.076 (−2.808, 0.655)
vasodilator 4.285 (2.496, 6.073)
Observations 7,000
R2
0.237
Adjusted R2
0.235
Residual Std. Error 15.949 (df = 6986)
F Statistic 166.604 (df = 13; 6986)

Apparent prediction error

New patient admitted
ˆβ New patient’s data
Intercept 82.340
admission sbp 0.244 162
age 0.067 74
female 1.158 0
hypertension 5.395 1
ischemichd 0.191 1
LVEFlow −8.246 0
LVEVmedium −0.130 1
angiotensin −1.528 0
betablock −0.055 0
ccantagon 2.786 0
digoxin −0.296 0
diuretic −1.076 0
vasodilator 4.285 0
Prediction of discharge sbp at admission for new patient:
132.3 = 82.340 + 0.244 × 160 + 0.067 × 75 + 5.395 + 0.191 − 0.130
Prediction with a margin of error of ±30, is that right?

Outline
6 Overﬁtting
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018

Let’s talk probability

Prediction is about probability
Prediction is usually about probability (risk) of something that is yet unknown

Diagnostic test
Numbers are made up; do not reﬂect true accuracy of CRP
Target disease: pneumonia
Accuracy CRP: 95%
Sensitivity CRP: Pr(CRP+|pneumonia+) = 95%
Speciﬁcity CRP: Pr(CRP-|pneumonia-) = 95%
Probability that patient has
pneunomia?

Bayesville
https://youtu.be/otdaJPVQIgg
Video shown with permission. By Harvard Prof Joseph Blitzstein part of edX MOOC ”Introduction to Probability”.
Highly recommended: https://www.edx.org/course/introduction-to-probability-0

Bayesville
Dis+ Dis-
Test+ 19 99
Test- 1 1,881
Accuracy: (19+1,881)/(19+1,881+1+99) = 0.95 (95%)
Sensitivity: Pr(Test+|Disease+) = (19)/(19+1) = 0.95 (95%)
Speciﬁcity: Pr(Test-|Disease-) = (99)/(99+1,881) = 0.95 (95%)
Probability of disease: (1+19)/(19+1,881+1+99) = 0.01 (1%)
Positive predictive value: Pr(Disease+|Test+) = (19)/(19+99) = 0.16 (16%)
Negative predictive value: Pr(Disease-|Test-) = (1,881)/(1+1,881) = 0.999 (99.9%)

Bayesville
Dis+ Dis-
Test+ 19 99
Test- 1 1,881
Not relevant for prediction
Accuracy: (19+1,881)/(19+1,881+1+99) = 0.95 (95%)
Sensitivity: Pr(Test+|Disease+) = (19)/(19+1) = 0.95 (95%)
Speciﬁcity: Pr(Test-|Disease-) = (99)/(99+1,881) = 0.95 (95%)
Relevant for prediction
Probability of disease: (1+19)/(19+1,881+1+99) = 0.01 (1%)
Positive predictive value: Pr(Disease+|Test+) = (19)/(19+99) = 0.16 (16%)
Negative predictive value: Pr(Disease-|Test-) = (1,881)/(1+1,881) = 0.999 (99.9%)
Recommended further reading: Moons, Epidemiology, 1996, PMID: 9116087

Bayes’ theorem
Reverend Thomas Bayes (1701 - 1761)
Theorem: Pr(A|B) =
Pr(B|A)Pr(A)
Pr(B)
Pr(A|B) and Pr(B|A) are mathematically related but they are surely not the same (most often
this theorem isn’t needed for computation)

Diagnostic test as a risk prediction model for disease
• A diagnostic test can be viewed as an approach to ”update” the probability of a disease:
Pr(D+) → Pr(D+|T)
• When the positive and negative predictive value (PPV/NPV) are known the probability of
disease after testing can be calculated
• Using a diagnostic test with known PPV/NPV can be viewed viewed as using a risk
prediction model with a single predictor

Conditional probability
• What is conditioned on (behind ”|” sign) is important for interpretation for a probability,
usually with notation Pr(outcome|·) (where Pr is sometimes simply P)
- Mixing up conditionals is quite common (sensitivity/specificity, p-values)
• All probabilities are conditional
- Some things are given without saying (e.g. probability is about human individuals),
others less so (e.g. prediction in first vs secondary care)
- Things that are constant (e.g. setting) do not enter in notation
- There is no such as thing as ”the probability”: context is everything
• Conditional probabilities are at the core of prediction modeling
- Perfect or near-perfect prediction models are suspect
- Proving that a probability model generates a wrong prediction is inherently difficult
- Prediction modeling is about finding the right variables (not too few and not too
many) to condition on to generate probability predictions in future individuals

Outline
6 Overﬁtting
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018

Risk prediction in medicine
• Risk prediction research tends to investigate the relationship between a baseline health
proﬁle and some (undiagnosed or future) health outcome
• Risk = probability

Risk model categories
Risk prediction models can be broadly categorized into:
• Diagnostic: estimate the risk of a target disease being currently present vs not present
- Given age, sex, loss in weight, diﬃculty swallowing, . . . , then what is the probability of
having undiagnosed lung cancer?
• Prognostic: estimate the risk of a certain disease or health state over a certain time period
- Given age, sex, BMI, cholesterol, . . . , then what is the probability of developing
cardiovascular diasease over next 10 year?

Why do we need risk prediction in medicine?
Model based risk estimates are used, among other reasons, to:
• Support and communicate about (preventive) treatment decisions
• Communicate with patients and their families about their risk to develop disease (lifestyle
changes, such as diet and exercise)
• Decide on further diagnostic testing for a certain disease (risk too high to rule out, but too
low to rule in)

Why use risk prediction?
"It is very difficult to predict - especially the future"
Niels Bohr
• Diseases have multiple causes / symptoms, presentations and courses
• It is diﬃcult to make risk predictions with multiple factors playing a role. In a risk prediction
model these factors do not get equal weight
sources: Groopman, book: how doctors think, 1995, isbn: 9780547053646; Balogh, improving diagnosis in healthcare, 2015, doi: 10.17226/21794

Why use risk prediction?
• Support clinical knowledge and intuition
- Attempts at replacing clinicians so far generally unsuccessful
• Goals of risk prediction models (broadly):
- To generate accurate and valid predictions of risk
- Ultimately improve medical decision making and patient outcomes

Risk prediction vs causal inference
Broad categorization of some traditional differences between prediction and causal inference
Risk prediction (today, tomorrow) Causal inference (days 1-3)
Terminology ”X” candidate predictor exposure/confounder/collider/...
Traditional focus predictive performance causal exposure-outcome effect
overfitting unmeasured confounding
Useful new setting? transportability generalizability
Causal direction predictor may be cause of outcome important
Correlation vs causation not important important
traditionally no DAGS DAGS helpful
Missing data important important
Measurement error important important
Medical treatment take into account take into account
Define baseline (T0) important important
Output tool knowledge
Risk prediction: if it predicts, it predicts

When is a risk prediction model ready for use?
Risk model phases before implementation:
• Model derivation/development
• External validation: evaluating the performance of the model
• Model updating / recalibration: updating the model for diﬀerent settings
• Model impact: evaluating whether the model changes clinician decision-making, improve
patients outcomes and cost eﬀectiveness

Introduction to prediction modelling - Berlin 2018 - Part I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to prediction modelling - Berlin 2018 - Part I

Similar to Introduction to prediction modelling - Berlin 2018 - Part I (20)

More from Maarten van Smeden

More from Maarten van Smeden (14)

Recently uploaded

Recently uploaded (20)

Introduction to prediction modelling - Berlin 2018 - Part I