Introduction to prediction modelling - Berlin 2018 - Part I
1. Advanced Epidemiologic Methods
causal research and prediction modelling
Prediction modelling topics 1-4
Maarten van Smeden
LUMC, Department of Clinical Epidemiology
20-24 August 2018
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
2. Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overfitting
7 External validation and updating
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
3. About
• Statistician by training
• PhD (2016): diagnostic research in the absence of a gold standard
• Post-doc department of Biostatistics (University Medical Center Utrecht)
• Currently: senior researcher department of Clinical Epidemiology (Leiden University Medical
Center)
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
4. Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overfitting
7 External validation and updating
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
5. Types of prediction research
• Prevalence/incidence studies
- Occurrence of health outcomes within/across an geographical area or over time
- Average risk of having/experiencing of an health outcome
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
6. Prevalence study
Beasley, The Lancet, 1998. doi: 10.1016/S0140-6736(97)07302-9
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
7. Incidence study
Adabag, JAMA, 2008. doi: 10.1001/jama.2008.553
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
8. Types of prediction research
• Prevalence/incidence studies
- Occurrence of health outcomes within/across an geographical area or over time
- Average risk of having/experiencing of an health outcome
• Predictor finding studies
- Identifying factors associated with a health outcome
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
9. Predictor finding study
Letellier, BJC, 2017. doi: 10.1038/bjc.2017.352
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
10. Types of prediction research
• Prevalence/incidence studies
- Occurrence of health outcomes within/across an geographical area or over time
- Average risk of having/experiencing of an health outcome
• Predictor finding studies
- Identifying factors associated with a health outcome
• Stratified medicine
- Identify biomarkers that predict response to a treatment
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
11. Stratified medicine
Bass, JCEM, 2010, doi:10.1210/jc.2010-0947
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
12. Types of prediction research
• Prevalence/incidence studies
- Occurrence of health outcomes within/across an geographical area or over time
- Average risk of having/experiencing the health outcome
• Predictor finding studies
- Identifying factors associated with a health outcome
• Stratified medicine
- Identify biomarkers that predict response to a treatment
Topic of today and tomorrow
• Prediction models
- Modelling combinations of factors to predict a health outcome for individual patients
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
13. Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
14. In the doctor’s office
The relevant questions to ask?
”What is wrong with this patient?”
”What happens to this patient without/after
treatment X?”
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
15. In the doctor’s office
⇒ Diagnosis
⇒ Prognosis/therapy
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
16. In the doctor’s office
The patient
• 52-year-old man
• Endurance cyclist
• Swollen calf since 10 days
• ”Calf feels hot”
• Previously documented DVT
• Elbow surgery 5 weeks ago
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
17. In the doctor’s office
The patient
• 52-year-old man
• Endurance cyclist
• Swollen calf since 10 days
• ”Calf feels hot”
• Previously documented DVT
• Elbow surgery 5 weeks ago
Deep venous thrombosis likely?
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
18. Clinical prediction example 1: Apgar
Apgar, JAMA, 1958. doi: 10.1001/jama.1958.03000150027007
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
19. Clinical prediction example 1: Apgar
Casey, NEJM, 2001, doi: 10.1056/NEJM200102153440701
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
20. Clinical prediction example 2:...
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
21. Clinical prediction example 2: Framingham risk score
10 year CVD risk
To online calculator
D’Agostino, Circulation, 2008. doi: 10.1161/CIRCULATIONAHA.107.699579
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
22. Clinical prediction example 3: Score
10 year fatal CVD risk
Conroy, European Heart Journal, 2003. doi: 10.1016/S0195-668X(03)00114-3
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
23. Clinical prediction example 4: Lymph node metastasis
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
24. Clinical prediction example 4: Lymph node metastasis
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
25. Clinical prediction example 4: Lymph node metastasis
Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
26. Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overfitting
7 External validation and updating
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
27. Example: predicting systolic blood pressure (sbp) at discharge
• Patients hospitalized for heart failure
• Goal is to develop a model predicting systolic blood
pressure at discharge
• This example is inspired by paper of Austin and
Steyerberg; data (N = 7,000) are simulated
Austin, J Clin Epi, 2015, doi: 10.1016/j.jclinepi.2014.12.014
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
28. Data dictionary
label explanation
admission sbp systolic blood pressure at admission (in mm Hg)
age age at hospitalization (in years)
female gender
hypertension presence of hypertension
ischemichd ischemic heart failure
LVEFlow left ventricular ejection fraction < 20%
LVEFmedium left ventricular ejection fraction > 20%, < 40%
angiotensin1 angiotensin converting enzyme inhibitors
betablock1 beta-blockers
ccantagon1 calcium channel antagonists
digoxin1 digoxin
diuretic1 diuretic
vasodilator1 vasodilator
discharge sbp systolic blood pressure at discharge (in mm Hg)
coding: 0 = no, 1 = yes
1during hospital stay
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
29. A note on data collection
Rubbish in = Rubbish out
A descriptive analysis tells only part of the data’s story
Wynants, BJOG, 2017, doi:10.1111/1471-0528.14170
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
30. Examples of rubbish data
Outcome measurements
• Irrelevant time horizons (e.g. too long or short follow-up times)
• Broad composite outcomes
• Outcomes measured with large error/misclassifications
Predictor variables
• That are too expensive for use in practice
• That are undue invasive
• That are unavailable at the point where prediction is needed (follow-up data)
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
32. Descriptive analyses
automated data summary using R library summarytools (version 0.8.7) with command view(dfSummary(Data))
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
35. Initial data analysis
Is:
• cleaning: finding/resolving inconsistencies
• screening: description of data properties
• documentation of steps
• preparation
Isn’t:
• for selection of predictors
• for selection of subgroups
• for developing prediction models
• always fun
Figure: Huebner, JTCS, 2016, doi: 10.1016/j.jtcvs.2015.09.085
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
36. Statistical model
Multivariable linear regression
discharge sbpi = β0 + β1admission sbpi + β2agei + β3femalei + β4hypertensioni
+ β5ischemichdi + β6LVEFlowi + β7LVEFmediumi + β8angiotensini
+ β9betablocki + β10ccantagoni + β11digoxini + β12diuretici + β13vasodilatori
+ i , ∼ N(0, σ2
), i = 1, . . . , 7000.
Meaning: a linear multivariable regression model will be fitted (i.e. forced on) the systolic blood
pressure data
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
37. Some terminology
Data set
• discharge sbp is the outcome variable, also known as: dependent variable, target variable,
response variable, predicted variable,. . .
• admission sbp, . . ., vasodilator are the predictor variables, also known as: independent
variables, predictors, features, explanatory variables, input variables, risk factors,. . .
• Together, the 7,000 observations on the outcome and predictor variables make up the
development data set, also known as: derivation data, training data,. . .
Model
• β0, . . . , β13 are the regression coefficients, β0 is the intercept
• once the regression coefficients are estimated (i.e. calculated a value for them) from the
development data set we usually give them a ”hat”: ˆβ0, . . . , ˆβ13
• ˆβ0 + ˆβ1admission sbpi + . . . + ˆβ13vasodilatori is the linear predictor for individual i
•
i is the residual for individual i
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
40. New patient admitted
ˆβ New patient’s data
Intercept 82.340
admission sbp 0.244 162
age 0.067 74
female 1.158 0
hypertension 5.395 1
ischemichd 0.191 1
LVEFlow −8.246 0
LVEVmedium −0.130 1
angiotensin −1.528 0
betablock −0.055 0
ccantagon 2.786 0
digoxin −0.296 0
diuretic −1.076 0
vasodilator 4.285 0
Prediction of discharge sbp at admission for new patient:
132.3 = 82.340 + 0.244 × 160 + 0.067 × 75 + 5.395 + 0.191 − 0.130
Prediction with a margin of error of ±30, is that right?
Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
41. Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overfitting
7 External validation and updating
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
43. Prediction is about probability
Prediction is usually about probability (risk) of something that is yet unknown
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
44. Prediction is about probability
Prediction is usually about probability (risk) of something that is yet unknown
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
45. Prediction is about probability
Prediction is usually about probability (risk) of something that is yet unknown
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
46. Diagnostic test
Numbers are made up; do not reflect true accuracy of CRP
Target disease: pneumonia
Accuracy CRP: 95%
Sensitivity CRP: Pr(CRP+|pneumonia+) = 95%
Specificity CRP: Pr(CRP-|pneumonia-) = 95%
Probability that patient has
pneunomia?
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
47. Bayesville
https://youtu.be/otdaJPVQIgg
Video shown with permission. By Harvard Prof Joseph Blitzstein part of edX MOOC ”Introduction to Probability”.
Highly recommended: https://www.edx.org/course/introduction-to-probability-0
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
49. Bayesville
Dis+ Dis-
Test+ 19 99
Test- 1 1,881
Not relevant for prediction
Accuracy: (19+1,881)/(19+1,881+1+99) = 0.95 (95%)
Sensitivity: Pr(Test+|Disease+) = (19)/(19+1) = 0.95 (95%)
Specificity: Pr(Test-|Disease-) = (99)/(99+1,881) = 0.95 (95%)
Relevant for prediction
Probability of disease: (1+19)/(19+1,881+1+99) = 0.01 (1%)
Positive predictive value: Pr(Disease+|Test+) = (19)/(19+99) = 0.16 (16%)
Negative predictive value: Pr(Disease-|Test-) = (1,881)/(1+1,881) = 0.999 (99.9%)
Recommended further reading: Moons, Epidemiology, 1996, PMID: 9116087
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
50. Bayes’ theorem
Reverend Thomas Bayes (1701 - 1761)
Theorem: Pr(A|B) =
Pr(B|A)Pr(A)
Pr(B)
Pr(A|B) and Pr(B|A) are mathematically related but they are surely not the same (most often
this theorem isn’t needed for computation)
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
51. Diagnostic test as a risk prediction model for disease
• A diagnostic test can be viewed as an approach to ”update” the probability of a disease:
Pr(D+) → Pr(D+|T)
• When the positive and negative predictive value (PPV/NPV) are known the probability of
disease after testing can be calculated
• Using a diagnostic test with known PPV/NPV can be viewed viewed as using a risk
prediction model with a single predictor
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
52. Conditional probability
• What is conditioned on (behind ”|” sign) is important for interpretation for a probability,
usually with notation Pr(outcome|·) (where Pr is sometimes simply P)
- Mixing up conditionals is quite common (sensitivity/specificity, p-values)
Some analogies:
• Pr(death|shot by handgun) vs Pr(shot by handgun|death)
• Pr(death|bitten by shark) vs Pr(bitten by shark|death)
• Pr(female|currently pregnant) vs Pr(currently pregnant|female)
• Pr(female|breast cancer) vs Pr(breast cancer|female)
• Pr(being pope|catholic) vs Pr(catholic|being pope)
• Pr(being US president|US citizen) vs Pr(US citizen|being US president)
• Pr(having sex|STD) vs Pr(STD|having sex)
• Pr(wet street|rain) vs Pr(rain|wet street)
source: https://twitter.com/MaartenvSmeden/status/1028630739162726400
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
53. Conditional probability
• What is conditioned on (behind ”|” sign) is important for interpretation for a probability,
usually with notation Pr(outcome|·) (where Pr is sometimes simply P)
- Mixing up conditionals is quite common (sensitivity/specificity, p-values)
• All probabilities are conditional
- Some things are given without saying (e.g. probability is about human individuals),
others less so (e.g. prediction in first vs secondary care)
- Things that are constant (e.g. setting) do not enter in notation
- There is no such as thing as ”the probability”: context is everything
• Conditional probabilities are at the core of prediction modeling
- Perfect or near-perfect prediction models are suspect
- Proving that a probability model generates a wrong prediction is inherently difficult
- Prediction modeling is about finding the right variables (not too few and not too
many) to condition on to generate probability predictions in future individuals
Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
54. Outline
1 Introduction to prediction modelling
2 Example: predicting systolic blood pressure
3 Risk and probability
4 Risk prediction modelling: rationale and context
5 Risk prediction model building
6 Overfitting
7 External validation and updating
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
55. Risk prediction in medicine
• Risk prediction research tends to investigate the relationship between a baseline health
profile and some (undiagnosed or future) health outcome
• Risk = probability
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
56. Risk model categories
Risk prediction models can be broadly categorized into:
• Diagnostic: estimate the risk of a target disease being currently present vs not present
- Given age, sex, loss in weight, difficulty swallowing, . . . , then what is the probability of
having undiagnosed lung cancer?
• Prognostic: estimate the risk of a certain disease or health state over a certain time period
- Given age, sex, BMI, cholesterol, . . . , then what is the probability of developing
cardiovascular diasease over next 10 year?
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
57. Why do we need risk prediction in medicine?
Model based risk estimates are used, among other reasons, to:
• Support and communicate about (preventive) treatment decisions
• Communicate with patients and their families about their risk to develop disease (lifestyle
changes, such as diet and exercise)
• Decide on further diagnostic testing for a certain disease (risk too high to rule out, but too
low to rule in)
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
58. Why use risk prediction?
"It is very difficult to predict - especially the future"
Niels Bohr
• Diseases have multiple causes / symptoms, presentations and courses
• It is difficult to make risk predictions with multiple factors playing a role. In a risk prediction
model these factors do not get equal weight
sources: Groopman, book: how doctors think, 1995, isbn: 9780547053646; Balogh, improving diagnosis in healthcare, 2015, doi: 10.17226/21794
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
59. Why use risk prediction?
• Support clinical knowledge and intuition
- Attempts at replacing clinicians so far generally unsuccessful
• Goals of risk prediction models (broadly):
- To generate accurate and valid predictions of risk
- Ultimately improve medical decision making and patient outcomes
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
60. Risk prediction vs causal inference
Broad categorization of some traditional differences between prediction and causal inference
Risk prediction (today, tomorrow) Causal inference (days 1-3)
Terminology ”X” candidate predictor exposure/confounder/collider/...
Traditional focus predictive performance causal exposure-outcome effect
overfitting unmeasured confounding
Useful new setting? transportability generalizability
Causal direction predictor may be cause of outcome important
Correlation vs causation not important important
traditionally no DAGS DAGS helpful
Missing data important important
Measurement error important important
Medical treatment take into account take into account
Define baseline (T0) important important
Output tool knowledge
Risk prediction: if it predicts, it predicts
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
61. When is a risk prediction model ready for use?
Risk model phases before implementation:
• Model derivation/development
• External validation: evaluating the performance of the model
• Model updating / recalibration: updating the model for different settings
• Model impact: evaluating whether the model changes clinician decision-making, improve
patients outcomes and cost effectiveness
Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018