Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to prediction modelling - Berlin 2018 - Part I

452 views

Published on

Lecture slides, subtopic: prediction modelling (part 1 of 2)

Published in: Science
  • Be the first to comment

Introduction to prediction modelling - Berlin 2018 - Part I

  1. 1. Advanced Epidemiologic Methods causal research and prediction modelling Prediction modelling topics 1-4 Maarten van Smeden LUMC, Department of Clinical Epidemiology 20-24 August 2018 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  2. 2. Outline 1 Introduction to prediction modelling 2 Example: predicting systolic blood pressure 3 Risk and probability 4 Risk prediction modelling: rationale and context 5 Risk prediction model building 6 Overfitting 7 External validation and updating Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  3. 3. About • Statistician by training • PhD (2016): diagnostic research in the absence of a gold standard • Post-doc department of Biostatistics (University Medical Center Utrecht) • Currently: senior researcher department of Clinical Epidemiology (Leiden University Medical Center) Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  4. 4. Outline 1 Introduction to prediction modelling 2 Example: predicting systolic blood pressure 3 Risk and probability 4 Risk prediction modelling: rationale and context 5 Risk prediction model building 6 Overfitting 7 External validation and updating Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  5. 5. Types of prediction research • Prevalence/incidence studies - Occurrence of health outcomes within/across an geographical area or over time - Average risk of having/experiencing of an health outcome Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  6. 6. Prevalence study Beasley, The Lancet, 1998. doi: 10.1016/S0140-6736(97)07302-9 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  7. 7. Incidence study Adabag, JAMA, 2008. doi: 10.1001/jama.2008.553 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  8. 8. Types of prediction research • Prevalence/incidence studies - Occurrence of health outcomes within/across an geographical area or over time - Average risk of having/experiencing of an health outcome • Predictor finding studies - Identifying factors associated with a health outcome Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  9. 9. Predictor finding study Letellier, BJC, 2017. doi: 10.1038/bjc.2017.352 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  10. 10. Types of prediction research • Prevalence/incidence studies - Occurrence of health outcomes within/across an geographical area or over time - Average risk of having/experiencing of an health outcome • Predictor finding studies - Identifying factors associated with a health outcome • Stratified medicine - Identify biomarkers that predict response to a treatment Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  11. 11. Stratified medicine Bass, JCEM, 2010, doi:10.1210/jc.2010-0947 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  12. 12. Types of prediction research • Prevalence/incidence studies - Occurrence of health outcomes within/across an geographical area or over time - Average risk of having/experiencing the health outcome • Predictor finding studies - Identifying factors associated with a health outcome • Stratified medicine - Identify biomarkers that predict response to a treatment Topic of today and tomorrow • Prediction models - Modelling combinations of factors to predict a health outcome for individual patients Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  13. 13. Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  14. 14. In the doctor’s office The relevant questions to ask? ”What is wrong with this patient?” ”What happens to this patient without/after treatment X?” Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  15. 15. In the doctor’s office ⇒ Diagnosis ⇒ Prognosis/therapy Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  16. 16. In the doctor’s office The patient • 52-year-old man • Endurance cyclist • Swollen calf since 10 days • ”Calf feels hot” • Previously documented DVT • Elbow surgery 5 weeks ago Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  17. 17. In the doctor’s office The patient • 52-year-old man • Endurance cyclist • Swollen calf since 10 days • ”Calf feels hot” • Previously documented DVT • Elbow surgery 5 weeks ago Deep venous thrombosis likely? Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  18. 18. Clinical prediction example 1: Apgar Apgar, JAMA, 1958. doi: 10.1001/jama.1958.03000150027007 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  19. 19. Clinical prediction example 1: Apgar Casey, NEJM, 2001, doi: 10.1056/NEJM200102153440701 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  20. 20. Clinical prediction example 2:... Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  21. 21. Clinical prediction example 2: Framingham risk score 10 year CVD risk To online calculator D’Agostino, Circulation, 2008. doi: 10.1161/CIRCULATIONAHA.107.699579 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  22. 22. Clinical prediction example 3: Score 10 year fatal CVD risk Conroy, European Heart Journal, 2003. doi: 10.1016/S0195-668X(03)00114-3 Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  23. 23. Clinical prediction example 4: Lymph node metastasis Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  24. 24. Clinical prediction example 4: Lymph node metastasis Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  25. 25. Clinical prediction example 4: Lymph node metastasis Maarten van Smeden (LUMC) Introduction to prediction modelling 20-24 August 2018
  26. 26. Outline 1 Introduction to prediction modelling 2 Example: predicting systolic blood pressure 3 Risk and probability 4 Risk prediction modelling: rationale and context 5 Risk prediction model building 6 Overfitting 7 External validation and updating Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  27. 27. Example: predicting systolic blood pressure (sbp) at discharge • Patients hospitalized for heart failure • Goal is to develop a model predicting systolic blood pressure at discharge • This example is inspired by paper of Austin and Steyerberg; data (N = 7,000) are simulated Austin, J Clin Epi, 2015, doi: 10.1016/j.jclinepi.2014.12.014 Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  28. 28. Data dictionary label explanation admission sbp systolic blood pressure at admission (in mm Hg) age age at hospitalization (in years) female gender hypertension presence of hypertension ischemichd ischemic heart failure LVEFlow left ventricular ejection fraction < 20% LVEFmedium left ventricular ejection fraction > 20%, < 40% angiotensin1 angiotensin converting enzyme inhibitors betablock1 beta-blockers ccantagon1 calcium channel antagonists digoxin1 digoxin diuretic1 diuretic vasodilator1 vasodilator discharge sbp systolic blood pressure at discharge (in mm Hg) coding: 0 = no, 1 = yes 1during hospital stay Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  29. 29. A note on data collection Rubbish in = Rubbish out A descriptive analysis tells only part of the data’s story Wynants, BJOG, 2017, doi:10.1111/1471-0528.14170 Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  30. 30. Examples of rubbish data Outcome measurements • Irrelevant time horizons (e.g. too long or short follow-up times) • Broad composite outcomes • Outcomes measured with large error/misclassifications Predictor variables • That are too expensive for use in practice • That are undue invasive • That are unavailable at the point where prediction is needed (follow-up data) Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  31. 31. Descriptive analyses Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  32. 32. Descriptive analyses automated data summary using R library summarytools (version 0.8.7) with command view(dfSummary(Data)) Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  33. 33. Descriptive analyses Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  34. 34. Descriptive analyses Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  35. 35. Initial data analysis Is: • cleaning: finding/resolving inconsistencies • screening: description of data properties • documentation of steps • preparation Isn’t: • for selection of predictors • for selection of subgroups • for developing prediction models • always fun Figure: Huebner, JTCS, 2016, doi: 10.1016/j.jtcvs.2015.09.085 Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  36. 36. Statistical model Multivariable linear regression discharge sbpi = β0 + β1admission sbpi + β2agei + β3femalei + β4hypertensioni + β5ischemichdi + β6LVEFlowi + β7LVEFmediumi + β8angiotensini + β9betablocki + β10ccantagoni + β11digoxini + β12diuretici + β13vasodilatori + i , ∼ N(0, σ2 ), i = 1, . . . , 7000. Meaning: a linear multivariable regression model will be fitted (i.e. forced on) the systolic blood pressure data Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  37. 37. Some terminology Data set • discharge sbp is the outcome variable, also known as: dependent variable, target variable, response variable, predicted variable,. . . • admission sbp, . . ., vasodilator are the predictor variables, also known as: independent variables, predictors, features, explanatory variables, input variables, risk factors,. . . • Together, the 7,000 observations on the outcome and predictor variables make up the development data set, also known as: derivation data, training data,. . . Model • β0, . . . , β13 are the regression coefficients, β0 is the intercept • once the regression coefficients are estimated (i.e. calculated a value for them) from the development data set we usually give them a ”hat”: ˆβ0, . . . , ˆβ13 • ˆβ0 + ˆβ1admission sbpi + . . . + ˆβ13vasodilatori is the linear predictor for individual i • i is the residual for individual i Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  38. 38. Model output for SBP at discharge ˆβ (95% CI) Intercept 82.340 (78.791, 85.889) admission sbp 0.244 (0.227, 0.262) age 0.067 (0.031, 0.103) female 1.158 (0.382, 1.935) hypertension 5.395 (4.574, 6.217) ischemichd 0.191 (−0.686, 1.068) LVEFlow −8.246 (−9.783, −6.708) LVEVmedium −0.130 (−1.093, 0.832) angiotensin −1.528 (−2.867, −0.188) betablock −0.055 (−0.832, 0.721) ccantagon 2.786 (1.965, 3.607) digoxin −0.296 (−1.096, 0.505) diuretic −1.076 (−2.808, 0.655) vasodilator 4.285 (2.496, 6.073) Observations 7,000 R2 0.237 Adjusted R2 0.235 Residual Std. Error 15.949 (df = 6986) F Statistic 166.604 (df = 13; 6986) Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  39. 39. Apparent prediction error Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  40. 40. New patient admitted ˆβ New patient’s data Intercept 82.340 admission sbp 0.244 162 age 0.067 74 female 1.158 0 hypertension 5.395 1 ischemichd 0.191 1 LVEFlow −8.246 0 LVEVmedium −0.130 1 angiotensin −1.528 0 betablock −0.055 0 ccantagon 2.786 0 digoxin −0.296 0 diuretic −1.076 0 vasodilator 4.285 0 Prediction of discharge sbp at admission for new patient: 132.3 = 82.340 + 0.244 × 160 + 0.067 × 75 + 5.395 + 0.191 − 0.130 Prediction with a margin of error of ±30, is that right? Maarten van Smeden (LUMC) Example: predicting systolic blood pressure 20-24 August 2018
  41. 41. Outline 1 Introduction to prediction modelling 2 Example: predicting systolic blood pressure 3 Risk and probability 4 Risk prediction modelling: rationale and context 5 Risk prediction model building 6 Overfitting 7 External validation and updating Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  42. 42. Let’s talk probability Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  43. 43. Prediction is about probability Prediction is usually about probability (risk) of something that is yet unknown Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  44. 44. Prediction is about probability Prediction is usually about probability (risk) of something that is yet unknown Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  45. 45. Prediction is about probability Prediction is usually about probability (risk) of something that is yet unknown Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  46. 46. Diagnostic test Numbers are made up; do not reflect true accuracy of CRP Target disease: pneumonia Accuracy CRP: 95% Sensitivity CRP: Pr(CRP+|pneumonia+) = 95% Specificity CRP: Pr(CRP-|pneumonia-) = 95% Probability that patient has pneunomia? Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  47. 47. Bayesville https://youtu.be/otdaJPVQIgg Video shown with permission. By Harvard Prof Joseph Blitzstein part of edX MOOC ”Introduction to Probability”. Highly recommended: https://www.edx.org/course/introduction-to-probability-0 Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  48. 48. Bayesville Dis+ Dis- Test+ 19 99 Test- 1 1,881 Accuracy: (19+1,881)/(19+1,881+1+99) = 0.95 (95%) Sensitivity: Pr(Test+|Disease+) = (19)/(19+1) = 0.95 (95%) Specificity: Pr(Test-|Disease-) = (99)/(99+1,881) = 0.95 (95%) Probability of disease: (1+19)/(19+1,881+1+99) = 0.01 (1%) Positive predictive value: Pr(Disease+|Test+) = (19)/(19+99) = 0.16 (16%) Negative predictive value: Pr(Disease-|Test-) = (1,881)/(1+1,881) = 0.999 (99.9%) Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  49. 49. Bayesville Dis+ Dis- Test+ 19 99 Test- 1 1,881 Not relevant for prediction Accuracy: (19+1,881)/(19+1,881+1+99) = 0.95 (95%) Sensitivity: Pr(Test+|Disease+) = (19)/(19+1) = 0.95 (95%) Specificity: Pr(Test-|Disease-) = (99)/(99+1,881) = 0.95 (95%) Relevant for prediction Probability of disease: (1+19)/(19+1,881+1+99) = 0.01 (1%) Positive predictive value: Pr(Disease+|Test+) = (19)/(19+99) = 0.16 (16%) Negative predictive value: Pr(Disease-|Test-) = (1,881)/(1+1,881) = 0.999 (99.9%) Recommended further reading: Moons, Epidemiology, 1996, PMID: 9116087 Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  50. 50. Bayes’ theorem Reverend Thomas Bayes (1701 - 1761) Theorem: Pr(A|B) = Pr(B|A)Pr(A) Pr(B) Pr(A|B) and Pr(B|A) are mathematically related but they are surely not the same (most often this theorem isn’t needed for computation) Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  51. 51. Diagnostic test as a risk prediction model for disease • A diagnostic test can be viewed as an approach to ”update” the probability of a disease: Pr(D+) → Pr(D+|T) • When the positive and negative predictive value (PPV/NPV) are known the probability of disease after testing can be calculated • Using a diagnostic test with known PPV/NPV can be viewed viewed as using a risk prediction model with a single predictor Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  52. 52. Conditional probability • What is conditioned on (behind ”|” sign) is important for interpretation for a probability, usually with notation Pr(outcome|·) (where Pr is sometimes simply P) - Mixing up conditionals is quite common (sensitivity/specificity, p-values) Some analogies: • Pr(death|shot by handgun) vs Pr(shot by handgun|death) • Pr(death|bitten by shark) vs Pr(bitten by shark|death) • Pr(female|currently pregnant) vs Pr(currently pregnant|female) • Pr(female|breast cancer) vs Pr(breast cancer|female) • Pr(being pope|catholic) vs Pr(catholic|being pope) • Pr(being US president|US citizen) vs Pr(US citizen|being US president) • Pr(having sex|STD) vs Pr(STD|having sex) • Pr(wet street|rain) vs Pr(rain|wet street) source: https://twitter.com/MaartenvSmeden/status/1028630739162726400 Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  53. 53. Conditional probability • What is conditioned on (behind ”|” sign) is important for interpretation for a probability, usually with notation Pr(outcome|·) (where Pr is sometimes simply P) - Mixing up conditionals is quite common (sensitivity/specificity, p-values) • All probabilities are conditional - Some things are given without saying (e.g. probability is about human individuals), others less so (e.g. prediction in first vs secondary care) - Things that are constant (e.g. setting) do not enter in notation - There is no such as thing as ”the probability”: context is everything • Conditional probabilities are at the core of prediction modeling - Perfect or near-perfect prediction models are suspect - Proving that a probability model generates a wrong prediction is inherently difficult - Prediction modeling is about finding the right variables (not too few and not too many) to condition on to generate probability predictions in future individuals Maarten van Smeden (LUMC) Risk and probability 20-24 August 2018
  54. 54. Outline 1 Introduction to prediction modelling 2 Example: predicting systolic blood pressure 3 Risk and probability 4 Risk prediction modelling: rationale and context 5 Risk prediction model building 6 Overfitting 7 External validation and updating Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  55. 55. Risk prediction in medicine • Risk prediction research tends to investigate the relationship between a baseline health profile and some (undiagnosed or future) health outcome • Risk = probability Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  56. 56. Risk model categories Risk prediction models can be broadly categorized into: • Diagnostic: estimate the risk of a target disease being currently present vs not present - Given age, sex, loss in weight, difficulty swallowing, . . . , then what is the probability of having undiagnosed lung cancer? • Prognostic: estimate the risk of a certain disease or health state over a certain time period - Given age, sex, BMI, cholesterol, . . . , then what is the probability of developing cardiovascular diasease over next 10 year? Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  57. 57. Why do we need risk prediction in medicine? Model based risk estimates are used, among other reasons, to: • Support and communicate about (preventive) treatment decisions • Communicate with patients and their families about their risk to develop disease (lifestyle changes, such as diet and exercise) • Decide on further diagnostic testing for a certain disease (risk too high to rule out, but too low to rule in) Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  58. 58. Why use risk prediction? "It is very difficult to predict - especially the future" Niels Bohr • Diseases have multiple causes / symptoms, presentations and courses • It is difficult to make risk predictions with multiple factors playing a role. In a risk prediction model these factors do not get equal weight sources: Groopman, book: how doctors think, 1995, isbn: 9780547053646; Balogh, improving diagnosis in healthcare, 2015, doi: 10.17226/21794 Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  59. 59. Why use risk prediction? • Support clinical knowledge and intuition - Attempts at replacing clinicians so far generally unsuccessful • Goals of risk prediction models (broadly): - To generate accurate and valid predictions of risk - Ultimately improve medical decision making and patient outcomes Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  60. 60. Risk prediction vs causal inference Broad categorization of some traditional differences between prediction and causal inference Risk prediction (today, tomorrow) Causal inference (days 1-3) Terminology ”X” candidate predictor exposure/confounder/collider/... Traditional focus predictive performance causal exposure-outcome effect overfitting unmeasured confounding Useful new setting? transportability generalizability Causal direction predictor may be cause of outcome important Correlation vs causation not important important traditionally no DAGS DAGS helpful Missing data important important Measurement error important important Medical treatment take into account take into account Define baseline (T0) important important Output tool knowledge Risk prediction: if it predicts, it predicts Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018
  61. 61. When is a risk prediction model ready for use? Risk model phases before implementation: • Model derivation/development • External validation: evaluating the performance of the model • Model updating / recalibration: updating the model for different settings • Model impact: evaluating whether the model changes clinician decision-making, improve patients outcomes and cost effectiveness Maarten van Smeden (LUMC) Risk prediction modelling: rationale and context 20-24 August 2018

×