Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sample size for binary logistic prediction models: Beyond events per variable criteria

182 views

Published on

Presentation for conference MEMTAB 2018

Published in: Science
  • Be the first to comment

Sample size for binary logistic prediction models: Beyond events per variable criteria

  1. 1. Sample size for binary logistic prediction models: Beyond events per variable criteria Maarten van Smeden, PhD Leiden University Medical Center Senior researcher MEMTAB 2018 Utrecht, July 3
  2. 2. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Sample size prediction modeling literature (2018)
  3. 3. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Events per variable (EPV)
  4. 4. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Events per variable (EPV)
  5. 5. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Events per variable (EPV) Critique • Flimsy supporting evidence for 10 EPV rule [1] • 50 EPV rule more realistic with traditional variable selection techniques [2] • 5 EPV sufficient to reduce (average) overfitting after “modern” shrinkage [3] • EPV only part of sample size story [4] [1] van Smeden et al., BMC MRM, 2014, doi: 10.1186/s12874-016-0267-3 [2] Steyerberg et al., Stat Med, 2000, doi: 10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0  [3] Pavlou et al., Stat Med, 2016, doi: 10.1002/sim.6782 [4] Ogundimu et al., JCE, 2016, doi: 10.1016/j.jclinepi.2016.02.031
  6. 6. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 EPV forgets about the intercept?
  7. 7. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 New sample size criteria: rMSPE Root Mean Squared Prediction Error (rMSPE): 
 standard deviation of out-of-sample probability prediction error Rational: since clinical prediction is about probability estimation, a sample size criterion should be based on allowable error rates in these estimates
  8. 8. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  9. 9. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  10. 10. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 *Coverage property not guaranteed: assuming errors are IID normal
  11. 11. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  12. 12. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Unfortunately no closed form solution for out-of-sample rMSPE
  13. 13. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation study • 4,032 simulation conditions (factorial design)
 simulation factors: EPV (3 to 50), number candidate predictors (4 to 12), events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution and correlation predictors, number of noise variables • 5,000 replications per condition -> > 20 million simulation runs
  14. 14. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation study • 4,032 simulation conditions (factorial design)
 simulation factors: EPV (3 to 50), number candidate predictors (4 to 12), events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution and correlation predictors, number of noise variables • 5,000 replications per condition -> > 20 million simulation runs • Each run: generate pairs of derivation data and validation data (large, with 5,000 expected events) and develop + validate various logistic prediction models • Will focus on maximum likelihood logistic regression
  15. 15. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation study • 4,032 simulation conditions (factorial design)
 simulation factors: EPV (3 to 50), number candidate predictors (4 to 12), events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution and correlation predictors, number of noise variables • 5,000 replications per condition -> > 20 million simulation runs • Each run: generate pairs of derivation data and validation data (large, with 5,000 expected events) and develop + validate various logistic prediction models • Will focus on maximum likelihood logistic regression • Simulation meta models: fit linear (Ridge) regression models to predict simulation outcome (rMSPE) from simulation factors
  16. 16. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation meta models rMPSE • Meta-model with 3 (of 7) factors: N, events fraction and number of (candidate) predictors: R2 = 0.992 • (Meta-model with only EPV as factor: R2 = 0.432) https://mvansmeden.shinyapps.io/BeyondEPV/
  17. 17. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  18. 18. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 In press Thanks to Richard Riley for commenting on early draft
  19. 19. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Final remarks • 10 EPV prediction models can produce widely inaccurate probability estimates • New sample size criterion - based on rMSPE - could be accurately approximated by predictable data characteristics • Validation, analytical work, and extensions still needs to be done • Our new sample size calculation shiny-app is “Beta”; can be used to approximate rMSPE for settings that stay close to our simulation design (article in press) • One sample criterion probably isn’t always enough. Notably, low events fraction settings may come with low rMSPE and high need of shrinkage
  20. 20. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Final remarks Binary logistic regression sample size recommendations 1. Think about allowable probability prediction error (e.g. in terms of 95% coverage regions) 2. If you can, run a realistic simulation study 3. If you can’t do 2, use our shiny-app with caution to calculate minimal sample size
  21. 21. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 https://mvansmeden.shinyapps.io/BeyondEPV/
  22. 22. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  23. 23. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  24. 24. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Logistic prediction models Schmidt et al., Schizo Bulletin, 2017, doi:10.1093/schbul/sbw098; Damen et al., BMJ, 2017, doi:10.1136/bmj.i2416; Collins et al., BMC MRM, 2014, doi:10.1186/1471-2288-14-40; Collins et al., BMC Med, 2011, doi: 10.1186/1741-7015-9-103; Bouwmeester et al., Plos Med, 2012: 10.1371/journal.pmed.1001221.
  25. 25. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 New sample size criterion Use expected root Mean Squared Prediction Error (rMSPE) Interpretation: standard deviation of expected out-of-sample probability prediction error Where are the unobservable “true” probabilities that would have been obtained would the prediction model have been derived with correct functional form and infinite sample size; are estimated probabilities from the derived model in a large external set of similar individuals (“out-of- sample”). rMSPE = E[(πi − ̂πi)2 ], πi ̂πi
  26. 26. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Difference between estimated probability from a prediction model when applied in large sample validation study vs “true” probability obtained when the same model would have been derived from an infinitely large sample

×