Slides of masterclass "Improving predictions: Lasso, Ridge and Stein's paradox" at the (Dutch) National Institute for Public Health and the Environment (RIVM)
Luciferase in rDNA technology (biotechnology).pptx
Improving predictions: Lasso, Ridge and Stein's paradox
1. Improving predictions: Ridge, Lasso and Stein’s paradox
RIVM Epi masterclass (22/3/18)
Maarten van Smeden
Post-doc clinical epidemiology/medical statistics, Leiden University Medical Center
2.
3.
4. This slide deck available:
https://www.slideshare.net/MaartenvanSmeden
5. Diagnostic / prognostic prediction
Clinical prediction models
•Diagnostic prediction: probability of disease D = d in patient i?
•Prognostic prediction: probability of developing health outcome Y = y within
(or up to) T years in patient i?
8. Rise of prediction models
•>110 models for prostate cancer (Shariat 2008)
•>100 models for Traumatic Brain Injury (Perel 2006)
•83 models for stroke (Counsell 2001)
•54 models for breast cancer (Altman 2009)
•43 models for type 2 diabetes (Collins 2011; Dieren 2012)
•31 models for osteoporotic fracture (Steurer 2011)
•29 models in reproductive medicine (Leushuis 2009)
•26 models for hospital readmission (Kansagara 2011)
•>25 models for length of stay in cardiac surgery (Ettema 2010)
•>350 models for CVD outcomes (Damen 2016)
The overview was created and first presented by Prof. KGM Moons (Julius Center, UMC Utrecht)
10. This talk
Key message
Regression shrinkage strategies, such as Ridge and Lasso, have the ability to
dramatically improve predictive performance of prediction models
Outline
•What is wrong with traditional prediction model development strategies?
•What is Ridge and Lasso?
•Some thoughts on when to consider Ridge/Lasso.
11. Setting
•Development data: with subjects (i = 1, . . . , N) for which an outcome is
observed (y: the outcome to predict), and P predictor variables (X: explanatory
variables to make a prediction of y)
•(External) validation data: with subjects that were not part of the
development data but have the same outcome and predictor variables observed.
Perhaps subjects from a different geographical area
•The goal is to develop a prediction model with high as possible predictive
performance in validation (out-of-sample performance); performance in
development sample is not directly relevant
{•I’ll focus on the linear model for illustrative reasons}
{•N >> P}
12. Setting
•Development data: with subjects (i = 1, . . . , N) for which an outcome is
observed (y: the outcome to predict), and P predictor variables (X: explanatory
variables to make a prediction of y)
•(External) validation data: with subjects that were not part of the
development data but have the same outcome and predictor variables observed.
Perhaps subjects from a different geographical area
•The goal is to develop prediction model with high as possible predictive
performance in validation (out-of-sample performance); performance in
development sample is not directly relevant
•I’ll focus on the linear model for illustrative reasons
•N >> P
13. Linear model: OLS regression
Linear regression model
y = f(X) + , ∼ N(0, σ2
)
•With linear main effects only: ˆf(X) = ˆβ0 + ˆβ1x1 + ˆβ2X2 + . . . + ˆβP xP
•Find β that minimizes (in-sample) squared prediction error: i
(yi − ˆf(xi))
•Closed form solution: (X X)−1
X y
Question
Is ˆf(.) the best estimator to predict for future individuals?
15. 1955: Stein’s paradox
Stein’s paradox in words (rather simplified)
When one has three or more units (say, individuals), and for each unit one can
calculate an average score (say, average blood pressure), then the best guess of
future observations (blood pressure) for each unit is NOT its average score.
16. 1961: James-Stein estimator: the next Berkley Symposium
James and Stein. Estimation with quadratic loss. Proceedings of the fourth Berkeley symposium on mathematical
statistics and probability. Vol. 1. 1961.
17. 1977: Baseball example
Efron and Morris (1977). Stein’s paradox in statistics. Scientific American, 236 (5): 119-127.
18. Lessons from Stein’s paradox
•Probably among the most surprising (and initially doubted) phenomena in
statistics
•Now a large “family”: shrinkage estimators reduce prediction variance to an
extent that typically outweighs the bias that is introduced
•Bias/variance trade-off principle has motivated many statistical developments
Bias, variance and prediction error1
Expected prediction error = irreducible error + bias2
+ variance
1
Friedman et al. (2001). The elements of statistical learning. Vol. 1. New York: Springer series.
31. Not just lucky
•5% reduction in MSPE just by shrinkage estimator
•Van Houwelingen and le Cessie’s heuristic shrinkage factor
32. Heuristic argument for shrinkage
calibration plot
predicted
observed
ideal
model
Typical calibration plot: “overfitting”
33. Heuristic argument for shrinkage
calibration plot
predicted
observed
ideal
model
Typical calibration plot: “overfitting”
34. Overfitting
"Idiosyncrasies in the data are fitted rather than generalizable
patterns. A model may hence not be applicable to new patients,
even when the setting of application is very similar to the
development setting."
Steyerberg (2009). Clinical Prediction Models.
35. Ridge regression
Objective
i
(yi − ˆf(xi))2
+ λ
P
p=1
ˆβ2
p
•Note: λ = 0 corresponds to the OLS solution
•Closed form solution: (X X+λIp)−1
X y, where Ip is a P-dimensional
identity matrix
•In most software programs X is standardized and y centered for estimation
(output is mostly transformed back to original scale)
The challenge of ridge regression
finding a good value for the "tuning parameter": λ.
38. K-fold cross-validation to find “optimal” λ
•Usually K = 10 or K = 5
•Partition the dataset into K non-overlapping sub-datasets of equal size
(disjoint subsets)
•Fit statistical model on all but 1 of the subsets (training set), and evaluate
performance of the model in the left-out subset (test set)
•Fit and evaluate K times
44. The argument to use Ridge/Lasso
Key message
Regression shrinkage strategies, such as Ridge and Lasso, have the ability to
dramatically improve predictive performance of prediction models
45. Some arguments against Ridge/Lasso
•Interpretation of regression coefficient
•Shrinkage not needed due to sufficient sample size (e.g. based on rule of
thumb)
•Cross-validation can lead to unstable estimation of the λ parameter
•Difficult to implement
46. Interpretation of regression coefficients
•Shrinkage estimators such as Ridge and Lasso introduce bias in (‘shrink’) the
regression coefficient by design
•Most software programs not provide standard errors and confidence intervals
for Ridge/Lasso regression coefficients
•Interpretation of coefficients is not / should not be the goal of a prediction
model
Note
Popular approaches to develop prediction models yield biased regression
coefficients and provide uninterpretable confidence intervals
48. Parameters may need shrinkage to become unbiased
Available at: https://www.slideshare.net/MaartenvanSmeden
49. Some arguments against Ridge/Lasso
•Interpretation of regression coefficient
•Shrinkage not needed due to sufficient sample size
•Cross-validation can lead to unstable estimation of the λ parameter
•Difficult to implement
50. Sufficient sample size?
Benefit of regression shrinkage dependents on:
•Sample size
•Correlations between predictor variables
•Sparsity of outcome and predictor variables
•The irreducible error component
•Type of outcome (continuous, binary, count, time-to-event,. . . )
•Number of candidate predictor variables
•Non-linear/interaction effects
•Weak/strong predictor balance
How to know that there is no need for shrinkage at some sample size?
51. Is a rule of thumb a rule of dumb1?
1
direct quote from tweet by prof Stephen Senn:
https://twitter.com/stephensenn/status/936213710770753536
52. Some arguments against Ridge/Lasso
•Interpretation of regression coefficient
•Shrinkage not needed due to sufficient sample size (e.g. based on rule of
thumb)
•Cross-validation can lead to unstable estimation of the λ parameter
•Difficult to implement
53. Estimating Ridge/Lasso
•“Programming” Ridge/Lasso regression isn’t hard with user friendly software
such as the glmnet package in R
•Getting it right might be a bit tougher than traditional approaches. It’s all
about the tuning parameter (λ)
•K-fold cross-validation makes arbitrary partitions of data which may make
estimating the tuning parameter unstable (there are some suggestions to
circumvent the problems). Note: this is not a flaw of cross-validation: it means
that there is probably insufficient data to estimate how much shrinkage is really
needed!
54. Closing remarks
•Shrinkage is highly recommended when developing a prediction model (e.g. see
Tripod guidelines for reporting)
•Software and methodological developments have made Lasso and Ridge
regression relatively easy to implement and computationally fast
•The cross-validation procedure can provide insights about possible overfitting
(much like propensity score analysis can provide information about balance)
•Consider the Lasso instead of traditional backward/forward selection strategies