Improving predictions: Lasso, Ridge and Stein's paradox

Improving predictions: Ridge, Lasso and Stein’s paradox
RIVM Epi masterclass (22/3/18)
Maarten van Smeden
Post-doc clinical epidemiology/medical statistics, Leiden University Medical Center

This slide deck available:
https://www.slideshare.net/MaartenvanSmeden

Diagnostic / prognostic prediction
Clinical prediction models
•Diagnostic prediction: probability of disease D = d in patient i?
•Prognostic prediction: probability of developing health outcome Y = y within
(or up to) T years in patient i?

Rise of prediction models
•>110 models for prostate cancer (Shariat 2008)
•>100 models for Traumatic Brain Injury (Perel 2006)
•83 models for stroke (Counsell 2001)
•54 models for breast cancer (Altman 2009)
•43 models for type 2 diabetes (Collins 2011; Dieren 2012)
•31 models for osteoporotic fracture (Steurer 2011)
•29 models in reproductive medicine (Leushuis 2009)
•26 models for hospital readmission (Kansagara 2011)
•>25 models for length of stay in cardiac surgery (Ettema 2010)
•>350 models for CVD outcomes (Damen 2016)
The overview was created and ﬁrst presented by Prof. KGM Moons (Julius Center, UMC Utrecht)

Reality
Bell et al. BMJ 2015;351:h5639

This talk
Key message
Regression shrinkage strategies, such as Ridge and Lasso, have the ability to
dramatically improve predictive performance of prediction models
Outline
•What is wrong with traditional prediction model development strategies?
•What is Ridge and Lasso?
•Some thoughts on when to consider Ridge/Lasso.

Setting
•Development data: with subjects (i = 1, . . . , N) for which an outcome is
observed (y: the outcome to predict), and P predictor variables (X: explanatory
variables to make a prediction of y)
•(External) validation data: with subjects that were not part of the
development data but have the same outcome and predictor variables observed.
Perhaps subjects from a diﬀerent geographical area
•The goal is to develop a prediction model with high as possible predictive
performance in validation (out-of-sample performance); performance in
development sample is not directly relevant
{•I’ll focus on the linear model for illustrative reasons}
{•N >> P}

Setting
•Development data: with subjects (i = 1, . . . , N) for which an outcome is
observed (y: the outcome to predict), and P predictor variables (X: explanatory
variables to make a prediction of y)
•(External) validation data: with subjects that were not part of the
development data but have the same outcome and predictor variables observed.
Perhaps subjects from a diﬀerent geographical area
•The goal is to develop prediction model with high as possible predictive
performance in validation (out-of-sample performance); performance in
development sample is not directly relevant
•I’ll focus on the linear model for illustrative reasons
•N >> P

Linear model: OLS regression
Linear regression model
y = f(X) + , ∼ N(0, σ2
)
•With linear main eﬀects only: ˆf(X) = ˆβ0 + ˆβ1x1 + ˆβ2X2 + . . . + ˆβP xP
•Find β that minimizes (in-sample) squared prediction error: i
(yi − ˆf(xi))
•Closed form solution: (X X)−1
X y
Question
Is ˆf(.) the best estimator to predict for future individuals?

1955: Stein’s paradox
Stein’s paradox in words (rather simpliﬁed)
When one has three or more units (say, individuals), and for each unit one can
calculate an average score (say, average blood pressure), then the best guess of
future observations (blood pressure) for each unit is NOT its average score.

1961: James-Stein estimator: the next Berkley Symposium
James and Stein. Estimation with quadratic loss. Proceedings of the fourth Berkeley symposium on mathematical
statistics and probability. Vol. 1. 1961.

1977: Baseball example
Efron and Morris (1977). Stein’s paradox in statistics. Scientiﬁc American, 236 (5): 119-127.

Lessons from Stein’s paradox
•Probably among the most surprising (and initially doubted) phenomena in
statistics
•Now a large “family”: shrinkage estimators reduce prediction variance to an
extent that typically outweighs the bias that is introduced
•Bias/variance trade-oﬀ principle has motivated many statistical developments
Bias, variance and prediction error1
Expected prediction error = irreducible error + bias2
+ variance
1
Friedman et al. (2001). The elements of statistical learning. Vol. 1. New York: Springer series.

Illustration of regression shrinkage

Illustration of shrinkage
Was I just lucky?

Not just lucky
•5% reduction in MSPE just by shrinkage estimator
•Van Houwelingen and le Cessie’s heuristic shrinkage factor

Heuristic argument for shrinkage
calibration plot
predicted
observed
ideal
model
Typical calibration plot: “overﬁtting”

Overﬁtting
"Idiosyncrasies in the data are fitted rather than generalizable
patterns. A model may hence not be applicable to new patients,
even when the setting of application is very similar to the
development setting."
Steyerberg (2009). Clinical Prediction Models.

Ridge regression
Objective
i
(yi − ˆf(xi))2
+ λ
P
p=1
ˆβ2
p
•Note: λ = 0 corresponds to the OLS solution
•Closed form solution: (X X+λIp)−1
X y, where Ip is a P-dimensional
identity matrix
•In most software programs X is standardized and y centered for estimation
(output is mostly transformed back to original scale)
The challenge of ridge regression
ﬁnding a good value for the "tuning parameter": λ.

Diabetes data
Source: https://web.stanford.edu/ hastie/Papers/LARS/ (19/3/2018)
Details: Efron et al. (2004) Least angle regression. The annals of Statistics.

K-fold cross-validation to ﬁnd “optimal” λ
•Usually K = 10 or K = 5
•Partition the dataset into K non-overlapping sub-datasets of equal size
(disjoint subsets)
•Fit statistical model on all but 1 of the subsets (training set), and evaluate
performance of the model in the left-out subset (test set)
•Fit and evaluate K times

First fold of cross-validation (Diabetes data)

5-fold cross-validation (Diabetes data)

Diabetes data: Ridge regression results
AGE SEX BMI BP s1 s2 s3 s4 s5 s6
OLS -10.00 -239.80 519.80 324.40 -792.2 476.70 -101.00 177.10 751.30 67.60
Ridge -9.93 -239.68 520.11 324.25 -763.5 454.28 -88.23 173.37 740.69 67.66
Regression coeﬃcients (data were standardized, outcome centered)
•log(λ) = 1.60 minimized average cross-validation MSPE
•R-code Ridge regression (glmnet package):
require(glmnet)
require(glmnetUtils)
df <- read.table("diabetes.txt",header=T)
rcv <- cv.glmnet(y~.,df,alpha=0,family="gaussian",nfolds=5)
fitr <- glmnet(y~.,data,alpha=0,lambda=rcv$lambda.min)
coef(fitr)

Lasso regression
Objective
i
(yi − ˆf(xi))2
+ λ2
P
p=1
|ˆβp|
•Remember Ridge regression: i
(yi − ˆf(xi))2
+ λ
P
p=1
ˆβ2
p
•No closed form solution for Lasso: estimation regression proceeds iteratively
•Like Ridge regression, cross-validation for estimating λ2

Diabetes data: Lasso regression results
AGE SEX BMI BP s1 s2 s3 s4 s5 s6
OLS -10.00 -239.80 519.80 324.40 -792.20 476.70 -101.00 177.10 751.30 67.60
Ridge -9.93 -239.68 520.11 324.25 -763.50 454.28 -88.23 173.37 740.69 67.66
Lasso 0.00 -184.39 520.52 290.18 -87.53 0.00 219.67 0.00 504.93 48.08
Regression coeﬃcients (data were standardized, outcome centered)
•Lasso shrinks some variables to zero: built-in variable selection (!!!)
•R-code Lasso regression (glmnet package):
require(glmnet)
require(glmnetUtils)
df <- read.table("diabetes.txt",header=T)
lcv <- cv.glmnet(y~.,df,alpha=1,family="gaussian",nfolds=5)
fitl <- glmnet(y~.,data,alpha=1,lambda=lcv$lambda.min)
coef(fitr)

The argument to use Ridge/Lasso
Key message
Regression shrinkage strategies, such as Ridge and Lasso, have the ability to
dramatically improve predictive performance of prediction models

Some arguments against Ridge/Lasso
•Interpretation of regression coefficient
•Shrinkage not needed due to sufficient sample size (e.g. based on rule of
thumb)
•Cross-validation can lead to unstable estimation of the λ parameter
•Difficult to implement

Interpretation of regression coefficients
•Shrinkage estimators such as Ridge and Lasso introduce bias in (‘shrink’) the
regression coefficient by design
•Most software programs not provide standard errors and confidence intervals
for Ridge/Lasso regression coefficients
•Interpretation of coefficients is not / should not be the goal of a prediction
model
Note
Popular approaches to develop prediction models yield biased regression
coefficients and provide uninterpretable confidence intervals

Variable selection without shrinkage

Parameters may need shrinkage to become unbiased
Available at: https://www.slideshare.net/MaartenvanSmeden

Some arguments against Ridge/Lasso
•Interpretation of regression coefficient
•Shrinkage not needed due to sufficient sample size
•Cross-validation can lead to unstable estimation of the λ parameter
•Difficult to implement

Sufficient sample size?
Benefit of regression shrinkage dependents on:
•Sample size
•Correlations between predictor variables
•Sparsity of outcome and predictor variables
•The irreducible error component
•Type of outcome (continuous, binary, count, time-to-event,. . . )
•Number of candidate predictor variables
•Non-linear/interaction effects
•Weak/strong predictor balance
How to know that there is no need for shrinkage at some sample size?

Is a rule of thumb a rule of dumb1?
1
direct quote from tweet by prof Stephen Senn:
https://twitter.com/stephensenn/status/936213710770753536

Estimating Ridge/Lasso
•“Programming” Ridge/Lasso regression isn’t hard with user friendly software
such as the glmnet package in R
•Getting it right might be a bit tougher than traditional approaches. It’s all
about the tuning parameter (λ)
•K-fold cross-validation makes arbitrary partitions of data which may make
estimating the tuning parameter unstable (there are some suggestions to
circumvent the problems). Note: this is not a ﬂaw of cross-validation: it means
that there is probably insuﬃcient data to estimate how much shrinkage is really
needed!

Closing remarks
•Shrinkage is highly recommended when developing a prediction model (e.g. see
Tripod guidelines for reporting)
•Software and methodological developments have made Lasso and Ridge
regression relatively easy to implement and computationally fast
•The cross-validation procedure can provide insights about possible overﬁtting
(much like propensity score analysis can provide information about balance)
•Consider the Lasso instead of traditional backward/forward selection strategies

Slide deck available: https://www.slideshare.net/MaartenvanSmeden
Free R tutorial (~ 2 hours): http://www.r-tutorial.nl/

Improving predictions: Lasso, Ridge and Stein's paradox

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improving predictions: Lasso, Ridge and Stein's paradox

Similar to Improving predictions: Lasso, Ridge and Stein's paradox (20)

More from Maarten van Smeden

More from Maarten van Smeden (10)

Recently uploaded

Recently uploaded (20)

Improving predictions: Lasso, Ridge and Stein's paradox