SlideShare a Scribd company logo
1 of 21
Download to read offline
Visual Explanation of
Ridge Regression and LASSO
Kazuki Yoshida
2017-11-03
1 / 21
OLS Problem
▶ The ordinary least square (OLS) problem can be
described as the following optimization.
arg min
β


n∑
i=1
(
Yi − β0 −
p
∑
j=1
βj Xji
)2


▶ That is, we try to find coefficient β that minimizes
squared errors (squared distance between the outcome Yi
and predicted outcome β0 +
∑p
j=1 βj Xji ).
2 / 21
Advantages of OLS
▶ Provided the true relationship between the response and
the predictors is approximately linear, the OLS estimates
will have low bias. If the number of observations n is
much larger the number of variables p, then the OLS
estimates tend to have low variance, and hence will
perform well on test observations.
3 / 21
Why OLS May Fail
▶ In settings with many explanatory variables, unless n is
extremely large, because of sampling variability the
estimates
{
βj
}
tend to be much larger in magnitude than
the true values {βj } (overfitting).
▶ This tendency is exacerbated when we keep only
statistically significant variables in a model (typical
stepwise selection). [Agresti, 2015]
▶ If p > n, there is no longer a unique OLS solution.
[James et al., 2017]
4 / 21
Penalized Regression Rationale
▶ Shrinkage (regularization) toward 0 tends to move
{
βj
}
closer to {βj }. [Agresti, 2015]
▶ By shrinking the estimated coefficients, we can often
substantially reduce the variance at the cost of a
negligible increase in bias, substantially improving the
accuracy of prediction for future observations.
[James et al., 2017]
5 / 21
Bias-Variance Tradeoff
[Kutner et al., 2004]
6 / 21
Penalized Regression Definition
Ordinary Least Square
arg min
β


n∑
i=1
(
Yi − β0 −
p
∑
j=1
βj Xji
)2


Ridge Regression
arg min
β


n∑
i=1
(
Yi − β0 −
p
∑
j=1
βj Xji
)2
+ λ
p
∑
j=1
|βj |2


LASSO
arg min
β


n∑
i=1
(
Yi − β0 −
p
∑
j=1
βj Xji
)2
+ λ
p
∑
j=1
|βj |


7 / 21
Penalized Regression Definition
arg min
β


n∑
i=1
(
Yi − β0 −
p
∑
j=1
βj Xji
)2
+ λ
p
∑
j=1
|βj |q


▶ Both ridge regression and LASSO adds a penalty term for
having "large" coefficients.
▶ Ridge regression: Large with respect to the squared L2
norm (Euclidean distance) q = 2
▶ LASSO: Large with respect to the L1 norm (Manhattan
distance) q = 1
▶ Both have a tuning parameter λ, which decides how
important the penalty is in relation to the squared error
term.
8 / 21
Ridge Regression vs OLS
▶ Ridge regression estimates tend to be stable in the sense
that they are usually little affected by small changes in
the data on which the fitted regression is based. In
contrast, ordinary least squares estimates may be highly
unstable under these conditions when the predictor
variables are highly multicollinear.
▶ Predictions of new observations made from ridge
estimated regression functions tend to be more precise
than predictions made from OLS regression functions
when the predictor variables are correlated and the new
observations follow the same multicollinearity pattern.
[Kutner et al., 2004]
▶ Unlike OLS, ridge regression will produce a different set
of coefficient estimates β
R
λ for each value of λ. Selecting
a good value for λ is critical. [James et al., 2017]
9 / 21
LASSO vs Stepwise Model Selection
▶ Because of the nature of the constraint, making the
penalty sufficiently large will cause some of the
coefficients to be exactly zero.
▶ Thus, the lasso does a kind of continuous model selection
[Hastie et al., 2016].
▶ On the other hand, the stepwise variable selection
methods are discrete in nature because variables are
either retained fully or discarded fully.
10 / 21
LASSO and Coefficients
▶ Regarding the coefficients themselves, the LASSO
shrinkage causes the estimates of the non-zero
coefficients to be biased towards zero, and in general they
are not consistent.
▶ One approach for reducing this bias is to run the LASSO
to identify the set of non-zero coefficients, and then fit an
unrestricted linear model to the selected set of features.
This is not always feasible, if the selected set is large.
▶ Alternatively, one can use the LASSO again on the
selected set of variable (relaxed LASSO).
[Hastie et al., 2016] (p91)
11 / 21
Ridge Regression vs LASSO
▶ A disadvantage of ridge regression is that it requires a
separate strategy for finding a parsimonious model,
because all explanatory variables remain in the model.
▶ When p is large but only a few {βj } are practically
different from 0, the LASSO tends to perform better,
because many
{
βj
}
may equal 0.
▶ When {βj } do not vary dramatically in substantive size,
ridge regression tends to perform better. [Agresti, 2015]
▶ Neither ridge regression nor the lasso will universally
dominate the other. [James et al., 2017]
▶ Not performing variable selection may not be a problem
for prediction accuracy, but it can create a challenge in
model interpretation in settings in which the number of
variables p is quite large. The LASSO yields sparse
models – that is, models that involve only a subset of the
variables, which are generally much easier to interpret. 12 / 21
Selecting the Tuning Parameter
▶ A penalty that is too large can prevent the penalized
model from capturing the main signal in the data, while
too small a value can lead to overfitting to the noise in
the data.
▶ Cross-validation provides a simple way to tackle the
problem of choosing a good tuning parameter λ. We
choose a grid of λ values, and compute the
cross-validation prediction error for each value of λ. We
then select the tuning parameter value for which the
cross-validation error is smallest. Finally, the model is
re-fit using all of the available observations and the
selected value of the tuning parameter.
[James et al., 2017] (p227)
13 / 21
Elastic Net
Elastic Net
arg min
β


n∑
i=1

Yi − β0 −
p∑
j=1
βj Xji


2
+ λ
p∑
j=1
(
α|βj |2
+ (1 − α)|βj |
)


▶ The elastic-net selects variables like the LASSO, and
shrinks together the coefficients of correlated predictors
like ridge.
▶ The LASSO does not handle highly correlated variables
very well; the coefficient paths tend to be erratic and can
sometimes show wild behaviors. If there are two identical
variables, the coefficients are not uniquely estimated with
the L1 penalty although identical half-sized coefficients
are estimated with the L2 penalty.
14 / 21
Set Up
[
X1i
X2i
]
∼ MVN
([
0
0
]
,
[
1 ρ
ρ 1
])
E[Yi ] = β1X1i + β2X2i
Yi ∼ N(β1X1i + β2X2i , σ2
)
ρ = 0.7
β1 = 0.5, β2 = 0.1
σ2
= 1
15 / 21
Data
Ten data points were generated.
Y X1 X2
[1,] -0.1031846 0.2975677 0.9011822
[2,] 0.3685920 -0.8119366 -1.3164309
[3,] 1.2895888 0.5110124 -0.8779760
[4,] -0.9877068 -0.5490504 -0.2468024
[5,] 1.0606268 0.8946819 0.7958214
[6,] 3.6955798 2.6184162 3.5490718
[7,] 0.2591585 -0.9085681 -1.1293642
[8,] -0.4325165 -0.8695810 -0.5797419
[9,] -0.5336736 -0.1324647 0.4957759
[10,] 1.5956205 -0.3602429 -0.1464311
16 / 21
OLS Objective Function
beta1 (truth 0.5)
−2
−1
0
1 2
beta2(truth0.1)
−2
−1
0
1
2
0
20
40
60
80
100
OLS min at (1.40,−0.30)
20
40
60
80
q
20
40
60
80
17 / 21
Penalty Functions
Ridge Regression
beta1
−2
−1
0
1 2
beta2
−2
−1
0
1
2
Penalty
0
20
40
60
80
100
Ridge
0
20
40
60
80
LASSO
beta1
−2
−1
0
1 2
beta2
−2
−1
0
1
2
Penalty
0
20
40
60
80
100
LASSO
0
20
40
60
80
18 / 21
Constructing Penalized Objective Functions
Ridge Regression
beta1 (truth 0.5)
−2
−1
0
1 2
beta2(truth0.1)
−2
−1
0
1
2
0
20
40
60
80
100
OLS min at (1.40,−0.30)
20
40
60
80
q
20
40
60
80
+
beta1
−2
−1
0
1 2
beta2
−2
−1
0
1
2
Penalty
0
20
40
60
80
100
Ridge
0
20
40
60
80
LASSO
beta1 (truth 0.5)
−2
−1
0
1 2
beta2(truth0.1)
−2
−1
0
1
2
0
20
40
60
80
100
OLS min at (1.40,−0.30)
20
40
60
80
q
20
40
60
80
+
beta1
−2
−1
0
1 2
beta2
−2
−1
0
1
2
Penalty
0
20
40
60
80
100
LASSO
0
20
40
60
80
19 / 21
Penalized Objective Functions
Ridge Regression
beta1 (truth 0.5)
−2
−1
0
1 2
beta2(truth0.1)
−2
−1
0
1
2
0
20
40
60
80
100
Ridge lambda = 10.0 min at (0.40,0.20)
20
40
60
80
q
20
40
60
80
LASSO
beta1 (truth 0.5)
−2
−1
0
1 2
beta2(truth0.1)
−2
−1
0
1
2
0
20
40
60
80
100
LASSO lambda = 10.0 min at (0.60,0.00)
20
40
60
80
q
20
40
60
80
20 / 21
Bibliography I
[Agresti, 2015] Agresti, A. (2015).
Foundations of Linear and Generalized Linear Models.
Wiley, Hoboken, New Jersey, 1 edition edition.
[Friedman et al., 2010] Friedman, J., Hastie, T., and Tibshirani, R. (2010).
Regularization Paths for Generalized Linear Models via Coordinate Descent.
J Stat Softw, 33(1):1–22.
[Hastie et al., 2016] Hastie, T., Tibshirani, R., and Friedman, J. (2016).
The Elements of Statistical Learning: Data Mining, Inference, and Prediction,
Second Edition.
Springer, New York, NY, 2nd edition edition.
[James et al., 2017] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2017).
An Introduction to Statistical Learning: With Applications in R.
Springer, New York, 1st ed. 2013, corr. 7th printing 2017 edition edition.
[Kutner et al., 2004] Kutner, M. H., Neter, J., Nachtsheim, C. J., and Li, W. (2004).
Applied Linear Statistical Models.
McGraw-Hill Education, Boston, Mass., 5th international edition edition.
21 / 21

More Related Content

What's hot

Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Simplilearn
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
sia16
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 

What's hot (20)

Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
 
Linear regression theory
Linear regression theoryLinear regression theory
Linear regression theory
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
support vector regression
support vector regressionsupport vector regression
support vector regression
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Decision tree
Decision treeDecision tree
Decision tree
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Vanishing & Exploding Gradients
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding Gradients
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by Example
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Maximum likelihood estimation
Maximum likelihood estimationMaximum likelihood estimation
Maximum likelihood estimation
 

Similar to Visual Explanation of Ridge Regression and LASSO

13 ch ken black solution
13 ch ken black solution13 ch ken black solution
13 ch ken black solution
Krunal Shah
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
ToleranceStackup
ToleranceStackupToleranceStackup
ToleranceStackup
Winson Mao
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Visual Explanation of Ridge Regression and LASSO (20)

Linear regression
Linear regressionLinear regression
Linear regression
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
Adaptive Constraint Handling and Success History Differential Evolution for C...
Adaptive Constraint Handling and Success History Differential Evolution for C...Adaptive Constraint Handling and Success History Differential Evolution for C...
Adaptive Constraint Handling and Success History Differential Evolution for C...
 
simple linear regression - brief introduction
simple linear regression - brief introductionsimple linear regression - brief introduction
simple linear regression - brief introduction
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
Input analysis
Input analysisInput analysis
Input analysis
 
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
 
13 ch ken black solution
13 ch ken black solution13 ch ken black solution
13 ch ken black solution
 
Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural Net
 
report
reportreport
report
 
Ai saturdays presentation
Ai saturdays presentationAi saturdays presentation
Ai saturdays presentation
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Diabetic Retinopathy Detection
Diabetic Retinopathy DetectionDiabetic Retinopathy Detection
Diabetic Retinopathy Detection
 
Image segmentation using normalized graph cut
Image segmentation using normalized graph cutImage segmentation using normalized graph cut
Image segmentation using normalized graph cut
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)
 
ToleranceStackup
ToleranceStackupToleranceStackup
ToleranceStackup
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
 

More from Kazuki Yoshida

20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R
Kazuki Yoshida
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
Kazuki Yoshida
 
Linear regression with R 2
Linear regression with R 2Linear regression with R 2
Linear regression with R 2
Kazuki Yoshida
 
Linear regression with R 1
Linear regression with R 1Linear regression with R 1
Linear regression with R 1
Kazuki Yoshida
 
(Very) Basic graphing with R
(Very) Basic graphing with R(Very) Basic graphing with R
(Very) Basic graphing with R
Kazuki Yoshida
 
Introduction to Deducer
Introduction to DeducerIntroduction to Deducer
Introduction to Deducer
Kazuki Yoshida
 
Groupwise comparison of continuous data
Groupwise comparison of continuous dataGroupwise comparison of continuous data
Groupwise comparison of continuous data
Kazuki Yoshida
 
Categorical data with R
Categorical data with RCategorical data with R
Categorical data with R
Kazuki Yoshida
 
Install and Configure R and RStudio
Install and Configure R and RStudioInstall and Configure R and RStudio
Install and Configure R and RStudio
Kazuki Yoshida
 

More from Kazuki Yoshida (20)

Graphical explanation of causal mediation analysis
Graphical explanation of causal mediation analysisGraphical explanation of causal mediation analysis
Graphical explanation of causal mediation analysis
 
Pharmacoepidemiology Lecture: Designing Observational CER to Emulate an RCT
Pharmacoepidemiology Lecture: Designing Observational CER to Emulate an RCTPharmacoepidemiology Lecture: Designing Observational CER to Emulate an RCT
Pharmacoepidemiology Lecture: Designing Observational CER to Emulate an RCT
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Propensity Score Methods for Comparative Effectiveness Research with Multiple...
Propensity Score Methods for Comparative Effectiveness Research with Multiple...Propensity Score Methods for Comparative Effectiveness Research with Multiple...
Propensity Score Methods for Comparative Effectiveness Research with Multiple...
 
Emacs Key Bindings
Emacs Key BindingsEmacs Key Bindings
Emacs Key Bindings
 
ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...
ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...
ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...
 
Search and Replacement Techniques in Emacs: avy, swiper, multiple-cursor, ag,...
Search and Replacement Techniques in Emacs: avy, swiper, multiple-cursor, ag,...Search and Replacement Techniques in Emacs: avy, swiper, multiple-cursor, ag,...
Search and Replacement Techniques in Emacs: avy, swiper, multiple-cursor, ag,...
 
Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulat...
Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulat...Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulat...
Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulat...
 
Spacemacs: emacs user's first impression
Spacemacs: emacs user's first impressionSpacemacs: emacs user's first impression
Spacemacs: emacs user's first impression
 
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
 
Multiple Imputation: Joint and Conditional Modeling of Missing Data
Multiple Imputation: Joint and Conditional Modeling of Missing DataMultiple Imputation: Joint and Conditional Modeling of Missing Data
Multiple Imputation: Joint and Conditional Modeling of Missing Data
 
20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
 
Linear regression with R 2
Linear regression with R 2Linear regression with R 2
Linear regression with R 2
 
Linear regression with R 1
Linear regression with R 1Linear regression with R 1
Linear regression with R 1
 
(Very) Basic graphing with R
(Very) Basic graphing with R(Very) Basic graphing with R
(Very) Basic graphing with R
 
Introduction to Deducer
Introduction to DeducerIntroduction to Deducer
Introduction to Deducer
 
Groupwise comparison of continuous data
Groupwise comparison of continuous dataGroupwise comparison of continuous data
Groupwise comparison of continuous data
 
Categorical data with R
Categorical data with RCategorical data with R
Categorical data with R
 
Install and Configure R and RStudio
Install and Configure R and RStudioInstall and Configure R and RStudio
Install and Configure R and RStudio
 

Recently uploaded

如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Recently uploaded (20)

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
👉 Tirunelveli Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Gir...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 

Visual Explanation of Ridge Regression and LASSO

  • 1. Visual Explanation of Ridge Regression and LASSO Kazuki Yoshida 2017-11-03 1 / 21
  • 2. OLS Problem ▶ The ordinary least square (OLS) problem can be described as the following optimization. arg min β   n∑ i=1 ( Yi − β0 − p ∑ j=1 βj Xji )2   ▶ That is, we try to find coefficient β that minimizes squared errors (squared distance between the outcome Yi and predicted outcome β0 + ∑p j=1 βj Xji ). 2 / 21
  • 3. Advantages of OLS ▶ Provided the true relationship between the response and the predictors is approximately linear, the OLS estimates will have low bias. If the number of observations n is much larger the number of variables p, then the OLS estimates tend to have low variance, and hence will perform well on test observations. 3 / 21
  • 4. Why OLS May Fail ▶ In settings with many explanatory variables, unless n is extremely large, because of sampling variability the estimates { βj } tend to be much larger in magnitude than the true values {βj } (overfitting). ▶ This tendency is exacerbated when we keep only statistically significant variables in a model (typical stepwise selection). [Agresti, 2015] ▶ If p > n, there is no longer a unique OLS solution. [James et al., 2017] 4 / 21
  • 5. Penalized Regression Rationale ▶ Shrinkage (regularization) toward 0 tends to move { βj } closer to {βj }. [Agresti, 2015] ▶ By shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias, substantially improving the accuracy of prediction for future observations. [James et al., 2017] 5 / 21
  • 7. Penalized Regression Definition Ordinary Least Square arg min β   n∑ i=1 ( Yi − β0 − p ∑ j=1 βj Xji )2   Ridge Regression arg min β   n∑ i=1 ( Yi − β0 − p ∑ j=1 βj Xji )2 + λ p ∑ j=1 |βj |2   LASSO arg min β   n∑ i=1 ( Yi − β0 − p ∑ j=1 βj Xji )2 + λ p ∑ j=1 |βj |   7 / 21
  • 8. Penalized Regression Definition arg min β   n∑ i=1 ( Yi − β0 − p ∑ j=1 βj Xji )2 + λ p ∑ j=1 |βj |q   ▶ Both ridge regression and LASSO adds a penalty term for having "large" coefficients. ▶ Ridge regression: Large with respect to the squared L2 norm (Euclidean distance) q = 2 ▶ LASSO: Large with respect to the L1 norm (Manhattan distance) q = 1 ▶ Both have a tuning parameter λ, which decides how important the penalty is in relation to the squared error term. 8 / 21
  • 9. Ridge Regression vs OLS ▶ Ridge regression estimates tend to be stable in the sense that they are usually little affected by small changes in the data on which the fitted regression is based. In contrast, ordinary least squares estimates may be highly unstable under these conditions when the predictor variables are highly multicollinear. ▶ Predictions of new observations made from ridge estimated regression functions tend to be more precise than predictions made from OLS regression functions when the predictor variables are correlated and the new observations follow the same multicollinearity pattern. [Kutner et al., 2004] ▶ Unlike OLS, ridge regression will produce a different set of coefficient estimates β R λ for each value of λ. Selecting a good value for λ is critical. [James et al., 2017] 9 / 21
  • 10. LASSO vs Stepwise Model Selection ▶ Because of the nature of the constraint, making the penalty sufficiently large will cause some of the coefficients to be exactly zero. ▶ Thus, the lasso does a kind of continuous model selection [Hastie et al., 2016]. ▶ On the other hand, the stepwise variable selection methods are discrete in nature because variables are either retained fully or discarded fully. 10 / 21
  • 11. LASSO and Coefficients ▶ Regarding the coefficients themselves, the LASSO shrinkage causes the estimates of the non-zero coefficients to be biased towards zero, and in general they are not consistent. ▶ One approach for reducing this bias is to run the LASSO to identify the set of non-zero coefficients, and then fit an unrestricted linear model to the selected set of features. This is not always feasible, if the selected set is large. ▶ Alternatively, one can use the LASSO again on the selected set of variable (relaxed LASSO). [Hastie et al., 2016] (p91) 11 / 21
  • 12. Ridge Regression vs LASSO ▶ A disadvantage of ridge regression is that it requires a separate strategy for finding a parsimonious model, because all explanatory variables remain in the model. ▶ When p is large but only a few {βj } are practically different from 0, the LASSO tends to perform better, because many { βj } may equal 0. ▶ When {βj } do not vary dramatically in substantive size, ridge regression tends to perform better. [Agresti, 2015] ▶ Neither ridge regression nor the lasso will universally dominate the other. [James et al., 2017] ▶ Not performing variable selection may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large. The LASSO yields sparse models – that is, models that involve only a subset of the variables, which are generally much easier to interpret. 12 / 21
  • 13. Selecting the Tuning Parameter ▶ A penalty that is too large can prevent the penalized model from capturing the main signal in the data, while too small a value can lead to overfitting to the noise in the data. ▶ Cross-validation provides a simple way to tackle the problem of choosing a good tuning parameter λ. We choose a grid of λ values, and compute the cross-validation prediction error for each value of λ. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. [James et al., 2017] (p227) 13 / 21
  • 14. Elastic Net Elastic Net arg min β   n∑ i=1  Yi − β0 − p∑ j=1 βj Xji   2 + λ p∑ j=1 ( α|βj |2 + (1 − α)|βj | )   ▶ The elastic-net selects variables like the LASSO, and shrinks together the coefficients of correlated predictors like ridge. ▶ The LASSO does not handle highly correlated variables very well; the coefficient paths tend to be erratic and can sometimes show wild behaviors. If there are two identical variables, the coefficients are not uniquely estimated with the L1 penalty although identical half-sized coefficients are estimated with the L2 penalty. 14 / 21
  • 15. Set Up [ X1i X2i ] ∼ MVN ([ 0 0 ] , [ 1 ρ ρ 1 ]) E[Yi ] = β1X1i + β2X2i Yi ∼ N(β1X1i + β2X2i , σ2 ) ρ = 0.7 β1 = 0.5, β2 = 0.1 σ2 = 1 15 / 21
  • 16. Data Ten data points were generated. Y X1 X2 [1,] -0.1031846 0.2975677 0.9011822 [2,] 0.3685920 -0.8119366 -1.3164309 [3,] 1.2895888 0.5110124 -0.8779760 [4,] -0.9877068 -0.5490504 -0.2468024 [5,] 1.0606268 0.8946819 0.7958214 [6,] 3.6955798 2.6184162 3.5490718 [7,] 0.2591585 -0.9085681 -1.1293642 [8,] -0.4325165 -0.8695810 -0.5797419 [9,] -0.5336736 -0.1324647 0.4957759 [10,] 1.5956205 -0.3602429 -0.1464311 16 / 21
  • 17. OLS Objective Function beta1 (truth 0.5) −2 −1 0 1 2 beta2(truth0.1) −2 −1 0 1 2 0 20 40 60 80 100 OLS min at (1.40,−0.30) 20 40 60 80 q 20 40 60 80 17 / 21
  • 18. Penalty Functions Ridge Regression beta1 −2 −1 0 1 2 beta2 −2 −1 0 1 2 Penalty 0 20 40 60 80 100 Ridge 0 20 40 60 80 LASSO beta1 −2 −1 0 1 2 beta2 −2 −1 0 1 2 Penalty 0 20 40 60 80 100 LASSO 0 20 40 60 80 18 / 21
  • 19. Constructing Penalized Objective Functions Ridge Regression beta1 (truth 0.5) −2 −1 0 1 2 beta2(truth0.1) −2 −1 0 1 2 0 20 40 60 80 100 OLS min at (1.40,−0.30) 20 40 60 80 q 20 40 60 80 + beta1 −2 −1 0 1 2 beta2 −2 −1 0 1 2 Penalty 0 20 40 60 80 100 Ridge 0 20 40 60 80 LASSO beta1 (truth 0.5) −2 −1 0 1 2 beta2(truth0.1) −2 −1 0 1 2 0 20 40 60 80 100 OLS min at (1.40,−0.30) 20 40 60 80 q 20 40 60 80 + beta1 −2 −1 0 1 2 beta2 −2 −1 0 1 2 Penalty 0 20 40 60 80 100 LASSO 0 20 40 60 80 19 / 21
  • 20. Penalized Objective Functions Ridge Regression beta1 (truth 0.5) −2 −1 0 1 2 beta2(truth0.1) −2 −1 0 1 2 0 20 40 60 80 100 Ridge lambda = 10.0 min at (0.40,0.20) 20 40 60 80 q 20 40 60 80 LASSO beta1 (truth 0.5) −2 −1 0 1 2 beta2(truth0.1) −2 −1 0 1 2 0 20 40 60 80 100 LASSO lambda = 10.0 min at (0.60,0.00) 20 40 60 80 q 20 40 60 80 20 / 21
  • 21. Bibliography I [Agresti, 2015] Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley, Hoboken, New Jersey, 1 edition edition. [Friedman et al., 2010] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw, 33(1):1–22. [Hastie et al., 2016] Hastie, T., Tibshirani, R., and Friedman, J. (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer, New York, NY, 2nd edition edition. [James et al., 2017] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2017). An Introduction to Statistical Learning: With Applications in R. Springer, New York, 1st ed. 2013, corr. 7th printing 2017 edition edition. [Kutner et al., 2004] Kutner, M. H., Neter, J., Nachtsheim, C. J., and Li, W. (2004). Applied Linear Statistical Models. McGraw-Hill Education, Boston, Mass., 5th international edition edition. 21 / 21