SlideShare a Scribd company logo
1 of 24
Download to read offline
Regularization and Variable Selection via the Elastic Net
STA315H5S: Advanced Statistical Learning - Winter 2021
Lim, Kyuson Hasaan Soomro
Department of Mathematical and Computational Sciences
University of Toronto
March 30th, 2021
Hasaan, Lim 1 / 24
Outline
1 Introduction/Motivation
2 Naive Elastic Net
3 Elastic Net
4 Simulations
5 Conclusion
Hasaan, Lim 2 / 24
p  n and Grouped Variable Selection Problem
In various problems, there are more predictors than the number of observations
(p  n), and there are groups of variables among which the pairwise correlations are
very high.
For example, a typical microarray data set has thousands of predictors (genes) and
fewer than 100 observations. Moreover, genes sharing the same biological pathway
can have high correlations between them - we could consider such genes as forming
a group.
An ideal model in such a scenario would do the following:
Automatically eliminate the trivial predictors and select the best subset of predictors (i.e
automatic variable selection).
Grouped Selection: Once one predictor is selected within a highly correlated group, then
the whole group is automatically selected.
Hasaan, Lim 3 / 24
LASSO and Ridge Regression
The LASSO is a penalized least squares method imposing an L1-penalty on the
regression coefficients:
β̂ = arg min
β̂
||y − Xβ||2
+ λ1||β||1
The LASSO does both continuous shrinkage and automatic variable selection at the
same time due to the L1-penalty.
Ridge regression minimizes the residual sum of squares (RSS) subject to the
L2-norm:
β̂ = arg min
β̂
||y − Xβ||2
+ λ2||β||2
Ridge regression is a continuous shrinkage method that achieves good prediction
performance via bias-variance trade-off, however,it does not perform variable
selection.
Hasaan, Lim 4 / 24
Limitations of LASSO and Ridge Regression
LASSO and Ridge perform well in various scenarios, however there are certain
limitations:
Due to the type of the convex optimization problem, LASSO can select at most n
variables prior to saturation in the p  n case.
LASSO cannot perform grouped selection - When there is a group of highly correlated
variables, LASSO selects only one variable from the group and does not care which one
is selected.
When n  p and the predictors are highly correlated, the prediction performance of
LASSO is dominated by Ridge Regression.
The Elastic Net overcomes these limitations by simultaneously performing automatic
variable selection and continuous shrinkage, while selecting group of correlated
predictors.
Hasaan, Lim 5 / 24
Naive Elastic Net - Definition
Given a data set with n observations with p predictors: let y = (y1, . . . , yn)T be the
response and X = (x1|. . . |xp) be the design matrix, with y centered and X
standardized. The Naive Elastic Net optimization problem for non-negative λ1, λ2 is:
β̂ = arg min
β
{|y − Xβ|2
+ λ2|β|2
+ λ1|β|1}
λ1 → 0 (ridge regression), λ2 → 0 (Lasso).
Alternatively, let α = λ2
λ1+λ2
. Then, the optimization problem is equivalent to:
β̂ = arg min
β
|y − Xβ|2
, subject to (1 − α)|β|1 + α|β|2
≤ t, for some t
α = 1 (ridge regression), α = 0 (Lasso). When α ∈ [0, 1), the Naive Elastic Net
enjoys the characteristics of both Ridge Regression and the Lasso.
α ∈ [0, 1), the Naive Elastic Net enjoys properties of both ridge and Lasso. The plot
shows singularities at vertices and edges (strictly convex), the strength of convexity
varies within α.
Hasaan, Lim 6 / 24
Naive Elastic Net - Definition
β0
β1
© Hasaan, Lim
shape of the ridge penalty. contour of the LASSO penalty.
contour of the elastic net penalty with α = 0.5.
Hasaan, Lim 7 / 24
Naive Elastic Net - Solution
We can develop a method to solve the naive elastic net problem efficiently, since
minimizing L(λ1, λ2, β) = |y − Xβ|2 + λ2|β|2 + λ1|β|1 is equivalent to LASSO-type
optimization problem.
X∗
(n+p)×p = (1 + λ2)−1/2

X
√
λ2 I

, y∗
(n+p)×1 =

yn×1
0p×1

Let γ = λ1/
√
1 + λ2 and β∗
=
p
(1 + λ2)β. Then, the naive elastic net criterion is
rearranged as
L(γ, β) = L(γ, β∗
) = |y∗
− X∗
β∗
|2
+ γ|β∗
|1
Let
β̂
∗
= arg min
β∗
L{(γ, β∗
)} then
β̂ =
1
p
(1 + λ2)
β̂
∗
Hasaan, Lim 8 / 24
Naive Elastic Net - Solution
The sample size in the augmented problem is n + p and X∗ has rank p, which means
that the naive elastic net can potentially select all p predictors in all situations.
This important property overcomes one of the limitations of the Lasso, which can
select at most n variables prior to saturation in the p  n case.
Therefore, by transforming the naive elastic net problem into an equivalent LASSO
problem on augmented data, the naive elastic net can perform an automatic variable
selection in a fashion similar to the LASSO along with the potential to select all p
predictors.
Hasaan, Lim 9 / 24
Naive Elastic Net - The Grouping Effect
Consider the generic penalization method:
β̂ = arg min
β
|y − Xβ|2
+ λJ(β), J(β)  0, β 6= 0
Lemma: Assume that xi = xj, i, j ∈ {1, ..., p}
(i) If J(·) is strictly convex, then β̂i = β̂j , for all λ  0.
(ii) If J(β) = |β|1, then β̂i , β̂j ≥ 0 and β̂
∗
is another minimizer where
β̂∗
k =





β̂k if k 6= i and k 6= j
(β̂i + β̂j ) · (s) if k = i, s ∈ [0, 1]
(β̂i + β̂j ) · (1 − s) if k = j, s ∈ [0, 1]
Strict convexity guarantees the grouping effect in the extreme situation above.
The Naive Elastic Net penalty is strictly convex, but the LASSO is not strictly convex
and does not have an unique solution.
Hasaan, Lim 10 / 24
Naive Elastic Net - The Grouping Effect
Theorem 1: Given data (y, X) and parameters (λ1, λ2), the response y is centered
and the predictors X are standardized. Let β̂ be the naive elastic net estimate.
Suppose β̂i(λ1, λ2), β̂j(λ1, λ2)  0, i, j ∈ {1, .., p}. Define
Dλ1,λ2
(i, j) =
1
|y|1
|β̂i(λ1, λ2) − β̂j(λ1, λ2)| then
Dλ1,λ2
(i, j) ≤
1
λ2
p
{2(1 − ρ)}, ρ = xT
i xj (the sample correlation).
The Dλ1,λ2
(i, j) describes the difference between coefficient paths of predictors i and j.
Notice that as ρ → 1, the difference between β̂i and β̂j will converge to 0.
Hasaan, Lim 11 / 24
Limitations of the Naive Elastic Net
While the Naive Elastic Net can select more than n predictors in the p  n case and
perform grouped selection, empirical evidence shows that it preforms unsatisfactorily
unless it is close to LASSO or Ridge.
Naive Elastic Net incurs a double amount of shrinkage which introduces extra
unnecessary bias and does not help reduce the variances in comparison to LASSO
or Ridge shrinkage.
Two-stage procedure: For each fixed λ2, we first find the Ridge regression coefficients,
and then we do the LASSO-type shrinkage along the LASSO coefficient solution paths.
The Elastic Net overcomes double shrinkage and improves the prediction
performance of the Naive Elastic Net.
Hasaan, Lim 12 / 24
The Elastic Net Estimate
Given data (y, X), penalty parameters (λ1, λ2) and augmented data (y∗, X∗), the naive
elastic net solves a LASSO-type problem
ˆ
β∗
= arg min
β∗
|y∗
− X∗
β∗
|2
+
λ2
p
(1 + λ2)
|β∗
|1
The elastic net (corrected) estimates of β̂ are defined by
β̂(elastic net) =
q
(1 + λ2)β̂
∗
Recall that β̂(naive elastic net) = {1/
p
(1 + λ2)}β̂
∗
and thus allowing
β̂(elastic net) = (1 + λ2)β̂ (naive elastic net)
The Elastic Net coefficient is a rescaled Naive Elastic Net coefficient.
Hasaan, Lim 13 / 24
The Elastic Net Estimate
Previously, β̂(elastic net) = (1 + λ2)β̂(naive elastic net) is defined to overcome two
steps of shrinkage (ridge and LASSO) by the penalty in the elastic net estimates.
For sampling correlation of Σ̂, the newly defined
Σ̂λ2
=

1
1 + λ2

Σ̂ +

λ2
1 + λ2

I,
yield to reduce the correlation matrix for the predictors.
Under the OLS, the ridge coefficients are
β̂ridge =
1
1 + λ2
Σ̂
−1
λ2
XT
y
Theorem 2. Suppose data is given as (y, X), then β̂enet are given as
β̂enet = arg min
β
βT

XT X + λ2I
1 + λ2

β − 2yT
Xβ + λ1|β|,
as an explicit optimization. For orthogonal design (XT X = I), β̂enet reduce the Σ̂ into I.
Hasaan, Lim 14 / 24
Elastic Net Computation - LARS-EN Algorithm
For LARS (Least Angle Regression) algorithm, the elastic net solution paths
increases gradually in a predictable manner.
The idea of LARS algorithm is to solve for the whole LASSO problems effectively as
to compute for the same steps of fitted OLS.
Within fixed individual λ2, the the LARS-EN (λ1 and λ2) use the single OLS fit to solve
for the whole elastic net solutions, (λ1, s or k).
For kth step, efficient update or downdate of the Cholesky factorization for the
inverted
GAk
= X∗T
Ak
X∗
Ak
=
1
1 + λ2
(XT
Ak−1
XAk−1
+ λ2I)
is recorded for non-zero coefficients, not to explicitly use X∗ to compute for all
quantities.
Algorithm LARS-EN sequentially updates the elastic net fits. As an empirical
evidence, the real and experiment simulations shows the optimal results for early
stops of LARS-EN.
Hasaan, Lim 15 / 24
Simulation
The simulated data comes from the true model: y = Xβ + σ,  ∼ N(0, 1).
Each simulated dataset is divided into training set / validation set / test set to serve.
Models were fitted on the training set only, and the validation data were used to select
the tuning parameters.
The test error (the mean-squared error) was computed on the test set.
Hasaan, Lim 16 / 24
Simulation Example 1 and 2
Simulation example 1: 50 data sets were simulated consisting of 20/20/200
observations and 8 predictors:
β = (3, 1.5, 0, 0, 2, 0, 0, 0), σ = 3 and cov(xi, xj) = (0.5)|i−j| for all i, j = 1, ..., 8.
Simulation example 2: Same as example 1, except βj = 0.85 for all j.
Hasaan, Lim 17 / 24
Simulation Example 3
Simulation example 3: 50 data sets were simulated consisting of 100/100/400
observations and 40 predictors:
β = (0, ..., 0
| {z }
10
, 2, ..., 2
| {z }
10
, 0, ..., 0
| {z }
10
, 2, ..., 2
| {z }
10
) and σ = 15, cor(xi, xj) = 0.5 for all i, j = 1, ..., 40.
Hasaan, Lim 18 / 24
Simulation Example 4
Simulation example 4: 50 data sets were simulated consisting of 50/50/400
observations and 40 predictors:
β = (3, ..., 3
| {z }
15
, 0, ..., 0
| {z }
25
), and σ = 15.
xi = Z1 + x
i , Z1 ∼ N(0, 1), i = 1, ..., 5.
xi = Z2 + x
i , Z2 ∼ N(0, 1), i = 6, ..., 10.
xi = Z3 + x
i , Z3 ∼ N(0, 1), i = 11, ..., 15.
xi
i.i.d.
∼ N(0, 1), i = 16, ..., 40,
x
i
i.i.d.
∼ N(0, 0.01), i = 1, ..., 15
Hasaan, Lim 19 / 24
Simulated Examples - Median MSE
Method Ex.1 Ex.2 Ex.3 Ex.4
Ridge 4.49 (0.46) 2.84 (0.27) 39.5 (1.80) 64.5 (4.78)
Lasso 3.06 (0.31) 3.87 (0.38) 65.0 (2.82) 46.6 (3.96)
Elastic Net 2.51 (0.29) 3.16 (0.27) 56.6 (1.75) 34.5 (1.64)
Naive Elastic Net 5.70 (0.41) 2.73 (0.23) 41.0 (2.13) 45.9 (3.72)
Table: Median MSE for the simulated examples and 4 methods on 50 simulations
Elastic Net is more accurate than the LASSO in all four examples, even when the
LASSO is significantly more accurate than Ridge regression.
The Naive Elastic Net performs very poorly with the highest mean-squared error in
Example 1. In Example 2 and 3 it behaves very similar to Ridge regression, and in
Example 4 it behaves similar to the LASSO.
Hasaan, Lim 20 / 24
Simulated Examples - Median MSE
Using the box-plot, the overall prediction performance of the LASSO, ridge, elastic
net, and naive elastic net is compared for 4 example.
Hasaan, Lim 21 / 24
Simulated Examples - Variable Selection
Method Ex. 1 Ex. 2 Ex. 3 Ex. 4
Lasso 5 6 24 11
Elastic Net 6 7 27 16
Table: Median number of non-zero coefficients
Elastic Net selects more predictors than the LASSO due to the grouping effect.
Elastic Net behaves like the ideal model in Example 4, where grouped selection is
needed.
Therefore, the Elastic Net has the additional ability to perform grouped variable
selection, which makes it a better variable selection method than the LASSO.
Hasaan, Lim 22 / 24
Conclusion
The LASSO can select at most n predictors in the p  n case and cannot perform
grouped selection. Furthermore, the ridge regression usually has a better prediction
performance than the LASSO when there are high correlations between predictors in
the n  p case.
The Elastic Net can produce a sparse model with good prediction accuracy, while
selecting group(s) of strongly correlated predictors. It can also potentially select all p
predictors in all situations.
A new algorithm called LARS-EN can be used for computing elastic net regularization
paths efficiently, similar to the LARS algorithm for LASSO.
The Elastic Net has two tuning parameters as opposed to one tuning parameter like
the LASSO, which can be selected using a training and validation set.
Simulation results indicate that the Elastic Net dominates the LASSO, especially
under collinearity.
Hasaan, Lim 23 / 24
References
Regularization and Variable Selection via the Elastic Net [Hui Zou and Trevor Hastie].
Hasaan, Lim 24 / 24

More Related Content

What's hot

Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Mrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningMrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningJaouad Dabounou
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Bayesian Methods for Machine Learning
Bayesian Methods for Machine LearningBayesian Methods for Machine Learning
Bayesian Methods for Machine Learningbutest
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOKazuki Yoshida
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Lecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual InformationLecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual Informationssuserb83554
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 

What's hot (20)

Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Regularization
RegularizationRegularization
Regularization
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Mrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningMrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine Learning
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Bayesian Methods for Machine Learning
Bayesian Methods for Machine LearningBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSO
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Lecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual InformationLecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual Information
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 

Similar to Regularization and variable selection via elastic net

Sparsenet
SparsenetSparsenet
Sparsenetndronen
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9VuTran231
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Loc Nguyen
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Mengxi Jiang
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Adrian Aley
 
Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Japheth Muthama
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JMJapheth Muthama
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Regression analysis and its type
Regression analysis and its typeRegression analysis and its type
Regression analysis and its typeEkta Bafna
 
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...SSA KPI
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)NYversity
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedPei-Che Chang
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer Sammer Qader
 

Similar to Regularization and variable selection via elastic net (20)

JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)
 
Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JM
 
Regression ppt.pptx
Regression ppt.pptxRegression ppt.pptx
Regression ppt.pptx
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
201977 1-1-4-pb
201977 1-1-4-pb201977 1-1-4-pb
201977 1-1-4-pb
 
Regression analysis and its type
Regression analysis and its typeRegression analysis and its type
Regression analysis and its type
 
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 

More from KyusonLim

ROC Korean drought presentation.pptx
ROC Korean drought presentation.pptxROC Korean drought presentation.pptx
ROC Korean drought presentation.pptxKyusonLim
 
Text mining and its association analysis.pdf
Text mining and its association analysis.pdfText mining and its association analysis.pdf
Text mining and its association analysis.pdfKyusonLim
 
ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)KyusonLim
 
BlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed modelBlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed modelKyusonLim
 
Missing value imputation (slide)
Missing value imputation (slide)Missing value imputation (slide)
Missing value imputation (slide)KyusonLim
 
Survival analysis 1
Survival analysis 1Survival analysis 1
Survival analysis 1KyusonLim
 

More from KyusonLim (7)

ROC Korean drought presentation.pptx
ROC Korean drought presentation.pptxROC Korean drought presentation.pptx
ROC Korean drought presentation.pptx
 
Text mining and its association analysis.pdf
Text mining and its association analysis.pdfText mining and its association analysis.pdf
Text mining and its association analysis.pdf
 
Dag in mmhc
Dag in mmhcDag in mmhc
Dag in mmhc
 
ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)
 
BlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed modelBlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed model
 
Missing value imputation (slide)
Missing value imputation (slide)Missing value imputation (slide)
Missing value imputation (slide)
 
Survival analysis 1
Survival analysis 1Survival analysis 1
Survival analysis 1
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Regularization and variable selection via elastic net

  • 1. Regularization and Variable Selection via the Elastic Net STA315H5S: Advanced Statistical Learning - Winter 2021 Lim, Kyuson Hasaan Soomro Department of Mathematical and Computational Sciences University of Toronto March 30th, 2021 Hasaan, Lim 1 / 24
  • 2. Outline 1 Introduction/Motivation 2 Naive Elastic Net 3 Elastic Net 4 Simulations 5 Conclusion Hasaan, Lim 2 / 24
  • 3. p n and Grouped Variable Selection Problem In various problems, there are more predictors than the number of observations (p n), and there are groups of variables among which the pairwise correlations are very high. For example, a typical microarray data set has thousands of predictors (genes) and fewer than 100 observations. Moreover, genes sharing the same biological pathway can have high correlations between them - we could consider such genes as forming a group. An ideal model in such a scenario would do the following: Automatically eliminate the trivial predictors and select the best subset of predictors (i.e automatic variable selection). Grouped Selection: Once one predictor is selected within a highly correlated group, then the whole group is automatically selected. Hasaan, Lim 3 / 24
  • 4. LASSO and Ridge Regression The LASSO is a penalized least squares method imposing an L1-penalty on the regression coefficients: β̂ = arg min β̂ ||y − Xβ||2 + λ1||β||1 The LASSO does both continuous shrinkage and automatic variable selection at the same time due to the L1-penalty. Ridge regression minimizes the residual sum of squares (RSS) subject to the L2-norm: β̂ = arg min β̂ ||y − Xβ||2 + λ2||β||2 Ridge regression is a continuous shrinkage method that achieves good prediction performance via bias-variance trade-off, however,it does not perform variable selection. Hasaan, Lim 4 / 24
  • 5. Limitations of LASSO and Ridge Regression LASSO and Ridge perform well in various scenarios, however there are certain limitations: Due to the type of the convex optimization problem, LASSO can select at most n variables prior to saturation in the p n case. LASSO cannot perform grouped selection - When there is a group of highly correlated variables, LASSO selects only one variable from the group and does not care which one is selected. When n p and the predictors are highly correlated, the prediction performance of LASSO is dominated by Ridge Regression. The Elastic Net overcomes these limitations by simultaneously performing automatic variable selection and continuous shrinkage, while selecting group of correlated predictors. Hasaan, Lim 5 / 24
  • 6. Naive Elastic Net - Definition Given a data set with n observations with p predictors: let y = (y1, . . . , yn)T be the response and X = (x1|. . . |xp) be the design matrix, with y centered and X standardized. The Naive Elastic Net optimization problem for non-negative λ1, λ2 is: β̂ = arg min β {|y − Xβ|2 + λ2|β|2 + λ1|β|1} λ1 → 0 (ridge regression), λ2 → 0 (Lasso). Alternatively, let α = λ2 λ1+λ2 . Then, the optimization problem is equivalent to: β̂ = arg min β |y − Xβ|2 , subject to (1 − α)|β|1 + α|β|2 ≤ t, for some t α = 1 (ridge regression), α = 0 (Lasso). When α ∈ [0, 1), the Naive Elastic Net enjoys the characteristics of both Ridge Regression and the Lasso. α ∈ [0, 1), the Naive Elastic Net enjoys properties of both ridge and Lasso. The plot shows singularities at vertices and edges (strictly convex), the strength of convexity varies within α. Hasaan, Lim 6 / 24
  • 7. Naive Elastic Net - Definition β0 β1 © Hasaan, Lim shape of the ridge penalty. contour of the LASSO penalty. contour of the elastic net penalty with α = 0.5. Hasaan, Lim 7 / 24
  • 8. Naive Elastic Net - Solution We can develop a method to solve the naive elastic net problem efficiently, since minimizing L(λ1, λ2, β) = |y − Xβ|2 + λ2|β|2 + λ1|β|1 is equivalent to LASSO-type optimization problem. X∗ (n+p)×p = (1 + λ2)−1/2 X √ λ2 I , y∗ (n+p)×1 = yn×1 0p×1 Let γ = λ1/ √ 1 + λ2 and β∗ = p (1 + λ2)β. Then, the naive elastic net criterion is rearranged as L(γ, β) = L(γ, β∗ ) = |y∗ − X∗ β∗ |2 + γ|β∗ |1 Let β̂ ∗ = arg min β∗ L{(γ, β∗ )} then β̂ = 1 p (1 + λ2) β̂ ∗ Hasaan, Lim 8 / 24
  • 9. Naive Elastic Net - Solution The sample size in the augmented problem is n + p and X∗ has rank p, which means that the naive elastic net can potentially select all p predictors in all situations. This important property overcomes one of the limitations of the Lasso, which can select at most n variables prior to saturation in the p n case. Therefore, by transforming the naive elastic net problem into an equivalent LASSO problem on augmented data, the naive elastic net can perform an automatic variable selection in a fashion similar to the LASSO along with the potential to select all p predictors. Hasaan, Lim 9 / 24
  • 10. Naive Elastic Net - The Grouping Effect Consider the generic penalization method: β̂ = arg min β |y − Xβ|2 + λJ(β), J(β) 0, β 6= 0 Lemma: Assume that xi = xj, i, j ∈ {1, ..., p} (i) If J(·) is strictly convex, then β̂i = β̂j , for all λ 0. (ii) If J(β) = |β|1, then β̂i , β̂j ≥ 0 and β̂ ∗ is another minimizer where β̂∗ k =      β̂k if k 6= i and k 6= j (β̂i + β̂j ) · (s) if k = i, s ∈ [0, 1] (β̂i + β̂j ) · (1 − s) if k = j, s ∈ [0, 1] Strict convexity guarantees the grouping effect in the extreme situation above. The Naive Elastic Net penalty is strictly convex, but the LASSO is not strictly convex and does not have an unique solution. Hasaan, Lim 10 / 24
  • 11. Naive Elastic Net - The Grouping Effect Theorem 1: Given data (y, X) and parameters (λ1, λ2), the response y is centered and the predictors X are standardized. Let β̂ be the naive elastic net estimate. Suppose β̂i(λ1, λ2), β̂j(λ1, λ2) 0, i, j ∈ {1, .., p}. Define Dλ1,λ2 (i, j) = 1 |y|1 |β̂i(λ1, λ2) − β̂j(λ1, λ2)| then Dλ1,λ2 (i, j) ≤ 1 λ2 p {2(1 − ρ)}, ρ = xT i xj (the sample correlation). The Dλ1,λ2 (i, j) describes the difference between coefficient paths of predictors i and j. Notice that as ρ → 1, the difference between β̂i and β̂j will converge to 0. Hasaan, Lim 11 / 24
  • 12. Limitations of the Naive Elastic Net While the Naive Elastic Net can select more than n predictors in the p n case and perform grouped selection, empirical evidence shows that it preforms unsatisfactorily unless it is close to LASSO or Ridge. Naive Elastic Net incurs a double amount of shrinkage which introduces extra unnecessary bias and does not help reduce the variances in comparison to LASSO or Ridge shrinkage. Two-stage procedure: For each fixed λ2, we first find the Ridge regression coefficients, and then we do the LASSO-type shrinkage along the LASSO coefficient solution paths. The Elastic Net overcomes double shrinkage and improves the prediction performance of the Naive Elastic Net. Hasaan, Lim 12 / 24
  • 13. The Elastic Net Estimate Given data (y, X), penalty parameters (λ1, λ2) and augmented data (y∗, X∗), the naive elastic net solves a LASSO-type problem ˆ β∗ = arg min β∗ |y∗ − X∗ β∗ |2 + λ2 p (1 + λ2) |β∗ |1 The elastic net (corrected) estimates of β̂ are defined by β̂(elastic net) = q (1 + λ2)β̂ ∗ Recall that β̂(naive elastic net) = {1/ p (1 + λ2)}β̂ ∗ and thus allowing β̂(elastic net) = (1 + λ2)β̂ (naive elastic net) The Elastic Net coefficient is a rescaled Naive Elastic Net coefficient. Hasaan, Lim 13 / 24
  • 14. The Elastic Net Estimate Previously, β̂(elastic net) = (1 + λ2)β̂(naive elastic net) is defined to overcome two steps of shrinkage (ridge and LASSO) by the penalty in the elastic net estimates. For sampling correlation of Σ̂, the newly defined Σ̂λ2 = 1 1 + λ2 Σ̂ + λ2 1 + λ2 I, yield to reduce the correlation matrix for the predictors. Under the OLS, the ridge coefficients are β̂ridge = 1 1 + λ2 Σ̂ −1 λ2 XT y Theorem 2. Suppose data is given as (y, X), then β̂enet are given as β̂enet = arg min β βT XT X + λ2I 1 + λ2 β − 2yT Xβ + λ1|β|, as an explicit optimization. For orthogonal design (XT X = I), β̂enet reduce the Σ̂ into I. Hasaan, Lim 14 / 24
  • 15. Elastic Net Computation - LARS-EN Algorithm For LARS (Least Angle Regression) algorithm, the elastic net solution paths increases gradually in a predictable manner. The idea of LARS algorithm is to solve for the whole LASSO problems effectively as to compute for the same steps of fitted OLS. Within fixed individual λ2, the the LARS-EN (λ1 and λ2) use the single OLS fit to solve for the whole elastic net solutions, (λ1, s or k). For kth step, efficient update or downdate of the Cholesky factorization for the inverted GAk = X∗T Ak X∗ Ak = 1 1 + λ2 (XT Ak−1 XAk−1 + λ2I) is recorded for non-zero coefficients, not to explicitly use X∗ to compute for all quantities. Algorithm LARS-EN sequentially updates the elastic net fits. As an empirical evidence, the real and experiment simulations shows the optimal results for early stops of LARS-EN. Hasaan, Lim 15 / 24
  • 16. Simulation The simulated data comes from the true model: y = Xβ + σ, ∼ N(0, 1). Each simulated dataset is divided into training set / validation set / test set to serve. Models were fitted on the training set only, and the validation data were used to select the tuning parameters. The test error (the mean-squared error) was computed on the test set. Hasaan, Lim 16 / 24
  • 17. Simulation Example 1 and 2 Simulation example 1: 50 data sets were simulated consisting of 20/20/200 observations and 8 predictors: β = (3, 1.5, 0, 0, 2, 0, 0, 0), σ = 3 and cov(xi, xj) = (0.5)|i−j| for all i, j = 1, ..., 8. Simulation example 2: Same as example 1, except βj = 0.85 for all j. Hasaan, Lim 17 / 24
  • 18. Simulation Example 3 Simulation example 3: 50 data sets were simulated consisting of 100/100/400 observations and 40 predictors: β = (0, ..., 0 | {z } 10 , 2, ..., 2 | {z } 10 , 0, ..., 0 | {z } 10 , 2, ..., 2 | {z } 10 ) and σ = 15, cor(xi, xj) = 0.5 for all i, j = 1, ..., 40. Hasaan, Lim 18 / 24
  • 19. Simulation Example 4 Simulation example 4: 50 data sets were simulated consisting of 50/50/400 observations and 40 predictors: β = (3, ..., 3 | {z } 15 , 0, ..., 0 | {z } 25 ), and σ = 15. xi = Z1 + x i , Z1 ∼ N(0, 1), i = 1, ..., 5. xi = Z2 + x i , Z2 ∼ N(0, 1), i = 6, ..., 10. xi = Z3 + x i , Z3 ∼ N(0, 1), i = 11, ..., 15. xi i.i.d. ∼ N(0, 1), i = 16, ..., 40, x i i.i.d. ∼ N(0, 0.01), i = 1, ..., 15 Hasaan, Lim 19 / 24
  • 20. Simulated Examples - Median MSE Method Ex.1 Ex.2 Ex.3 Ex.4 Ridge 4.49 (0.46) 2.84 (0.27) 39.5 (1.80) 64.5 (4.78) Lasso 3.06 (0.31) 3.87 (0.38) 65.0 (2.82) 46.6 (3.96) Elastic Net 2.51 (0.29) 3.16 (0.27) 56.6 (1.75) 34.5 (1.64) Naive Elastic Net 5.70 (0.41) 2.73 (0.23) 41.0 (2.13) 45.9 (3.72) Table: Median MSE for the simulated examples and 4 methods on 50 simulations Elastic Net is more accurate than the LASSO in all four examples, even when the LASSO is significantly more accurate than Ridge regression. The Naive Elastic Net performs very poorly with the highest mean-squared error in Example 1. In Example 2 and 3 it behaves very similar to Ridge regression, and in Example 4 it behaves similar to the LASSO. Hasaan, Lim 20 / 24
  • 21. Simulated Examples - Median MSE Using the box-plot, the overall prediction performance of the LASSO, ridge, elastic net, and naive elastic net is compared for 4 example. Hasaan, Lim 21 / 24
  • 22. Simulated Examples - Variable Selection Method Ex. 1 Ex. 2 Ex. 3 Ex. 4 Lasso 5 6 24 11 Elastic Net 6 7 27 16 Table: Median number of non-zero coefficients Elastic Net selects more predictors than the LASSO due to the grouping effect. Elastic Net behaves like the ideal model in Example 4, where grouped selection is needed. Therefore, the Elastic Net has the additional ability to perform grouped variable selection, which makes it a better variable selection method than the LASSO. Hasaan, Lim 22 / 24
  • 23. Conclusion The LASSO can select at most n predictors in the p n case and cannot perform grouped selection. Furthermore, the ridge regression usually has a better prediction performance than the LASSO when there are high correlations between predictors in the n p case. The Elastic Net can produce a sparse model with good prediction accuracy, while selecting group(s) of strongly correlated predictors. It can also potentially select all p predictors in all situations. A new algorithm called LARS-EN can be used for computing elastic net regularization paths efficiently, similar to the LARS algorithm for LASSO. The Elastic Net has two tuning parameters as opposed to one tuning parameter like the LASSO, which can be selected using a training and validation set. Simulation results indicate that the Elastic Net dominates the LASSO, especially under collinearity. Hasaan, Lim 23 / 24
  • 24. References Regularization and Variable Selection via the Elastic Net [Hui Zou and Trevor Hastie]. Hasaan, Lim 24 / 24