Regularization and variable selection via elastic net
1. Regularization and Variable Selection via the Elastic Net
STA315H5S: Advanced Statistical Learning - Winter 2021
Lim, Kyuson Hasaan Soomro
Department of Mathematical and Computational Sciences
University of Toronto
March 30th, 2021
Hasaan, Lim 1 / 24
3. p n and Grouped Variable Selection Problem
In various problems, there are more predictors than the number of observations
(p n), and there are groups of variables among which the pairwise correlations are
very high.
For example, a typical microarray data set has thousands of predictors (genes) and
fewer than 100 observations. Moreover, genes sharing the same biological pathway
can have high correlations between them - we could consider such genes as forming
a group.
An ideal model in such a scenario would do the following:
Automatically eliminate the trivial predictors and select the best subset of predictors (i.e
automatic variable selection).
Grouped Selection: Once one predictor is selected within a highly correlated group, then
the whole group is automatically selected.
Hasaan, Lim 3 / 24
4. LASSO and Ridge Regression
The LASSO is a penalized least squares method imposing an L1-penalty on the
regression coefficients:
β̂ = arg min
β̂
||y − Xβ||2
+ λ1||β||1
The LASSO does both continuous shrinkage and automatic variable selection at the
same time due to the L1-penalty.
Ridge regression minimizes the residual sum of squares (RSS) subject to the
L2-norm:
β̂ = arg min
β̂
||y − Xβ||2
+ λ2||β||2
Ridge regression is a continuous shrinkage method that achieves good prediction
performance via bias-variance trade-off, however,it does not perform variable
selection.
Hasaan, Lim 4 / 24
5. Limitations of LASSO and Ridge Regression
LASSO and Ridge perform well in various scenarios, however there are certain
limitations:
Due to the type of the convex optimization problem, LASSO can select at most n
variables prior to saturation in the p n case.
LASSO cannot perform grouped selection - When there is a group of highly correlated
variables, LASSO selects only one variable from the group and does not care which one
is selected.
When n p and the predictors are highly correlated, the prediction performance of
LASSO is dominated by Ridge Regression.
The Elastic Net overcomes these limitations by simultaneously performing automatic
variable selection and continuous shrinkage, while selecting group of correlated
predictors.
Hasaan, Lim 5 / 24
6. Naive Elastic Net - Definition
Given a data set with n observations with p predictors: let y = (y1, . . . , yn)T be the
response and X = (x1|. . . |xp) be the design matrix, with y centered and X
standardized. The Naive Elastic Net optimization problem for non-negative λ1, λ2 is:
β̂ = arg min
β
{|y − Xβ|2
+ λ2|β|2
+ λ1|β|1}
λ1 → 0 (ridge regression), λ2 → 0 (Lasso).
Alternatively, let α = λ2
λ1+λ2
. Then, the optimization problem is equivalent to:
β̂ = arg min
β
|y − Xβ|2
, subject to (1 − α)|β|1 + α|β|2
≤ t, for some t
α = 1 (ridge regression), α = 0 (Lasso). When α ∈ [0, 1), the Naive Elastic Net
enjoys the characteristics of both Ridge Regression and the Lasso.
α ∈ [0, 1), the Naive Elastic Net enjoys properties of both ridge and Lasso. The plot
shows singularities at vertices and edges (strictly convex), the strength of convexity
varies within α.
Hasaan, Lim 6 / 24
8. Naive Elastic Net - Solution
We can develop a method to solve the naive elastic net problem efficiently, since
minimizing L(λ1, λ2, β) = |y − Xβ|2 + λ2|β|2 + λ1|β|1 is equivalent to LASSO-type
optimization problem.
X∗
(n+p)×p = (1 + λ2)−1/2
X
√
λ2 I
, y∗
(n+p)×1 =
yn×1
0p×1
Let γ = λ1/
√
1 + λ2 and β∗
=
p
(1 + λ2)β. Then, the naive elastic net criterion is
rearranged as
L(γ, β) = L(γ, β∗
) = |y∗
− X∗
β∗
|2
+ γ|β∗
|1
Let
β̂
∗
= arg min
β∗
L{(γ, β∗
)} then
β̂ =
1
p
(1 + λ2)
β̂
∗
Hasaan, Lim 8 / 24
9. Naive Elastic Net - Solution
The sample size in the augmented problem is n + p and X∗ has rank p, which means
that the naive elastic net can potentially select all p predictors in all situations.
This important property overcomes one of the limitations of the Lasso, which can
select at most n variables prior to saturation in the p n case.
Therefore, by transforming the naive elastic net problem into an equivalent LASSO
problem on augmented data, the naive elastic net can perform an automatic variable
selection in a fashion similar to the LASSO along with the potential to select all p
predictors.
Hasaan, Lim 9 / 24
10. Naive Elastic Net - The Grouping Effect
Consider the generic penalization method:
β̂ = arg min
β
|y − Xβ|2
+ λJ(β), J(β) 0, β 6= 0
Lemma: Assume that xi = xj, i, j ∈ {1, ..., p}
(i) If J(·) is strictly convex, then β̂i = β̂j , for all λ 0.
(ii) If J(β) = |β|1, then β̂i , β̂j ≥ 0 and β̂
∗
is another minimizer where
β̂∗
k =
β̂k if k 6= i and k 6= j
(β̂i + β̂j ) · (s) if k = i, s ∈ [0, 1]
(β̂i + β̂j ) · (1 − s) if k = j, s ∈ [0, 1]
Strict convexity guarantees the grouping effect in the extreme situation above.
The Naive Elastic Net penalty is strictly convex, but the LASSO is not strictly convex
and does not have an unique solution.
Hasaan, Lim 10 / 24
11. Naive Elastic Net - The Grouping Effect
Theorem 1: Given data (y, X) and parameters (λ1, λ2), the response y is centered
and the predictors X are standardized. Let β̂ be the naive elastic net estimate.
Suppose β̂i(λ1, λ2), β̂j(λ1, λ2) 0, i, j ∈ {1, .., p}. Define
Dλ1,λ2
(i, j) =
1
|y|1
|β̂i(λ1, λ2) − β̂j(λ1, λ2)| then
Dλ1,λ2
(i, j) ≤
1
λ2
p
{2(1 − ρ)}, ρ = xT
i xj (the sample correlation).
The Dλ1,λ2
(i, j) describes the difference between coefficient paths of predictors i and j.
Notice that as ρ → 1, the difference between β̂i and β̂j will converge to 0.
Hasaan, Lim 11 / 24
12. Limitations of the Naive Elastic Net
While the Naive Elastic Net can select more than n predictors in the p n case and
perform grouped selection, empirical evidence shows that it preforms unsatisfactorily
unless it is close to LASSO or Ridge.
Naive Elastic Net incurs a double amount of shrinkage which introduces extra
unnecessary bias and does not help reduce the variances in comparison to LASSO
or Ridge shrinkage.
Two-stage procedure: For each fixed λ2, we first find the Ridge regression coefficients,
and then we do the LASSO-type shrinkage along the LASSO coefficient solution paths.
The Elastic Net overcomes double shrinkage and improves the prediction
performance of the Naive Elastic Net.
Hasaan, Lim 12 / 24
13. The Elastic Net Estimate
Given data (y, X), penalty parameters (λ1, λ2) and augmented data (y∗, X∗), the naive
elastic net solves a LASSO-type problem
ˆ
β∗
= arg min
β∗
|y∗
− X∗
β∗
|2
+
λ2
p
(1 + λ2)
|β∗
|1
The elastic net (corrected) estimates of β̂ are defined by
β̂(elastic net) =
q
(1 + λ2)β̂
∗
Recall that β̂(naive elastic net) = {1/
p
(1 + λ2)}β̂
∗
and thus allowing
β̂(elastic net) = (1 + λ2)β̂ (naive elastic net)
The Elastic Net coefficient is a rescaled Naive Elastic Net coefficient.
Hasaan, Lim 13 / 24
14. The Elastic Net Estimate
Previously, β̂(elastic net) = (1 + λ2)β̂(naive elastic net) is defined to overcome two
steps of shrinkage (ridge and LASSO) by the penalty in the elastic net estimates.
For sampling correlation of Σ̂, the newly defined
Σ̂λ2
=
1
1 + λ2
Σ̂ +
λ2
1 + λ2
I,
yield to reduce the correlation matrix for the predictors.
Under the OLS, the ridge coefficients are
β̂ridge =
1
1 + λ2
Σ̂
−1
λ2
XT
y
Theorem 2. Suppose data is given as (y, X), then β̂enet are given as
β̂enet = arg min
β
βT
XT X + λ2I
1 + λ2
β − 2yT
Xβ + λ1|β|,
as an explicit optimization. For orthogonal design (XT X = I), β̂enet reduce the Σ̂ into I.
Hasaan, Lim 14 / 24
15. Elastic Net Computation - LARS-EN Algorithm
For LARS (Least Angle Regression) algorithm, the elastic net solution paths
increases gradually in a predictable manner.
The idea of LARS algorithm is to solve for the whole LASSO problems effectively as
to compute for the same steps of fitted OLS.
Within fixed individual λ2, the the LARS-EN (λ1 and λ2) use the single OLS fit to solve
for the whole elastic net solutions, (λ1, s or k).
For kth step, efficient update or downdate of the Cholesky factorization for the
inverted
GAk
= X∗T
Ak
X∗
Ak
=
1
1 + λ2
(XT
Ak−1
XAk−1
+ λ2I)
is recorded for non-zero coefficients, not to explicitly use X∗ to compute for all
quantities.
Algorithm LARS-EN sequentially updates the elastic net fits. As an empirical
evidence, the real and experiment simulations shows the optimal results for early
stops of LARS-EN.
Hasaan, Lim 15 / 24
16. Simulation
The simulated data comes from the true model: y = Xβ + σ, ∼ N(0, 1).
Each simulated dataset is divided into training set / validation set / test set to serve.
Models were fitted on the training set only, and the validation data were used to select
the tuning parameters.
The test error (the mean-squared error) was computed on the test set.
Hasaan, Lim 16 / 24
17. Simulation Example 1 and 2
Simulation example 1: 50 data sets were simulated consisting of 20/20/200
observations and 8 predictors:
β = (3, 1.5, 0, 0, 2, 0, 0, 0), σ = 3 and cov(xi, xj) = (0.5)|i−j| for all i, j = 1, ..., 8.
Simulation example 2: Same as example 1, except βj = 0.85 for all j.
Hasaan, Lim 17 / 24
18. Simulation Example 3
Simulation example 3: 50 data sets were simulated consisting of 100/100/400
observations and 40 predictors:
β = (0, ..., 0
| {z }
10
, 2, ..., 2
| {z }
10
, 0, ..., 0
| {z }
10
, 2, ..., 2
| {z }
10
) and σ = 15, cor(xi, xj) = 0.5 for all i, j = 1, ..., 40.
Hasaan, Lim 18 / 24
19. Simulation Example 4
Simulation example 4: 50 data sets were simulated consisting of 50/50/400
observations and 40 predictors:
β = (3, ..., 3
| {z }
15
, 0, ..., 0
| {z }
25
), and σ = 15.
xi = Z1 + x
i , Z1 ∼ N(0, 1), i = 1, ..., 5.
xi = Z2 + x
i , Z2 ∼ N(0, 1), i = 6, ..., 10.
xi = Z3 + x
i , Z3 ∼ N(0, 1), i = 11, ..., 15.
xi
i.i.d.
∼ N(0, 1), i = 16, ..., 40,
x
i
i.i.d.
∼ N(0, 0.01), i = 1, ..., 15
Hasaan, Lim 19 / 24
20. Simulated Examples - Median MSE
Method Ex.1 Ex.2 Ex.3 Ex.4
Ridge 4.49 (0.46) 2.84 (0.27) 39.5 (1.80) 64.5 (4.78)
Lasso 3.06 (0.31) 3.87 (0.38) 65.0 (2.82) 46.6 (3.96)
Elastic Net 2.51 (0.29) 3.16 (0.27) 56.6 (1.75) 34.5 (1.64)
Naive Elastic Net 5.70 (0.41) 2.73 (0.23) 41.0 (2.13) 45.9 (3.72)
Table: Median MSE for the simulated examples and 4 methods on 50 simulations
Elastic Net is more accurate than the LASSO in all four examples, even when the
LASSO is significantly more accurate than Ridge regression.
The Naive Elastic Net performs very poorly with the highest mean-squared error in
Example 1. In Example 2 and 3 it behaves very similar to Ridge regression, and in
Example 4 it behaves similar to the LASSO.
Hasaan, Lim 20 / 24
21. Simulated Examples - Median MSE
Using the box-plot, the overall prediction performance of the LASSO, ridge, elastic
net, and naive elastic net is compared for 4 example.
Hasaan, Lim 21 / 24
22. Simulated Examples - Variable Selection
Method Ex. 1 Ex. 2 Ex. 3 Ex. 4
Lasso 5 6 24 11
Elastic Net 6 7 27 16
Table: Median number of non-zero coefficients
Elastic Net selects more predictors than the LASSO due to the grouping effect.
Elastic Net behaves like the ideal model in Example 4, where grouped selection is
needed.
Therefore, the Elastic Net has the additional ability to perform grouped variable
selection, which makes it a better variable selection method than the LASSO.
Hasaan, Lim 22 / 24
23. Conclusion
The LASSO can select at most n predictors in the p n case and cannot perform
grouped selection. Furthermore, the ridge regression usually has a better prediction
performance than the LASSO when there are high correlations between predictors in
the n p case.
The Elastic Net can produce a sparse model with good prediction accuracy, while
selecting group(s) of strongly correlated predictors. It can also potentially select all p
predictors in all situations.
A new algorithm called LARS-EN can be used for computing elastic net regularization
paths efficiently, similar to the LARS algorithm for LASSO.
The Elastic Net has two tuning parameters as opposed to one tuning parameter like
the LASSO, which can be selected using a training and validation set.
Simulation results indicate that the Elastic Net dominates the LASSO, especially
under collinearity.
Hasaan, Lim 23 / 24