Arthur Charpentier, SIDE Summer School, July 2019
# 3 Regularization & Penalized Regression
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
Let s denote a subset of {0, 1, · · · , p}, with cardinal |s|.
Xs is the matrice with columns xj where j ∈ s.
Consider the model Y = Xsβs + η, so that βs = Xs Xs
−1
Xs y
In general, βs = (β)s
R2
is usually not a good measure since R2
(s) ≤ R2
(t) when s ⊂ t.
Some use the adjusted R2
, R
2
(s) = 1 −
n − 1
n − |s|
1 − R2
(s)
The mean square error is
mse(s) = E (Xβ − Xsβs)2
= E RSS(s) − nσ2
+ 2|s|σ2
Define Mallows’Cp as Cp(s) =
RSS(s)
σ2
− n + 2|s|
Rule of thumb: model with variables s is valid if Cp(s) ≤ |s|
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
In a linear model,
log L(β, σ2
) = −
n
2
log σ2
−
n
2
log(2π) −
1
2σ2
y − Xβ 2
and
log L(βs, σ2
s ) = −
n
2
log
RSS(s)
n
−
n
2
[1 + log(2π)]
It is necessary to penalize too complex models
Akaike’s AIC : AIC(s) =
n
2
log
RSS(s)
n
+
n
2
[1 + log(2π)] + 2|s|
Schwarz’s BIC : BIC(s) =
n
2
log
RSS(s)
n
+
n
2
[1 + log(2π)] + |s| log n
Exhaustive search of all models, 2p+1
... too complicated.
Stepwise procedure, forward or backward... not very stable and satisfactory.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur Charpentier, SIDE Summer School, July 2019
Linear Model and Variable Selection
For variable selection, use the classical stats::step function, or leaps::regsubset
for best subset, forward stepwide and backward stepwise.
For leave-one-out-cross validation, we can write
1
n
n
i=1
(yi − y(i))2
=
1
n
n
i=1
yi − yi
1 − Hi,i
2
Heuristically, (yi − yi)2
underestimates the true prediction error
High underestimation if the correlation between yi and yi is high
One can use Cov[y, y], e.g. in Mallows’ Cp,
Cp =
1
n
n
i=1
(yi − yi)2
+
2
n
σ2
p where p =
1
σ2
trace[Cov(y, y])
with Gaussian errors, AIC and Mallows’ Cp are asymptotically equivalent.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur Charpentier, SIDE Summer School, July 2019
Penalized Inference and Shrinkage
Consider a parametric model, with true (unknown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
Arthur Charpentier, SIDE Summer School, July 2019
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur Charpentier, SIDE Summer School, July 2019
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =







1 −1 2
1 0 1
1 2 −1
1 1 0







then XT
X =




4 2 2
2 6 −4
2 −4 6



 while XT
X+I =




5 2 2
2 7 −4
2 −4 7




eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin



n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j



β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
an Wieringen (2018 Lecture notes on ridge regression
Ridge Estimator (OLS)
β
ridge
λ = argmin



n
i=1
(yi − xi β)2
+ λ
p
j=1
β2
j



Ridge Estimator (GLM)
β
ridge
λ = argmin



−
n
i=1
log f(yi|µi = g−1
(xi β)) +
λ
2
p
j=1
β2
j



@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
Observe that if xj1
⊥ xj2
, then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
Smaller mse
There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur Charpentier, SIDE Summer School, July 2019
Ridge Regression
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur Charpentier, SIDE Summer School, July 2019
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Ridge & Shrinkage
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
Arthur Charpentier, SIDE Summer School, July 2019
Properties of the Ridge Estimator
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
Ridge regression is obtained using glmnet::glmnet(..., alpha = 0) - and
glmnet::cv.glmnet for cross validation
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur Charpentier, SIDE Summer School, July 2019
SVD decomposition
For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n)
and a ”diagonal” matrix Σ (m × n) such that A = UΣV T
, or AV = UΣ.
Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ),
that is mapped by the matrix A into an orthonormal set of vectors (i.e. the
columns of U).
Let r = rank(A), then A =
r
i=1
σiuivT
i (called the dyadic decomposition of A).
Observe that it can be used to compute (e.g.) the Frobenius norm of A,
A = a2
i,j = σ2
1 + · · · + σ2
min{m,n}.
Further AT
A = V ΣT
ΣV T
while AAT
= UΣΣT
UT
.
Hence, σ2
i ’s are related to eigenvalues of AT
A and AAT
, and ui, vi are associated
eigenvectors.
Golub & Reinsh (1970, Singular Value Decomposition and Least Squares Solutions)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur Charpentier, SIDE Summer School, July 2019
SVD decomposition
Consider the singular value decomposition of X, X = UDV T
.
Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penalty shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur Charpentier, SIDE Summer School, July 2019
Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur Charpentier, SIDE Summer School, July 2019
Sparsity Issues
In several applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevant features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity),
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
q
q
= . +
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Define a 0 = 1(|ai| > 0).
Here dim(β) = k but β 0 = s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients should be chosen here (with k (very) large).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0
}
where we might convexify the 0 norm, · 0
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur Charpentier, SIDE Summer School, July 2019
Going further on sparsity issues
On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2 }
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1
}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur Charpentier, SIDE Summer School, July 2019
lasso Least Absolute Shrinkage and Selection Operator
lasso Estimator (OLS)
β
lasso
λ = argmin



n
i=1
(yi − xi β)2
+ λ
p
j=1
|βj|



lasso Estimator (GLM)
β
lasso
λ = argmin



−
n
i=1
log f(yi|µi = g−1
(xi β)) +
λ
2
p
j=1
|βj|



@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
Arthur Charpentier, SIDE Summer School, July 2019
lasso Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur Charpentier, SIDE Summer School, July 2019
Optimal lasso Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin



i∈Ik
[yi − xT
i β]2
+ λ β 1



then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and finally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur Charpentier, SIDE Summer School, July 2019
Optimal lasso Penalty
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009, Elements
of Statistical Learning) suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
lasso regression is obtained using glmnet::glmnet(..., alpha = 1) - and
glmnet::cv.glmnet for cross validation.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur Charpentier, SIDE Summer School, July 2019
LASSO and Ridge, with R
1 > library(glmnet)
2 > chicago=read.table("http:// freakonometrics .free.fr/
chicago.txt",header=TRUE ,sep=";")
3 > standardize <- function(x) {(x-mean(x))/sd(x)}
4 > z0 <- standardize(chicago[, 1])
5 > z1 <- standardize(chicago[, 3])
6 > z2 <- standardize(chicago[, 4])
7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept=
FALSE , lambda =1)
8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept=
FALSE , lambda =1)
9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5,
intercept=FALSE , lambda =1)
Elastic net, λ1 β 1 + λ2 β 2
2
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
qq
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
Arthur Charpentier, SIDE Summer School, July 2019
lasso and lar (Least-Angle Regression)
lasso estimation can be seen as an adaptation of LAR procedure
Least Angle Regression
(i) set (small)
(ii) start with initial residual ε = y, and β = 0
(iii) find the predictor xj with the highest correlation with ε
(iv) update βj = βj + δj = βj + · sign[ε xj]
(v) set ε = ε − δjxj and go to (iii)
see Efron et al. (2004, Least Angle Regression)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Define
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Foster & George (1994) the risk inflation criterion for multiple regression tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =



0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not always strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
”best subset” solution - like in ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur Charpentier, SIDE Summer School, July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Consider a simple regression yi = xiβ + ε, with 1-penalty and a 2-loss fucntion.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written
−2yT
x + 2xT
xβ±2λ = 0.
(the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is
(strictly) positive, i.e. yT
x > 0. If λ is not too large β and βols
have the same
sign, and
−2yT
x + 2xT
xβ + 2λ = 0.
with solution βlasso
λ =
yT
x − λ
xTx
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Increase λ so that βλ = 0.
Increase slightly more, βλ cannot become negative, because the sign of the first
order condition will change, and we should solve
−2yT
x + 2xT
xβ − 2λ = 0.
and solution would be βlasso
λ =
yT
x + λ
xTx
. But that solution is positive (we
assumed that yT
x > 0), to we should have βλ < 0.
Thus, at some point βλ = 0, which is a corner solution.
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Cand`es & Plan (2009, Near-ideal model selection by 1 minimization.)
With some additional technical assumption, that lasso estimator is ”sparsistent”
in the sense that the support of β
lasso
λ is the same as β,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur Charpentier, SIDE Summer School, July 2019
Going further, 0, 1 and 2 penalty
Thus, lasso can be used for variable selection - see Hastie et al. (2001, The
Elements of Statistical Learning).
Generally, βlasso
λ is a biased estimator but its variance can be small enough to
have a smaller least squared error than the OLS estimate.
With orthonormal covariates, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = signe[βols
j ] · (|βols
j | − λ)+.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
Arthur Charpentier, SIDE Summer School, July 2019
lasso for Autoregressive Time Series
Consider some AR(p) autoregressive time series,
Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt,
for some white noise (εt), with a causal type representation. Write y = xT
φ + ε.
The lasso estimator φ is a minimizer of
1
2T
y = xT
φ 2
+ λ
p
i=1
λi|φi|,
for some tuning parameters (λ, λ1, · · · , λp).
See Nardi & Rinaldo (2011, Autoregressive process modeling via the Lasso
procedure).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
Arthur Charpentier, SIDE Summer School, July 2019
lasso and Non-Linearities
Consider knots k1, · · · , km, we want a function m which is a cubic polynomial
between every pair of knots, continuous at each knot, and with ontinuous first
and second derivatives at each knot.
We can write m as
m(x) = β0 + β1x + β2x2
+ β3x3
+ β4(x − k1)3
+ + · · · + βm+3(x − km)3
+
One strategy is the following
• fix the number of knots m (m < n)
• find the natural cubic spline m which minimizes
n
i=1
(yi − m(xi))2
• then choose m by cross validation
and alternative is to use a penalty based approach (Ridge type) to avoid overfit
(since with m = n, the residual sum of square is null).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Consider a univariate nonlinear regression problem, so that E[Y |X = x] = m(x).
Given a sample {(y1, x1), · · · , (yn, xn)}, consider the following penalized problem
m = argmin
m∈C2
n
i=1
(yi − m(xi))2
+ λ
R
m (x)dx
with the Residual sum of squares on the left, and a penalty for the roughness of
the function.
The solution is a natural cubic spline with knots at unique values of x (see
Eubanks (1999, Nonparametric Regression and Spline Smoothing)
Consider some spline basis {h1, · · · , hn}, and let m(x) =
n
i=1
βihi(x).
Let H and Ω be the n × n matrices Hi,j = hj(xi), and Ωi,j =
R
hi (x)hj (x)dx.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Then the objective function can be written
(y − Hβ)T
(y − Hβ) + λβT
Ωβ
Recognize here a generalized Ridge regression, with solution
βλ = HT
H + λΩ
−1
HT
y.
Note that predicted values are linear functions of the observed value since
y = H HT
H + λΩ
−1
HT
y = Sλy,
with degrees of freedom trace(Sλ).
One can obtain the so-called Reinsch form by considering the singular value
decomposition of H = UDV T
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
Here U is orthogonal since H is square (n × n), and D is here invertible. Then
Sλ = (I + λUT
D−1
V T
ΩV D−1
U)−1
= (I + λK)−1
where K is a positive semidefinite matrix, K = B∆BT
, where columns of B are
know as the Demmler-Reinsch basis.
In that (orthonormal) basis, Sλ is a diagonal matrix,
Sλ = B I + λ∆
−1
BT
Observe that SλBk =
1
1 + λ∆k,k
Bk.
Here again, eigenvalues are shrinkage coefficients of basis vectors.
With more covariates, consider an additive problem
(h1, · · · , hp) =
h1,··· ,hp∈C2
argmin



n
i=1

yi −
p
j=1
m(xi,j)


2
+ λ
p
j=1 R
mj (x)dx



@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
which can be written
min



(y −
p
j=1
Hjβj)T
(y −
p
j=1
Hjβj) + λ β1
T
p
j=1
Ωjβj



where each matrix Hj is a Demmler-Reinsch basis for variable xj.
Chouldechova & Hastie (2015, Generalized Additive Model Selection)
Assume that the mean function for the jth variable is mj(x) = αjx + mj(x)T
βj.
One can write
min (y − α0 −
p
j=1
αjxj −
p
j=1
Hjβj)T
(y − α0 −
p
j=1
αjxj −
p
j=1
Hjβj)
+λ γ|α1| + (1 − γ) βj Ωj + ψ1β1
T
Ω1β1 + · · · + ψpβp
T
Ωpβp
where βj Ωj
= βj
TΩjβj.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48
Arthur Charpentier, SIDE Summer School, July 2019
GAM, splines and Ridge regression
The second term is the selection penalty, with a mixture of 1 and 2 (type)
norm-based penalty
The third term is the end-to-path penalty (GAM type when λ = 0).
For each predictor xj, there are three possibilities
• zero, αj = 0 and βj = 0
• linear, αj = 0 and βj = 0
• nonlinear, βj = 0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 49
Arthur Charpentier, SIDE Summer School, July 2019
0.0 0.2 0.4 0.6 0.8
−30−101030
variable 1
smoothfunction
10 20 30 40 50 60 70
−30−101030
variable 2
smoothfunction
5 10 15 20
−30−101030
variable 3
smoothfunction
50 20 10 5 2 1
−30−20−10010
Linear Components
λ
α
50 20 10 5 2 1
05102030
Non−linear Components
λ
||β||
0 1 2 3 4
30507090
log(Lambda)
Mean−SquaredError
q
qqq
q
q
qq
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1
42 27
0.0 0.2 0.4 0.6 0.8
−50510
v1
f(v1)
10 20 30 40 50 60 70
−50510
v2
f(v2)
5 10 15 20
−50510
v3
f(v3)@freakonometrics freakonometrics freakonometrics.hypotheses.org 50
Arthur Charpentier, SIDE Summer School, July 2019
Coordinate Descent
LASSO Coordinate Descent Algorithm
1. Set β0 = β
2 . For k = 1, · · ·
for j = 1, · · · , p
(i) compute Rj = xj y − X−jβk−1(−j)
(ii) set βk,j = Rj · 1 −
λ
2|Rj| +
3. The final estimate βκ is βλ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 51
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Dantzig Selection
Cand`es & Tao (2007, The Dantzig selector: Statistical estimation when p is much
larger than n) defined
β
dantzig
λ ∈ argmin
β∈Rp
β 1
s.t. X (y − Xβ) ∞
≤ λ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 52
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Adaptative Lasso
Zou (2006, The Adaptive Lasso)
β
a-lasso
λ ∈ argmin
β∈Rp



y − Xβ 2
2
+ λ
p
j=1
|βj|
|βγ-lasso
λ,j |



where β
γ-lasso
λ = ΠXs(λ)
y where s(λ) is the set of non null components β
lasso
λ
See library lqa or lassogrp
@freakonometrics freakonometrics freakonometrics.hypotheses.org 53
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Group Lasso
Assume that variables x ∈ Rp
can be grouped in L subgroups, x = (x1 · · · , xL),
where dim[xl] = pl.
Yuan & Lin (2007, Model selection and estimation in the Gaussian graphical model)
defined, for some Kl matrices nl × nl definite positives
β
g-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
βl Klβl
or, if Kl = plI
β
g-lasso
λ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
pl βl 2
See library gglasso
@freakonometrics freakonometrics freakonometrics.hypotheses.org 54
Arthur Charpentier, SIDE Summer School, July 2019
From LASSO to Sparse-Group Lasso
Assume that variables x ∈ Rp
can be grouped in L subgroups, x = (x1 · · · , xL),
where dim[xl] = pl.
Simon et al. (2013, A Sparse-Group LASSO) defined, for some Kl matrices nl × nl
definite positives
β
sg-lasso
λ,µ ∈ argmin
β∈Rp
y − Xβ 2
2
+ λ
L
l=1
βl Klβl + µ β 1
See library SGL
@freakonometrics freakonometrics freakonometrics.hypotheses.org 55

Side 2019 #3

  • 1.
    Arthur Charpentier, SIDESummer School, July 2019 # 3 Regularization & Penalized Regression Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal) Machine Learning & Econometrics SIDE Summer School - July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2.
    Arthur Charpentier, SIDESummer School, July 2019 Linear Model and Variable Selection Let s denote a subset of {0, 1, · · · , p}, with cardinal |s|. Xs is the matrice with columns xj where j ∈ s. Consider the model Y = Xsβs + η, so that βs = Xs Xs −1 Xs y In general, βs = (β)s R2 is usually not a good measure since R2 (s) ≤ R2 (t) when s ⊂ t. Some use the adjusted R2 , R 2 (s) = 1 − n − 1 n − |s| 1 − R2 (s) The mean square error is mse(s) = E (Xβ − Xsβs)2 = E RSS(s) − nσ2 + 2|s|σ2 Define Mallows’Cp as Cp(s) = RSS(s) σ2 − n + 2|s| Rule of thumb: model with variables s is valid if Cp(s) ≤ |s| @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3.
    Arthur Charpentier, SIDESummer School, July 2019 Linear Model and Variable Selection In a linear model, log L(β, σ2 ) = − n 2 log σ2 − n 2 log(2π) − 1 2σ2 y − Xβ 2 and log L(βs, σ2 s ) = − n 2 log RSS(s) n − n 2 [1 + log(2π)] It is necessary to penalize too complex models Akaike’s AIC : AIC(s) = n 2 log RSS(s) n + n 2 [1 + log(2π)] + 2|s| Schwarz’s BIC : BIC(s) = n 2 log RSS(s) n + n 2 [1 + log(2π)] + |s| log n Exhaustive search of all models, 2p+1 ... too complicated. Stepwise procedure, forward or backward... not very stable and satisfactory. @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4.
    Arthur Charpentier, SIDESummer School, July 2019 Linear Model and Variable Selection For variable selection, use the classical stats::step function, or leaps::regsubset for best subset, forward stepwide and backward stepwise. For leave-one-out-cross validation, we can write 1 n n i=1 (yi − y(i))2 = 1 n n i=1 yi − yi 1 − Hi,i 2 Heuristically, (yi − yi)2 underestimates the true prediction error High underestimation if the correlation between yi and yi is high One can use Cov[y, y], e.g. in Mallows’ Cp, Cp = 1 n n i=1 (yi − yi)2 + 2 n σ2 p where p = 1 σ2 trace[Cov(y, y]) with Gaussian errors, AIC and Mallows’ Cp are asymptotically equivalent. @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5.
    Arthur Charpentier, SIDESummer School, July 2019 Penalized Inference and Shrinkage Consider a parametric model, with true (unknown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 One can think of a shrinkage of an unbiased estimator, Let θ denote an unbiased estimator of θ. Then ˆθ = θ2 θ2 + mse(θ) · θ = θ − mse(θ) θ2 + mse(θ) · θ penalty satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics freakonometrics freakonometrics.hypotheses.org 5 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  • 6.
    Arthur Charpentier, SIDESummer School, July 2019 Normalization : Euclidean 2 vs. Mahalonobis We want to penalize complicated models : if βk is “too small”, we prefer to have βk = 0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Instead of d(x, y) = (x − y)T (x − y) use dΣ(x, y) = (x − y)TΣ−1 (x − y) beta1 beta2 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7.
    Arthur Charpentier, SIDESummer School, July 2019 Linear Regression Shortcoming Least Squares Estimator β = (XT X)−1 XT y Unbiased Estimator E[β] = β Variance Var[β] = σ2 (XT X)−1 which can be (extremely) large when det[(XT X)] ∼ 0. X =        1 −1 2 1 0 1 1 2 −1 1 1 0        then XT X =     4 2 2 2 6 −4 2 −4 6     while XT X+I =     5 2 2 2 7 −4 2 −4 7     eigenvalues : {10, 6, 0} {11, 7, 1} Ad-hoc strategy: use XT X + λI @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8.
    Arthur Charpentier, SIDESummer School, July 2019 Ridge Regression ... like the least square, but it shrinks estimated coefficients towards 0. β ridge λ = argmin    n i=1 (yi − xT i β)2 + λ p j=1 β2 j    β ridge λ = argmin    y − Xβ 2 2 =criteria + λ β 2 2 =penalty    λ ≥ 0 is a tuning parameter. @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9.
    Arthur Charpentier, SIDESummer School, July 2019 Ridge Regression an Wieringen (2018 Lecture notes on ridge regression Ridge Estimator (OLS) β ridge λ = argmin    n i=1 (yi − xi β)2 + λ p j=1 β2 j    Ridge Estimator (GLM) β ridge λ = argmin    − n i=1 log f(yi|µi = g−1 (xi β)) + λ 2 p j=1 β2 j    @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10.
    Arthur Charpentier, SIDESummer School, July 2019 Ridge Regression β ridge λ = argmin y − (β0 + Xβ) 2 2 + λ β 2 2 can be seen as a constrained optimization problem β ridge λ = argmin β 2 2 ≤hλ y − (β0 + Xβ) 2 2 Explicit solution βλ = (XT X + λI)−1 XT y If λ → 0, β ridge 0 = β ols If λ → ∞, β ridge ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11.
    Arthur Charpentier, SIDESummer School, July 2019 Ridge Regression This penalty can be seen as rather unfair if compo- nents of x are not expressed on the same scale • center: xj = 0, then β0 = y • scale: xT j xj = 1 Then compute β ridge λ = argmin    y − Xβ 2 2 =loss + λ β 2 2 =penalty    beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12.
    Arthur Charpentier, SIDESummer School, July 2019 Ridge Regression Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols λ which explain relationship with shrinkage. But generally, it is not the case... Smaller mse There exists λ such that mse[β ridge λ ] ≤ mse[β ols λ ] q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13.
    Arthur Charpentier, SIDESummer School, July 2019 Ridge Regression Lλ(β) = n i=1 (yi − β0 − xT i β)2 + λ p j=1 β2 j ∂Lλ(β) ∂β = −2XT y + 2(XT X + λI)β ∂2 Lλ(β) ∂β∂βT = 2(XT X + λI) where XT X is a semi-positive definite matrix, and λI is a positive definite matrix, and βλ = (XT X + λI)−1 XT y @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14.
    Arthur Charpentier, SIDESummer School, July 2019 The Bayesian Interpretation From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2 I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15.
    Arthur Charpentier, SIDESummer School, July 2019 Properties of the Ridge Estimator βλ = (XT X + λI)−1 XT y E[βλ] = XT X(λI + XT X)−1 β. i.e. E[βλ] = β. Observe that E[βλ] → 0 as λ → ∞. Ridge & Shrinkage Assume that X is an orthogonal design matrix, i.e. XT X = I, then βλ = (1 + λ)−1 β ols . @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16.
    Arthur Charpentier, SIDESummer School, July 2019 Properties of the Ridge Estimator Set W λ = (I + λ[XT X]−1 )−1 . One can prove that W λβ ols = βλ. Thus, Var[βλ] = W λVar[β ols ]W T λ and Var[βλ] = σ2 (XT X + λI)−1 XT X[(XT X + λI)−1 ]T . Observe that Var[β ols ] − Var[βλ] = σ2 W λ[2λ(XT X)−2 + λ2 (XT X)−3 ]W T λ ≥ 0. @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17.
    Arthur Charpentier, SIDESummer School, July 2019 Properties of the Ridge Estimator Hence, the confidence ellipsoid of ridge estimator is indeed smaller than the OLS, If X is an orthogonal design matrix, Var[βλ] = σ2 (1 + λ)−2 I. mse[βλ] = σ2 trace(W λ(XT X)−1 W T λ) + βT (W λ − I)T (W λ − I)β. If X is an orthogonal design matrix, mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β @freakonometrics freakonometrics freakonometrics.hypotheses.org 17 0.0 0.2 0.4 0.6 0.8 −1.0−0.8−0.6−0.4−0.2 β1 β2 1 2 3 4 5 6 7
  • 18.
    Arthur Charpentier, SIDESummer School, July 2019 Properties of the Ridge Estimator mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β is minimal for λ = pσ2 βT β Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β ols ]. Ridge regression is obtained using glmnet::glmnet(..., alpha = 0) - and glmnet::cv.glmnet for cross validation @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19.
    Arthur Charpentier, SIDESummer School, July 2019 SVD decomposition For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n) and a ”diagonal” matrix Σ (m × n) such that A = UΣV T , or AV = UΣ. Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ), that is mapped by the matrix A into an orthonormal set of vectors (i.e. the columns of U). Let r = rank(A), then A = r i=1 σiuivT i (called the dyadic decomposition of A). Observe that it can be used to compute (e.g.) the Frobenius norm of A, A = a2 i,j = σ2 1 + · · · + σ2 min{m,n}. Further AT A = V ΣT ΣV T while AAT = UΣΣT UT . Hence, σ2 i ’s are related to eigenvalues of AT A and AAT , and ui, vi are associated eigenvectors. Golub & Reinsh (1970, Singular Value Decomposition and Least Squares Solutions) @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20.
    Arthur Charpentier, SIDESummer School, July 2019 SVD decomposition Consider the singular value decomposition of X, X = UDV T . Then β ols = V D−2 D UT y βλ = V (D2 + λI)−1 D UT y Observe that D−1 i,i ≥ Di,i D2 i,i + λ hence, the ridge penalty shrinks singular values. Set now R = UD (n × n matrix), so that X = RV T , βλ = V (RT R + λI)−1 RT y @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21.
    Arthur Charpentier, SIDESummer School, July 2019 Hat matrix and Degrees of Freedom Recall that Y = HY with H = X(XT X)−1 XT Similarly Hλ = X(XT X + λI)−1 XT trace[Hλ] = p j=1 d2 j,j d2 j,j + λ → 0, as λ → ∞. @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22.
    Arthur Charpentier, SIDESummer School, July 2019 Sparsity Issues In several applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevant features, with s << k, cf Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity), s = card{S} where S = {j; βj = 0} The model is now y = XT SβS + ε, where XT SXS is a full rank matrix. @freakonometrics freakonometrics freakonometrics.hypotheses.org 22 q q = . +
  • 23.
    Arthur Charpentier, SIDESummer School, July 2019 Going further on sparsity issues The Ridge regression problem was to solve β = argmin β∈{ β 2 ≤s} { Y − XT β 2 2 } Define a 0 = 1(|ai| > 0). Here dim(β) = k but β 0 = s. We wish we could solve β = argmin β∈{ β 0 =s} { Y − XT β 2 2 } Problem: it is usually not possible to describe all possible constraints, since s k coefficients should be chosen here (with k (very) large). @freakonometrics freakonometrics freakonometrics.hypotheses.org 23 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0
  • 24.
    Arthur Charpentier, SIDESummer School, July 2019 Going further on sparsity issues In a convex problem, solve the dual problem, e.g. in the Ridge regression : primal problem min β∈{ β 2 ≤s} { Y − XT β 2 2 } and the dual problem min β∈{ Y −XTβ 2 ≤t} { β 2 2 } beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25.
    Arthur Charpentier, SIDESummer School, July 2019 Going further on sparsity issues Idea: solve the dual problem β = argmin β∈{ Y −XTβ 2 ≤h} { β 0 } where we might convexify the 0 norm, · 0 . @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26.
    Arthur Charpentier, SIDESummer School, July 2019 Going further on sparsity issues On [−1, +1]k , the convex hull of β 0 is β 1 On [−a, +a]k , the convex hull of β 0 is a−1 β 1 Hence, why not solve β = argmin β; β 1 ≤˜s { Y − XT β 2 } which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem β = argmin{ Y − XT β 2 2 +λ β 1 } @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27.
    Arthur Charpentier, SIDESummer School, July 2019 lasso Least Absolute Shrinkage and Selection Operator lasso Estimator (OLS) β lasso λ = argmin    n i=1 (yi − xi β)2 + λ p j=1 |βj|    lasso Estimator (GLM) β lasso λ = argmin    − n i=1 log f(yi|µi = g−1 (xi β)) + λ 2 p j=1 |βj|    @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28.
    Arthur Charpentier, SIDESummer School, July 2019 lasso Regression No explicit solution... If λ → 0, β lasso 0 = β ols If λ → ∞, β lasso ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29.
    Arthur Charpentier, SIDESummer School, July 2019 lasso Regression For some λ, there are k’s such that β lasso k,λ = 0. Further, λ → β lasso k,λ is piecewise linear beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  • 30.
    Arthur Charpentier, SIDESummer School, July 2019 lasso Regression In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the LASSO estimate is related to the soft threshold function... q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31.
    Arthur Charpentier, SIDESummer School, July 2019 Optimal lasso Penalty Use cross validation, e.g. K-fold, β(−k)(λ) = argmin    i∈Ik [yi − xT i β]2 + λ β 1    then compute the sum of the squared errors, Qk(λ) = i∈Ik [yi − xT i β(−k)(λ)]2 and finally solve λ = argmin Q(λ) = 1 K k Qk(λ) @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32.
    Arthur Charpentier, SIDESummer School, July 2019 Optimal lasso Penalty Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009, Elements of Statistical Learning) suggest the largest λ such that Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2 = 1 K2 K k=1 [Qk(λ) − Q(λ)]2 lasso regression is obtained using glmnet::glmnet(..., alpha = 1) - and glmnet::cv.glmnet for cross validation. @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33.
    Arthur Charpentier, SIDESummer School, July 2019 LASSO and Ridge, with R 1 > library(glmnet) 2 > chicago=read.table("http:// freakonometrics .free.fr/ chicago.txt",header=TRUE ,sep=";") 3 > standardize <- function(x) {(x-mean(x))/sd(x)} 4 > z0 <- standardize(chicago[, 1]) 5 > z1 <- standardize(chicago[, 3]) 6 > z2 <- standardize(chicago[, 4]) 7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept= FALSE , lambda =1) 8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept= FALSE , lambda =1) 9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5, intercept=FALSE , lambda =1) Elastic net, λ1 β 1 + λ2 β 2 2 q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 33
  • 34.
    Arthur Charpentier, SIDESummer School, July 2019 lasso and lar (Least-Angle Regression) lasso estimation can be seen as an adaptation of LAR procedure Least Angle Regression (i) set (small) (ii) start with initial residual ε = y, and β = 0 (iii) find the predictor xj with the highest correlation with ε (iv) update βj = βj + δj = βj + · sign[ε xj] (v) set ε = ε − δjxj and go to (iii) see Efron et al. (2004, Least Angle Regression) @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty Define a 0 = d i=1 1(ai = 0), a 1 = d i=1 |ai| and a 2 = d i=1 a2 i 1/2 , for a ∈ Rd . constrained penalized optimization optimization argmin β; β 0 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 0 ( 0) argmin β; β 1 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 1 ( 1) argmin β; β 2 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 2 ( 2) Assume that is the quadratic norm. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  • 36.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. Nevertheless, if there is a theoretical equivalence, there might be numerical issues since there is not necessarily unicity of the solution. The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right problem, ∃s such that β is a solution of the left problem. But the converse is not true. More generally, consider a p norm, • sparsity is obtained when p ≤ 1 • convexity is obtained when p ≥ 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty Foster & George (1994) the risk inflation criterion for multiple regression tried to solve directly the penalized problem of ( 0). But it is a complex combinatorial problem in high dimension (Natarajan (1995) sparse approximate solutions to linear systems proved that it was a NP-hard problem) One can prove that if λ ∼ σ2 log(p), alors E [xT β − xT β0]2 ≤ E [xS T βS − xT β0]2 =σ2#S · 4 log p + 2 + o(1) . In that case β sub λ,j =    0 si j /∈ Sλ(β) β ols j si j ∈ Sλ(β), where Sλ(β) is the set of non-null values in solutions of ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty If is no longer the quadratic norm but 1, problem ( 1) is not always strictly convex, and optimum is not always unique (e.g. if XT X is singular). But in the quadratic case, is strictly convex, and at least Xβ is unique. Further, note that solutions are necessarily coherent (signs of coefficients) : it is not possible to have βj < 0 for one solution and βj > 0 for another one. In many cases, problem ( 1) yields a corner-type solution, which can be seen as a ”best subset” solution - like in ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39.
    Arthur Charpentier, SIDESummer School, July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty Consider a simple regression yi = xiβ + ε, with 1-penalty and a 2-loss fucntion. ( 1) becomes min yT y − 2yT xβ + βxT xβ + 2λ|β| First order condition can be written −2yT x + 2xT xβ±2λ = 0. (the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is (strictly) positive, i.e. yT x > 0. If λ is not too large β and βols have the same sign, and −2yT x + 2xT xβ + 2λ = 0. with solution βlasso λ = yT x − λ xTx . @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty Increase λ so that βλ = 0. Increase slightly more, βλ cannot become negative, because the sign of the first order condition will change, and we should solve −2yT x + 2xT xβ − 2λ = 0. and solution would be βlasso λ = yT x + λ xTx . But that solution is positive (we assumed that yT x > 0), to we should have βλ < 0. Thus, at some point βλ = 0, which is a corner solution. In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse regression) or Cand`es & Plan (2009, Near-ideal model selection by 1 minimization.) With some additional technical assumption, that lasso estimator is ”sparsistent” in the sense that the support of β lasso λ is the same as β, @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42.
    Arthur Charpentier, SIDESummer School, July 2019 Going further, 0, 1 and 2 penalty Thus, lasso can be used for variable selection - see Hastie et al. (2001, The Elements of Statistical Learning). Generally, βlasso λ is a biased estimator but its variance can be small enough to have a smaller least squared error than the OLS estimate. With orthonormal covariates, one can prove that βsub λ,j = βols j 1|βsub λ,j |>b , βridge λ,j = βols j 1 + λ and βlasso λ,j = signe[βols j ] · (|βols j | − λ)+. @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  • 43.
    Arthur Charpentier, SIDESummer School, July 2019 lasso for Autoregressive Time Series Consider some AR(p) autoregressive time series, Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt, for some white noise (εt), with a causal type representation. Write y = xT φ + ε. The lasso estimator φ is a minimizer of 1 2T y = xT φ 2 + λ p i=1 λi|φi|, for some tuning parameters (λ, λ1, · · · , λp). See Nardi & Rinaldo (2011, Autoregressive process modeling via the Lasso procedure). @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  • 44.
    Arthur Charpentier, SIDESummer School, July 2019 lasso and Non-Linearities Consider knots k1, · · · , km, we want a function m which is a cubic polynomial between every pair of knots, continuous at each knot, and with ontinuous first and second derivatives at each knot. We can write m as m(x) = β0 + β1x + β2x2 + β3x3 + β4(x − k1)3 + + · · · + βm+3(x − km)3 + One strategy is the following • fix the number of knots m (m < n) • find the natural cubic spline m which minimizes n i=1 (yi − m(xi))2 • then choose m by cross validation and alternative is to use a penalty based approach (Ridge type) to avoid overfit (since with m = n, the residual sum of square is null). @freakonometrics freakonometrics freakonometrics.hypotheses.org 44
  • 45.
    Arthur Charpentier, SIDESummer School, July 2019 GAM, splines and Ridge regression Consider a univariate nonlinear regression problem, so that E[Y |X = x] = m(x). Given a sample {(y1, x1), · · · , (yn, xn)}, consider the following penalized problem m = argmin m∈C2 n i=1 (yi − m(xi))2 + λ R m (x)dx with the Residual sum of squares on the left, and a penalty for the roughness of the function. The solution is a natural cubic spline with knots at unique values of x (see Eubanks (1999, Nonparametric Regression and Spline Smoothing) Consider some spline basis {h1, · · · , hn}, and let m(x) = n i=1 βihi(x). Let H and Ω be the n × n matrices Hi,j = hj(xi), and Ωi,j = R hi (x)hj (x)dx. @freakonometrics freakonometrics freakonometrics.hypotheses.org 45
  • 46.
    Arthur Charpentier, SIDESummer School, July 2019 GAM, splines and Ridge regression Then the objective function can be written (y − Hβ)T (y − Hβ) + λβT Ωβ Recognize here a generalized Ridge regression, with solution βλ = HT H + λΩ −1 HT y. Note that predicted values are linear functions of the observed value since y = H HT H + λΩ −1 HT y = Sλy, with degrees of freedom trace(Sλ). One can obtain the so-called Reinsch form by considering the singular value decomposition of H = UDV T . @freakonometrics freakonometrics freakonometrics.hypotheses.org 46
  • 47.
    Arthur Charpentier, SIDESummer School, July 2019 GAM, splines and Ridge regression Here U is orthogonal since H is square (n × n), and D is here invertible. Then Sλ = (I + λUT D−1 V T ΩV D−1 U)−1 = (I + λK)−1 where K is a positive semidefinite matrix, K = B∆BT , where columns of B are know as the Demmler-Reinsch basis. In that (orthonormal) basis, Sλ is a diagonal matrix, Sλ = B I + λ∆ −1 BT Observe that SλBk = 1 1 + λ∆k,k Bk. Here again, eigenvalues are shrinkage coefficients of basis vectors. With more covariates, consider an additive problem (h1, · · · , hp) = h1,··· ,hp∈C2 argmin    n i=1  yi − p j=1 m(xi,j)   2 + λ p j=1 R mj (x)dx    @freakonometrics freakonometrics freakonometrics.hypotheses.org 47
  • 48.
    Arthur Charpentier, SIDESummer School, July 2019 GAM, splines and Ridge regression which can be written min    (y − p j=1 Hjβj)T (y − p j=1 Hjβj) + λ β1 T p j=1 Ωjβj    where each matrix Hj is a Demmler-Reinsch basis for variable xj. Chouldechova & Hastie (2015, Generalized Additive Model Selection) Assume that the mean function for the jth variable is mj(x) = αjx + mj(x)T βj. One can write min (y − α0 − p j=1 αjxj − p j=1 Hjβj)T (y − α0 − p j=1 αjxj − p j=1 Hjβj) +λ γ|α1| + (1 − γ) βj Ωj + ψ1β1 T Ω1β1 + · · · + ψpβp T Ωpβp where βj Ωj = βj TΩjβj. @freakonometrics freakonometrics freakonometrics.hypotheses.org 48
  • 49.
    Arthur Charpentier, SIDESummer School, July 2019 GAM, splines and Ridge regression The second term is the selection penalty, with a mixture of 1 and 2 (type) norm-based penalty The third term is the end-to-path penalty (GAM type when λ = 0). For each predictor xj, there are three possibilities • zero, αj = 0 and βj = 0 • linear, αj = 0 and βj = 0 • nonlinear, βj = 0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 49
  • 50.
    Arthur Charpentier, SIDESummer School, July 2019 0.0 0.2 0.4 0.6 0.8 −30−101030 variable 1 smoothfunction 10 20 30 40 50 60 70 −30−101030 variable 2 smoothfunction 5 10 15 20 −30−101030 variable 3 smoothfunction 50 20 10 5 2 1 −30−20−10010 Linear Components λ α 50 20 10 5 2 1 05102030 Non−linear Components λ ||β|| 0 1 2 3 4 30507090 log(Lambda) Mean−SquaredError q qqq q q qq q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1 42 27 0.0 0.2 0.4 0.6 0.8 −50510 v1 f(v1) 10 20 30 40 50 60 70 −50510 v2 f(v2) 5 10 15 20 −50510 v3 f(v3)@freakonometrics freakonometrics freakonometrics.hypotheses.org 50
  • 51.
    Arthur Charpentier, SIDESummer School, July 2019 Coordinate Descent LASSO Coordinate Descent Algorithm 1. Set β0 = β 2 . For k = 1, · · · for j = 1, · · · , p (i) compute Rj = xj y − X−jβk−1(−j) (ii) set βk,j = Rj · 1 − λ 2|Rj| + 3. The final estimate βκ is βλ @freakonometrics freakonometrics freakonometrics.hypotheses.org 51
  • 52.
    Arthur Charpentier, SIDESummer School, July 2019 From LASSO to Dantzig Selection Cand`es & Tao (2007, The Dantzig selector: Statistical estimation when p is much larger than n) defined β dantzig λ ∈ argmin β∈Rp β 1 s.t. X (y − Xβ) ∞ ≤ λ @freakonometrics freakonometrics freakonometrics.hypotheses.org 52
  • 53.
    Arthur Charpentier, SIDESummer School, July 2019 From LASSO to Adaptative Lasso Zou (2006, The Adaptive Lasso) β a-lasso λ ∈ argmin β∈Rp    y − Xβ 2 2 + λ p j=1 |βj| |βγ-lasso λ,j |    where β γ-lasso λ = ΠXs(λ) y where s(λ) is the set of non null components β lasso λ See library lqa or lassogrp @freakonometrics freakonometrics freakonometrics.hypotheses.org 53
  • 54.
    Arthur Charpentier, SIDESummer School, July 2019 From LASSO to Group Lasso Assume that variables x ∈ Rp can be grouped in L subgroups, x = (x1 · · · , xL), where dim[xl] = pl. Yuan & Lin (2007, Model selection and estimation in the Gaussian graphical model) defined, for some Kl matrices nl × nl definite positives β g-lasso λ ∈ argmin β∈Rp y − Xβ 2 2 + λ L l=1 βl Klβl or, if Kl = plI β g-lasso λ ∈ argmin β∈Rp y − Xβ 2 2 + λ L l=1 pl βl 2 See library gglasso @freakonometrics freakonometrics freakonometrics.hypotheses.org 54
  • 55.
    Arthur Charpentier, SIDESummer School, July 2019 From LASSO to Sparse-Group Lasso Assume that variables x ∈ Rp can be grouped in L subgroups, x = (x1 · · · , xL), where dim[xl] = pl. Simon et al. (2013, A Sparse-Group LASSO) defined, for some Kl matrices nl × nl definite positives β sg-lasso λ,µ ∈ argmin β∈Rp y − Xβ 2 2 + λ L l=1 βl Klβl + µ β 1 See library SGL @freakonometrics freakonometrics freakonometrics.hypotheses.org 55