SlideShare a Scribd company logo
1 of 64
Download to read offline
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Advanced Econometrics #3: Model & Variable Selection*
A. Charpentier (Université de Rennes 1)
Université de Rennes 1,
Graduate Course, 2017.
@freakonometrics 1
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
“Great plot.
Now need to find the theory that explains it”
Deville (2017) http://twitter.com
@freakonometrics 2
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preliminary Results: Numerical Optimization
Problem : x ∈ argmin{f(x); x ∈ Rd
}
Gradient descent : xk+1 = xk − η f(xk) starting from some x0
Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd
}
Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0
A constrained problem is said to be convex if



min{f(x)} with f convex
s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear
hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex
Lagrangian : L(x, λ, µ) = f(x) +
n
i=1
λigi(x) +
m
i=1
µihi(x) where x are primal
variables and (λ, µ) are dual variables.
Remark L is an affine function in (λ, µ)
@freakonometrics 3
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preliminary Results: Numerical Optimization
Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and
only if there are (λ , µ ) such that the following condition hold
• stationarity : xL(x, λ, µ) = 0 at (x , λ , µ )
• primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i
• dual admissibility : µ ≥ 0
Let L denote the associated dual function L(λ, µ) = min
x
{L(x, λ, µ)}
L is a convex function in (λ, µ) and the dual problem is max
λ,µ
{L(λ, µ)}.
@freakonometrics 4
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
References
Motivation
Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip:
Identifying Central Individuals in a Social Networks.
References
Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in
high-dimensional sparse models.
Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with
Sparsity: The Lasso and Generalizations. CRC Press.
@freakonometrics 5
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preambule
Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise.
The error E[(y − m(x))2
] is the sume of three terms
• variance of the estimator : E[(y − m(x))2
]
• bias2
of the estimator : [m(x − m(x)]2
• variance of the noise : E[(y − m(x))2
]
(the latter exists, even with a ‘perfect’ model).
@freakonometrics 6
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preambule
Consider a parametric model, with true (unkown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics 7
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Occam’s Razor
The “law of parsimony”, “lex parsimoniæ”
Penalize too complex models
@freakonometrics 8
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
James & Stein Estimator
Let X ∼ N(µ, σ2
I). We want to estimate µ.
µmle = Xn ∼ N µ,
σ2
n
I .
From James & Stein (1961) Estimation with quadratic loss
µJS = 1 −
(d − 2)σ2
n y 2
y
where · is the Euclidean norm.
One can prove that if d ≥ 3,
E µJS − µ
2
< E µmle − µ
2
Samworth (2015) Stein’s paradox, “one should use the price of tea in China to
obtain a better estimate of the chance of rain in Melbourne”.
@freakonometrics 9
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
James & Stein Estimator
Heuristics : consider a biased estimator, to decrease the variance.
See Efron (2010) Large-Scale Inference
@freakonometrics 10
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Motivation: Avoiding Overfit
Generalization : the model should perform well on new data (and not only on the
training ones).
q
q
q q q q q q q q q q q
2 4 6 8 10 12
0.00.20.40.6
q
q
q q q q q q
q q q
q
q
@freakonometrics 11
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
qq
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Reducing Dimension with PCA
Use principal components to reduce dimension (on centered and scaled variables):
we want d vectors z1, · · · , zd such that
First Compoment is z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
Second Compoment is z2 = Xω2 where
ω2 = argmax
ω =1
X
(1)
· ω 2
0 20 40 60 80
−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−101234
PC score 1
PCscore2
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
qq
q
q
qqqq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
1914
1915
1916
1917
1918
1919
1940
1942
1943
1944
0 20 40 60 80
−10−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−10123
PC score 1
PCscore2
qq
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
qqqq
q
qqqq
q
q
q
q
q
q
with X
(1)
= X − Xω1
z1
ωT
1 .
@freakonometrics 12
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Reducing Dimension with PCA
A regression on (the d) principal components, y = zT
β + η could be an
interesting idea, unfortunatley, principal components have no reason to be
correlated with y. First compoment was z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
It is a non-supervised technique.
Instead, use partial least squares, introduced in Wold (1966) Estimation of
Principal Components and Related Models by Iterative Least squares. First
compoment is z1 = Xω1 where
ω1 = argmax
ω =1
{ y, X · ω } = argmax
ω =1
ωT
XT
yyT
Xω
@freakonometrics 13
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Terminology
Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an
unknown distribution P.
Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi.
In a regression context (quadratic loss function function), the risk associated to
m is
R(m) = EP Y − m(X)
2
An optimal model m within a class M satisfies
R(m ) = inf
m∈M
R(m)
Such a model m is usually called oracle.
Observe that m (x) = E[Y |X = x] is the solution of
R(m ) = inf
m∈M
R(m) where M is the set of measurable functions
@freakonometrics 14
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The empirical risk is
Rn(m) =
1
n
n
i=1
yi − m(xi)
2
For instance, m can be a linear predictor, m(x) = β0 + xT
β, where θ = (β0, β)
should estimated (trained).
E Rn(m) = E (m(X) − Y )2
can be expressed as
E (m(X) − E[m(X)|X])2
variance of m
+ E E[m(X)|X] − E[Y |X]
m0(X)
2
bias of m
+ E Y − E[Y |X]
m0(X)
)2
variance of the noise
The third term is the risk of the “optimal” estimator m, that cannot be
decreased.
@freakonometrics 15
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mallows Penalty and Model Complexity
Consider a linear predictor (see #1), i.e. y = m(x) = Ay.
Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2
I.
Let · denote the Euclidean norm
Empirical risk : Rn(m) = 1
n y − m(x) 2
Vapnik’s risk : E[Rn(m)] =
1
n
m0(x − m(x) 2
+
1
n
E y − m0(x 2
with
m0(x = E[Y |X = x].
Observe that
nE Rn(m) = E y − m(x) 2
= (I − A)m0
2
+ σ2
I − A 2
while
= E m0(x) − m(x) 2
=
2
(I − A)m0
bias
+ σ2
A 2
variance
@freakonometrics 16
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mallows Penalty and Model Complexity
One can obtain
E Rn(m) = E Rn(m) + 2
σ2
n
trace(A).
If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator.
The number of degrees of freedom of the (linear) predictor is related to trace(A)
2
σ2
n
trace(A) is called Mallow’s penalty CL.
If A is a projection matrix, trace(A) is the dimension of the projection space, p,
then we obtain Mallow’s CP , 2
σ2
n
p.
Remark : Mallows (1973) Some Comments on Cp introduced this penalty while
focusing on the R2
.
@freakonometrics 17
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
CP is associated to a quadratic risk
an alternative is to use a distance on the (conditional) distribution of Y , namely
Kullback-Leibler distance
discrete case: DKL(P Q) =
i
P(i) log
P(i)
Q(i)
continuous case :
DKL(P Q) =
∞
−∞
p(x) log
p(x)
q(x)
dxDKL(P Q) =
∞
−∞
p(x) log p(x)
q(x) dx
Let f denote the true (unknown) density, and fθ some parametric distribution,
DKL(f fθ) =
∞
−∞
f(x) log
f(x)
fθ(x)
dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx
relative information
Hence
minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)]
@freakonometrics 18
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
Akaike (1974) A new look at the statistical model identification observe that for n
large enough
E log[fθ(X)] ∼ log[L(θ)] − dim(θ)
Thus
AIC = −2 log L(θ) + 2dim(θ)
Example : in a (Gaussian) linear model, yi = β0 + xT
i β + εi
AIC = n log
1
n
n
i=1
εi + 2[dim(β) + 2]
@freakonometrics 19
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40),
otherwise use a corrected AIC
AICc = AIC +
2k(k + 1)
n − k − 1
bias correction
where k = dim(θ)
see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and
the finite corrections second order AIC.
Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a
model obtained
BIC = −2 log L(θ) + log(n)dim(θ).
Observe that the criteria considered is
criteria = −function L(θ) + penality complexity
@freakonometrics 20
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Estimation of the Risk
Consider a naive bootstrap procedure, based on a bootstrap sample
Sb = {(y
(b)
i , x
(b)
i )}.
The plug-in estimator of the empirical risk is
Rn(m(b)
) =
1
n
n
i=1
yi − m(b)
(xi)
2
and then
Rn =
1
B
B
b=1
Rn(m(b)
) =
1
B
B
b=1
1
n
n
i=1
yi − m(b)
(xi)
2
@freakonometrics 21
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Estimation of the Risk
One might improve this estimate using a out-of-bag procedure
Rn =
1
n
n
i=1
1
#Bi
b∈Bi
yi − m(b)
(xi)
2
where Bi is the set of all boostrap sample that contain (yi, xi).
Remark: P ((yi, xi) /∈ Sb) = 1 −
1
n
n
∼ e−1
= 36, 78%.
@freakonometrics 22
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =







1 −1 2
1 0 1
1 2 −1
1 1 0







then XT
X =




4 2 2
2 6 −4
2 −4 6



 while XT
X+I =




5 2 2
2 7 −4
2 −4 7




eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics 23
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear Regression Shortcoming
Evolution of (β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
when cor(X1, X2) = r ∈ [0, 1], on top.
Below, Ridge regression
(β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
+λ(β2
1 + β2
2)
where λ ∈ [0, ∞), below,
when cor(X1, X2) ∼ 1 (colinearity).
@freakonometrics 24
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
500
1000
1500
2000
2000
2500
2500 2500
2500
3000
3000 3000
3000
3500
q
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
1000
1000
2000
2000
3000
3000
4000
4000
5000
5000
6000
6000
7000
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 25
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin



n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j



β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
The constant is usually unpenalized. The true equation is
β
ridge
λ = argmin



y − (β0 + Xβ)
2
2
=criteria
+ λ β
2
2
=penalty



@freakonometrics 26
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 27
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 28
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
q
q
Theorem There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
Ridge Regression
@freakonometrics 29
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics 30
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics 31
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics 32
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics 33
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
Properties of the Ridge Estimator
@freakonometrics 34
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
@freakonometrics 35
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
SVD decomposition
Consider the singular value decomposition X = UDV T
. Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penality shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics 36
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics 37
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Sparsity Issues
In severall applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevent features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity,
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics 38
q
q
= . +
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Define a 0
= 1(|ai| > 0).
Here dim(β) = k but β 0
= s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients should be chosen here (with k (very) large).
@freakonometrics 39
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 40
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0 }
where we might convexify the 0 norm, · 0
.
@freakonometrics 41
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2
}
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1 }
@freakonometrics 42
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Least Absolute Shrinkage and Selection Operator
β ∈ argmin{ Y − XT
β 2
2
+λ β 1
}
is a convex problem (several algorithms ), but not strictly convex (no unicity of
the minimum). Nevertheless, predictions y = xT
β are unique.
MM, minimize majorization, coordinate descent Hunter & Lange (2003) A
Tutorial on MM Algorithms.
@freakonometrics 43
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 44
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 45
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics 46
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimal LASSO Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin



i∈Ik
[yi − xT
i β]2
+ λ β 1



then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and finally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) Elements
of Statistical Learning suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
@freakonometrics 47
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO and Ridge, with R
1 > library(glmnet)
2 > chicago=read.table("http:// freakonometrics .free.fr/
chicago.txt",header=TRUE ,sep=";")
3 > standardize <- function(x) {(x-mean(x))/sd(x)}
4 > z0 <- standardize(chicago[, 1])
5 > z1 <- standardize(chicago[, 3])
6 > z2 <- standardize(chicago[, 4])
7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept=
FALSE , lambda =1)
8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept=
FALSE , lambda =1)
9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5,
intercept=FALSE , lambda =1)
Elastic net, λ1 β 1
+ λ2 β 2
2
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
qq
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
@freakonometrics 48
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression, Smoothing and Overfit
LASSO can be used to avoid overfit.
@freakonometrics 49
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.0−0.50.00.51.0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge vs. LASSO
Consider simulated data (output on the right).
With orthogonal variables, shrinkage operators are
0 1 2 3 4 5
012345
β
β(ridge)
0 1 2 3 4 5
012345
β
β(lasso)
0.0 0.5 1.0 1.5 2.0
−2.0−1.5−1.0−0.50.00.51.0
L1 Norm
Coefficients
3 3 3 3 3
0.0 0.5 1.0 1.5 2.0
−2.0−1.5−1.0−0.50.00.51.0
L1 Norm
Coefficients
0 1 3 3 3
@freakonometrics 50
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
First idea: given some initial guess β(0), |β| ∼ |β(0)| +
1
2|β(0)|
(β2
− β2
(0))
LASSO estimate can probably be derived from iterated Ridge estimates
y − Xβ(k+1)
2
2
+ λ β(k+1) 1 ∼ Xβ(k+1)
2
2
+
λ
2 j
1
|βj,(k)|
[βj,(k+1)]2
which is a weighted ridge penalty function
Thus,
β(k+1) = XT
X + λ∆(k)
−1
XT
y
where ∆(k) = diag[|βj,(k)|−1
]. Then β(k) → β
lasso
, as k → ∞.
@freakonometrics 51
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of LASSO Estimate
From this iterative technique
β
lasso
λ ∼ XT
X + λ∆
−1
XT
y
where ∆ = diag[|β
lasso
j,λ |−1
] if β
lasso
j,λ = 0, 0 otherwise.
Thus,
E[β
lasso
λ ] ∼ XT
X + λ∆
−1
XT
Xβ
and
Var[β
lasso
λ ] ∼ σ2
XT
X + λ∆
−1
XT
XT
X XT
X + λ∆
−1
XT
@freakonometrics 52
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
Consider here a simplified problem, min
a∈R
1
2
(a − b)2
+ λ|a|
g(a)
with λ > 0.
Observe that g (0) = −b ± λ. Then
• if |b| ≤ λ, then a = 0
• if b ≥ λ, then a = b − λ
• if b ≤ −λ, then a = b + λ
a = argmin
a∈R
1
2
(a − b)2
+ λ|a| = Sλ(b) = sign(b) · (|b| − λ)+,
also called soft-thresholding operator.
@freakonometrics 53
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
Definition for any convex function h, define the proximal operator operator of h,
proximalh(y) = argmin
x∈Rd
1
2
x − y 2
2
+ h(x)
Note that
proximalλ · 2
2
(y) =
1
1 + λ
x shrinkage operator
proximalλ · 1
(y) = Sλ(y) = sign(y) · (|y| − λ)+
@freakonometrics 54
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
We want to solve here
θ ∈ argmin
θ∈Rd
1
n
y − mθ(x)) 2
2
f(θ)
+ λpenalty(θ)
g(θ)
.
where f is convex and smooth, and g is convex, but not smooth...
1. Focus on f : descent lemma, ∀θ, θ
f(θ) ≤ f(θ ) + f(θ ), θ − θ +
t
2
θ − θ 2
2
Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1
f(θk), then
f(θ) ≤
ϕ(θ): θk+1=argmin{ϕ(θ)}
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
@freakonometrics 55
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
2. Add function g
f(θ)+g(θ) ≤
ψ(θ)
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
+g(θ)
And one can proof that
θk+1 = argmin
θ∈Rd
ψ(θ) = proximalg/t θk − t−1
f(θk)
so called proximal gradient descent algorithm, since
argmin {ψ(θ)} = argmin
t
2
θ − θk − t−1
f(θk)
2
2
+ g(θ)
@freakonometrics 56
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Coordinate-wise minimization
Consider some convex differentiable f : Rk
→ R function.
Consider x ∈ Rk
obtained by minimizing along each coordinate axis, i.e.
f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk)
for all i. Is x a global minimizer? i.e.
f(x) ≥ f(x ), ∀x ∈ Rk
.
Yes. If f is convex and differentiable.
f(x)|x=x =
∂f(x)
∂x1
, · · · ,
∂f(x)
∂xk
= 0
There might be problem if f is not differentiable (except in each axis direction).
If f(x) = g(x) +
k
i=1 hi(xi) with g convex and differentiable, yes, since
f(x) − f(x ) ≥ g(x )T
(x − x ) +
i
[hi(xi) − hi(xi )]
@freakonometrics 57
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Coordinate-wise minimization
f(x) − f(x ) ≥
i
[ ig(x )T
(xi − xi )hi(xi) − hi(xi )]
≥0
≥ 0
Thus, for functions f(x) = g(x) +
k
i=1 hi(xi) we can use coordinate descent to
find a minimizer, i.e. at step j
x
(j)
1 ∈ argmin
x1
f(x1, x
(j−1)
2 , x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
2 ∈ argmin
x2
f(x
(j)
1 , x2, x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
3 ∈ argmin
x3
f(x
(j)
1 , x
(j)
2 , x3, · · · x
(j−1)
k )
Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous,
then x∞
is a minimizer of f.
@freakonometrics 58
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application in Linear Regression
Let f(x) = 1
2 y − Ax 2
, with y ∈ Rn
and A ∈ Mn×k. Let A = [A1, · · · , Ak].
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Ax − y] = AT
i [Aixi + A−ix−i − y]
thus, the optimal value is here
xi =
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 59
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application to LASSO
Let f(x) = 1
2 y − Ax 2
+ λ x 1 , so that the non-differentiable part is
separable, since x 1
=
k
i=1 |xi|.
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Aixi + A−ix−i − y] + λsi
where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding
xi = Sλ/ Ai
2
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 60
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Convergence rate for LASSO
Let f(x) = g(x) + λ x 1
with
• g convex, g Lipschitz with constant L > 0, and Id − g/L monotone
inscreasing in each component
• there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or
z ≤ Sλ(z − g(z))
Saka & Tewari (2010), On the finite time convergence of cyclic coordinate descent
methods proved that a coordinate descent starting from z satisfies
f(x(j)
) − f(x ) ≤
L z − x 2
2j
@freakonometrics 61
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Graphical Lasso and Covariance Estimation
We want to estimate an (unknown) covariance matrix Σ, or Σ−1
.
An estimate for Σ−1
is Θ solution of
Θ ∈ argmin
Θ∈Mk×k
{− log[det(Θ)] + trace[SΘ] + λ Θ 1
} where S =
XT
X
n
and where Θ 1
= |Θi,j|.
See van Wieringen (2016) Undirected network reconstruction from high-dimensional
data and https://github.com/kaizhang/glasso
@freakonometrics 62
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application to Network Simplification
Can be applied on networks, to spot ‘significant’
connexions...
Source: http://khughitt.github.io/graphical-lasso/
@freakonometrics 63
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Extention of Penalization Techniques
In a more general context, we want to solve
θ ∈ argmin
θ∈Rd
1
n
n
i=1
(yi, mθ(xi)) + λ · penalty(θ) .
@freakonometrics 64

More Related Content

What's hot

What's hot (20)

Slides ensae 11bis
Slides ensae 11bisSlides ensae 11bis
Slides ensae 11bis
 
Inequalities #3
Inequalities #3Inequalities #3
Inequalities #3
 
Inequality, slides #2
Inequality, slides #2Inequality, slides #2
Inequality, slides #2
 
Slides ensae 9
Slides ensae 9Slides ensae 9
Slides ensae 9
 
Slides erasmus
Slides erasmusSlides erasmus
Slides erasmus
 
Multiattribute utility copula
Multiattribute utility copulaMultiattribute utility copula
Multiattribute utility copula
 
Proba stats-r1-2017
Proba stats-r1-2017Proba stats-r1-2017
Proba stats-r1-2017
 
Slides ineq-4
Slides ineq-4Slides ineq-4
Slides ineq-4
 
Quantile and Expectile Regression
Quantile and Expectile RegressionQuantile and Expectile Regression
Quantile and Expectile Regression
 
Inequality #4
Inequality #4Inequality #4
Inequality #4
 
Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
Slides barcelona Machine Learning
Slides barcelona Machine LearningSlides barcelona Machine Learning
Slides barcelona Machine Learning
 
Slides amsterdam-2013
Slides amsterdam-2013Slides amsterdam-2013
Slides amsterdam-2013
 
Slides econometrics-2018-graduate-2
Slides econometrics-2018-graduate-2Slides econometrics-2018-graduate-2
Slides econometrics-2018-graduate-2
 
Slides toulouse
Slides toulouseSlides toulouse
Slides toulouse
 
Inequalities #2
Inequalities #2Inequalities #2
Inequalities #2
 
Slides risk-rennes
Slides risk-rennesSlides risk-rennes
Slides risk-rennes
 
Slides simplexe
Slides simplexeSlides simplexe
Slides simplexe
 
Slides Bank England
Slides Bank EnglandSlides Bank England
Slides Bank England
 
Slides astin
Slides astinSlides astin
Slides astin
 

Viewers also liked

Soutenance julie viard_partie_1
Soutenance julie viard_partie_1Soutenance julie viard_partie_1
Soutenance julie viard_partie_1
Arthur Charpentier
 
Julien slides - séminaire quantact
Julien slides - séminaire quantactJulien slides - séminaire quantact
Julien slides - séminaire quantact
Arthur Charpentier
 

Viewers also liked (20)

Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Econometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 NonlinearitiesEconometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 Nonlinearities
 
Soutenance julie viard_partie_1
Soutenance julie viard_partie_1Soutenance julie viard_partie_1
Soutenance julie viard_partie_1
 
Lg ph d_slides_vfinal
Lg ph d_slides_vfinalLg ph d_slides_vfinal
Lg ph d_slides_vfinal
 
HdR
HdRHdR
HdR
 
Slides act6420-e2014-ts-1
Slides act6420-e2014-ts-1Slides act6420-e2014-ts-1
Slides act6420-e2014-ts-1
 
Slides inequality 2017
Slides inequality 2017Slides inequality 2017
Slides inequality 2017
 
So a webinar-2013-2
So a webinar-2013-2So a webinar-2013-2
So a webinar-2013-2
 
Slides arbres-ubuntu
Slides arbres-ubuntuSlides arbres-ubuntu
Slides arbres-ubuntu
 
Julien slides - séminaire quantact
Julien slides - séminaire quantactJulien slides - séminaire quantact
Julien slides - séminaire quantact
 
Slides 2040-6
Slides 2040-6Slides 2040-6
Slides 2040-6
 
Slides 2040-4
Slides 2040-4Slides 2040-4
Slides 2040-4
 
Slides guanauato
Slides guanauatoSlides guanauato
Slides guanauato
 
Slides ensae-2016-4
Slides ensae-2016-4Slides ensae-2016-4
Slides ensae-2016-4
 
Slides ensae-2016-8
Slides ensae-2016-8Slides ensae-2016-8
Slides ensae-2016-8
 
Slides ensae-2016-9
Slides ensae-2016-9Slides ensae-2016-9
Slides ensae-2016-9
 
Slides ensae-2016-5
Slides ensae-2016-5Slides ensae-2016-5
Slides ensae-2016-5
 
Slides ensae-2016-7
Slides ensae-2016-7Slides ensae-2016-7
Slides ensae-2016-7
 
Slides ensae-2016-6
Slides ensae-2016-6Slides ensae-2016-6
Slides ensae-2016-6
 
Slides ensae-2016-10
Slides ensae-2016-10Slides ensae-2016-10
Slides ensae-2016-10
 

Similar to Econometrics 2017-graduate-3

Similar to Econometrics 2017-graduate-3 (20)

Slides ub-3
Slides ub-3Slides ub-3
Slides ub-3
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Slides ub-2
Slides ub-2Slides ub-2
Slides ub-2
 
Side 2019 #9
Side 2019 #9Side 2019 #9
Side 2019 #9
 
Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3
 
Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4
 
Side 2019, part 2
Side 2019, part 2Side 2019, part 2
Side 2019, part 2
 
Statistical Physics Assignment Help
Statistical Physics Assignment HelpStatistical Physics Assignment Help
Statistical Physics Assignment Help
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
 
Metodo gauss_newton.pdf
Metodo gauss_newton.pdfMetodo gauss_newton.pdf
Metodo gauss_newton.pdf
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 

More from Arthur Charpentier

More from Arthur Charpentier (20)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Side 2019 #12
Side 2019 #12Side 2019 #12
Side 2019 #12
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 
Side 2019 #6
Side 2019 #6Side 2019 #6
Side 2019 #6
 
Side 2019 #5
Side 2019 #5Side 2019 #5
Side 2019 #5
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Recently uploaded (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Econometrics 2017-graduate-3

  • 1. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Advanced Econometrics #3: Model & Variable Selection* A. Charpentier (Université de Rennes 1) Université de Rennes 1, Graduate Course, 2017. @freakonometrics 1
  • 2. Arthur CHARPENTIER, Advanced Econometrics Graduate Course “Great plot. Now need to find the theory that explains it” Deville (2017) http://twitter.com @freakonometrics 2
  • 3. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Problem : x ∈ argmin{f(x); x ∈ Rd } Gradient descent : xk+1 = xk − η f(xk) starting from some x0 Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd } Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0 A constrained problem is said to be convex if    min{f(x)} with f convex s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex Lagrangian : L(x, λ, µ) = f(x) + n i=1 λigi(x) + m i=1 µihi(x) where x are primal variables and (λ, µ) are dual variables. Remark L is an affine function in (λ, µ) @freakonometrics 3
  • 4. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and only if there are (λ , µ ) such that the following condition hold • stationarity : xL(x, λ, µ) = 0 at (x , λ , µ ) • primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i • dual admissibility : µ ≥ 0 Let L denote the associated dual function L(λ, µ) = min x {L(x, λ, µ)} L is a convex function in (λ, µ) and the dual problem is max λ,µ {L(λ, µ)}. @freakonometrics 4
  • 5. Arthur CHARPENTIER, Advanced Econometrics Graduate Course References Motivation Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip: Identifying Central Individuals in a Social Networks. References Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in high-dimensional sparse models. Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press. @freakonometrics 5
  • 6. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise. The error E[(y − m(x))2 ] is the sume of three terms • variance of the estimator : E[(y − m(x))2 ] • bias2 of the estimator : [m(x − m(x)]2 • variance of the noise : E[(y − m(x))2 ] (the latter exists, even with a ‘perfect’ model). @freakonometrics 6
  • 7. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Consider a parametric model, with true (unkown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 Let θ denote an unbiased estimator of θ. Then ˆθ = θ2 θ2 + mse(θ) · θ = θ − mse(θ) θ2 + mse(θ) · θ penalty satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics 7 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  • 8. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Occam’s Razor The “law of parsimony”, “lex parsimoniæ” Penalize too complex models @freakonometrics 8
  • 9. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Let X ∼ N(µ, σ2 I). We want to estimate µ. µmle = Xn ∼ N µ, σ2 n I . From James & Stein (1961) Estimation with quadratic loss µJS = 1 − (d − 2)σ2 n y 2 y where · is the Euclidean norm. One can prove that if d ≥ 3, E µJS − µ 2 < E µmle − µ 2 Samworth (2015) Stein’s paradox, “one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne”. @freakonometrics 9
  • 10. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Heuristics : consider a biased estimator, to decrease the variance. See Efron (2010) Large-Scale Inference @freakonometrics 10
  • 11. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Motivation: Avoiding Overfit Generalization : the model should perform well on new data (and not only on the training ones). q q q q q q q q q q q q q 2 4 6 8 10 12 0.00.20.40.6 q q q q q q q q q q q q q @freakonometrics 11 q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.5−1.0−0.50.00.51.01.5 q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.5−1.0−0.50.00.51.01.5 q q q q q q q q q qq
  • 12. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Reducing Dimension with PCA Use principal components to reduce dimension (on centered and scaled variables): we want d vectors z1, · · · , zd such that First Compoment is z1 = Xω1 where ω1 = argmax ω =1 X · ω 2 = argmax ω =1 ωT XT Xω Second Compoment is z2 = Xω2 where ω2 = argmax ω =1 X (1) · ω 2 0 20 40 60 80 −8−6−4−2 Age LogMortalityRate −10 −5 0 5 10 15 −101234 PC score 1 PCscore2 q q qq q q qqq q qqq qq q qqq q q q qq q q q q q q qqq q q q q q q qq q qq q q qq q q q q q q qq q qqq q q q q q q q q q q q q q q qq q q qqq q qqq qq q qqq q q q q q q q q q q q q q q q q q q q qq q q q q qq q q qq q q qqqq q q q q q q q q q qqq q qqq qq q qq q q q q qq q q q q q q q q q q q q q 1914 1915 1916 1917 1918 1919 1940 1942 1943 1944 0 20 40 60 80 −10−8−6−4−2 Age LogMortalityRate −10 −5 0 5 10 15 −10123 PC score 1 PCscore2 qq q q q q q qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q qq q q q q q q q q q q qq qq q q q q q qq qqqq q qqqq q q q q q q with X (1) = X − Xω1 z1 ωT 1 . @freakonometrics 12
  • 13. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Reducing Dimension with PCA A regression on (the d) principal components, y = zT β + η could be an interesting idea, unfortunatley, principal components have no reason to be correlated with y. First compoment was z1 = Xω1 where ω1 = argmax ω =1 X · ω 2 = argmax ω =1 ωT XT Xω It is a non-supervised technique. Instead, use partial least squares, introduced in Wold (1966) Estimation of Principal Components and Related Models by Iterative Least squares. First compoment is z1 = Xω1 where ω1 = argmax ω =1 { y, X · ω } = argmax ω =1 ωT XT yyT Xω @freakonometrics 13
  • 14. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Terminology Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an unknown distribution P. Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi. In a regression context (quadratic loss function function), the risk associated to m is R(m) = EP Y − m(X) 2 An optimal model m within a class M satisfies R(m ) = inf m∈M R(m) Such a model m is usually called oracle. Observe that m (x) = E[Y |X = x] is the solution of R(m ) = inf m∈M R(m) where M is the set of measurable functions @freakonometrics 14
  • 15. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The empirical risk is Rn(m) = 1 n n i=1 yi − m(xi) 2 For instance, m can be a linear predictor, m(x) = β0 + xT β, where θ = (β0, β) should estimated (trained). E Rn(m) = E (m(X) − Y )2 can be expressed as E (m(X) − E[m(X)|X])2 variance of m + E E[m(X)|X] − E[Y |X] m0(X) 2 bias of m + E Y − E[Y |X] m0(X) )2 variance of the noise The third term is the risk of the “optimal” estimator m, that cannot be decreased. @freakonometrics 15
  • 16. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mallows Penalty and Model Complexity Consider a linear predictor (see #1), i.e. y = m(x) = Ay. Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2 I. Let · denote the Euclidean norm Empirical risk : Rn(m) = 1 n y − m(x) 2 Vapnik’s risk : E[Rn(m)] = 1 n m0(x − m(x) 2 + 1 n E y − m0(x 2 with m0(x = E[Y |X = x]. Observe that nE Rn(m) = E y − m(x) 2 = (I − A)m0 2 + σ2 I − A 2 while = E m0(x) − m(x) 2 = 2 (I − A)m0 bias + σ2 A 2 variance @freakonometrics 16
  • 17. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mallows Penalty and Model Complexity One can obtain E Rn(m) = E Rn(m) + 2 σ2 n trace(A). If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator. The number of degrees of freedom of the (linear) predictor is related to trace(A) 2 σ2 n trace(A) is called Mallow’s penalty CL. If A is a projection matrix, trace(A) is the dimension of the projection space, p, then we obtain Mallow’s CP , 2 σ2 n p. Remark : Mallows (1973) Some Comments on Cp introduced this penalty while focusing on the R2 . @freakonometrics 17
  • 18. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood CP is associated to a quadratic risk an alternative is to use a distance on the (conditional) distribution of Y , namely Kullback-Leibler distance discrete case: DKL(P Q) = i P(i) log P(i) Q(i) continuous case : DKL(P Q) = ∞ −∞ p(x) log p(x) q(x) dxDKL(P Q) = ∞ −∞ p(x) log p(x) q(x) dx Let f denote the true (unknown) density, and fθ some parametric distribution, DKL(f fθ) = ∞ −∞ f(x) log f(x) fθ(x) dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx relative information Hence minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)] @freakonometrics 18
  • 19. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood Akaike (1974) A new look at the statistical model identification observe that for n large enough E log[fθ(X)] ∼ log[L(θ)] − dim(θ) Thus AIC = −2 log L(θ) + 2dim(θ) Example : in a (Gaussian) linear model, yi = β0 + xT i β + εi AIC = n log 1 n n i=1 εi + 2[dim(β) + 2] @freakonometrics 19
  • 20. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40), otherwise use a corrected AIC AICc = AIC + 2k(k + 1) n − k − 1 bias correction where k = dim(θ) see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections second order AIC. Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a model obtained BIC = −2 log L(θ) + log(n)dim(θ). Observe that the criteria considered is criteria = −function L(θ) + penality complexity @freakonometrics 20
  • 21. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Estimation of the Risk Consider a naive bootstrap procedure, based on a bootstrap sample Sb = {(y (b) i , x (b) i )}. The plug-in estimator of the empirical risk is Rn(m(b) ) = 1 n n i=1 yi − m(b) (xi) 2 and then Rn = 1 B B b=1 Rn(m(b) ) = 1 B B b=1 1 n n i=1 yi − m(b) (xi) 2 @freakonometrics 21
  • 22. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Estimation of the Risk One might improve this estimate using a out-of-bag procedure Rn = 1 n n i=1 1 #Bi b∈Bi yi − m(b) (xi) 2 where Bi is the set of all boostrap sample that contain (yi, xi). Remark: P ((yi, xi) /∈ Sb) = 1 − 1 n n ∼ e−1 = 36, 78%. @freakonometrics 22
  • 23. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear Regression Shortcoming Least Squares Estimator β = (XT X)−1 XT y Unbiased Estimator E[β] = β Variance Var[β] = σ2 (XT X)−1 which can be (extremely) large when det[(XT X)] ∼ 0. X =        1 −1 2 1 0 1 1 2 −1 1 1 0        then XT X =     4 2 2 2 6 −4 2 −4 6     while XT X+I =     5 2 2 2 7 −4 2 −4 7     eigenvalues : {10, 6, 0} {11, 7, 1} Ad-hoc strategy: use XT X + λI @freakonometrics 23
  • 24. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear Regression Shortcoming Evolution of (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 when cor(X1, X2) = r ∈ [0, 1], on top. Below, Ridge regression (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 +λ(β2 1 + β2 2) where λ ∈ [0, ∞), below, when cor(X1, X2) ∼ 1 (colinearity). @freakonometrics 24 −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 500 1000 1500 2000 2000 2500 2500 2500 2500 3000 3000 3000 3000 3500 q −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 6000 7000
  • 25. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Normalization : Euclidean 2 vs. Mahalonobis We want to penalize complicated models : if βk is “too small”, we prefer to have βk = 0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Instead of d(x, y) = (x − y)T (x − y) use dΣ(x, y) = (x − y)TΣ−1 (x − y) beta1 beta2 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 25
  • 26. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression ... like the least square, but it shrinks estimated coefficients towards 0. β ridge λ = argmin    n i=1 (yi − xT i β)2 + λ p j=1 β2 j    β ridge λ = argmin    y − Xβ 2 2 =criteria + λ β 2 2 =penalty    λ ≥ 0 is a tuning parameter. The constant is usually unpenalized. The true equation is β ridge λ = argmin    y − (β0 + Xβ) 2 2 =criteria + λ β 2 2 =penalty    @freakonometrics 26
  • 27. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression β ridge λ = argmin y − (β0 + Xβ) 2 2 + λ β 2 2 can be seen as a constrained optimization problem β ridge λ = argmin β 2 2 ≤hλ y − (β0 + Xβ) 2 2 Explicit solution βλ = (XT X + λI)−1 XT y If λ → 0, β ridge 0 = β ols If λ → ∞, β ridge ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 27
  • 28. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression This penalty can be seen as rather unfair if compo- nents of x are not expressed on the same scale • center: xj = 0, then β0 = y • scale: xT j xj = 1 Then compute β ridge λ = argmin    y − Xβ 2 2 =loss + λ β 2 2 =penalty    beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 28
  • 29. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols λ which explain relationship with shrinkage. But generally, it is not the case... q q Theorem There exists λ such that mse[β ridge λ ] ≤ mse[β ols λ ] Ridge Regression @freakonometrics 29
  • 30. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Lλ(β) = n i=1 (yi − β0 − xT i β)2 + λ p j=1 β2 j ∂Lλ(β) ∂β = −2XT y + 2(XT X + λI)β ∂2 Lλ(β) ∂β∂βT = 2(XT X + λI) where XT X is a semi-positive definite matrix, and λI is a positive definite matrix, and βλ = (XT X + λI)−1 XT y @freakonometrics 30
  • 31. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The Bayesian Interpretation From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2 I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics 31
  • 32. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator βλ = (XT X + λI)−1 XT y E[βλ] = XT X(λI + XT X)−1 β. i.e. E[βλ] = β. Observe that E[βλ] → 0 as λ → ∞. Assume that X is an orthogonal design matrix, i.e. XT X = I, then βλ = (1 + λ)−1 β ols . @freakonometrics 32
  • 33. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator Set W λ = (I + λ[XT X]−1 )−1 . One can prove that W λβ ols = βλ. Thus, Var[βλ] = W λVar[β ols ]W T λ and Var[βλ] = σ2 (XT X + λI)−1 XT X[(XT X + λI)−1 ]T . Observe that Var[β ols ] − Var[βλ] = σ2 W λ[2λ(XT X)−2 + λ2 (XT X)−3 ]W T λ ≥ 0. @freakonometrics 33
  • 34. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator Hence, the confidence ellipsoid of ridge estimator is indeed smaller than the OLS, If X is an orthogonal design matrix, Var[βλ] = σ2 (1 + λ)−2 I. mse[βλ] = σ2 trace(W λ(XT X)−1 W T λ) + βT (W λ − I)T (W λ − I)β. If X is an orthogonal design matrix, mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β Properties of the Ridge Estimator @freakonometrics 34 0.0 0.2 0.4 0.6 0.8 −1.0−0.8−0.6−0.4−0.2 β1 β2 1 2 3 4 5 6 7
  • 35. Arthur CHARPENTIER, Advanced Econometrics Graduate Course mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β is minimal for λ = pσ2 βT β Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β ols ]. @freakonometrics 35
  • 36. Arthur CHARPENTIER, Advanced Econometrics Graduate Course SVD decomposition Consider the singular value decomposition X = UDV T . Then β ols = V D−2 D UT y βλ = V (D2 + λI)−1 D UT y Observe that D−1 i,i ≥ Di,i D2 i,i + λ hence, the ridge penality shrinks singular values. Set now R = UD (n × n matrix), so that X = RV T , βλ = V (RT R + λI)−1 RT y @freakonometrics 36
  • 37. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Hat matrix and Degrees of Freedom Recall that Y = HY with H = X(XT X)−1 XT Similarly Hλ = X(XT X + λI)−1 XT trace[Hλ] = p j=1 d2 j,j d2 j,j + λ → 0, as λ → ∞. @freakonometrics 37
  • 38. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Sparsity Issues In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s << k, cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity, s = card{S} where S = {j; βj = 0} The model is now y = XT SβS + ε, where XT SXS is a full rank matrix. @freakonometrics 38 q q = . +
  • 39. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues The Ridge regression problem was to solve β = argmin β∈{ β 2 ≤s} { Y − XT β 2 2 } Define a 0 = 1(|ai| > 0). Here dim(β) = k but β 0 = s. We wish we could solve β = argmin β∈{ β 0 =s} { Y − XT β 2 2 } Problem: it is usually not possible to describe all possible constraints, since s k coefficients should be chosen here (with k (very) large). @freakonometrics 39 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0
  • 40. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues In a convex problem, solve the dual problem, e.g. in the Ridge regression : primal problem min β∈{ β 2 ≤s} { Y − XT β 2 2 } and the dual problem min β∈{ Y −XTβ 2 ≤t} { β 2 2 } beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 40
  • 41. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues Idea: solve the dual problem β = argmin β∈{ Y −XTβ 2 ≤h} { β 0 } where we might convexify the 0 norm, · 0 . @freakonometrics 41
  • 42. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues On [−1, +1]k , the convex hull of β 0 is β 1 On [−a, +a]k , the convex hull of β 0 is a−1 β 1 Hence, why not solve β = argmin β; β 1 ≤˜s { Y − XT β 2 } which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem β = argmin{ Y − XT β 2 2 +λ β 1 } @freakonometrics 42
  • 43. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Least Absolute Shrinkage and Selection Operator β ∈ argmin{ Y − XT β 2 2 +λ β 1 } is a convex problem (several algorithms ), but not strictly convex (no unicity of the minimum). Nevertheless, predictions y = xT β are unique. MM, minimize majorization, coordinate descent Hunter & Lange (2003) A Tutorial on MM Algorithms. @freakonometrics 43
  • 44. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression No explicit solution... If λ → 0, β lasso 0 = β ols If λ → ∞, β lasso ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 44
  • 45. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression For some λ, there are k’s such that β lasso k,λ = 0. Further, λ → β lasso k,λ is piecewise linear beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 45
  • 46. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the LASSO estimate is related to the soft threshold function... q q @freakonometrics 46
  • 47. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimal LASSO Penalty Use cross validation, e.g. K-fold, β(−k)(λ) = argmin    i∈Ik [yi − xT i β]2 + λ β 1    then compute the sum of the squared errors, Qk(λ) = i∈Ik [yi − xT i β(−k)(λ)]2 and finally solve λ = argmin Q(λ) = 1 K k Qk(λ) Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) Elements of Statistical Learning suggest the largest λ such that Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2 = 1 K2 K k=1 [Qk(λ) − Q(λ)]2 @freakonometrics 47
  • 48. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO and Ridge, with R 1 > library(glmnet) 2 > chicago=read.table("http:// freakonometrics .free.fr/ chicago.txt",header=TRUE ,sep=";") 3 > standardize <- function(x) {(x-mean(x))/sd(x)} 4 > z0 <- standardize(chicago[, 1]) 5 > z1 <- standardize(chicago[, 3]) 6 > z2 <- standardize(chicago[, 4]) 7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept= FALSE , lambda =1) 8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept= FALSE , lambda =1) 9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5, intercept=FALSE , lambda =1) Elastic net, λ1 β 1 + λ2 β 2 2 q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q @freakonometrics 48
  • 49. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression, Smoothing and Overfit LASSO can be used to avoid overfit. @freakonometrics 49 q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.0−0.50.00.51.0
  • 50. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge vs. LASSO Consider simulated data (output on the right). With orthogonal variables, shrinkage operators are 0 1 2 3 4 5 012345 β β(ridge) 0 1 2 3 4 5 012345 β β(lasso) 0.0 0.5 1.0 1.5 2.0 −2.0−1.5−1.0−0.50.00.51.0 L1 Norm Coefficients 3 3 3 3 3 0.0 0.5 1.0 1.5 2.0 −2.0−1.5−1.0−0.50.00.51.0 L1 Norm Coefficients 0 1 3 3 3 @freakonometrics 50
  • 51. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics First idea: given some initial guess β(0), |β| ∼ |β(0)| + 1 2|β(0)| (β2 − β2 (0)) LASSO estimate can probably be derived from iterated Ridge estimates y − Xβ(k+1) 2 2 + λ β(k+1) 1 ∼ Xβ(k+1) 2 2 + λ 2 j 1 |βj,(k)| [βj,(k+1)]2 which is a weighted ridge penalty function Thus, β(k+1) = XT X + λ∆(k) −1 XT y where ∆(k) = diag[|βj,(k)|−1 ]. Then β(k) → β lasso , as k → ∞. @freakonometrics 51
  • 52. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of LASSO Estimate From this iterative technique β lasso λ ∼ XT X + λ∆ −1 XT y where ∆ = diag[|β lasso j,λ |−1 ] if β lasso j,λ = 0, 0 otherwise. Thus, E[β lasso λ ] ∼ XT X + λ∆ −1 XT Xβ and Var[β lasso λ ] ∼ σ2 XT X + λ∆ −1 XT XT X XT X + λ∆ −1 XT @freakonometrics 52
  • 53. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics Consider here a simplified problem, min a∈R 1 2 (a − b)2 + λ|a| g(a) with λ > 0. Observe that g (0) = −b ± λ. Then • if |b| ≤ λ, then a = 0 • if b ≥ λ, then a = b − λ • if b ≤ −λ, then a = b + λ a = argmin a∈R 1 2 (a − b)2 + λ|a| = Sλ(b) = sign(b) · (|b| − λ)+, also called soft-thresholding operator. @freakonometrics 53
  • 54. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics Definition for any convex function h, define the proximal operator operator of h, proximalh(y) = argmin x∈Rd 1 2 x − y 2 2 + h(x) Note that proximalλ · 2 2 (y) = 1 1 + λ x shrinkage operator proximalλ · 1 (y) = Sλ(y) = sign(y) · (|y| − λ)+ @freakonometrics 54
  • 55. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics We want to solve here θ ∈ argmin θ∈Rd 1 n y − mθ(x)) 2 2 f(θ) + λpenalty(θ) g(θ) . where f is convex and smooth, and g is convex, but not smooth... 1. Focus on f : descent lemma, ∀θ, θ f(θ) ≤ f(θ ) + f(θ ), θ − θ + t 2 θ − θ 2 2 Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1 f(θk), then f(θ) ≤ ϕ(θ): θk+1=argmin{ϕ(θ)} f(θk) + f(θk), θ − θk + t 2 θ − θk 2 2 @freakonometrics 55
  • 56. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics 2. Add function g f(θ)+g(θ) ≤ ψ(θ) f(θk) + f(θk), θ − θk + t 2 θ − θk 2 2 +g(θ) And one can proof that θk+1 = argmin θ∈Rd ψ(θ) = proximalg/t θk − t−1 f(θk) so called proximal gradient descent algorithm, since argmin {ψ(θ)} = argmin t 2 θ − θk − t−1 f(θk) 2 2 + g(θ) @freakonometrics 56
  • 57. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Coordinate-wise minimization Consider some convex differentiable f : Rk → R function. Consider x ∈ Rk obtained by minimizing along each coordinate axis, i.e. f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk) for all i. Is x a global minimizer? i.e. f(x) ≥ f(x ), ∀x ∈ Rk . Yes. If f is convex and differentiable. f(x)|x=x = ∂f(x) ∂x1 , · · · , ∂f(x) ∂xk = 0 There might be problem if f is not differentiable (except in each axis direction). If f(x) = g(x) + k i=1 hi(xi) with g convex and differentiable, yes, since f(x) − f(x ) ≥ g(x )T (x − x ) + i [hi(xi) − hi(xi )] @freakonometrics 57
  • 58. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Coordinate-wise minimization f(x) − f(x ) ≥ i [ ig(x )T (xi − xi )hi(xi) − hi(xi )] ≥0 ≥ 0 Thus, for functions f(x) = g(x) + k i=1 hi(xi) we can use coordinate descent to find a minimizer, i.e. at step j x (j) 1 ∈ argmin x1 f(x1, x (j−1) 2 , x (j−1) 3 , · · · x (j−1) k ) x (j) 2 ∈ argmin x2 f(x (j) 1 , x2, x (j−1) 3 , · · · x (j−1) k ) x (j) 3 ∈ argmin x3 f(x (j) 1 , x (j) 2 , x3, · · · x (j−1) k ) Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous, then x∞ is a minimizer of f. @freakonometrics 58
  • 59. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application in Linear Regression Let f(x) = 1 2 y − Ax 2 , with y ∈ Rn and A ∈ Mn×k. Let A = [A1, · · · , Ak]. Let us minimize in direction i. Let x−i denote the vector in Rk−1 without xi. Here 0 = ∂f(x) ∂xi = AT i [Ax − y] = AT i [Aixi + A−ix−i − y] thus, the optimal value is here xi = AT i [A−ix−i − y] AT i Ai @freakonometrics 59
  • 60. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application to LASSO Let f(x) = 1 2 y − Ax 2 + λ x 1 , so that the non-differentiable part is separable, since x 1 = k i=1 |xi|. Let us minimize in direction i. Let x−i denote the vector in Rk−1 without xi. Here 0 = ∂f(x) ∂xi = AT i [Aixi + A−ix−i − y] + λsi where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding xi = Sλ/ Ai 2 AT i [A−ix−i − y] AT i Ai @freakonometrics 60
  • 61. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Convergence rate for LASSO Let f(x) = g(x) + λ x 1 with • g convex, g Lipschitz with constant L > 0, and Id − g/L monotone inscreasing in each component • there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or z ≤ Sλ(z − g(z)) Saka & Tewari (2010), On the finite time convergence of cyclic coordinate descent methods proved that a coordinate descent starting from z satisfies f(x(j) ) − f(x ) ≤ L z − x 2 2j @freakonometrics 61
  • 62. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Graphical Lasso and Covariance Estimation We want to estimate an (unknown) covariance matrix Σ, or Σ−1 . An estimate for Σ−1 is Θ solution of Θ ∈ argmin Θ∈Mk×k {− log[det(Θ)] + trace[SΘ] + λ Θ 1 } where S = XT X n and where Θ 1 = |Θi,j|. See van Wieringen (2016) Undirected network reconstruction from high-dimensional data and https://github.com/kaizhang/glasso @freakonometrics 62
  • 63. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application to Network Simplification Can be applied on networks, to spot ‘significant’ connexions... Source: http://khughitt.github.io/graphical-lasso/ @freakonometrics 63
  • 64. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Extention of Penalization Techniques In a more general context, we want to solve θ ∈ argmin θ∈Rd 1 n n i=1 (yi, mθ(xi)) + λ · penalty(θ) . @freakonometrics 64