Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Varese italie seminar
1. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”*
A. Charpentier (Université de Rennes)
joint work with E. Flachaire (AMSE) & A. Ly (Université Paris-Est)
https://hal.archives-ouvertes.fr/hal-01568851/
Università degli studi dell’Insubria
Seminar, May 2018.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
2. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”
Varian (2014, The Probabilistic Approach in Econometrics)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
3. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics in a “Big Data” Context
Here data means y ∈ Rn
and X a n × p matrix.
n large means “ asymptotic” theorems can be invoked (n → ∞)
Portnoy (1988, Asymptotic Behavior of Likelihood Methods for Exponential Families
when the Number of Parameters Tends to Infinity) proved that MLE estimator is
asymptotically Gaussian as long as p2
/n → 0.
There might be High-dimensional issues if p >
√
n.
Nevertheless, there might be sparsity issues in high dimension (see Hastie,
Tibshirani & Wainwright (2015, Statistical Learning with Sparsity)) : a sparse
statistical model has only a small number of nonzero parameters or weights
First order condition XT
(Xβ − y) = 0 is based on QR decomposition (can be
computationaly intensive).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
4. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics in a “Big Data” Context
Arthur Hoerl in 1982, back on Hoerl & Kennard (1970) on Ridge regression.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
5. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Back on the history of the “regression”
Galton (1870, Heriditary Genius, 1886, Regression to-
wards mediocrity in hereditary stature) and Pearson &
Lee (1896, On Telegony in Man, 1903 On the Laws of
Inheritance in Man) studied genetic transmission of
characterisitcs, e.g. the heigth.
On average the child of tall parents is taller than
other children, but less than his parents.
“I have called this peculiarity by the name of regres-
sion”, Francis Galton, 1886.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
6. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Economics and Statistics
Working (1925)
What Do Statistical "Demand Curves" Show?
Tinbergen (1939)
Statistical Testing of Business Cycle Theories
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
7. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics and the Probabilistic Model
Haavelno (1944, The Probabilistic Approach in Econometrics)
Data (yi, xi) are seen as realizations of (iid) random varibles (Y, X) on some
probabilistic space (Ω, F, P).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
8. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Vapnik (1998, Statistical Learning Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
9. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Consider observations {y1, · · · , yn}
from iid random variables Yi ∼ Fθ
(with “density” fθ).
Likelihood is
L(θ ; y) →
n
i=1
fθ(yi)
Maximum likelihood estimate is
θ
mle
∈ arg max
θ∈Θ
{L(θ; y)}
Fisher (1912) On an absolute criterion for fitting frequency curves
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
10. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Under standard assumptions (Identification of the model, Compactness,
Continuity and Dominance), the maximum likelihood estimator is consistent
θ
mle P
−→ θ. With additional assumptions, it can be shown that the maximum
likelihood estimator converges to a normal distribution
√
n θ
mle
− θ
L
−→N(0, I−1
)
where I is Fisher information matrix (i.e. θ
mle
is asymptotically efficient).
Eg. if Y ∼ N(θ, σ2
), log L ∝ −
n
i=1
yi − θ
σ
2
, and θmle
= y (see also method of
moments).
max log L ⇐⇒ min
n
i=1
ε2
i = least squares
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
11. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Data and Conditional Distribution
Cook & Weisberg (1999, Applied Regression Including Computing and Graphics)
From available data, describe the conditional distribution of Y given X.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
12. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Conditional Distributions and Likelihood
We had a sample {y1, · · · , yn}, with Y from a N(θ, σ2
) distribution.
The natural extension if we had a sample {(y1, x1), · · · , (yn, xn)} is to assume
that Y |X = x has a N(θx, σ2
) distribution.
The standard linear model is obtained when θx is linear, θx = β0 + xT
β. Hence
E[Y |X = x] = xT
β and Var[Y |X = x] = σ2
(homoskedasticity).
(if we center variables or if we add the constant in x).
The MLE of β is
β
mle
= argmin
n
i=1
(yi − xi
T
β)2
= (XT
X)−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
13. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
From a Gaussian to a Bernoulli (Logistic) Regression, and GLMs
Consider y ∈ {0, 1}, so that Y has a Bernoulli distribution. The logarithm of the
odds is linear,
log
P[Y = 1|X = x]
P[Y = 1|X = x]
= β0 + xT
β,
i.e.
E[Y |X = x] =
eβ0+xT
β
1 + eβ0+xTβ
= H(β0 + xT
β), où H(·) =
exp(·)
1 + exp(·)
,
with likelihood
L(β; y, X) =
n
i=1
exT
i β
1 + exT
i
β
yi
1
1 + exT
i
β
1−yi
and set β
mle
= arg max
β∈Rp
{L(β; y, X)} (solved numerically).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
14. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Two strategies are usually considered to compute m(x) = E[Y |X = x]
• consider a local approximation (in the neighborhood of x), e.g. k-nearest
neighbor, kernel regression, lowess
• consider a functional decomposition of function m in some natural basis, e.g.
splines
For a kernel regression, as in Nadaraya (1964, ) and Watson (1964, ), consider
some kernel function K and some bandwidth h
mh(x) = sx
T
y =
n
i=1
sx,iyi where sx,i =
Kh(x − xi)
Kh(x − x1) + · · · + Kh(x − xn)
.
(in a univariate problem).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
15. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Recall - see Simonoff (1996, ) that using asymtptotic approximations
bias[mh(x)] = E[mh(x)] − m(x) ∼ h2 C1
2
m (x) + C2m (x)
f (x)
f(x)
Var[mh(x)] ∼
C3
nh
σ(x)
f(x)
for some constants C1, C2 and C3.
Set mse[mh(x)] = bias2
[mh(x)] + Var[mh(x)] and consider the integrated version
mise[mh] = mse[mh(x)]dF(x)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
16. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
mise[mh] ∼
bias2
h4
4
x2
k(x)dx
2
m (x) + 2m (x)
f (x)
f(x)
2
dx
+
variance
σ2
nh
k2
(x)dx ·
dx
f(x)
,
Thus, the optimal value is h = O(n−1/5
) (see Silverman’s rule of thumb).
Obtained using asymptotic theory and approximations.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
17. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Choice of meta-parameter h : trade-off bias / variance.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
18. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice
Suppose that the true model is yi = x1,iβ1 + x2,iβ2 + εi, but we estimate the
model on x1 (only) yi = X1,ib1 + ηi.
b1 = (XT
1 X1)−1
XT
1 Y
= (XT
1 X1)−1
XT
1 [X1,iβ1 + X2,iβ2 + ε]
= (XT
1 X1)−1
XT
1 X1β1 + (XT
1 X1)−1
XT
1 X2β2 + (XT
1 X1)−1
XT
1 ε
= β1 + (X1X1)−1
XT
1 X2β2
β12
+ (XT
1 X1)−1
XT
1 εi
νi
i.e. E(b1) = β1 + β12. If XT
1 X2 = 0 (X1 ⊥ X2), E(b1) = β1
Conversely, assume that the true model is yi = x1,iβ1 + εi but we estimated on
(unnecessary) variables X2 yi = x1,ib1 + x2,ib2 + ηi. Here estimation is unbiased
E(b1) = β1, but the estimator is not efficient...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
19. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
From the variance decomposition
1
n
n
i=1
(yi − ¯y)2
total variance
=
1
n
n
i=1
(yi − yi)2
residual variance
+
1
n
n
i=1
(yi − ¯y)2
explained variance
Define the R2
as
R2
=
n
i=1(yi − ¯y)2
−
n
i=1(yi − yi)2
n
i=1(yi − ¯y)2
or better, the adjusted R2
¯R2
= 1 − (1 − R2
)
n − 1
n − p
= R2
− (1 − R2
)
p − 1
n − p
penalty
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
20. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
One can also consider the log-likelihood, or some penalized version
AIC = − log L − 2 · p
penalty
or BIC = − log L − log(n) · p
penalty
.
Objective is min{− log L(β)}, and quality is measured via − log L(β) − 2p.
Goodhart’s law “When a measure becomes a target, it ceases to be a good measure”
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
21. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
Alternative approach on penalization : consider a linear predictor m ∈ M
M = m : m(x) = sh(x)T
y where S = (s(x1), · · · , s(xn))T
smoothing matrix
Suppose that the true model is y = m0(x) + ε with E[ε] = 0 and Var[ε] = σ2
I, so
that m0(x) = E[Y |X = x]. Quadratic risk R(m) = E (Y − m(X))2
is
E (Y − m0(X))2
error
+ E (m0(X) − E[m(X)])2
bias2
+ E (E[m(X)] − m(X))2
variance
.
Consider empirical risk, Rn(m) =
1
n
n
i=1
(yi − m(xi))2
, then one can prove that
R(m) = E Rn(m) +
2σ2
n
trace(S),
see Mallow’s Cp from Mallow (1973, Some Comments on Cp)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
22. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Introduction: “Machine Learning” ?
Friedman (1997, Datamining & Statistics, what’s the connection)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
23. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Introduction: “Machine Learning” ?
Machine learning (initially) did not need any probabilistic framework.
Consider observations (yi, xi), and loss function and some class of models M.
The goal is to solve
m ∈ argmin
m∈M
n
i=1
(yi, m(xi))
Watt et al. (2016, Machine learning refined foundations algorithms and applications)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
24. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Vapnik (2000, The Nature of Statistical Learning Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
25. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Vailiant (1984, A Theory of Learnable)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
26. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Consider a classification problem, y ∈ {−1, +1}. The “true” model is m0
(target), and consider some model m/
Let P denote the (unknown) distribution of X’s. The error of m can be written
RP,m0
(m) = P m(X) = m0(X) = P {x : m(x) = m0(x)} where X ∼ P.
Naturally, we can assume that observations xi’s were drawn from P. Empirical
risk is
R(m) =
1
n
n
i=1
1(m(xi) = yi).
It is not possible to get a perfect model, we should seek a Probably Almost
Correct model. Given , we want to find m such that RP,m0
(m) ≤ with
probability 1 − δ.
More precisely, we want an algorithm that might lead us to a candidate m.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
27. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Suppose that M contains a finite number of models. For any , δ, P and m0, if
we have enough observations ( n ≥ −1
log[δ−1
M ]), if
m ∈ argmin
m∈M
1
n
n
i=1
1(m(xi) = yi)
then with probability higher than 1 − δ, RP,f (m ) ≤ .
Here nM( , δ) = −1
log[δ−1
M ] is called complexity, and M is PAC-learnable.
If M is not finite, the problem is more complicated, it is necessary to define a
dimension d - so called VC-dimension - of M, that will be a substitute to M .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
28. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Penalization and Over/Under-Fit
The risk of m cannot be estimated on the data used to fit m.
Alternatives are cross-validation and bootstrap
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
29. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
The objective function and loss functions
Consider loss function : Y × Y → R+ and set m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) .
A classical loss function is the quadratic one 2(u, v) = (u − v)2
. Recall that
y = argmin
m∈R
n
i=1
2(yi, m) , or (for the continuous version)
E(Y ) = argmin
m∈R
Y − m 2
2
= argmin
m∈R
E 2(Y, m) .
One can also consider the least absolute value one 1(u, v) = |u − v. Recall that
median[y] = argmin
m∈R
n
i=1
1(yi, m) . The optimisation problem
m = argmin
m∈M
n
i=1
|yi − m(xi)|
is well know in robust econometrics.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
30. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
The objective function and loss functions
More generally, since 1(y, m) = |(y − m)(1/2 − 1y≤m)|, can be generalized for
any probability level τ ∈ (0, 1) :
mτ = argmin
m∈M0
n
i=1
q
τ (yi, m(xi)) with q
τ (x, y) = (x − y)(τ − 1x≤y)
is the quantile regression, see Koenker (2005, Quantile Regression). One can also
consider expectiles, with loss function e
τ (x, y) = (x − y)2
· τ − 1x≤y , see Aigner
et al. (1977, Formulation and estimation of stochastic frontier production function
models)
Gneiting (2011, Making and Evaluating Point Forecasts) introduced the concept of
ellicitable statistics: T is ellicitable if there exists : R × R → R+ such that
T(Y ) = argmin
x∈R R
(x, y)dF(y) = argmin
x∈R
E (x, Y ) where Y
L
∼ F .
Thus, the mean, the median, any quantiles are ellicitable.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
31. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
We want to solve m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) on a set of possible predictor M.
In a classical optimization problem x ∈ argmin
x∈Rd
{m(x)}, we consider a gradient
descent to approximate x
f(xk)
f,xk
∼ f(xk−1)
f,xk−1
+ (xk − xk−1)
αk
f(xk−1)
f,xk−1
Here, it is some sort of dual problem m ∈ argmin
m∈M
{m(x)}, and a functional
gradient descent can be considered
mk(x)
mk,x
∼ mk−1(x)
mk−1,x
+(mk − mk−1)
βk
mk−1, x
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
32. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
m(k)
= m(k−1)
+ argmin
h∈H
n
i=1
(yi − m(k−1)
(xi)
εk,i
, h(xi))
where
m(k)
(·) = m1(·)
∼y
+ m2(·)
∼ε1
+ m3(·)
∼ε2
+ · · · + mk(·)
∼εk−1
= m(k−1)
(·) + mk(·).
Equivalently, start with m(0)
= argmin
m∈R
n
i=1
(yi, m) , and solve iteratively
m(k)
= m(k−1)
+ argmin
h∈H
n
i=1
(yi, m(k−1)
(xi) + h(xi))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
34. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Penalization and Over/Under-Fit
Consider some penalized objective function, given λ ≥ 0,
(β0,λ, βλ) = argmin
n
i=1
(yi, β0 + xT
β) + λ β
with centered and scaled variables, for some penalty β .
Note that there is some Bayesian interpretation
P[θ|y]
a posteriori
∝ P[y|θ]
likelihood
· P[θ]
a priori
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalization
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
35. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
Define
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
36. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
37. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 0 penalty
Foster & George (1994, the risk inflation criterion for multiple regression) tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =
0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
38. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 2 penalty
With the 2-norm, ( 2) is the Ridge regression
β
ridge
λ = (XT
X + λI)−1
XT
y = (XT
X + λI)−1
(XT
X)β
ols
.
see Tikhonov (1943, On the stability of inverse problems). Hence
biais[β
ridge
λ ] = −λ[XT
X+λI]−1
β
ols
, Var[β
ridge
λ ] = σ2
[XT
X+λI]−1
XT
X[XT
X+λI]−1
.
With orthogonal variables (i.e. XT
X = I), we get
biais[β
ridge
λ ] =
λ
1 + λ
β
ols
and Var[β
ridge
λ ] =
σ2
(1 + λ)2
I =
Var[β
ols
]
(1 + λ)2
.
Note thatVar[β
ridge
λ ] < Var[β
ols
]. Further, mse[β
ridge
λ ] is minimal when
λ = pσ2
/βT
β.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
39. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 1 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
"best subset" solution - like in ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
40. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 1 penalty
Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written −2yT
x + 2xT
xβ±2λ = 0. (the sign in ±
being the sign of β). Assume that yT
x > 0, then solution is
βlasso
λ = max
yT
x − λ
xTx
, 0 . (we get a corner solution when λ is large).
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Candès & Plan (2009, Near-ideal model selection by 1 minimization).
With some additional technical assumption, that lasso estimator is "sparsistent"
in the sense that the support of β
lasso
λ is the same as β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
41. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
Thus, lasso can be used for variable selection (see Hastie et al. (2001 The
Elements of Statistical Learning).
With orthonormal covariance, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = sign[βols
j ] · (|βols
j | − λ)+.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
42. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometric Modeling
Data {(yi, xi)}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp
and yi ∈ Y.
A model is a m : X → Y mapping
- regression, Y = R (but also Y = N)
- classification, Y = {0, 1}, {−1, +1}, {•, •}
(binary, or more)
Classification models are based on two steps,
• score function, s(x) = P(Y = 1|X = x) ∈ [0, 1]
• classifier s(x) → y ∈ {0, 1}.
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
43. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Classification Trees
To split {N} into two {NL, NR}, consider
I(NL, NR) =
x∈{L,R}
nx
n
I(Nx)
e.g. Gini index (used originally in CART, see
Breiman et al. (1984, Classification and Regression
Trees)
gini(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
1 −
nx,y
nx
and the cross-entropy (used in C4.5 and C5.0)
entropy(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
log
nx,y
nx
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
44. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model Selection & ROC Curves
Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold
s ∈ (0, 1), set
Y (s)
= 1[m(x) > s] =
1 if m(x) > s
0 if m(x) ≤ s
Define the confusion matrix as N = [Nu,v]
Y = 0 Y = 1
Ys = 0 TNs FNs TNs+FNs
Ys = 1 FPs TPs FPs+TPs
TNs+FPs FNs+TPs n
ROC curve is
ROCs =
FPs
FPs + TNs
,
TPs
TPs + FNs
with s ∈ (0, 1)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
45. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Car seats sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.9544
bagging 0.8973
random forest 0.9050
boosting 0.9313
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
q
q
0.285 (0.860, 0.927)
0.500 (0.907, 0.860)
logit
bagging
random forest
boosting
see Section 5.1 in C. et al. (2018, Econométrie et Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
46. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Caravan sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.7372
bagging 0.7198
random forest 0.7154
boosting 0.7691
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
logit
bagging
random forest
boosting
see Section 5.2 in C. et al. (2018, Econométrie et Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
47. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Credit database, used in Nisbet et al. (2001, Handbook of Statistical Analysis and
Data Mining)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
48. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Conclusion
Breiman (2001, The Two Cultures)
Econometrics and Machine Learning are focusing on
the same problem, but with different cultures
- probabilistic interpretation of econometric mod-
els (unfortunately sometimes misleading, e.g. p-
value) can deal with non-i.id data (time series,
panel, etc)
- machine learning is about predictive modeling
and generalization with algorithmic tools, based
on bootstrap (sampling and sub-sampling),
cross-validation, variable selection, nonlineari-
ties, cross effects, etc
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48