Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”*
A. Charpentier (Université de Rennes)
joint work with E. Flachaire (AMSE) & A. Ly (Université Paris-Est)
https://hal.archives-ouvertes.fr/hal-01568851/
Università degli studi dell’Insubria
Seminar, May 2018.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”
Varian (2014, The Probabilistic Approach in Econometrics)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics in a “Big Data” Context
Here data means y ∈ Rn
and X a n × p matrix.
n large means “ asymptotic” theorems can be invoked (n → ∞)
Portnoy (1988, Asymptotic Behavior of Likelihood Methods for Exponential Families
when the Number of Parameters Tends to Infinity) proved that MLE estimator is
asymptotically Gaussian as long as p2
/n → 0.
There might be High-dimensional issues if p >
√
n.
Nevertheless, there might be sparsity issues in high dimension (see Hastie,
Tibshirani & Wainwright (2015, Statistical Learning with Sparsity)) : a sparse
statistical model has only a small number of nonzero parameters or weights
First order condition XT
(Xβ − y) = 0 is based on QR decomposition (can be
computationaly intensive).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics in a “Big Data” Context
Arthur Hoerl in 1982, back on Hoerl & Kennard (1970) on Ridge regression.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Back on the history of the “regression”
Galton (1870, Heriditary Genius, 1886, Regression to-
wards mediocrity in hereditary stature) and Pearson &
Lee (1896, On Telegony in Man, 1903 On the Laws of
Inheritance in Man) studied genetic transmission of
characterisitcs, e.g. the heigth.
On average the child of tall parents is taller than
other children, but less than his parents.
“I have called this peculiarity by the name of regres-
sion”, Francis Galton, 1886.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Economics and Statistics
Working (1925)
What Do Statistical "Demand Curves" Show?
Tinbergen (1939)
Statistical Testing of Business Cycle Theories
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics and the Probabilistic Model
Haavelno (1944, The Probabilistic Approach in Econometrics)
Data (yi, xi) are seen as realizations of (iid) random varibles (Y, X) on some
probabilistic space (Ω, F, P).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Vapnik (1998, Statistical Learning Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Consider observations {y1, · · · , yn}
from iid random variables Yi ∼ Fθ
(with “density” fθ).
Likelihood is
L(θ ; y) →
n
i=1
fθ(yi)
Maximum likelihood estimate is
θ
mle
∈ arg max
θ∈Θ
{L(θ; y)}
Fisher (1912) On an absolute criterion for fitting frequency curves
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Under standard assumptions (Identification of the model, Compactness,
Continuity and Dominance), the maximum likelihood estimator is consistent
θ
mle P
−→ θ. With additional assumptions, it can be shown that the maximum
likelihood estimator converges to a normal distribution
√
n θ
mle
− θ
L
−→N(0, I−1
)
where I is Fisher information matrix (i.e. θ
mle
is asymptotically efficient).
Eg. if Y ∼ N(θ, σ2
), log L ∝ −
n
i=1
yi − θ
σ
2
, and θmle
= y (see also method of
moments).
max log L ⇐⇒ min
n
i=1
ε2
i = least squares
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Data and Conditional Distribution
Cook & Weisberg (1999, Applied Regression Including Computing and Graphics)
From available data, describe the conditional distribution of Y given X.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Conditional Distributions and Likelihood
We had a sample {y1, · · · , yn}, with Y from a N(θ, σ2
) distribution.
The natural extension if we had a sample {(y1, x1), · · · , (yn, xn)} is to assume
that Y |X = x has a N(θx, σ2
) distribution.
The standard linear model is obtained when θx is linear, θx = β0 + xT
β. Hence
E[Y |X = x] = xT
β and Var[Y |X = x] = σ2
(homoskedasticity).
(if we center variables or if we add the constant in x).
The MLE of β is
β
mle
= argmin
n
i=1
(yi − xi
T
β)2
= (XT
X)−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
From a Gaussian to a Bernoulli (Logistic) Regression, and GLMs
Consider y ∈ {0, 1}, so that Y has a Bernoulli distribution. The logarithm of the
odds is linear,
log
P[Y = 1|X = x]
P[Y = 1|X = x]
= β0 + xT
β,
i.e.
E[Y |X = x] =
eβ0+xT
β
1 + eβ0+xTβ
= H(β0 + xT
β), où H(·) =
exp(·)
1 + exp(·)
,
with likelihood
L(β; y, X) =
n
i=1
exT
i β
1 + exT
i
β
yi
1
1 + exT
i
β
1−yi
and set β
mle
= arg max
β∈Rp
{L(β; y, X)} (solved numerically).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Two strategies are usually considered to compute m(x) = E[Y |X = x]
• consider a local approximation (in the neighborhood of x), e.g. k-nearest
neighbor, kernel regression, lowess
• consider a functional decomposition of function m in some natural basis, e.g.
splines
For a kernel regression, as in Nadaraya (1964, ) and Watson (1964, ), consider
some kernel function K and some bandwidth h
mh(x) = sx
T
y =
n
i=1
sx,iyi where sx,i =
Kh(x − xi)
Kh(x − x1) + · · · + Kh(x − xn)
.
(in a univariate problem).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Recall - see Simonoff (1996, ) that using asymtptotic approximations
bias[mh(x)] = E[mh(x)] − m(x) ∼ h2 C1
2
m (x) + C2m (x)
f (x)
f(x)
Var[mh(x)] ∼
C3
nh
σ(x)
f(x)
for some constants C1, C2 and C3.
Set mse[mh(x)] = bias2
[mh(x)] + Var[mh(x)] and consider the integrated version
mise[mh] = mse[mh(x)]dF(x)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
mise[mh] ∼
bias2
h4
4
x2
k(x)dx
2
m (x) + 2m (x)
f (x)
f(x)
2
dx
+
variance
σ2
nh
k2
(x)dx ·
dx
f(x)
,
Thus, the optimal value is h = O(n−1/5
) (see Silverman’s rule of thumb).
Obtained using asymptotic theory and approximations.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Choice of meta-parameter h : trade-off bias / variance.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice
Suppose that the true model is yi = x1,iβ1 + x2,iβ2 + εi, but we estimate the
model on x1 (only) yi = X1,ib1 + ηi.
b1 = (XT
1 X1)−1
XT
1 Y
= (XT
1 X1)−1
XT
1 [X1,iβ1 + X2,iβ2 + ε]
= (XT
1 X1)−1
XT
1 X1β1 + (XT
1 X1)−1
XT
1 X2β2 + (XT
1 X1)−1
XT
1 ε
= β1 + (X1X1)−1
XT
1 X2β2
β12
+ (XT
1 X1)−1
XT
1 εi
νi
i.e. E(b1) = β1 + β12. If XT
1 X2 = 0 (X1 ⊥ X2), E(b1) = β1
Conversely, assume that the true model is yi = x1,iβ1 + εi but we estimated on
(unnecessary) variables X2 yi = x1,ib1 + x2,ib2 + ηi. Here estimation is unbiased
E(b1) = β1, but the estimator is not efficient...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
From the variance decomposition
1
n
n
i=1
(yi − ¯y)2
total variance
=
1
n
n
i=1
(yi − yi)2
residual variance
+
1
n
n
i=1
(yi − ¯y)2
explained variance
Define the R2
as
R2
=
n
i=1(yi − ¯y)2
−
n
i=1(yi − yi)2
n
i=1(yi − ¯y)2
or better, the adjusted R2
¯R2
= 1 − (1 − R2
)
n − 1
n − p
= R2
− (1 − R2
)
p − 1
n − p
penalty
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
One can also consider the log-likelihood, or some penalized version
AIC = − log L − 2 · p
penalty
or BIC = − log L − log(n) · p
penalty
.
Objective is min{− log L(β)}, and quality is measured via − log L(β) − 2p.
Goodhart’s law “When a measure becomes a target, it ceases to be a good measure”
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
Alternative approach on penalization : consider a linear predictor m ∈ M
M = m : m(x) = sh(x)T
y where S = (s(x1), · · · , s(xn))T
smoothing matrix
Suppose that the true model is y = m0(x) + ε with E[ε] = 0 and Var[ε] = σ2
I, so
that m0(x) = E[Y |X = x]. Quadratic risk R(m) = E (Y − m(X))2
is
E (Y − m0(X))2
error
+ E (m0(X) − E[m(X)])2
bias2
+ E (E[m(X)] − m(X))2
variance
.
Consider empirical risk, Rn(m) =
1
n
n
i=1
(yi − m(xi))2
, then one can prove that
R(m) = E Rn(m) +
2σ2
n
trace(S),
see Mallow’s Cp from Mallow (1973, Some Comments on Cp)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Introduction: “Machine Learning” ?
Friedman (1997, Datamining & Statistics, what’s the connection)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Introduction: “Machine Learning” ?
Machine learning (initially) did not need any probabilistic framework.
Consider observations (yi, xi), and loss function and some class of models M.
The goal is to solve
m ∈ argmin
m∈M
n
i=1
(yi, m(xi))
Watt et al. (2016, Machine learning refined foundations algorithms and applications)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Vapnik (2000, The Nature of Statistical Learning Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Vailiant (1984, A Theory of Learnable)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Consider a classification problem, y ∈ {−1, +1}. The “true” model is m0
(target), and consider some model m/
Let P denote the (unknown) distribution of X’s. The error of m can be written
RP,m0
(m) = P m(X) = m0(X) = P {x : m(x) = m0(x)} where X ∼ P.
Naturally, we can assume that observations xi’s were drawn from P. Empirical
risk is
R(m) =
1
n
n
i=1
1(m(xi) = yi).
It is not possible to get a perfect model, we should seek a Probably Almost
Correct model. Given , we want to find m such that RP,m0
(m) ≤ with
probability 1 − δ.
More precisely, we want an algorithm that might lead us to a candidate m.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Suppose that M contains a finite number of models. For any , δ, P and m0, if
we have enough observations ( n ≥ −1
log[δ−1
M ]), if
m ∈ argmin
m∈M
1
n
n
i=1
1(m(xi) = yi)
then with probability higher than 1 − δ, RP,f (m ) ≤ .
Here nM( , δ) = −1
log[δ−1
M ] is called complexity, and M is PAC-learnable.
If M is not finite, the problem is more complicated, it is necessary to define a
dimension d - so called VC-dimension - of M, that will be a substitute to M .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Penalization and Over/Under-Fit
The risk of m cannot be estimated on the data used to fit m.
Alternatives are cross-validation and bootstrap
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
The objective function and loss functions
Consider loss function : Y × Y → R+ and set m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) .
A classical loss function is the quadratic one 2(u, v) = (u − v)2
. Recall that
y = argmin
m∈R
n
i=1
2(yi, m) , or (for the continuous version)
E(Y ) = argmin
m∈R
Y − m 2
2
= argmin
m∈R
E 2(Y, m) .
One can also consider the least absolute value one 1(u, v) = |u − v. Recall that
median[y] = argmin
m∈R
n
i=1
1(yi, m) . The optimisation problem
m = argmin
m∈M
n
i=1
|yi − m(xi)|
is well know in robust econometrics.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
The objective function and loss functions
More generally, since 1(y, m) = |(y − m)(1/2 − 1y≤m)|, can be generalized for
any probability level τ ∈ (0, 1) :
mτ = argmin
m∈M0
n
i=1
q
τ (yi, m(xi)) with q
τ (x, y) = (x − y)(τ − 1x≤y)
is the quantile regression, see Koenker (2005, Quantile Regression). One can also
consider expectiles, with loss function e
τ (x, y) = (x − y)2
· τ − 1x≤y , see Aigner
et al. (1977, Formulation and estimation of stochastic frontier production function
models)
Gneiting (2011, Making and Evaluating Point Forecasts) introduced the concept of
ellicitable statistics: T is ellicitable if there exists : R × R → R+ such that
T(Y ) = argmin
x∈R R
(x, y)dF(y) = argmin
x∈R
E (x, Y ) where Y
L
∼ F .
Thus, the mean, the median, any quantiles are ellicitable.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
We want to solve m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) on a set of possible predictor M.
In a classical optimization problem x ∈ argmin
x∈Rd
{m(x)}, we consider a gradient
descent to approximate x
f(xk)
f,xk
∼ f(xk−1)
f,xk−1
+ (xk − xk−1)
αk
f(xk−1)
f,xk−1
Here, it is some sort of dual problem m ∈ argmin
m∈M
{m(x)}, and a functional
gradient descent can be considered
mk(x)
mk,x
∼ mk−1(x)
mk−1,x
+(mk − mk−1)
βk
mk−1, x
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
m(k)
= m(k−1)
+ argmin
h∈H



n
i=1
(yi − m(k−1)
(xi)
εk,i
, h(xi))



where
m(k)
(·) = m1(·)
∼y
+ m2(·)
∼ε1
+ m3(·)
∼ε2
+ · · · + mk(·)
∼εk−1
= m(k−1)
(·) + mk(·).
Equivalently, start with m(0)
= argmin
m∈R
n
i=1
(yi, m) , and solve iteratively
m(k)
= m(k−1)
+ argmin
h∈H
n
i=1
(yi, m(k−1)
(xi) + h(xi))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
if H is a set of differentiable functions,
m(k)
= m(k−1)
− γk
n
i=1
m(k−1) (yi, m(k−1)
(xi)),
where
γk = argmin
γ
n
i=1
yi, m(k−1)
(xi) − γ m(k−1) (yi, m(k−1)
(xi)) .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Penalization and Over/Under-Fit
Consider some penalized objective function, given λ ≥ 0,
(β0,λ, βλ) = argmin
n
i=1
(yi, β0 + xT
β) + λ β
with centered and scaled variables, for some penalty β .
Note that there is some Bayesian interpretation
P[θ|y]
a posteriori
∝ P[y|θ]
likelihood
· P[θ]
a priori
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalization
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
Define
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 0 penalty
Foster & George (1994, the risk inflation criterion for multiple regression) tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =



0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 2 penalty
With the 2-norm, ( 2) is the Ridge regression
β
ridge
λ = (XT
X + λI)−1
XT
y = (XT
X + λI)−1
(XT
X)β
ols
.
see Tikhonov (1943, On the stability of inverse problems). Hence
biais[β
ridge
λ ] = −λ[XT
X+λI]−1
β
ols
, Var[β
ridge
λ ] = σ2
[XT
X+λI]−1
XT
X[XT
X+λI]−1
.
With orthogonal variables (i.e. XT
X = I), we get
biais[β
ridge
λ ] =
λ
1 + λ
β
ols
and Var[β
ridge
λ ] =
σ2
(1 + λ)2
I =
Var[β
ols
]
(1 + λ)2
.
Note thatVar[β
ridge
λ ] < Var[β
ols
]. Further, mse[β
ridge
λ ] is minimal when
λ = pσ2
/βT
β.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 1 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
"best subset" solution - like in ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 1 penalty
Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written −2yT
x + 2xT
xβ±2λ = 0. (the sign in ±
being the sign of β). Assume that yT
x > 0, then solution is
βlasso
λ = max
yT
x − λ
xTx
, 0 . (we get a corner solution when λ is large).
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Candès & Plan (2009, Near-ideal model selection by 1 minimization).
With some additional technical assumption, that lasso estimator is "sparsistent"
in the sense that the support of β
lasso
λ is the same as β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
Thus, lasso can be used for variable selection (see Hastie et al. (2001 The
Elements of Statistical Learning).
With orthonormal covariance, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = sign[βols
j ] · (|βols
j | − λ)+.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometric Modeling
Data {(yi, xi)}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp
and yi ∈ Y.
A model is a m : X → Y mapping
- regression, Y = R (but also Y = N)
- classification, Y = {0, 1}, {−1, +1}, {•, •}
(binary, or more)
Classification models are based on two steps,
• score function, s(x) = P(Y = 1|X = x) ∈ [0, 1]
• classifier s(x) → y ∈ {0, 1}.
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Classification Trees
To split {N} into two {NL, NR}, consider
I(NL, NR) =
x∈{L,R}
nx
n
I(Nx)
e.g. Gini index (used originally in CART, see
Breiman et al. (1984, Classification and Regression
Trees)
gini(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
1 −
nx,y
nx
and the cross-entropy (used in C4.5 and C5.0)
entropy(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
log
nx,y
nx
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model Selection & ROC Curves
Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold
s ∈ (0, 1), set
Y (s)
= 1[m(x) > s] =



1 if m(x) > s
0 if m(x) ≤ s
Define the confusion matrix as N = [Nu,v]
Y = 0 Y = 1
Ys = 0 TNs FNs TNs+FNs
Ys = 1 FPs TPs FPs+TPs
TNs+FPs FNs+TPs n
ROC curve is
ROCs =
FPs
FPs + TNs
,
TPs
TPs + FNs
with s ∈ (0, 1)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Car seats sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.9544
bagging 0.8973
random forest 0.9050
boosting 0.9313
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
q
q
0.285 (0.860, 0.927)
0.500 (0.907, 0.860)
logit
bagging
random forest
boosting
see Section 5.1 in C. et al. (2018, Econométrie et Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Caravan sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.7372
bagging 0.7198
random forest 0.7154
boosting 0.7691
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
logit
bagging
random forest
boosting
see Section 5.2 in C. et al. (2018, Econométrie et Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Credit database, used in Nisbet et al. (2001, Handbook of Statistical Analysis and
Data Mining)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Conclusion
Breiman (2001, The Two Cultures)
Econometrics and Machine Learning are focusing on
the same problem, but with different cultures
- probabilistic interpretation of econometric mod-
els (unfortunately sometimes misleading, e.g. p-
value) can deal with non-i.id data (time series,
panel, etc)
- machine learning is about predictive modeling
and generalization with algorithmic tools, based
on bootstrap (sampling and sub-sampling),
cross-validation, variable selection, nonlineari-
ties, cross effects, etc
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48

Varese italie seminar

  • 1.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics & “Machine Learning”* A. Charpentier (Université de Rennes) joint work with E. Flachaire (AMSE) & A. Ly (Université Paris-Est) https://hal.archives-ouvertes.fr/hal-01568851/ Università degli studi dell’Insubria Seminar, May 2018. @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics & “Machine Learning” Varian (2014, The Probabilistic Approach in Econometrics) @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics in a “Big Data” Context Here data means y ∈ Rn and X a n × p matrix. n large means “ asymptotic” theorems can be invoked (n → ∞) Portnoy (1988, Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity) proved that MLE estimator is asymptotically Gaussian as long as p2 /n → 0. There might be High-dimensional issues if p > √ n. Nevertheless, there might be sparsity issues in high dimension (see Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity)) : a sparse statistical model has only a small number of nonzero parameters or weights First order condition XT (Xβ − y) = 0 is based on QR decomposition (can be computationaly intensive). @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics in a “Big Data” Context Arthur Hoerl in 1982, back on Hoerl & Kennard (1970) on Ridge regression. @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Back on the history of the “regression” Galton (1870, Heriditary Genius, 1886, Regression to- wards mediocrity in hereditary stature) and Pearson & Lee (1896, On Telegony in Man, 1903 On the Laws of Inheritance in Man) studied genetic transmission of characterisitcs, e.g. the heigth. On average the child of tall parents is taller than other children, but less than his parents. “I have called this peculiarity by the name of regres- sion”, Francis Galton, 1886. @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  • 6.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Economics and Statistics Working (1925) What Do Statistical "Demand Curves" Show? Tinbergen (1939) Statistical Testing of Business Cycle Theories @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics and the Probabilistic Model Haavelno (1944, The Probabilistic Approach in Econometrics) Data (yi, xi) are seen as realizations of (iid) random varibles (Y, X) on some probabilistic space (Ω, F, P). @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Mathematical Statistics Vapnik (1998, Statistical Learning Theory) @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Mathematical Statistics Consider observations {y1, · · · , yn} from iid random variables Yi ∼ Fθ (with “density” fθ). Likelihood is L(θ ; y) → n i=1 fθ(yi) Maximum likelihood estimate is θ mle ∈ arg max θ∈Θ {L(θ; y)} Fisher (1912) On an absolute criterion for fitting frequency curves @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Mathematical Statistics Under standard assumptions (Identification of the model, Compactness, Continuity and Dominance), the maximum likelihood estimator is consistent θ mle P −→ θ. With additional assumptions, it can be shown that the maximum likelihood estimator converges to a normal distribution √ n θ mle − θ L −→N(0, I−1 ) where I is Fisher information matrix (i.e. θ mle is asymptotically efficient). Eg. if Y ∼ N(θ, σ2 ), log L ∝ − n i=1 yi − θ σ 2 , and θmle = y (see also method of moments). max log L ⇐⇒ min n i=1 ε2 i = least squares @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Data and Conditional Distribution Cook & Weisberg (1999, Applied Regression Including Computing and Graphics) From available data, describe the conditional distribution of Y given X. @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Conditional Distributions and Likelihood We had a sample {y1, · · · , yn}, with Y from a N(θ, σ2 ) distribution. The natural extension if we had a sample {(y1, x1), · · · , (yn, xn)} is to assume that Y |X = x has a N(θx, σ2 ) distribution. The standard linear model is obtained when θx is linear, θx = β0 + xT β. Hence E[Y |X = x] = xT β and Var[Y |X = x] = σ2 (homoskedasticity). (if we center variables or if we add the constant in x). The MLE of β is β mle = argmin n i=1 (yi − xi T β)2 = (XT X)−1 XT y. @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria From a Gaussian to a Bernoulli (Logistic) Regression, and GLMs Consider y ∈ {0, 1}, so that Y has a Bernoulli distribution. The logarithm of the odds is linear, log P[Y = 1|X = x] P[Y = 1|X = x] = β0 + xT β, i.e. E[Y |X = x] = eβ0+xT β 1 + eβ0+xTβ = H(β0 + xT β), où H(·) = exp(·) 1 + exp(·) , with likelihood L(β; y, X) = n i=1 exT i β 1 + exT i β yi 1 1 + exT i β 1−yi and set β mle = arg max β∈Rp {L(β; y, X)} (solved numerically). @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics Two strategies are usually considered to compute m(x) = E[Y |X = x] • consider a local approximation (in the neighborhood of x), e.g. k-nearest neighbor, kernel regression, lowess • consider a functional decomposition of function m in some natural basis, e.g. splines For a kernel regression, as in Nadaraya (1964, ) and Watson (1964, ), consider some kernel function K and some bandwidth h mh(x) = sx T y = n i=1 sx,iyi where sx,i = Kh(x − xi) Kh(x − x1) + · · · + Kh(x − xn) . (in a univariate problem). @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics Recall - see Simonoff (1996, ) that using asymtptotic approximations bias[mh(x)] = E[mh(x)] − m(x) ∼ h2 C1 2 m (x) + C2m (x) f (x) f(x) Var[mh(x)] ∼ C3 nh σ(x) f(x) for some constants C1, C2 and C3. Set mse[mh(x)] = bias2 [mh(x)] + Var[mh(x)] and consider the integrated version mise[mh] = mse[mh(x)]dF(x) @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics mise[mh] ∼ bias2 h4 4 x2 k(x)dx 2 m (x) + 2m (x) f (x) f(x) 2 dx + variance σ2 nh k2 (x)dx · dx f(x) , Thus, the optimal value is h = O(n−1/5 ) (see Silverman’s rule of thumb). Obtained using asymptotic theory and approximations. @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics Choice of meta-parameter h : trade-off bias / variance. @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  • 18.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice Suppose that the true model is yi = x1,iβ1 + x2,iβ2 + εi, but we estimate the model on x1 (only) yi = X1,ib1 + ηi. b1 = (XT 1 X1)−1 XT 1 Y = (XT 1 X1)−1 XT 1 [X1,iβ1 + X2,iβ2 + ε] = (XT 1 X1)−1 XT 1 X1β1 + (XT 1 X1)−1 XT 1 X2β2 + (XT 1 X1)−1 XT 1 ε = β1 + (X1X1)−1 XT 1 X2β2 β12 + (XT 1 X1)−1 XT 1 εi νi i.e. E(b1) = β1 + β12. If XT 1 X2 = 0 (X1 ⊥ X2), E(b1) = β1 Conversely, assume that the true model is yi = x1,iβ1 + εi but we estimated on (unnecessary) variables X2 yi = x1,ib1 + x2,ib2 + ηi. Here estimation is unbiased E(b1) = β1, but the estimator is not efficient... @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice, pluralitas non est ponenda sine necessitate From the variance decomposition 1 n n i=1 (yi − ¯y)2 total variance = 1 n n i=1 (yi − yi)2 residual variance + 1 n n i=1 (yi − ¯y)2 explained variance Define the R2 as R2 = n i=1(yi − ¯y)2 − n i=1(yi − yi)2 n i=1(yi − ¯y)2 or better, the adjusted R2 ¯R2 = 1 − (1 − R2 ) n − 1 n − p = R2 − (1 − R2 ) p − 1 n − p penalty . @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice, pluralitas non est ponenda sine necessitate One can also consider the log-likelihood, or some penalized version AIC = − log L − 2 · p penalty or BIC = − log L − log(n) · p penalty . Objective is min{− log L(β)}, and quality is measured via − log L(β) − 2p. Goodhart’s law “When a measure becomes a target, it ceases to be a good measure” @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice, pluralitas non est ponenda sine necessitate Alternative approach on penalization : consider a linear predictor m ∈ M M = m : m(x) = sh(x)T y where S = (s(x1), · · · , s(xn))T smoothing matrix Suppose that the true model is y = m0(x) + ε with E[ε] = 0 and Var[ε] = σ2 I, so that m0(x) = E[Y |X = x]. Quadratic risk R(m) = E (Y − m(X))2 is E (Y − m0(X))2 error + E (m0(X) − E[m(X)])2 bias2 + E (E[m(X)] − m(X))2 variance . Consider empirical risk, Rn(m) = 1 n n i=1 (yi − m(xi))2 , then one can prove that R(m) = E Rn(m) + 2σ2 n trace(S), see Mallow’s Cp from Mallow (1973, Some Comments on Cp) @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Introduction: “Machine Learning” ? Friedman (1997, Datamining & Statistics, what’s the connection) @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  • 23.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Introduction: “Machine Learning” ? Machine learning (initially) did not need any probabilistic framework. Consider observations (yi, xi), and loss function and some class of models M. The goal is to solve m ∈ argmin m∈M n i=1 (yi, m(xi)) Watt et al. (2016, Machine learning refined foundations algorithms and applications) @freakonometrics freakonometrics freakonometrics.hypotheses.org 23
  • 24.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Vapnik (2000, The Nature of Statistical Learning Theory) @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Vailiant (1984, A Theory of Learnable) @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Consider a classification problem, y ∈ {−1, +1}. The “true” model is m0 (target), and consider some model m/ Let P denote the (unknown) distribution of X’s. The error of m can be written RP,m0 (m) = P m(X) = m0(X) = P {x : m(x) = m0(x)} where X ∼ P. Naturally, we can assume that observations xi’s were drawn from P. Empirical risk is R(m) = 1 n n i=1 1(m(xi) = yi). It is not possible to get a perfect model, we should seek a Probably Almost Correct model. Given , we want to find m such that RP,m0 (m) ≤ with probability 1 − δ. More precisely, we want an algorithm that might lead us to a candidate m. @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Suppose that M contains a finite number of models. For any , δ, P and m0, if we have enough observations ( n ≥ −1 log[δ−1 M ]), if m ∈ argmin m∈M 1 n n i=1 1(m(xi) = yi) then with probability higher than 1 − δ, RP,f (m ) ≤ . Here nM( , δ) = −1 log[δ−1 M ] is called complexity, and M is PAC-learnable. If M is not finite, the problem is more complicated, it is necessary to define a dimension d - so called VC-dimension - of M, that will be a substitute to M . @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Penalization and Over/Under-Fit The risk of m cannot be estimated on the data used to fit m. Alternatives are cross-validation and bootstrap @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria The objective function and loss functions Consider loss function : Y × Y → R+ and set m ∈ argmin m∈M n i=1 (yi, m(xi)) . A classical loss function is the quadratic one 2(u, v) = (u − v)2 . Recall that y = argmin m∈R n i=1 2(yi, m) , or (for the continuous version) E(Y ) = argmin m∈R Y − m 2 2 = argmin m∈R E 2(Y, m) . One can also consider the least absolute value one 1(u, v) = |u − v. Recall that median[y] = argmin m∈R n i=1 1(yi, m) . The optimisation problem m = argmin m∈M n i=1 |yi − m(xi)| is well know in robust econometrics. @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  • 30.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria The objective function and loss functions More generally, since 1(y, m) = |(y − m)(1/2 − 1y≤m)|, can be generalized for any probability level τ ∈ (0, 1) : mτ = argmin m∈M0 n i=1 q τ (yi, m(xi)) with q τ (x, y) = (x − y)(τ − 1x≤y) is the quantile regression, see Koenker (2005, Quantile Regression). One can also consider expectiles, with loss function e τ (x, y) = (x − y)2 · τ − 1x≤y , see Aigner et al. (1977, Formulation and estimation of stochastic frontier production function models) Gneiting (2011, Making and Evaluating Point Forecasts) introduced the concept of ellicitable statistics: T is ellicitable if there exists : R × R → R+ such that T(Y ) = argmin x∈R R (x, y)dF(y) = argmin x∈R E (x, Y ) where Y L ∼ F . Thus, the mean, the median, any quantiles are ellicitable. @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Weak and iterative learning (Gradient Boosting) We want to solve m ∈ argmin m∈M n i=1 (yi, m(xi)) on a set of possible predictor M. In a classical optimization problem x ∈ argmin x∈Rd {m(x)}, we consider a gradient descent to approximate x f(xk) f,xk ∼ f(xk−1) f,xk−1 + (xk − xk−1) αk f(xk−1) f,xk−1 Here, it is some sort of dual problem m ∈ argmin m∈M {m(x)}, and a functional gradient descent can be considered mk(x) mk,x ∼ mk−1(x) mk−1,x +(mk − mk−1) βk mk−1, x @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Weak and iterative learning (Gradient Boosting) m(k) = m(k−1) + argmin h∈H    n i=1 (yi − m(k−1) (xi) εk,i , h(xi))    where m(k) (·) = m1(·) ∼y + m2(·) ∼ε1 + m3(·) ∼ε2 + · · · + mk(·) ∼εk−1 = m(k−1) (·) + mk(·). Equivalently, start with m(0) = argmin m∈R n i=1 (yi, m) , and solve iteratively m(k) = m(k−1) + argmin h∈H n i=1 (yi, m(k−1) (xi) + h(xi)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Weak and iterative learning (Gradient Boosting) if H is a set of differentiable functions, m(k) = m(k−1) − γk n i=1 m(k−1) (yi, m(k−1) (xi)), where γk = argmin γ n i=1 yi, m(k−1) (xi) − γ m(k−1) (yi, m(k−1) (xi)) . @freakonometrics freakonometrics freakonometrics.hypotheses.org 33 q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q qq q q q qq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q qq q qqq q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q qqqqq q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 6 −1.5−1.0−0.50.00.51.01.5 q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q qq q q q qq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q qq q qqq q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q qqqqq q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 6 −1.5−1.0−0.50.00.51.01.5
  • 34.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Penalization and Over/Under-Fit Consider some penalized objective function, given λ ≥ 0, (β0,λ, βλ) = argmin n i=1 (yi, β0 + xT β) + λ β with centered and scaled variables, for some penalty β . Note that there is some Bayesian interpretation P[θ|y] a posteriori ∝ P[y|θ] likelihood · P[θ] a priori i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalization . @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Going further, 0, 1 and 2 penalty Define a 0 = d i=1 1(ai = 0), a 1 = d i=1 |ai| and a 2 = d i=1 a2 i 1/2 , for a ∈ Rd . constrained penalized optimization optimization argmin β; β 0 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 0 ( 0) argmin β; β 1 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 1 ( 1) argmin β; β 2 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 2 ( 2) Assume that is the quadratic norm. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  • 36.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Going further, 0, 1 and 2 penalty The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. Nevertheless, if there is a theoretical equivalence, there might be numerical issues since there is not necessarily unicity of the solution. The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right problem, ∃s such that β is a solution of the left problem. But the converse is not true. More generally, consider a p norm, • sparsity is obtained when p ≤ 1 • convexity is obtained when p ≥ 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 0 penalty Foster & George (1994, the risk inflation criterion for multiple regression) tried to solve directly the penalized problem of ( 0). But it is a complex combinatorial problem in high dimension (Natarajan (1995) sparse approximate solutions to linear systems proved that it was a NP-hard problem) One can prove that if λ ∼ σ2 log(p), alors E [xT β − xT β0]2 ≤ E [xS T βS − xT β0]2 =σ2#S · 4 log p + 2 + o(1) . In that case β sub λ,j =    0 si j /∈ Sλ(β) β ols j si j ∈ Sλ(β), where Sλ(β) is the set of non-null values in solutions of ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 2 penalty With the 2-norm, ( 2) is the Ridge regression β ridge λ = (XT X + λI)−1 XT y = (XT X + λI)−1 (XT X)β ols . see Tikhonov (1943, On the stability of inverse problems). Hence biais[β ridge λ ] = −λ[XT X+λI]−1 β ols , Var[β ridge λ ] = σ2 [XT X+λI]−1 XT X[XT X+λI]−1 . With orthogonal variables (i.e. XT X = I), we get biais[β ridge λ ] = λ 1 + λ β ols and Var[β ridge λ ] = σ2 (1 + λ)2 I = Var[β ols ] (1 + λ)2 . Note thatVar[β ridge λ ] < Var[β ols ]. Further, mse[β ridge λ ] is minimal when λ = pσ2 /βT β. @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 1 penalty If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly convex, and optimum is not always unique (e.g. if XT X is singular). But in the quadratic case, is strictly convex, and at least Xβ is unique. Further, note that solutions are necessarily coherent (signs of coefficients) : it is not possible to have βj < 0 for one solution and βj > 0 for another one. In many cases, problem ( 1) yields a corner-type solution, which can be seen as a "best subset" solution - like in ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 1 penalty Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction. ( 1) becomes min yT y − 2yT xβ + βxT xβ + 2λ|β| First order condition can be written −2yT x + 2xT xβ±2λ = 0. (the sign in ± being the sign of β). Assume that yT x > 0, then solution is βlasso λ = max yT x − λ xTx , 0 . (we get a corner solution when λ is large). In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse regression) or Candès & Plan (2009, Near-ideal model selection by 1 minimization). With some additional technical assumption, that lasso estimator is "sparsistent" in the sense that the support of β lasso λ is the same as β @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Going further, 0, 1 and 2 penalty Thus, lasso can be used for variable selection (see Hastie et al. (2001 The Elements of Statistical Learning). With orthonormal covariance, one can prove that βsub λ,j = βols j 1|βsub λ,j |>b , βridge λ,j = βols j 1 + λ and βlasso λ,j = sign[βols j ] · (|βols j | − λ)+. @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometric Modeling Data {(yi, xi)}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp and yi ∈ Y. A model is a m : X → Y mapping - regression, Y = R (but also Y = N) - classification, Y = {0, 1}, {−1, +1}, {•, •} (binary, or more) Classification models are based on two steps, • score function, s(x) = P(Y = 1|X = x) ∈ [0, 1] • classifier s(x) → y ∈ {0, 1}. q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q q q q q q q q q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  • 43.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Classification Trees To split {N} into two {NL, NR}, consider I(NL, NR) = x∈{L,R} nx n I(Nx) e.g. Gini index (used originally in CART, see Breiman et al. (1984, Classification and Regression Trees) gini(NL, NR) = − x∈{L,R} nx n y∈{0,1} nx,y nx 1 − nx,y nx and the cross-entropy (used in C4.5 and C5.0) entropy(NL, NR) = − x∈{L,R} nx n y∈{0,1} nx,y nx log nx,y nx @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  • 44.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Model Selection & ROC Curves Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold s ∈ (0, 1), set Y (s) = 1[m(x) > s] =    1 if m(x) > s 0 if m(x) ≤ s Define the confusion matrix as N = [Nu,v] Y = 0 Y = 1 Ys = 0 TNs FNs TNs+FNs Ys = 1 FPs TPs FPs+TPs TNs+FPs FNs+TPs n ROC curve is ROCs = FPs FPs + TNs , TPs TPs + FNs with s ∈ (0, 1) @freakonometrics freakonometrics freakonometrics.hypotheses.org 44
  • 45.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Machine Learning Tools versus Econometric Techniques Car seats sales, used in James et al. (2014) An Introduction to Statistical Learning AUC logit 0.9544 bagging 0.8973 random forest 0.9050 boosting 0.9313 Specificity Sensitivity 1.0 0.8 0.6 0.4 0.2 0.0 0.00.20.40.60.81.0 q q 0.285 (0.860, 0.927) 0.500 (0.907, 0.860) logit bagging random forest boosting see Section 5.1 in C. et al. (2018, Econométrie et Machine Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 45
  • 46.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Machine Learning Tools versus Econometric Techniques Caravan sales, used in James et al. (2014) An Introduction to Statistical Learning AUC logit 0.7372 bagging 0.7198 random forest 0.7154 boosting 0.7691 Specificity Sensitivity 1.0 0.8 0.6 0.4 0.2 0.0 0.00.20.40.60.81.0 logit bagging random forest boosting see Section 5.2 in C. et al. (2018, Econométrie et Machine Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 46
  • 47.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Machine Learning Tools versus Econometric Techniques Credit database, used in Nisbet et al. (2001, Handbook of Statistical Analysis and Data Mining) @freakonometrics freakonometrics freakonometrics.hypotheses.org 47
  • 48.
    Arthur CHARPENTIER, Econometrics& “Machine Learning”, May 2018, Università degli studi dell’Insubria Conclusion Breiman (2001, The Two Cultures) Econometrics and Machine Learning are focusing on the same problem, but with different cultures - probabilistic interpretation of econometric mod- els (unfortunately sometimes misleading, e.g. p- value) can deal with non-i.id data (time series, panel, etc) - machine learning is about predictive modeling and generalization with algorithmic tools, based on bootstrap (sampling and sub-sampling), cross-validation, variable selection, nonlineari- ties, cross effects, etc @freakonometrics freakonometrics freakonometrics.hypotheses.org 48