SlideShare a Scribd company logo
1 of 48
Download to read offline
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”*
A. Charpentier (Université de Rennes)
joint work with E. Flachaire (AMSE) & A. Ly (Université Paris-Est)
https://hal.archives-ouvertes.fr/hal-01568851/
Università degli studi dell’Insubria
Seminar, May 2018.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”
Varian (2014, The Probabilistic Approach in Econometrics)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics in a “Big Data” Context
Here data means y ∈ Rn
and X a n × p matrix.
n large means “ asymptotic” theorems can be invoked (n → ∞)
Portnoy (1988, Asymptotic Behavior of Likelihood Methods for Exponential Families
when the Number of Parameters Tends to Infinity) proved that MLE estimator is
asymptotically Gaussian as long as p2
/n → 0.
There might be High-dimensional issues if p >
√
n.
Nevertheless, there might be sparsity issues in high dimension (see Hastie,
Tibshirani & Wainwright (2015, Statistical Learning with Sparsity)) : a sparse
statistical model has only a small number of nonzero parameters or weights
First order condition XT
(Xβ − y) = 0 is based on QR decomposition (can be
computationaly intensive).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics in a “Big Data” Context
Arthur Hoerl in 1982, back on Hoerl & Kennard (1970) on Ridge regression.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Back on the history of the “regression”
Galton (1870, Heriditary Genius, 1886, Regression to-
wards mediocrity in hereditary stature) and Pearson &
Lee (1896, On Telegony in Man, 1903 On the Laws of
Inheritance in Man) studied genetic transmission of
characterisitcs, e.g. the heigth.
On average the child of tall parents is taller than
other children, but less than his parents.
“I have called this peculiarity by the name of regres-
sion”, Francis Galton, 1886.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Economics and Statistics
Working (1925)
What Do Statistical "Demand Curves" Show?
Tinbergen (1939)
Statistical Testing of Business Cycle Theories
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics and the Probabilistic Model
Haavelno (1944, The Probabilistic Approach in Econometrics)
Data (yi, xi) are seen as realizations of (iid) random varibles (Y, X) on some
probabilistic space (Ω, F, P).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Vapnik (1998, Statistical Learning Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Consider observations {y1, · · · , yn}
from iid random variables Yi ∼ Fθ
(with “density” fθ).
Likelihood is
L(θ ; y) →
n
i=1
fθ(yi)
Maximum likelihood estimate is
θ
mle
∈ arg max
θ∈Θ
{L(θ; y)}
Fisher (1912) On an absolute criterion for fitting frequency curves
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Mathematical Statistics
Under standard assumptions (Identification of the model, Compactness,
Continuity and Dominance), the maximum likelihood estimator is consistent
θ
mle P
−→ θ. With additional assumptions, it can be shown that the maximum
likelihood estimator converges to a normal distribution
√
n θ
mle
− θ
L
−→N(0, I−1
)
where I is Fisher information matrix (i.e. θ
mle
is asymptotically efficient).
Eg. if Y ∼ N(θ, σ2
), log L ∝ −
n
i=1
yi − θ
σ
2
, and θmle
= y (see also method of
moments).
max log L ⇐⇒ min
n
i=1
ε2
i = least squares
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Data and Conditional Distribution
Cook & Weisberg (1999, Applied Regression Including Computing and Graphics)
From available data, describe the conditional distribution of Y given X.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Conditional Distributions and Likelihood
We had a sample {y1, · · · , yn}, with Y from a N(θ, σ2
) distribution.
The natural extension if we had a sample {(y1, x1), · · · , (yn, xn)} is to assume
that Y |X = x has a N(θx, σ2
) distribution.
The standard linear model is obtained when θx is linear, θx = β0 + xT
β. Hence
E[Y |X = x] = xT
β and Var[Y |X = x] = σ2
(homoskedasticity).
(if we center variables or if we add the constant in x).
The MLE of β is
β
mle
= argmin
n
i=1
(yi − xi
T
β)2
= (XT
X)−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
From a Gaussian to a Bernoulli (Logistic) Regression, and GLMs
Consider y ∈ {0, 1}, so that Y has a Bernoulli distribution. The logarithm of the
odds is linear,
log
P[Y = 1|X = x]
P[Y = 1|X = x]
= β0 + xT
β,
i.e.
E[Y |X = x] =
eβ0+xT
β
1 + eβ0+xTβ
= H(β0 + xT
β), où H(·) =
exp(·)
1 + exp(·)
,
with likelihood
L(β; y, X) =
n
i=1
exT
i β
1 + exT
i
β
yi
1
1 + exT
i
β
1−yi
and set β
mle
= arg max
β∈Rp
{L(β; y, X)} (solved numerically).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Two strategies are usually considered to compute m(x) = E[Y |X = x]
• consider a local approximation (in the neighborhood of x), e.g. k-nearest
neighbor, kernel regression, lowess
• consider a functional decomposition of function m in some natural basis, e.g.
splines
For a kernel regression, as in Nadaraya (1964, ) and Watson (1964, ), consider
some kernel function K and some bandwidth h
mh(x) = sx
T
y =
n
i=1
sx,iyi where sx,i =
Kh(x − xi)
Kh(x − x1) + · · · + Kh(x − xn)
.
(in a univariate problem).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Recall - see Simonoff (1996, ) that using asymtptotic approximations
bias[mh(x)] = E[mh(x)] − m(x) ∼ h2 C1
2
m (x) + C2m (x)
f (x)
f(x)
Var[mh(x)] ∼
C3
nh
σ(x)
f(x)
for some constants C1, C2 and C3.
Set mse[mh(x)] = bias2
[mh(x)] + Var[mh(x)] and consider the integrated version
mise[mh] = mse[mh(x)]dF(x)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
mise[mh] ∼
bias2
h4
4
x2
k(x)dx
2
m (x) + 2m (x)
f (x)
f(x)
2
dx
+
variance
σ2
nh
k2
(x)dx ·
dx
f(x)
,
Thus, the optimal value is h = O(n−1/5
) (see Silverman’s rule of thumb).
Obtained using asymptotic theory and approximations.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
A Brief Excursion to Nonparametric Econometrics
Choice of meta-parameter h : trade-off bias / variance.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice
Suppose that the true model is yi = x1,iβ1 + x2,iβ2 + εi, but we estimate the
model on x1 (only) yi = X1,ib1 + ηi.
b1 = (XT
1 X1)−1
XT
1 Y
= (XT
1 X1)−1
XT
1 [X1,iβ1 + X2,iβ2 + ε]
= (XT
1 X1)−1
XT
1 X1β1 + (XT
1 X1)−1
XT
1 X2β2 + (XT
1 X1)−1
XT
1 ε
= β1 + (X1X1)−1
XT
1 X2β2
β12
+ (XT
1 X1)−1
XT
1 εi
νi
i.e. E(b1) = β1 + β12. If XT
1 X2 = 0 (X1 ⊥ X2), E(b1) = β1
Conversely, assume that the true model is yi = x1,iβ1 + εi but we estimated on
(unnecessary) variables X2 yi = x1,ib1 + x2,ib2 + ηi. Here estimation is unbiased
E(b1) = β1, but the estimator is not efficient...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
From the variance decomposition
1
n
n
i=1
(yi − ¯y)2
total variance
=
1
n
n
i=1
(yi − yi)2
residual variance
+
1
n
n
i=1
(yi − ¯y)2
explained variance
Define the R2
as
R2
=
n
i=1(yi − ¯y)2
−
n
i=1(yi − yi)2
n
i=1(yi − ¯y)2
or better, the adjusted R2
¯R2
= 1 − (1 − R2
)
n − 1
n − p
= R2
− (1 − R2
)
p − 1
n − p
penalty
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
One can also consider the log-likelihood, or some penalized version
AIC = − log L − 2 · p
penalty
or BIC = − log L − log(n) · p
penalty
.
Objective is min{− log L(β)}, and quality is measured via − log L(β) − 2p.
Goodhart’s law “When a measure becomes a target, it ceases to be a good measure”
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
Alternative approach on penalization : consider a linear predictor m ∈ M
M = m : m(x) = sh(x)T
y where S = (s(x1), · · · , s(xn))T
smoothing matrix
Suppose that the true model is y = m0(x) + ε with E[ε] = 0 and Var[ε] = σ2
I, so
that m0(x) = E[Y |X = x]. Quadratic risk R(m) = E (Y − m(X))2
is
E (Y − m0(X))2
error
+ E (m0(X) − E[m(X)])2
bias2
+ E (E[m(X)] − m(X))2
variance
.
Consider empirical risk, Rn(m) =
1
n
n
i=1
(yi − m(xi))2
, then one can prove that
R(m) = E Rn(m) +
2σ2
n
trace(S),
see Mallow’s Cp from Mallow (1973, Some Comments on Cp)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Introduction: “Machine Learning” ?
Friedman (1997, Datamining & Statistics, what’s the connection)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Introduction: “Machine Learning” ?
Machine learning (initially) did not need any probabilistic framework.
Consider observations (yi, xi), and loss function and some class of models M.
The goal is to solve
m ∈ argmin
m∈M
n
i=1
(yi, m(xi))
Watt et al. (2016, Machine learning refined foundations algorithms and applications)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Vapnik (2000, The Nature of Statistical Learning Theory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Vailiant (1984, A Theory of Learnable)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Consider a classification problem, y ∈ {−1, +1}. The “true” model is m0
(target), and consider some model m/
Let P denote the (unknown) distribution of X’s. The error of m can be written
RP,m0
(m) = P m(X) = m0(X) = P {x : m(x) = m0(x)} where X ∼ P.
Naturally, we can assume that observations xi’s were drawn from P. Empirical
risk is
R(m) =
1
n
n
i=1
1(m(xi) = yi).
It is not possible to get a perfect model, we should seek a Probably Almost
Correct model. Given , we want to find m such that RP,m0
(m) ≤ with
probability 1 − δ.
More precisely, we want an algorithm that might lead us to a candidate m.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Probabilistic Foundations and “Statistical Learning”
Suppose that M contains a finite number of models. For any , δ, P and m0, if
we have enough observations ( n ≥ −1
log[δ−1
M ]), if
m ∈ argmin
m∈M
1
n
n
i=1
1(m(xi) = yi)
then with probability higher than 1 − δ, RP,f (m ) ≤ .
Here nM( , δ) = −1
log[δ−1
M ] is called complexity, and M is PAC-learnable.
If M is not finite, the problem is more complicated, it is necessary to define a
dimension d - so called VC-dimension - of M, that will be a substitute to M .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Penalization and Over/Under-Fit
The risk of m cannot be estimated on the data used to fit m.
Alternatives are cross-validation and bootstrap
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
The objective function and loss functions
Consider loss function : Y × Y → R+ and set m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) .
A classical loss function is the quadratic one 2(u, v) = (u − v)2
. Recall that
y = argmin
m∈R
n
i=1
2(yi, m) , or (for the continuous version)
E(Y ) = argmin
m∈R
Y − m 2
2
= argmin
m∈R
E 2(Y, m) .
One can also consider the least absolute value one 1(u, v) = |u − v. Recall that
median[y] = argmin
m∈R
n
i=1
1(yi, m) . The optimisation problem
m = argmin
m∈M
n
i=1
|yi − m(xi)|
is well know in robust econometrics.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
The objective function and loss functions
More generally, since 1(y, m) = |(y − m)(1/2 − 1y≤m)|, can be generalized for
any probability level τ ∈ (0, 1) :
mτ = argmin
m∈M0
n
i=1
q
τ (yi, m(xi)) with q
τ (x, y) = (x − y)(τ − 1x≤y)
is the quantile regression, see Koenker (2005, Quantile Regression). One can also
consider expectiles, with loss function e
τ (x, y) = (x − y)2
· τ − 1x≤y , see Aigner
et al. (1977, Formulation and estimation of stochastic frontier production function
models)
Gneiting (2011, Making and Evaluating Point Forecasts) introduced the concept of
ellicitable statistics: T is ellicitable if there exists : R × R → R+ such that
T(Y ) = argmin
x∈R R
(x, y)dF(y) = argmin
x∈R
E (x, Y ) where Y
L
∼ F .
Thus, the mean, the median, any quantiles are ellicitable.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
We want to solve m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) on a set of possible predictor M.
In a classical optimization problem x ∈ argmin
x∈Rd
{m(x)}, we consider a gradient
descent to approximate x
f(xk)
f,xk
∼ f(xk−1)
f,xk−1
+ (xk − xk−1)
αk
f(xk−1)
f,xk−1
Here, it is some sort of dual problem m ∈ argmin
m∈M
{m(x)}, and a functional
gradient descent can be considered
mk(x)
mk,x
∼ mk−1(x)
mk−1,x
+(mk − mk−1)
βk
mk−1, x
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
m(k)
= m(k−1)
+ argmin
h∈H



n
i=1
(yi − m(k−1)
(xi)
εk,i
, h(xi))



where
m(k)
(·) = m1(·)
∼y
+ m2(·)
∼ε1
+ m3(·)
∼ε2
+ · · · + mk(·)
∼εk−1
= m(k−1)
(·) + mk(·).
Equivalently, start with m(0)
= argmin
m∈R
n
i=1
(yi, m) , and solve iteratively
m(k)
= m(k−1)
+ argmin
h∈H
n
i=1
(yi, m(k−1)
(xi) + h(xi))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Weak and iterative learning (Gradient Boosting)
if H is a set of differentiable functions,
m(k)
= m(k−1)
− γk
n
i=1
m(k−1) (yi, m(k−1)
(xi)),
where
γk = argmin
γ
n
i=1
yi, m(k−1)
(xi) − γ m(k−1) (yi, m(k−1)
(xi)) .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Penalization and Over/Under-Fit
Consider some penalized objective function, given λ ≥ 0,
(β0,λ, βλ) = argmin
n
i=1
(yi, β0 + xT
β) + λ β
with centered and scaled variables, for some penalty β .
Note that there is some Bayesian interpretation
P[θ|y]
a posteriori
∝ P[y|θ]
likelihood
· P[θ]
a priori
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalization
.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
Define
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 0 penalty
Foster & George (1994, the risk inflation criterion for multiple regression) tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =



0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 2 penalty
With the 2-norm, ( 2) is the Ridge regression
β
ridge
λ = (XT
X + λI)−1
XT
y = (XT
X + λI)−1
(XT
X)β
ols
.
see Tikhonov (1943, On the stability of inverse problems). Hence
biais[β
ridge
λ ] = −λ[XT
X+λI]−1
β
ols
, Var[β
ridge
λ ] = σ2
[XT
X+λI]−1
XT
X[XT
X+λI]−1
.
With orthogonal variables (i.e. XT
X = I), we get
biais[β
ridge
λ ] =
λ
1 + λ
β
ols
and Var[β
ridge
λ ] =
σ2
(1 + λ)2
I =
Var[β
ols
]
(1 + λ)2
.
Note thatVar[β
ridge
λ ] < Var[β
ols
]. Further, mse[β
ridge
λ ] is minimal when
λ = pσ2
/βT
β.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 1 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
"best subset" solution - like in ( 0).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Using the 1 penalty
Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written −2yT
x + 2xT
xβ±2λ = 0. (the sign in ±
being the sign of β). Assume that yT
x > 0, then solution is
βlasso
λ = max
yT
x − λ
xTx
, 0 . (we get a corner solution when λ is large).
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Candès & Plan (2009, Near-ideal model selection by 1 minimization).
With some additional technical assumption, that lasso estimator is "sparsistent"
in the sense that the support of β
lasso
λ is the same as β
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Going further, 0, 1 and 2 penalty
Thus, lasso can be used for variable selection (see Hastie et al. (2001 The
Elements of Statistical Learning).
With orthonormal covariance, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = sign[βols
j ] · (|βols
j | − λ)+.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometric Modeling
Data {(yi, xi)}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp
and yi ∈ Y.
A model is a m : X → Y mapping
- regression, Y = R (but also Y = N)
- classification, Y = {0, 1}, {−1, +1}, {•, •}
(binary, or more)
Classification models are based on two steps,
• score function, s(x) = P(Y = 1|X = x) ∈ [0, 1]
• classifier s(x) → y ∈ {0, 1}.
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Classification Trees
To split {N} into two {NL, NR}, consider
I(NL, NR) =
x∈{L,R}
nx
n
I(Nx)
e.g. Gini index (used originally in CART, see
Breiman et al. (1984, Classification and Regression
Trees)
gini(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
1 −
nx,y
nx
and the cross-entropy (used in C4.5 and C5.0)
entropy(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
log
nx,y
nx
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Model Selection & ROC Curves
Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold
s ∈ (0, 1), set
Y (s)
= 1[m(x) > s] =



1 if m(x) > s
0 if m(x) ≤ s
Define the confusion matrix as N = [Nu,v]
Y = 0 Y = 1
Ys = 0 TNs FNs TNs+FNs
Ys = 1 FPs TPs FPs+TPs
TNs+FPs FNs+TPs n
ROC curve is
ROCs =
FPs
FPs + TNs
,
TPs
TPs + FNs
with s ∈ (0, 1)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Car seats sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.9544
bagging 0.8973
random forest 0.9050
boosting 0.9313
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
q
q
0.285 (0.860, 0.927)
0.500 (0.907, 0.860)
logit
bagging
random forest
boosting
see Section 5.1 in C. et al. (2018, Econométrie et Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Caravan sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.7372
bagging 0.7198
random forest 0.7154
boosting 0.7691
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
logit
bagging
random forest
boosting
see Section 5.2 in C. et al. (2018, Econométrie et Machine Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Machine Learning Tools versus Econometric Techniques
Credit database, used in Nisbet et al. (2001, Handbook of Statistical Analysis and
Data Mining)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47
Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Conclusion
Breiman (2001, The Two Cultures)
Econometrics and Machine Learning are focusing on
the same problem, but with different cultures
- probabilistic interpretation of econometric mod-
els (unfortunately sometimes misleading, e.g. p-
value) can deal with non-i.id data (time series,
panel, etc)
- machine learning is about predictive modeling
and generalization with algorithmic tools, based
on bootstrap (sampling and sub-sampling),
cross-validation, variable selection, nonlineari-
ties, cross effects, etc
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48

More Related Content

What's hot (20)

Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Side 2019, part 2
Side 2019, part 2Side 2019, part 2
Side 2019, part 2
 
Slides ub-7
Slides ub-7Slides ub-7
Slides ub-7
 
Econometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 NonlinearitiesEconometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 Nonlinearities
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 
Mutualisation et Segmentation
Mutualisation et SegmentationMutualisation et Segmentation
Mutualisation et Segmentation
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Side 2019 #4
Side 2019 #4Side 2019 #4
Side 2019 #4
 
Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3
 
Side 2019 #3
Side 2019 #3Side 2019 #3
Side 2019 #3
 
Side 2019 #12
Side 2019 #12Side 2019 #12
Side 2019 #12
 
transformations and nonparametric inference
transformations and nonparametric inferencetransformations and nonparametric inference
transformations and nonparametric inference
 
Slides econometrics-2018-graduate-2
Slides econometrics-2018-graduate-2Slides econometrics-2018-graduate-2
Slides econometrics-2018-graduate-2
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for Actuaries
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Side 2019 #5
Side 2019 #5Side 2019 #5
Side 2019 #5
 
Slides ub-2
Slides ub-2Slides ub-2
Slides ub-2
 

Similar to Varese italie seminar

Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Information topology, Deep Network generalization and Consciousness quantific...
Information topology, Deep Network generalization and Consciousness quantific...Information topology, Deep Network generalization and Consciousness quantific...
Information topology, Deep Network generalization and Consciousness quantific...Pierre BAUDOT
 
Partitions and entropies intel third draft
Partitions and entropies intel third draftPartitions and entropies intel third draft
Partitions and entropies intel third draftAnna Movsheva
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GANSEMINARGROOT
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisXin-She Yang
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrisonComputer Science Club
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Xin-She Yang
 
WSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsWSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsChristian Robert
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsChristian Robert
 

Similar to Varese italie seminar (20)

Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Slides ub-3
Slides ub-3Slides ub-3
Slides ub-3
 
Slides ub-1
Slides ub-1Slides ub-1
Slides ub-1
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
Slides Bank England
Slides Bank EnglandSlides Bank England
Slides Bank England
 
Classification
ClassificationClassification
Classification
 
Information topology, Deep Network generalization and Consciousness quantific...
Information topology, Deep Network generalization and Consciousness quantific...Information topology, Deep Network generalization and Consciousness quantific...
Information topology, Deep Network generalization and Consciousness quantific...
 
Partitions and entropies intel third draft
Partitions and entropies intel third draftPartitions and entropies intel third draft
Partitions and entropies intel third draft
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GAN
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical Analysis
 
Rencontres Mutualistes
Rencontres MutualistesRencontres Mutualistes
Rencontres Mutualistes
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms
 
WSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in StatisticsWSC 2011, advanced tutorial on simulation in Statistics
WSC 2011, advanced tutorial on simulation in Statistics
 
Proba stats-r1-2017
Proba stats-r1-2017Proba stats-r1-2017
Proba stats-r1-2017
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: Models
 
Nested loop
Nested loopNested loop
Nested loop
 

More from Arthur Charpentier

More from Arthur Charpentier (14)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Pareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQPareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQ
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
Econ. Seminar Uqam
Econ. Seminar UqamEcon. Seminar Uqam
Econ. Seminar Uqam
 

Recently uploaded

OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptxOAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptxhiddenlevers
 
20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdf20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdfAdnet Communications
 
Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designsegoetzinger
 
Log your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaignLog your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaignHenry Tapper
 
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Delhi Call girls
 
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyInterimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyTyöeläkeyhtiö Elo
 
The Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdfThe Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdfGale Pooley
 
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...ssifa0344
 
Booking open Available Pune Call Girls Shivane 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Shivane  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Shivane  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Shivane 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
The Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfThe Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfGale Pooley
 
Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024Bladex
 
Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...
Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...
Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...shivangimorya083
 
The Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfThe Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfGale Pooley
 
The Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfThe Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfGale Pooley
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptxFinTech Belgium
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdfFinTech Belgium
 
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...Suhani Kapoor
 
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxanshikagoel52
 
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure serviceCall US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptxOAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
 
20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdf20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdf
 
Instant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School DesignsInstant Issue Debit Cards - School Designs
Instant Issue Debit Cards - School Designs
 
Log your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaignLog your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaign
 
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
Best VIP Call Girls Noida Sector 18 Call Me: 8448380779
 
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyInterimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
 
The Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdfThe Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdf
 
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
Solution Manual for Financial Accounting, 11th Edition by Robert Libby, Patri...
 
Booking open Available Pune Call Girls Shivane 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Shivane  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Shivane  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Shivane 6297143586 Call Hot Indian Gi...
 
The Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfThe Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdf
 
Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024Bladex Earnings Call Presentation 1Q2024
Bladex Earnings Call Presentation 1Q2024
 
Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...
Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...
Russian Call Girls In Gtb Nagar (Delhi) 9711199012 💋✔💕😘 Naughty Call Girls Se...
 
The Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfThe Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdf
 
The Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdfThe Economic History of the U.S. Lecture 17.pdf
The Economic History of the U.S. Lecture 17.pdf
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
 
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
06_Joeri Van Speybroek_Dell_MeetupDora&Cybersecurity.pdf
 
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
VIP Call Girls LB Nagar ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With Room...
 
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANIKA) Budhwar Peth Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Dividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptxDividend Policy and Dividend Decision Theories.pptx
Dividend Policy and Dividend Decision Theories.pptx
 
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure serviceCall US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
 

Varese italie seminar

  • 1. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics & “Machine Learning”* A. Charpentier (Université de Rennes) joint work with E. Flachaire (AMSE) & A. Ly (Université Paris-Est) https://hal.archives-ouvertes.fr/hal-01568851/ Università degli studi dell’Insubria Seminar, May 2018. @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics & “Machine Learning” Varian (2014, The Probabilistic Approach in Econometrics) @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics in a “Big Data” Context Here data means y ∈ Rn and X a n × p matrix. n large means “ asymptotic” theorems can be invoked (n → ∞) Portnoy (1988, Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity) proved that MLE estimator is asymptotically Gaussian as long as p2 /n → 0. There might be High-dimensional issues if p > √ n. Nevertheless, there might be sparsity issues in high dimension (see Hastie, Tibshirani & Wainwright (2015, Statistical Learning with Sparsity)) : a sparse statistical model has only a small number of nonzero parameters or weights First order condition XT (Xβ − y) = 0 is based on QR decomposition (can be computationaly intensive). @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics in a “Big Data” Context Arthur Hoerl in 1982, back on Hoerl & Kennard (1970) on Ridge regression. @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Back on the history of the “regression” Galton (1870, Heriditary Genius, 1886, Regression to- wards mediocrity in hereditary stature) and Pearson & Lee (1896, On Telegony in Man, 1903 On the Laws of Inheritance in Man) studied genetic transmission of characterisitcs, e.g. the heigth. On average the child of tall parents is taller than other children, but less than his parents. “I have called this peculiarity by the name of regres- sion”, Francis Galton, 1886. @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  • 6. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Economics and Statistics Working (1925) What Do Statistical "Demand Curves" Show? Tinbergen (1939) Statistical Testing of Business Cycle Theories @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometrics and the Probabilistic Model Haavelno (1944, The Probabilistic Approach in Econometrics) Data (yi, xi) are seen as realizations of (iid) random varibles (Y, X) on some probabilistic space (Ω, F, P). @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Mathematical Statistics Vapnik (1998, Statistical Learning Theory) @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Mathematical Statistics Consider observations {y1, · · · , yn} from iid random variables Yi ∼ Fθ (with “density” fθ). Likelihood is L(θ ; y) → n i=1 fθ(yi) Maximum likelihood estimate is θ mle ∈ arg max θ∈Θ {L(θ; y)} Fisher (1912) On an absolute criterion for fitting frequency curves @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Mathematical Statistics Under standard assumptions (Identification of the model, Compactness, Continuity and Dominance), the maximum likelihood estimator is consistent θ mle P −→ θ. With additional assumptions, it can be shown that the maximum likelihood estimator converges to a normal distribution √ n θ mle − θ L −→N(0, I−1 ) where I is Fisher information matrix (i.e. θ mle is asymptotically efficient). Eg. if Y ∼ N(θ, σ2 ), log L ∝ − n i=1 yi − θ σ 2 , and θmle = y (see also method of moments). max log L ⇐⇒ min n i=1 ε2 i = least squares @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Data and Conditional Distribution Cook & Weisberg (1999, Applied Regression Including Computing and Graphics) From available data, describe the conditional distribution of Y given X. @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Conditional Distributions and Likelihood We had a sample {y1, · · · , yn}, with Y from a N(θ, σ2 ) distribution. The natural extension if we had a sample {(y1, x1), · · · , (yn, xn)} is to assume that Y |X = x has a N(θx, σ2 ) distribution. The standard linear model is obtained when θx is linear, θx = β0 + xT β. Hence E[Y |X = x] = xT β and Var[Y |X = x] = σ2 (homoskedasticity). (if we center variables or if we add the constant in x). The MLE of β is β mle = argmin n i=1 (yi − xi T β)2 = (XT X)−1 XT y. @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria From a Gaussian to a Bernoulli (Logistic) Regression, and GLMs Consider y ∈ {0, 1}, so that Y has a Bernoulli distribution. The logarithm of the odds is linear, log P[Y = 1|X = x] P[Y = 1|X = x] = β0 + xT β, i.e. E[Y |X = x] = eβ0+xT β 1 + eβ0+xTβ = H(β0 + xT β), où H(·) = exp(·) 1 + exp(·) , with likelihood L(β; y, X) = n i=1 exT i β 1 + exT i β yi 1 1 + exT i β 1−yi and set β mle = arg max β∈Rp {L(β; y, X)} (solved numerically). @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics Two strategies are usually considered to compute m(x) = E[Y |X = x] • consider a local approximation (in the neighborhood of x), e.g. k-nearest neighbor, kernel regression, lowess • consider a functional decomposition of function m in some natural basis, e.g. splines For a kernel regression, as in Nadaraya (1964, ) and Watson (1964, ), consider some kernel function K and some bandwidth h mh(x) = sx T y = n i=1 sx,iyi where sx,i = Kh(x − xi) Kh(x − x1) + · · · + Kh(x − xn) . (in a univariate problem). @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics Recall - see Simonoff (1996, ) that using asymtptotic approximations bias[mh(x)] = E[mh(x)] − m(x) ∼ h2 C1 2 m (x) + C2m (x) f (x) f(x) Var[mh(x)] ∼ C3 nh σ(x) f(x) for some constants C1, C2 and C3. Set mse[mh(x)] = bias2 [mh(x)] + Var[mh(x)] and consider the integrated version mise[mh] = mse[mh(x)]dF(x) @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics mise[mh] ∼ bias2 h4 4 x2 k(x)dx 2 m (x) + 2m (x) f (x) f(x) 2 dx + variance σ2 nh k2 (x)dx · dx f(x) , Thus, the optimal value is h = O(n−1/5 ) (see Silverman’s rule of thumb). Obtained using asymptotic theory and approximations. @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria A Brief Excursion to Nonparametric Econometrics Choice of meta-parameter h : trade-off bias / variance. @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  • 18. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice Suppose that the true model is yi = x1,iβ1 + x2,iβ2 + εi, but we estimate the model on x1 (only) yi = X1,ib1 + ηi. b1 = (XT 1 X1)−1 XT 1 Y = (XT 1 X1)−1 XT 1 [X1,iβ1 + X2,iβ2 + ε] = (XT 1 X1)−1 XT 1 X1β1 + (XT 1 X1)−1 XT 1 X2β2 + (XT 1 X1)−1 XT 1 ε = β1 + (X1X1)−1 XT 1 X2β2 β12 + (XT 1 X1)−1 XT 1 εi νi i.e. E(b1) = β1 + β12. If XT 1 X2 = 0 (X1 ⊥ X2), E(b1) = β1 Conversely, assume that the true model is yi = x1,iβ1 + εi but we estimated on (unnecessary) variables X2 yi = x1,ib1 + x2,ib2 + ηi. Here estimation is unbiased E(b1) = β1, but the estimator is not efficient... @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice, pluralitas non est ponenda sine necessitate From the variance decomposition 1 n n i=1 (yi − ¯y)2 total variance = 1 n n i=1 (yi − yi)2 residual variance + 1 n n i=1 (yi − ¯y)2 explained variance Define the R2 as R2 = n i=1(yi − ¯y)2 − n i=1(yi − yi)2 n i=1(yi − ¯y)2 or better, the adjusted R2 ¯R2 = 1 − (1 − R2 ) n − 1 n − p = R2 − (1 − R2 ) p − 1 n − p penalty . @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice, pluralitas non est ponenda sine necessitate One can also consider the log-likelihood, or some penalized version AIC = − log L − 2 · p penalty or BIC = − log L − log(n) · p penalty . Objective is min{− log L(β)}, and quality is measured via − log L(β) − 2p. Goodhart’s law “When a measure becomes a target, it ceases to be a good measure” @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Model (and Variable) Choice, pluralitas non est ponenda sine necessitate Alternative approach on penalization : consider a linear predictor m ∈ M M = m : m(x) = sh(x)T y where S = (s(x1), · · · , s(xn))T smoothing matrix Suppose that the true model is y = m0(x) + ε with E[ε] = 0 and Var[ε] = σ2 I, so that m0(x) = E[Y |X = x]. Quadratic risk R(m) = E (Y − m(X))2 is E (Y − m0(X))2 error + E (m0(X) − E[m(X)])2 bias2 + E (E[m(X)] − m(X))2 variance . Consider empirical risk, Rn(m) = 1 n n i=1 (yi − m(xi))2 , then one can prove that R(m) = E Rn(m) + 2σ2 n trace(S), see Mallow’s Cp from Mallow (1973, Some Comments on Cp) @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Introduction: “Machine Learning” ? Friedman (1997, Datamining & Statistics, what’s the connection) @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  • 23. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Introduction: “Machine Learning” ? Machine learning (initially) did not need any probabilistic framework. Consider observations (yi, xi), and loss function and some class of models M. The goal is to solve m ∈ argmin m∈M n i=1 (yi, m(xi)) Watt et al. (2016, Machine learning refined foundations algorithms and applications) @freakonometrics freakonometrics freakonometrics.hypotheses.org 23
  • 24. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Vapnik (2000, The Nature of Statistical Learning Theory) @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Vailiant (1984, A Theory of Learnable) @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Consider a classification problem, y ∈ {−1, +1}. The “true” model is m0 (target), and consider some model m/ Let P denote the (unknown) distribution of X’s. The error of m can be written RP,m0 (m) = P m(X) = m0(X) = P {x : m(x) = m0(x)} where X ∼ P. Naturally, we can assume that observations xi’s were drawn from P. Empirical risk is R(m) = 1 n n i=1 1(m(xi) = yi). It is not possible to get a perfect model, we should seek a Probably Almost Correct model. Given , we want to find m such that RP,m0 (m) ≤ with probability 1 − δ. More precisely, we want an algorithm that might lead us to a candidate m. @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Probabilistic Foundations and “Statistical Learning” Suppose that M contains a finite number of models. For any , δ, P and m0, if we have enough observations ( n ≥ −1 log[δ−1 M ]), if m ∈ argmin m∈M 1 n n i=1 1(m(xi) = yi) then with probability higher than 1 − δ, RP,f (m ) ≤ . Here nM( , δ) = −1 log[δ−1 M ] is called complexity, and M is PAC-learnable. If M is not finite, the problem is more complicated, it is necessary to define a dimension d - so called VC-dimension - of M, that will be a substitute to M . @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Penalization and Over/Under-Fit The risk of m cannot be estimated on the data used to fit m. Alternatives are cross-validation and bootstrap @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria The objective function and loss functions Consider loss function : Y × Y → R+ and set m ∈ argmin m∈M n i=1 (yi, m(xi)) . A classical loss function is the quadratic one 2(u, v) = (u − v)2 . Recall that y = argmin m∈R n i=1 2(yi, m) , or (for the continuous version) E(Y ) = argmin m∈R Y − m 2 2 = argmin m∈R E 2(Y, m) . One can also consider the least absolute value one 1(u, v) = |u − v. Recall that median[y] = argmin m∈R n i=1 1(yi, m) . The optimisation problem m = argmin m∈M n i=1 |yi − m(xi)| is well know in robust econometrics. @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  • 30. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria The objective function and loss functions More generally, since 1(y, m) = |(y − m)(1/2 − 1y≤m)|, can be generalized for any probability level τ ∈ (0, 1) : mτ = argmin m∈M0 n i=1 q τ (yi, m(xi)) with q τ (x, y) = (x − y)(τ − 1x≤y) is the quantile regression, see Koenker (2005, Quantile Regression). One can also consider expectiles, with loss function e τ (x, y) = (x − y)2 · τ − 1x≤y , see Aigner et al. (1977, Formulation and estimation of stochastic frontier production function models) Gneiting (2011, Making and Evaluating Point Forecasts) introduced the concept of ellicitable statistics: T is ellicitable if there exists : R × R → R+ such that T(Y ) = argmin x∈R R (x, y)dF(y) = argmin x∈R E (x, Y ) where Y L ∼ F . Thus, the mean, the median, any quantiles are ellicitable. @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Weak and iterative learning (Gradient Boosting) We want to solve m ∈ argmin m∈M n i=1 (yi, m(xi)) on a set of possible predictor M. In a classical optimization problem x ∈ argmin x∈Rd {m(x)}, we consider a gradient descent to approximate x f(xk) f,xk ∼ f(xk−1) f,xk−1 + (xk − xk−1) αk f(xk−1) f,xk−1 Here, it is some sort of dual problem m ∈ argmin m∈M {m(x)}, and a functional gradient descent can be considered mk(x) mk,x ∼ mk−1(x) mk−1,x +(mk − mk−1) βk mk−1, x @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Weak and iterative learning (Gradient Boosting) m(k) = m(k−1) + argmin h∈H    n i=1 (yi − m(k−1) (xi) εk,i , h(xi))    where m(k) (·) = m1(·) ∼y + m2(·) ∼ε1 + m3(·) ∼ε2 + · · · + mk(·) ∼εk−1 = m(k−1) (·) + mk(·). Equivalently, start with m(0) = argmin m∈R n i=1 (yi, m) , and solve iteratively m(k) = m(k−1) + argmin h∈H n i=1 (yi, m(k−1) (xi) + h(xi)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Weak and iterative learning (Gradient Boosting) if H is a set of differentiable functions, m(k) = m(k−1) − γk n i=1 m(k−1) (yi, m(k−1) (xi)), where γk = argmin γ n i=1 yi, m(k−1) (xi) − γ m(k−1) (yi, m(k−1) (xi)) . @freakonometrics freakonometrics freakonometrics.hypotheses.org 33 q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q qq q q q qq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q qq q qqq q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q qqqqq q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 6 −1.5−1.0−0.50.00.51.01.5 q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q qq q q q qq q q q qq q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q qq q qqq q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q qqqqq q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 6 −1.5−1.0−0.50.00.51.01.5
  • 34. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Penalization and Over/Under-Fit Consider some penalized objective function, given λ ≥ 0, (β0,λ, βλ) = argmin n i=1 (yi, β0 + xT β) + λ β with centered and scaled variables, for some penalty β . Note that there is some Bayesian interpretation P[θ|y] a posteriori ∝ P[y|θ] likelihood · P[θ] a priori i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalization . @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Going further, 0, 1 and 2 penalty Define a 0 = d i=1 1(ai = 0), a 1 = d i=1 |ai| and a 2 = d i=1 a2 i 1/2 , for a ∈ Rd . constrained penalized optimization optimization argmin β; β 0 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 0 ( 0) argmin β; β 1 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 1 ( 1) argmin β; β 2 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 2 ( 2) Assume that is the quadratic norm. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  • 36. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Going further, 0, 1 and 2 penalty The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. Nevertheless, if there is a theoretical equivalence, there might be numerical issues since there is not necessarily unicity of the solution. The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right problem, ∃s such that β is a solution of the left problem. But the converse is not true. More generally, consider a p norm, • sparsity is obtained when p ≤ 1 • convexity is obtained when p ≥ 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 0 penalty Foster & George (1994, the risk inflation criterion for multiple regression) tried to solve directly the penalized problem of ( 0). But it is a complex combinatorial problem in high dimension (Natarajan (1995) sparse approximate solutions to linear systems proved that it was a NP-hard problem) One can prove that if λ ∼ σ2 log(p), alors E [xT β − xT β0]2 ≤ E [xS T βS − xT β0]2 =σ2#S · 4 log p + 2 + o(1) . In that case β sub λ,j =    0 si j /∈ Sλ(β) β ols j si j ∈ Sλ(β), where Sλ(β) is the set of non-null values in solutions of ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 2 penalty With the 2-norm, ( 2) is the Ridge regression β ridge λ = (XT X + λI)−1 XT y = (XT X + λI)−1 (XT X)β ols . see Tikhonov (1943, On the stability of inverse problems). Hence biais[β ridge λ ] = −λ[XT X+λI]−1 β ols , Var[β ridge λ ] = σ2 [XT X+λI]−1 XT X[XT X+λI]−1 . With orthogonal variables (i.e. XT X = I), we get biais[β ridge λ ] = λ 1 + λ β ols and Var[β ridge λ ] = σ2 (1 + λ)2 I = Var[β ols ] (1 + λ)2 . Note thatVar[β ridge λ ] < Var[β ols ]. Further, mse[β ridge λ ] is minimal when λ = pσ2 /βT β. @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 1 penalty If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly convex, and optimum is not always unique (e.g. if XT X is singular). But in the quadratic case, is strictly convex, and at least Xβ is unique. Further, note that solutions are necessarily coherent (signs of coefficients) : it is not possible to have βj < 0 for one solution and βj > 0 for another one. In many cases, problem ( 1) yields a corner-type solution, which can be seen as a "best subset" solution - like in ( 0). @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Using the 1 penalty Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction. ( 1) becomes min yT y − 2yT xβ + βxT xβ + 2λ|β| First order condition can be written −2yT x + 2xT xβ±2λ = 0. (the sign in ± being the sign of β). Assume that yT x > 0, then solution is βlasso λ = max yT x − λ xTx , 0 . (we get a corner solution when λ is large). In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse regression) or Candès & Plan (2009, Near-ideal model selection by 1 minimization). With some additional technical assumption, that lasso estimator is "sparsistent" in the sense that the support of β lasso λ is the same as β @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Going further, 0, 1 and 2 penalty Thus, lasso can be used for variable selection (see Hastie et al. (2001 The Elements of Statistical Learning). With orthonormal covariance, one can prove that βsub λ,j = βols j 1|βsub λ,j |>b , βridge λ,j = βols j 1 + λ and βlasso λ,j = sign[βols j ] · (|βols j | − λ)+. @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Econometric Modeling Data {(yi, xi)}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp and yi ∈ Y. A model is a m : X → Y mapping - regression, Y = R (but also Y = N) - classification, Y = {0, 1}, {−1, +1}, {•, •} (binary, or more) Classification models are based on two steps, • score function, s(x) = P(Y = 1|X = x) ∈ [0, 1] • classifier s(x) → y ∈ {0, 1}. q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 q q q q q q q q q q @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  • 43. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Classification Trees To split {N} into two {NL, NR}, consider I(NL, NR) = x∈{L,R} nx n I(Nx) e.g. Gini index (used originally in CART, see Breiman et al. (1984, Classification and Regression Trees) gini(NL, NR) = − x∈{L,R} nx n y∈{0,1} nx,y nx 1 − nx,y nx and the cross-entropy (used in C4.5 and C5.0) entropy(NL, NR) = − x∈{L,R} nx n y∈{0,1} nx,y nx log nx,y nx @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  • 44. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Model Selection & ROC Curves Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold s ∈ (0, 1), set Y (s) = 1[m(x) > s] =    1 if m(x) > s 0 if m(x) ≤ s Define the confusion matrix as N = [Nu,v] Y = 0 Y = 1 Ys = 0 TNs FNs TNs+FNs Ys = 1 FPs TPs FPs+TPs TNs+FPs FNs+TPs n ROC curve is ROCs = FPs FPs + TNs , TPs TPs + FNs with s ∈ (0, 1) @freakonometrics freakonometrics freakonometrics.hypotheses.org 44
  • 45. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Machine Learning Tools versus Econometric Techniques Car seats sales, used in James et al. (2014) An Introduction to Statistical Learning AUC logit 0.9544 bagging 0.8973 random forest 0.9050 boosting 0.9313 Specificity Sensitivity 1.0 0.8 0.6 0.4 0.2 0.0 0.00.20.40.60.81.0 q q 0.285 (0.860, 0.927) 0.500 (0.907, 0.860) logit bagging random forest boosting see Section 5.1 in C. et al. (2018, Econométrie et Machine Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 45
  • 46. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Machine Learning Tools versus Econometric Techniques Caravan sales, used in James et al. (2014) An Introduction to Statistical Learning AUC logit 0.7372 bagging 0.7198 random forest 0.7154 boosting 0.7691 Specificity Sensitivity 1.0 0.8 0.6 0.4 0.2 0.0 0.00.20.40.60.81.0 logit bagging random forest boosting see Section 5.2 in C. et al. (2018, Econométrie et Machine Learning) @freakonometrics freakonometrics freakonometrics.hypotheses.org 46
  • 47. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Machine Learning Tools versus Econometric Techniques Credit database, used in Nisbet et al. (2001, Handbook of Statistical Analysis and Data Mining) @freakonometrics freakonometrics freakonometrics.hypotheses.org 47
  • 48. Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria Conclusion Breiman (2001, The Two Cultures) Econometrics and Machine Learning are focusing on the same problem, but with different cultures - probabilistic interpretation of econometric mod- els (unfortunately sometimes misleading, e.g. p- value) can deal with non-i.id data (time series, panel, etc) - machine learning is about predictive modeling and generalization with algorithmic tools, based on bootstrap (sampling and sub-sampling), cross-validation, variable selection, nonlineari- ties, cross effects, etc @freakonometrics freakonometrics freakonometrics.hypotheses.org 48