Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Slides econ-lm
1. Arthur Charpentier, Econometrics & Machine Learning - 2017
Econometrics & Machine Learning
A. Charpentier, E. Flachaire & A. Ly
https://arxiv.org/abs/1708.06992
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
2. Arthur Charpentier, Econometrics & Machine Learning - 2017
Probabilistic Foundations of Econometrics
see Haavelmo (1944)
see Duo (1997)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
3. Arthur Charpentier, Econometrics & Machine Learning - 2017
Probabilistic Foundations of Econometrics
There is a probabilistic space (Ω, F, P) such that data {yi, xi} can be seen as
realizations of random variables {Yi, Xi}.
Consider a conditional distribution for Y |X, e.g.
(Y |X = x)
L
∼ N(µ(x), σ2
) where µ(x) = β0 + xT
β, and β ∈ Rp
.
for the linear model, with some extension when it is in the exponential family
(GLM).
Then use maximum likelihood for inference, to derive confidence interval
(quantification of uncertainty), etc.
Importance of unbiased estimators (Cramer Rao lower bound for the variance).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
4. Arthur Charpentier, Econometrics & Machine Learning - 2017
Loss Functions
Gneiting (2009) in a statistical context: a statistics T is said to be ellicitable if
there is : R × R → R+ such that
T(Y ) = argmin
x∈R R
l(x, y)dF(y) = argmin
x∈R
E l(x, Y ) where Y
L
∼ F
(e.g. T(x) = x and = 2, or T(x) = median(x) and = 1).
In machine-learning, we want to solve
m (x) = argmin
m∈M
n
i=1
(yi, m(x))
using optimization techniques, in a not-too-complex space M.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
5. Arthur Charpentier, Econometrics & Machine Learning - 2017
Overfit ?
Underfit: true yi = β0 + xT
1 β1 + xT
2 β2 + εi vs. fitted model yi = b0 + xT
1 b1 + ηi.
b1 = (XT
1 X1)−1
XT
1 y = β1 + (X1X1)−1
XT
1 X2β2
β12
+ (XT
1 X1)−1
XT
1 ε
νi
i.e. E[b1] = β1 + β12 = β1, unless X1 ⊥ X2 (see Frish-Waugh Theorem).
Overfit: true yi = β0 + xT
1 β1 + εi vs. fitted model yi = b0 + xT
1 b1 + xT
2 b2 + ηi.
In that case E[b1] = β1 but no-longer efficient.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
6. Arthur Charpentier, Econometrics & Machine Learning - 2017
Occam’s Razor and Parsimony
Importance of penalty in order to avoid a too complex model (overfit).
Consider some linear predictor, y = Hy where H = X(XT
X)−1
XT
y, or more
generally y = Sy for some smoothing matrix S.
In econometrics, consider some ex-post penalty for model selection,
AIC = −2 log L + 2p = deviance + 2p
where p is the dimension (i.e. trace(S) in a more general setting).
In machine-learning, penalty is added in the objective function, see Ridge or
LASSO regression
(β0,λ, βλ) = argmin
(β0,β)
n
i=1
(yi, β0 + xT
i β) + λ β
Goodhart’s law, “when a measure becomes a target, it ceases to be a good
measure”
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
7. Arthur Charpentier, Econometrics & Machine Learning - 2017
Boosting, or learning from previous errors
Construct a sequence of models
m(k)
(x) = m(k−1)
(x) + α · f (x)
where
f = argmin
f∈W
n
i=1
(yi − m(k−1)
(xi), f(xi))
for some set of weak learner W.
Problem: where to stop to avoid overfit...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
0 2 4 6 8 10
−2−101