Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Slides econ-lm

5,475 views

Published on

Econometrics and Machine Learning

Published in: Education
  • Be the first to comment

  • Be the first to like this

Slides econ-lm

  1. 1. Arthur Charpentier, Econometrics & Machine Learning - 2017 Econometrics & Machine Learning A. Charpentier, E. Flachaire & A. Ly https://arxiv.org/abs/1708.06992 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  2. 2. Arthur Charpentier, Econometrics & Machine Learning - 2017 Probabilistic Foundations of Econometrics see Haavelmo (1944) see Duo (1997) @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  3. 3. Arthur Charpentier, Econometrics & Machine Learning - 2017 Probabilistic Foundations of Econometrics There is a probabilistic space (Ω, F, P) such that data {yi, xi} can be seen as realizations of random variables {Yi, Xi}. Consider a conditional distribution for Y |X, e.g. (Y |X = x) L ∼ N(µ(x), σ2 ) where µ(x) = β0 + xT β, and β ∈ Rp . for the linear model, with some extension when it is in the exponential family (GLM). Then use maximum likelihood for inference, to derive confidence interval (quantification of uncertainty), etc. Importance of unbiased estimators (Cramer Rao lower bound for the variance). @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  4. 4. Arthur Charpentier, Econometrics & Machine Learning - 2017 Loss Functions Gneiting (2009) in a statistical context: a statistics T is said to be ellicitable if there is : R × R → R+ such that T(Y ) = argmin x∈R R l(x, y)dF(y) = argmin x∈R E l(x, Y ) where Y L ∼ F (e.g. T(x) = x and = 2, or T(x) = median(x) and = 1). In machine-learning, we want to solve m (x) = argmin m∈M n i=1 (yi, m(x)) using optimization techniques, in a not-too-complex space M. @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  5. 5. Arthur Charpentier, Econometrics & Machine Learning - 2017 Overfit ? Underfit: true yi = β0 + xT 1 β1 + xT 2 β2 + εi vs. fitted model yi = b0 + xT 1 b1 + ηi. b1 = (XT 1 X1)−1 XT 1 y = β1 + (X1X1)−1 XT 1 X2β2 β12 + (XT 1 X1)−1 XT 1 ε νi i.e. E[b1] = β1 + β12 = β1, unless X1 ⊥ X2 (see Frish-Waugh Theorem). Overfit: true yi = β0 + xT 1 β1 + εi vs. fitted model yi = b0 + xT 1 b1 + xT 2 b2 + ηi. In that case E[b1] = β1 but no-longer efficient. @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  6. 6. Arthur Charpentier, Econometrics & Machine Learning - 2017 Occam’s Razor and Parsimony Importance of penalty in order to avoid a too complex model (overfit). Consider some linear predictor, y = Hy where H = X(XT X)−1 XT y, or more generally y = Sy for some smoothing matrix S. In econometrics, consider some ex-post penalty for model selection, AIC = −2 log L + 2p = deviance + 2p where p is the dimension (i.e. trace(S) in a more general setting). In machine-learning, penalty is added in the objective function, see Ridge or LASSO regression (β0,λ, βλ) = argmin (β0,β) n i=1 (yi, β0 + xT i β) + λ β Goodhart’s law, “when a measure becomes a target, it ceases to be a good measure” @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  7. 7. Arthur Charpentier, Econometrics & Machine Learning - 2017 Boosting, or learning from previous errors Construct a sequence of models m(k) (x) = m(k−1) (x) + α · f (x) where f = argmin f∈W n i=1 (yi − m(k−1) (xi), f(xi)) for some set of weak learner W. Problem: where to stop to avoid overfit... @freakonometrics freakonometrics freakonometrics.hypotheses.org 7 q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q qq q q q q q q q q 0 2 4 6 8 10 −2−101

×