Varese italie seminar

Arthur CHARPENTIER, Econometrics & “Machine Learning”, May 2018, Università degli studi dell’Insubria
Econometrics & “Machine Learning”*
A. Charpentier (Université de Rennes)
joint work with E. Flachaire (AMSE) & A. Ly (Université Paris-Est)
https://hal.archives-ouvertes.fr/hal-01568851/
Università degli studi dell’Insubria
Seminar, May 2018.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1

Econometrics & “Machine Learning”
Varian (2014, The Probabilistic Approach in Econometrics)

Econometrics in a “Big Data” Context
Here data means y ∈ Rn
and X a n × p matrix.
n large means “ asymptotic” theorems can be invoked (n → ∞)
Portnoy (1988, Asymptotic Behavior of Likelihood Methods for Exponential Families
when the Number of Parameters Tends to Inﬁnity) proved that MLE estimator is
asymptotically Gaussian as long as p2
/n → 0.
There might be High-dimensional issues if p >
√
n.
Nevertheless, there might be sparsity issues in high dimension (see Hastie,
Tibshirani & Wainwright (2015, Statistical Learning with Sparsity)) : a sparse
statistical model has only a small number of nonzero parameters or weights
First order condition XT
(Xβ − y) = 0 is based on QR decomposition (can be
computationaly intensive).

Econometrics in a “Big Data” Context
Arthur Hoerl in 1982, back on Hoerl & Kennard (1970) on Ridge regression.

Back on the history of the “regression”
Galton (1870, Heriditary Genius, 1886, Regression to-
wards mediocrity in hereditary stature) and Pearson &
Lee (1896, On Telegony in Man, 1903 On the Laws of
Inheritance in Man) studied genetic transmission of
characterisitcs, e.g. the heigth.
On average the child of tall parents is taller than
other children, but less than his parents.
“I have called this peculiarity by the name of regres-
sion”, Francis Galton, 1886.

Economics and Statistics
Working (1925)
What Do Statistical "Demand Curves" Show?
Tinbergen (1939)
Statistical Testing of Business Cycle Theories

Econometrics and the Probabilistic Model
Haavelno (1944, The Probabilistic Approach in Econometrics)
Data (yi, xi) are seen as realizations of (iid) random varibles (Y, X) on some
probabilistic space (Ω, F, P).

Mathematical Statistics
Vapnik (1998, Statistical Learning Theory)

Consider observations {y1, · · · , yn}
from iid random variables Yi ∼ Fθ
(with “density” fθ).
Likelihood is
L(θ ; y) →
n
i=1
fθ(yi)
Maximum likelihood estimate is
θ
mle
∈ arg max
θ∈Θ
{L(θ; y)}
Fisher (1912) On an absolute criterion for ﬁtting frequency curves

Under standard assumptions (Identiﬁcation of the model, Compactness,
Continuity and Dominance), the maximum likelihood estimator is consistent
θ
mle P
−→ θ. With additional assumptions, it can be shown that the maximum
likelihood estimator converges to a normal distribution
√
n θ
mle
− θ
L
−→N(0, I−1
)
where I is Fisher information matrix (i.e. θ
mle
is asymptotically eﬃcient).
Eg. if Y ∼ N(θ, σ2
), log L ∝ −
n
i=1
yi − θ
σ
2
, and θmle
= y (see also method of
moments).
max log L ⇐⇒ min
n
i=1
ε2
i = least squares

Data and Conditional Distribution
Cook & Weisberg (1999, Applied Regression Including Computing and Graphics)
From available data, describe the conditional distribution of Y given X.

Conditional Distributions and Likelihood
We had a sample {y1, · · · , yn}, with Y from a N(θ, σ2
) distribution.
The natural extension if we had a sample {(y1, x1), · · · , (yn, xn)} is to assume
that Y |X = x has a N(θx, σ2
) distribution.
The standard linear model is obtained when θx is linear, θx = β0 + xT
β. Hence
E[Y |X = x] = xT
β and Var[Y |X = x] = σ2
(homoskedasticity).
(if we center variables or if we add the constant in x).
The MLE of β is
β
mle
= argmin
n
i=1
(yi − xi
T
β)2
= (XT
X)−1
XT
y.

From a Gaussian to a Bernoulli (Logistic) Regression, and GLMs
Consider y ∈ {0, 1}, so that Y has a Bernoulli distribution. The logarithm of the
odds is linear,
log
P[Y = 1|X = x]
P[Y = 1|X = x]
= β0 + xT
β,
i.e.
E[Y |X = x] =
eβ0+xT
β
1 + eβ0+xTβ
= H(β0 + xT
β), où H(·) =
exp(·)
1 + exp(·)
,
with likelihood
L(β; y, X) =
n
i=1
exT
i β
1 + exT
i
β
yi
1
1 + exT
i
β
1−yi
and set β
mle
= arg max
β∈Rp
{L(β; y, X)} (solved numerically).

A Brief Excursion to Nonparametric Econometrics
Two strategies are usually considered to compute m(x) = E[Y |X = x]
• consider a local approximation (in the neighborhood of x), e.g. k-nearest
neighbor, kernel regression, lowess
• consider a functional decomposition of function m in some natural basis, e.g.
splines
For a kernel regression, as in Nadaraya (1964, ) and Watson (1964, ), consider
some kernel function K and some bandwidth h
mh(x) = sx
T
y =
n
i=1
sx,iyi where sx,i =
Kh(x − xi)
Kh(x − x1) + · · · + Kh(x − xn)
.
(in a univariate problem).

Recall - see Simonoﬀ (1996, ) that using asymtptotic approximations
bias[mh(x)] = E[mh(x)] − m(x) ∼ h2 C1
2
m (x) + C2m (x)
f (x)
f(x)
Var[mh(x)] ∼
C3
nh
σ(x)
f(x)
for some constants C1, C2 and C3.
Set mse[mh(x)] = bias2
[mh(x)] + Var[mh(x)] and consider the integrated version
mise[mh] = mse[mh(x)]dF(x)

mise[mh] ∼
bias2
h4
4
x2
k(x)dx
2
m (x) + 2m (x)
f (x)
f(x)
2
dx
+
variance
σ2
nh
k2
(x)dx ·
dx
f(x)
,
Thus, the optimal value is h = O(n−1/5
) (see Silverman’s rule of thumb).
Obtained using asymptotic theory and approximations.

Choice of meta-parameter h : trade-oﬀ bias / variance.

Model (and Variable) Choice
Suppose that the true model is yi = x1,iβ1 + x2,iβ2 + εi, but we estimate the
model on x1 (only) yi = X1,ib1 + ηi.
b1 = (XT
1 X1)−1
XT
1 Y
= (XT
1 X1)−1
XT
1 [X1,iβ1 + X2,iβ2 + ε]
= (XT
1 X1)−1
XT
1 X1β1 + (XT
1 X1)−1
XT
1 X2β2 + (XT
1 X1)−1
XT
1 ε
= β1 + (X1X1)−1
XT
1 X2β2
β12
+ (XT
1 X1)−1
XT
1 εi
νi
i.e. E(b1) = β1 + β12. If XT
1 X2 = 0 (X1 ⊥ X2), E(b1) = β1
Conversely, assume that the true model is yi = x1,iβ1 + εi but we estimated on
(unnecessary) variables X2 yi = x1,ib1 + x2,ib2 + ηi. Here estimation is unbiased
E(b1) = β1, but the estimator is not eﬃcient...

Model (and Variable) Choice, pluralitas non est ponenda sine necessitate
From the variance decomposition
1
n
n
i=1
(yi − ¯y)2
total variance
=
1
n
n
i=1
(yi − yi)2
residual variance
+
1
n
n
i=1
(yi − ¯y)2
explained variance
Deﬁne the R2
as
R2
=
n
i=1(yi − ¯y)2
−
n
i=1(yi − yi)2
n
i=1(yi − ¯y)2
or better, the adjusted R2
¯R2
= 1 − (1 − R2
)
n − 1
n − p
= R2
− (1 − R2
)
p − 1
n − p
penalty
.

One can also consider the log-likelihood, or some penalized version
AIC = − log L − 2 · p
penalty
or BIC = − log L − log(n) · p
penalty
.
Objective is min{− log L(β)}, and quality is measured via − log L(β) − 2p.
Goodhart’s law “When a measure becomes a target, it ceases to be a good measure”

Alternative approach on penalization : consider a linear predictor m ∈ M
M = m : m(x) = sh(x)T
y where S = (s(x1), · · · , s(xn))T
smoothing matrix
Suppose that the true model is y = m0(x) + ε with E[ε] = 0 and Var[ε] = σ2
I, so
that m0(x) = E[Y |X = x]. Quadratic risk R(m) = E (Y − m(X))2
is
E (Y − m0(X))2
error
+ E (m0(X) − E[m(X)])2
bias2
+ E (E[m(X)] − m(X))2
variance
.
Consider empirical risk, Rn(m) =
1
n
n
i=1
(yi − m(xi))2
, then one can prove that
R(m) = E Rn(m) +
2σ2
n
trace(S),
see Mallow’s Cp from Mallow (1973, Some Comments on Cp)

Introduction: “Machine Learning” ?
Friedman (1997, Datamining & Statistics, what’s the connection)

Introduction: “Machine Learning” ?
Machine learning (initially) did not need any probabilistic framework.
Consider observations (yi, xi), and loss function and some class of models M.
The goal is to solve
m ∈ argmin
m∈M
n
i=1
(yi, m(xi))
Watt et al. (2016, Machine learning reﬁned foundations algorithms and applications)

Probabilistic Foundations and “Statistical Learning”
Vapnik (2000, The Nature of Statistical Learning Theory)

Vailiant (1984, A Theory of Learnable)

Consider a classiﬁcation problem, y ∈ {−1, +1}. The “true” model is m0
(target), and consider some model m/
Let P denote the (unknown) distribution of X’s. The error of m can be written
RP,m0
(m) = P m(X) = m0(X) = P {x : m(x) = m0(x)} where X ∼ P.
Naturally, we can assume that observations xi’s were drawn from P. Empirical
risk is
R(m) =
1
n
n
i=1
1(m(xi) = yi).
It is not possible to get a perfect model, we should seek a Probably Almost
Correct model. Given , we want to ﬁnd m such that RP,m0
(m) ≤ with
probability 1 − δ.
More precisely, we want an algorithm that might lead us to a candidate m.

Suppose that M contains a finite number of models. For any , δ, P and m0, if
we have enough observations ( n ≥ −1
log[δ−1
M ]), if
m ∈ argmin
m∈M
1
n
n
i=1
1(m(xi) = yi)
then with probability higher than 1 − δ, RP,f (m ) ≤ .
Here nM( , δ) = −1
log[δ−1
M ] is called complexity, and M is PAC-learnable.
If M is not finite, the problem is more complicated, it is necessary to define a
dimension d - so called VC-dimension - of M, that will be a substitute to M .

Penalization and Over/Under-Fit
The risk of m cannot be estimated on the data used to ﬁt m.
Alternatives are cross-validation and bootstrap

The objective function and loss functions
Consider loss function : Y × Y → R+ and set m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) .
A classical loss function is the quadratic one 2(u, v) = (u − v)2
. Recall that
y = argmin
m∈R
n
i=1
2(yi, m) , or (for the continuous version)
E(Y ) = argmin
m∈R
Y − m 2
2
= argmin
m∈R
E 2(Y, m) .
One can also consider the least absolute value one 1(u, v) = |u − v. Recall that
median[y] = argmin
m∈R
n
i=1
1(yi, m) . The optimisation problem
m = argmin
m∈M
n
i=1
|yi − m(xi)|
is well know in robust econometrics.

The objective function and loss functions
More generally, since 1(y, m) = |(y − m)(1/2 − 1y≤m)|, can be generalized for
any probability level τ ∈ (0, 1) :
mτ = argmin
m∈M0
n
i=1
q
τ (yi, m(xi)) with q
τ (x, y) = (x − y)(τ − 1x≤y)
is the quantile regression, see Koenker (2005, Quantile Regression). One can also
consider expectiles, with loss function e
τ (x, y) = (x − y)2
· τ − 1x≤y , see Aigner
et al. (1977, Formulation and estimation of stochastic frontier production function
models)
Gneiting (2011, Making and Evaluating Point Forecasts) introduced the concept of
ellicitable statistics: T is ellicitable if there exists : R × R → R+ such that
T(Y ) = argmin
x∈R R
(x, y)dF(y) = argmin
x∈R
E (x, Y ) where Y
L
∼ F .
Thus, the mean, the median, any quantiles are ellicitable.

Weak and iterative learning (Gradient Boosting)
We want to solve m ∈ argmin
m∈M
n
i=1
(yi, m(xi)) on a set of possible predictor M.
In a classical optimization problem x ∈ argmin
x∈Rd
{m(x)}, we consider a gradient
descent to approximate x
f(xk)
f,xk
∼ f(xk−1)
f,xk−1
+ (xk − xk−1)
αk
f(xk−1)
f,xk−1
Here, it is some sort of dual problem m ∈ argmin
m∈M
{m(x)}, and a functional
gradient descent can be considered
mk(x)
mk,x
∼ mk−1(x)
mk−1,x
+(mk − mk−1)
βk
mk−1, x

m(k)
= m(k−1)
+ argmin
h∈H



n
i=1
(yi − m(k−1)
(xi)
εk,i
, h(xi))



where
m(k)
(·) = m1(·)
∼y
+ m2(·)
∼ε1
+ m3(·)
∼ε2
+ · · · + mk(·)
∼εk−1
= m(k−1)
(·) + mk(·).
Equivalently, start with m(0)
= argmin
m∈R
n
i=1
(yi, m) , and solve iteratively
m(k)
= m(k−1)
+ argmin
h∈H
n
i=1
(yi, m(k−1)
(xi) + h(xi))

if H is a set of diﬀerentiable functions,
m(k)
= m(k−1)
− γk
n
i=1
m(k−1) (yi, m(k−1)
(xi)),
where
γk = argmin
γ
n
i=1
yi, m(k−1)
(xi) − γ m(k−1) (yi, m(k−1)
(xi)) .
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5

Penalization and Over/Under-Fit
Consider some penalized objective function, given λ ≥ 0,
(β0,λ, βλ) = argmin
n
i=1
(yi, β0 + xT
β) + λ β
with centered and scaled variables, for some penalty β .
Note that there is some Bayesian interpretation
P[θ|y]
a posteriori
∝ P[y|θ]
likelihood
· P[θ]
a priori
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalization
.

Going further, 0, 1 and 2 penalty
Deﬁne
a 0 =
d
i=1
1(ai = 0), a 1 =
d
i=1
|ai| and a 2 =
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2 ( 2)
Assume that is the quadratic norm.

The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1

Using the 0 penalty
Foster & George (1994, the risk inﬂation criterion for multiple regression) tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =



0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).

Using the 2 penalty
With the 2-norm, ( 2) is the Ridge regression
β
ridge
λ = (XT
X + λI)−1
XT
y = (XT
X + λI)−1
(XT
X)β
ols
.
see Tikhonov (1943, On the stability of inverse problems). Hence
biais[β
ridge
λ ] = −λ[XT
X+λI]−1
β
ols
, Var[β
ridge
λ ] = σ2
[XT
X+λI]−1
XT
X[XT
X+λI]−1
.
With orthogonal variables (i.e. XT
X = I), we get
biais[β
ridge
λ ] =
λ
1 + λ
β
ols
and Var[β
ridge
λ ] =
σ2
(1 + λ)2
I =
Var[β
ols
]
(1 + λ)2
.
Note thatVar[β
ridge
λ ] < Var[β
ols
]. Further, mse[β
ridge
λ ] is minimal when
λ = pσ2
/βT
β.

Using the 1 penalty
If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coeﬃcients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
"best subset" solution - like in ( 0).

Using the 1 penalty
Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written −2yT
x + 2xT
xβ±2λ = 0. (the sign in ±
being the sign of β). Assume that yT
x > 0, then solution is
βlasso
λ = max
yT
x − λ
xTx
, 0 . (we get a corner solution when λ is large).
In higher dimension, see Tibshirani & Wasserman (2016, a closer look at sparse
regression) or Candès & Plan (2009, Near-ideal model selection by 1 minimization).
With some additional technical assumption, that lasso estimator is "sparsistent"
in the sense that the support of β
lasso
λ is the same as β

Thus, lasso can be used for variable selection (see Hastie et al. (2001 The
Elements of Statistical Learning).
With orthonormal covariance, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
and βlasso
λ,j = sign[βols
j ] · (|βols
j | − λ)+.

Econometric Modeling
Data {(yi, xi)}, for i = 1, · · · , n, with xi ∈ X ⊂ Rp
and yi ∈ Y.
A model is a m : X → Y mapping
- regression, Y = R (but also Y = N)
- classification, Y = {0, 1}, {−1, +1}, {•, •}
(binary, or more)
Classification models are based on two steps,
• score function, s(x) = P(Y = 1|X = x) ∈ [0, 1]
• classifier s(x) → y ∈ {0, 1}.
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
q
q

Classiﬁcation Trees
To split {N} into two {NL, NR}, consider
I(NL, NR) =
x∈{L,R}
nx
n
I(Nx)
e.g. Gini index (used originally in CART, see
Breiman et al. (1984, Classiﬁcation and Regression
Trees)
gini(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
1 −
nx,y
nx
and the cross-entropy (used in C4.5 and C5.0)
entropy(NL, NR) = −
x∈{L,R}
nx
n
y∈{0,1}
nx,y
nx
log
nx,y
nx

Model Selection & ROC Curves
Given a scoring function m(·), with m(x) = E[Y |X = x], and a threshold
s ∈ (0, 1), set
Y (s)
= 1[m(x) > s] =



1 if m(x) > s
0 if m(x) ≤ s
Deﬁne the confusion matrix as N = [Nu,v]
Y = 0 Y = 1
Ys = 0 TNs FNs TNs+FNs
Ys = 1 FPs TPs FPs+TPs
TNs+FPs FNs+TPs n
ROC curve is
ROCs =
FPs
FPs + TNs
,
TPs
TPs + FNs
with s ∈ (0, 1)

Machine Learning Tools versus Econometric Techniques
Car seats sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.9544
bagging 0.8973
random forest 0.9050
boosting 0.9313
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
q
q
0.285 (0.860, 0.927)
0.500 (0.907, 0.860)
logit
bagging
random forest
boosting
see Section 5.1 in C. et al. (2018, Econométrie et Machine Learning)

Caravan sales, used in James et al. (2014)
An Introduction to Statistical Learning
AUC
logit 0.7372
bagging 0.7198
random forest 0.7154
boosting 0.7691
Specificity
Sensitivity
1.0 0.8 0.6 0.4 0.2 0.0
0.00.20.40.60.81.0
logit
bagging
random forest
boosting
see Section 5.2 in C. et al. (2018, Econométrie et Machine Learning)

Credit database, used in Nisbet et al. (2001, Handbook of Statistical Analysis and
Data Mining)

Conclusion
Breiman (2001, The Two Cultures)
Econometrics and Machine Learning are focusing on
the same problem, but with diﬀerent cultures
- probabilistic interpretation of econometric mod-
els (unfortunately sometimes misleading, e.g. p-
value) can deal with non-i.id data (time series,
panel, etc)
- machine learning is about predictive modeling
and generalization with algorithmic tools, based
on bootstrap (sampling and sub-sampling),
cross-validation, variable selection, nonlineari-
ties, cross eﬀects, etc

Varese italie seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Varese italie seminar

Similar to Varese italie seminar (20)

More from Arthur Charpentier

More from Arthur Charpentier (14)

Recently uploaded

Recently uploaded (20)

Varese italie seminar