Essential econometrics for data scientists

Essential economics for data scientists
Benjamin S. Skrainka
February 10, 2016
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 1 / 40

Overview
Economics studies allocation of resources under scarcity. Many of these
tools are useful for data scientists:
Econometric methods adapt classical statistics for applied problems
Causal inference
Experimental design
Regression analysis
Often, require small or ‘medium’ data
Goal of talk: understand (magnitude) of causal relationships

Theory I won’t discuss
Economic theory I won’t discuss:
Understanding individual behavior
Understanding ﬁrm behavior
Strategic questions: products, pricing, auctions, platforms, incentives,
M&A, new products
Estimate demand & forecasting
Structural modeling

Applied tools I won’t discuss
Econometric tools I won’t discuss:
Structural vs. reduced form
Bayesian vs. frequentist
Counter-factual & welfare analysis
Forecasting

Objectives
Today’s goals:
List diﬀerences between economic & machine learning approach
Know when to use econometrics or machine learning
Survey alternative types of experiments
Overview of how to estimate causal eﬀects using regression analysis

Agenda
Today’s agenda
1 Econometrics or machine learning?
2 Establishing causality
3 When A/B tests fail. . .
4 Causal regression analysis

References (1/2)
A few references:
Angrist, Joshua D., and Jörn-Steffen Pischke. Mostly harmless
econometrics: An empiricist’s companion. Princeton university press,
2008.
Angrist, Joshua D., and Jörn-Steffen Pischke. Mastering ’metrics: The
path from cause to effect. Princeton University Press, 2014.
Breiman, Leo. “Statistical modeling: The two cultures (with comments
and a rejoinder by the author).” Statistical Science 16.3 (2001):
199-231.
Cameron, A. Colin, and Pravin K. Trivedi. Microeconometrics:
methods and applications. Cambridge university press, 2005.
Card, David, and Alan B. Krueger. “Minimum Wages and Employment:
A Case Study of the Fast-Food Industry in New Jersey and
Pennsylvania.” The American Economic Review 84.4 (1994): 772-793.

References (2/2)
A few more:
Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics,
social, and biomedical sciences. Cambridge University Press, 2015.
LaLonde, Robert J. “Evaluating the econometric evaluations of training
programs with experimental data.” The American economic review
(1986): 604-620.
Pearl, Judea. Causality. Cambridge university press, 2009.
Wooldridge, Jeﬀrey M. Econometric analysis of cross section and panel
data. MIT press, 2010.

Econometrics or machine learning?

Econometrics vs. machine learning
Econometrics Machine learning
Approach statistical: data
generating process
algorithmic model, DGP
unknown
Driver theory fitting the data
Focus hypothesis testing &
interpretability
predictive accuracy
Model
choice
parameter significance &
in-sample goodness of fit
cross-validation of
predictive accuracy on
partitions of data
Strength understand causal
relationships & behavior
prediction
See Breiman (2001) and Matt Bogard’s blog

Establishing causality

How economists think about data
Data has a data generating process (DGP):
Dependent and independent variables are stochastic
A structure is the statistical & functional relationship that determines
the observed outcomes
Two structures are observationally equivalent if they produce the same
process
A structure is identiﬁed only if a unique structure can cause the process
Consequently:
Parameter estimates are random
Can perform inference on them. . .
. . . if the inverse problem is well-posed, i.e., the model is identiﬁed

Variation in data
What is the nature of the variation in the data?
Identify exogenous vs. endogenous sources of variation
Exogenous:
Variable is determined outside the model
Example: cost, weather, draft lottery number, parental income
Endogenous:
Variable is determined inside the model
Caused by E[x · ] = 0 or causal loop between y and x
Example: crime & policing; price & demand, product characteristics
⇒ Mishandling endogenous features/variables almost always causes biased
estimates

Experimental vs. observational data
In the Rubin Causal Model, experimental data satisfies:
1 Individualistic: whether I am assigned to treatment doesn’t affect
whether you are
2 Probabilistic: non-zero probability of assignment to each treatment
3 Unconfoundedness: outcome doesn’t affect probability of assignment
4 Known, random assignment rule
If 4. is violated, your data is observational
Also need Stable Unit Treatment Value Assumption (SUTVA)

Causality and ceteris paribus
Measuring causality depends on ceteris paribus:
Ceteris paribus means “all else being equal”
I.e., compare apples to apples by conditioning on everything other than
the variable under analysis
E.g., data should be as good as randomly assigned
In the terminology of Wooldridge (2010):
Want to understand how w aﬀects y
Must condition on other confounding inﬂuences or correlation between
w and c will bias results:
E[y|w, c]

Model Interpretation
Interpretation based on:
Partial eﬀects:
Continuous w: βw =
∂E[y|w, c]
∂w
Discrete w: βw = ∆w E[y|w, c]
Elasticities:
η =
w · ∂E[y|w, c]
E[y|w, c] · ∂w
=
∂ log E[y|w, c]
∂ log w
Captures dimensionless change
E.g., market power:
price − marginal cost
price
= −
1
ηDemand
Structural models permit more sophisticated counter-factual analysis

Establishing Causality
It is very diﬃcult to establish causality:
Provide strong evidence of causality:
Well-designed experimentation
Careful statistical analysis
Show method controls for sources of possible bias
Randomization is the gold standard
For observational data, must show controls make (regression) analysis
‘as good as randomly assigned’
See LaLonde (1986) or Angrist & Pischke (2008, 2014)

Example 1: selection bias (1/2)
Suppose we measure impact of a policy intervention (e.g., advertising), ˆγ:
Yi (Wi ) = µ + γ · Wi + i
Try diﬀerencing averages, but we only observe outcomes condition on
treatment status Wi :
observed effect = AVGn[Yi (1)|Wi = 1] − AVGn[Yi (0)|Wi = 0]
But:
observed effect = AVGn[Yi (1)|Wi = 1] − AVGn[Yi (1)|Wi = 0]
direct effect
+ AVGn[Yi (1)|Wi = 0] − AVGn[Yi (0)|Wi = 0]
selection

Example 1: selection bias (2/2)
Selection bias is everywhere in behavioral data:
Randomize to eliminate selection!
In the absence of randomization:
Model selection process
Choose sensible functional form & distributions
Use control functional
Condition on suitable controls to compare groups which are as good as
randomly assigned
See Wooldridge (2010)

Endogeneity
Consider a regression model:
yi = xi β + i
If weak exogeneity, E[xi · i ] = 0, holds (among other assumptions) ⇒
ˆβ unbiased
Endogeneity occurs if yi and xi are codetermined:
Weak exogeneity fails, E[xi · i ] = 0
ˆβ is biased

Types of endogeneity
There are several types of endogeneity:
Simultaneity
Omitted variable bias (OVB)
Selection bias
Measurement error
If any are present, E[xi · i ] = 0 and your estimates are biased

Common endogenous variables
Endogeneity is everywhere:
Price & demand
Product characteristics
Wages and schooling (or anything aﬀected by ability)
Labor-force participation
Policing & crime

Example 2: OLS & endogeneity
Quick review of OLS in one dimension:
yi = β0 + β1 · xi + i
Cov(yi , xi ) = β1 · Var(xi ) + Cov( i , xi )
ˆβ1 = β1 +
Cov( i , xi )
Var(xi )
⇒ ˆβ1 is unbiased only ⇔ Cov( i , xi ) = 0

Omitted variable bias (OVB)
OVB is a common problem:
yi = β · xi + α · zi + i
But, zi is omitted from regression
Then, eﬀective error term is ui = i + α · zi and E[ui , xi ] = 0
Estimates are biased:
ˆβ =
Cov(yi , xi )
Var(xi )
+ α ·
Cov(zi , xi )
Var(xi )

Individual heterogeneity
Handling individual heterogeneity is a key triumph of modern econometrics:
Unobserved effects from individuals and firms affect behavior
Failure to model, can cause biased or inefficient estimates
Often, can use panel data models to control for an unobserved effect:
Panel data is a data where we observed individuals over time
Can exploit this to eliminate temporal and individual biases:
yit = αi + δt + xitβi + uit

Assembling a dataset
Economists think hard about the relationship between features and
outcomes, and what is lurking in the error term:
Think about the error term:
Endogeneity?
Omitted variables?
Valid instruments?
Seek out supplementary data which could explain behavior . . . or proxy
for missing features
Choose a functional form to control for ignorance yet measure what
matters
Create a panel to control for individual heterogeneity

Feature engineering à la economics
To handle these problems, economists add supplementary data:
Instrumental variables
Proxy variables
dummy variables for individual
Or, use clever tricks:
Panel data
Regression discontinuity design
Diﬀerence-in-diﬀerences

Natural experiments

Overview
Sometimes A/B testing is not possible:
Impossible to run the experiment
Look for natural randomization devices which provide as good ‘as
random assignment to treatment’:
Birthdays
Draft lottery numbers
Access decisions
Natural randomization can eliminate selection
Always check for balance!
Requires some cleverness. . .

A natural experiment
Often ‘nature’ provides natural randomization which is as good as
experimental randomization.
Example: needed to measure lift of experiential marketing campaign: * No
experimental design * Ten treatment units * Matching estimators failed *
But, short list had 50 sites. . . * Assume (and test) if access is as good as
random ⇒ have valid treatment & control groups!

Causal regression analysis

Causal regression analysis
To establish causality, must condition on other variables so ceteris paribus
applies:
Want to estimate E[y|w, c], where:
w is the factor of interest
c are other factors, correlated with w which could confound analysis
Need good measures of y, w, and c
Beware of variables determined by equilibrium:
Must deal with simultaneous equation modeling
Must avoid endogenous variables
E.g., crime vs. policing

Some common regression tools
Econometricians have developed many methods to assess causal
relationships:
Regression discontinuity design (RDD):
Diﬀerence-in-diﬀerences (DID):
Instrumental variables: use exogenous variables to instrument an
endogenous variable
Panel data:
Other methods include:
Matching estimators
Censoring & truncation
Discrete choice
Discrete/continuous choice

Regression discontinuity design
Can exploit policy or laws which cause a ‘natural’ treatment eﬀect to
establish causality:
Example: impact of drinking age on mortality
People on either side of 21 get diﬀerent treatment but are essentially
identical
Wi is a function of age
Wi is a discontinuous function of age
age is known as the running variable
mortalityi = α0 + α1 · age + γ · Wi (age) + i
See Agrist & Pischke (2015)

Difference-in-differences (DID) (1/2)
DID useful when you have individual effects and common time trends:
Must observe data over at least two periods
Eliminate bias from individual and time effects
E.g., impact of minimum wage on employment; See, Card & Krueger
(1998)
yit = αi + δt + γ · Wi × PERIODt
ˆγ = (yNJ,1 − yNJ,0) − (yPA,1 − yPA,0)
ˆγ = (∆δt + γ) − (∆δt)

DID regression (2/2)
Can estimate DID using regression:
yit = αi + δt + β · Wi + γ · Wi × PERIODt + it
Wi is treatment status
PERIODt ∈ 0, 1 for periods 0 and 1, respectively
γ is the treatment eﬀect
Add additional covariates to control as needed
Verify common trends holds (e.g., plot data vs. time)

Instrumental variables
An instrumental variable, z, provides a way to correct for endogeneity:
Assumptions:
E[ i · zi ] = 0
E[xi · zi ] = 0
Use z in regression:
Cov(yi , zi ) = β1 · Cov(xi , zi ) + Cov( i , zi )
ˆβ1
IV
= β1 +
Cov( i , zi )
Cov(xi , zi )

Panel data
Panel data is a powerful tool to eliminate sources of bias:
yit = αi + δt + xitβi + it
Panel data consists of individuals observed overtime, i.e., xit
Has time series and cross-section properties
Can eliminate individual & time eﬀects:
With-in estimator (
..
xit ← xit − xi )
First diﬀerences (FD)
Can also handle serial correlation of { t}

Least squares dummy variable regression (LSDV)
Often panel data is equivalent to LSDV:
Occurs when individual eﬀect could be modeled using dummy variables
for each individual
Frisch-Waugh decomposition:
Powerful dimension reduction
Use when you need to control for unobserved eﬀect without estimating it

Conclusion
To establish a causal relationship:
Must furnish evidence of correctness
Experiment is gold standard
In absence of random assignment to treatment:
Natural experiments can provide as good as random assignment to
treatment
Regression analysis can be causal if you condition on confounding
inﬂuences and control for endogeneity

Essential econometrics for data scientists

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Essential econometrics for data scientists

Similar to Essential econometrics for data scientists (20)

Recently uploaded

Recently uploaded (20)

Essential econometrics for data scientists