SlideShare a Scribd company logo
Essential economics for data scientists
Benjamin S. Skrainka
February 10, 2016
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 1 / 40
Overview
Economics studies allocation of resources under scarcity. Many of these
tools are useful for data scientists:
Econometric methods adapt classical statistics for applied problems
Causal inference
Experimental design
Regression analysis
Often, require small or ‘medium’ data
Goal of talk: understand (magnitude) of causal relationships
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 2 / 40
Theory I won’t discuss
Economic theory I won’t discuss:
Understanding individual behavior
Understanding firm behavior
Strategic questions: products, pricing, auctions, platforms, incentives,
M&A, new products
Estimate demand & forecasting
Structural modeling
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 3 / 40
Applied tools I won’t discuss
Econometric tools I won’t discuss:
Structural vs. reduced form
Bayesian vs. frequentist
Counter-factual & welfare analysis
Forecasting
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 4 / 40
Objectives
Today’s goals:
List differences between economic & machine learning approach
Know when to use econometrics or machine learning
Survey alternative types of experiments
Overview of how to estimate causal effects using regression analysis
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 5 / 40
Agenda
Today’s agenda
1 Econometrics or machine learning?
2 Establishing causality
3 When A/B tests fail. . .
4 Causal regression analysis
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 6 / 40
References (1/2)
A few references:
Angrist, Joshua D., and Jörn-Steffen Pischke. Mostly harmless
econometrics: An empiricist’s companion. Princeton university press,
2008.
Angrist, Joshua D., and Jörn-Steffen Pischke. Mastering ’metrics: The
path from cause to effect. Princeton University Press, 2014.
Breiman, Leo. “Statistical modeling: The two cultures (with comments
and a rejoinder by the author).” Statistical Science 16.3 (2001):
199-231.
Cameron, A. Colin, and Pravin K. Trivedi. Microeconometrics:
methods and applications. Cambridge university press, 2005.
Card, David, and Alan B. Krueger. “Minimum Wages and Employment:
A Case Study of the Fast-Food Industry in New Jersey and
Pennsylvania.” The American Economic Review 84.4 (1994): 772-793.
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 7 / 40
References (2/2)
A few more:
Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics,
social, and biomedical sciences. Cambridge University Press, 2015.
LaLonde, Robert J. “Evaluating the econometric evaluations of training
programs with experimental data.” The American economic review
(1986): 604-620.
Pearl, Judea. Causality. Cambridge university press, 2009.
Wooldridge, Jeffrey M. Econometric analysis of cross section and panel
data. MIT press, 2010.
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 8 / 40
Econometrics or machine learning?
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 9 / 40
Econometrics vs. machine learning
Econometrics Machine learning
Approach statistical: data
generating process
algorithmic model, DGP
unknown
Driver theory fitting the data
Focus hypothesis testing &
interpretability
predictive accuracy
Model
choice
parameter significance &
in-sample goodness of fit
cross-validation of
predictive accuracy on
partitions of data
Strength understand causal
relationships & behavior
prediction
See Breiman (2001) and Matt Bogard’s blog
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 10 / 40
Establishing causality
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 11 / 40
How economists think about data
Data has a data generating process (DGP):
Dependent and independent variables are stochastic
A structure is the statistical & functional relationship that determines
the observed outcomes
Two structures are observationally equivalent if they produce the same
process
A structure is identified only if a unique structure can cause the process
Consequently:
Parameter estimates are random
Can perform inference on them. . .
. . . if the inverse problem is well-posed, i.e., the model is identified
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 12 / 40
Variation in data
What is the nature of the variation in the data?
Identify exogenous vs. endogenous sources of variation
Exogenous:
Variable is determined outside the model
Example: cost, weather, draft lottery number, parental income
Endogenous:
Variable is determined inside the model
Caused by E[x · ] = 0 or causal loop between y and x
Example: crime & policing; price & demand, product characteristics
⇒ Mishandling endogenous features/variables almost always causes biased
estimates
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 13 / 40
Experimental vs. observational data
In the Rubin Causal Model, experimental data satisfies:
1 Individualistic: whether I am assigned to treatment doesn’t affect
whether you are
2 Probabilistic: non-zero probability of assignment to each treatment
3 Unconfoundedness: outcome doesn’t affect probability of assignment
4 Known, random assignment rule
If 4. is violated, your data is observational
Also need Stable Unit Treatment Value Assumption (SUTVA)
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 14 / 40
Causality and ceteris paribus
Measuring causality depends on ceteris paribus:
Ceteris paribus means “all else being equal”
I.e., compare apples to apples by conditioning on everything other than
the variable under analysis
E.g., data should be as good as randomly assigned
In the terminology of Wooldridge (2010):
Want to understand how w affects y
Must condition on other confounding influences or correlation between
w and c will bias results:
E[y|w, c]
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 15 / 40
Model Interpretation
Interpretation based on:
Partial effects:
Continuous w: βw =
∂E[y|w, c]
∂w
Discrete w: βw = ∆w E[y|w, c]
Elasticities:
η =
w · ∂E[y|w, c]
E[y|w, c] · ∂w
=
∂ log E[y|w, c]
∂ log w
Captures dimensionless change
E.g., market power:
price − marginal cost
price
= −
1
ηDemand
Structural models permit more sophisticated counter-factual analysis
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 16 / 40
Establishing Causality
It is very difficult to establish causality:
Provide strong evidence of causality:
Well-designed experimentation
Careful statistical analysis
Show method controls for sources of possible bias
Randomization is the gold standard
For observational data, must show controls make (regression) analysis
‘as good as randomly assigned’
See LaLonde (1986) or Angrist & Pischke (2008, 2014)
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 17 / 40
Example 1: selection bias (1/2)
Suppose we measure impact of a policy intervention (e.g., advertising), ˆγ:
Yi (Wi ) = µ + γ · Wi + i
Try differencing averages, but we only observe outcomes condition on
treatment status Wi :
observed effect = AVGn[Yi (1)|Wi = 1] − AVGn[Yi (0)|Wi = 0]
But:
observed effect = AVGn[Yi (1)|Wi = 1] − AVGn[Yi (1)|Wi = 0]
direct effect
+ AVGn[Yi (1)|Wi = 0] − AVGn[Yi (0)|Wi = 0]
selection
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 18 / 40
Example 1: selection bias (2/2)
Selection bias is everywhere in behavioral data:
Randomize to eliminate selection!
In the absence of randomization:
Model selection process
Choose sensible functional form & distributions
Use control functional
Condition on suitable controls to compare groups which are as good as
randomly assigned
See Wooldridge (2010)
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 19 / 40
Endogeneity
Consider a regression model:
yi = xi β + i
If weak exogeneity, E[xi · i ] = 0, holds (among other assumptions) ⇒
ˆβ unbiased
Endogeneity occurs if yi and xi are codetermined:
Weak exogeneity fails, E[xi · i ] = 0
ˆβ is biased
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 20 / 40
Types of endogeneity
There are several types of endogeneity:
Simultaneity
Omitted variable bias (OVB)
Selection bias
Measurement error
If any are present, E[xi · i ] = 0 and your estimates are biased
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 21 / 40
Common endogenous variables
Endogeneity is everywhere:
Price & demand
Product characteristics
Wages and schooling (or anything affected by ability)
Labor-force participation
Policing & crime
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 22 / 40
Example 2: OLS & endogeneity
Quick review of OLS in one dimension:
yi = β0 + β1 · xi + i
Cov(yi , xi ) = β1 · Var(xi ) + Cov( i , xi )
ˆβ1 = β1 +
Cov( i , xi )
Var(xi )
⇒ ˆβ1 is unbiased only ⇔ Cov( i , xi ) = 0
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 23 / 40
Omitted variable bias (OVB)
OVB is a common problem:
yi = β · xi + α · zi + i
But, zi is omitted from regression
Then, effective error term is ui = i + α · zi and E[ui , xi ] = 0
Estimates are biased:
ˆβ =
Cov(yi , xi )
Var(xi )
+ α ·
Cov(zi , xi )
Var(xi )
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 24 / 40
Individual heterogeneity
Handling individual heterogeneity is a key triumph of modern econometrics:
Unobserved effects from individuals and firms affect behavior
Failure to model, can cause biased or inefficient estimates
Often, can use panel data models to control for an unobserved effect:
Panel data is a data where we observed individuals over time
Can exploit this to eliminate temporal and individual biases:
yit = αi + δt + xitβi + uit
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 25 / 40
Assembling a dataset
Economists think hard about the relationship between features and
outcomes, and what is lurking in the error term:
Think about the error term:
Endogeneity?
Omitted variables?
Valid instruments?
Seek out supplementary data which could explain behavior . . . or proxy
for missing features
Choose a functional form to control for ignorance yet measure what
matters
Create a panel to control for individual heterogeneity
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 26 / 40
Feature engineering à la economics
To handle these problems, economists add supplementary data:
Instrumental variables
Proxy variables
dummy variables for individual
Or, use clever tricks:
Panel data
Regression discontinuity design
Difference-in-differences
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 27 / 40
Natural experiments
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 28 / 40
Overview
Sometimes A/B testing is not possible:
Impossible to run the experiment
Look for natural randomization devices which provide as good ‘as
random assignment to treatment’:
Birthdays
Draft lottery numbers
Access decisions
Natural randomization can eliminate selection
Always check for balance!
Requires some cleverness. . .
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 29 / 40
A natural experiment
Often ‘nature’ provides natural randomization which is as good as
experimental randomization.
Example: needed to measure lift of experiential marketing campaign: * No
experimental design * Ten treatment units * Matching estimators failed *
But, short list had 50 sites. . . * Assume (and test) if access is as good as
random ⇒ have valid treatment & control groups!
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 30 / 40
Causal regression analysis
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 31 / 40
Causal regression analysis
To establish causality, must condition on other variables so ceteris paribus
applies:
Want to estimate E[y|w, c], where:
w is the factor of interest
c are other factors, correlated with w which could confound analysis
Need good measures of y, w, and c
Beware of variables determined by equilibrium:
Must deal with simultaneous equation modeling
Must avoid endogenous variables
E.g., crime vs. policing
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 32 / 40
Some common regression tools
Econometricians have developed many methods to assess causal
relationships:
Regression discontinuity design (RDD):
Difference-in-differences (DID):
Instrumental variables: use exogenous variables to instrument an
endogenous variable
Panel data:
Other methods include:
Matching estimators
Censoring & truncation
Discrete choice
Discrete/continuous choice
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 33 / 40
Regression discontinuity design
Can exploit policy or laws which cause a ‘natural’ treatment effect to
establish causality:
Example: impact of drinking age on mortality
People on either side of 21 get different treatment but are essentially
identical
Wi is a function of age
Wi is a discontinuous function of age
age is known as the running variable
mortalityi = α0 + α1 · age + γ · Wi (age) + i
See Agrist & Pischke (2015)
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 34 / 40
Difference-in-differences (DID) (1/2)
DID useful when you have individual effects and common time trends:
Must observe data over at least two periods
Eliminate bias from individual and time effects
E.g., impact of minimum wage on employment; See, Card & Krueger
(1998)
yit = αi + δt + γ · Wi × PERIODt
ˆγ = (yNJ,1 − yNJ,0) − (yPA,1 − yPA,0)
ˆγ = (∆δt + γ) − (∆δt)
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 35 / 40
DID regression (2/2)
Can estimate DID using regression:
yit = αi + δt + β · Wi + γ · Wi × PERIODt + it
Wi is treatment status
PERIODt ∈ 0, 1 for periods 0 and 1, respectively
γ is the treatment effect
Add additional covariates to control as needed
Verify common trends holds (e.g., plot data vs. time)
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 36 / 40
Instrumental variables
An instrumental variable, z, provides a way to correct for endogeneity:
Assumptions:
E[ i · zi ] = 0
E[xi · zi ] = 0
Use z in regression:
Cov(yi , zi ) = β1 · Cov(xi , zi ) + Cov( i , zi )
ˆβ1
IV
= β1 +
Cov( i , zi )
Cov(xi , zi )
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 37 / 40
Panel data
Panel data is a powerful tool to eliminate sources of bias:
yit = αi + δt + xitβi + it
Panel data consists of individuals observed overtime, i.e., xit
Has time series and cross-section properties
Can eliminate individual & time effects:
With-in estimator (
..
xit ← xit − xi )
First differences (FD)
Can also handle serial correlation of { t}
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 38 / 40
Least squares dummy variable regression (LSDV)
Often panel data is equivalent to LSDV:
Occurs when individual effect could be modeled using dummy variables
for each individual
Frisch-Waugh decomposition:
Powerful dimension reduction
Use when you need to control for unobserved effect without estimating it
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 39 / 40
Conclusion
To establish a causal relationship:
Must furnish evidence of correctness
Experiment is gold standard
In absence of random assignment to treatment:
Natural experiments can provide as good as random assignment to
treatment
Regression analysis can be causal if you condition on confounding
influences and control for endogeneity
Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 40 / 40

More Related Content

What's hot

Econometrics ch2
Econometrics ch2Econometrics ch2
Econometrics ch2
Baterdene Batchuluun
 
Chapter 9 The Economics Of Information & Market Failure
Chapter 9 The Economics Of Information & Market Failure  �Chapter 9 The Economics Of Information & Market Failure  �
Chapter 9 The Economics Of Information & Market Failure
Firdaus Fitri Zainal Abidin
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
Bhaswat Chakraborty
 
Phuong HM Nguyen - The Market Forces of Supply and Demand
Phuong HM Nguyen - The Market Forces of Supply and DemandPhuong HM Nguyen - The Market Forces of Supply and Demand
Phuong HM Nguyen - The Market Forces of Supply and Demand
Phuong Nguyen
 
Topic 1.3
Topic 1.3Topic 1.3
Topic 1.3
Sue Whale
 
Demand and supply functions in economics
Demand and supply functions in economics Demand and supply functions in economics
Demand and supply functions in economics
vipul nigam
 
Causal Inference Introduction.pdf
Causal Inference Introduction.pdfCausal Inference Introduction.pdf
Causal Inference Introduction.pdf
Yuna Koyama
 
Statistical computing 1
Statistical computing 1Statistical computing 1
Statistical computing 1
Padma Metta
 
Market Failure
Market FailureMarket Failure
Market Failure
Seemanto
 
Econometrics Notes
Econometrics NotesEconometrics Notes
Econometrics Notes
Laurel Ayuyao
 
Mankiew chapter 7 Consumers, Producers, and the Efficiency of Markets
Mankiew chapter 7 Consumers, Producers, and the Efficiency of MarketsMankiew chapter 7 Consumers, Producers, and the Efficiency of Markets
Mankiew chapter 7 Consumers, Producers, and the Efficiency of Markets
Abd ELRahman ALFar
 
Mankiew chapter 8
Mankiew chapter 8Mankiew chapter 8
Mankiew chapter 8
Listiani Kumala Sari
 
Externalities
ExternalitiesExternalities
Externalities
Imran Siddiqui
 
Probability & probability distribution
Probability & probability distributionProbability & probability distribution
Probability & probability distribution
umar sheikh
 
Ch04
Ch04Ch04
Ch04
waiwai28
 
05 price elasticity of demand and supply
05 price elasticity of demand and supply05 price elasticity of demand and supply
05 price elasticity of demand and supply
NepDevWiki
 
Probability
ProbabilityProbability
Probability
lrassbach
 
Value of Statistical Life
Value of Statistical LifeValue of Statistical Life
Value of Statistical Life
Iwl Pcu
 
Cost minimisation analysis in health economics
Cost minimisation analysis in health economicsCost minimisation analysis in health economics
Cost minimisation analysis in health economics
a01071979
 
Pharmacoeconomics 1
Pharmacoeconomics 1Pharmacoeconomics 1
Pharmacoeconomics 1
AmmarJassim4
 

What's hot (20)

Econometrics ch2
Econometrics ch2Econometrics ch2
Econometrics ch2
 
Chapter 9 The Economics Of Information & Market Failure
Chapter 9 The Economics Of Information & Market Failure  �Chapter 9 The Economics Of Information & Market Failure  �
Chapter 9 The Economics Of Information & Market Failure
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
Phuong HM Nguyen - The Market Forces of Supply and Demand
Phuong HM Nguyen - The Market Forces of Supply and DemandPhuong HM Nguyen - The Market Forces of Supply and Demand
Phuong HM Nguyen - The Market Forces of Supply and Demand
 
Topic 1.3
Topic 1.3Topic 1.3
Topic 1.3
 
Demand and supply functions in economics
Demand and supply functions in economics Demand and supply functions in economics
Demand and supply functions in economics
 
Causal Inference Introduction.pdf
Causal Inference Introduction.pdfCausal Inference Introduction.pdf
Causal Inference Introduction.pdf
 
Statistical computing 1
Statistical computing 1Statistical computing 1
Statistical computing 1
 
Market Failure
Market FailureMarket Failure
Market Failure
 
Econometrics Notes
Econometrics NotesEconometrics Notes
Econometrics Notes
 
Mankiew chapter 7 Consumers, Producers, and the Efficiency of Markets
Mankiew chapter 7 Consumers, Producers, and the Efficiency of MarketsMankiew chapter 7 Consumers, Producers, and the Efficiency of Markets
Mankiew chapter 7 Consumers, Producers, and the Efficiency of Markets
 
Mankiew chapter 8
Mankiew chapter 8Mankiew chapter 8
Mankiew chapter 8
 
Externalities
ExternalitiesExternalities
Externalities
 
Probability & probability distribution
Probability & probability distributionProbability & probability distribution
Probability & probability distribution
 
Ch04
Ch04Ch04
Ch04
 
05 price elasticity of demand and supply
05 price elasticity of demand and supply05 price elasticity of demand and supply
05 price elasticity of demand and supply
 
Probability
ProbabilityProbability
Probability
 
Value of Statistical Life
Value of Statistical LifeValue of Statistical Life
Value of Statistical Life
 
Cost minimisation analysis in health economics
Cost minimisation analysis in health economicsCost minimisation analysis in health economics
Cost minimisation analysis in health economics
 
Pharmacoeconomics 1
Pharmacoeconomics 1Pharmacoeconomics 1
Pharmacoeconomics 1
 

Similar to Essential econometrics for data scientists

Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
University of Washington
 
MAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.pptMAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.ppt
PreciousOsoOla
 
BayesianClassifierAndConditionalProbability.pptx
BayesianClassifierAndConditionalProbability.pptxBayesianClassifierAndConditionalProbability.pptx
BayesianClassifierAndConditionalProbability.pptx
Nishant83346
 
Contextual information elicitation in travel recommender systems
Contextual information elicitation in travel recommender systemsContextual information elicitation in travel recommender systems
Contextual information elicitation in travel recommender systems
International Federation for Information Technologies in Travel and Tourism (IFITT)
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
Matthias Braunhofer
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014
Claudia Wagner
 
16497 mgt 252
16497 mgt 25216497 mgt 252
STATS 780 (Bayesian 1 way anova comparison).pdf
STATS 780 (Bayesian 1 way anova comparison).pdfSTATS 780 (Bayesian 1 way anova comparison).pdf
STATS 780 (Bayesian 1 way anova comparison).pdf
KevinLim722425
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
jemille6
 
Representing and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonRepresenting and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıon
Azdeen Najah
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
csula its training
 
Choosing the Right Statistical Techniques
Choosing the Right Statistical TechniquesChoosing the Right Statistical Techniques
Choosing the Right Statistical Techniques
Bodhiya Wijaya Mulya
 
Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.
OliviaNightingale2
 
Sesi 1_Introduction Econometrics.pdf
Sesi 1_Introduction Econometrics.pdfSesi 1_Introduction Econometrics.pdf
Sesi 1_Introduction Econometrics.pdf
DianNoviWiyanah1
 
Lecture 1.pdf
Lecture 1.pdfLecture 1.pdf
Lecture 1.pdf
JamalBibi1
 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis Essay
Crystal Alvarez
 
Statistical tools in research 1
Statistical tools in research 1Statistical tools in research 1
Statistical tools in research 1
ashish7sattee
 
Quantifying an association to predict future events chapt
Quantifying an association to predict future events chaptQuantifying an association to predict future events chapt
Quantifying an association to predict future events chapt
MARK547399
 
Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...
Sara Magliacane
 
Unit-2.ppt
Unit-2.pptUnit-2.ppt
Unit-2.ppt
AshwaniShukla47
 

Similar to Essential econometrics for data scientists (20)

Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
MAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.pptMAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.ppt
 
BayesianClassifierAndConditionalProbability.pptx
BayesianClassifierAndConditionalProbability.pptxBayesianClassifierAndConditionalProbability.pptx
BayesianClassifierAndConditionalProbability.pptx
 
Contextual information elicitation in travel recommender systems
Contextual information elicitation in travel recommender systemsContextual information elicitation in travel recommender systems
Contextual information elicitation in travel recommender systems
 
Contextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender SystemsContextual Information Elicitation in Travel Recommender Systems
Contextual Information Elicitation in Travel Recommender Systems
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014
 
16497 mgt 252
16497 mgt 25216497 mgt 252
16497 mgt 252
 
STATS 780 (Bayesian 1 way anova comparison).pdf
STATS 780 (Bayesian 1 way anova comparison).pdfSTATS 780 (Bayesian 1 way anova comparison).pdf
STATS 780 (Bayesian 1 way anova comparison).pdf
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
Representing and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonRepresenting and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıon
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Choosing the Right Statistical Techniques
Choosing the Right Statistical TechniquesChoosing the Right Statistical Techniques
Choosing the Right Statistical Techniques
 
Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.Chapter 1 introduction to statistics.
Chapter 1 introduction to statistics.
 
Sesi 1_Introduction Econometrics.pdf
Sesi 1_Introduction Econometrics.pdfSesi 1_Introduction Econometrics.pdf
Sesi 1_Introduction Econometrics.pdf
 
Lecture 1.pdf
Lecture 1.pdfLecture 1.pdf
Lecture 1.pdf
 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis Essay
 
Statistical tools in research 1
Statistical tools in research 1Statistical tools in research 1
Statistical tools in research 1
 
Quantifying an association to predict future events chapt
Quantifying an association to predict future events chaptQuantifying an association to predict future events chapt
Quantifying an association to predict future events chapt
 
Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...
 
Unit-2.ppt
Unit-2.pptUnit-2.ppt
Unit-2.ppt
 

Recently uploaded

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 

Recently uploaded (20)

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 

Essential econometrics for data scientists

  • 1. Essential economics for data scientists Benjamin S. Skrainka February 10, 2016 Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 1 / 40
  • 2. Overview Economics studies allocation of resources under scarcity. Many of these tools are useful for data scientists: Econometric methods adapt classical statistics for applied problems Causal inference Experimental design Regression analysis Often, require small or ‘medium’ data Goal of talk: understand (magnitude) of causal relationships Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 2 / 40
  • 3. Theory I won’t discuss Economic theory I won’t discuss: Understanding individual behavior Understanding firm behavior Strategic questions: products, pricing, auctions, platforms, incentives, M&A, new products Estimate demand & forecasting Structural modeling Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 3 / 40
  • 4. Applied tools I won’t discuss Econometric tools I won’t discuss: Structural vs. reduced form Bayesian vs. frequentist Counter-factual & welfare analysis Forecasting Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 4 / 40
  • 5. Objectives Today’s goals: List differences between economic & machine learning approach Know when to use econometrics or machine learning Survey alternative types of experiments Overview of how to estimate causal effects using regression analysis Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 5 / 40
  • 6. Agenda Today’s agenda 1 Econometrics or machine learning? 2 Establishing causality 3 When A/B tests fail. . . 4 Causal regression analysis Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 6 / 40
  • 7. References (1/2) A few references: Angrist, Joshua D., and Jörn-Steffen Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2008. Angrist, Joshua D., and Jörn-Steffen Pischke. Mastering ’metrics: The path from cause to effect. Princeton University Press, 2014. Breiman, Leo. “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical Science 16.3 (2001): 199-231. Cameron, A. Colin, and Pravin K. Trivedi. Microeconometrics: methods and applications. Cambridge university press, 2005. Card, David, and Alan B. Krueger. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” The American Economic Review 84.4 (1994): 772-793. Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 7 / 40
  • 8. References (2/2) A few more: Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. LaLonde, Robert J. “Evaluating the econometric evaluations of training programs with experimental data.” The American economic review (1986): 604-620. Pearl, Judea. Causality. Cambridge university press, 2009. Wooldridge, Jeffrey M. Econometric analysis of cross section and panel data. MIT press, 2010. Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 8 / 40
  • 9. Econometrics or machine learning? Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 9 / 40
  • 10. Econometrics vs. machine learning Econometrics Machine learning Approach statistical: data generating process algorithmic model, DGP unknown Driver theory fitting the data Focus hypothesis testing & interpretability predictive accuracy Model choice parameter significance & in-sample goodness of fit cross-validation of predictive accuracy on partitions of data Strength understand causal relationships & behavior prediction See Breiman (2001) and Matt Bogard’s blog Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 10 / 40
  • 11. Establishing causality Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 11 / 40
  • 12. How economists think about data Data has a data generating process (DGP): Dependent and independent variables are stochastic A structure is the statistical & functional relationship that determines the observed outcomes Two structures are observationally equivalent if they produce the same process A structure is identified only if a unique structure can cause the process Consequently: Parameter estimates are random Can perform inference on them. . . . . . if the inverse problem is well-posed, i.e., the model is identified Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 12 / 40
  • 13. Variation in data What is the nature of the variation in the data? Identify exogenous vs. endogenous sources of variation Exogenous: Variable is determined outside the model Example: cost, weather, draft lottery number, parental income Endogenous: Variable is determined inside the model Caused by E[x · ] = 0 or causal loop between y and x Example: crime & policing; price & demand, product characteristics ⇒ Mishandling endogenous features/variables almost always causes biased estimates Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 13 / 40
  • 14. Experimental vs. observational data In the Rubin Causal Model, experimental data satisfies: 1 Individualistic: whether I am assigned to treatment doesn’t affect whether you are 2 Probabilistic: non-zero probability of assignment to each treatment 3 Unconfoundedness: outcome doesn’t affect probability of assignment 4 Known, random assignment rule If 4. is violated, your data is observational Also need Stable Unit Treatment Value Assumption (SUTVA) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 14 / 40
  • 15. Causality and ceteris paribus Measuring causality depends on ceteris paribus: Ceteris paribus means “all else being equal” I.e., compare apples to apples by conditioning on everything other than the variable under analysis E.g., data should be as good as randomly assigned In the terminology of Wooldridge (2010): Want to understand how w affects y Must condition on other confounding influences or correlation between w and c will bias results: E[y|w, c] Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 15 / 40
  • 16. Model Interpretation Interpretation based on: Partial effects: Continuous w: βw = ∂E[y|w, c] ∂w Discrete w: βw = ∆w E[y|w, c] Elasticities: η = w · ∂E[y|w, c] E[y|w, c] · ∂w = ∂ log E[y|w, c] ∂ log w Captures dimensionless change E.g., market power: price − marginal cost price = − 1 ηDemand Structural models permit more sophisticated counter-factual analysis Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 16 / 40
  • 17. Establishing Causality It is very difficult to establish causality: Provide strong evidence of causality: Well-designed experimentation Careful statistical analysis Show method controls for sources of possible bias Randomization is the gold standard For observational data, must show controls make (regression) analysis ‘as good as randomly assigned’ See LaLonde (1986) or Angrist & Pischke (2008, 2014) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 17 / 40
  • 18. Example 1: selection bias (1/2) Suppose we measure impact of a policy intervention (e.g., advertising), ˆγ: Yi (Wi ) = µ + γ · Wi + i Try differencing averages, but we only observe outcomes condition on treatment status Wi : observed effect = AVGn[Yi (1)|Wi = 1] − AVGn[Yi (0)|Wi = 0] But: observed effect = AVGn[Yi (1)|Wi = 1] − AVGn[Yi (1)|Wi = 0] direct effect + AVGn[Yi (1)|Wi = 0] − AVGn[Yi (0)|Wi = 0] selection Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 18 / 40
  • 19. Example 1: selection bias (2/2) Selection bias is everywhere in behavioral data: Randomize to eliminate selection! In the absence of randomization: Model selection process Choose sensible functional form & distributions Use control functional Condition on suitable controls to compare groups which are as good as randomly assigned See Wooldridge (2010) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 19 / 40
  • 20. Endogeneity Consider a regression model: yi = xi β + i If weak exogeneity, E[xi · i ] = 0, holds (among other assumptions) ⇒ ˆβ unbiased Endogeneity occurs if yi and xi are codetermined: Weak exogeneity fails, E[xi · i ] = 0 ˆβ is biased Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 20 / 40
  • 21. Types of endogeneity There are several types of endogeneity: Simultaneity Omitted variable bias (OVB) Selection bias Measurement error If any are present, E[xi · i ] = 0 and your estimates are biased Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 21 / 40
  • 22. Common endogenous variables Endogeneity is everywhere: Price & demand Product characteristics Wages and schooling (or anything affected by ability) Labor-force participation Policing & crime Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 22 / 40
  • 23. Example 2: OLS & endogeneity Quick review of OLS in one dimension: yi = β0 + β1 · xi + i Cov(yi , xi ) = β1 · Var(xi ) + Cov( i , xi ) ˆβ1 = β1 + Cov( i , xi ) Var(xi ) ⇒ ˆβ1 is unbiased only ⇔ Cov( i , xi ) = 0 Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 23 / 40
  • 24. Omitted variable bias (OVB) OVB is a common problem: yi = β · xi + α · zi + i But, zi is omitted from regression Then, effective error term is ui = i + α · zi and E[ui , xi ] = 0 Estimates are biased: ˆβ = Cov(yi , xi ) Var(xi ) + α · Cov(zi , xi ) Var(xi ) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 24 / 40
  • 25. Individual heterogeneity Handling individual heterogeneity is a key triumph of modern econometrics: Unobserved effects from individuals and firms affect behavior Failure to model, can cause biased or inefficient estimates Often, can use panel data models to control for an unobserved effect: Panel data is a data where we observed individuals over time Can exploit this to eliminate temporal and individual biases: yit = αi + δt + xitβi + uit Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 25 / 40
  • 26. Assembling a dataset Economists think hard about the relationship between features and outcomes, and what is lurking in the error term: Think about the error term: Endogeneity? Omitted variables? Valid instruments? Seek out supplementary data which could explain behavior . . . or proxy for missing features Choose a functional form to control for ignorance yet measure what matters Create a panel to control for individual heterogeneity Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 26 / 40
  • 27. Feature engineering à la economics To handle these problems, economists add supplementary data: Instrumental variables Proxy variables dummy variables for individual Or, use clever tricks: Panel data Regression discontinuity design Difference-in-differences Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 27 / 40
  • 28. Natural experiments Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 28 / 40
  • 29. Overview Sometimes A/B testing is not possible: Impossible to run the experiment Look for natural randomization devices which provide as good ‘as random assignment to treatment’: Birthdays Draft lottery numbers Access decisions Natural randomization can eliminate selection Always check for balance! Requires some cleverness. . . Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 29 / 40
  • 30. A natural experiment Often ‘nature’ provides natural randomization which is as good as experimental randomization. Example: needed to measure lift of experiential marketing campaign: * No experimental design * Ten treatment units * Matching estimators failed * But, short list had 50 sites. . . * Assume (and test) if access is as good as random ⇒ have valid treatment & control groups! Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 30 / 40
  • 31. Causal regression analysis Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 31 / 40
  • 32. Causal regression analysis To establish causality, must condition on other variables so ceteris paribus applies: Want to estimate E[y|w, c], where: w is the factor of interest c are other factors, correlated with w which could confound analysis Need good measures of y, w, and c Beware of variables determined by equilibrium: Must deal with simultaneous equation modeling Must avoid endogenous variables E.g., crime vs. policing Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 32 / 40
  • 33. Some common regression tools Econometricians have developed many methods to assess causal relationships: Regression discontinuity design (RDD): Difference-in-differences (DID): Instrumental variables: use exogenous variables to instrument an endogenous variable Panel data: Other methods include: Matching estimators Censoring & truncation Discrete choice Discrete/continuous choice Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 33 / 40
  • 34. Regression discontinuity design Can exploit policy or laws which cause a ‘natural’ treatment effect to establish causality: Example: impact of drinking age on mortality People on either side of 21 get different treatment but are essentially identical Wi is a function of age Wi is a discontinuous function of age age is known as the running variable mortalityi = α0 + α1 · age + γ · Wi (age) + i See Agrist & Pischke (2015) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 34 / 40
  • 35. Difference-in-differences (DID) (1/2) DID useful when you have individual effects and common time trends: Must observe data over at least two periods Eliminate bias from individual and time effects E.g., impact of minimum wage on employment; See, Card & Krueger (1998) yit = αi + δt + γ · Wi × PERIODt ˆγ = (yNJ,1 − yNJ,0) − (yPA,1 − yPA,0) ˆγ = (∆δt + γ) − (∆δt) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 35 / 40
  • 36. DID regression (2/2) Can estimate DID using regression: yit = αi + δt + β · Wi + γ · Wi × PERIODt + it Wi is treatment status PERIODt ∈ 0, 1 for periods 0 and 1, respectively γ is the treatment effect Add additional covariates to control as needed Verify common trends holds (e.g., plot data vs. time) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 36 / 40
  • 37. Instrumental variables An instrumental variable, z, provides a way to correct for endogeneity: Assumptions: E[ i · zi ] = 0 E[xi · zi ] = 0 Use z in regression: Cov(yi , zi ) = β1 · Cov(xi , zi ) + Cov( i , zi ) ˆβ1 IV = β1 + Cov( i , zi ) Cov(xi , zi ) Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 37 / 40
  • 38. Panel data Panel data is a powerful tool to eliminate sources of bias: yit = αi + δt + xitβi + it Panel data consists of individuals observed overtime, i.e., xit Has time series and cross-section properties Can eliminate individual & time effects: With-in estimator ( .. xit ← xit − xi ) First differences (FD) Can also handle serial correlation of { t} Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 38 / 40
  • 39. Least squares dummy variable regression (LSDV) Often panel data is equivalent to LSDV: Occurs when individual effect could be modeled using dummy variables for each individual Frisch-Waugh decomposition: Powerful dimension reduction Use when you need to control for unobserved effect without estimating it Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 39 / 40
  • 40. Conclusion To establish a causal relationship: Must furnish evidence of correctness Experiment is gold standard In absence of random assignment to treatment: Natural experiments can provide as good as random assignment to treatment Regression analysis can be causal if you condition on confounding influences and control for endogeneity Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 40 / 40