Causal data mining: Identifying causal effects at scale

Causal data mining:
Identifying causal
effects at scale
1
AMIT SHARMA
Postdoctoral Researcher, Microsoft Research New York
http://www.amitsharma.in
@amt_shrma

Distinguishing between personal preference and homophily
in online activity feeds. Sharma and Cosley (2016).
Studying and modeling the effect of social explanations
in recommender systems. Sharma and Cosley (2013).
Amit and Dan like this.

Distinguishing between personal preference and homophily
in online activity feeds. Sharma and Cosley (2016).
Studying and modeling the effect of social explanations
in recommender systems. Sharma and Cosley (2013).
Amit and Dan like this.
Averaging Gone Wrong: Using Time-Aware Analyses to Better
Understand Behavior. Barbosa, Cosley, Sharma, Cesar (2016)
Auditing search engines for differential satisfaction across
demographics. Mehrotra, Anderson, Diaz, Sharma, Wallach (2016)

9
Cause (X=𝑥) Outcome (Y)
Unobserved
Confounders (U)
𝒚 = 𝑓 𝒙, 𝑢
𝒙 = 𝑔 𝑢

16
Data mining for causal inference

Part 0: Traditional causal
inference using a natural
experiment
17

1854: London was having a devastating
cholera outbreak 18

Causal question: What
is causing cholera?
Air-borne: Spreads
through air (“miasma”)
Water-borne: Spreads
through contaminated
water
19

Polluted Air
Cholera
Diagnosis
Contaminated
Water
Cholera
Diagnosis
Neighborhood

21
Enter John Snow. He found higher cholera deaths near
a water pump, but could be just correlational.

22
New Idea: Two major water companies for London:
one upstream and one downstream.

23
No difference in neighborhood, still an 8-fold increase
in cholera with the downstream company.
S&V
and
Lambeth

Led to a change in belief about cholera’s cause.
24

26
Contaminated
Water
Cholera
Diagnosis
Neighborhood
Water
Company

27
Contaminated
Water (X)
Cholera
Diagnosis
(Y)
Other factors
[e.g. neighborhood]
(U)
Water
Company
(Z)
As-If-Random
Exclusion

28
Cause (X) Outcome (Y)
Unobserved
Confounders
(U)
New
variable (Z)
As-If-Random
Exclusion

29
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
Randomized
Assignment
(Z)

Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
Instrumental
Variable
(Z)
30

32
Such that:
As-If-Random:
Exclusion: 𝑍 ∐ 𝑌 | 𝑋, 𝑈

Part I: Split-door criterion for
causal identification
35

36
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome

37
Cause
Primary
Outcome
Unobserved
Confounders
Auxiliary
Outcome

Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome

45
Demand for
The Road
Visits to The
Road
Rec. visits to
No Country
for Old Men
Demand for
No Country
for Old Men

46
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page
visits
?
ACTIVITY WITHOUT
RECOMMENDER

49
Instrument
Demand for
Cormac McCarthy
Visits to The
Road
Rec. visits to
No Country
for Old Men

51
All visits to a
recommended product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Auxiliary outcome: Proxy for
unobserved demand

52
Demand for
focal product
(UX)
Visits to
focal product
(X)
Rec. visits
(YR)
Direct visits
(YD)
Demand for
rec. product
(UY)

55
Criterion: 𝑿 ∐ 𝒀 𝑫
Demand for
focal product
(UX)
Visits to
focal product
(X)
Rec. visits
(YR)
Direct visits
(YD)
Demand for
rec. product
(UY)

56
Unobserved
variables (UX)
Cause
(X)
Outcome
(YR)
Auxiliary
Outcome
(YD)
Unobserved
variables (UY)

57
𝑦𝑟 = 𝑓 𝑥, 𝑢 = 𝛾2 𝑢 + 𝝆𝒙 + 𝜖 𝑦𝑟
𝑥 = 𝑔 𝑢 = 𝛾1 𝑢 + 𝜖 𝑥
𝑦 𝑑 = ℎ 𝑢 = 𝛾3 𝑢 + 𝜖 𝑦𝑑

58
Independence test used to
find natural experiments.
Only Assumption: Auxiliary
outcome is affected by the
causes of the primary
outcome.

59
Treatment Outcome
Unobserved
Confounders
Instrumental
Variable
Treatment Outcome
Unobserved
Confounders
Auxiliary
Outcome

Recreating sequence of visits: Log data
62
Timestamp URL
2014-01-20
09:04:10
http://www.amazon.com/s/ref=nb_sb_noss_1?fiel
d-keywords=Cormac%20McCarthy
2014-01-20
09:04:15
http://www.amazon.com/dp/0812984250/ref=sr_
1_2
2014-01-20
09:05:01
http://www.amazon.com/dp/1573225797/ref=pd
_sim_b_1

Recreating sequence of visits: Log data
63
Timestamp URL
2014-01-20
09:04:10
http://www.amazon.com/s/ref=nb_s
b_noss_1?field-
keywords=Cormac%20McCarthy
2014-01-20
09:04:15
http://www.amazon.com/dp/081298
4250/ref=sr_1_2
2014-01-20
09:05:01
http://www.amazon.com/dp/157322
5797/ref=pd_sim_b_1
User searches for
Cormac McCarthy
User clicks on the
second search result
User clicks on the first
recommendation

Compute 𝜌 = 𝑌𝑅/ 𝑋
68

76
Lower CTR may be due
to the holiday season

79
Oprah [Carmi et al.] 133 shocks Restricted to books
Split-door criterion 12,000 natural
experiments
Representative of overall
product distribution

Part 2: A general Bayesian
test for natural experiments
in any dataset
81

Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
Instrumental
Variable
(Z)
As-If-Random?
Exclusion?

83
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
I.V.
(Z)
(X) (Y)
(U)
(Z)
(X) (Y)
(U)
(Z)
(X) (Y)
(U)
(Z)

84
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
I.V.
(Z)

87
Both Valid and
Invalid IV models
can generate this
data distribution.
Can attain a weaker notion: probable sufficiency

89
(X) (Y)
(U)
(Z)(X) (Y)
(U)
(Z)
(X) (Y)
(U)
(Z)(X) (Y)
(U)
(Z)

90
Data is likely to be generated from a Valid-IV model if
ValidityRatio ≫ 1

92
𝑦 =
0 ∶ (𝑓𝑎)
𝑥 ∶ (𝑓𝑏)
~𝑥 ∶ (𝑓𝑐)
1 ∶ (𝑓𝑑)

Denominator (Invalid-IV)
Derived a closed form
solution.
Properties of dirichlet and
hyperdirichlet distributions.
-Laplace transform
Numerator (Valid-IV)
No closed form solution
exists.
Used Monte Carlo methods
for approximating.
-Annealed Importance
Sampling
93

95
Studies from American Economic Review Validity Ratio
Effect of Mexican immigration on crime in United States (2015) 0.07
Effect of subsidy manipulation on Medicare premiums (2015) 1.02
Effect of credit supply on housing prices (2015) 0.01
Effect of Chinese import competition on local labor markets (2013) 0.3
Effect of rural electrification on employment in South Africa (2011) 3.6
Expt: National Job Training Partnership Act (JTPA) Study (2002) 3.4

99
Causal algorithms
Warm Start (choosing expts.)
Online+Offline

100
𝑷 𝑿, 𝒚 : 𝑦 = 𝑘 𝑋 + 𝜖

http://www.amitsharma.in
101
1. Hofman, Sharma, and Watts (2017). Prediction and explanation in
social systems. Science, 355.6324.
2. Sharma (2016). Necessary and probably sufficient test for finding
instrumental variables. Working paper.
3. Sharma, Hofman, and Watts (2016). Split-door criterion for causal
identification: An algorithm for finding natural experiments. Under
review at Annals of Applied Statistics (AOAS).
4. Sharma, Hofman, and Watts (2015). Estimating the causal impact of
recommendation systems from observational data. In Proceedings of
the 16th ACM Conference on Economics and Computation.

Causal data mining: Identifying causal effects at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Causal data mining: Identifying causal effects at scale

Similar to Causal data mining: Identifying causal effects at scale (20)

More from Amit Sharma

More from Amit Sharma (12)

Recently uploaded

Recently uploaded (20)

Causal data mining: Identifying causal effects at scale