Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Causal data mining:
Identifying causal
effects at scale
1
AMIT SHARMA
Postdoctoral Researcher, Microsoft Research New York...
2
3
Distinguishing between personal preference and homophily
in online activity feeds. Sharma and Cosley (2016).
Studying and ...
Distinguishing between personal preference and homophily
in online activity feeds. Sharma and Cosley (2016).
Studying and ...
7
Jake and Duncan like this
8
9
Cause (X=π‘₯) Outcome (Y)
Unobserved
Confounders (U)
π’š = 𝑓 𝒙, 𝑒
𝒙 = 𝑔 𝑒
10
11
12
14
15
16
Data mining for causal inference
Part 0: Traditional causal
inference using a natural
experiment
17
1854: London was having a devastating
cholera outbreak 18
Causal question: What
is causing cholera?
Air-borne: Spreads
through air (β€œmiasma”)
Water-borne: Spreads
through contamina...
Polluted Air
Cholera
Diagnosis
Contaminated
Water
Cholera
Diagnosis
Neighborhood
21
Enter John Snow. He found higher cholera deaths near
a water pump, but could be just correlational.
22
New Idea: Two major water companies for London:
one upstream and one downstream.
23
No difference in neighborhood, still an 8-fold increase
in cholera with the downstream company.
S&V
and
Lambeth
Led to a change in belief about cholera’s cause.
24
25
26
Contaminated
Water
Cholera
Diagnosis
Neighborhood
Water
Company
27
Contaminated
Water (X)
Cholera
Diagnosis
(Y)
Other factors
[e.g. neighborhood]
(U)
Water
Company
(Z)
As-If-Random
Exclu...
28
Cause (X) Outcome (Y)
Unobserved
Confounders
(U)
New
variable (Z)
As-If-Random
Exclusion
29
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
Randomized
Assignment
(Z)
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
Instrumental
Variable
(Z)
30
32
Such that:
As-If-Random:
Exclusion: 𝑍 ∐ π‘Œ | 𝑋, π‘ˆ
33
34
Part I: Split-door criterion for
causal identification
35
36
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome
37
Cause
Primary
Outcome
Unobserved
Confounders
Auxiliary
Outcome
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome
40
41
42
43
44
45
Demand for
The Road
Visits to The
Road
Rec. visits to
No Country
for Old Men
Demand for
No Country
for Old Men
46
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page
visits
?
ACTIVITY WITHOUT
RECOMMENDER
47
48
Treatment (A) Control (B)
49
Instrument
Demand for
Cormac McCarthy
Visits to The
Road
Rec. visits to
No Country
for Old Men
50
51
All visits to a
recommended product
Recommender
visits
Direct visits
Search visits
Direct
browsing
Auxiliary outcome: P...
52
Demand for
focal product
(UX)
Visits to
focal product
(X)
Rec. visits
(YR)
Direct visits
(YD)
Demand for
rec. product
(...
? ?
53
54
55
Criterion: 𝑿 ∐ 𝒀 𝑫
Demand for
focal product
(UX)
Visits to
focal product
(X)
Rec. visits
(YR)
Direct visits
(YD)
Demand...
56
Unobserved
variables (UX)
Cause
(X)
Outcome
(YR)
Auxiliary
Outcome
(YD)
Unobserved
variables (UY)
57
π‘¦π‘Ÿ = 𝑓 π‘₯, 𝑒 = 𝛾2 𝑒 + 𝝆𝒙 + πœ– π‘¦π‘Ÿ
π‘₯ = 𝑔 𝑒 = 𝛾1 𝑒 + πœ– π‘₯
𝑦 𝑑 = β„Ž 𝑒 = 𝛾3 𝑒 + πœ– 𝑦𝑑
58
Independence test used to
find natural experiments.
Only Assumption: Auxiliary
outcome is affected by the
causes of the...
59
Treatment Outcome
Unobserved
Confounders
Instrumental
Variable
Treatment Outcome
Unobserved
Confounders
Auxiliary
Outco...
61
Recreating sequence of visits: Log data
62
Timestamp URL
2014-01-20
09:04:10
http://www.amazon.com/s/ref=nb_sb_noss_1?fiel...
Recreating sequence of visits: Log data
63
Timestamp URL
2014-01-20
09:04:10
http://www.amazon.com/s/ref=nb_s
b_noss_1?fie...
65
67
𝑑 = 15 days
Compute 𝜌 = π‘Œπ‘…/ 𝑋
68
69
70
71
72
73
74
75
76
Lower CTR may be due
to the holiday season
77
78
79
Oprah [Carmi et al.] 133 shocks Restricted to books
Split-door criterion 12,000 natural
experiments
Representative of o...
80
Part 2: A general Bayesian
test for natural experiments
in any dataset
81
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
Instrumental
Variable
(Z)
As-If-Random?
Exclusion?
83
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
I.V.
(Z)
(X) (Y)
(U)
(Z)
(X) (Y)
(U)
(Z)
(X) (Y)
(U)
(Z)
84
Cause
(X)
Outcome
(Y)
Unobserved
Confounders
(U)
I.V.
(Z)
85
86
87
Both Valid and
Invalid IV models
can generate this
data distribution.
Can attain a weaker notion: probable sufficiency
88
89
(X) (Y)
(U)
(Z)(X) (Y)
(U)
(Z)
(X) (Y)
(U)
(Z)(X) (Y)
(U)
(Z)
90
Data is likely to be generated from a Valid-IV model if
ValidityRatio ≫ 1
91
92
𝑦 =
0 ∢ (π‘“π‘Ž)
π‘₯ ∢ (𝑓𝑏)
~π‘₯ ∢ (𝑓𝑐)
1 ∢ (𝑓𝑑)
Denominator (Invalid-IV)
Derived a closed form
solution.
Properties of dirichlet and
hyperdirichlet distributions.
-Laplac...
94
95
Studies from American Economic Review Validity Ratio
Effect of Mexican immigration on crime in United States (2015) 0.0...
96
97
98
99
Causal algorithms
Warm Start (choosing expts.)
Online+Offline
100
𝑷 𝑿, π’š : 𝑦 = π‘˜ 𝑋 + πœ–
http://www.amitsharma.in
101
1. Hofman, Sharma, and Watts (2017). Prediction and explanation in
social systems. Science, 3...
102
103
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
Causal data mining: Identifying causal effects at scale
Upcoming SlideShare
Loading in …5
×

Causal data mining: Identifying causal effects at scale

7,299 views

Published on

Identifying causal effects is an integral part of scientific inquiry, spanning a wide range of questions such as understanding behavior in online systems, effect of social policies, or risk factors for diseases. In the absence of a randomized experiment, however, traditional methods such as matching or instrumental variables fail to provide robust estimates because they depend on strong assumptions that are never tested.
Β 
My research shows that many of the strong assumptions are testable. This leads to a data mining framework for causal inference from observed data: instead of relying on untestable assumptions, we develop tests for valid experiment-like data---a "natural" experiment---and estimate causal effects only from subsets of data that pass those tests. Two such methods are presented. The first utilizes auxiliary data from large-scale systems to automate the search for natural experiments. Applying it to estimate the additional activity caused by Amazon's recommendation system, I find over 20,000 natural experiments, an order of magnitude more than those in past work. These experiments indicate that less than half of the click-throughs typically attributed to the recommendation system are causal; the rest would have happened anyways. The second method proposes a general Bayesian test that can be used for validating natural experiments in any dataset. For instance, I find that a majority of natural experiments used in recent studies in a premier economics journal are likely invalid. More generally, the proposed framework presents a viable way of doing causal inference in large-scale datasets with minimal assumptions.

Published in: Science
  • Be the first to comment

Causal data mining: Identifying causal effects at scale

  1. 1. Causal data mining: Identifying causal effects at scale 1 AMIT SHARMA Postdoctoral Researcher, Microsoft Research New York http://www.amitsharma.in @amt_shrma
  2. 2. 2
  3. 3. 3
  4. 4. Distinguishing between personal preference and homophily in online activity feeds. Sharma and Cosley (2016). Studying and modeling the effect of social explanations in recommender systems. Sharma and Cosley (2013). Amit and Dan like this.
  5. 5. Distinguishing between personal preference and homophily in online activity feeds. Sharma and Cosley (2016). Studying and modeling the effect of social explanations in recommender systems. Sharma and Cosley (2013). Amit and Dan like this. Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior. Barbosa, Cosley, Sharma, Cesar (2016) Auditing search engines for differential satisfaction across demographics. Mehrotra, Anderson, Diaz, Sharma, Wallach (2016)
  6. 6. 7 Jake and Duncan like this
  7. 7. 8
  8. 8. 9 Cause (X=π‘₯) Outcome (Y) Unobserved Confounders (U) π’š = 𝑓 𝒙, 𝑒 𝒙 = 𝑔 𝑒
  9. 9. 10
  10. 10. 11
  11. 11. 12
  12. 12. 14
  13. 13. 15
  14. 14. 16 Data mining for causal inference
  15. 15. Part 0: Traditional causal inference using a natural experiment 17
  16. 16. 1854: London was having a devastating cholera outbreak 18
  17. 17. Causal question: What is causing cholera? Air-borne: Spreads through air (β€œmiasma”) Water-borne: Spreads through contaminated water 19
  18. 18. Polluted Air Cholera Diagnosis Contaminated Water Cholera Diagnosis Neighborhood
  19. 19. 21 Enter John Snow. He found higher cholera deaths near a water pump, but could be just correlational.
  20. 20. 22 New Idea: Two major water companies for London: one upstream and one downstream.
  21. 21. 23 No difference in neighborhood, still an 8-fold increase in cholera with the downstream company. S&V and Lambeth
  22. 22. Led to a change in belief about cholera’s cause. 24
  23. 23. 25
  24. 24. 26 Contaminated Water Cholera Diagnosis Neighborhood Water Company
  25. 25. 27 Contaminated Water (X) Cholera Diagnosis (Y) Other factors [e.g. neighborhood] (U) Water Company (Z) As-If-Random Exclusion
  26. 26. 28 Cause (X) Outcome (Y) Unobserved Confounders (U) New variable (Z) As-If-Random Exclusion
  27. 27. 29 Cause (X) Outcome (Y) Unobserved Confounders (U) Randomized Assignment (Z)
  28. 28. Cause (X) Outcome (Y) Unobserved Confounders (U) Instrumental Variable (Z) 30
  29. 29. 32 Such that: As-If-Random: Exclusion: 𝑍 ∐ π‘Œ | 𝑋, π‘ˆ
  30. 30. 33
  31. 31. 34
  32. 32. Part I: Split-door criterion for causal identification 35
  33. 33. 36 Cause Outcome Unobserved Confounders Auxiliary Outcome
  34. 34. 37 Cause Primary Outcome Unobserved Confounders Auxiliary Outcome
  35. 35. Cause Outcome Unobserved Confounders Auxiliary Outcome
  36. 36. Cause Outcome Unobserved Confounders Auxiliary Outcome
  37. 37. 40
  38. 38. 41
  39. 39. 42
  40. 40. 43
  41. 41. 44
  42. 42. 45 Demand for The Road Visits to The Road Rec. visits to No Country for Old Men Demand for No Country for Old Men
  43. 43. 46 Causal Convenience OBSERVED ACTIVITY FROM RECOMMENDER All page visits ? ACTIVITY WITHOUT RECOMMENDER
  44. 44. 47
  45. 45. 48 Treatment (A) Control (B)
  46. 46. 49 Instrument Demand for Cormac McCarthy Visits to The Road Rec. visits to No Country for Old Men
  47. 47. 50
  48. 48. 51 All visits to a recommended product Recommender visits Direct visits Search visits Direct browsing Auxiliary outcome: Proxy for unobserved demand
  49. 49. 52 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
  50. 50. ? ? 53
  51. 51. 54
  52. 52. 55 Criterion: 𝑿 ∐ 𝒀 𝑫 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
  53. 53. 56 Unobserved variables (UX) Cause (X) Outcome (YR) Auxiliary Outcome (YD) Unobserved variables (UY)
  54. 54. 57 π‘¦π‘Ÿ = 𝑓 π‘₯, 𝑒 = 𝛾2 𝑒 + 𝝆𝒙 + πœ– π‘¦π‘Ÿ π‘₯ = 𝑔 𝑒 = 𝛾1 𝑒 + πœ– π‘₯ 𝑦 𝑑 = β„Ž 𝑒 = 𝛾3 𝑒 + πœ– 𝑦𝑑
  55. 55. 58 Independence test used to find natural experiments. Only Assumption: Auxiliary outcome is affected by the causes of the primary outcome.
  56. 56. 59 Treatment Outcome Unobserved Confounders Instrumental Variable Treatment Outcome Unobserved Confounders Auxiliary Outcome
  57. 57. 61
  58. 58. Recreating sequence of visits: Log data 62 Timestamp URL 2014-01-20 09:04:10 http://www.amazon.com/s/ref=nb_sb_noss_1?fiel d-keywords=Cormac%20McCarthy 2014-01-20 09:04:15 http://www.amazon.com/dp/0812984250/ref=sr_ 1_2 2014-01-20 09:05:01 http://www.amazon.com/dp/1573225797/ref=pd _sim_b_1
  59. 59. Recreating sequence of visits: Log data 63 Timestamp URL 2014-01-20 09:04:10 http://www.amazon.com/s/ref=nb_s b_noss_1?field- keywords=Cormac%20McCarthy 2014-01-20 09:04:15 http://www.amazon.com/dp/081298 4250/ref=sr_1_2 2014-01-20 09:05:01 http://www.amazon.com/dp/157322 5797/ref=pd_sim_b_1 User searches for Cormac McCarthy User clicks on the second search result User clicks on the first recommendation
  60. 60. 65
  61. 61. 67 𝑑 = 15 days
  62. 62. Compute 𝜌 = π‘Œπ‘…/ 𝑋 68
  63. 63. 69
  64. 64. 70
  65. 65. 71
  66. 66. 72
  67. 67. 73
  68. 68. 74
  69. 69. 75
  70. 70. 76 Lower CTR may be due to the holiday season
  71. 71. 77
  72. 72. 78
  73. 73. 79 Oprah [Carmi et al.] 133 shocks Restricted to books Split-door criterion 12,000 natural experiments Representative of overall product distribution
  74. 74. 80
  75. 75. Part 2: A general Bayesian test for natural experiments in any dataset 81
  76. 76. Cause (X) Outcome (Y) Unobserved Confounders (U) Instrumental Variable (Z) As-If-Random? Exclusion?
  77. 77. 83 Cause (X) Outcome (Y) Unobserved Confounders (U) I.V. (Z) (X) (Y) (U) (Z) (X) (Y) (U) (Z) (X) (Y) (U) (Z)
  78. 78. 84 Cause (X) Outcome (Y) Unobserved Confounders (U) I.V. (Z)
  79. 79. 85
  80. 80. 86
  81. 81. 87 Both Valid and Invalid IV models can generate this data distribution. Can attain a weaker notion: probable sufficiency
  82. 82. 88
  83. 83. 89 (X) (Y) (U) (Z)(X) (Y) (U) (Z) (X) (Y) (U) (Z)(X) (Y) (U) (Z)
  84. 84. 90 Data is likely to be generated from a Valid-IV model if ValidityRatio ≫ 1
  85. 85. 91
  86. 86. 92 𝑦 = 0 ∢ (π‘“π‘Ž) π‘₯ ∢ (𝑓𝑏) ~π‘₯ ∢ (𝑓𝑐) 1 ∢ (𝑓𝑑)
  87. 87. Denominator (Invalid-IV) Derived a closed form solution. Properties of dirichlet and hyperdirichlet distributions. -Laplace transform Numerator (Valid-IV) No closed form solution exists. Used Monte Carlo methods for approximating. -Annealed Importance Sampling 93
  88. 88. 94
  89. 89. 95 Studies from American Economic Review Validity Ratio Effect of Mexican immigration on crime in United States (2015) 0.07 Effect of subsidy manipulation on Medicare premiums (2015) 1.02 Effect of credit supply on housing prices (2015) 0.01 Effect of Chinese import competition on local labor markets (2013) 0.3 Effect of rural electrification on employment in South Africa (2011) 3.6 Expt: National Job Training Partnership Act (JTPA) Study (2002) 3.4
  90. 90. 96
  91. 91. 97
  92. 92. 98
  93. 93. 99 Causal algorithms Warm Start (choosing expts.) Online+Offline
  94. 94. 100 𝑷 𝑿, π’š : 𝑦 = π‘˜ 𝑋 + πœ–
  95. 95. http://www.amitsharma.in 101 1. Hofman, Sharma, and Watts (2017). Prediction and explanation in social systems. Science, 355.6324. 2. Sharma (2016). Necessary and probably sufficient test for finding instrumental variables. Working paper. 3. Sharma, Hofman, and Watts (2016). Split-door criterion for causal identification: An algorithm for finding natural experiments. Under review at Annals of Applied Statistics (AOAS). 4. Sharma, Hofman, and Watts (2015). Estimating the causal impact of recommendation systems from observational data. In Proceedings of the 16th ACM Conference on Economics and Computation.
  96. 96. 102
  97. 97. 103

Γ—