Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

6,654 views

Published on

My research shows that many of the strong assumptions are testable. This leads to a data mining framework for causal inference from observed data: instead of relying on untestable assumptions, we develop tests for valid experiment-like data---a "natural" experiment---and estimate causal effects only from subsets of data that pass those tests. Two such methods are presented. The first utilizes auxiliary data from large-scale systems to automate the search for natural experiments. Applying it to estimate the additional activity caused by Amazon's recommendation system, I find over 20,000 natural experiments, an order of magnitude more than those in past work. These experiments indicate that less than half of the click-throughs typically attributed to the recommendation system are causal; the rest would have happened anyways. The second method proposes a general Bayesian test that can be used for validating natural experiments in any dataset. For instance, I find that a majority of natural experiments used in recent studies in a premier economics journal are likely invalid. More generally, the proposed framework presents a viable way of doing causal inference in large-scale datasets with minimal assumptions.

Published in:
Science

No Downloads

Total views

6,654

On SlideShare

0

From Embeds

0

Number of Embeds

6,048

Shares

0

Downloads

28

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Causal data mining: Identifying causal effects at scale 1 AMIT SHARMA Postdoctoral Researcher, Microsoft Research New York http://www.amitsharma.in @amt_shrma
- 2. 2
- 3. 3
- 4. Distinguishing between personal preference and homophily in online activity feeds. Sharma and Cosley (2016). Studying and modeling the effect of social explanations in recommender systems. Sharma and Cosley (2013). Amit and Dan like this.
- 5. Distinguishing between personal preference and homophily in online activity feeds. Sharma and Cosley (2016). Studying and modeling the effect of social explanations in recommender systems. Sharma and Cosley (2013). Amit and Dan like this. Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior. Barbosa, Cosley, Sharma, Cesar (2016) Auditing search engines for differential satisfaction across demographics. Mehrotra, Anderson, Diaz, Sharma, Wallach (2016)
- 6. 7 Jake and Duncan like this
- 7. 8
- 8. 9 Cause (X=𝑥) Outcome (Y) Unobserved Confounders (U) 𝒚 = 𝑓 𝒙, 𝑢 𝒙 = 𝑔 𝑢
- 9. 10
- 10. 11
- 11. 12
- 12. 14
- 13. 15
- 14. 16 Data mining for causal inference
- 15. Part 0: Traditional causal inference using a natural experiment 17
- 16. 1854: London was having a devastating cholera outbreak 18
- 17. Causal question: What is causing cholera? Air-borne: Spreads through air (“miasma”) Water-borne: Spreads through contaminated water 19
- 18. Polluted Air Cholera Diagnosis Contaminated Water Cholera Diagnosis Neighborhood
- 19. 21 Enter John Snow. He found higher cholera deaths near a water pump, but could be just correlational.
- 20. 22 New Idea: Two major water companies for London: one upstream and one downstream.
- 21. 23 No difference in neighborhood, still an 8-fold increase in cholera with the downstream company. S&V and Lambeth
- 22. Led to a change in belief about cholera’s cause. 24
- 23. 25
- 24. 26 Contaminated Water Cholera Diagnosis Neighborhood Water Company
- 25. 27 Contaminated Water (X) Cholera Diagnosis (Y) Other factors [e.g. neighborhood] (U) Water Company (Z) As-If-Random Exclusion
- 26. 28 Cause (X) Outcome (Y) Unobserved Confounders (U) New variable (Z) As-If-Random Exclusion
- 27. 29 Cause (X) Outcome (Y) Unobserved Confounders (U) Randomized Assignment (Z)
- 28. Cause (X) Outcome (Y) Unobserved Confounders (U) Instrumental Variable (Z) 30
- 29. 32 Such that: As-If-Random: Exclusion: 𝑍 ∐ 𝑌 | 𝑋, 𝑈
- 30. 33
- 31. 34
- 32. Part I: Split-door criterion for causal identification 35
- 33. 36 Cause Outcome Unobserved Confounders Auxiliary Outcome
- 34. 37 Cause Primary Outcome Unobserved Confounders Auxiliary Outcome
- 35. Cause Outcome Unobserved Confounders Auxiliary Outcome
- 36. Cause Outcome Unobserved Confounders Auxiliary Outcome
- 37. 40
- 38. 41
- 39. 42
- 40. 43
- 41. 44
- 42. 45 Demand for The Road Visits to The Road Rec. visits to No Country for Old Men Demand for No Country for Old Men
- 43. 46 Causal Convenience OBSERVED ACTIVITY FROM RECOMMENDER All page visits ? ACTIVITY WITHOUT RECOMMENDER
- 44. 47
- 45. 48 Treatment (A) Control (B)
- 46. 49 Instrument Demand for Cormac McCarthy Visits to The Road Rec. visits to No Country for Old Men
- 47. 50
- 48. 51 All visits to a recommended product Recommender visits Direct visits Search visits Direct browsing Auxiliary outcome: Proxy for unobserved demand
- 49. 52 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
- 50. ? ? 53
- 51. 54
- 52. 55 Criterion: 𝑿 ∐ 𝒀 𝑫 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
- 53. 56 Unobserved variables (UX) Cause (X) Outcome (YR) Auxiliary Outcome (YD) Unobserved variables (UY)
- 54. 57 𝑦𝑟 = 𝑓 𝑥, 𝑢 = 𝛾2 𝑢 + 𝝆𝒙 + 𝜖 𝑦𝑟 𝑥 = 𝑔 𝑢 = 𝛾1 𝑢 + 𝜖 𝑥 𝑦 𝑑 = ℎ 𝑢 = 𝛾3 𝑢 + 𝜖 𝑦𝑑
- 55. 58 Independence test used to find natural experiments. Only Assumption: Auxiliary outcome is affected by the causes of the primary outcome.
- 56. 59 Treatment Outcome Unobserved Confounders Instrumental Variable Treatment Outcome Unobserved Confounders Auxiliary Outcome
- 57. 61
- 58. Recreating sequence of visits: Log data 62 Timestamp URL 2014-01-20 09:04:10 http://www.amazon.com/s/ref=nb_sb_noss_1?fiel d-keywords=Cormac%20McCarthy 2014-01-20 09:04:15 http://www.amazon.com/dp/0812984250/ref=sr_ 1_2 2014-01-20 09:05:01 http://www.amazon.com/dp/1573225797/ref=pd _sim_b_1
- 59. Recreating sequence of visits: Log data 63 Timestamp URL 2014-01-20 09:04:10 http://www.amazon.com/s/ref=nb_s b_noss_1?field- keywords=Cormac%20McCarthy 2014-01-20 09:04:15 http://www.amazon.com/dp/081298 4250/ref=sr_1_2 2014-01-20 09:05:01 http://www.amazon.com/dp/157322 5797/ref=pd_sim_b_1 User searches for Cormac McCarthy User clicks on the second search result User clicks on the first recommendation
- 60. 65
- 61. 67 𝑡 = 15 days
- 62. Compute 𝜌 = 𝑌𝑅/ 𝑋 68
- 63. 69
- 64. 70
- 65. 71
- 66. 72
- 67. 73
- 68. 74
- 69. 75
- 70. 76 Lower CTR may be due to the holiday season
- 71. 77
- 72. 78
- 73. 79 Oprah [Carmi et al.] 133 shocks Restricted to books Split-door criterion 12,000 natural experiments Representative of overall product distribution
- 74. 80
- 75. Part 2: A general Bayesian test for natural experiments in any dataset 81
- 76. Cause (X) Outcome (Y) Unobserved Confounders (U) Instrumental Variable (Z) As-If-Random? Exclusion?
- 77. 83 Cause (X) Outcome (Y) Unobserved Confounders (U) I.V. (Z) (X) (Y) (U) (Z) (X) (Y) (U) (Z) (X) (Y) (U) (Z)
- 78. 84 Cause (X) Outcome (Y) Unobserved Confounders (U) I.V. (Z)
- 79. 85
- 80. 86
- 81. 87 Both Valid and Invalid IV models can generate this data distribution. Can attain a weaker notion: probable sufficiency
- 82. 88
- 83. 89 (X) (Y) (U) (Z)(X) (Y) (U) (Z) (X) (Y) (U) (Z)(X) (Y) (U) (Z)
- 84. 90 Data is likely to be generated from a Valid-IV model if ValidityRatio ≫ 1
- 85. 91
- 86. 92 𝑦 = 0 ∶ (𝑓𝑎) 𝑥 ∶ (𝑓𝑏) ~𝑥 ∶ (𝑓𝑐) 1 ∶ (𝑓𝑑)
- 87. Denominator (Invalid-IV) Derived a closed form solution. Properties of dirichlet and hyperdirichlet distributions. -Laplace transform Numerator (Valid-IV) No closed form solution exists. Used Monte Carlo methods for approximating. -Annealed Importance Sampling 93
- 88. 94
- 89. 95 Studies from American Economic Review Validity Ratio Effect of Mexican immigration on crime in United States (2015) 0.07 Effect of subsidy manipulation on Medicare premiums (2015) 1.02 Effect of credit supply on housing prices (2015) 0.01 Effect of Chinese import competition on local labor markets (2013) 0.3 Effect of rural electrification on employment in South Africa (2011) 3.6 Expt: National Job Training Partnership Act (JTPA) Study (2002) 3.4
- 90. 96
- 91. 97
- 92. 98
- 93. 99 Causal algorithms Warm Start (choosing expts.) Online+Offline
- 94. 100 𝑷 𝑿, 𝒚 : 𝑦 = 𝑘 𝑋 + 𝜖
- 95. http://www.amitsharma.in 101 1. Hofman, Sharma, and Watts (2017). Prediction and explanation in social systems. Science, 355.6324. 2. Sharma (2016). Necessary and probably sufficient test for finding instrumental variables. Working paper. 3. Sharma, Hofman, and Watts (2016). Split-door criterion for causal identification: An algorithm for finding natural experiments. Under review at Annals of Applied Statistics (AOAS). 4. Sharma, Hofman, and Watts (2015). Estimating the causal impact of recommendation systems from observational data. In Proceedings of the 16th ACM Conference on Economics and Computation.
- 96. 102
- 97. 103

No public clipboards found for this slide

Be the first to comment