Successfully reported this slideshow.
Upcoming SlideShare
×

# Modeling Social Data, Lecture 12: Causality & Experiments, Part 2

304 views

Published on

http://modelingsocialdata.org

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Modeling Social Data, Lecture 12: Causality & Experiments, Part 2

1. 1. Randomized & natural experiments MODELING SOCIAL DATA JAKE HOFMAN COLUMBIA UNIVERSITY
2. 2. Example: Hospitalization on health What’s wrong with estimating this model from observational data? Health tomorrow Hospital visit today Effect? Arrow means “X causes Y”
3. 3. Confounds The effect and cause might be confounded by a common cause, and be changing together as a result Health tomorrow Hospital visit today Effect? Health today Dashed circle means “unobserved”
4. 4. Confounds If we only get to observe them changing together, we can’t estimate the effect of hospitalization changing alone Health tomorrow Hospital visit today Effect? Health today
5. 5. “To find out what happens when you change something, it is necessary to change it.” -GEORGE BOX
6. 6. Hospital visit today Random assignment Random assignment determines the treatment independent of any confounds Health tomorrowEffect? Health today Coin flip Double lines mean “intervention”
7. 7. Counterfactuals To isolate the causal effect, we have to change one and only one thing (hospital visits), and compare outcomes + vs (what happened) Reality (what would have happened) Counterfactual
8. 8. Counterfactuals We never get to observe what would have happened if we did something else, so we have to estimate it + vs (what happened) Reality (what would have happened) Counterfactual
9. 9. Random assignment We can use randomization to create two groups that differ only in which treatment they receive, restoring symmetry + World 1 World 2 Heads Tails
10. 10. Random assignment We can use randomization to create two groups that differ only in which treatment they receive, restoring symmetry + World 1 World 2 Heads Tails
11. 11. Random assignment We can use randomization to create two groups that differ only in which treatment they receive, restoring symmetry + World 1 World 2
12. 12. Random assignment Dunning (2012)
13. 13. Problems Random assignment is the “gold standard” for causal inference, but can be misleading under certain circumstances ◦ Small sample sizes ◦ Researcher degrees of freedom ◦ Publication bias ◦ P-hacking
14. 14. http://bit.ly/fdrtree
15. 15. Electronic copy available at: https://ssrn.com/abstract=1850704 designed to demonstrate something false: that certain songs can change listeners’ age. Everything reported here actually happened.1 Study 1:musical contrast and subjective age In Study 1, we investigated whether listening to a children’s song induces an age contrast, making people feel older. In exchange for payment, 30 University of Pennsylvania under- graduates sat at computer terminals, donned headphones, and were randomly assigned to listen to either a control song (“Kalimba,” an instrumental song by Mr. Scruff that comes free with the Windows 7 operating system) or a children’s song (“Hot Potato,” performed by The Wiggles). After listening to part of the song, participants com- pleted an ostensibly unrelated survey: They answered the question “How old do you feel right now?” by choosing among five options (very young, young, neither young nor old, old, and very old). They also reported their father’s age, allowing us to control for variation in baseline age across participants. An analysis of covariance (ANCOVA) revealed the pre- dicted effect: People felt older after listening to “Hot Potato” tematic analysis of how researcher degrees of freedom influ- ence statistical significance. Impatient readers can consult Table 3. “How Bad Can It Be?” Simulations Simulationsof common researcher degreesof freedom We used computer simulations of experimental data to esti- mate how researcher degrees of freedom influence the proba- bility of a false-positive result. These simulations assessed the impact of four common degrees of freedom: flexibility in (a) choosing among dependent variables, (b) choosing sample size, (c) using covariates, and (d) reporting subsets of experi- mental conditions. We also investigated various combinations of these degrees of freedom. We generated random samples with each observation inde- pendently drawn from a normal distribution, performed sets of analyses on each sample, and observed how often at least one of the resulting p values in each sample was below standard significance levels. For example, imagine a researcher who collects two dependent variables, say liking and willingness to by guest on November 20, 2011pss.sagepub.comDownloaded from 1360 Simmonset al. of ambiguous information and remarkably adept at reaching justifiable conclusions that mesh with their desires (Babcock & Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002; Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuck- erman, 1979). This literature suggests that when we as researchers face ambiguous analytic decisions, we will tend to conclude, with convincing self-justification, that the appropri- ate decisions are those that result in statistical significance (p ≤ .05). Ambiguity is rampant in empirical research. As an exam- ple, consider a very simple decision faced by researchers ana- lyzing reaction times: how to treat outliers. In a perusal of roughly 30 Psychological Science articles, we discovered con- siderable inconsistency in, and hence considerable ambiguity about, this decision. Most (but not all) researchers excluded some responses for being too fast, but what constituted “too fast” varied enormously: the fastest 2.5%, or faster than 2 stan- dard deviations from the mean, or faster than 100 or 150 or 200 or 300 ms. Similarly, what constituted “too slow” varied enormously: the slowest 2.5% or 10%, or 2 or 2.5 or 3 stan- dard deviations slower than the mean, or 1.5 standard devia- tions slower from that condition’s mean, or slower than 1,000 or 1,200 or 1,500 or 2,000 or 3,000 or 5,000 ms. None of these decisions is necessarily incorrect, but that fact makes any of them justifiable and hence potential fodder for self-serving (adjusted M = 2.54 years) than after listening to the control song (adjusted M = 2.06 years), F(1, 27) = 5.06, p = .033. In Study 2, we sought to conceptually replicate and extend Study 1. Having demonstrated that listening to a children’s song makes people feel older, Study 2 investigated whether listening to a song about older age makes people actually younger. Study 2:musical contrast and chronological rejuvenation Using the same method as in Study 1, we asked 20 University of Pennsylvania undergraduates to listen to either “When I’m Sixty-Four” by The Beatles or “Kalimba.” Then, in an ostensi- bly unrelated task, they indicated their birth date (mm/dd/ yyyy) and their father’s age. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040. Discussion
16. 16. Psychological Science 22(11) 1359–1366 © TheAuthor(s) 2011 Reprintsand permission: sagepub.com/journalsPermissions.nav DOI:10.1177/0956797611417632 http://pss.sagepub.com False-Positive Psychology:Undisclosed Flexibility in Data Collection and Analysis AllowsPresentingAnything asSignificant Joseph P. Simmons1 , Leif D. Nelson2 , and Uri Simonsohn1 1 TheWharton School, University of Pennsylvania, and 2 Haas School of Business, University of California,Berkeley Abstract In thisarticle,weaccomplish two things.First,weshow that despite empirical psychologists’ nominal endorsement of alow rate of false-positive findings (£ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers,all of which impose aminimal bur den on the publication process. Keywords General Article
17. 17. False-Positive Psychology 1361 Table 1. Likelihood of ObtainingaFalse-Positive Result Significance level Researcher degrees of freedom p < .1 p < .05 p < .01 Situation A:two dependent variables (r = .50) 17.8% 9.5% 2.2% Situation B: addition of 10 more observations per cell 14.5% 7.7% 1.6% Situation C: controllingfor gender or interaction of gender with treatment 21.6% 11.7% 2.7% Situation D: dropping(or not dropping) one of three conditions 23.2% 12.6% 2.8% Combine Situations A and B 26.0% 14.4% 3.3% Combine Situations A,B,and C 50.9% 30.9% 8.4% Combine Situations A,B,C,and D 81.5% 60.7% 21.5% Note: The table reportsthe percentage of 15,000 simulated samplesin which at least one of a set of analyseswassignificant. Observationswere drawn independently from anormal distribu- tion.Baseline isatwo-condition design with 20 observationsper cell.Resultsfor SituationA were obtained by conductingthree t tests,one on each of two dependent variablesand athird on the average of these two variables.Resultsfor Situation B were obtained by conductingone t test after collecting20 observationsper cell and another after collectingan additional 10 observationsper cell.Resultsfor Situation C were obtained by conductinga t test,an analysisof covariance with a
18. 18. Caveats / limitations Random assignment is the “gold standard” for causal inference, but it has some limitations: ◦ Randomization often isn’t feasible and/or ethical ◦ Experiments are costly in terms of time and money ◦ It’s difficult to create convincing parallel worlds ◦ Inevitably people deviate from their random assignments Anyone can flip a coin, but it’s difficult to create convincing parallel worlds
19. 19. Natural experiments
20. 20. Natural experiments Sometimes we get lucky and nature effectively runs experiments for us, e.g.: ◦ As-if random: People are randomly exposed to water sources ◦ Instrumental variables: A lottery influences military service ◦ Discontinuities: Star ratings get arbitrarily rounded ◦ Difference in differences: Minimum wage changes in just one state
21. 21. Natural experiments Sometimes we get lucky and nature effectively runs experiments for us, e.g.: ◦ As-if random: People are randomly exposed to water sources ◦ Instrumental variables: A lottery influences military service ◦ Discontinuities: Star ratings get arbitrarily rounded ◦ Difference in differences: Minimum wage changes in just one state Experiments happen all the time, we just have to notice them
22. 22. As-if random Idea: Nature randomly assigns conditions Example: People are randomly exposed to water sources (Snow, 1854) http://bit.ly/johnsnowmap
23. 23. Instrumental variables Idea: An instrument independently shifts the distribution of a treatment Example: A lottery influences military service (Angrist, 1990) Military service Future earningsEffect? Confounds Lottery
24. 24. Dunning (2012)
25. 25. Figure 4: Average Revenue around Discontinuous Changes in Rating Notes: Each restaurant’s log revenue is de-meaned to normalize a restaurant’s average log revenue to zero. Normalized log revenues are then averaged within bins based on how far the restaurant’s rating is from a rounding threshold in that quarter. The graph plots average log revenue as a function of how far the rating is from a rounding threshold. All points with a positive (negative) distance from a discontinuity are rounded up (down). Regression discontinuities Idea: Things change around an arbitrarily chosen threshold Example: Star ratings get arbitrarily rounded (Luca, 2011) http://bit.ly/yelpstars
26. 26. Difference in differences Idea: Compare differences after a sudden change with trends in a control group Example: Minimum wage changes in just one state (Card & Krueger, 1994) http://stats.stackexchange.com/a/125266
27. 27. Natural experiments: Caveats Natural experiments are great, but: ◦ Good natural experiments are hard to find ◦ They rely on many (untestable) assumptions ◦ The treated population may not be the one of interest
28. 28. Natural experiments: Caveats Natural experiments are great, but: ◦ Good natural experiments are hard to find ◦ They rely on many (untestable) assumptions ◦ The treated population may not be the one of interest Sometimes we can use additional data + algorithms to automatically find natural experiments
29. 29. Discovering natural experiments
30. 30. Example: How much traffic to recommender systems cause? (Sharma, Hofman & Watts, 2015)
31. 31. Observational estimates Naively, up to 30% of pageviews come through recommendations
32. 32. Observational estimates Naively, up to 30% of pageviews come through recommendations
33. 33. Observational estimates Naively, up to 30% of pageviews come through recommendations But this is almost surely an overestimate of the effect
34. 34. Typical browsing WITH RECOMMENDATIONS
35. 35. Typical browsing: Search
36. 36. Typical browsing: Focal product
37. 37. Typical browsing: Recommended product
38. 38. Counterfactual browsing WITHOUT RECOMMENDATIONS
39. 39. Counterfactual browsing: Search
40. 40. Counterfactual browsing: Focal product
41. 41. Counterfactual browsing: Search again
42. 42. Counterfactual browsing
43. 43. Confound: Correlated demand Some views would have happened anyway due to correlated demand We call these convenience clicks Direct views of hats Referred click- throughs to gloves Effect? Demand for winter items
44. 44. Ideally we would run an A/B test where randomly selected people see recommendations Ideal experiment World 1 World 2
45. 45. Natural experiment Instead, we can exploits sudden shocks in traffic to focal products as an instrumental variable Direct views of focal product Referred click- throughsEffect? Demand External shock
46. 46. Usual approach Think hard for a source of random variation that only directly affects focal product (e.g., author wins award)
47. 47. Usual approach Think hard for a source of random variation that only directly affects focal product (e.g., author wins award) Problem: Impossible to rule out side effects
48. 48. New approach: Automatically discovering natural experiments Look for products that receive shocks in direct traffic, while their recommendations do not
49. 49. New approach: Automatically discovering natural experiments Look for products that receive shocks in direct traffic, while their recommendations do not
50. 50. New approach: Automatically discovering natural experiments Applying this method to 23 million pageviews in the Bing Toolbar logs, we find over 4,000 such natural experiments
51. 51. New approach: Automatically discovering natural experiments Causal click-through rate is just the marginal gain in clicks during the shock Causal effect = Change in recommendation clicks / Size of shock
52. 52. Causal click-through rate is just the marginal gain in clicks during the shock Causal effect = Change in rec clicks / Size of shock Results
53. 53. Although recommendation click- throughs account for a large fraction of traffic, at least 75% of this activity would likely occur in the absence of recommendations* Results *With lots of caveats
54. 54. Closing thoughts Large-scale observational data is useful for building predictive models of a static world
55. 55. Closing thoughts But without appropriate random variation, it’s hard to predict what happens when you change something in the world
56. 56. Closing thoughts Randomized experiments are like custom-made datasets to answer a specific question
57. 57. Closing thoughts Additional data + algorithms can help us discover and analyze these examples in the wild