Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

0

Share

Correlation causality

causality vs. correlation.

  • Be the first to like this

Correlation causality

  1. 1. Correlation vs. Causality Nature & Design of Experiments D3M
  2. 2. Dangerous Question  A Dangerous Question: Does Internet Advertising Work at All?  Did eBay Just Prove That Paid Search Ads Don't Work?  Original Paper
  3. 3. The Causal Effect of some intervention is the difference in outcomes with and without the intervention + minus “The Mozart Effect” =
  4. 4. 4 “Simpson’s paradox” Do firefighters present at fire create higher fire damage? Rationale  There is a strong positive correlation between the number of firefighters present at a fire and the amount of fire damage  Missing variable?  When you factor in the missing variable you will get a different relationship
  5. 5. 5 Arm Exercise and Longevity A study found that the average life expectancy of famous orchestra conductors was 73.4 years, significantly higher than the life expectancy for males, 68.5 years….this was thought to be due to arm exercise.
  6. 6. 6 Correlation vs. Causation  Correlation between Education and Income  Correlation between money raised and election outcomes  Facebook use & low Grades  Drinking and long life
  7. 7. 7
  8. 8. Low fat Diet and Cancer 8 Low fat diet breast cancer hope (BBC May 2005) Breast cancer link to high fat foods (The Scotsman, July 2003) Low- Fat Diet May Control Prostate Cancer (Health News, August 2005) Low-fat diet, not wine, fights heart disease in France (CNN May, 1999) High-Fat Meal May Raise Risk Of Blood Clotting -- Increasing Heart Attack And Stroke Risk (American Heart Association, November 1997)
  9. 9. National study finds no effect from reducing total dietary fat The study, a project of the National Institutes of Health, had taken eight years, cost $415 million, and involved nearly 49,000 older women, 40 percent of whom were assigned to a diet that kept their intake of calories from fat significantly below that of the other 60 percent. Researchers had expected to confirm what earlier studies and conventional medical wisdom had long suggested -- that consuming less fat is good for your health. Researchers found no difference between the two groups in terms of risk of breast cancer, colon cancer, heart disease or stroke. http://www.nih.gov/news/pr/feb2006/nhlbi-07.htm The results from the largest ever clinical trial of low-fat diet are reported in three papers in the February 8 edition of the Journal of the American Medical Association.
  10. 10. 10 Important Policy Implications Sir Francis Galton:  Belief: talent was based on heredity alone  Evidence: strong positive correlation between talent of parents and offspring (e.g., judges had children that were judges)  Policy Goal: Limit reproduction of less talented or ill • Anthropogenic Global Warming? – CO2 Anyone?
  11. 11. 11 Nature & Design of Experiments
  12. 12. 12 Notion of Random Sampling Selection of a subset of elements from the population on which the research will be based Contrast to a census: Measure entire population More sugar in coffee? More salt in soup? Blood test.
  13. 13. Investigations of Passive Smoking Harm: Relationship between Article Conclusions & Author Affiliations Number (%) of Reviews Article Conclusion Tobacco Affiliated Authors (n=31) Non-Tobacco Affiliated Authors (n=75) Passive smoking harmful 2 (6%) 65 (87%) Passive smoking not harmful 29 (94%) 10 (13%) Significance What Test? P<.001 Barnes, Deborah E. 1998. Why review articles on the health effects of passive smoking reach different conclusions. JAMA. 279(19): 1566-1570. Examining the Data Source
  14. 14. Election Projections o A famous case of what can go wrong when using a biased sample is found in the 1936 US presidential election polls. o The Literary Digest held a poll that forecast that Alfred M. Landon would defeat Franklin Delano Roosevelt by 57% to 43%. o Sample: Own subscribers, Mailing lists from registered car owners and telephone users o George Gallup, using a much smaller sample (300,000 rather than 2,000,000), predicted Roosevelt would win, and he was right. o What went wrong with the Literary Digest poll? The election of 1948 Candidates Crossley Gallup Roper The Results Truman 45 44 38 50 Dewey 50 50 53 45
  15. 15. Sampling… Poll Date Sample MoE Obama (D) McCain (R) Spread Final Results -- -- -- 52.9 45.6 Obama +7.3 RCP Average 10/29 - 11/03 -- -- 52.1 44.5 Obama +7.6 Marist 11/03 - 11/03 804 LV 4.0 52 43 Obama +9 Battleground (Lake)* 11/02 - 11/03 800 LV 3.5 52 47 Obama +5 Battleground (Tarrance)* 11/02 - 11/03 800 LV 3.5 50 48 Obama +2 Rasmussen Reports 11/01 - 11/03 3000 LV 2.0 52 46 Obama +6 Reuters/C-SPAN/Zogby 11/01 - 11/03 1201 LV 2.9 54 43 Obama +11 IBD/TIPP 11/01 - 11/03 981 LV 3.2 52 44 Obama +8 FOX News 11/01 - 11/02 971 LV 3.0 50 43 Obama +7 NBC News/Wall St. Jrnl 11/01 - 11/02 1011 LV 3.1 51 43 Obama +8 Gallup 10/31 - 11/02 2472 LV 2.0 55 44 Obama +11 Diageo/Hotline 10/31 - 11/02 887 LV 3.3 50 45 Obama +5 CBS News 10/31 - 11/02 714 LV -- 51 42 Obama +9 ABC News/Wash Post 10/30 - 11/02 2470 LV 2.5 53 44 Obama +9 Ipsos/McClatchy 10/30 - 11/02 760 LV 3.6 53 46 Obama +7 CNN/Opinion Research 10/30 - 11/01 714 LV 3.5 53 46 Obama +7 Pew Research 10/29 - 11/01 2587 LV 2.0 52 46 Obama +6
  16. 16. 16 Two-group Before-After design Experimental Design: Basics Two-group Before-After Design O1 X O2 Causal Effect of X = O2 – O1 – (O4 - O3) Treatment“before outcome” “after outcome” O3 O4 Experimental Group Control Group Randomly assigned!
  17. 17. Example: Gneezy & Rustichini (2000, JLS)  Setting: A study of day-care centers in Israel. The day care centers operates between 7.30 and 16.00. Before the study there was no fine if parents came late to pick up their children.  Treatments: Control (only record late parents) and treatment (recorded first 4 weeks, then a fine of 10 NIS for late pick-up, removed fine in 17th week).  Subjects: The study was carried out on 10 day-care centers in Israel (center 1-6 in the test group and center 7-10 in the control group). Between 28 and 37 children in each day care center.
  18. 18. Example: Gneezy & Rustichini (2000, JLS) What Happened?
  19. 19. Impact of telephones on price of fish in Kerala (India)Natural Experiments
  20. 20. Natural Experiments Organ Donation Rates Why is there a difference?
  21. 21. 21 Long History of Online Experimentation
  22. 22. 22 Controlled Store Test
  23. 23. Association Between Variables Once we have done the experiment, we want to see if our intervention had an impact What statistical test to do depends on how the outcome variable is measured
  24. 24. 24 Question of Interest  Association between two or more variables:  “Is there a relation between variable X and variable Y?”  Is voting behavior related to individual’s education level?  Do sales increase when we put a full-page AD in NY times?  What is the relationship between sales and price charged? Most of data analysis is finding patterns/relationships between variables
  25. 25. Association Between Variables  Both Variables Nominal Cross tabs (Chi-square test)  One Continuous One Nominal: Mean Comparison (T-test, ANOVA)  Many Variables Regression
  26. 26. Relationship b/w two variables Variable 1 is Nominal • Voting 1=Democrat 2=Republican Variable 2 is Nominal • Education 1=High school, 2=Some college, 3=College • Brand Preference 1=National Brand, 2=Generic • Income 1=Income < 25K 2= 25K to 50 K 3=Over 50K Is there a relationship b/w variable 1 & 2 Both Nominal: Do cross-tab & Chi-square
  27. 27. Cross-tab Example Bing it ON D3M
  28. 28. Bing it ON Context Microsoft's "Bing It On" campaign purports to show that users prefer the company's search engine to Google's in a majority of blind tests. Recently, Ian Ayres (faculty at Yale Law) ran a blind test at BingItOn.com with 1,000 people recruited through Amazon's Mechanical Turk. The paper concludes that Bing's claims are misleading and are based on search words provided by the company. This in turn warrants legal scrutiny under the Lanham Act on false advertising (you can find the unpublished working paper on his web page). Data In the file “Bing_it_on.csv” you are provided the data used in this study (it may be useful to visit the "Bing It On" web page to understand the experiment). There are approximately 900 participants in the experiment that were randomly assigned to one of the 3 groups based on what search words to use (variable: “Search Type”): 1: Popular searches (based on 2012 most popular google search words) 2: Bing suggested search words 3: User-generated search words The key variable of interest is “Preference” coded as 1-Bing Wins, 2-Tie, and 3-Google wins. Data also contains an additional variable “Gender” (1=Male, 2=Female) that you can ignore. Objective Analyze the relationship between “Search Type” and “Preference”.
  29. 29. 31 Use 2-test : where oij= observed count in cell (i,j) and eij= expected count in cell (i,j) under no association r = number of rows in table c = number of columns • The test statistic has a 2-distribution with (r-1)*(c-1) degrees of freedom • The null hypothesis is no assocation. • Reject the null hypothesis when the test statistic is “large”: • Larger than the critical value, or • The p-value is small     c j ij ijij r i e eo 1 2 1 2 )(  2-test for Association
  30. 30. Chi-square Test in R # Set Your working directory and load data setwd("C:/Users/vsingh.NYC-STERN/Dropbox/teaching/2014/Fall/Assignments/Assignment 1") # Read data and give a temp name "election" bing <- read.csv("bing_it_on1.csv", header=TRUE, sep=",") library(gmodels) CrossTable (bing$Search_Type, bing$Preference, chisq=TRUE, format="SPSS") Cell Contents |-------------------------| | Count | | Chi-square contribution | | Row Percent | | Column Percent | | Total Percent | |-------------------------| Total Observations in Table: 985 | bing$Preference bing$Search_Type | Bing Wins | Google Wins | Tie | Row Total | ---------------------|-------------|-------------|-------------|-------------| Bing Suggested | 159 | 157 | 18 | 334 | | 4.025 | 2.407 | 0.348 | | | 47.605% | 47.006% | 5.389% | 33.909% | | 39.750% | 29.962% | 29.508% | | | 16.142% | 15.939% | 1.827% | | ---------------------|-------------|-------------|-------------|-------------| Popular Searches | 129 | 184 | 19 | 332 | | 0.251 | 0.309 | 0.118 | | | 38.855% | 55.422% | 5.723% | 33.706% | | 32.250% | 35.115% | 31.148% | | | 13.096% | 18.680% | 1.929% | | ---------------------|-------------|-------------|-------------|-------------| Self-selected Search | 112 | 183 | 24 | 319 | | 2.376 | 1.042 | 0.912 | | | 35.110% | 57.367% | 7.524% | 32.386% | | 28.000% | 34.924% | 39.344% | | | 11.371% | 18.579% | 2.437% | | ---------------------|-------------|-------------|-------------|-------------| Column Total | 400 | 524 | 61 | 985 | | 40.609% | 53.198% | 6.193% | | ---------------------|-------------|-------------|-------------|-------------| Statistics for All Table Factors Pearson's Chi-squared test ------------------------------------------------------------ Chi^2 = 11.78902 d.f. = 4 p = 0.01899112
  31. 31. MEAN COMPARISON t-test, ANOVA, Regression Most of data analysis is finding patterns/relationships between variables
  32. 32. Association Between Variables  Both Variables Nominal Cross tabs (Chi-square test)  One Continuous One Nominal: Mean Comparison (T-test, ANOVA)  Many Variables Regression
  33. 33. 35 Example: Impact of Southwest t-test, ANOVA, Regression
  34. 34. Context 36
  35. 35. 37
  36. 36. Impact of Southwest Airlines on Price 38 • Objective: • What is the impact of Southwest presence on the average prices? • Approach: – Compute the average fares with and without Southwest – T-test – ANOVA – Regression
  37. 37. Our Data 39
  38. 38. T-test (Student t-test) History: The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name).[1][2][3] Gosset had been hired due to Claude Guinness's policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness' industrial processes.[2] Gosset devised the t-test as a way to cheaply monitor the quality of stout. He published the test in Biometrika in 1908, but was forced to use a pen name by his employer, who regarded the fact that they were using statistics as a trade secret. 40
  39. 39. T-test Output 41 Impact of Southwest: $ 142
  40. 40. The Lady Tasting Tea: Experimental Design & ANOVA
  41. 41. History of Experimentation Galileo (1564-1642) reportedly dropped balls of various masses from the Leaning Tower of Pisa. o How many balls did he drop? o How many times did he repeat the comparison? o What were his independent and dependent variables? o How did he measure the time to impact? Experimental design was haphazard prior to the 1920’s.
  42. 42. Ronald Aylmer Fisher (1890-1962)  Considered to be the father of modern statistics .  Poor eyesight; did a lot of math in his head without paper or pencil.  In 1919, he began working as a statistician Agricultural Experiment Station in the United Kingdom.  Charming but had a terrible temper (and a big ego)  Smoked a pipe & argued professionally in the 1950’s that smoking did not cause cancer  Supported eugenics
  43. 43. The Design of Experiments (1935)
  44. 44. Background Studies in crop variation I – VI (1921 – 1929) In 1919 a statistician named Fisher was hired at Rothamsted agricultural station They had a lot of observational data on crop yields and hoped a statistician could analyze it to find effects of various treatments All he had to do was sort out the effects of confounding variables
  45. 45. No replication (pre-Fisher): Field with High N Field with Low N Plots are blocked by location or other condition; treatments are applied randomly to plots within blocks. Field broken up into smaller plots & plots are grouped.
  46. 46. 48 NOTE: t-stat when we conducted a t-test was 6.71 If you square this (6.71* 6.71) you get 45.03 ANOVA
  47. 47. Regression • Dependent variable is Fare and independent variable is Southwest Dummy 49 Seen these numbers before?
  48. 48. Regression o So we get the same output from regression as a t-test or ANOVA o Note that Fares do not just depend on presence of Southwest o Other factors o In our example: Competition, Distance o Run regression again including these as additional predictors o Important to note that “Presence of Southwest” is NOT Random. 50
  49. 49. Compare the R-square 51
  50. 50. Regression: Anova Table The 'Anova' test suggests that the regression model as a whole explains a reasonable amount of variance in Sales. The calculated F-value is equal to 141 and has a very small p-value (0.000). The amount of variance in Fares explained by the model is equal to 41.6% The null and alternate hypothesis for the F-test test can be formulated as follows: H0: All regression coefficients are equal to 0 Ha: At least one regression coefficient is not equal to zero
  51. 51. Interpretation Of Coefficients 53 Southwest: After Controlling for Distance and Competition (#of airlines), presence of Southwest in the market reduces fares by approximately $49. Distance: Increasing distance by 100 miles, increases the fare by $ 21.5 # of Airline: Increasing the number of airline serving the markets by 1, reduces the fare by approximately $41.
  52. 52. • Least Squares Principle: Choose β’s so that the sum of the squared prediction errors, is a small as possible. Ok, but what does that mean? Open the file SSQ_Intuition.xls 2 m3m2 1 m10m )SF()( CompDistWareSSQ M m    How does R Compute the parameters?
  53. 53. Conclusion  T-test and ANOVA are both used to compare means across different groups  T-test for 2 groups and ANOVA for many groups  We can always convert the question to a regression problem using dummy variables  Advantage of regression is that it is straightforward to control for any number of other variables that might impact the outcome  From now on, we will focus on regression analysis 55
  54. 54. Regression: Key Points Regression: widely used research tool • Determine whether the independent variables explain a significant variation in the dependent variable: whether a relationship exists. • Determine how much of the variation in the dependent variable can be explained by the independent variables: strength of the relationship. • Control for other independent variables when evaluating the contributions of a specific variable or set of variables. Marginal effect • Forecast/Predict the values of the dependent variable. • Use regression results as inputs to additional computations: Optimal pricing, promotion, time to launch a product….

causality vs. correlation.

Views

Total views

1,148

On Slideshare

0

From embeds

0

Number of embeds

363

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×