Exploring relationships

683 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
683
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Exploring relationships

  1. 1. Exploring relationships<br />Andrew Hingston<br />switchsolutions.com.au<br />
  2. 2. 2<br />Why<br />explorerelationships<br />? <br />
  3. 3. 3<br />
  4. 4. Today<br />1. Hypothesis test on X<br />2. Y and X<br />3. Y and D<br />4. Y and Xs and Ds<br />5. Process<br />4<br />Course<br />1. Understanding data<br />2. Monitoring processes<br />3. Exploring relationships<br />
  5. 5. 5<br />1<br />Testing X<br />
  6. 6. Revenue growth<br />New strategy<br />Costs  $700<br />Revenue  $1234<br />Revenue  really > $700?<br />6<br />
  7. 7. 7<br />Revenue change<br />($)<br />revenue2.csv<br />> mydata = read.csv(“revenue2.csv”)<br />> attach(mydata)<br />> mydata<br />
  8. 8. 8<br />> boxplot (RevenueChange)<br />ExtremeOutlier<br />Special causes (outliers) can stuff things up!<br />> mydata = read.csv(“revenue2b.csv”)<br />
  9. 9. Is change significant?<br />9<br />$1234<br />$700<br />
  10. 10. Null hypothesis<br />no difference or no change<br />revenue growth only $700<br />10<br />Alternate hypothesis<br />difference or change<br />revenue growth is not $700<br />
  11. 11. p value<br />A false negative<br />Probability of chasing<br />The probability of rejecting your hypothesisof ‘no change’ when it is actually true<br />11<br />
  12. 12. 12<br />INTERPRETATION<br />p value < 0.05<br />p value > 0.05<br />Different<br />No change<br />REALITY<br />No change<br />Different<br />
  13. 13. p values<br />13<br />Evidence against ‘no difference’<br />… or evidence in favour of there being a difference!<br />
  14. 14. 14<br />> t.test ( RevenueChange, mu = 700)<br />p-value = 1.713% so revenue is significantly greater than $700<br />0.8565%<br />0.8565%<br /><br />$700<br />$1234<br />
  15. 15. Confidence interval<br />Range of plausible values for ‘true’ mean<br />Plausible values for ‘true’ revenue growth across different samples<br />Stops us ‘fixating’ on estimate fromone sample only<br />15<br />> confint (RevenueChange)<br />
  16. 16. 16<br />> confint (RevenueChange)<br />$804<br />$1663<br />$700<br />$1234<br />95% Confidence Interval<br />Range of plausible values for revenue increase<br />
  17. 17. Data normally distributed<br /><ul><li>shapiro.test(X) # Normal if p-value > 0.05</li></ul> OR<br />Sample size > 50<br />Assumptions<br />17<br />
  18. 18. 18<br />2<br />Y and X<br />
  19. 19. Exploring Relationships<br />Google returns (Y) and NASDAQ returns (X)<br />Positive, negative or no relationship?<br />Statistically significant?<br />NASDAQ  1% … Google  ?<br />Range of plausible values?<br />How ‘close’ is the relationship?<br />19<br />
  20. 20. 20<br />Weeklyreturns<br />(%)<br />> mydata = read.csv("google_nasdaq_2010.csv")<br />> attach(mydata)<br />> mydata<br />
  21. 21. 21<br />Weekly returns<br />Google vs NASDAQ Index<br />> plot (NASDAQr, GOOGr )<br />> abline(h=0,v=0)<br />
  22. 22. Line of best fit<br />22<br />Y = <intercept> + <slope> × X<br />GOOGr = <intercept> + <slope> × NASDAQr<br />GOOGr = 0.005289+ 1.071682 × NASDAQr<br />> fm = lm ( GOOGr~ NASDAQr )<br />> fm<br />
  23. 23. 23<br />Weekly returns<br />Google vs NASDAQ Index<br />Intercept0.005289<br />Slope<br />1.071682<br />> abline (coef (fm), col="red" )<br />
  24. 24. Meaning<br />Intercept<br />NASDAQ = 0% thenGoogle  0.0053% <br />24<br />GOOGr = 0.005289+ 1.071682 × NASDAQr<br />Slope<br />NASDAQ  1% thenGoogle  1.07% <br />
  25. 25. No significant slope<br />No significant relationship between X and Y<br />25<br />
  26. 26. Significance<br />Are they significantlydifferent from zero?<br />Use p-values for evidencethat they are not zero<br />26<br />Intercept0.005289<br />Slope<br />1.071682<br />
  27. 27. Significance<br />27<br />> summary ( fm )<br />
  28. 28. Confidence interval<br />Range of plausible values for ‘true’ intercept and slope across other samples<br />Stops us ‘fixating’ on our slope and intercept from this one sample<br />28<br />> confint (fm)<br />
  29. 29. 29<br />Slope<br />> confint (fm)<br />1.07<br />0.81<br />1.34<br />95% Confidence Interval<br />Range of plausible values for influenceof NASDAQ on Google<br />
  30. 30. 30<br />Intercept<br />> confint (fm)<br />0.012<br />0.005<br />0.002<br />95% Confidence Interval<br />Range of plausible values for Google’s returnwhen NASDAQ is 0%<br />
  31. 31. Goodness of fit<br />Adjusted R-squared<br />Proportion of variability in Y explained by the model<br />How close dots are to the line<br />Between 0 and 1<br />31<br />> summary ( fm )<br />AdjustedR-squared<br />0.5419<br />
  32. 32. Why?intercept and slope<br />p values<br />confidence interval<br />Adj R-squared<br />Recap<br />32<br />
  33. 33. 33<br />3<br />Y and D<br />
  34. 34. Qualitative variables<br />Age (Y) and purchase (Yes/No)<br />How run regression on categories?<br />What if many categories?<br />Meaning of slope, intercept?<br />34<br />
  35. 35. 35<br />Dummy variable<br />> mydata = read.csv("toothpaste.csv")<br />> attach(mydata)<br />> mydata<br />
  36. 36. 36<br />Boxes overlap<br />No evidence of difference in age between two groups<br />Very crude test!<br />Age (Y)<br />vs Purchase<br />> boxplot (Age~Purchase)<br />
  37. 37. Line of best fit<br />37<br />Y = <intercept> + <slope> × D<br />Age = <intercept> + <slope> × Purchase<br />Age = 47.27.4 × Purchase<br />> fm = lm ( Age ~ Purchase)<br />> fm<br />BIG DEAL!<br />So non-purchasers aged 47.2<br />and purchasers aged 47.2 – 7.4 = 39.8<br />
  38. 38. 38<br />Age (Y)<br />vs Purchase<br />Intercept 47.2<br />Slope 7.4<br />Nothing to see here!<br />> plot (Age~Purchase)<br />> abline(coef(fm), col="red")<br />
  39. 39. No significant slope<br />No significant difference between the two groups<br />39<br />
  40. 40. Significance<br />40<br />> summary ( fm )<br />
  41. 41. Confidence interval<br />Range of plausible values for ‘true’ intercept and slope across other samples<br />Stops us ‘fixating’ on our slope and intercept from this one sample<br />41<br />> confint (fm)<br />
  42. 42. 42<br />Slope<br />> confint (fm)<br />-7.4<br />-15.1<br />+0.3<br />95% Confidence Interval<br />Range of plausible values for age differencebetween purchasers and non-purchasers<br />
  43. 43. 43<br />Intercept<br />> confint (fm)<br />41.8<br />47.2<br />52.6<br />95% Confidence Interval<br />Range of plausible values for the ageof the non-purchasers<br />
  44. 44. Goodness of fit<br />Adjusted R-squared<br />Proportion of variability in Y explained by the model<br />How close dots are to the line<br />Between 0 and 1<br />44<br />> summary ( fm )<br />AdjustedR-squared<br />0.068<br />
  45. 45. 45<br />4<br />Y andXs and Ds<br />
  46. 46. Many relationships<br />46<br />Sales Revenue (Y)<br />Mallsize(X3)<br />Comp-etitors(X2)<br />StoreSize(X1)<br />MainWalkway(D4)<br />Multiple regression measures<br />individual influence of each X on Y<br />holding all other Xs constant<br />
  47. 47. 47<br />> mydata = read.csv("clothing.csv")<br />> attach(mydata)<br />> mydata<br />
  48. 48. 48<br />> plot (Sales~Store_Size)<br />> …<br />Main<br />
  49. 49. Model<br />49<br />> fm = lm ( Sales ~ Store_Size + Competitors + Mall_Size + Main )<br />> fm<br />
  50. 50. No significant slope<br />No relationship between X and Y<br />50<br />
  51. 51. Significance<br />51<br />> summary ( fm )<br />
  52. 52. Confidence intervals<br />52<br />> confint ( fm )<br />
  53. 53. Goodness of fit<br />Adjusted R-squared<br />Proportion of variability in Y explained by the model<br />How close dots are to the line<br />Between 0 and 1<br />53<br />AdjustedR-squared<br />0.8523<br />> summary ( fm )<br />
  54. 54. Why?intercept and slope<br />p values<br />confidence interval<br />Adj R-squared<br />Recap<br />54<br />
  55. 55. 55<br />5<br />Process<br />
  56. 56. Steps …<br />Visualise each X<br />Visualise Y vs X<br />Run regression<br />Check valid model (Next module)<br />Look at p values<br />Look at sign and size of estimates<br />Look at confidence intervals<br />Goodness of fit<br />56<br />
  57. 57. Sample size preferably > X by 10+<br />Both X and Y numbers<br />Y has spread, not dummy or prob.<br />Straight-line between each X and Y<br />Plots of residuals must be random<br />Residuals must be normal<br />No missing but relevant X variables<br />Assumptions<br />57<br />
  58. 58. 58<br />6<br />Exercises<br />
  59. 59. Exercises in R<br />Exercise 1 Recycled waste<br />Exercise 3 Invoice processing<br />Exercise 5 Surface hardness<br />Exercise 6 Call center complaints<br />Exercise 8 Parcel delivery<br />Exercise 9 Wage discrimination<br />59<br />
  60. 60. THANKS<br />Feedback please!<br />60<br />

×