• Like
  • Save
Exploring relationships
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Exploring relationships

  • 467 views
Published

 

Published in Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
467
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Exploring relationships
    Andrew Hingston
    switchsolutions.com.au
  • 2. 2
    Why
    explorerelationships
    ?
  • 3. 3
  • 4. Today
    1. Hypothesis test on X
    2. Y and X
    3. Y and D
    4. Y and Xs and Ds
    5. Process
    4
    Course
    1. Understanding data
    2. Monitoring processes
    3. Exploring relationships
  • 5. 5
    1
    Testing X
  • 6. Revenue growth
    New strategy
    Costs  $700
    Revenue  $1234
    Revenue  really > $700?
    6
  • 7. 7
    Revenue change
    ($)
    revenue2.csv
    > mydata = read.csv(“revenue2.csv”)
    > attach(mydata)
    > mydata
  • 8. 8
    > boxplot (RevenueChange)
    ExtremeOutlier
    Special causes (outliers) can stuff things up!
    > mydata = read.csv(“revenue2b.csv”)
  • 9. Is change significant?
    9
    $1234
    $700
  • 10. Null hypothesis
    no difference or no change
    revenue growth only $700
    10
    Alternate hypothesis
    difference or change
    revenue growth is not $700
  • 11. p value
    A false negative
    Probability of chasing
    The probability of rejecting your hypothesisof ‘no change’ when it is actually true
    11
  • 12. 12
    INTERPRETATION
    p value < 0.05
    p value > 0.05
    Different
    No change
    REALITY
    No change
    Different
  • 13. p values
    13
    Evidence against ‘no difference’
    … or evidence in favour of there being a difference!
  • 14. 14
    > t.test ( RevenueChange, mu = 700)
    p-value = 1.713% so revenue is significantly greater than $700
    0.8565%
    0.8565%

    $700
    $1234
  • 15. Confidence interval
    Range of plausible values for ‘true’ mean
    Plausible values for ‘true’ revenue growth across different samples
    Stops us ‘fixating’ on estimate fromone sample only
    15
    > confint (RevenueChange)
  • 16. 16
    > confint (RevenueChange)
    $804
    $1663
    $700
    $1234
    95% Confidence Interval
    Range of plausible values for revenue increase
  • 17. Data normally distributed
    • shapiro.test(X) # Normal if p-value > 0.05
    OR
    Sample size > 50
    Assumptions
    17
  • 18. 18
    2
    Y and X
  • 19. Exploring Relationships
    Google returns (Y) and NASDAQ returns (X)
    Positive, negative or no relationship?
    Statistically significant?
    NASDAQ  1% … Google  ?
    Range of plausible values?
    How ‘close’ is the relationship?
    19
  • 20. 20
    Weeklyreturns
    (%)
    > mydata = read.csv("google_nasdaq_2010.csv")
    > attach(mydata)
    > mydata
  • 21. 21
    Weekly returns
    Google vs NASDAQ Index
    > plot (NASDAQr, GOOGr )
    > abline(h=0,v=0)
  • 22. Line of best fit
    22
    Y = <intercept> + <slope> × X
    GOOGr = <intercept> + <slope> × NASDAQr
    GOOGr = 0.005289+ 1.071682 × NASDAQr
    > fm = lm ( GOOGr~ NASDAQr )
    > fm
  • 23. 23
    Weekly returns
    Google vs NASDAQ Index
    Intercept0.005289
    Slope
    1.071682
    > abline (coef (fm), col="red" )
  • 24. Meaning
    Intercept
    NASDAQ = 0% thenGoogle  0.0053%
    24
    GOOGr = 0.005289+ 1.071682 × NASDAQr
    Slope
    NASDAQ  1% thenGoogle  1.07%
  • 25. No significant slope
    No significant relationship between X and Y
    25
  • 26. Significance
    Are they significantlydifferent from zero?
    Use p-values for evidencethat they are not zero
    26
    Intercept0.005289
    Slope
    1.071682
  • 27. Significance
    27
    > summary ( fm )
  • 28. Confidence interval
    Range of plausible values for ‘true’ intercept and slope across other samples
    Stops us ‘fixating’ on our slope and intercept from this one sample
    28
    > confint (fm)
  • 29. 29
    Slope
    > confint (fm)
    1.07
    0.81
    1.34
    95% Confidence Interval
    Range of plausible values for influenceof NASDAQ on Google
  • 30. 30
    Intercept
    > confint (fm)
    0.012
    0.005
    0.002
    95% Confidence Interval
    Range of plausible values for Google’s returnwhen NASDAQ is 0%
  • 31. Goodness of fit
    Adjusted R-squared
    Proportion of variability in Y explained by the model
    How close dots are to the line
    Between 0 and 1
    31
    > summary ( fm )
    AdjustedR-squared
    0.5419
  • 32. Why?intercept and slope
    p values
    confidence interval
    Adj R-squared
    Recap
    32
  • 33. 33
    3
    Y and D
  • 34. Qualitative variables
    Age (Y) and purchase (Yes/No)
    How run regression on categories?
    What if many categories?
    Meaning of slope, intercept?
    34
  • 35. 35
    Dummy variable
    > mydata = read.csv("toothpaste.csv")
    > attach(mydata)
    > mydata
  • 36. 36
    Boxes overlap
    No evidence of difference in age between two groups
    Very crude test!
    Age (Y)
    vs Purchase
    > boxplot (Age~Purchase)
  • 37. Line of best fit
    37
    Y = <intercept> + <slope> × D
    Age = <intercept> + <slope> × Purchase
    Age = 47.27.4 × Purchase
    > fm = lm ( Age ~ Purchase)
    > fm
    BIG DEAL!
    So non-purchasers aged 47.2
    and purchasers aged 47.2 – 7.4 = 39.8
  • 38. 38
    Age (Y)
    vs Purchase
    Intercept 47.2
    Slope 7.4
    Nothing to see here!
    > plot (Age~Purchase)
    > abline(coef(fm), col="red")
  • 39. No significant slope
    No significant difference between the two groups
    39
  • 40. Significance
    40
    > summary ( fm )
  • 41. Confidence interval
    Range of plausible values for ‘true’ intercept and slope across other samples
    Stops us ‘fixating’ on our slope and intercept from this one sample
    41
    > confint (fm)
  • 42. 42
    Slope
    > confint (fm)
    -7.4
    -15.1
    +0.3
    95% Confidence Interval
    Range of plausible values for age differencebetween purchasers and non-purchasers
  • 43. 43
    Intercept
    > confint (fm)
    41.8
    47.2
    52.6
    95% Confidence Interval
    Range of plausible values for the ageof the non-purchasers
  • 44. Goodness of fit
    Adjusted R-squared
    Proportion of variability in Y explained by the model
    How close dots are to the line
    Between 0 and 1
    44
    > summary ( fm )
    AdjustedR-squared
    0.068
  • 45. 45
    4
    Y andXs and Ds
  • 46. Many relationships
    46
    Sales Revenue (Y)
    Mallsize(X3)
    Comp-etitors(X2)
    StoreSize(X1)
    MainWalkway(D4)
    Multiple regression measures
    individual influence of each X on Y
    holding all other Xs constant
  • 47. 47
    > mydata = read.csv("clothing.csv")
    > attach(mydata)
    > mydata
  • 48. 48
    > plot (Sales~Store_Size)
    > …
    Main
  • 49. Model
    49
    > fm = lm ( Sales ~ Store_Size + Competitors + Mall_Size + Main )
    > fm
  • 50. No significant slope
    No relationship between X and Y
    50
  • 51. Significance
    51
    > summary ( fm )
  • 52. Confidence intervals
    52
    > confint ( fm )
  • 53. Goodness of fit
    Adjusted R-squared
    Proportion of variability in Y explained by the model
    How close dots are to the line
    Between 0 and 1
    53
    AdjustedR-squared
    0.8523
    > summary ( fm )
  • 54. Why?intercept and slope
    p values
    confidence interval
    Adj R-squared
    Recap
    54
  • 55. 55
    5
    Process
  • 56. Steps …
    Visualise each X
    Visualise Y vs X
    Run regression
    Check valid model (Next module)
    Look at p values
    Look at sign and size of estimates
    Look at confidence intervals
    Goodness of fit
    56
  • 57. Sample size preferably > X by 10+
    Both X and Y numbers
    Y has spread, not dummy or prob.
    Straight-line between each X and Y
    Plots of residuals must be random
    Residuals must be normal
    No missing but relevant X variables
    Assumptions
    57
  • 58. 58
    6
    Exercises
  • 59. Exercises in R
    Exercise 1 Recycled waste
    Exercise 3 Invoice processing
    Exercise 5 Surface hardness
    Exercise 6 Call center complaints
    Exercise 8 Parcel delivery
    Exercise 9 Wage discrimination
    59
  • 60. THANKS
    Feedback please!
    60