• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

24 modelling

  • 1,291 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,291
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
33
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Stat405 Intro to modelling Hadley Wickham Tuesday, 16 November 2010
  • 2. 1. What is a linear model? 2. Removing trends 3. Transformations 4. Categorical data 5. Visualising models Tuesday, 16 November 2010
  • 3. What is a linear model? Tuesday, 16 November 2010
  • 4. Tuesday, 16 November 2010
  • 5. observed value Tuesday, 16 November 2010
  • 6. observed value Tuesday, 16 November 2010
  • 7. observed value predicted value Tuesday, 16 November 2010
  • 8. observed value predicted value Tuesday, 16 November 2010
  • 9. observed value residual predicted value Tuesday, 16 November 2010
  • 10. y ~ x # yhat = b1x + b0 # Want to find b's that minimise distance # between y and yhat z ~ x + y # zhat = b2x + b1y + b0 # Want to find b's that minimise distance # between z and zhat z ~ x * y # zhat = b3(x⋅y) + b2x + b1y + b0 Tuesday, 16 November 2010
  • 11. Assumptions X is measured without error. Relationship is linear. Errors are independent. Errors have normal distribution. Errors have constant variance. Tuesday, 16 November 2010
  • 12. Removing trends Tuesday, 16 November 2010
  • 13. library(ggplot2) diamonds$x[diamonds$x == 0] <- NA diamonds$y[diamonds$y == 0] <- NA diamonds$y[diamonds$y > 30] <- NA diamonds$z[diamonds$z == 0] <- NA diamonds$z[diamonds$z > 30] <- NA diamonds <- subset(diamonds, carat < 2) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) Tuesday, 16 November 2010
  • 14. Tuesday, 16 November 2010
  • 15. Tuesday, 16 November 2010
  • 16. mody <- lm(y ~ x, data = diamonds, na = na.exclude) coef(mody) # yhat = 0.05 + 0.99⋅x # Plot x vs yhat qplot(x, predict(mody), data = diamonds) # Plot x vs (y - yhat) = residual qplot(x, resid(mody), data = diamonds) # Standardised residual: qplot(x, rstandard(mody), data = diamonds) Tuesday, 16 November 2010
  • 17. qplot(x, resid(mody), data=dclean) Tuesday, 16 November 2010
  • 18. qplot(x, y - x, data=dclean) Tuesday, 16 November 2010
  • 19. Your turn Do the same thing for z and x. What threshold might you use to remove outlying values? Are the errors from predicting z and y from x related? Tuesday, 16 November 2010
  • 20. modz <- lm(z ~ x, data = diamonds, na = na.exclude) coef(modz) # zhat = 0.03 + 0.61x qplot(x, rstandard(modz), data = diamonds) last_plot() + ylim(-10, 10) qplot(rstandard(mody), rstandard(modz)) Tuesday, 16 November 2010
  • 21. Transformations Tuesday, 16 November 2010
  • 22. Can we use a linear model to remove this trend? Tuesday, 16 November 2010
  • 23. Can we use a linear model to remove this trend? Tuesday, 16 November 2010
  • 24. Can we use a linear model to remove this trend? Linear models are linear in their parameters which can be any transformation of the data Tuesday, 16 November 2010
  • 25. Your turn Use a linear model to remove the effect of carat on price. Confirm that this worked by plotting model residuals vs. color. How can you interpret the model coefficients and residuals? Tuesday, 16 November 2010
  • 26. modprice <- lm(log(price) ~ log(carat), data = diamonds, na = na.exclude) diamonds$relprice <- exp(resid(modprice)) qplot(carat, relprice, data = diamonds) diamonds <- subset(diamonds, carat < 2) qplot(carat, relprice, data = diamonds) qplot(carat, relprice, data = diamonds) + facet_wrap(~ color) qplot(relprice, ..density.., data = diamonds, colour = color, geom = "freqpoly", binwidth = 0.2) qplot(relprice, ..density.., data = diamonds, colour = cut, geom = "freqpoly", binwidth = 0.2) Tuesday, 16 November 2010
  • 27. Multiplicative model log(Y) = a * log(X) + b Y = c . dX An additive model becomes a multiplicative model. Intercept becomes starting point, slope becomes geometric growth. Tuesday, 16 November 2010
  • 28. Residuals resid(mod) = log(Y) - log(Yhat) exp(resid(mod)) = Y / (Yhat) Tuesday, 16 November 2010
  • 29. # Useful trick - close to 0, exp(x) ~ x + 1 x <- seq(-0.2, 0.2, length = 100) qplot(x, exp(x)) + geom_abline(intercept = 1) qplot(x, x / exp(x)) + scale_y_continuous("Percent error", formatter = percent) # Not so useful here because the x is also # transformed coef(modprice) Tuesday, 16 November 2010
  • 30. Categorical data Tuesday, 16 November 2010
  • 31. Your turn Compare the results of the following two functions. What can you say about the model? ddply(diamonds, "color", summarise, mean = mean(price)) coef(lm(price ~ color, data = diamonds)) Tuesday, 16 November 2010
  • 32. Categorical data Converted into a numeric matrix, with one column for each level. Contains 1 if that observation has that level, 0 otherwise. However, if we just do that naively, we end up with too many columns (because we have one extra column for the intercept) So everything is relative to the first level. Tuesday, 16 November 2010
  • 33. Visualising models Tuesday, 16 November 2010
  • 34. # What do you think this model does? lm(log(price) ~ log(carat) + color, data = diamonds) # What about this one? lm(log(price) ~ log(carat) * color, data = diamonds) # Or this one? lm(log(price) ~ cut * color, data = diamonds) # How can we interpret the results? Tuesday, 16 November 2010
  • 35. mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds) mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds) # One way is to explore predictions from the model # over an evenly spaced grid. expand.grid makes # this easy grid <- expand.grid( carat = seq(0.2, 2, length = 20), cut = levels(diamonds$cut), KEEP.OUT.ATTRS = FALSE) str(grid) grid grid$p1 <- exp(predict(mod1, grid)) grid$p2 <- exp(predict(mod2, grid)) Tuesday, 16 November 2010
  • 36. Your turn Plot the predictions from the two sets of models. How are they different? Tuesday, 16 November 2010
  • 37. qplot(carat, p1, data = grid, colour = cut, geom = "line") qplot(carat, p2, data = grid, colour = cut, geom = "line") qplot(log(carat), log(p1), data = grid, colour = cut, geom = "line") qplot(log(carat), log(p2), data = grid, colour = cut, geom = "line") qplot(carat, p1 / p2, data = grid, colour = cut, geom = "line") Tuesday, 16 November 2010
  • 38. # Another approach is the effects package # install.packages("effects") library(effects) effect("cut", mod1) cut <- as.data.frame(effect("cut", mod1)) qplot(fit, reorder(cut, fit), data = cut) qplot(fit, reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1) qplot(exp(fit), reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = exp(lower), xmax = exp(upper)), height = 0.1) Tuesday, 16 November 2010