Boston Predictive Analytics: Linear and Logistic Regression Using R - Intermediate Topics

1,901 views

Published on

Presentation given for the Boston Predictive Analytics group by Daniel Gerlanc of Enplus Advisors Inc on July 25, 2012 at the CIC. Visit us at enplusadvisors.com

Intermediate Regression Topics including variable transformations and simulation for constructing confidence intervals.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,901
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Boston Predictive Analytics: Linear and Logistic Regression Using R - Intermediate Topics

    1. 1. IntermediateRegression Topics Daniel Gerlanc, Director Enplus Advisors Inc
    2. 2. TopicsAbalone DataVariable TransformationSimulation for Predictive Inference
    3. 3. http://archive.ics.uci.edu/ml/datasets/Abalone Abalone
    4. 4. Loading the data> abalone.path = "~/data/abalone.csv"> abalone.cols = c("sex", "length", "diameter", "height", "whole.wt",+ "shucked.wt", "viscera.wt", "shell.wt", "rings")>> abalone <- read.csv(abalone.path, sep=",", row.names=NULL,+ col.names=abalone.cols)> str(abalone)data.frame:! 4177 obs. of 9 variables: $ sex : chr "M" "M" "F" "M" ... $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ... $ diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ... $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ... $ whole.wt : num 0.514 0.226 0.677 0.516 0.205 ... $ shucked.wt: num 0.2245 0.0995 0.2565 0.2155 0.0895 ... $ viscera.wt: num 0.101 0.0485 0.1415 0.114 0.0395 ... $ shell.wt : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ... $ rings : int 15 7 9 10 7 8 20 16 9 19 ...
    5. 5. Uses lattice graphics Draw pictures
    6. 6. Lattice Plots> xyplot(jitter(rings) ~ shell.wt | sex, abalone, grid=T, pch=".", subset=volume < 0.2, panel=function(x, y, ...) { panel.lmline(x, y, ...) panel.xyplot(x, y, ...) }, ylab="rings")ggplot2 is a newer package that can be used to create similar plots.
    7. 7. Infant Adult Combine groups
    8. 8. Why Transform?InterpretabilityAdditive vs. Multiplicative FormPrediction
    9. 9. Simple Model> fit.1 <- lm(rings ~ sex + shell.wt, abalone)> summary(fit.1)Call:lm(formula = rings ~ sex + shell.wt, data = abalone)Residuals: Min 1Q Median 3Q Max-5.750 -1.592 -0.535 0.886 15.736Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 6.2423 0.0799 78.08 <2e-16 ***sex 0.9142 0.0984 9.29 <2e-16 ***shell.wt 12.8581 0.3300 38.96 <2e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 2.5 on 4174 degrees of freedom
    10. 10. Centering with z-scores Subtract the mean from each input and divide by 1 or 2 standard deviations Dummy/Proxy variables may be centered as well
    11. 11. Center Values> abalone.adj <- abalone[, c(outcome, predictors)]for (i in predictors) { abalone.adj[[i]] <- (abalone.adj[[i]] - mean(abalone.adj[[i]])) / (2 * sd(abalone.adj[[i]]))}Also look into the ‘scale’ function
    12. 12. Why center?Interpret coefficients in terms of standarddeviationsGives a sense of variable importance
    13. 13. Interpretability> fit.1a <- lm(rings ~ sex + shell.wt, abalone.adj)> summary(fit.1a)Call:lm(formula = rings ~ sex + shell.wt, data = abalone.adj)Residuals: Min 1Q Median 3Q Max-5.750 -1.592 -0.535 0.886 15.736Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 9.9337 0.0385 258.33 <2e-16 ***sex 0.8539 0.0919 9.29 <2e-16 ***shell.wt 3.5798 0.0919 38.96 <2e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 2.5 on 4174 degrees of freedomMultiple R-squared: 0.406,! Adjusted R-squared: 0.406F-statistic: 1.43e+03 on 2 and 4174 DF, p-value: <2e-16
    14. 14. Two Models lm(formula = rings ~ sex + shell.wt, data = abalone) coef.est coef.se (Intercept) 6.24 0.08 sex 0.91 0.10 shell.wt 12.86 0.33 --- n = 4177, k = 3 residual sd = 2.49, R-Squared = 0.41 lm(formula = rings ~ sex + shell.wt, data = abalone.adj) coef.est coef.se (Intercept) 9.93 0.04 sex 0.85 0.09 shell.wt 3.58 0.09 --- n = 4177, k = 3 residual sd = 2.49, R-Squared = 0.41Smaller difference in SD terms
    15. 15. Why divide by 2 SDsSo binary variables may be interpretedsimilarly to continuous variablese.g., Binary Value of 0, 1 occurring with equalfrequency has an sd of 0.5.sqrt(0.5 * (1 - 0.5)) = 0.5(1 - 0.5) / (2 * 0.5) = 0.5 (1 - 0.5) / (2 * 0.5) = +1(0 - 0.5) / (2 * 0.5) = -0.5 (0 - 0.5) / (2 * 0.5) = -1-0.5 --> +0.5 -1 --> +1 Diff of 1 Diff of 2
    16. 16. Prediction
    17. 17. SimulationAllow for more general inferencesPropagation of uncertainty
    18. 18. Prediction Errors90% Percentile Adult vs. 50% Infant fit.4 <- lm(log(rings) ~ sex + log(shell.wt), abalone) large.abalone <- log(quantile(subset(abalone, sex == 1)$shell.wt, 0.90)) small.infant <- log(median(abalone$shell.wt[abalone$sex == 0])) x.a <- sum(c(1, 1, large.abalone) * coef(fit.4)) x.i <- sum(c(1, 0, small.infant) * coef(fit.4)) set.seed(1) n.sims <- 1000 pred.a <- exp(rnorm(n.sims, x.a, sigma.hat(fit.4))) pred.i <- exp(rnorm(n.sims, x.i, sigma.hat(fit.4))) pred.diff <- pred.a - pred.i > mean(pred.diff) 4.5 > quantile(pred.diff, c(0.025, 0.975)) 2.5% 98% -1.9 11.3
    19. 19. Simulation for Inferential Uncertainty Simulate residualstandard deviation Simulate
    20. 20. Inferential Uncertainty## Create 1000 simulations of the residual standard error and coefficientsfit.5 <- lm(log(rings) ~ sex + shell.wt + sex:shell.wt, abalone)n.sims <- 1000obj <- summary(fit.5) # save off the summary objectsigma.hat <- obj$sigmab.hat <- obj$coef[, Estimate, drop=TRUE]cov.beta <- obj$cov.unscaled # extract the covariance matrixk <- obj$df[1] # number of predictorsn <- obj$df[1] + obj$df[2] # number of observationsset.seed(1)sigma.sim <- sigma.hat * sqrt((n-k) / rchisq(n.sims, n-k))beta.sim <- matrix(NA_real_, n.sims, k, dimnames=list(NULL, names(beta.hat)))for (i in seq_len(n.sims)) { beta.sim[i, ] <- MASS::mvrnorm(1, b.hat, sigma.sim[i]^2 * cov.beta)}
    21. 21. Inferential Uncertainty

    ×