Boston Predictive Analytics: Linear and Logistic Regression Using R - Intermediate Topics

  • 1,102 views
Uploaded on

Presentation given for the Boston Predictive Analytics group by Daniel Gerlanc of Enplus Advisors Inc on July 25, 2012 at the CIC. Visit us at enplusadvisors.com …

Presentation given for the Boston Predictive Analytics group by Daniel Gerlanc of Enplus Advisors Inc on July 25, 2012 at the CIC. Visit us at enplusadvisors.com

Intermediate Regression Topics including variable transformations and simulation for constructing confidence intervals.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,102
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. IntermediateRegression Topics Daniel Gerlanc, Director Enplus Advisors Inc
  • 2. TopicsAbalone DataVariable TransformationSimulation for Predictive Inference
  • 3. http://archive.ics.uci.edu/ml/datasets/Abalone Abalone
  • 4. Loading the data> abalone.path = "~/data/abalone.csv"> abalone.cols = c("sex", "length", "diameter", "height", "whole.wt",+ "shucked.wt", "viscera.wt", "shell.wt", "rings")>> abalone <- read.csv(abalone.path, sep=",", row.names=NULL,+ col.names=abalone.cols)> str(abalone)data.frame:! 4177 obs. of 9 variables: $ sex : chr "M" "M" "F" "M" ... $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ... $ diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ... $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ... $ whole.wt : num 0.514 0.226 0.677 0.516 0.205 ... $ shucked.wt: num 0.2245 0.0995 0.2565 0.2155 0.0895 ... $ viscera.wt: num 0.101 0.0485 0.1415 0.114 0.0395 ... $ shell.wt : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ... $ rings : int 15 7 9 10 7 8 20 16 9 19 ...
  • 5. Uses lattice graphics Draw pictures
  • 6. Lattice Plots> xyplot(jitter(rings) ~ shell.wt | sex, abalone, grid=T, pch=".", subset=volume < 0.2, panel=function(x, y, ...) { panel.lmline(x, y, ...) panel.xyplot(x, y, ...) }, ylab="rings")ggplot2 is a newer package that can be used to create similar plots.
  • 7. Infant Adult Combine groups
  • 8. Why Transform?InterpretabilityAdditive vs. Multiplicative FormPrediction
  • 9. Simple Model> fit.1 <- lm(rings ~ sex + shell.wt, abalone)> summary(fit.1)Call:lm(formula = rings ~ sex + shell.wt, data = abalone)Residuals: Min 1Q Median 3Q Max-5.750 -1.592 -0.535 0.886 15.736Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 6.2423 0.0799 78.08 <2e-16 ***sex 0.9142 0.0984 9.29 <2e-16 ***shell.wt 12.8581 0.3300 38.96 <2e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 2.5 on 4174 degrees of freedom
  • 10. Centering with z-scores Subtract the mean from each input and divide by 1 or 2 standard deviations Dummy/Proxy variables may be centered as well
  • 11. Center Values> abalone.adj <- abalone[, c(outcome, predictors)]for (i in predictors) { abalone.adj[[i]] <- (abalone.adj[[i]] - mean(abalone.adj[[i]])) / (2 * sd(abalone.adj[[i]]))}Also look into the ‘scale’ function
  • 12. Why center?Interpret coefficients in terms of standarddeviationsGives a sense of variable importance
  • 13. Interpretability> fit.1a <- lm(rings ~ sex + shell.wt, abalone.adj)> summary(fit.1a)Call:lm(formula = rings ~ sex + shell.wt, data = abalone.adj)Residuals: Min 1Q Median 3Q Max-5.750 -1.592 -0.535 0.886 15.736Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 9.9337 0.0385 258.33 <2e-16 ***sex 0.8539 0.0919 9.29 <2e-16 ***shell.wt 3.5798 0.0919 38.96 <2e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 2.5 on 4174 degrees of freedomMultiple R-squared: 0.406,! Adjusted R-squared: 0.406F-statistic: 1.43e+03 on 2 and 4174 DF, p-value: <2e-16
  • 14. Two Models lm(formula = rings ~ sex + shell.wt, data = abalone) coef.est coef.se (Intercept) 6.24 0.08 sex 0.91 0.10 shell.wt 12.86 0.33 --- n = 4177, k = 3 residual sd = 2.49, R-Squared = 0.41 lm(formula = rings ~ sex + shell.wt, data = abalone.adj) coef.est coef.se (Intercept) 9.93 0.04 sex 0.85 0.09 shell.wt 3.58 0.09 --- n = 4177, k = 3 residual sd = 2.49, R-Squared = 0.41Smaller difference in SD terms
  • 15. Why divide by 2 SDsSo binary variables may be interpretedsimilarly to continuous variablese.g., Binary Value of 0, 1 occurring with equalfrequency has an sd of 0.5.sqrt(0.5 * (1 - 0.5)) = 0.5(1 - 0.5) / (2 * 0.5) = 0.5 (1 - 0.5) / (2 * 0.5) = +1(0 - 0.5) / (2 * 0.5) = -0.5 (0 - 0.5) / (2 * 0.5) = -1-0.5 --> +0.5 -1 --> +1 Diff of 1 Diff of 2
  • 16. Prediction
  • 17. SimulationAllow for more general inferencesPropagation of uncertainty
  • 18. Prediction Errors90% Percentile Adult vs. 50% Infant fit.4 <- lm(log(rings) ~ sex + log(shell.wt), abalone) large.abalone <- log(quantile(subset(abalone, sex == 1)$shell.wt, 0.90)) small.infant <- log(median(abalone$shell.wt[abalone$sex == 0])) x.a <- sum(c(1, 1, large.abalone) * coef(fit.4)) x.i <- sum(c(1, 0, small.infant) * coef(fit.4)) set.seed(1) n.sims <- 1000 pred.a <- exp(rnorm(n.sims, x.a, sigma.hat(fit.4))) pred.i <- exp(rnorm(n.sims, x.i, sigma.hat(fit.4))) pred.diff <- pred.a - pred.i > mean(pred.diff) 4.5 > quantile(pred.diff, c(0.025, 0.975)) 2.5% 98% -1.9 11.3
  • 19. Simulation for Inferential Uncertainty Simulate residualstandard deviation Simulate
  • 20. Inferential Uncertainty## Create 1000 simulations of the residual standard error and coefficientsfit.5 <- lm(log(rings) ~ sex + shell.wt + sex:shell.wt, abalone)n.sims <- 1000obj <- summary(fit.5) # save off the summary objectsigma.hat <- obj$sigmab.hat <- obj$coef[, Estimate, drop=TRUE]cov.beta <- obj$cov.unscaled # extract the covariance matrixk <- obj$df[1] # number of predictorsn <- obj$df[1] + obj$df[2] # number of observationsset.seed(1)sigma.sim <- sigma.hat * sqrt((n-k) / rchisq(n.sims, n-k))beta.sim <- matrix(NA_real_, n.sims, k, dimnames=list(NULL, names(beta.hat)))for (i in seq_len(n.sims)) { beta.sim[i, ] <- MASS::mvrnorm(1, b.hat, sigma.sim[i]^2 * cov.beta)}
  • 21. Inferential Uncertainty