R Regression Models with Zelig

13,561 views

Published on

This hands-on R course will demonstrate a variety of statistical procedures using the open-source statistical software program, R. Emphasis is on regression modeling using the Zelig package.

Workshop materials including example data sets and R scripts are available from http://projects.iq.harvard.edu/rtc/r-stats

Published in: Technology, Economy & Finance

R Regression Models with Zelig

  1. 1. Regression Models in R Harvard MIT Data Center May 3, 2013 The Institute for Quantitative Social Science at Harvard University(Harvard MIT Data Center) Regression Models in R May 3, 2013 1 / 49
  2. 2. Outline 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 2 / 49
  3. 3. IntroductionTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 3 / 49
  4. 4. IntroductionWorkshop description This is an intermediate/advanced R course Appropriate for those with basic knowledge of R This is not a statistics course! Learning objectives: Learn the R formula interface Specify factor contrasts to test specific hypotheses Perform model comparisons Run and interpret variety of regression models in R Create and use imputed data sets in regression models (Harvard MIT Data Center) Regression Models in R May 3, 2013 4 / 49
  5. 5. IntroductionMaterials and Setup Lab computer users: USERNAME dataclass PASSWORD dataclass Find class materials at Scratch > DataClass > RStatistics Copy this folder to your desktop! Laptop users Download materials from http://projects.iq.harvard.edu/rtc/r-stats Scroll to the bottom of the page and download the r-programming.zip file Move it to your desktop and extract (Harvard MIT Data Center) Regression Models in R May 3, 2013 5 / 49
  6. 6. IntroductionLaunch RStudio Open the RStudio program from the Windows start menu Open up today’s R script In RStudio, Go to File => Open Script Locate and open the Rstatistics.R script in the Rstatistics folder on your desktop Go to Tools => Set working directory => To source file location (more on the working directory later) I encourage you to add your own notes to this file! (Harvard MIT Data Center) Regression Models in R May 3, 2013 6 / 49
  7. 7. IntroductionSet working directory It is often helpful to start your R session by setting your working directory so you don’t have to type the full path names to your data and other files > # set the working directory > # setwd("~/Desktop/Rstatistics") > You might also start by listing the files in your working directory (Harvard MIT Data Center) Regression Models in R May 3, 2013 7 / 49
  8. 8. IntroductionLoad data Many of the following examples use data from the 2011 National Health Survey. From the CDC website: The National Health Interview Survey (NHIS) has monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. For over 50 years, the U.S. Census Bureau has been the data collection agent for the National Health Interview Survey. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives. Key variables include: See attributes(NH11)$labels for the full variable list. (Harvard MIT Data Center) Regression Models in R May 3, 2013 8 / 49
  9. 9. Linear regressionTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 9 / 49
  10. 10. Linear regressionLinear regression example Linear regression models can be fit with the lm() function For example, we can use lm to predict bmi based on: number of cigarettes smoked/day (cigsday) duration of moderate exercise (modmin) hours of sleep (sleep) > # Fit our regression model > weight.out <- lm(bmi~cigsday+modmin+sleep, # regression formula + data=NH11) # data set > # Print the results > coef(summary(weight.out)) # show regression coefficients table Estimate Std. Error t value Pr(>|t|) (Intercept) 26.281379 0.38557 68.163 0.00000000000000 cigsday 0.038384 0.01909 2.011 0.04445427219108 modmin -0.000775 0.00175 -0.443 0.65785096091484 sleep 0.244050 0.03416 7.144 0.00000000000113 (Harvard MIT Data Center) Regression Models in R May 3, 2013 10 / 49
  11. 11. Linear regressionThe lm class and methods OK, we fit our model. Now what? Examine the model object: > class(weight.out) [1] "lm" > names(weight.out) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "na.action" "xlevels" "call" "terms" [13] "model" > methods(class = class(weight.out))[1:9] [1] "add1.lm" "alias.lm" "anova.lm" [4] "case.names.lm" "confint.lm" "cooks.distance.lm" [7] "deviance.lm" "dfbeta.lm" "dfbetas.lm" Use function methods to get more information about the fit > confint(weight.out) 2.5 % 97.5 % (Intercept) 25.525383 27.03738 cigsday 0.000952 0.07582 modmin -0.004207 0.00266 sleep 0.177065 0.31103 > # summary(weight.out) (Harvard MIT Data Center) > Regression Models in R May 3, 2013 11 / 49
  12. 12. Linear regressionComparing models Does our model predict bmi over and above demographics? Fit two models and compare them: > # Ommit missing (models can only by compared if data is the same) > NH.nomiss <- na.omit(NH11[c("bmi", "sex", "age_p", + "cigsday", "modmin", "sleep")]) > # demographics only model > demog.only <- lm(bmi~sex+age_p, + data=NH.nomiss) > # demographics plus smoking, exercise, and sleep > demog.plus <- update(demog.only, . ~ . +cigsday+modmin+sleep) > # compare using the anova() function > anova(demog.only, demog.plus) Analysis of Variance Table Model 1: bmi ~ sex + age_p Model 2: bmi ~ sex + age_p + cigsday + modmin + sleep Res.Df RSS Df Sum of Sq F Pr(>F) 1 3045 347864 2 3042 341333 3 6531 19.4 0.0000000000019 (Harvard MIT Data Center) Regression Models in R May 3, 2013 12 / 49
  13. 13. Linear regressionExercise 0: least squares regression Use the NH11 data set. 1 Use lm to fit a regression model predicting days missed work in past year (wkdayr) from age. 2 Test the hypothesis that vigorous exercise (vigmin) and moderate exercise modmin (together) predict days missed work over and above age. (Harvard MIT Data Center) Regression Models in R May 3, 2013 13 / 49
  14. 14. Interactions and factorsTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 14 / 49
  15. 15. Interactions and factorsModeling interactions Does the effect of smoking depend on exercise? > #Add the interaction to the model > weight.int.out <- lm(bmi~cigsday*modmin+sleep, + data=NH11) > #Show the results > coef(summary(weight.int.out)) # show regression coefficients table Estimate Std. Error t value Pr(>|t|) (Intercept) 25.895912 0.407954 63.48 0.000000000000000 cigsday 0.066651 0.021471 3.10 0.001925465603062 modmin 0.005822 0.002892 2.01 0.044163951769538 sleep 0.246864 0.034137 7.23 0.000000000000601 cigsday:modmin -0.000502 0.000175 -2.86 0.004213812367825 (Harvard MIT Data Center) Regression Models in R May 3, 2013 15 / 49
  16. 16. Interactions and factorsRegression with categorical predictors Let’s try to predict bmi from region, a categorical variable in which 1 = Northeast, 2 = Midwest, 3 = South and 4 = West: > str(NH11$region) num [1:33014] 3 3 1 3 3 1 3 3 3 3 ... > unique(NH11$region) [1] 3 1 2 4 > weight.int.out <- lm(bmi~region, + data=NH11) > coef(summary(weight.int.out)) Estimate Std. Error t value Pr(>|t|) (Intercept) 30.55 0.2152 141.98 0.00000 region -0.24 0.0742 -3.23 0.00122 Not what we want: R doesn’t know that region is categorical! (Harvard MIT Data Center) Regression Models in R May 3, 2013 16 / 49
  17. 17. Interactions and factorsTelling R which variables are categorical Let’s try again to predict bmi from region > # make a factor version of region, with labels > NH11 <- within(NH11, { + regionF <- factor(region, + levels=1:4, + labels=c("Northeast", "Midwest", + "South", "West"))}) > > # predict bmi from region > weight.fac.out <- lm(bmi~region, + data=NH11) > anova(weight.fac.out) # multi-df test of region Analysis of Variance Table Response: bmi Df Sum Sq Mean Sq F value Pr(>F) region 1 1980 1980 10.5 0.0012 Residuals 33012 6245630 189 > coef(summary(weight.fac.out)) # individual comparisons Estimate Std. Error t value Pr(>|t|) (Intercept) 30.55 0.2152 141.98 0.00000 region -0.24 0.0742 -3.23 0.00122 Take-home: make sure to tell R which Models in R are categorical by converting / 49 (Harvard MIT Data Center) Regression variables May 3, 2013 17
  18. 18. Interactions and factorsSetting factor contrasts In the previous example we use the default contrasts for region. The default in R is treatment contrists, AKA dummy codes. Sometimes this default is not what we want, so we can get and set contrasts using the contrasts() function > # print default contrasts > contrasts(NH11$regionF) Midwest South West Northeast 0 0 0 Midwest 1 0 0 South 0 1 0 West 0 0 1 > # change to sum-to-zero contrasts > contrasts(NH11$regionF) <- contr.sum(n = 4) > contrasts(NH11$regionF) [,1] [,2] [,3] Northeast 1 0 0 Midwest 0 1 0 South 0 0 1 West -1 -1 -1 (Harvard MIT Data Center) Regression Models in R May 3, 2013 18 / 49
  19. 19. Interactions and factorsRegression models with specific contrasts Regression models reflect contrasts setting > weight.fac2.out <- lm(bmi~regionF, + data=NH11) > > coef(summary(weight.fac2.out)) Estimate Std. Error t value Pr(>|t|) (Intercept) 29.8910 0.0789 378.860 0.0000 regionF1 0.2383 0.1551 1.536 0.1245 regionF2 0.0767 0.1381 0.555 0.5787 regionF3 0.2971 0.1191 2.493 0.0127 Contrasts can also be set as arguments to lm() > coef(summary(lm(bmi~regionF, + data=NH11, + contrasts= + list( + regionF=contr.treatment( + n=4, base=4))))) Estimate Std. Error t value Pr(>|t|) (Intercept) 29.279 0.149 196.19 0.00000000 regionF1 0.850 0.241 3.53 0.00041194 regionF2 0.689 0.219 3.14 0.00166450 regionF3 0.909 0.195 4.65 0.00000332 (Harvard MIT Data Center) Regression Models in R May 3, 2013 19 / 49
  20. 20. Interactions and factorsExercise 1: interactions and factors Use the NH11 data set. 1 Use lm to fit a regression model predicting days missed work in past year (wkdayr) from age and race (mracrpi2). 2 Change the contrasts for race (mracrpi2) to sum-to-zero contrasts and re-fit the model from step 1. 3 Evaluate the hypothesis that the relation between days missed work and age differs as a function of race. (Harvard MIT Data Center) Regression Models in R May 3, 2013 20 / 49
  21. 21. Regression with binary outcomesTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 21 / 49
  22. 22. Regression with binary outcomesLogistic regression This far we have used the lm function to fit our regression models. lm is great, but limited–in particular it only fits models for continuous dependent variables. For categorical dependent variables we can use the glm() function. Let’s predict the probability of being diagnosed with hypertension based on age, sex, sleep, and bmi > str(NH11$hypev) # check stucture of hypev Factor w/ 2 levels "2 No","1 Yes": 1 1 2 1 1 2 1 1 2 1 ... > levels(NH11$hypev) # check levels of hypev [1] "2 No" "1 Yes" > # collapse all missing values to NA > NH11$hypev <- factor(NH11$hypev, levels=c("2 No", "1 Yes")) > # run our regression model > hyp.out <- glm(hypev~age_p+sex+sleep+bmi, + data=NH11, family="binomial") > coef(summary(hyp.out)) Estimate Std. Error z value Pr(>|z|) (Intercept) -4.26947 0.056495 -75.57 0.00e+00 age_p 0.06070 0.000823 73.78 0.00e+00 sex2 Female -0.14403 0.026798 -5.37 7.68e-08 sleep -0.00704 0.001640 -4.29 1.78e-05 bmi 0.01857 0.000951 19.53 6.49e-85 (Harvard MIT Data Center) Regression Models in R May 3, 2013 22 / 49
  23. 23. Regression with binary outcomesLogistic regression coefficients Generalized linear models use link functions, so raw coefficients are difficult to interpret. For example, the age coefficient of .06 in the previous model tells us that for every one unit increase in age, the log odds of hypertension diagnosis increases by 0.06. Since most of us are not used to thinking in log odds this is not too helpful! One solution is to transform the coefficients to make them easier to interpret > hyp.out.tab <- coef(summary(hyp.out)) > hyp.out.tab[, "Estimate"] <- exp(coef(hyp.out)) > hyp.out.tab Estimate Std. Error z value Pr(>|z|) (Intercept) 0.014 0.056495 -75.57 0.00e+00 age_p 1.063 0.000823 73.78 0.00e+00 sex2 Female 0.866 0.026798 -5.37 7.68e-08 sleep 0.993 0.001640 -4.29 1.78e-05 bmi 1.019 0.000951 19.53 6.49e-85 Now we can say that for a one unit increase in age, the odds of being diagnosed with hypertension increase by a factor of 1.06. For more information on interpreting odds ratios see our. (Harvard MIT Data Center) Regression Models in R May 3, 2013 23 / 49
  24. 24. Regression with binary outcomesComputing quantities of interest with predict() In addition to transforming the log-odds produced by glm to odds, we can use the predict() function to make direct statements about the predictors in our model. For example, we can ask "How much more likely is a 63 year old female to have hypertension compared to a 33 year old female?". > # Create a dataset with predictors set at desired levels > predDat <- with(NH11, + expand.grid(age_p = c(33, 63), + sex = "2 Female", + bmi = mean(bmi, na.rm = TRUE), + sleep = mean(sleep, na.rm = TRUE))) > # predict hypertension at those levels > cbind(predDat, predict(hyp.out, type = "response", + se.fit = TRUE, interval="confidence", + newdata = predDat)) age_p sex bmi sleep fit se.fit residual.scale 1 33 2 Female 29.9 7.86 0.129 0.00285 1 2 63 2 Female 29.9 7.86 0.478 0.00482 1 This tells us that a 33 year old female has a 13% probability of having been diagnosed with hypertension, while and 63 year old female has a 48% probability of having been diagnosed. (Harvard MIT Data Center) Regression Models in R May 3, 2013 24 / 49
  25. 25. Regression with binary outcomesPackages for computing quantities of interest Instead of doing all this ourselves, we can use the Zelig package to compute quantities of interest for us (cf. the effects package). Use zelig instead of glm and setx / sim instead of expand.grid / predict > library(Zelig) > hyp.out.z <- zelig(hypev~age_p+sex+sleep+bmi, + data=NH11, + model="logit", cite=FALSE) > x.low <-setx(hyp.out.z, age_p=33) # set age to 33 years > x.high <-setx(hyp.out.z, age_p=63) # set age to 63 years > s.out <-sim(hyp.out.z, x=x.low, x1=x.high) # get predicted values > (Harvard MIT Data Center) Regression Models in R May 3, 2013 25 / 49
  26. 26. Regression with binary outcomesComputing quantities of interest with Zelig > summary(s.out) # show the results Model: logit Number of simulations: 1000 Values of X (Intercept) age_p sex2 Female sleep bmi 1 1 33 1 7.86 29.9 attr(,"assign") [1] 0 1 2 3 4 attr(,"contrasts") attr(,"contrasts")$sex [1] "contr.treatment" Values of X1 (Intercept) age_p sex2 Female sleep bmi 1 1 63 1 7.86 29.9 attr(,"assign") [1] 0 1 2 3 4 attr(,"contrasts") attr(,"contrasts")$sex [1] "contr.treatment" (Harvard MIT Data Center) Regression Models in R May 3, 2013 26 / 49
  27. 27. Regression with binary outcomesGraphing quantities of interest plot(s.out) # show the results graphically (Harvard MIT Data Center) Regression Models in R May 3, 2013 27 / 49
  28. 28. Regression with binary outcomesExercise 2: logistic regression Use the NH11 data set. 1 Use glm or zelig to conduct a logistic regression to predict ever worked (everwrk) using age (agep ) and marital status (rmaritl ). You may want to re-code rmarital first. 2 Predict the probability of working for each level of (the possibly re-coded) marital status variable. (Harvard MIT Data Center) Regression Models in R May 3, 2013 28 / 49
  29. 29. Multiple imputationTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 29 / 49
  30. 30. Multiple imputationMultiple imputation Majority of datasets contain missing data Produces a variety of problems and limitations to data analysis Multiple imputation (MI) generates multiple, complete datasets that contain estimations of missing data points (Harvard MIT Data Center) Regression Models in R May 3, 2013 30 / 49
  31. 31. Multiple imputationMultiple imputation Earlier we wanted to compare a model predicting bmi from demographic variables to a model including demographics and substantive predictors. We omitted missing data so that we could fit both models to the same data. That is a common practice, but it has many problems (which we unfortunately don’t have time to discuss in detail). A popular solution is to use multiple imputation to fill in the missing values with reasonable placeholders. MI is typically thought of as involving three steps: Selection of imputation model Generation of imputed datasets Combining results across imputed datasets There are a number of packages for doing this in R: we will use the Amelia package because it is powerful, fast, and easy to use. You can refer to the Amelia documentation for more information about its imputation procedures: http://r.iq.harvard.edu/docs/amelia/amelia.pdf (Harvard MIT Data Center) Regression Models in R May 3, 2013 31 / 49
  32. 32. Multiple imputationCreating imputed data sets We’re going to create several datasets to look at a model predicting the number of days of work missed/year (wkdayr) > # load the Amelia package > library(Amelia) > # help(package="Amelia") > # load a smaller version of NH > NH08.mi <- readRDS("dataSets/NatHealth2008MI") > # generate five imputed data sets > amelia.log <- capture.output( # suppress amelia’s chattiness + NatHealth.MI <- amelia(NH08.mi, + m=5, + idvars=c("id"))) > (Harvard MIT Data Center) Regression Models in R May 3, 2013 32 / 49
  33. 33. Multiple imputationChecking imputed values Compare imputed values to observed values plot(NatHealth.MI, which.vars=9:12) (Harvard MIT Data Center) Regression Models in R May 3, 2013 33 / 49
  34. 34. Multiple imputationChecking imputed values: overimputation Overimputation strategy: Treat every observed value as if it was missing Impute many values for that observed value Examine the correspondence between imputed and observed values overimpute(NatHealth.MI, var="sleep") (Harvard MIT Data Center) Regression Models in R May 3, 2013 34 / 49
  35. 35. Multiple imputationUsing imputed data sets in regression models Zelig makes it very easy to use imputed data sets – just point to the list of imputed data sets in the data argument > library(Zelig) > nhImp.out <- zelig(wkdayr ~ cigsday + modmin + sleep, model = "ls", + data = NatHealth.MI$imputations, cite = FALSE) > > coef(summary(nhImp.out)) Value Std. Error t-stat p-value (Intercept) -8.8163 6.7633 -1.3035 0.2132 cigsday -0.0102 0.1438 -0.0707 0.9437 modmin -0.0373 0.0234 -1.5940 0.1110 sleep 2.2303 0.9412 2.3697 0.0348 For separate results, use print(summary(x), subset = i:j). (Harvard MIT Data Center) Regression Models in R May 3, 2013 35 / 49
  36. 36. Multiple imputationExercise 2: multiple imputation 1 Using Amelia, generate 5 imputed versions of the Exam dataset. Make sure you tell Amelia which variables are nominal, and that school is the id variable. 2 Create plots that compare imputed values to observed values 3 Overimpute the variable "schavg" (Harvard MIT Data Center) Regression Models in R May 3, 2013 36 / 49
  37. 37. Multilevel ModelingTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 37 / 49
  38. 38. Multilevel ModelingMultilevel modeling overview Multi-level (AKA hierarchical) models are a type of mixed-effects models Used to model variation due to group membership Can model different intercepts and/or slopes for each group Mixed-effecs models include two types of predictors: fixed-effects and random effects Fixed-effects observed levels are of direct interest (.e.g, sex, political party. . . ) Random-effects observed levels not of direct interest: goal is to make inferences to a population represtent by observed leves In R the lme4 package is the most popular for mixed effects models Use the lmer function for liner mixed models, glmer for generalized mixed models (Harvard MIT Data Center) Regression Models in R May 3, 2013 38 / 49
  39. 39. Multilevel ModelingThe Exam data The Exam data set contans exam scores of 4,059 students from 65 schools in Inner London. The variable names are as follows: school School ID - a factor. normexam Normalized exam score. schgend School gender - a factor. Levels are ’mixed’, ’boys’, and ’girls’. schavg School average of intake score. vr Student level Verbal Reasoning (VR) score band at intake - a factor. Levels are ’bottom 25%’, ’mid 50%’, and ’top 25%’. intake Band of student’s intake score - a factor. Levels are ’bottom 25%’, ’mid 50%’ and ’top 25%’./ standLRT Standardised LR test score. sex Sex of the student - levels are ’F’ and ’M’. type School type - levels are ’Mxd’ and ’Sngl’. student Student id (within school) - a factor (Harvard MIT Data Center) Regression Models in R May 3, 2013 39 / 49
  40. 40. Multilevel ModelingThe null model and ICC As a preliminary step it is often useful to partition the variance in the dependent variable into the various levels. This can be accomplished by running a null model (i.e., a model with a random effects grouping structure, but no fixed-effects predictors). Linear mixed model fit by maximum likelihood Formula: normexam ~ 1 + (1 | school) Data: Exam AIC BIC logLik deviance REMLdev 10826 10844 -5410 10820 10824 Random effects: Groups Name Variance Std.Dev. school (Intercept) 0.169 0.412 Residual 0.848 0.921 Number of obs: 3987, groups: school, 65 Fixed effects: Estimate Std. Error t value (Intercept) -0.0141 0.0538 -0.26 (Harvard MIT Data Center) Regression Models in R May 3, 2013 40 / 49
  41. 41. Multilevel ModelingCalculating ICC Calculate ICC–amount of total variance in exam scores that is between groups, i.e, between-school variance/total variance > (N1re <-data.frame(summary(Norm1)@REmat, + stringsAsFactors = FALSE)) Groups Name Variance Std.Dev. 1 school (Intercept) 0.169 0.412 2 Residual 0.848 0.921 > > N1re[3:4] <-data.matrix(N1re[3:4]) > N1re[1, "Variance"] / sum(N1re["Variance"]) [1] 0.166 17% of the variance in exam scores is between schools; the rest is within school variance. (Harvard MIT Data Center) Regression Models in R May 3, 2013 41 / 49
  42. 42. Multilevel ModelingAdding fixed-effects predictors Predict exam scores from student’s standardized tests scores > Norm2 <-lmer(normexam~standLRT + (1|school), + data=Exam, + REML = FALSE) > summary(Norm2) Linear mixed model fit by maximum likelihood Formula: normexam ~ standLRT + (1 | school) Data: Exam AIC BIC logLik deviance REMLdev 9143 9169 -4568 9135 9147 Random effects: Groups Name Variance Std.Dev. school (Intercept) 0.0919 0.303 Residual 0.5670 0.753 Number of obs: 3958, groups: school, 65 Fixed effects: Estimate Std. Error t value (Intercept) 0.00121 0.04003 0.0 standLRT 0.56559 0.01265 44.7 Correlation of Fixed Effects: (Intr) standLRT 0.007 (Harvard MIT Data Center) Regression Models in R May 3, 2013 42 / 49
  43. 43. Multilevel ModelingMultiple degree of freedom comparisons As with lm and glm models, you can compare the two lmer models using the anova function. > anova(Norm1, Norm2) Data: Exam Models: Norm1: normexam ~ 1 + tag(1 | school) Norm2: normexam ~ standLRT + (1 | school) Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) Norm1 3 10826 10844 -5410 Norm2 4 9143 9169 -4568 1684 1 <2e-16 (Harvard MIT Data Center) Regression Models in R May 3, 2013 43 / 49
  44. 44. Multilevel ModelingRandom slopes Add a random effect of students’ standardized test scores as well. Now in addition to estimating the distribution of intercepts across schools, we also estimate the distribution of the slope of exam on standardized test. > Norm3 <- lmer(normexam~standLRT + (standLRT|school), data=Exam, + REML = FALSE) > summary(Norm3) Linear mixed model fit by maximum likelihood Formula: normexam ~ standLRT + (standLRT | school) Data: Exam AIC BIC logLik deviance REMLdev 9108 9146 -4548 9096 9107 Random effects: Groups Name Variance Std.Dev. Corr school (Intercept) 0.0899 0.300 standLRT 0.0141 0.119 0.512 Residual 0.5552 0.745 Number of obs: 3958, groups: school, 65 Fixed effects: Estimate Std. Error t value (Intercept) -0.0122 0.0397 -0.31 standLRT 0.5586 0.0199 28.08 (Harvard MIT Data Center) Regression Models in R May 3, 2013 44 / 49
  45. 45. Multilevel ModelingTest the significance of the random slope To test the significance of a random slope just compare models with and without the random slope term > anova(Norm2, Norm3) Data: Exam Models: Norm2: normexam ~ standLRT + tag(1 | school) Norm3: normexam ~ standLRT + tag(standLRT | school) Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) Norm2 4 9143 9169 -4568 Norm3 6 9108 9146 -4548 39 2 0.0000000035 (Harvard MIT Data Center) Regression Models in R May 3, 2013 45 / 49
  46. 46. Multilevel ModelingExercise 3: multilevel modeling Use the dataset, bh1996: data(bh1996, package="multilevel") From the data documentation: Variables are Cohesion (COHES), Leadership Climate (LEAD), Well-Being (WBEING) and Work Hours (HRS). Each of these variables has two variants - a group mean version that replicates each group mean for every individual, and a within-group version where the group mean is subtracted from each individual response. The group mean version is designated with a G. (e.g., G.HRS), and the within-group version is designated with a W. (e.g., W.HRS). 1 Create a null model predicting wellbeing ("WBEING") 2 Calculate the ICC for your null model 3 Run a second multi-level model that adds two individual-level predictors, average number of hours worked ("HRS") and leadership skills ("LEAD") to the model and interpret your output. 4 Now, add a random effect of average number of hours worked ("HRS") to the model and interpret your output. Test the significance of this random term. 5 Finally, add a group-level term, workplace cohesion ("G.COHES") to the / 49 (Harvard MIT Data Center) Regression Models in R May 3, 2013 46
  47. 47. Wrap-upTopic 1 Introduction 2 Linear regression 3 Interactions and factors 4 Regression with binary outcomes 5 Multiple imputation 6 Multilevel Modeling 7 Wrap-up (Harvard MIT Data Center) Regression Models in R May 3, 2013 47 / 49
  48. 48. Wrap-upHelp us make this workshop better! Please take a moment to fill out a very short feedback form These workshops exist for you – tell us what you need! http://tinyurl.com/RstatisticsFeedback (Harvard MIT Data Center) Regression Models in R May 3, 2013 48 / 49
  49. 49. Wrap-upAdditional resources IQSS workshops: http://projects.iq.harvard.edu/rtc/filter_by/workshops IQSS statistical consulting: http://rtc.iq.harvard.edu Zelig Website: http://gking.harvard.edu/zelig Documentation: http://r.iq.harvard.edu/docs/zelig.pdf Ameila Website: http://gking.harvard.edu/Amelia/ Documetation: http://r.iq.harvard.edu/docs/amelia/amelia.pdf (Harvard MIT Data Center) Regression Models in R May 3, 2013 49 / 49

×