Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multivariate Analysis in R

44 views

Published on

Prepared for HSCI 432, SFU

Published in: Education
  • Be the first to comment

Multivariate Analysis in R

  1. 1. USING R FOR EPIDEMIOLOGICAL RESEARCHKiffer G. Card, PhD Kirk J. Hepburn, MPP
  2. 2. MULTIVARIABLE ANALYSIS IN R
  3. 3. • Linear Regression • Logistic Regression • Other Regression Models • Model Building & Fit • Stratifications • Interactions Outline
  4. 4. 3.1. LINEAR REGRESSION Linear Regression Diagnosing Linear Models Assumptions of Linear Regression
  5. 5. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y ŷ = mx + b ŷ = β0 + β1x Linear Regression
  6. 6. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions • For a continuous predictor, β1 is the per unit increase in y for each unit increase in x (in the units of y). • For a binary predictor, β1 per unit increase comparing the variable level to its’ reference level. • Gender • Male = 0 • Female = 1 • ŷ = β0 + β1(Gender: M[x=0] or F[x=1]) • For a categorical predictor, β1 per unit increase comparing the variable level to its’ reference level. • Ethnicity • White • No = 0 • Yes = 1 • Black • No = 0 • Yes = 1 • Asian • No = 0 • Yes = 1 • Indigenous • No = 0 • Yes = 1 Linear Regression
  7. 7. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y Residuals = y - ŷ Observed Predicted Linear Regression
  8. 8. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y -1 +3 -3 +1 +2 +2 -1 -3 -5 +2 +3 -2 -3 +5 Residuals +3+1+5+2+2+2+3 = +18 (-1)+(-3)+(-3)+(-1)+(-5)+ (-2)+(-3) = -18 Linear Regression
  9. 9. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y -12 +32 -32 +12 +22 +22 -12 -32 -52 +22 +32 -22 -32 +52 Residuals (Regression Error) +32+12+52+22+22+22+32 = +56 (-1)2+(-3)2+(-3)2+(-1)2+(-5)2+ (-2)2+(-3)2 = +58 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝑬𝒓𝒓𝒐𝒓 = (𝒚𝒊− 𝒚 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏) 𝟐 Linear Regression
  10. 10. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y -32 +12 -52 -12 +22 +42 +12 -22 -52 +52 +72 +32 +42 +52 Total Error +12+52+22+42+12+52+72 +32 +42 = +146 (-3)2+(-1)2+(-5)2+(-2)2+(-5)2 = +64 Mean 𝑻𝒐𝒕𝒂𝒍 𝑬𝒓𝒓𝒐𝒓 = (𝒚𝒊− 𝒚) 𝟐 Linear Regression
  11. 11. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Data Fertility Agriculture Examination Education Catholic Infant Mortality Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00 Min. : 2.150 Min. :10.80 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00 1st Qu.: 5.195 1st Qu.:18.15 Median :70.40 Median :54.10 Median :16.00 Median : 8.00 Median : 15.140 Median :20.00 Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98 Mean : 41.144 Mean :19.94 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00 3rd Qu.: 93.125 3rd Qu.:21.70 Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00 Max. :100.000 Max. :26.60 summary(swiss)
  12. 12. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association R syntax: UnivariateModel <- lm(formula = Fertility ~ Agriculture, data = swiss) summary(UnivariateModel) The lm() function means “linear model”, and we read the formula as “Fertility by Agriculture”.
  13. 13. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association ## Call: ## lm(formula = Fertility ~ Agriculture, data = swiss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.5374 -7.8685 -0.6362 9.0464 24.4858 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 60.30438 4.25126 14.185 <2e-16 *** ## Agriculture 0.19420 0.07671 2.532 0.0149 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.82 on 45 degrees of freedom ## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052 ## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
  14. 14. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association 𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦𝑖 = 𝛼 + 𝛽𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒𝑖 + 𝜖𝑖 where 𝑖 is the Swiss province and 𝛽1 is Agriculture • term: The model term • estimate: the 𝛼s 𝛽s in our formula • std.error: the standard error of each variable, the 𝜖𝑖 in our formula summarized • statistic: the association statistic, in this case t • p.value: the all-important (sort of) p-value term estimate std.error statistic p.value (Intercept) 60.304 4.251 14.185 0.000 Agriculture 0.194 0.077 2.532 0.015
  15. 15. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association While it is clear that this model does explain some Fertility, the 𝑅 2 is only 0.125. This means that only about 12.5% of Fertility is explained by Agriculture.
  16. 16. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Models: Multivariate Equation: 𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦𝑖 = 𝛼 + 𝛽𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒𝑖 + 𝛽 𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽 𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐𝑖 + 𝛽𝐼𝑛𝑓𝑎𝑛𝑡𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 𝐼𝑛𝑓𝑎𝑛𝑡𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦𝑖 + 𝜖𝑖 MultivariableModel <- lm(formula = Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data = swiss) summary(MultivariableModel)
  17. 17. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Models: Multivariate ## Call: ## lm(formula = Fertility ~ Agriculture + Examination + Education + ## Catholic + Infant.Mortality, data = swiss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.2743 -5.2617 0.5032 4.1198 15.3213 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 66.91518 10.70604 6.250 1.91e-07 *** ## Agriculture -0.17211 0.07030 -2.448 0.01873 * ## Examination -0.25801 0.25388 -1.016 0.31546 ## Education -0.87094 0.18303 -4.758 2.43e-05 *** ## Catholic 0.10412 0.03526 2.953 0.00519 ** ## Infant.Mortality 1.07705 0.38172 2.822 0.00734 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.165 on 41 degrees of freedom ## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671 ## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
  18. 18. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Diagnosing Linear Models: Visualization car::avPlots(MultivariableModel, layout = c(2, 3)) You can call a function from a package without using library() first by naming the package and then adding “::”
  19. 19. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 1: Normality • Most researchers use Q-Q plots to test the assumption of normality. • In this method, observed value and expected value are plotted on a graph. • If the plotted value vary more from a straight line, then the data is not normally distributed. • Otherwise data will be normally distributed.
  20. 20. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 1: Normality plot(MultivariableModel, 2)
  21. 21. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 2: Linearity plot(MultivariableModel, 1) Linearity: In the population, the relation between the dependent variable and the independent variable is linear when all the other independent variables are held constant.
  22. 22. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y Assumption 3: Homoscedasticity
  23. 23. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 3: Homoscedasticity • To test the assumption of homogeneity of variance, Levene’s test is used. • Levene’s test is used to asses if the groups have equal variances. This test should not be significant to meet the assumption of equality of variances.
  24. 24. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 3: Homoscedasticity plot(MultivariableModel, 3)
  25. 25. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions • There is no test for the independence of observations. • Chain-referral and other non-random sampling methods may generate data that is non-independent. Assumption 4: Independent Observations
  26. 26. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X1 X2 X1 X2 Assumption 5: Multicollinearity
  27. 27. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 5: Multicollinearity corrplot::corrplot(cor(swiss)) • If you were to just execute the cor(swiss) function, you would get a correlation table, rather than this pretty plot. Doing so would give you exact values, which can be helpful in identifying highly correlated variables.
  28. 28. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 5: Multicollinearity vif(MultivariableModel) Agriculture 2.284129 Examination 3.675420 Education 2.774943 Catholic 1.937160 Infant.Mortali ty 1.107542 • In statistics, the variance inflation factor (VIF) is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. • It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. • It provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity.
  29. 29. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 6: Outliers plot(MultivariableModel, 4) • Cook’s distance identifies possible outliers. • A distance of 1 is usually used as threshold of concern, but you are always free to explore notable outliers to determine why they are outliers.
  30. 30. 3.2. LOGISTIC REGRESSION Logistic Regression Multinomial Regression
  31. 31. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions 0 1 Logistic Distribution 𝒍𝒏( 𝒑 𝟏− 𝒑 ) = 𝜷 𝟎 + 𝜷 𝟏 + ⋯ 𝜷 𝒏 Logistic Regression
  32. 32. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions • With a dichotomous predictor, β1 is the log odds comparing group(s) to referent level. • With a continuous predictor, β1 is the log odds per unit change. • β1 = 0: Equal Odds • β1 > 0: Increased Odds • β1 < 0::Decreased Odds • Exponentiating the log odds (β1) gives you an odds ratio • exp(β1) = 1.00: Equal Odds • exp(β1) > 1.00: Increased Odds • exp(β1) < 1.00::Decreased Odds Logistic Regression
  33. 33. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Logistic Regression summary(infert) Education Age Prior Pregnancies Induced Infertility Miscarriages 0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000 Mode :logical Min. :0.0000 6-11yrs:120 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000 FALSE:165 1st Qu.:0.0000 12+ yrs:116 Median :31.00 Median :2.000 Median :0.0000 TRUE :83 Median :0.0000 Mean :31.50 Mean :2.093 Mean :0.5726 Mean :0.5766 3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 Max. :44.00 Max. :6.000 Max. :2.0000 Max. :2.0000
  34. 34. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Logistic Regression 𝑙𝑛( 𝑃(𝐼𝑛𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦) 1 − 𝑃(𝐼𝑛𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦) ) = 𝛼 + 𝛽𝐴𝑔𝑒 𝐴𝑔𝑒𝑖 + 𝛽 𝑃𝑟𝑖𝑜𝑟𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑖𝑒𝑠 𝑃𝑟𝑖𝑜𝑟𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑖𝑒𝑠𝑖 + 𝛽 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽 𝑀𝑖𝑠𝑐𝑎𝑟𝑟𝑖𝑎𝑔𝑒𝑠 𝑀𝑖𝑠𝑐𝑎𝑟𝑟𝑖𝑎𝑔𝑒𝑠𝑖 + 𝛽𝐼𝑛𝑑𝑢𝑐𝑒𝑑 𝐼𝑛𝑑𝑢𝑐𝑒𝑑𝑖 + 𝜖𝑖 LogisticModel <- glm(formula = Infertility ~ Age + PriorPregnancies + Education + Miscarriages + Induced, data = infert, family = binomial()) tidy(LogisticModel)
  35. 35. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Logistic Regression ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.15 1.41 -0.814 4.16e- 1 ## 2 Age 0.0396 0.0312 1.27 2.05e- 1 ## 3 PriorPregnancies -0.828 0.196 -4.22 2.49e- 5 ## 4 Education6-11yrs -1.04 0.793 -1.32 1.88e- 1 ## 5 Education12+ yrs -1.40 0.834 -1.68 9.25e- 2 ## 6 Miscarriages 2.05 0.310 6.60 4.21e-11 ## 7 Induced 1.29 0.301 4.28 1.91e- 5 exp(LogisticModel$coefficients) ## (Intercept) Age PriorPregnancies Education6-11yrs ## 0.3168786 1.0403758 0.4368011 0.3519579 ## Education12+ yrs Miscarriages Induced ## 0.2458079 7.7361567 3.6282752
  36. 36. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression SES Program Writing low :47 general : 45 Min. :31.00 middle:95 academic:105 1st Qu.:45.75 high :58 vocation: 50 Median :54.00 Mean :52.77 3rd Qu.:60.00 Max. :67.00 summary(students)
  37. 37. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression 𝑙𝑛( 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚==𝑎𝑐𝑎𝑑𝑒𝑚𝑖𝑐) 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚==𝑔𝑒𝑛𝑒𝑟𝑎𝑙) ) = 𝛼1 + 𝛽𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒1 𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒𝑖 + 𝛽𝑆𝐸𝑆ℎ𝑖𝑔ℎ1 𝑆𝐸𝑆ℎ𝑖𝑔ℎ𝑖 + 𝛽 𝑊𝑟𝑖𝑡𝑖𝑛𝑔1 𝑊𝑟𝑖𝑡𝑖𝑛𝑔𝑖 𝑙𝑛( 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚 == 𝑣𝑜𝑐𝑎𝑡𝑖𝑜𝑛) 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚 == 𝑔𝑒𝑛𝑒𝑟𝑎𝑙) ) = 𝛼2 + 𝛽𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒2 𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒𝑖 + 𝛽𝑆𝐸𝑆ℎ𝑖𝑔ℎ2 𝑆𝐸𝑆ℎ𝑖𝑔ℎ𝑖 + 𝛽 𝑊𝑟𝑖𝑡𝑖𝑛𝑔2 𝑊𝑟𝑖𝑡𝑖𝑛𝑔𝑖 MultinomialModel <- multinom(formula = Program ~ SES + Writing, data = students) ## # weights: 15 (8 variable) ## initial value 219.722458 ## iter 10 value 179.985215 ## final value 179.981726 ## converged
  38. 38. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression summary(MultinomialModel) ## Call: ## multinom(formula = Program ~ SES + Writing, data = students) ## ## Coefficients: ## (Intercept) SESmiddle SEShigh Writing ## academic -2.851973 0.5332914 1.1628257 0.05792480 ## vocation 2.366097 0.8246384 0.1802176 -0.05567514 ## ## Std. Errors: ## (Intercept) SESmiddle SEShigh Writing ## academic 1.166437 0.4437319 0.5142215 0.02141092 ## vocation 1.174251 0.4901237 0.6484508 0.02333135 ## ## Residual Deviance: 359.9635 ## AIC: 375.9635
  39. 39. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression tidy(MultinomialModel) ## # A tibble: 8 x 6 ## y.level term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 academic (Intercept) 0.0577 1.17 -2.45 0.0145 ## 2 academic SESmiddle 1.70 0.444 1.20 0.229 ## 3 academic SEShigh 3.20 0.514 2.26 0.0237 ## 4 academic Writing 1.06 0.0214 2.71 0.00682 ## 5 vocation (Intercept) 10.7 1.17 2.01 0.0439 ## 6 vocation SESmiddle 2.28 0.490 1.68 0.0925 ## 7 vocation SEShigh 1.20 0.648 0.278 0.781 ## 8 vocation Writing 0.946 0.0233 -2.39 0.0170 • The tidy command from the broom package provides a table with more information and better arranged
  40. 40. 3.2. OTHER MODELS Poisson Regression Binomial Regression
  41. 41. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Poisson Regression NumberOfAwards Program Math Min. :0.00 academic:105 Min. :33.00 1st Qu.:0.00 general : 45 1st Qu.:45.00 Median :0.00 vocation: 50 Median :52.00 Mean :0.63 Mean :52.65 3rd Qu.:1.00 3rd Qu.:59.00 Max. :6.00 Max. :75.00
  42. 42. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Poisson Regression PoissonModel <- glm(formula = NumberOfAwards ~ Program + Math, data = students2, family = poisson()) tidy(PoissonModel) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -4.16 0.663 -6.28 3.37e-10 ## 2 Programgeneral -1.08 0.358 -3.03 2.48e- 3 ## 3 Programvocation -0.714 0.320 -2.23 2.57e- 2 ## 4 Math 0.0702 0.0106 6.62 3.63e-11 exp(PoissonModel$coefficients) ## (Intercept) Programgeneral Programvocation Math ## 0.01555668 0.33828750 0.48965711 1.07267164 Note: The exponentiated Poisson regression coefficient is a rate ratio corresponding to a one unit difference in the predictor.
  43. 43. 3.3. MODEL BUILDING & FIT Multivariable Regression Model Building Variable Selection Relative Fit Absolute Fit
  44. 44. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions ŷ = β0 + β1x + β2x X1 X2 Y Multiple Regression
  45. 45. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Relative Fit BothSelected <- step(object = MultivariableModel, scope = list(lower = UnivariateModel, upper = MultivariableModel), direction = "both") • You can imagine that you often have many variables to choose from in a dataset. How do you know which of the many potential models is the best? • Usually, the best thing to do is set up some sort of rule for which variables will be considered. For instance all those with bivariate p values < 0.20. • Then you create a method for removing variables. • Backwards selection (start with all variables) • Forwards selection (start with none and add them in) • Stepwise (drops and adds variables until AIC in optimized)
  46. 46. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Relative Fit ## Start: AIC=190.69 ## Fertility ~ Agriculture + Examination + Education + Catholic + ## Infant.Mortality ## ## Df Sum of Sq RSS AIC ## - Examination 1 53.03 2158.1 189.86 ## <none> 2105.0 190.69 ## - Infant.Mortality 1 408.75 2513.8 197.03 ## - Catholic 1 447.71 2552.8 197.75 ## - Education 1 1162.56 3267.6 209.36 ## ## Step: AIC=189.86 ## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality ## ## Df Sum of Sq RSS AIC ## <none> 2158.1 189.86 ## + Examination 1 53.03 2105.0 190.69 ## - Infant.Mortality 1 409.81 2567.9 196.03 ## - Catholic 1 956.57 3114.6 205.10 ## - Education 1 2249.97 4408.0 221.43
  47. 47. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Absolute Fit stargazer::stargazer(UnivariateModel, BothSelected, MultivariableModel, type = "text") • Absolute model fit tells you how well your model matches the data. For a linear model this is measured by r-squared; and for other model designs, a pseudo r- squared can be generated.
  48. 48. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Absolute Fit## ## ====================================================================================== ## Dependent variable: ## ------------------------------------------------------------------ ## Fertility ## (1) (2) (3) ## -------------------------------------------------------------------------------------- ## Agriculture 0.194** -0.155** -0.172** ## (0.077) (0.068) (0.070) ## ## Examination -0.258 ## (0.254) ## ## Education -0.980*** -0.871*** ## (0.148) (0.183) ## ## Catholic 0.125*** 0.104*** ## (0.029) (0.035) ## ## Infant.Mortality 1.078*** 1.077*** ## (0.382) (0.382) ## ## Constant 60.304*** 62.101*** 66.915*** ## (4.251) (9.605) (10.706) ## ## -------------------------------------------------------------------------------------- ## Observations 47 47 47 ## R2 0.125 0.699 0.707 ## Adjusted R2 0.105 0.671 0.671 ## Residual Std. Error 11.816 (df = 45) 7.168 (df = 42) 7.165 (df = 41) ## F Statistic 6.409** (df = 1; 45) 24.424*** (df = 4; 42) 19.761*** (df = 5; 41) ## ====================================================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
  49. 49. 3.4. STRATIFICATION Moderators Stratifying
  50. 50. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Stratifying NonHonorsModel <- lm(formula = Writing ~ read + SES, data = students, subset = honors == "not enrolled") HonorsModel <- lm(formula = Writing ~ read + SES, data = students, subset = honors == "enrolled") Stratify by selecting a subset using a logical term (here being “if honors equals “not enrolled” and “enrolled”). Although the sample for the honors kids may have been too small to detect an association anyway, the magnitude of the coefficient is so different that it appears the relationship between writing and reading scores is not present among honors students while it is among non honors. They may all have very similar scores in the honors class. You can imagine that sometimes there are variables that are so important to predicting an outcome that the basic processes and relatinoships are different. For example, gender is often considered an “effect modifier” or moderator. In these cases it is helpful to stratify your results.
  51. 51. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Stratifying tidy(NonHonorsModel) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 28.9 3.27 8.84 3.17e-15 ## 2 read 0.394 0.0665 5.92 2.33e- 8 ## 3 SESmiddle 1.09 1.48 0.736 4.63e- 1 ## 4 SEShigh 0.504 1.80 0.281 7.79e- 1 tidy(HonorsModel) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 60.8 2.27 26.8 6.47e-31 ## 2 read 0.0392 0.0382 1.03 3.10e- 1 ## 3 SESmiddle 0.264 0.854 0.309 7.59e- 1 ## 4 SEShigh 0.392 0.807 0.485 6.30e- 1
  52. 52. 3.5. INTERACTIONS Mediators Mediation Test Interaction Terms
  53. 53. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Interaction Terms InteractionModel <- lm(formula = Fertility ~ Agriculture + Examination + Education*Catholic + Infant.Mortality, data = swiss) summary(InteractionModel) In other cases, perhaps you believe that specific intersections (like race and gender combined) offer more explanatory power than the individual constructs themselves. In this case, interaction terms can be included in your model:
  54. 54. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Interaction Terms ## Call: ## lm(formula = Fertility ~ Agriculture + Examination + Education * ## Catholic + Infant.Mortality, data = swiss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.9575 -5.1321 0.9299 4.2203 13.1799 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 59.500060 10.534970 5.648 1.48e-06 *** ## Agriculture -0.156010 0.066628 -2.342 0.024274 * ## Examination -0.356748 0.242766 -1.470 0.149516 ## Education -0.313333 0.284332 -1.102 0.277050 ## Catholic 0.187584 0.047430 3.955 0.000304 *** ## Infant.Mortality 1.255314 0.367195 3.419 0.001460 ** ## Education:Catholic -0.012480 0.005057 -2.468 0.017960 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.758 on 40 degrees of freedom ## Multiple R-squared: 0.7455, Adjusted R-squared: 0.7073 ## F-statistic: 19.53 on 6 and 40 DF, p-value: 1.725e-10

×