Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 2160983 views
- AI and Machine Learning Demystified... by Carol Smith 4006555 views
- 10 facts about jobs in the future by Pew Research Cent... 1017488 views
- Harry Surden - Artificial Intellige... by Harry Surden 839320 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1423808 views
- Pinot: Realtime Distributed OLAP da... by Kishore Gopalakri... 676940 views

No Downloads

Total views

44

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

3

Comments

0

Likes

2

No embeds

No notes for slide

Regression—equation that best describes relationship between variables

Two continuous variables

Y = dependent, outcome variable

X = independent, predictor variable

But as you can imagine we can draw any sort of line through this data. However, theoretically there is what we call a “best fitting” line.

This works because whenever you square a negative number, it becomes positive.

Here I have used the summary function to give us a glimpse of the dataset. Fertility is a standardized fertility measure. All others are percentages of populations as follows:

% of males in the agriculture occupation

% of draftees receiving the highest mark on army examination

% of draftees with education beyond primary school

% of population Catholic (vs. Protestant)

% of live births who live less than 1 year

Switzerland, in 1888, was entering a period known as the demographic transition; i.e., its fertility was beginning to fall from the high level typical of underdeveloped countries.

For these top three Cook’s distances we might wonder - What about these observations might set them apart from other observations, and can we get data for that? or - Were the data for these observations entered incorrectly?

Often a distance of 1 is used as a threshold for concern. Our top three distances are fairly small.

When outliers are detected, they should only be removed if you find that doing so has a major influence on your coefficient estimates. Otherwise, they should remain in unless you find that there’s some characteristic that makes them too different from other observations to be included in the sample.

Outcome is modeled as log odds.

Model parameters are estimated using maximum likelihood techniques.

b1 is the log odds ratio.

exp(b1) is the odds ratio estimate from a logistic regression model.

As shown in the equation, the model produces estimates in log-odds. To produce an odds ratio instead, exponentiate the coefficients.

Women who underwent dangerous, illegal abortions were 3.63 times more likely to experience secondary infertility, controlling for prior miscarriages, prior pregnancies, education level, and age

As shown in the equation, the model produces estimates in log-odds. To produce an odds ratio instead, exponentiate the coefficients.

Women who underwent dangerous, illegal abortions were 3.63 times more likely to experience secondary infertility, controlling for prior miscarriages, prior pregnancies, education level, and age

The exponentiated Poisson regression coefficient is a rate ratio corresponding to a one unit difference in the predictor.

- 1. USING R FOR EPIDEMIOLOGICAL RESEARCHKiffer G. Card, PhD Kirk J. Hepburn, MPP
- 2. MULTIVARIABLE ANALYSIS IN R
- 3. • Linear Regression • Logistic Regression • Other Regression Models • Model Building & Fit • Stratifications • Interactions Outline
- 4. 3.1. LINEAR REGRESSION Linear Regression Diagnosing Linear Models Assumptions of Linear Regression
- 5. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y ŷ = mx + b ŷ = β0 + β1x Linear Regression
- 6. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions • For a continuous predictor, β1 is the per unit increase in y for each unit increase in x (in the units of y). • For a binary predictor, β1 per unit increase comparing the variable level to its’ reference level. • Gender • Male = 0 • Female = 1 • ŷ = β0 + β1(Gender: M[x=0] or F[x=1]) • For a categorical predictor, β1 per unit increase comparing the variable level to its’ reference level. • Ethnicity • White • No = 0 • Yes = 1 • Black • No = 0 • Yes = 1 • Asian • No = 0 • Yes = 1 • Indigenous • No = 0 • Yes = 1 Linear Regression
- 7. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y Residuals = y - ŷ Observed Predicted Linear Regression
- 8. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y -1 +3 -3 +1 +2 +2 -1 -3 -5 +2 +3 -2 -3 +5 Residuals +3+1+5+2+2+2+3 = +18 (-1)+(-3)+(-3)+(-1)+(-5)+ (-2)+(-3) = -18 Linear Regression
- 9. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y -12 +32 -32 +12 +22 +22 -12 -32 -52 +22 +32 -22 -32 +52 Residuals (Regression Error) +32+12+52+22+22+22+32 = +56 (-1)2+(-3)2+(-3)2+(-1)2+(-5)2+ (-2)2+(-3)2 = +58 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝑬𝒓𝒓𝒐𝒓 = (𝒚𝒊− 𝒚 𝑹𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏) 𝟐 Linear Regression
- 10. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y -32 +12 -52 -12 +22 +42 +12 -22 -52 +52 +72 +32 +42 +52 Total Error +12+52+22+42+12+52+72 +32 +42 = +146 (-3)2+(-1)2+(-5)2+(-2)2+(-5)2 = +64 Mean 𝑻𝒐𝒕𝒂𝒍 𝑬𝒓𝒓𝒐𝒓 = (𝒚𝒊− 𝒚) 𝟐 Linear Regression
- 11. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Data Fertility Agriculture Examination Education Catholic Infant Mortality Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00 Min. : 2.150 Min. :10.80 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00 1st Qu.: 5.195 1st Qu.:18.15 Median :70.40 Median :54.10 Median :16.00 Median : 8.00 Median : 15.140 Median :20.00 Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98 Mean : 41.144 Mean :19.94 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00 3rd Qu.: 93.125 3rd Qu.:21.70 Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00 Max. :100.000 Max. :26.60 summary(swiss)
- 12. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association R syntax: UnivariateModel <- lm(formula = Fertility ~ Agriculture, data = swiss) summary(UnivariateModel) The lm() function means “linear model”, and we read the formula as “Fertility by Agriculture”.
- 13. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association ## Call: ## lm(formula = Fertility ~ Agriculture, data = swiss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.5374 -7.8685 -0.6362 9.0464 24.4858 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 60.30438 4.25126 14.185 <2e-16 *** ## Agriculture 0.19420 0.07671 2.532 0.0149 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.82 on 45 degrees of freedom ## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052 ## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
- 14. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association 𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦𝑖 = 𝛼 + 𝛽𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒𝑖 + 𝜖𝑖 where 𝑖 is the Swiss province and 𝛽1 is Agriculture • term: The model term • estimate: the 𝛼s 𝛽s in our formula • std.error: the standard error of each variable, the 𝜖𝑖 in our formula summarized • statistic: the association statistic, in this case t • p.value: the all-important (sort of) p-value term estimate std.error statistic p.value (Intercept) 60.304 4.251 14.185 0.000 Agriculture 0.194 0.077 2.532 0.015
- 15. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Regression: Bivariate association While it is clear that this model does explain some Fertility, the 𝑅 2 is only 0.125. This means that only about 12.5% of Fertility is explained by Agriculture.
- 16. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Models: Multivariate Equation: 𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦𝑖 = 𝛼 + 𝛽𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒𝑖 + 𝛽 𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽 𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐𝑖 + 𝛽𝐼𝑛𝑓𝑎𝑛𝑡𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 𝐼𝑛𝑓𝑎𝑛𝑡𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦𝑖 + 𝜖𝑖 MultivariableModel <- lm(formula = Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data = swiss) summary(MultivariableModel)
- 17. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Linear Models: Multivariate ## Call: ## lm(formula = Fertility ~ Agriculture + Examination + Education + ## Catholic + Infant.Mortality, data = swiss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.2743 -5.2617 0.5032 4.1198 15.3213 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 66.91518 10.70604 6.250 1.91e-07 *** ## Agriculture -0.17211 0.07030 -2.448 0.01873 * ## Examination -0.25801 0.25388 -1.016 0.31546 ## Education -0.87094 0.18303 -4.758 2.43e-05 *** ## Catholic 0.10412 0.03526 2.953 0.00519 ** ## Infant.Mortality 1.07705 0.38172 2.822 0.00734 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.165 on 41 degrees of freedom ## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671 ## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
- 18. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Diagnosing Linear Models: Visualization car::avPlots(MultivariableModel, layout = c(2, 3)) You can call a function from a package without using library() first by naming the package and then adding “::”
- 19. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 1: Normality • Most researchers use Q-Q plots to test the assumption of normality. • In this method, observed value and expected value are plotted on a graph. • If the plotted value vary more from a straight line, then the data is not normally distributed. • Otherwise data will be normally distributed.
- 20. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 1: Normality plot(MultivariableModel, 2)
- 21. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 2: Linearity plot(MultivariableModel, 1) Linearity: In the population, the relation between the dependent variable and the independent variable is linear when all the other independent variables are held constant.
- 22. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X Y Assumption 3: Homoscedasticity
- 23. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 3: Homoscedasticity • To test the assumption of homogeneity of variance, Levene’s test is used. • Levene’s test is used to asses if the groups have equal variances. This test should not be significant to meet the assumption of equality of variances.
- 24. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 3: Homoscedasticity plot(MultivariableModel, 3)
- 25. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions • There is no test for the independence of observations. • Chain-referral and other non-random sampling methods may generate data that is non-independent. Assumption 4: Independent Observations
- 26. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions X1 X2 X1 X2 Assumption 5: Multicollinearity
- 27. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 5: Multicollinearity corrplot::corrplot(cor(swiss)) • If you were to just execute the cor(swiss) function, you would get a correlation table, rather than this pretty plot. Doing so would give you exact values, which can be helpful in identifying highly correlated variables.
- 28. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 5: Multicollinearity vif(MultivariableModel) Agriculture 2.284129 Examination 3.675420 Education 2.774943 Catholic 1.937160 Infant.Mortali ty 1.107542 • In statistics, the variance inflation factor (VIF) is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. • It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. • It provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity.
- 29. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Assumption 6: Outliers plot(MultivariableModel, 4) • Cook’s distance identifies possible outliers. • A distance of 1 is usually used as threshold of concern, but you are always free to explore notable outliers to determine why they are outliers.
- 30. 3.2. LOGISTIC REGRESSION Logistic Regression Multinomial Regression
- 31. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions 0 1 Logistic Distribution 𝒍𝒏( 𝒑 𝟏− 𝒑 ) = 𝜷 𝟎 + 𝜷 𝟏 + ⋯ 𝜷 𝒏 Logistic Regression
- 32. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions • With a dichotomous predictor, β1 is the log odds comparing group(s) to referent level. • With a continuous predictor, β1 is the log odds per unit change. • β1 = 0: Equal Odds • β1 > 0: Increased Odds • β1 < 0::Decreased Odds • Exponentiating the log odds (β1) gives you an odds ratio • exp(β1) = 1.00: Equal Odds • exp(β1) > 1.00: Increased Odds • exp(β1) < 1.00::Decreased Odds Logistic Regression
- 33. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Logistic Regression summary(infert) Education Age Prior Pregnancies Induced Infertility Miscarriages 0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000 Mode :logical Min. :0.0000 6-11yrs:120 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000 FALSE:165 1st Qu.:0.0000 12+ yrs:116 Median :31.00 Median :2.000 Median :0.0000 TRUE :83 Median :0.0000 Mean :31.50 Mean :2.093 Mean :0.5726 Mean :0.5766 3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 Max. :44.00 Max. :6.000 Max. :2.0000 Max. :2.0000
- 34. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Logistic Regression 𝑙𝑛( 𝑃(𝐼𝑛𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦) 1 − 𝑃(𝐼𝑛𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦) ) = 𝛼 + 𝛽𝐴𝑔𝑒 𝐴𝑔𝑒𝑖 + 𝛽 𝑃𝑟𝑖𝑜𝑟𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑖𝑒𝑠 𝑃𝑟𝑖𝑜𝑟𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑖𝑒𝑠𝑖 + 𝛽 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽 𝑀𝑖𝑠𝑐𝑎𝑟𝑟𝑖𝑎𝑔𝑒𝑠 𝑀𝑖𝑠𝑐𝑎𝑟𝑟𝑖𝑎𝑔𝑒𝑠𝑖 + 𝛽𝐼𝑛𝑑𝑢𝑐𝑒𝑑 𝐼𝑛𝑑𝑢𝑐𝑒𝑑𝑖 + 𝜖𝑖 LogisticModel <- glm(formula = Infertility ~ Age + PriorPregnancies + Education + Miscarriages + Induced, data = infert, family = binomial()) tidy(LogisticModel)
- 35. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Logistic Regression ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.15 1.41 -0.814 4.16e- 1 ## 2 Age 0.0396 0.0312 1.27 2.05e- 1 ## 3 PriorPregnancies -0.828 0.196 -4.22 2.49e- 5 ## 4 Education6-11yrs -1.04 0.793 -1.32 1.88e- 1 ## 5 Education12+ yrs -1.40 0.834 -1.68 9.25e- 2 ## 6 Miscarriages 2.05 0.310 6.60 4.21e-11 ## 7 Induced 1.29 0.301 4.28 1.91e- 5 exp(LogisticModel$coefficients) ## (Intercept) Age PriorPregnancies Education6-11yrs ## 0.3168786 1.0403758 0.4368011 0.3519579 ## Education12+ yrs Miscarriages Induced ## 0.2458079 7.7361567 3.6282752
- 36. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression SES Program Writing low :47 general : 45 Min. :31.00 middle:95 academic:105 1st Qu.:45.75 high :58 vocation: 50 Median :54.00 Mean :52.77 3rd Qu.:60.00 Max. :67.00 summary(students)
- 37. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression 𝑙𝑛( 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚==𝑎𝑐𝑎𝑑𝑒𝑚𝑖𝑐) 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚==𝑔𝑒𝑛𝑒𝑟𝑎𝑙) ) = 𝛼1 + 𝛽𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒1 𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒𝑖 + 𝛽𝑆𝐸𝑆ℎ𝑖𝑔ℎ1 𝑆𝐸𝑆ℎ𝑖𝑔ℎ𝑖 + 𝛽 𝑊𝑟𝑖𝑡𝑖𝑛𝑔1 𝑊𝑟𝑖𝑡𝑖𝑛𝑔𝑖 𝑙𝑛( 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚 == 𝑣𝑜𝑐𝑎𝑡𝑖𝑜𝑛) 𝑃(𝑃𝑟𝑜𝑔𝑟𝑎𝑚 == 𝑔𝑒𝑛𝑒𝑟𝑎𝑙) ) = 𝛼2 + 𝛽𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒2 𝑆𝐸𝑆𝑚𝑖𝑑𝑑𝑙𝑒𝑖 + 𝛽𝑆𝐸𝑆ℎ𝑖𝑔ℎ2 𝑆𝐸𝑆ℎ𝑖𝑔ℎ𝑖 + 𝛽 𝑊𝑟𝑖𝑡𝑖𝑛𝑔2 𝑊𝑟𝑖𝑡𝑖𝑛𝑔𝑖 MultinomialModel <- multinom(formula = Program ~ SES + Writing, data = students) ## # weights: 15 (8 variable) ## initial value 219.722458 ## iter 10 value 179.985215 ## final value 179.981726 ## converged
- 38. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression summary(MultinomialModel) ## Call: ## multinom(formula = Program ~ SES + Writing, data = students) ## ## Coefficients: ## (Intercept) SESmiddle SEShigh Writing ## academic -2.851973 0.5332914 1.1628257 0.05792480 ## vocation 2.366097 0.8246384 0.1802176 -0.05567514 ## ## Std. Errors: ## (Intercept) SESmiddle SEShigh Writing ## academic 1.166437 0.4437319 0.5142215 0.02141092 ## vocation 1.174251 0.4901237 0.6484508 0.02333135 ## ## Residual Deviance: 359.9635 ## AIC: 375.9635
- 39. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Multinomial Logistic Regression tidy(MultinomialModel) ## # A tibble: 8 x 6 ## y.level term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 academic (Intercept) 0.0577 1.17 -2.45 0.0145 ## 2 academic SESmiddle 1.70 0.444 1.20 0.229 ## 3 academic SEShigh 3.20 0.514 2.26 0.0237 ## 4 academic Writing 1.06 0.0214 2.71 0.00682 ## 5 vocation (Intercept) 10.7 1.17 2.01 0.0439 ## 6 vocation SESmiddle 2.28 0.490 1.68 0.0925 ## 7 vocation SEShigh 1.20 0.648 0.278 0.781 ## 8 vocation Writing 0.946 0.0233 -2.39 0.0170 • The tidy command from the broom package provides a table with more information and better arranged
- 40. 3.2. OTHER MODELS Poisson Regression Binomial Regression
- 41. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Poisson Regression NumberOfAwards Program Math Min. :0.00 academic:105 Min. :33.00 1st Qu.:0.00 general : 45 1st Qu.:45.00 Median :0.00 vocation: 50 Median :52.00 Mean :0.63 Mean :52.65 3rd Qu.:1.00 3rd Qu.:59.00 Max. :6.00 Max. :75.00
- 42. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Poisson Regression PoissonModel <- glm(formula = NumberOfAwards ~ Program + Math, data = students2, family = poisson()) tidy(PoissonModel) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -4.16 0.663 -6.28 3.37e-10 ## 2 Programgeneral -1.08 0.358 -3.03 2.48e- 3 ## 3 Programvocation -0.714 0.320 -2.23 2.57e- 2 ## 4 Math 0.0702 0.0106 6.62 3.63e-11 exp(PoissonModel$coefficients) ## (Intercept) Programgeneral Programvocation Math ## 0.01555668 0.33828750 0.48965711 1.07267164 Note: The exponentiated Poisson regression coefficient is a rate ratio corresponding to a one unit difference in the predictor.
- 43. 3.3. MODEL BUILDING & FIT Multivariable Regression Model Building Variable Selection Relative Fit Absolute Fit
- 44. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions ŷ = β0 + β1x + β2x X1 X2 Y Multiple Regression
- 45. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Relative Fit BothSelected <- step(object = MultivariableModel, scope = list(lower = UnivariateModel, upper = MultivariableModel), direction = "both") • You can imagine that you often have many variables to choose from in a dataset. How do you know which of the many potential models is the best? • Usually, the best thing to do is set up some sort of rule for which variables will be considered. For instance all those with bivariate p values < 0.20. • Then you create a method for removing variables. • Backwards selection (start with all variables) • Forwards selection (start with none and add them in) • Stepwise (drops and adds variables until AIC in optimized)
- 46. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Relative Fit ## Start: AIC=190.69 ## Fertility ~ Agriculture + Examination + Education + Catholic + ## Infant.Mortality ## ## Df Sum of Sq RSS AIC ## - Examination 1 53.03 2158.1 189.86 ## <none> 2105.0 190.69 ## - Infant.Mortality 1 408.75 2513.8 197.03 ## - Catholic 1 447.71 2552.8 197.75 ## - Education 1 1162.56 3267.6 209.36 ## ## Step: AIC=189.86 ## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality ## ## Df Sum of Sq RSS AIC ## <none> 2158.1 189.86 ## + Examination 1 53.03 2105.0 190.69 ## - Infant.Mortality 1 409.81 2567.9 196.03 ## - Catholic 1 956.57 3114.6 205.10 ## - Education 1 2249.97 4408.0 221.43
- 47. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Absolute Fit stargazer::stargazer(UnivariateModel, BothSelected, MultivariableModel, type = "text") • Absolute model fit tells you how well your model matches the data. For a linear model this is measured by r-squared; and for other model designs, a pseudo r- squared can be generated.
- 48. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Absolute Fit## ## ====================================================================================== ## Dependent variable: ## ------------------------------------------------------------------ ## Fertility ## (1) (2) (3) ## -------------------------------------------------------------------------------------- ## Agriculture 0.194** -0.155** -0.172** ## (0.077) (0.068) (0.070) ## ## Examination -0.258 ## (0.254) ## ## Education -0.980*** -0.871*** ## (0.148) (0.183) ## ## Catholic 0.125*** 0.104*** ## (0.029) (0.035) ## ## Infant.Mortality 1.078*** 1.077*** ## (0.382) (0.382) ## ## Constant 60.304*** 62.101*** 66.915*** ## (4.251) (9.605) (10.706) ## ## -------------------------------------------------------------------------------------- ## Observations 47 47 47 ## R2 0.125 0.699 0.707 ## Adjusted R2 0.105 0.671 0.671 ## Residual Std. Error 11.816 (df = 45) 7.168 (df = 42) 7.165 (df = 41) ## F Statistic 6.409** (df = 1; 45) 24.424*** (df = 4; 42) 19.761*** (df = 5; 41) ## ====================================================================================== ## Note: *p<0.1; **p<0.05; ***p<0.01
- 49. 3.4. STRATIFICATION Moderators Stratifying
- 50. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Stratifying NonHonorsModel <- lm(formula = Writing ~ read + SES, data = students, subset = honors == "not enrolled") HonorsModel <- lm(formula = Writing ~ read + SES, data = students, subset = honors == "enrolled") Stratify by selecting a subset using a logical term (here being “if honors equals “not enrolled” and “enrolled”). Although the sample for the honors kids may have been too small to detect an association anyway, the magnitude of the coefficient is so different that it appears the relationship between writing and reading scores is not present among honors students while it is among non honors. They may all have very similar scores in the honors class. You can imagine that sometimes there are variables that are so important to predicting an outcome that the basic processes and relatinoships are different. For example, gender is often considered an “effect modifier” or moderator. In these cases it is helpful to stratify your results.
- 51. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Stratifying tidy(NonHonorsModel) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 28.9 3.27 8.84 3.17e-15 ## 2 read 0.394 0.0665 5.92 2.33e- 8 ## 3 SESmiddle 1.09 1.48 0.736 4.63e- 1 ## 4 SEShigh 0.504 1.80 0.281 7.79e- 1 tidy(HonorsModel) ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 60.8 2.27 26.8 6.47e-31 ## 2 read 0.0392 0.0382 1.03 3.10e- 1 ## 3 SESmiddle 0.264 0.854 0.309 7.59e- 1 ## 4 SEShigh 0.392 0.807 0.485 6.30e- 1
- 52. 3.5. INTERACTIONS Mediators Mediation Test Interaction Terms
- 53. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Interaction Terms InteractionModel <- lm(formula = Fertility ~ Agriculture + Examination + Education*Catholic + Infant.Mortality, data = swiss) summary(InteractionModel) In other cases, perhaps you believe that specific intersections (like race and gender combined) offer more explanatory power than the individual constructs themselves. In this case, interaction terms can be included in your model:
- 54. Linear Regression Logistic Regression Other Regression Models Model Building & Fit Stratification Interactions Interaction Terms ## Call: ## lm(formula = Fertility ~ Agriculture + Examination + Education * ## Catholic + Infant.Mortality, data = swiss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.9575 -5.1321 0.9299 4.2203 13.1799 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 59.500060 10.534970 5.648 1.48e-06 *** ## Agriculture -0.156010 0.066628 -2.342 0.024274 * ## Examination -0.356748 0.242766 -1.470 0.149516 ## Education -0.313333 0.284332 -1.102 0.277050 ## Catholic 0.187584 0.047430 3.955 0.000304 *** ## Infant.Mortality 1.255314 0.367195 3.419 0.001460 ** ## Education:Catholic -0.012480 0.005057 -2.468 0.017960 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.758 on 40 degrees of freedom ## Multiple R-squared: 0.7455, Adjusted R-squared: 0.7073 ## F-statistic: 19.53 on 6 and 40 DF, p-value: 1.725e-10

No public clipboards found for this slide

Be the first to comment