Constructing regression models using forward selection, backward elimination, and stepwise regression I found the best model that explains the variation in miles per gallon that is predictable from other car characteristics from the dataset mtcars in R
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Relentless Regression
1. Relentless Regression
By Nicholas Brooks
3/15/17
Using the dataset in R called mtcars I will use descriptive and inferential statistical methods
to find out whether any significant relationships exist between miles per gallon (mpg) and
the other variables in the dataset.
> datasets::mtcars
> mc<-mtcars
> head(mc,5)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
> str(mc)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> summary(mc$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
𝑝𝑙𝑜𝑡(𝑚𝑐wt,mc$mpg,xlab="weight",ylab="mpg",main="weight and mpg
comparison",col="blue")
2. the plot above displays a possible strong negative relationship between
weight and mpg
cor.test(mc𝑤𝑡, 𝑚𝑐mpg)
Pearson's product-moment correlation
data: mc𝑤𝑡𝑎𝑛𝑑𝑚𝑐mpg t = -9.559, df = 30, p-value = 1.294e-10 alternative hypothesis: true
correlation is not equal to 0 95 percent confidence interval: -0.9338264 -0.7440872 sample
estimates: cor -0.8676594
The above correlation test supports that a strong negative relationship does
exist between weight and mpg conluding that as the weight of the car
increases, the mpg decreases.
plot(mcℎ𝑝, 𝑚𝑐mpg,xlab="horse power",ylab="mpg",main="horse power and mpg comparison",col="green")
2 3 4 5
1015202530
weight and mpg comparison
weight
mpg
3. cor.test(mcℎ𝑝, 𝑚𝑐mpg)
Pearson's product-moment correlation
data: mcℎ𝑝𝑎𝑛𝑑𝑚𝑐mpg t = -6.7424, df = 30, p-value = 1.788e-07 alternative hypothesis: true
correlation is not equal to 0 95 percent confidence interval: -0.8852686 -0.5860994 sample
estimates: cor -0.7761684
The above plot displayed a possible negative relationship between horse
power and mpg. A correlation test between these two variables supports
sufficient evidence a strong negative relationship possibly exist as horse
power increases, mpg decreases.
plot(mc𝑑𝑖𝑠𝑝, 𝑚𝑐mpg,xlab="dispostion",ylab="mpg",main=" disposition and mpg comparison",col="red")
50 100 150 200 250 300
1015202530
horse power and mpg comparison
horse power
mpg
4. cor.test(mc𝑑𝑖𝑠𝑝, 𝑚𝑐mpg)
Pearson's product-moment correlation
data: mc𝑑𝑖𝑠𝑝𝑎𝑛𝑑𝑚𝑐mpg t = -8.7472, df = 30, p-value = 9.38e-10 alternative hypothesis:
true correlation is not equal to 0 95 percent confidence interval: -0.9233594 -0.7081376
sample estimates: cor -0.8475514
The plot as well as the correlation test between dispositon and mpg does
show indications of a strong negative relationship that as disposition
increases, mpg decreases.
100 200 300 400
1015202530
disposition and mpg comparison
dispostion
mpg
5. plot(mc𝑑𝑟𝑎𝑡, 𝑚𝑐mpg,xlab="drat",ylab="mpg",main="drat and mpg comparison",col="black")
cor.test(mc𝑑𝑟𝑎𝑡, 𝑚𝑐mpg)
Pearson's product-moment correlation
data: mc𝑑𝑟𝑎𝑡𝑎𝑛𝑑𝑚𝑐mpg t = 5.096, df = 30, p-value = 1.776e-05 alternative hypothesis: true
correlation is not equal to 0 95 percent confidence interval: 0.4360484 0.8322010 sample
estimates: cor 0.6811719
The above plot and correlation test between drat and mpg show indications of
a moderately strong positive relationship between the two variables exists
that as drat increases, mpg increases.
3.0 3.5 4.0 4.5 5.0
1015202530
drat and mpg comparison
drat
mpg
6. plot(mc𝑞𝑠𝑒𝑐, 𝑚𝑐mpg,xlab="qsec",ylab="mpg",main="qsec and mpg comparison",col="black")
cor.test(mc𝑞𝑠𝑒𝑐, 𝑚𝑐mpg)
Pearson's product-moment correlation
data: mc𝑞𝑠𝑒𝑐𝑎𝑛𝑑𝑚𝑐mpg t = 2.5252, df = 30, p-value = 0.01708 alternative hypothesis: true
correlation is not equal to 0 95 percent confidence interval: 0.08195487 0.66961864
sample estimates: cor 0.418684
The plot and correlation test between qsec and mpg indicates a slightly
positive relationship may exist between qsec and mpg.
16 18 20 22
1015202530
qsec and mpg comparison
qsec
mpg
7. boxplot(mc$mpg~factor.cyl,xlab="cylinder",ylab="mpg",main="mpg and cylinder comparison",col=c(3,5,7))
This box plot reveals a possible indication that as cylinder increases the mpg
decreases.
The data visualization has shown indications that possible relationships
exist as well as substantial variance between mpg and other variables. I will
now construct a regression model that best measures if any independent
variables are statistically significant to the dependent variable mpg. The
model should also help better explain the variation in the mpg that is
predictable from any independent variables. I will use forward selection,
4 6 8
1015202530
mpg and cylinder comparison
cylinder
mpg
8. backward elimination, and stepwise regression to construct a regression
model with each method and then compare their results to determine which
best fits the model.
add1(lm(mc$mpg~1),scope=(~.+mc$disp+mc$hp+mc$drat+mc$wt+mc$qsec+factor.cyl+f
actor.vs+factor.am+factor.gear+factor.carb),test="F")
Single term additions
Model:
mc$mpg ~ 1
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1126.05 115.943
mc$disp 1 808.89 317.16 77.397 76.5127 9.380e-10 ***
mc$hp 1 678.37 447.67 88.427 45.4598 1.788e-07 ***
mc$drat 1 522.48 603.57 97.988 25.9696 1.776e-05 ***
mc$wt 1 847.73 278.32 73.217 91.3753 1.294e-10 ***
mc$qsec 1 197.39 928.66 111.776 6.3767 0.0170820 *
factor.cyl 2 824.78 301.26 77.752 39.6975 4.979e-09 ***
factor.vs 1 496.53 629.52 99.335 23.6622 3.416e-05 ***
factor.am 1 405.15 720.90 103.672 16.8603 0.0002850 ***
factor.gear 2 483.24 642.80 102.003 10.9007 0.0002948 ***
factor.carb 5 500.56 625.49 107.129 4.1614 0.0065462 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(mc$mpg~1+mc$wt),scope=(~.+mc$disp+mc$hp+mc$drat+mc$qsec+factor.cyl+f
actor.vs+factor.am+factor.gear+factor.carb),test="F")
Single term additions
21. > model3<-lm(mc$mpg~mc$hp+mc$wt+factor.cyl+factor.am)
> summary(model3)
Call:
lm(formula = mc$mpg ~ mc$hp + mc$wt + factor.cyl + factor.am)
Residuals:
Min 1Q Median 3Q Max
-3.9387 -1.2560 -0.4013 1.1253 5.0513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
mc$hp -0.03211 0.01369 -2.345 0.02693 *
mc$wt -2.49683 0.88559 -2.819 0.00908 **
factor.cyl6 -3.03134 1.40728 -2.154 0.04068 *
factor.cyl8 -2.16368 2.28425 -0.947 0.35225
factor.am1 1.80921 1.39630 1.296 0.20646
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.41 on 26 degrees of freedom
Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
> AIC(model3)
[1] 154.4669
The model above was constructed using the stepwise regression method.
22. Now that I have 3 regression models using 3 different methods I can now
choose which is the best fitting model. Below I recoded each model to easily
compare their results.
> summary(model)
Call:
lm(formula = mc$mpg ~ mc$wt + mc$hp)
Residuals:
Min 1Q Median 3Q Max
-3.941 -1.600 -0.182 1.050 5.854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
mc$wt -3.87783 0.63273 -6.129 1.12e-06 ***
mc$hp -0.03177 0.00903 -3.519 0.00145 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
23. > summary(modelII)
Call:
lm(formula = mc$mpg ~ mc$hp + mc$wt + factor.cyl)
Residuals:
Min 1Q Median 3Q Max
-4.2612 -1.0320 -0.3210 0.9281 5.3947
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.84600 2.04102 17.563 2.67e-16 ***
mc$hp -0.02312 0.01195 -1.934 0.063613 .
mc$wt -3.18140 0.71960 -4.421 0.000144 ***
factor.cyl6 -3.35902 1.40167 -2.396 0.023747 *
factor.cyl8 -3.18588 2.17048 -1.468 0.153705
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.44 on 27 degrees of freedom
Multiple R-squared: 0.8572, Adjusted R-squared: 0.8361
F-statistic: 40.53 on 4 and 27 DF, p-value: 4.869e-11
25. > AIC(model)
[1] 156.6523
> AIC(modelII)
[1] 154.4692
> AIC(model3)
[1] 154.4669
After examining and comparing all the regression models, it can be concluded
that the variable that explains the most variability in all 3 models is weight.
Horse power would be the second varible to explain the most variability.
model3 has the lowest AIC value which is a measure used to avoid
multicollinearity. Model3 not only has the lowest AIC value, but also the
highest Adjusted R squared value or coefficent of determination that can
explain approximately 84% of the variation in the regression equation.