2. PART I
1) Looking at the graph below, it is clear that the two pieces are not the same line.
This is confirmed by the sameline test I performed, as shown below.
The p-value is just barely significant (if we don’t get too picky about thousandths
and ten-thousandths). Thus, the extra 0.0003 is not enough for me to call it
insignificant. Because we have a significant p-value, we can reject the null
hypothesis β2 = 0 in favor of the alternative hypothesis β2 ≠ 0.
3. 2) (a) Extra sum of squares = SSE(R)–SSE(F) = 306.48106–277.86404 = 28.61702
F-value: F(1, 34) = = = 3.50163
(b) F(1, 34) = 3.50 p-value = 0.0699 Conclusion: Fail to reject H0, and β7 = 0
(c) t* = −1.87 p-value = 0.0699 (t*)2
= 3.4969
A t-test with n degrees of freedom is equivalent to an F-test with (1, n) degrees of
freedom. As you can see the t-value squared gives the F-value from the previous
parts of the problem. Hence, both produce the same p-value.
4. 3) Σ Type-I SS = 632.17685 Σ Type-II SS = 104.00489
The Type-I SS sum to SSM.
The two types of SS are equal for the danger predictor. This is because the
conditional probability is the same in both cases:
SS(Danger | BodyWt BrainWt Dreaming LifeSpan Gestation Predation Exposure)
so we get the same value.
4) Explanatory Variables R2
BodyWt BrainWt Dreaming LifeSpan Gestation Predation Exposure Danger 0.6952
BodyWt 0.1175
BrainWt 0.1136
Dreaming 0.5287
NonDreaming 0.9364
LifeSpan 0.1463
Gestation 0.3776
Predation 0.0078
Exposure 0.3861
Danger 0.3652
BodyWt BrainWt 0.1186
BodyWt BrainWt Sum 0.4118
LifeSpan Gestation 0.3780
Dreaming Sum 0.6346
NonDreaming Sum 0.9376
Predation Exposure Gestation 0.5404
BodyWt BrainWt Dreaming 0.5991
5. Part II
1) These are the initial scatter plots for each variable:
6.
7.
8.
9. I decided that BodyWt, BrainWt, Gestation and Lifespan needed transformed
because they are all non-linear.
At the professor’s suggestion, I took the ratio of BrainWt to BodyWt to transform
those two. I also took the inverse of both Gestation and Lifespan.
10. LifeInverse ended up being fairly non-linear so I checked two further
transformations: square root and log. I decided to go with logLifeInv as it is more
linear.
11. Note: From here onward I use NonDreaming and threw out Dreaming
because NonDreaming had a more linear relationship with TotalSleep.
NonDreaming also has lower or the same correlation with all the variables
compared to Dreaming (see below).
I ran the correlation procedure and found that Predation, Exposure and Danger,
were all highly correlated. Thus, I removed Danger as it had the highest
correlation with the other two (it’s also the least linear of the three).
12. Box-Cox gave optimal λ as 0.75 so I decided Y did not need transformed.
I ran regression with the following model to obtain residuals:
TotalSleep = NonDreaming + Predation + Exposure + BrainBody + GestInv + logLifeInv
Thus, I obtained the following residual plots:
13.
14.
15.
16. I also produced the following histogram and QQ-plot, which both indicate the
residual are approximately Normal.
17. I also checked TotalSleep with a QQ-plot and found that it was approximately
Normal as well.
In summation, I would conclude that
TotalSleep = NonDreaming + Predation + Exposure + BrainBody + GestInv + logLifeInv
is a good model to begin with as a starting point.
18. 2) Mallow’s Cp reported the following models (I’ve only included the first 12):
I’ve highlighted my selection for best model in blue. I would use this model
because it has a Cp < p which is good, and a very high R2
. Adding more variables
doesn’t increase the R2
value very much, so it would be unnecessary. I don’t
want to include the BrainBody term since a negative coefficient would not make
sense for this variable, because it is positively correlated with TotalSleep, as
evidenced by its scatter plot. Given the somewhat high (~0.62) correlation
between Predation and Exposure, it does not make sense to have both in the
model, and as you can see, switching Exposure with Predation (last entry) results
in a Cp > p, which would not be a good model.
The following is a summary of the part of the table I have removed.
The Cp values slowly increase as different combinations of variables are tried,
with NonDreaming remaining constant among them. Once NonDreaming is
removed, the Cp values skyrocket, increasing by approximately 2000% on the
“best” model without NonDreaming (highlighted in yellow below), and increasing
further from thereon as different combinations without NonDreaming are tried.
This result makes sense as TotalSleep = NonDreaming + Dreaming, so it is the
most influential explanatory variable.
In conclusion, I choose the best model to be:
TotalSleep = β0 + β1NonDreaming – β2Predation + β3GestationInverse
19. 3) The stepwise selection method produced the following result:
As this is the same model as selected above, I will not restate my reasons for
selecting it. However, I shall list some interesting points.
Some points of note:
Predation contributes very little to R2
, however if we want to satisfy Cp < p,
it is necessary to include it. This is also true of GestationInverse.
As pointed out above, NonDreaming contributes very heavily to
TotalSleep, which is why it has such a large partial R2
.
The stepwise selection produces the same “best” model as the Cp
criterion.
To reiterate, I choose the best model to be:
TotalSleep = β0 + β1NonDreaming – β2Predation + β3GestationInverse*
*Note: On Mixable I saw Professor Sharabati saying he would suggest keeping
Dreaming and throwing out NonDreaming. I tested my model at every step after
switching NonDreaming with Dreaming and had much worse results, which
prompted meto continue using NonDreaming.
20. 4) The residual plot for NonDreaming appears to be okay, except for maybe an
outlier at ~ −2.75.
The residual plot for Predation is fine, no discernible pattern.
21. However, the residual plot for GestationInverse indicates that the constant variance
assumption may be violated. We can also see that possible outlier at ~ −2.75.
Summary of residuals:
The GestationInverse residual plot makes me cautious, but I would not
denounce the model just yet.
I see no reason to assume the responses are not independent.
Looking at the following histogram and QQ-plot indicates that the residuals are
approximately Normally distributed. The histogram tells me that the possible
outlier is probably not an outlier, but I will confirm this in the nest question.
Based on the scatter plots in Question 1, I would say the linearity assumption is
not violated.
Overall, I would say this is an acceptable model to use with some caution due to
the GestationInverse residuals having slight problem with constant variance.
22.
23. 5) I use the Studentized Residuals and Cook’s Distance to check for outliers and
influential observations. I use VIF to check for multicollinearity.
VIF results: All VIF scores are well below the threshold for determining
multicollinearity. I conclude that multicollinearity is not a problem in the model.
For the residuals I only include output for the most influential/unusual points:
Looking at the Studentized Residuals and fences, none of the largest are
considered outliers.
Cook’s Distance Critical F-value = F(4,40) (.5) = 0.85356585
As you can see, none of the largest Cook’s D values come close to exceeding
the critical value.
In conclusion, I have determined and statistically proved that the suspected
outlier from the residual and QQ-plots is in fact, not an outlier, and that there are
no influential observations.
24. 6) (a) = 1.153133 + 1.05652(NonDreaming) – 0.28902(Predation)
+ 35.46003(GestationInverse)
(b) 90% C.I. for µh : Highlighted in green below (first 20 obs.)
(c) 90% P.I. for (h)new : Highlighted in pink below (first 20 obs.)
(d) 90% C.I. for βi : Highlighted in blue below
25. SAS CODE
*data imported using File menu
PART 1 ;
symbol1 v=dot i=sm75S;
proc gplot data = sleep;
plot TotalSleep* (BodyWt BrainWt NonDreaming Dreaming Lifespan
Gestation Predation Exposure Danger);
run; *I used this to figure out which variable I wanted to use;
quit;
data sleep; *I decided on gestation, so I create the cslope term;
set sleep;
if gestation le 175
then cslope=0;
if gestation gt 175
then cslope=(gestation-175);
proc reg data=sleep; *regression to get equation;
model totalsleep=gestation cslope / p;
output out=sleepoutpred p=pred;
sameline: test cslope; *sameline test;
run;
quit;
symbol1 v=circle i=none c=black;
symbol2 v=none i=join c=red;
title1 'Question 1 - Piecewise Regression';
title2 'Scott Cunningham';
axis1 label = (angle=90 'TotalSleep');
proc sort data=sleepoutpred; by gestation;
proc gplot data=sleepoutpred;
plot (totalsleep pred)*gestation / overlay
vaxis=axis1;
run;
quit; *plotting the graph;
* END PROBLEM 1
--------------------------------------------------------------
PROBLEM 2 ;
data sleep; *creating sum;
set sleep;
sum = lifespan+gestation;
proc reg data = sleep; *running the two regressions;
model totalsleep = bodywt brainwt dreaming predation exposure danger;
model totalsleep = bodywt brainwt dreaming predation exposure danger
sum;
nilsum: test sum; *F-test;
run;
quit;
* END PROBLEM 2
-------------------------------------------------------------
26. PROBLEM 3 ;
proc reg data = sleep;
model totalsleep = bodywt brainwt dreaming lifespan gestation predation
exposure danger / ss1 ss2;
run;
quit;
* END PROBLEM 3
--------------------------------------------------------------
PROBLEM 4 ;
proc reg data = sleep;
model totalsleep = bodywt;
model totalsleep = brainwt;
model totalsleep = dreaming;
model totalsleep = nondreaming;
model totalsleep = lifespan;
model totalsleep = gestation;
model totalsleep = predation;
model totalsleep = exposure;
model totalsleep = danger;
model totalsleep = bodywt brainwt;
model totalsleep = bodywt brainwt sum;
model totalsleep = lifespan gestation;
model totalsleep = dreaming sum;
model totalsleep = nondreaming sum;
model totalsleep = predation exposure danger;
model totalsleep = bodywt brainwt dreaming;
run;
quit;
* END PROBLEM 4
--------------------------------
PART 2
PROBLEM 1 ;
symbol1 v=dot i=sm75S;
title1 'Question 1 - Scatter Plot with Smoothing Curve';
title2 'Scott Cunningham';
proc gplot data = sleep;
plot TotalSleep* (BodyWt BrainWt NonDreaming Dreaming Lifespan
Gestation Predation Exposure Danger) / vaxis=axis1;
run; *to examine the response variables;
quit;
data sleep; *creating transforms of the variables I think need it;
set sleep;
brainbody = brainwt/bodywt;
gestinv = 1/gestation;
lifeinv = 1/lifespan;
proc gplot data = sleep;
plot TotalSleep*(brainbody gestinv lifeinv) / vaxis=axis1;
run; *checking the new transformed variables;
quit;
27. data sleep; *checking two possible further transformations;
set sleep;
loglifeinv=log(lifeinv);
sqrtlifeinv=sqrt(lifeinv);
proc gplot data = sleep; *checking again;
plot TotalSleep*(loglifeinv sqrtlifeinv) / vaxis=axis1;
run;
quit;
proc corr data=sleep; *checking correlation between responses;
var nondreaming dreaming predation exposure danger brainbody gestinv
loglifeinv;
proc transreg data = sleep; *performing Box-Cox to check if Y needs to be
transformed;
model boxcox(totalsleep)=identity(nondreaming predation exposure
brainbody gestinv loglifeinv);
run;
quit;
proc reg data = sleep;
model totalsleep = nondreaming predation exposure brainbody gestinv
loglifeinv / r;
output out=sleepoutresid r=resid;
run; *computing the residuals;
quit;
symbol1 v=dot i=none;
title1 'Question 1 - Residual Plot';
title2 'Scott Cunningham';
axis1 label = (angle=90 'Residual');
proc gplot data = sleepoutresid;
plot resid*(nondreaming predation exposure brainbody gestinv
loglifeinv) / vref=0
vaxis=axis1;
proc univariate data=sleepoutresid noprint;
qqplot resid totalsleep / normal (L=1 mu=est sigma=est)
odstitle='Question 1 - QQ-plot'
odstitle2='Scott Cunningham';
histogram resid / odstitle='Question 1 - Histogram'
odstitle2='Scott Cunningham'
normal(noprint);
run;
quit; *these graphs are just sort of a final check to see if I did well in
refining the model;
*END PROBLEM 1
---------------------------------------------------
PROBELMS 2 & 3 ;
proc reg data = sleep;