The work is done as part of graduate coursework at University of Florida. The author studied master's in environmental engineering sciences during the making of the presentation.
2. I. PRILIMINARY ANALYSIS
1. Plot the Y’s vs individual X’s:
(Anti-clockwise) We observe that X1 is more scattered, X2 is closely linear, X3=1 has higher Y’s than
X3=0, X1-X3 and X2-X3 is not much inferential.
2. Run Naïve Model: 𝑌̂ = β0 + β1X1 + β2X2 + β3X3 + β4A
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1082.3793 165.4371 -6.543 8.17e-08 ***
X1 0.2397 3.4967 0.069 0.946
X2 1.2993 0.1620 8.019 7.45e-10 ***
X3 67.9740 47.9893 1.416 0.164
A 2.8016 7.1449 0.392 0.697
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 120.4 on 40 degrees of freedom
Multiple R-squared: 0.8635, Adjusted R-squared: 0.8498
F-statistic: 63.24 on 4 and 40 DF, p-value: < 2.2e-16
Page 2/8
3. Only is β2 significant at 95% significance. R2
=86% and R2
-adj=85% implies the model tries to explain 85% variability
in data and model is concise. The F-statistic p-value<0.05 implies atleast one of the β’s is significant at 95%
confidence level. In addition, AIC=565.61.
3. Correlation Matrix
Y X1 X2 X3 A
Y 1.00 0.77 0.93 0.50 -0.51
X1 0.77 1.00 0.83 0.47 -0.48
X2 0.93 0.83 1.00 0.46 -0.57
X3 0.50 0.47 0.46 1.00 -0.32
A -0.51 -0.48 -0.57 -0.32 1.00
There is a strong correlation between Y-X2, and X1-X2. Rejection region of r: t(α/2,n-2)=2.016 rmin=0.29 implies
anything above 0.29 is a significant correlation.
4. Assumptions to be met:
i. Normality of error terms
ii. Independent of spatial and temporal
iii. Constant Variance homoscedasticity of error terms
iv. Independent (uncorrelated) from predictors
The naïve model is found to violate most of assumptions. It is explained at appendix. In addition, R2
-adj and
AIC can be improved. A higher R2
-adj and lower AIC is desired.
II. A BETTER MODEL
5. Improve R2
-adj and AIC
These methods are performed to see how it performs.
i. Adding interaction terms
ii. Double log model
iii. Polynomial function
However the model can better explain the variability when an interaction term is added. Addition of log and
polynomial terms are attempted but the accuracy improved is not very significant. One of the better model is
shown below. It is simpler as well.
Call:
lm(formula = Y ~ X1 + X2 + X3 + X1:X2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.771e+03 4.499e+02 3.936 0.000322 ***
X1 -6.394e+01 1.037e+01 -6.168 2.75e-07 ***
X2 -1.078e+00 3.852e-01 -2.799 0.007855 **
X3 8.190e+01 3.386e+01 2.419 0.020217 *
X1:X2 5.275e-02 8.278e-03 6.372 1.42e-07 ***
Page 3/8
4. ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 85.01 on 40 degrees of freedom
Multiple R-squared: 0.932, Adjusted R-squared: 0.9252
F-statistic: 137 on 4 and 40 DF, p-value: < 2.2e-16
> AIC(y.reg)
[1] 534.2501
All β’s are significant at 95% significance. R2
=93.2% (↑86%) and R2
-adj=92.52% (↑85%) has significantly
improved. The F-statistic=137 (↑63) and its p-value<0.05 implies atleast one of the β’s is significant (≠0) at 95%
confidence level. In addition, AIC=534.25 (↓565.61) has reduced which is a good sign.
As an important assumption, normality is checked first. Anderson-Darling test is performed as it is more robust
than Shapiro-Wilk for large sample sizes. It is found that p-value=0.006 < 0.05. Hence H0: Normality is rejected.
As a remedy, box-cox transformation (power transformation) is applied. The estimated power for dependent
variable, Y is –0.277.
The transformed Y’s are, say Y*
, Y*
= Y-0.277
. The transformed model would then be,
Y*
= 2.998 – 6.471*10-3
*X1 - 5.618*10-5
*X2 + 1.814*10-2
*X3 – 2.139*10-3
*A + 6.259*10-6
*X1*X2
Call:
lm(formula = Y.transform ~ X1 + X2 + X3 + X1:X2 + A)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.998e+00 9.894e-02 30.297 < 2e-16 ***
X1 -6.471e-03 2.197e-03 -2.945 0.005418 **
X2 -5.618e-05 8.309e-05 -0.676 0.502958
X3 1.814e-02 7.115e-03 2.549 0.014834 *
A -2.139e-03 1.069e-03 -2.001 0.052373 .
X1:X2 6.259e-06 1.755e-06 3.568 0.000973 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01782 on 39 degrees of freedom
Multiple R-squared: 0.9377, Adjusted R-squared: 0.9297
F-statistic: 117.3 on 5 and 39 DF, p-value: < 2.2e-16
[1] -227.1882
Though β values are insignificant for X2 and A at 95% confidence level, we keep it because it helps to conform to
Normality assumption. Analysis show that removing X2 and A yield a slightly better model but violates the
normality assumption. The R2
-adj=92.9% is significantly high, and F-statistic p-value<0.05. |AIC| is reduced to -
227.18 which is a good sign.
The new model confirms to the normality test. The assumptions are detailed in following section.
Page 4/8
5. III. CHECK FOR ASSUMPTIONS
6. Test for Normality: The following methods can be used
i. Shapiro-Wilk
ii. Kolmogorov-Smirnov
iii. D’Agostino’s Battery of Tests
The sample size is relatively small; n=45. Hence Shapiro-Wilk test is performed and itse p-value=0.2489 > 0.05.
‘Fail to reject’ H0: Distribution is Normal.
7. Test for independence: The following tests can be used to test if error terms are independent of space and
time.
i. Runs test
ii. Durbin Watson test
Runs test is performed here. In the above figure, the plot on right side shows distribution of residuals about
the mean. Spatial dependence is not significantly observable. The p-value=0.453 > 0.05, implies Fail to Reject
H0: Variance is not independent.
8. Test for homoscedasticity: Homoscedasticity or homogeneity of variance can be tested by Bartlett, Levenes
and Fligner Killeen test. Bartlett test is roughly valid only when data is normally distributed data. Levenes test
is performed here. It uses a non-parametric approach and is powerful than Bartlett.
Levenes test yield p-value=0.92>0.05. Hence fail to reject H0: Homogeneity of variance.
Page 5/8
6. IV. RELIABILITY AND ROBUSTNESS
9. Coefficient of Determination (R2
-adjusted): The transformed model has an R2
-adj=92.97% which means it can
explain the variability effectively up to 92.97%.
10. Fisher–Snedecor Statistic: The transformed model has F-statistic=117.3 and its p-value≈0, which means the
model explains the variability better.
11. Akaike Information Criterion (AIC): It is a measure of relative quality of statistical models. The transformed
model yielda AIC=–227.18 which is very low than naïve model.
12. Modified Coefficient of Efficiency (E*): E* is less sensitive to larger values hence it the terms not squared. This
model has an E*=0.7569.
There are other testing procedures available but these three are the most powerful. By this we can finalize our
model.
The best fit regression model is
Y -0.277
= 2.998 – 6.471*10-3
*X1 - 5.618*10-5
*X2 + 1.814*10-2
*X3 – 2.139*10-3
*A + 6.259*10-6
*X1*X2
Y: Monthly Mortgage Payment ($)
X1: Household Disposable Income (x1000 $)
X2: Square Footage of housing units
X3: Mortgage Type
A: Housing Unit’s Age
V. LIMITATIONS
13. Some of the tests for model validation are not performed but intuitive in the model. In the lines, the following
are
i. Multicollinearity: Test for multicollinearity is not systematically performed. It requires regressing each
independent variable to rest of independent variables and finding Coefficient of Variation for each of the
regression. In this model, multicollinearity exists with X2 but it is found that removing X2 violates the
normality and other assumptions.
ii. Polynomial Interaction Terms: Since R2-adj is above 90%, polynomial interaction terms are not included in
the final model.
iii. Outliers: Outliers are not identified in the model. Those which has high deviations from predicted values
(usually when standard residuals, 𝜀̂i> 2*σ. It could be noticed that two points lie outside of ±2σ. Removing it
might give a better model.
iv. Since there are not many models formulated, the Mallow Cp which is used to identify the best model
among a set of models is not implemented here.
v. In MODEL.ACCURACY(), a function created to check the reliability, the significance test for β (≠0) is not
explicitly performed but it is implied from the t-statistic of individual β.
Page 6/8
7. VI. APPENDIX
The following program, written in R, is used to simulate the above results.
#read data
raw.data=read.csv("~/My R Codes/Data/LabDataGEO6161.csv",header=T)
attach(raw.data); length(Y)
#split window
dev.list()
mat=matrix(c(1,2,1,2,3,4,3,5),2,4)
layout(mat); layout.show(5)
#scatter plot for each variable
plot(X1,Y,main="Y - X1",ylab="Y",las=1)
plot(-log10(A),Y,main="Y - X2",ylab="Y",las=1)
boxplot(Y~X3,main="Y - X3",ylab="Y",xlab="X3",las=1)
boxplot(X1~X3,main="X1 - X3",ylab="X1",xlab="X3",las=1)
boxplot(X2~X3,main="X2 - X3",ylab="X2",xlab="X3",las=1)
TEST.ASSUMPTIONS=function(reg.sample,Yi) {
error.sample=rstandard(reg.sample)
mat=matrix(c(1,1,2,3),2,2);layout(mat)
#anderson-darling normality
qqnorm(error.sample,datax=TRUE); qqline(error.sample,datax=TRUE)
p.norm=nortest::ad.test(error.sample)$p.value
norm=ifelse(nortest::ad.test(error.sample)$p.value<=0.05,"Ha:Normality
Violated","Ho:Normality Verified")
#runstest independence
p.ind=lawstat::runs.test(error.sample,plot.it=T,alternative="two.sided")$p.value
ind=ifelse(runs.test(error.sample)$p.value<=0.05, "Ha:Inpendence Violated",
"Ho:Independence Verified")
#levenes variance
group.levene=as.factor(c(rep(1,length(Yi)),rep(2,length(reg.sample$fitted.values))))
y.combined=c(Yi,reg.sample$fitted.values)
p.var=lawstat::levene.test(y.combined,group.levene)$p.value
var=ifelse(p.var<=0.05,"Ha:Variance Violated","Ho:Homoscedastic Variance")
plot(error.sample~fitted.values(reg.sample),xlab=expression(hat(y)),ylab="std
res.",
main="Homogeneity / Fit")
abline(h=0)
RESULTS=list("Normality"=c(round(p.norm,4),norm),
"Independence:"=c(round(p.ind,4),ind),
"Variance:"=c(round(p.var,4),var))
return (RESULTS)
}
MODEL.ACCURACY=function(reg.sample,Yi) {
r.sq.adj=summary(reg.sample)$adj.r.squared
fstat=summary(reg.sample)$fstatistic
Page 7/8