Running head: Statistics
2
Statistics
Statistics
Name:
Course:
Instructor:
Institution:
Date of Submission:
Assignment #4: Model Diagnostics
A fundamental requirement in the classical linear regression is that the regression error term must be normally distributed with zero mean and constant variance (Greene, 2008). The normality tests results are presented below.
All the plots have values greater than the threshold probability value of 0.05 thus the null hypothesis of normality of the regression residuals could not be rejected at 5 per cent significance level. Conclusion is thus made that the regression residuals from the estimated equations followed a normal distribution. Since any linear function of normally distributed variables is considered to be normally distributed, normal distribution of the residuals had the implication that the coefficients of the estimates were also themselves normally distributed (Gujarati, 2008).
The residual plot is shown below:
From the residual plot it can be seen that all the residuals fall within the standard error bands thus confirming that the model is stable and can thus be used for forecasting.
References
Greene, W. (2008). Econometric analysis, 6th ed. . New Jersey: Pearson-Prentice Hall.
Gujarati, D. (2004). Basic econometrics 4th ed. . New York: McGraw Hill Companies.
Normal Probability Plot
2.6315789473684208 7.8947368421052602 13.15789473684211 18.421052631578942 23.684210526315791 28.947368421052641 34.21052631578948 39.473684210526301 44.73684210526315 50 55.26315789 4736857 60.526315789473699 65.789473684210563 71.052631578947384 76.315789473684163 81.578947368420984 86.842105263157904 92.105263157894726 97.368421052631547 10.7 11.3 11.8 11.9 12 12 12 12.4 12.5 12.6 13.1 13.2 13.4 13.5 13.5 14.2 14.5 14.5 14.6
Sample Percentile
MedianSchoolYears
Age Residual Plot
60 30 62 44 0 30 62 68 46 56 36 28 0 0 34 26 52 50 44 0.50878516451792 1.7144464705013149 -0.42159945482941003 0.54117037792769895 0.71299080887547295 1.269413932725179 0.26951686627728799 0.22131431339594501 -0.13472012994437299 0.22075061567252199 -1.3199768562363781 -0.18681496091028299 0.380020030213299 -1.451131014273024 -0.56052688701790399 -0.116260966970037 -0.67291294283960901 -0.49015761805784802 -0.48430774902780499
Age
Residuals
RUNNING HEADER: WEEK 3 ASSIGNMENT 4 1
WEEK 3 ASSIGNMENT 4 13
Week 3 Assignment 4
Introduction
In this project I selected six variables from the ' SampleDataSet.xlsx'. Among these six variables three of them were continuous and the reaming three were discrete variables. The continuous variables selected for this study are Age, WealthScore and MedianSchoolYears. The discrete variables selected for this study are NumberOfChildren, MailResponder and NumberOfCars.
Analysis
Age
The age is a continuous variable which takes only positive values even though we usually consider the integer part of it. The descriptive statistics summary of the age variable .
Running head Statistics 2Statistics Statistics Na.docx
1. Running head: Statistics
2
Statistics
Statistics
Name:
Course:
Instructor:
Institution:
Date of Submission:
Assignment #4: Model Diagnostics
A fundamental requirement in the classical linear regression is
that the regression error term must be normally distributed with
zero mean and constant variance (Greene, 2008). The normality
tests results are presented below.
All the plots have values greater than the threshold probability
value of 0.05 thus the null hypothesis of normality of the
regression residuals could not be rejected at 5 per cent
significance level. Conclusion is thus made that the regression
residuals from the estimated equations followed a normal
distribution. Since any linear function of normally distributed
variables is considered to be normally distributed, normal
distribution of the residuals had the implication that the
2. coefficients of the estimates were also themselves normally
distributed (Gujarati, 2008).
The residual plot is shown below:
From the residual plot it can be seen that all the residuals fall
within the standard error bands thus confirming that the model
is stable and can thus be used for forecasting.
References
Greene, W. (2008). Econometric analysis, 6th ed. . New Jersey:
Pearson-Prentice Hall.
Gujarati, D. (2004). Basic econometrics 4th ed. . New York:
McGraw Hill Companies.
Normal Probability Plot
2.6315789473684208 7.8947368421052602
13.15789473684211 18.421052631578942
23.684210526315791 28.947368421052641
34.21052631578948 39.473684210526301
44.73684210526315 50 55.26315789 4736857
60.526315789473699 65.789473684210563
71.052631578947384 76.315789473684163
81.578947368420984 86.842105263157904
92.105263157894726 97.368421052631547 10.7
11.3 11.8 11.9 12 12 12 12.4 12.5 12.6 13.1 13.2
13.4 13.5 13.5 14.2 14.5 14.5 14.6
Sample Percentile
MedianSchoolYears
Age Residual Plot
60 30 62 44 0 30 62 68 46 56 36 28 0
0 34 26 52 50 44 0.50878516451792
1.7144464705013149 -0.42159945482941003
0.54117037792769895 0.71299080887547295
1.269413932725179 0.26951686627728799
4. Introduction
In this project I selected six variables from the '
SampleDataSet.xlsx'. Among these six variables three of them
were continuous and the reaming three were discrete variables.
The continuous variables selected for this study are Age,
WealthScore and MedianSchoolYears. The discrete variables
selected for this study are NumberOfChildren, MailResponder
and NumberOfCars.
Analysis
Age
The age is a continuous variable which takes only positive
values even though we usually consider the integer part of it.
The descriptive statistics summary of the age variable are
shown below.
Table -1: Descriptive statistics of Age
Age
Mean
46.202
Median
48
Mode
0
Standard Deviation
20.85197
Sample Variance
434.8046
Kurtosis
0.249096
Skewness
-0.67306
Range
96
Minimum
0
5. Maximum
96
Sum
92404
Count
2000
Table-1, shows that the data consists of the ages of 2000 people.
The mean age is 46.202 years, median age is 48, but the mode is
0. The largest age is 96 with variability measured by standard
deviation is 20.85.
Graph-1: Distribution of Age
The age distribution using histogram is shown is Graph-1.
Bearing a clustering at 0, the distribution of age is
approximately symmetric.
Wealth Score
Table -2: Descriptive statistics of WealthScore
WealthScore
Mean
301.8376
Median
299.01
Mode
490.46
Standard Deviation
94.35198
Sample Variance
6. 8902.296
Kurtosis
-0.66676
Skewness
0.124734
Range
390.46
Minimum
100
Maximum
490.46
Count
2000
Table-2, shows that the data consists of the wealth scores of
2000 samples. The mean score is 301.84, median age is 299.01
and the mode is 490.46. The largest age is 490.46 with
variability measured by standard deviation is 94.35.
Graph-2: Histogram of wealth score
The histogram of the wealth scores in Graph -2 shows that the
distribution is approximately symmetric and bell shape.
Median School years
Table -3: Descriptive statistics of MedianSchoolYears
MedianSchoolYears
Mean
13.26322
Median
13.2
Mode
12
7. Standard Deviation
1.424966
Sample Variance
2.030528
Kurtosis
-0.21916
Skewness
0.303654
Range
12.4
Minimum
5.7
Maximum
18.1
Count
1903
From Table-3 we can see that the sample consists of the median
school years of 1903 people. The mean of the variable is 13.26
years, median age is 13.2 and the mode is 12 years. The
maximum years is 18.1 with variability measured by standard
deviation is 1.42 years.
Graph-3: Histogram of school years
The histogram of the median school years in Graph -3 shows
that, the distribution is approximately positively skewed.
Number of Children
Table -4: Descriptive statistics of the number of children
NumberOfChildren
Mean
0.586
Median
9. 174
3
99
4
42
5
6
6
1
Graph-4: Frequency distribution of the number of children
Graph-5: Pie chart representing the number of children
Mail Responder
The mode of the variable mail responder is 2
Table 5: Frequency of the mail responder
Mail responder
Frequency
0
447
1
515
2
1038
10. Graph-6: Pie chart representing the mail responder
It shows that the mode of the variable mail responder is 2.
Number of cars
Table 6: Descriptive statistics for the number of cars
NumberOfCars
Mean
1.0313
Median
1
Mode
1
Standard Deviation
1.0557
Sample Variance
1.1146
Kurtosis
6.0731
Skewness
1.72
Range
9
Minimum
0
11. Maximum
9
Sum
956
Count
927
Hypothesis tests
The first hypothesis tested in this study is regarding the mean
age. The null hypothesis H0 was that the mean age of the
population is 50. Alternative hypothesis H1 was that the mean
age of the population is less than 50. 5% level of significance is
used to test the hypothesis. The null hypothesis was rejected (t
=-8.145, df= 1999, p value <0.000) at 5% level. Therefore, we
conclude that there is enough evidence to support the claim that
the mean age is less than 50.
Table 7: Test for the mean age
Age
hypo. Value
Mean
46.202
50
Variance
434.8046
0
Observations
2000
2000
Pearson Correlation
#DIV/0!
Hypothesized Mean Difference
12. 0
df
1999
t Stat
-8.1456
P(T<=t) one-tail
3.29E-16
t Critical one-tail
1.645616
P(T<=t) two-tail
6.57E-16
t Critical two-tail
1.961151
Now the hypothesis tested in this study is regarding the mean
wealth score. The null hypothesis H0 was that the mean wealth
score of the population is 300. Alternative hypothesis H1 was
that the mean wealth score of the population is not equal to 300.
The null hypothesis was not able to rejected (t 0.871003, df=
1999, p value = 0.389) at 5% level. Therefore, we conclude that
there is not enough evidence to support the claim that the mean
wealth score is different from 300. Details are given in Table -
8
Table 8: Test for the mean wealth score
WealthScore
Null wealth
14. that the mean school year of the population is 12 years.
Alternative hypothesis H1 was that the mean school year of the
population is greater than 12 years. The null hypothesis was
rejected (t =38.672, df= 1902, p value <0.000) at 5% level.
Therefore, we conclude that there is enough evidence to support
the claim that the mean of teh median school years is greater
than 12.
Table 9: Test for the mean school years
MedianSchoolYears
Null schoolyears
Mean
13.26322
12
Variance
2.030528
0
Observations
1903
1903
Pearson Correlation
#DIV/0!
Hypothesized Mean Difference
0
15. df
1902
t Stat
38.67163
P(T<=t) one-tail
3.4E-242
t Critical one-tail
1.645655
P(T<=t) two-tail
6.9E-242
t Critical two-tail
1.961212
Now a chi-square test is conducted to test the independence of
the number of children and the number of Cars. The null
hypothesis H0 was the number of children and the number of
cars is independent. The alternative hypothesis H1 is that the
number of children and the number of cars is dependent
Table 10: Contingency table
zero
one
two
three
four
Five
Six above
Total
zero
18. 388
149
56
12
4
5
1995
Table 11: Hypothesis test
35.99
chi-square
36
df
.4691
p-value
The null hypothesis was not to reject ( = 35.99, df=36, p value
<0.4961 ) at 5% level. There is not enough evidence that the
number of children is independent of the vehicles.
Limitations of the study
There were a number of missing observations in the variable,
number of cars. The t test assumes that the distribution of the
population is normal. But we have not conducted the study for
normality of the populations.
19. References
Doane and Seward (2010), Applied Statistics in Business and
Economics: the McGraw Hills Ltd.
Linda, Marchal, and Wathen (2008), Statistical Techniques in
Business & Economics, 13th
edition. New York, NY: McGraw Hill.
Number of Children
NumberOfChildren 0 1 2 3 4 5 6
Frequency 1355 323 174 99 42 6 1
Number of childrens
zero one two threefour five six 1355 323 174 99 42 6
1 Frequency zero one two 447 515 1038
Histogram of Age
Frequency 0 5 10 15 20 25 30 35 40 45
50 55 60 65 70 75 80 85 90 95 100
More 202 0 0 0 1 29 128 111 206
173 265 171 240 158 118 65 89 24 15 2 3
0
Age
Frequency
Distribution of WealthScore
Frequency 0 25 50 75 100 125 150 175 200