Successfully reported this slideshow.
Upcoming SlideShare
×

of

0

Share

# Student 26 Revised Assign.1

See all

### Related Audiobooks

#### Free with a 30 day trial from Scribd

See all
• Be the first to like this

### Student 26 Revised Assign.1

1. 1. Assignment #1 for ECO 4313 Econometrics Aaron Helton 1 Part 1: Introduction PART 1 of assignment 1 called for fitting the least-squares regression model to the relationship between 200 observed home selling prices from Lucas County, Ohio, using a constant term and the square foot living area of the home as the explanatory variables. The source of the sample data is a publicly accessible data set provided by Lesage as part of the Spatial Econometrics Toolbox, from Lesage and Pace (2004), consisting of around 25,000 home sales in the years 1993 through 1998. The data used here was named Student26.data comprised of a sample of 200 surrounding homes that were sold, among numerous other characteristics ( house age, square foot living area, square foot lot size, # of rooms, # of full baths, # of half baths, # of bedrooms) This data model is as such: ( 𝑦1 𝑦2 . . . 𝑦𝑛) = 𝜃 ( 1 1 . . . ) + 𝛽 ( 𝑥1 𝑥2 . . . 𝑥 𝑛 ) + ( 𝜀1 𝜖2 . . . 𝜀 𝑛) This model can be described as where y1, y2,… yn are selling prices, x1, x2, … are the values for square foot living area for the homes, and 𝜀1, 𝜀2, … are unobserved/unknown errors for our sample of homes that we observed, which is represented by n=200. The relationship between these variables can be written as: 𝑦𝑖 = 𝜃 + 𝛽𝑥𝑖 + 𝜀𝑖 𝑖 = 1, 2,……, 𝑛 This equation shows us the slope which is equal to 𝛽 = Δy/Δx explaining how the change in the square foot of the living area (x) is related to the change in the selling price (y). The variable 𝜃 states the intercept of the line being the selling price of a house with zero square foot living area.
2. 2. 2 Table 1: Summary Statistics Sample of 200 Homes Variables Mean Median Std Deviation min max price 60312.9200 60200 11432.8343 26713 97000 house age 64.8850 69 11.7688 28 81 lotsize 5271.6800 4900 1328.2408 2900 9600 rooms 6.2100 6 0.9111 4 9 sqft living area 1376.9150 1348 251.6858 809 2524 bedrooms 3.0550 3 0.5599 2 5 full baths 1.0400 1 0.1965 1 2 half baths 0.2350 0 0.4481 0 2 Sample of all 25,357 homes Variables Mean Median Std Deviation min max price 79017.94 65500 59655.02 2000 875000 house age 50.37 46 27.93 0 161 lotsize 13332.210 6800 28940.7312 702 429100 rooms 6.1147 6 1.3033 1 20 Sqft TLA 1462.2486 1318 613.1206 120 7616 Bedrooms 2.9875 3 0.7226 0 9 full baths 1.2420 1 .4873 0 7 half baths 0.3412 0 0.5018 0 3 Part 1.1 Summary Statistics Table 1 shows us summary statistics for the sample of the 200 observed homes as well as the 25,357 total sample of homes, along with both samples characteristics. These characteristics include the mean, median, standard deviation, min and max of selling price and all other given variables. By looking at the table we can see that the only variable that the mean does not exceed the median in my 200 house sample is house age, this implies that this variable has a left-skewed distribution. Having a left-skewed distribution suggests that my sample includes some homes that are younger than the usual home in the sample, the usual being the median. The variables: price, lot size, rooms, square foot living area, bedrooms, full
3. 3. 3 baths, and half baths all have means that are higher than the medians, this results in a right-skewed distribution, meaning that the 200 house sample contains homes that are more expensive, have bigger lots, more rooms, full and half baths, more bedrooms, and bigger living areas than the usual home in my sample, median being the usual again. Compared to the full sample of 25,357 houses, my sample of houses has homes with a higher mean in the variables: house age, number of rooms, and number of bed room, this suggests that my sample of houses has some home that are much older, with more bedrooms and more number of rooms. For the variables: price, lot size, living area, full and half baths the full sample has a higher mean, meaning that the houses from my sample are on average cheaper, with smaller lot sizes and living areas, and less half and full baths. In the full sample, you can see that all the variables excluding the bedrooms show right skewed-distributions, while bedrooms are very close to a symmetric distribution because the mean is close to the median. As a comparison of the samples we can state that the usual, median, home in my sample of the 200 observed homes has: a selling price of \$60,200, a lot size of 4,900 square feet, a living area of 1,348 square feet, is sixty-nine years old, has six rooms, thee bedrooms, one full bath, and zero half baths. For the full sample, the usual, median, home has: a selling price of \$65,500, a lot size of 16,700 square feet, a living area of 3,004 square feet, is forty-six years old, has six rooms, three bed rooms, one full bath, and zero half baths. Part 1.2 Univariate regression results Table 2 shows the univariate regression results, this tells us that the slope estimate 𝛽̂ is equal to 17.74. This slope estimate tells us that an increase of one square foot would, everything else constant, would lead to an increase In selling price of \$17.74. The t- statistic specifies that this estimate is around 6 standard deviations away from zero, this tells us that the slope is positive and statistically significant. R2 in Table 2 tells us that the fitted line represents only 15% of the variation in home selling prices.
4. 4. 4 Table 2: Ordinary Least-squares Estimates (levels model) Ordinary Least-squares Estimates Dependent Variable = price R-squared = 0.1525 Rbar-squared = 0.1483 sigma^2 = 111330770.8286 Nobs, Nvars = 200, 2 *************************************************************** Variable Coefficient t-statistic t-probability constant 35884.612197 8.627344 0.000000 sqft TLA 17.741333 5.969856 0.000000 After running the first regression we ran a second regression that was estimated where the Y and X variables were converted into log form. The results from this second regression are shown in Table 3, this table shows us that an increase in square foot by one percent causes an increase in the selling price by .37%. This regression of the log of price shows our t-statistic of 5, meaning the slope of the fitted like in positive and the estimated slope coefficient is close to 5 standard deviations away from zero. Table 3: Ordinary Least-squares Estimates (log-converted model) Ordinary Least-squares Estimates Dependent Variable = log(price) R-squared = 0.1127 Rbar-squared = 0.1082 sigma^2 = 0.0357 Nobs, Nvars = 200, 2 *************************************************************** Variable Coefficient t-statistic t-probability constant 8.290236 15.402189 0.000000 log(sqft TLA) 0.374129 5.014130 0.000001 Along with the tables of the two regressions (logged and unlogged), there were to scatter plots that accompanied each table, both were plots of the actual versus fitted values for my sample of homes (logged or unlogged) selling prices, where the horizontal axis showed (logged or unlogged) square foot living area. The resulting scatterplots show that the fitted line has a few large errors for 2 homes that have very low selling prices but square foot living areas between 1000 and 1400. The same kind of result are shown by Figure 2, the log-converted model, that there are 2 errors for the same 2 homes.
5. 5. 5 Figure 1: Scatter plot of actual selling prices versus fitted values Figure 2: Log-converted regression (Actual selling price vs. Fitted Values) 6.6 6.8 7 7.2 7.4 7.6 7.8 8 10 10.5 11 11.5 o blue = actual, * red = predicted log(sqft living area) log(sellingprice) 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2 3 4 5 6 7 8 9 10 x 10 4 o blue = actual, * red = predicted sqft living area sellingprice
6. 6. 6 Part 2 : Extended Regression For Part 2 of the assignment we ran two more regression models by adding in other home characteristics to help explain the variation in the selling prices. Included in the regressions we have both the level data (unlogged) along with log-converted data, the three variables that have been logged in this regression model are the selling prices, square foot living area and the lot size, along with all other variables remaining unlogged. From Table 4 we can see that when adding the new variables the new model will have a slightly higher R2 value of .197 resulting in a higher amount of variation being explained by this model being 19% rather than 15% in previous models. Because the R2 of the two models are different we find it more appropriate to compare the two models by the Rbar squared, or adjusted R2, statistic. The new model shows that two of the explanatory variables are significant at the 99% level: the constant term, and square foot living area, while lot size is significant at the 95% level and the variables of house age, number of bed rooms, rooms, full and half baths are not significantly different than zero. Looking at Table 4 you will notice that there is two variables with negative numbers, this would mean that if these numbers were significantly different than zero they would have a decrease in selling price as they rose by 1. Along with these variables any other variable that we have defined not significantly different than zero would have no affect on the selling price of the homes. The only variables that would have any impact on selling prices are: lot size, which shows a change of \$1.50 in selling price for every one square foot increase in the lot size. The estimate also illustrates that there is a \$16.26 increase in the selling price as a result of increasing the square foot living area by one foot. This is lower than the change in the first regression model, which suggests omitted variables bias.
7. 7. 7 Table 4: Ordinary Least-square Estimates (extended model) Ordinary Least-squares Estimates Dependent Variable = price R-squared = 0.1972 Rbar-squared = 0.1679 sigma^2 = 108764089.8805 Nobs, Nvars = 200, 8 *************************************************************** Variable Coefficient t-statistic t-probability constant 31695.838158 4.567777 0.000009 house age -84.966144 -1.209722 0.227873 lotsize 1.504773 2.596185 0.010156 # rooms -8.075865 -0.005547 0.995580 sqft living area 16.259024 3.708296 0.000273 # bedrooms 220.198751 0.105395 0.916172 # full baths 2855.202094 0.739072 0.460766 # half baths 928.377908 0.520066 0.603617 The next regression model included using the log of the selling price among two other explanatory variables: square food living area, and lot size. This lets us to interpret the coefficients on those variables as elasticities. The coefficients on the non-logged variables show how a change in these variables change the selling price but is not important for this assignment. In Table 5 we can see the new estimates for the logged regressions, this show us that the square foot living area is significant at the 99% level and the only other variable that is significant is log(lotsize) at the 95% level. The estimates for log(lotsize) and square foot living area being positive tells us that 10% increase in the log(lotsize) would increase the selling price by 1.4% holding all else constant. We also can note that a increase by 10% of the square foot living area would increase the selling price by 3.25% holding everything else constant. This is consistent with intuition, because houses with more living area and bigger lots would naturally be more expensive. The positive and significant intercept coefficient (constant) tells us that a vacant lot would have the log price of 7.42.
8. 8. 8 Table 5: Ordinary Least-squares estimate (extended log-converted model) Ordinary Least-squares Estimates Dependent Variable = log(price) R-squared = 0.1586 Rbar-squared = 0.1279 sigma^2 = 0.0349 Nobs, Nvars = 200, 8 *************************************************************** Variable Coefficient t-statistic t-probability constant 7.421542 9.203428 0.000000 house age -0.001531 -1.203469 0.230277 log(lotsize) 0.143304 2.441477 0.015534 # rooms 0.000276 0.010561 0.991585 log(sqft living area) 0.325530 2.897992 0.004192 # bedrooms 0.011579 0.308910 0.757725 # full baths 0.049763 0.718941 0.473051 # half baths 0.023795 0.744367 0.457564 Along with second set of tables, we have respective graphs of the residuals of the extended models in Figure 3. As the definition of 𝑒 = 𝑌 − 𝑌̂, negative residuals suggest that observations where the predicted selling price Y-hat is large than the actual selling price. We can see that there are about 19-20 large negative residuals with some of the low selling prices. We can also note that there are large negative residuals for both logged and non-logged regressions. These negative residuals tell us that while some homes predicted selling prices were predicted accurately there were also homes ( the first 14 in the figure) that had predicted selling prices that were much greater than the actual selling price. By comparing the two figures , Figure 1,2 and Figure 3, we can note that we saw 2 observations that had higher living areas but having a very low selling price. As you can see in Figure 3 the first two negative residuals are some of the lowest, which tells us that the univariate model is predicting these homes selling prices above their actual leading to negative residuals. This extended regression does not help predict the selling price much better than the original model.
9. 9. 9 Figure 3: Plot of extended models residuals 0 20 40 60 80 100 120 140 160 180 200 -4 -3 -2 -1 0 1 2 3 x 10 4 Observations (sorted by selling price) residuals 0 20 40 60 80 100 120 140 160 180 200 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Observations (sorted by logged selling price) residuals
10. 10. 10 Part 3: Specification Tests For Part 3 of the assignment we run a diagnostic test for our regression specification. In part 3 we will only take into account the levels regression and ignore the log-converted models. The first test questions whether the house age variable should enter the relationship in a linear or non-linear fashion. The 2nd test will explore the outliers of my sample. Part 3.1 : House age test In this test we compare the R-bar squared statistic for three models including the variables: house age, house age plus house age squared, and house age plus house age squared plus house age cubed. Because there will be different numbers of explanatory variables, we do not used the R- squared but the R-bar squared. Table 6: Model fit for various functional forms for house age Model R-bar squared age 0.1679 age + age^2 0.1784 age + age^2 + age^3 0.1742 The results suggest a slight edge for the specification including age + age^2, with the results as follows. Ordinary Least-squares Estimates Dependent Variable = price R-squared = 0.2114 Rbar-squared = 0.1784 sigma^2 = 107391876.9702 Nobs, Nvars = 200, 9 *************************************************************** Variable Coefficient t-statistic t-probability constant -4845.647095 -0.232542 0.816366 age 1204.598594 1.727155 0.085757 age2 -11.037887 -1.858307 0.064664 lotsize 1.517449 2.634543 0.009116 rooms 155.653687 0.107400 0.914584 sqft 16.731792 3.833895 0.000171 beds 190.843820 0.091924 0.926855 fbaths 2065.720811 0.534854 0.593373 hbaths 1079.980403 0.608202 0.543776 Part 3.2: Outliers and robust regression
11. 11. 11 To test for outliers, we run robust regressions using Bayesian MCMC estimates proposed by Geweke (1993). We want to run this regression using the “best model” based on the diagnostic results presented in Part 3.1. This model is the model that includes: age + age^2 variables. Bayesian Heteroscedastic Linear Model Gibbs Estimates Dependent Variable = price R-squared = 0.2099 Rbar-squared = 0.1768 sigma^2 = 72471444.9419 Nobs, Nvars = 200, 9 ndraws,nomit = 2500, 500 time in secs = 3.4640 r-value = 4 ****************************************************************** Posterior Estimates Variable Coefficient t-statistic t-probability constant -8800.451738 -0.432175 0.666080 house age 1358.196299 1.979377 0.049145 house age^2 -12.315389 -2.106671 0.036393 lotsize 1.594466 2.677398 0.008036 # rooms -39.479899 -0.028509 0.977285 sqft living area 16.882491 3.900605 0.000131 # bedrooms -62.219026 -0.031466 0.974929 # full baths 3628.894004 0.970255 0.333091 # half baths 715.520445 0.418645 0.675925 These results show that downweighting the outlier observations increases the t- probability of lot size, and square foot living area, but does not change it enough to change their significance. While those variables do not change in significance, the variables: house age, and house age^2 change from significant at the 90% level to significant at the 95% level. This change provides us weak evidence of outliers that we see in our data set. As expected the R-squared is lower because we are not trying to “fit” the outlier observations. While the lot size and the square foot living area are both significantly different, the do not change the selling price by much at all. A change by one square foot lot size (ceteris paribus) will only change the selling price by \$1.6 and one square foot change in the living area (ceteris paribus) only increases the selling price by \$16.88, which is prospective is not a huge change.
12. 12. 12 Figure 4: Plot of Vi Estimates In figure 4 we see the Vi estimates, this shows us that the observations with low selling prices show large variance scalar estimates, Vi. With the robust regression, these observations would be downweighted by 1/V1. An example would be, if Vi =5, then observation i receives 1/5 = 0.2 weight, whereas the least squares estimation gives equal weight of unity to all observations. From figure 4 we can also note that the homes with the highest selling prices are outlier observations as well, but these estimates all except one are less than 4, compared to four observations above 4 on the cheaper houses. Part 4: Conclusion From the previous part (part 3) we can conclude that the best model specification to use for this sample data set is the regression that includes house age, and house age-squared. There are only a few outliers in the regression, but a robust regression procedure would 0 20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 Observations (sorted by selling price) V i estimates
13. 13. 13 be appropriate because it shows us that there are changes in some of the variables t- probability giving us some proof of the outliers that occur it also reduces the size of the outliers we find. We can also note that the coefficient estimate for square foot living area from the extended robust regression is very close to the simple regression in Table 2. This model will be used in assignment # 2 to carry out further diagnostics for problems of collinearity, heteroscedasticty and spatial autocorrection of disturbances. References Lesage,James P. and R. Kelly Pace,“Models for Spatially Dependent Missing Data” Journal of RealEstate Finance and Economics, 2004, Volume 29, number 2, pp.233-254 Geweke,J. (1993) “Bayesian Treatment of the Independent Student t Linear Model,” Journal of applied econometrics, 8, 19-40.

Total views

185

On Slideshare

0

From embeds

0

Number of embeds

18

2

Shares

0