Upcoming SlideShare
×

# Prediction of house price using multiple regression

14,870
-1

Published on

- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.

6 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
14,870
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
276
0
Likes
6
Embeds 0
No embeds

No notes for slide

### Prediction of house price using multiple regression

1. 1. Prediction of House Price using Multiple Regression By Vinod Kumar Shanmugam MATH 661 – APPLIED STATISTICS PROFESSOR ARIDAMAN JAIN FALL 2009
2. 2. ABSTRACT <ul><li>This project focuses on predicting the selling price of the house depending on various parameters like Year built, Square feet, Lot size, number of beds and baths, features, Walk score etc. </li></ul><ul><li>The data is taken from www.zillow.com . </li></ul><ul><li>What is zillow.com? </li></ul><ul><li>-Zillow is an online real estate service dedicated to people to get an edge in real estate by providing with valuable tools and information. </li></ul><ul><li>PROJECT OBJECTIVE: </li></ul><ul><li>-This project aims in constructing a mathematical model using Multiple Regression to estimate the selling price of the house based on a set of predictor variables. </li></ul><ul><li>Analysis Software Used – SAS (Statistical Analysis Software) </li></ul>
3. 3. VARIABLES USED FOR ANALYSIS <ul><li>LIST OF DEPENDENT AND INDEPENDENT VARIABLES </li></ul><ul><li>-We have 8 independent variables and 1 dependent variable.we screen variables based on correlation coefficient with price and amount of variability explained by the model (R-square). </li></ul>
4. 4. STASTISTICAL APPROACH <ul><li>The statistical approach used here is Multiple Regression. </li></ul><ul><li>What is Multiple Regression? </li></ul><ul><li>-Multiple regression involves the use of more than one independent variable to predict a dependent variable. </li></ul><ul><li>EQUATION FOR MULTIPLE REGRESSION: </li></ul><ul><li>-> Y = b0 + b1*X1 + b2*X2 + ... + bp*Xp </li></ul><ul><li>-> X1, X2…Xp are the independent variables and Y is the housing price and is the dependent variable that is being predicted or explained. </li></ul><ul><li>-> bo is the Constant or intercept </li></ul><ul><li>-> b1 is the Slope (Beta coefficient) for X1, b2 is the Beta coefficient for X2, etc… </li></ul><ul><li>This equation is estimated using the Least-Squares method. </li></ul>
5. 5. EXPLORATORY DATA ANALYSIS <ul><li>The exploratory data analysis involves the scatter plot outputs between house price and predictor variables with natural log transform of price and without natural log transform of price variable. </li></ul><ul><li>The log transformation is necessary for price to have a linearity relationship between price and other independent variables and there by to have accurate prediction. </li></ul>
6. 6. DISTRIBUTION OF HOUSING PRICE VARIABLE WITHOUT NATURAL LOG TRANSFORM <ul><li>Distribution </li></ul>
7. 7. DISTRIBUTION OF HOUSING PRICE VARIABLE WITH NATURAL LOG TRANSFORM <ul><li>Distribution </li></ul><ul><li>1)Normal Probability plot 2)Histogram </li></ul><ul><li>The housing price is transformed using natural log and appears very close to normal distribution. This ensures linearity relationship between housing price and other predictor variables. </li></ul><ul><li>The distribution is not that much skewed compared to before transformation. </li></ul>
8. 8. CORRELATION AND REGRESSION ANALYSIS: <ul><li>What is correlation? </li></ul><ul><li>-Correlation is a statistical relation between two or more variables such that systematic changes in the value of one variable are accompanied by systematic changes in the other. It is represented by r and ranges between -1 to +1. </li></ul><ul><li>Pearson correlation coefficient : associates the independent variable price with other features of the house like age, sqft, appliances_cnt etc… </li></ul>The highlighted correlation is greater than 0.5 and have strong positive or negative correlation and will be able to explain the variation of house price in the regression model better than other variables. Automatic variable selection is done in sas based on amount of variability explained in the model.
9. 9. MULTIPLE REGRESSION ANALYSIS: <ul><li>Multiple regression was done on the data set using the Proc REG procedure in SAS. </li></ul><ul><li>ANOVA TABLE: </li></ul>
10. 10. Main Points from SAS output: <ul><li>The F-Value is 37.32 and P value is <0.05, so the regression model is significant. </li></ul><ul><li>The P-value for the t-statistic of the selected variables are all <=0.05, so all the variables are significant in the model </li></ul><ul><li>The R-square is 0.8092, which means 80.92% of the total variability is explained by the age, lotsizesqft, bedrooms, appliances_cnt and numfloors variables </li></ul><ul><li>The Regression equation to predict the house price is </li></ul>
11. 11. Identifying Outliers using residuals <ul><li>After Identifying influential observations, the outliers were removed from the data. The top 3 and bottom 3 cases were removed, to see if it improves the variability explained by the model. The R-square value increased from 0.8092 to 0.8322, which is good, so we retain the newly fit model after removing the outliers. </li></ul>
12. 12. Main Points from SAS output <ul><li>The F-Value is 37.70 and P value is <0.05, so the regression model is significant. </li></ul><ul><li>The P-value for the t-statistic of the selected variables are all <=0.05, so all the variables are significant in the model, except numfloors, we can remove the variable from the model if we wanted to. </li></ul><ul><li>The R-square is 0.8322, which means 83.32% of the total variability is explained by the age, lotsizesqft, bedrooms, appliances_cnt and numfloors variables after removing for outliers </li></ul><ul><li> FINAL MODEL </li></ul>
13. 13. Explaining the effect of each independent variable selected by the regression model <ul><li>Interpreting Regression Co-efficient </li></ul><ul><li>- Each regression Coefficient measures the average change in Y per unit change in the relevant independent variables. </li></ul><ul><li>Starting to compare two houses: same input value, same output value- no change here: </li></ul>
14. 14. Explaining the effect of each independent variable selected by the regression model (Cont..) <ul><li>Explaining Age Coefficient Explaining lot coefficient </li></ul>
15. 15. Explaining the effect of each independent variable selected by the regression model (Cont..) <ul><li>Predicting : New Case -1 Predicting : New Case - 2 </li></ul><ul><li>This will help the house seller or realtor to suggest modifications to existing house, if they wanted a good selling price in the neighborhood. </li></ul><ul><li>The X (independent) variables should be within the min and max of the data set that was used to fit the regression model, as out of range predictions will not work </li></ul>
16. 16. PLOT OF ACTUAL VS PREDICTED VALUE <ul><li>BEFORE REMOVING OUTLIERS AFTER REMOVING OUTLIERS </li></ul>PLOTS OF Actuals vs Predicted Value after removing outliers, now it looks quite linear association between actual vs predicted.
17. 17. CUMULATIVE DISTRIBUTION OF PREDICTION ERROR % <ul><li>The formula is (abs(actual-predicted)*100/actual). </li></ul><ul><li>This cumulative chart shows that 70% (0.7 on y-axis) of cases have less than 9% prediction error when compare to the actual selling price. </li></ul><ul><li>80% of cases have less than 10% prediction error </li></ul><ul><li>90% of cases have less than 12% prediction error </li></ul>
18. 18. CONCLUSION <ul><li>we are able to predict house price with around 90% accuracy for most of the cases and we have a good R-square of 0.83, which means 83% of the variability is explained by the model and we are also able to explain the interpretation of the estimates of the model . </li></ul><ul><li>SCOPE OF THE PROJECT: </li></ul><ul><li>In future we can also include, latitude, longitude and elevation of the house in the model to predict the house price with more accuracy. Future work can also include demographics variable like income, number of children, education, age of the family group etc in the model, to explain the variability in the house pricing and to predict house pricing more effectively. </li></ul>
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.