SlideShare a Scribd company logo
HOUSE PRICES
Advanced Regression Technique
Prepared by: Anirvan Ghosh
Outline
■ Project Objective
■ Data Source and Variables
■ Data Processing
■ Method of Analysis
■ Result
■ Predicted House Prices
All coding and model building is done using R software
Objective
■ Create an analytical framework to understand
– Key factors impacting house price
■ Develop a modeling framework
– To estimate the price of a house that is up for sale
Data Source and Variables
■ Kaggle competition - “House Prices: Advanced Regression Techniques”
– Dataset prepared by Dean De Cock
■ Variables:
– 79 variables present in the dataset
■ Variable named “SalePrice”
– Dependent variable
– Represent final price at which the house was sold
■ Remaining 78 variables
– Represent different attributes of the house like area, car parking, number of
fireplaces, etc.
Data Processing
■ Normalizing Response Variable
■ Training Vs Validation split
– Train data – 75%
– Validation Data – 25%
■ Data cleansing
– Variable treatments
■ Missing value treatment:
– Continuous variables
– Character variables
■ Outlier treatment
■ Variable creations:
– Character variables were converted to indicators
– Based on train data, further grouping of character variables were done and new indicators were created
Data Processing – Normalizing
Response Variable
■ The response variable is converted to
its logarithmic form to normalize it.
■ Underlying reason:
– Satisfying the basic assumption of
Ordinary least square
Data Processing – Training Vs
Validation split
■ Training Data
– Containing 75% of the total observations picked up at a random.
– Model is developed on this dataset.
■ Validation Data
– Containing remaining 25% of the total observations.
– Validation of the model is carried out based on this data
– Model tuning, if required, is carried based on the model performance on the
validation data
Data Processing – Missing value
treatment
■ Continuous Variable
– Missing values are replaced by the median of the corresponding variable.
■ Why median and not mean?
– Mean is more prone and get highly impacted by outliers
– Median is a more stable measure
■ Character Variable
– A separate category is created for missing values
■ This helps us retaining the prediction power of a variable
■ Also impact of missing values on the dependent variable can be established
Data Processing – Outlier treatment
■ Upper tailed values
– Cut-off value: sum of 99th percentile and
1.5 times of IQR of the corresponding
variables.
■ Replaced by 99th percentile
■ Lower tailed values
– Cut-off value: value of 1st percentile
point less 1.5 times IQR
■ Replaced by 1st percentile
Data Processing – Variable creation
■ Character Variables
– (n – 1) indicators are created for a
character variable containing n
different categories
– Separate indicator created for missing
value
■ Additional Indicator Variables
– Based on bivariate plots: if two or more
categories contains similar level of
value of dependent variable, they are
combined and converted into an
indicator
■ For example, Alley has three levels:
Grvl, Missing and Pave. A new
variable was created for Missing and
Pave category as they both have a high
median value of the dependent
variable.
Method of Analysis
■ Variable Selection Using Random Forest
■ Multiple Linear Regression
– Significance
■ t value and probability of t value
– Goodness of fit
■ Adjusted R-square
– Multicollinearity
■ Random Forest
■ Model Accuracy
– Error rate
– MAPE
Introduction – Random Forest
• Random Forest operate by constructing multitude of
decision trees at training time and outputting the mean
prediction of the individual trees. It also correct the
decision trees’ habit of over fitting to the training set.
• It is also been used to rank the importance of variables
in a regression problem in a natural way.
• In a regression tree, for each independent variable, the
data is split at several split points. Sum of Squared
Error(SSE) at each split point between the predicted
value and the actual values is calculated. The variable
resulting in minimum SSE is selected for the node.
Then this process is recursively continued till the
entire data is covered.
Method of Analysis – Variable
Selection
■ Random Forest Model
– Model
■ A model has been built on train data
using random forest – number of
trees:100
– Importance of Variables
■ Importance of Variables were extracted
■ Variables are sorted descending based
on the importance measure
– Variables selected in order of
importance measure are introduced into
the OLS model Introduce one-by-one variable into model 2
Variables sorted in descending order of their importance
Table of variable importance
Model1- Random Forest
Method of Analysis – Multiple Linear
Regression
■ Ordinary Least Square (OLS) :
– Simplest method regression in which the
unknown coefficients of features are estimated
with the goal to minimize the sum square
errors. i.e.
𝑚𝑖𝑛 (𝑌 − 𝑌)
2
– Visually this is seen as the sum of the vertical
distances between each data point in the set
and the corresponding point on the regression
line
Method of Analysis – Multiple Linear
Regression
■ Iteration
– Select one variable at a time from the variable importance table created using random forest
■ Significance
– Check significance of new variable along with existing variables by its t-value and probability of t-
statistics.
– If R-square is improved, keep the variable, else drop it
■ Multicollinearity
– Multicollinearity is checked at each step – ensuring the maximum value is < 4
– If new variable has multicollinearity above threshold value – drop it
– If introducing the new variable increases the multicollinearity of any existing variable – then the variable
with lowest t-value is dropped
■ Model Accuracy
– After adding the new variable, the model accuracy on train and test data using error rate and MAPE is
checked.
– Drop the new variable if the model accuracy falls.
Method of Analysis – Random Forest
• Variables
• Using the same variable used in the linear regression
• Trees and Nodes
• Checking various combinations of number of trees and maximum number of nodes to get the best
result.
• Using number of trees = 100 and maximum nodes = 10 for best fitted model.
• Model Accuracy
• Checking model accuracy using error rate and MAPE
• Decision
• Drop the variable if the model accuracy falls or remains same.
Method of Analysis – Model Accuracy
• Error rate
• Calculated as : Error Rate = 1 −
𝑌
𝑌
∗ 100
• Calculated minimum error rate and maximum error rate for train and test data.
• Aim is to reduce the error rate
• Difference between minimum and maximum error rate between training and validation.
• MAPE
• Mean Absolute Percentage Error is calculated by :
M𝐴𝑃𝐸 = 𝑀𝑒𝑎𝑛 1 −
𝑌
𝑌
∗ 100
• Aim is to minimize the MAPE
Results
Result – Interpretation
■ Linear Regression Model
– All the variables taken in the final
model.
– Adjusted R-square is: 90.76
■ i.e. these variables together are
explaining 90.76% variability of
SalePrice.
– All of the variables are significant.
– Multicollinearity is not severe. All
VIFs’are below 4.
Result – Interpretation
■ Linear Regression Model
– Train Data
■ Minimum error rate is -9.3%
■ Maximum error rate is 3.35%
■ MAPE is 0.697
– Test Data
■ Minimum error rate is -8.68%
■ Maximum error rate is 3.41%
■ MAPE is 0.778
■ Random Forest Model
– Train Data
■ Minimum error rate is -6.9%
■ Maximum error rate is 4.49%
■ MAPE is 0.993
– Test Data
■ Minimum error rate is -7.29%
■ Maximum error rate is 3.37%
■ MAPE is 1.02
Result
■ MAPE
– Low for Linear Regression
– High for Random Forest
■ Linear Regression Model Chosen – based on minimum MAPE
THANK YOU
APPENDIX
Sample Predicted Sale Price – Linear Regression Model
Id SalePrice Id SalePrice Id SalePrice Id
1461 126701.3 1483 175239.3 1505 225337.5 1527
1462 162982.1 1484 170172.4 1506 194180.3 1528
1463 178089.9 1485 175177.9 1507 271850.8 1529
1464 208902.2 1486 199074.1 1508 208083.3 1530
1465 193823.7 1487 309688.9 1509 159138.8 1531
1466 168862.5 1488 245865.5 1510 139717.4 1532
1467 193708.7 1489 190057.9 1511 151421.5 1533
1468 165445.6 1490 225320 1512 174073.1 1534
1469 196142.3 1491 193731.9 1513 138853.8 1535

More Related Content

What's hot

Predicting house price
Predicting house pricePredicting house price
Predicting house price
Divya Tiwari
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
Abhimanyu Dwivedi
 
House Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning AlgorithmHouse Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning Algorithm
ijtsrd
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
Leo Salemann
 
IRJET- House Rent Price Prediction
IRJET- House Rent Price PredictionIRJET- House Rent Price Prediction
IRJET- House Rent Price Prediction
IRJET Journal
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.
Nahian Ahmed
 
Data analytics with python introductory
Data analytics with python introductoryData analytics with python introductory
Data analytics with python introductory
Abhimanyu Dwivedi
 
Machine learning
Machine learningMachine learning
Machine learning
Mike Martinez
 
housepriceprediction-ml.pptx
housepriceprediction-ml.pptxhousepriceprediction-ml.pptx
housepriceprediction-ml.pptx
tommychauhan
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
Andrew Ferlitsch
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAbhishek Singh
 
Linear regression
Linear regressionLinear regression
Linear regression
MartinHogg9
 
Loss Function.pptx
Loss Function.pptxLoss Function.pptx
Loss Function.pptx
funnyworld18
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Feature selection
Feature selectionFeature selection
Feature selection
dkpawar
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
Sharayu Patil
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
Rashid Ansari
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 

What's hot (20)

Predicting house price
Predicting house pricePredicting house price
Predicting house price
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
House Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning AlgorithmHouse Price Estimates Based on Machine Learning Algorithm
House Price Estimates Based on Machine Learning Algorithm
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
IRJET- House Rent Price Prediction
IRJET- House Rent Price PredictionIRJET- House Rent Price Prediction
IRJET- House Rent Price Prediction
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.
 
Data analytics with python introductory
Data analytics with python introductoryData analytics with python introductory
Data analytics with python introductory
 
Machine learning
Machine learningMachine learning
Machine learning
 
housepriceprediction-ml.pptx
housepriceprediction-ml.pptxhousepriceprediction-ml.pptx
housepriceprediction-ml.pptx
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Loss Function.pptx
Loss Function.pptxLoss Function.pptx
Loss Function.pptx
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Random forest
Random forestRandom forest
Random forest
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 

Viewers also liked

Predicting Sales Price Of A House
Predicting Sales Price Of A House Predicting Sales Price Of A House
Predicting Sales Price Of A House
Someshwar Rao S S V
 
Regression Analysis Project
Regression Analysis ProjectRegression Analysis Project
Regression Analysis ProjectMichael Wallace
 
Regression analysis project
Regression analysis projectRegression analysis project
Regression analysis projectMAS261
 
Predicting Sales Price Of A House
Predicting Sales Price Of A House Predicting Sales Price Of A House
Predicting Sales Price Of A House
Someshwar Rao S S V
 
Average performance prediction of elementary school using multiple regression
Average performance prediction of elementary school using multiple regressionAverage performance prediction of elementary school using multiple regression
Average performance prediction of elementary school using multiple regression
Anurag Shandilya
 
2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구
2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구
2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구
Platum
 
Segunda geração modernista
Segunda geração modernistaSegunda geração modernista
Segunda geração modernista
Andrieli Muhl
 
Powerpoint
PowerpointPowerpoint
MM School Brochure | International School in Chhattisgarh
MM School Brochure | International School in ChhattisgarhMM School Brochure | International School in Chhattisgarh
MM School Brochure | International School in Chhattisgarh
MM School
 
MKT6337_FinalPPT_1
MKT6337_FinalPPT_1MKT6337_FinalPPT_1
MKT6337_FinalPPT_1Ishan Dua
 
T16 multiple regression
T16 multiple regressionT16 multiple regression
T16 multiple regressionkompellark
 
Statistics for managers, Multiple regression analysis
Statistics for managers, Multiple regression analysisStatistics for managers, Multiple regression analysis
Statistics for managers, Multiple regression analysis
Owais Ashraf
 
EC4417 Econometrics Project
EC4417 Econometrics ProjectEC4417 Econometrics Project
EC4417 Econometrics ProjectLonan Carroll
 

Viewers also liked (14)

Predicting Sales Price Of A House
Predicting Sales Price Of A House Predicting Sales Price Of A House
Predicting Sales Price Of A House
 
Regression Analysis Project
Regression Analysis ProjectRegression Analysis Project
Regression Analysis Project
 
Regression analysis project
Regression analysis projectRegression analysis project
Regression analysis project
 
Predicting Sales Price Of A House
Predicting Sales Price Of A House Predicting Sales Price Of A House
Predicting Sales Price Of A House
 
Average performance prediction of elementary school using multiple regression
Average performance prediction of elementary school using multiple regressionAverage performance prediction of elementary school using multiple regression
Average performance prediction of elementary school using multiple regression
 
2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구
2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구
2014 중화권 ICT 시장 조사 보고서_NIPA 플래텀 공동 연구
 
Mohammad Sadiq Kawosh C,V
Mohammad Sadiq Kawosh C,VMohammad Sadiq Kawosh C,V
Mohammad Sadiq Kawosh C,V
 
Segunda geração modernista
Segunda geração modernistaSegunda geração modernista
Segunda geração modernista
 
Powerpoint
PowerpointPowerpoint
Powerpoint
 
MM School Brochure | International School in Chhattisgarh
MM School Brochure | International School in ChhattisgarhMM School Brochure | International School in Chhattisgarh
MM School Brochure | International School in Chhattisgarh
 
MKT6337_FinalPPT_1
MKT6337_FinalPPT_1MKT6337_FinalPPT_1
MKT6337_FinalPPT_1
 
T16 multiple regression
T16 multiple regressionT16 multiple regression
T16 multiple regression
 
Statistics for managers, Multiple regression analysis
Statistics for managers, Multiple regression analysisStatistics for managers, Multiple regression analysis
Statistics for managers, Multiple regression analysis
 
EC4417 Econometrics Project
EC4417 Econometrics ProjectEC4417 Econometrics Project
EC4417 Econometrics Project
 

Similar to Prediction of House Sales Price

Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Maninda Edirisooriya
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
Pratap Dangeti
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
Salford Systems
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
NAGARAJANS68
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
1015 track2 abbott
1015 track2 abbott1015 track2 abbott
1015 track2 abbott
Rising Media, Inc.
 
1030 track2 abbott
1030 track2 abbott1030 track2 abbott
1030 track2 abbott
Rising Media, Inc.
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
Pyingkodi Maran
 
Lead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdfLead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdf
KrishP2
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
Collin Bennett
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
RithikRaj25
 
Exploratory factor analysis
Exploratory factor analysisExploratory factor analysis
Exploratory factor analysis
Sreenivasa Harish
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Cheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricksCheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricks
Steve Nouri
 
Андрей Гулин "Знакомство с MatrixNet"
Андрей Гулин "Знакомство с MatrixNet"Андрей Гулин "Знакомство с MatrixNet"
Андрей Гулин "Знакомство с MatrixNet"
Yandex
 
Factors affecting customer satisfaction
Factors affecting customer satisfactionFactors affecting customer satisfaction
Factors affecting customer satisfaction
Saleesh Satheeshchandran
 

Similar to Prediction of House Sales Price (20)

Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Ahp
AhpAhp
Ahp
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
1015 track2 abbott
1015 track2 abbott1015 track2 abbott
1015 track2 abbott
 
1030 track2 abbott
1030 track2 abbott1030 track2 abbott
1030 track2 abbott
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
BI PPT Finale
BI PPT FinaleBI PPT Finale
BI PPT Finale
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
 
Lead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdfLead Scoring Group Case Study Presentation.pdf
Lead Scoring Group Case Study Presentation.pdf
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
 
Exploratory factor analysis
Exploratory factor analysisExploratory factor analysis
Exploratory factor analysis
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Cheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricksCheatsheet machine-learning-tips-and-tricks
Cheatsheet machine-learning-tips-and-tricks
 
Андрей Гулин "Знакомство с MatrixNet"
Андрей Гулин "Знакомство с MatrixNet"Андрей Гулин "Знакомство с MatrixNet"
Андрей Гулин "Знакомство с MatrixNet"
 
Factors affecting customer satisfaction
Factors affecting customer satisfactionFactors affecting customer satisfaction
Factors affecting customer satisfaction
 

Recently uploaded

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 

Recently uploaded (20)

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 

Prediction of House Sales Price

  • 1. HOUSE PRICES Advanced Regression Technique Prepared by: Anirvan Ghosh
  • 2. Outline ■ Project Objective ■ Data Source and Variables ■ Data Processing ■ Method of Analysis ■ Result ■ Predicted House Prices All coding and model building is done using R software
  • 3. Objective ■ Create an analytical framework to understand – Key factors impacting house price ■ Develop a modeling framework – To estimate the price of a house that is up for sale
  • 4. Data Source and Variables ■ Kaggle competition - “House Prices: Advanced Regression Techniques” – Dataset prepared by Dean De Cock ■ Variables: – 79 variables present in the dataset ■ Variable named “SalePrice” – Dependent variable – Represent final price at which the house was sold ■ Remaining 78 variables – Represent different attributes of the house like area, car parking, number of fireplaces, etc.
  • 5. Data Processing ■ Normalizing Response Variable ■ Training Vs Validation split – Train data – 75% – Validation Data – 25% ■ Data cleansing – Variable treatments ■ Missing value treatment: – Continuous variables – Character variables ■ Outlier treatment ■ Variable creations: – Character variables were converted to indicators – Based on train data, further grouping of character variables were done and new indicators were created
  • 6. Data Processing – Normalizing Response Variable ■ The response variable is converted to its logarithmic form to normalize it. ■ Underlying reason: – Satisfying the basic assumption of Ordinary least square
  • 7. Data Processing – Training Vs Validation split ■ Training Data – Containing 75% of the total observations picked up at a random. – Model is developed on this dataset. ■ Validation Data – Containing remaining 25% of the total observations. – Validation of the model is carried out based on this data – Model tuning, if required, is carried based on the model performance on the validation data
  • 8. Data Processing – Missing value treatment ■ Continuous Variable – Missing values are replaced by the median of the corresponding variable. ■ Why median and not mean? – Mean is more prone and get highly impacted by outliers – Median is a more stable measure ■ Character Variable – A separate category is created for missing values ■ This helps us retaining the prediction power of a variable ■ Also impact of missing values on the dependent variable can be established
  • 9. Data Processing – Outlier treatment ■ Upper tailed values – Cut-off value: sum of 99th percentile and 1.5 times of IQR of the corresponding variables. ■ Replaced by 99th percentile ■ Lower tailed values – Cut-off value: value of 1st percentile point less 1.5 times IQR ■ Replaced by 1st percentile
  • 10. Data Processing – Variable creation ■ Character Variables – (n – 1) indicators are created for a character variable containing n different categories – Separate indicator created for missing value ■ Additional Indicator Variables – Based on bivariate plots: if two or more categories contains similar level of value of dependent variable, they are combined and converted into an indicator ■ For example, Alley has three levels: Grvl, Missing and Pave. A new variable was created for Missing and Pave category as they both have a high median value of the dependent variable.
  • 11. Method of Analysis ■ Variable Selection Using Random Forest ■ Multiple Linear Regression – Significance ■ t value and probability of t value – Goodness of fit ■ Adjusted R-square – Multicollinearity ■ Random Forest ■ Model Accuracy – Error rate – MAPE
  • 12. Introduction – Random Forest • Random Forest operate by constructing multitude of decision trees at training time and outputting the mean prediction of the individual trees. It also correct the decision trees’ habit of over fitting to the training set. • It is also been used to rank the importance of variables in a regression problem in a natural way. • In a regression tree, for each independent variable, the data is split at several split points. Sum of Squared Error(SSE) at each split point between the predicted value and the actual values is calculated. The variable resulting in minimum SSE is selected for the node. Then this process is recursively continued till the entire data is covered.
  • 13. Method of Analysis – Variable Selection ■ Random Forest Model – Model ■ A model has been built on train data using random forest – number of trees:100 – Importance of Variables ■ Importance of Variables were extracted ■ Variables are sorted descending based on the importance measure – Variables selected in order of importance measure are introduced into the OLS model Introduce one-by-one variable into model 2 Variables sorted in descending order of their importance Table of variable importance Model1- Random Forest
  • 14. Method of Analysis – Multiple Linear Regression ■ Ordinary Least Square (OLS) : – Simplest method regression in which the unknown coefficients of features are estimated with the goal to minimize the sum square errors. i.e. 𝑚𝑖𝑛 (𝑌 − 𝑌) 2 – Visually this is seen as the sum of the vertical distances between each data point in the set and the corresponding point on the regression line
  • 15. Method of Analysis – Multiple Linear Regression ■ Iteration – Select one variable at a time from the variable importance table created using random forest ■ Significance – Check significance of new variable along with existing variables by its t-value and probability of t- statistics. – If R-square is improved, keep the variable, else drop it ■ Multicollinearity – Multicollinearity is checked at each step – ensuring the maximum value is < 4 – If new variable has multicollinearity above threshold value – drop it – If introducing the new variable increases the multicollinearity of any existing variable – then the variable with lowest t-value is dropped ■ Model Accuracy – After adding the new variable, the model accuracy on train and test data using error rate and MAPE is checked. – Drop the new variable if the model accuracy falls.
  • 16. Method of Analysis – Random Forest • Variables • Using the same variable used in the linear regression • Trees and Nodes • Checking various combinations of number of trees and maximum number of nodes to get the best result. • Using number of trees = 100 and maximum nodes = 10 for best fitted model. • Model Accuracy • Checking model accuracy using error rate and MAPE • Decision • Drop the variable if the model accuracy falls or remains same.
  • 17. Method of Analysis – Model Accuracy • Error rate • Calculated as : Error Rate = 1 − 𝑌 𝑌 ∗ 100 • Calculated minimum error rate and maximum error rate for train and test data. • Aim is to reduce the error rate • Difference between minimum and maximum error rate between training and validation. • MAPE • Mean Absolute Percentage Error is calculated by : M𝐴𝑃𝐸 = 𝑀𝑒𝑎𝑛 1 − 𝑌 𝑌 ∗ 100 • Aim is to minimize the MAPE
  • 19. Result – Interpretation ■ Linear Regression Model – All the variables taken in the final model. – Adjusted R-square is: 90.76 ■ i.e. these variables together are explaining 90.76% variability of SalePrice. – All of the variables are significant. – Multicollinearity is not severe. All VIFs’are below 4.
  • 20. Result – Interpretation ■ Linear Regression Model – Train Data ■ Minimum error rate is -9.3% ■ Maximum error rate is 3.35% ■ MAPE is 0.697 – Test Data ■ Minimum error rate is -8.68% ■ Maximum error rate is 3.41% ■ MAPE is 0.778 ■ Random Forest Model – Train Data ■ Minimum error rate is -6.9% ■ Maximum error rate is 4.49% ■ MAPE is 0.993 – Test Data ■ Minimum error rate is -7.29% ■ Maximum error rate is 3.37% ■ MAPE is 1.02
  • 21. Result ■ MAPE – Low for Linear Regression – High for Random Forest ■ Linear Regression Model Chosen – based on minimum MAPE
  • 24. Sample Predicted Sale Price – Linear Regression Model Id SalePrice Id SalePrice Id SalePrice Id 1461 126701.3 1483 175239.3 1505 225337.5 1527 1462 162982.1 1484 170172.4 1506 194180.3 1528 1463 178089.9 1485 175177.9 1507 271850.8 1529 1464 208902.2 1486 199074.1 1508 208083.3 1530 1465 193823.7 1487 309688.9 1509 159138.8 1531 1466 168862.5 1488 245865.5 1510 139717.4 1532 1467 193708.7 1489 190057.9 1511 151421.5 1533 1468 165445.6 1490 225320 1512 174073.1 1534 1469 196142.3 1491 193731.9 1513 138853.8 1535