House On Sale -
Prediction of Sales Price
Kexin
liu.kexin22@gmail.com
Module II
ML - Algorithm - Regression
Inspect Problems
Avg Price : $180k
Top Sale :
June , 2007
2007 2010
Top Drop :
2009 - 2010
↓50%
Best Style : One Story
Best Building Type : Single-family Detached
Mode
Mean
Preprocessing
Dataset Info
Total R
Total C
Null
Duplicates
1460
81
6965
0
Variable Type
Cat
Nominal
Ordinal
'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities','LotConfig', 'LandSlope', 'Neighborhood', 'BldgType',
'HouseStyle', 'RoofStyle','RoofMatl', 'Exterior1st', 'Exterior2nd',
'MasVnrType', 'Foundation', 'BsmtExposure','BsmtFinType1',
'BsmtFinType2', 'Heating', 'CentralAir','Electrical', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath','BedroomAbvGr',
'KitchenAbvGr','TotRmsAbvGrd',
'Functional', 'Fireplaces', 'GarageType', 'GarageFinish',
'GarageCars', 'PavedDrive', 'Fence', 'MiscFeature', 'SaleType'
'Condition1', 'Condition2','OverallQual', 'OverallCond',
'ExterQual','ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC',
'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
'PoolQC','SaleCondition'
Num
Continuous
Discrete
'YearBuilt','YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold',
'MSSubClass', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea','GarageArea', 'WoodDeckSF',
'OpenPorchSF','EnclosedPorch', '3SsnPorch', 'ScreenPorch','PoolArea',
'MiscVal','SalePrice'
Missing Value
Cat
Num -
LotFrontage
Fill NaN by String ‘NO’ ✔
Fill NaN by Prediction ?
Fill NaN by Mean ? ✔
Drop NaN column ?
Missing Value
Num - LotFrontage
Drop NaN column ?
NO
Linear feet of street
connected to property
Image: https://www.concordma.gov/DocumentCenter/View/1385/Section-6-PDF?bidId=
Missing Value
Num - LotFrontage
Fill NaN by Prediction ?
NO
EDA - Correlation
Training Scores from
models
'1stFlrSF','LotArea','GrLivArea','TotalBsmtSF','GarageArea',
'MSSubClass'
Predict 'LogFrontage' with KNN - 0.52
Predict 'LogFrontage' with Linear - 0.40
Predict 'LogFrontage' with RandomForest - 0.59
Missing Value
Num - LotFrontage
Fill NaN by Mean ?
Groupby Mean 'Neighborhood','YearBuilt'
EDA - Correlation
Target - SalesPrice
Target - SalesPrice
Top 3 Corr
'GrLivArea', 'GarageArea',
'TotalBsmtSF'
Feature Selection -
All Numeric Variables
'GrLivArea',
'GarageArea',
'TotalBsmtSF',
'1stFlrSF',
'YearBuilt',
'YearRemodAdd',
'OpenPorchSF',
'LotArea',
'LotFrontage'
Feature Selection -
Categorical Variables
Outliers
Building Models N
TrainTest Info
Total R
Total C
1426
80
Linear
Regression
LR
Scaled
LR
Lasso
Train Test
TrainTest Info
Total R
Total C
1426
80
KNN
Random
Forest
Decision
Tree
Train Test
Building Models C
TrainTest Info
Total R
Total C
1426
48
Linear
Regression
LR
Scaled
LR
Lasso
Train Test
TrainTest Info
Total R
Total C
1426
48
KNN
Random
Forest
Decision
Tree
Train Test
After Log Transform
TrainTest Info
Total R
Total C
1426
48
Linear
Regression
LR
Lasso
Train Test
0.88 0.85
0.38 0.84
TrainTest Info
Total R
Total C
1426
48
KNN
Random
Forest
Decision
Tree
Train Test
0.74 0.74
0.75 0.71
0.83 0.71
Linear
Regression
Train Test
0.88 0.85
KNN
Train Test
0.74 0.74

Machine Learning Project - House Price Prediction

  • 1.
    House On Sale- Prediction of Sales Price Kexin liu.kexin22@gmail.com Module II ML - Algorithm - Regression
  • 2.
    Inspect Problems Avg Price: $180k Top Sale : June , 2007 2007 2010 Top Drop : 2009 - 2010 ↓50% Best Style : One Story Best Building Type : Single-family Detached Mode Mean
  • 3.
  • 4.
    Dataset Info Total R TotalC Null Duplicates 1460 81 6965 0
  • 5.
    Variable Type Cat Nominal Ordinal 'MSZoning', 'Street','Alley', 'LotShape', 'LandContour', 'Utilities','LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle', 'RoofStyle','RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'BsmtExposure','BsmtFinType1', 'BsmtFinType2', 'Heating', 'CentralAir','Electrical', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath','BedroomAbvGr', 'KitchenAbvGr','TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive', 'Fence', 'MiscFeature', 'SaleType' 'Condition1', 'Condition2','OverallQual', 'OverallCond', 'ExterQual','ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC','SaleCondition' Num Continuous Discrete 'YearBuilt','YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold', 'MSSubClass', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea','GarageArea', 'WoodDeckSF', 'OpenPorchSF','EnclosedPorch', '3SsnPorch', 'ScreenPorch','PoolArea', 'MiscVal','SalePrice'
  • 6.
    Missing Value Cat Num - LotFrontage FillNaN by String ‘NO’ ✔ Fill NaN by Prediction ? Fill NaN by Mean ? ✔ Drop NaN column ?
  • 7.
    Missing Value Num -LotFrontage Drop NaN column ? NO Linear feet of street connected to property Image: https://www.concordma.gov/DocumentCenter/View/1385/Section-6-PDF?bidId=
  • 8.
    Missing Value Num -LotFrontage Fill NaN by Prediction ? NO EDA - Correlation Training Scores from models '1stFlrSF','LotArea','GrLivArea','TotalBsmtSF','GarageArea', 'MSSubClass' Predict 'LogFrontage' with KNN - 0.52 Predict 'LogFrontage' with Linear - 0.40 Predict 'LogFrontage' with RandomForest - 0.59
  • 9.
    Missing Value Num -LotFrontage Fill NaN by Mean ? Groupby Mean 'Neighborhood','YearBuilt'
  • 10.
  • 11.
  • 12.
    Target - SalesPrice Top3 Corr 'GrLivArea', 'GarageArea', 'TotalBsmtSF'
  • 13.
    Feature Selection - AllNumeric Variables 'GrLivArea', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'YearBuilt', 'YearRemodAdd', 'OpenPorchSF', 'LotArea', 'LotFrontage'
  • 14.
  • 15.
  • 16.
  • 17.
    TrainTest Info Total R TotalC 1426 80 Linear Regression LR Scaled LR Lasso Train Test
  • 18.
    TrainTest Info Total R TotalC 1426 80 KNN Random Forest Decision Tree Train Test
  • 19.
  • 20.
    TrainTest Info Total R TotalC 1426 48 Linear Regression LR Scaled LR Lasso Train Test
  • 21.
    TrainTest Info Total R TotalC 1426 48 KNN Random Forest Decision Tree Train Test
  • 22.
  • 23.
    TrainTest Info Total R TotalC 1426 48 Linear Regression LR Lasso Train Test 0.88 0.85 0.38 0.84
  • 24.
    TrainTest Info Total R TotalC 1426 48 KNN Random Forest Decision Tree Train Test 0.74 0.74 0.75 0.71 0.83 0.71
  • 25.