Boston housing data analysis

1
BOSTON HOUSING DATA ANALYSIS
The Boston housing data is a classic dataset that has details about the median values of 506 properties
with details such as crime rate in the town, industrial properties intown, average number of rooms per
property among others. The goal of this exercise is to solve the prediction problem with medv,median
value of owner-occupied homes in $1000s as the response variable.
The data was sampled to split it into an 80-20 training – test data. Multiple methods were employed to
solve the predictionproblemsuchasGeneralizedlinearregression,RegressionTree,GeneralizedAdditive
Model and Neural networkto predictthe medvin the trainingand test data. The bestmodel foreach of
the models were evaluated and the below results were found.
GLM (Stepwise
Variable
Selection)
LASSO
Regression
Regression
Tree
GAM Neural
Network
Model equation -indus -age . lstat+ nox +
crim + rm +
dis
Smoothing term:
-age, - black, - zn,
-ptratio,
-chas, -rad
-
Model MSE 23.73 24.97 - - -
R-squared 0.7216 0.7216 - - -
Adj R-squared 0.7138 - - 0.877 -
AIC 2439.68 - - 2137.64 -
MSPE
(In-sample)
23.02 23.01 13.43 10.18 0.077
MSPE
(Out-of-sample)
18.15 18.30 16.53 10.15 9.00
Exploringthe above modelsshowsclearlythatthe Neural networkbuiltafterevaluatingthe numberof
nodesthatresultsinlowesttestSSE performedthe bestforpredictingthe housingpricesinBoston.For
the sake of interpretability,the GAM/ RegressionTree also performedfairlyandcan be usedto gainan
understandingof the predictionvariable.

2
BOSTON HOUSING DATA:
BACKGROUND:
The Bostondataset isa classicdataset inthe Data science worldthat’s usedto benchmarkalgorithms.It
wascollectedbytheUSCensusService BostonMass area.Itwasoriginallypublishedintheresearchpaper,
Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics &
Management,vol.5,81-102, 1978. Clearly,amongitsmany applications,thisdatasetisaptfor exercising
variable selection methods and building predictive models.
ABOUT THE DATA:
The datasetcontains506observationsand14variables,suchascrime rate,proportionofresidentiallands
over25,000 sq.feet,proportionof non-retailbusinessacresamongothers. The datawasfurthersampled
to split it into an 80-20 training – test data using a seed value of 12420360.
MODEL SELECTION:
1. GENERALIZED LINEAR REGRESSION:
I) Full Model:
For the full model,the responsevariable,medvwasmodeledagainstall the 13 explanatoryvariables.
The resultantp-value of the full modelwas< 0.05, makingthe regressionmodel meaningful.The
adjustedR-squaredof the full modelwas0.7217, implyingthat72.17% of the variationinmedvis
explainedbythe regressionmodel.All variables,exceptforindusandage,were foundtobe significant,
hence explainingmedv.
R-squared Adj R-squared MSE AIC MSPE
(In-sample)
MSPE
(Out-of-sample)
0.7217 0.7124 23.84 2443.48 23.01 18.33
II) Step-wise Variable Selection:
Employing stepwise variable selection methods to identify the best model to predict medv, forward,
backward and step-wise variable selection was used. For all three cases, the null model was built witha
constant and the full model was built with all variables. AIC was used as the criterion for the variable
selection methods employed. All three methods produced the same result and the results of the step
function moving in both directions identified the best model with the lowest AIC value of 2439.68.
Final Model: medv~ lstat + rm + ptratio+ dis+ nox + chas + black+ zn + rad + tax + crim
The above identifiedfinal model wasfitintoalinearregressionmodel tocalculate the MSE,MSPE and
AIC.
(In-sample)
MSPE
(Out-of-sample)
0.7216 0.7138 23.73 2439.68 23.02 18.15

3
III) Reduced Model: LASSOregression
To identifythe bestmodel through LASSOregression, the independent variables were standardized and
the glmnetmethodwasusedto run the LASSOregression model.Fig. 1 showsthe LASSO plotindicating
the best variables at different values of lambda. As lambda increases, the number of best variables
decreases. Cross-validation techniques were used to identify an optimal lambda, lambda.min (tuning
parameter) of 0.0117. This optimal lambda is the value of lambda at which the MSE is the smallest.
Fig 1. LASSORegression plot forVariableselection
Final Model: medv~ crim + zn+ indus+ chas + nox + rm + age + dis+ rad + tax + ptratio + black+ lstat
The model wasre-builtusingthe optimal lambdaresultinginthe bestmodel withall the variables.The
out of sample error(MSPE) calculatedonthe testdata,was foundto be 18.30.
(In-sample)
MSPE
(Out-of-sample)
0.7216 - 24.97 - 23.01 18.30
2. REGRESSION TREE:
The CART technique separatesthe datasetintobinsbyprogressivelyaddingvariable-valuecombinations
to the sequence,ensuringthatat each stepthe splitincreasesthe homogeneityof the resultingsubsets
of observations. All 404 observations in the training dataset were fed into the regression tree and the
below tree was observed.
From the regressiontree, the MSPE on the trainingdata was foundto be 13.43 and on the test data was
found to be 16.53.

4
Fig 2. Regression tree plot from CART method
MSPE
(In-sample)
MSPE
(Out-of-sample)
13.43 16.53
3. GENERALIZED ADDITIVE MODELS:
A generalizedadditive model was builtwith a non-linear component to all the variables except for chas
and rad, both of which are binary categorical variables. From the summary of this GAMmodel, zn, age,
blackand ptratiowere foundtonothave a polynomial relationshipwiththe responsevariable,medv. Zn,
age and black were found to be insignificant. , while the edf of ptratio was found to be 1.

5
Basedon the above inference,anewgeneralizedadditive model wasbuiltwithanon-linearsmoothing
termon the significantparameters. The below summary wasobservedasthe outputof the GAM model.
Model (Smoothing) Adj R-squared AIC MSPE (In-sample) MSPE (Out-of-sample)
-age, - black, - zn, -ptratio,
-chas, -rad
0.877 2173.64 10.18 10.65

6
A large reduction in the MSPE (in-sample) and (out-of-sample) for the GAMmodel indicates that there
mightbe a strong polynomialrelationshipbetween some of the independentvariablesandthe response
variable.Hence,the GAM model providingalargerflexibilitymightbe a bettermodel for thisprediction
problem.
4. NEURAL NETWORK:
Toimplementtheneural networkalgorithm,adatapreprocessingstepisrequired. The datapreprocessing
step is necessary to ensure that the algorithm converges. The independent variables were normalized
with the max-min scaling using x = (X-Xmin)/(Xmax-Xmin).
Once the independentvariableswerescaled,the nnetfunctionwasusedtoevaluate the perfectnumber
of nodes to use in the neural network using a loop to evaluate the best number that minimizes the test
SSE.
Fig 3. Plot to evaluate optimal number of nodes
Fromthe above plot,itisevidentthatfor14 hiddennodes,the testSSEisminimum.Withthisevaluation,
the neural networkwasrebuiltusing14 hiddennodestoobtainthe best MSPE for in-sample andout-of-
sample set.
MSPE
(In-sample)
MSPE
(Out-of-sample)
0.076 9.001
Basedon the MSPE valuescalculatedforthe Neural network,the modelperformsthe bestin
comparisonwithall the modelsrun.

7
CONCLUSION:
Summarizing the results from all the models run for the prediction problem, the below table was
populated. Fromthe belowtable,comparisonsinthe performance betweenin-sample measurescanbe
done usingAIC,In-sample MSPE,while betweenthe out-of-sample measurescanbe done usingthe out-
of-sample MSE.
GLM (Stepwise
Variable
Selection)
LASSO
Regression
Regression
Tree
GAM Neural
Network
Model equation -indus -age . lstat+ nox +
crim + rm +
dis
Smoothing term:
-age, - black, - zn,
-ptratio,
-chas, -rad
-
Model MSE 23.73 24.97 - - -
R-squared 0.7216 0.7216 - - -
Adj R-squared 0.7138 - - 0.877 -
AIC 2439.68 - - 2137.64 -
MSPE
(In-sample)
23.02 23.01 13.43 10.18 0.076
MSPE
(Out-of-sample)
18.15 18.30 16.53 10.15 9.001
Forthe Bostonhousingdata,clearlythe Neuralnetwork modelgeneratedthe lowestMSPEforthe training
sample chosen. The In-sample MSPE was 0.076 and the out-of-sample MSPE was 9.001 indicating the
lowest value achievable for the sample. For the sake of interpretability, the Regression model and GAM
could also be further evaluated and better understood to predict the response variable, medv.
Extending this study, cross-validation methods can be used for all the models to generate a more
comparable value of the MSPE that is independent of the sample chosen.

Boston housing data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Boston housing data analysis

Similar to Boston housing data analysis (20)

Recently uploaded

Recently uploaded (20)

Boston housing data analysis