SlideShare a Scribd company logo
1 of 10
Download to read offline
BOSTON HOUSING DATA
A Comprehensive Regression Analysis
Ravish Kalra
Graduate Student, Business Analytics
University of Cincinnati
Table of Contents
Executive Summary - Boston Housing Data.................................................................................................2
Boston Housing Data.....................................................................................................................................3
Introduction ..............................................................................................................................................3
Exploratory Data Analysis .........................................................................................................................3
Variable Selection and Modelling .............................................................................................................7
Residual Diagnostics .................................................................................................................................9
Final Model ...............................................................................................................................................9
Comparison with CART ...........................................................................................................................10
Executive Summary - Boston Housing Data
This report provides an analysis and evaluation of the factors affecting the median value of the
owner occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data is
used for this analysis and various factors about the structural quality, neighbourhood,
accessibility and air pollution such as per capita crime rate by town, proportion of non-retail
business acres per town, index of accessibility to radial highways etc are taken into account for
this study.
Methods of analysis include (but not limited to) summary statistics and visualization of the
distribution of the variables, finding correlation between variables and conducting linear
regression on the data.
Further, various variable selection methods like Best Subset, Stepwise Selection and LASSO was
performed to come up with the best linear regresssion model to predict the median value of the
owner occupied homes. These models were then compared with a custom model designed after
including all the analysis from the initial exploration.
Finally, a comprehensive comparison was made between linear regression and CART to predict
the median price values after supplying the same data. The results indicated that while CART
outperformed linear regression, the additional details captured by the linear regression model in
the exploratory phase was still a better choice.
The final model included interaction term and variable transformation. This model resulted in an
adjuted R-squared value of 0.85 and an avg MSE value of 3.60
medv ~ nox + ptratio + age + [log(lstat) + rm + log(crim) + dis] * rad_c
Boston Housing Data
Introduction
The entire data consists of 506 observations and 14 variables. A train-test set of the ratio 80:20
was sampled for the study, resulting in 404 observations and 14 variables, all of type numeric. The
variable chas (which captures the amenities of a riverside location) is categorical while the rest are
continuous. Given below is the exploratory data analysis and model selection for best model to
predict the median value of owner-occupied homes.
Exploratory Data Analysis
An initial look at the summary statistics of the data gives us some of the following insights:
• There are no NA / missing values in the data set.
• The median value of the owner occupied homes (medv – the dependent variable) ranges
from 5 to 50 (in $1000s).
• The average number of rooms per dwelling is ~6 rooms.
• The full-value property-tax rate (in $10,000) varies from 187 to 711
• The proportion of owner occupied units built prior to 1940 is on the upper side. More than
50% of the observations are greater 75 years old
From the distributions shown in figure 1, the following can be concluded about the variables taken
for this study -
• The proportion of owner-occupied units built prior to 1940 (age) and the proportion of
blacks by town (black) are highly skewed to the left, which means that the most counts of
these variables occur on the higher end.
• The average number of rooms per dwelling (rm) follows a normal distribution i.e most of
the dwellings have an average of 6 rooms.
• There are more dwellings which have smaller distances to five Boston employment centers
(dis is skewed to the right)
• There are more dwellings which have lower median value (less than $25000) than the
number of dwellings that have a higher value. (medv is skewed to the right)
• There are lesser proportion of adults without high school education and male workers
classified as laborers in the dwellings of the Boston suburbs (lstat is skewed to the right)
• The full value property tax rate (tax - measured in $10000s) can be seen to be separated
into 2 distinct clusters. One below 500 and the other more than 700.
• The index of accessibility to radial highways (rad) also seems to be separated into 2 distinct
clusters. A huge number of dwellings having this index less than 10 and the rest having
more than 24.
Figure 1:Histograms of different variables of Boston data set
Studying the correlation between the variables, some of the following observations were made –
• A strong correlation of 0.912 between variables rad and tax. This is expected as we often
see that as the accessibility to radial highways increase, the property tax rate of the
dwellings also increases.
• A correlation of 0.76 between the proportion of non-retail business acres per town (indus)
and the nitrogen oxide concentration (nox). This corroborates the fact that non-retail
businesses have a high contribution to the nitrogen concentration in the air.
• A correlation of 0.73 between the proportion of non-retail business acres per town (indus)
and the property tax rate of the dwellings (tax). The tax rate may also be influenced by the
presence of non-retail business near the dwellings
• A correlation of 0.73 between proportion of owner-occupied units built prior to 1940 (age)
and the nitrogen oxides concentration (nox). This might lead to the fact the older parts of
the city, or where the older houses are situated have more air pollution.
• A negative correlation of 0.74 between mean of distances to five Boston employment
centers (dis) and proportion of owner-occupied units built prior to 1940 (age). Interesting
to note that older homes are farther away from the employment centers, which shows that
a city expands more where the employment centers are located.
Correlation with the median value of owner-occupied homes (medv):
• A negative correlation of 0.74 with lstat (percent of lower status of the population) i.e more
the proportion of people with lower status, lesser is the value of the house. This can be
attributed the fact of affordability.
• A positive correlation of 0.70 with the average number of rooms per dwelling i.e as the
number of rooms increase, a hike in the price of the dwellings can be observed.
Figure 2: Correlation matrix
Figure 3 shows the scatter plot of the various variables with respect to the variable medv. The linear
regression lines are plotted to better visualize their relationship with medv. Also, we can
consolidate on our understanding of the variables rad and tax, which have a high correlation. It
can also be seen that applying log transformation on the variables crim and lstat seem to fit the
linear line better.
Figure 3: Scatter plots of different variables and medv (including log transformed variables)
Table 1: Correlation coefficients with respect to medv
Variable lstat_log lstat rm ptratio indus crim_log crim
Correlation
coefficient
-0.82 -0.74 0.70 -0.51 -0.48 -0.45 -.039
p-value 9.2e-122 5.0e-88 2.4e-74 1.6e-34 4.9e-31 3.8e-27 1.1e-19
Further analyzing the correlation coefficients of the variables with respect to medv (as shown in
table 1) confirms our understanding about transformed variables being more linearly correlated.
The high correlation between tax and rad can also be observed (as shown in figure 4). Since their
distributions are also in two clusters a new categorical variable called rad_c was created, and tax
variable was dropped as rad_c would be able to explain most of the variation in tax variable.
Figure 4: Correlation and plots of variables tax and rad
With the introduction of new variable, there is a change of slope observed in the following
variables
Figure 5: Introduction of rad_c variable forces a change in slope
Variable Selection and Modelling
For the modelling phase, both classical and regularization techniques for variable selection were
used to come up with the best linear regression model for the dependent variable medv. Best subset
method, stepwise selection and LASSO (with parameter tuning to select best lambda) was
performed. Table 2 gives a summary of these models.
Table 2: Comparison of different models assumed through variable selection techniques
Method Formula 10 fold
Cross
validation
In-sample
Prediction
Out-Sample
Prediction
R2
Adj
R2
AIC BIC
Best
subset
medv ~ chas + nox + rm +
dis + ptratio + black + lstat
+ crim + zn + rad + tax
23.56 24.35 12.75 0.741 0.735 2462 2514
Stepwise
B/F/Both
medv ~ chas + nox + rm +
dis + ptratio + black + lstat
+ crim + zn + rad + tax
23.56 24.35 12.75 0.741 0.735 2462 2514
Full
Model
medv ~ . 24.42 25.10 13.39 0.742 0.734 3030 3102
Lasso (λ
= 0.034)
medv ~ chas + nox + rm +
dis + ptratio + black + lstat
+ crim + zn + rad + age +
indus
24.12 24.26 12.00 0.735 0.729 3036 3095
The difference between in-sample and out-sample prediction was high and surprisingly lower on
the out-sample prediction. This was due to the random one-time split of data from test / train and
goes to show how a single fold result should not be trusted. When the same was repeated for a 10-
fold cross validation, a more realistic picture surfaced which was very different from the out of
sample prediction. Since the splits were random, we obtained different results for 10-fold cross
validation. From our exploratory data analysis, we discovered that taking log of crim and lstat
variable increased their linear correlation with medv. We also observed that an interaction term
with the transformed rad_c variable explained more variations in the regression line. We would
now compare the above models with a customized model that incorporates the discoveries from
exploratory data analysis.
A comparison of repeated cross-validation with 5 repeats and 20 folds is depicted in Figure-6.
Figure 6: Model comparison (RMSE and R sq) at 95% Confidence Interval
The customized model performed much better at explaining the variation in median housing prices
and predicting out of sample.
Residual Diagnostics
Stepwise Selection Model v/s Custom Model Residual Comparison
Figure 7: Residual plots comparison for stepwise model (left) and custom model (right)
Figure - 7 shows that custom model displays a slight improvement in the Q-Q plot that indicates
that the residuals of the model are nearly normal. The curvature in the Scale-Location graph has
also been linearized to an extend in the custom model. This indicates that our assumptions for
linear regression holds better with the custom model than the other models. Thus, to make
predictions for out of sample, the custom model should be preferred.
Final Model
Table 3: Model summary of the final selected model
Formula R2
Adj R2
AIC BIC RMSE
medv ~ nox + ptratio + age + (log(lstat) + rm + log(crim) + dis)
* rad_c
0.854 0.850 2735 2794 3.607
Comparison with CART
After constructing the tree from the split data available, we observed the following values in
comparison to linear model:
Table 4: Comparison of predictions made by linear regression and CART
Sample Type Linear Regression (full model) CART (cp = 0.015642)
In-Sample (80%) 21.50 17.81
Out-Sample (20%) 23.91 21.76
The values observed in Table 4 suggests that CART performed better than the full regression
model. The above values, however, are volatile i.e. the prediction errors vary with a slight change
in the split of train / test data. Thus, to compare these two models and the model arrived at earlier,
we needed to run a repeated cross validation for 5 repeats and 20-fold crosses. Figure – 8 depicts
the summary of these repeats and predictions at 95% confidence interval.
Figure 8: Comparison of model prediction between full linear regression, CART and custom model
From above, it is evident that CART performs better than linear regression model. However,
because of the simplicity of linear regression, the analysis done in the exploratory phase and the
incorporated final model outperforms the CART model.

More Related Content

What's hot

CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
Red black tree insertion
Red black tree insertionRed black tree insertion
Red black tree insertionKousalya M
 
Radix sort presentation
Radix sort presentationRadix sort presentation
Radix sort presentationRatul Hasan
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxnikshaikh786
 
queue & its applications
queue & its applicationsqueue & its applications
queue & its applicationssomendra kumar
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for filesZainab Almugbel
 
Canopy clustering algorithm
Canopy clustering algorithmCanopy clustering algorithm
Canopy clustering algorithmAshish Karki
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Data Structures- Part5 recursion
Data Structures- Part5 recursionData Structures- Part5 recursion
Data Structures- Part5 recursionAbdullah Al-hazmy
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

What's hot (20)

Radix sort
Radix sortRadix sort
Radix sort
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Book Recommendations.pptx
Book Recommendations.pptxBook Recommendations.pptx
Book Recommendations.pptx
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Red black tree insertion
Red black tree insertionRed black tree insertion
Red black tree insertion
 
Radix sort presentation
Radix sort presentationRadix sort presentation
Radix sort presentation
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
queue & its applications
queue & its applicationsqueue & its applications
queue & its applications
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for files
 
Canopy clustering algorithm
Canopy clustering algorithmCanopy clustering algorithm
Canopy clustering algorithm
 
Radix sorting
Radix sortingRadix sorting
Radix sorting
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Data Structures- Part5 recursion
Data Structures- Part5 recursionData Structures- Part5 recursion
Data Structures- Part5 recursion
 
Clustering
ClusteringClustering
Clustering
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 

Similar to Regression Study: Boston Housing

Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusShuai Yuan
 
Real - estate pricing valuation
Real - estate pricing valuationReal - estate pricing valuation
Real - estate pricing valuationAmrit Tandon
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RSotiris Baratsas
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WAMohammed Al Hamadi
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxwkyra78
 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docxoswald1horne84988
 
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdfregression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdflisow86669
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning AlexAman1
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdfthaersyam
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
Regression Analysis.ppt
Regression Analysis.pptRegression Analysis.ppt
Regression Analysis.pptAbebe334138
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
 

Similar to Regression Study: Boston Housing (20)

RegressionProjectReport
RegressionProjectReportRegressionProjectReport
RegressionProjectReport
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 census
 
Real - estate pricing valuation
Real - estate pricing valuationReal - estate pricing valuation
Real - estate pricing valuation
 
Dm
DmDm
Dm
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricing
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricing
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docx
 
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdfregression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
 
Linear and Logistics Regression
Linear and Logistics RegressionLinear and Logistics Regression
Linear and Logistics Regression
 
Bab 3.ppt
Bab 3.pptBab 3.ppt
Bab 3.ppt
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
Chap003.ppt
Chap003.pptChap003.ppt
Chap003.ppt
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Regression Analysis.ppt
Regression Analysis.pptRegression Analysis.ppt
Regression Analysis.ppt
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 

Recently uploaded

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 

Recently uploaded (20)

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Regression Study: Boston Housing

  • 1. BOSTON HOUSING DATA A Comprehensive Regression Analysis Ravish Kalra Graduate Student, Business Analytics University of Cincinnati
  • 2. Table of Contents Executive Summary - Boston Housing Data.................................................................................................2 Boston Housing Data.....................................................................................................................................3 Introduction ..............................................................................................................................................3 Exploratory Data Analysis .........................................................................................................................3 Variable Selection and Modelling .............................................................................................................7 Residual Diagnostics .................................................................................................................................9 Final Model ...............................................................................................................................................9 Comparison with CART ...........................................................................................................................10 Executive Summary - Boston Housing Data This report provides an analysis and evaluation of the factors affecting the median value of the owner occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data is used for this analysis and various factors about the structural quality, neighbourhood, accessibility and air pollution such as per capita crime rate by town, proportion of non-retail business acres per town, index of accessibility to radial highways etc are taken into account for this study. Methods of analysis include (but not limited to) summary statistics and visualization of the distribution of the variables, finding correlation between variables and conducting linear regression on the data. Further, various variable selection methods like Best Subset, Stepwise Selection and LASSO was performed to come up with the best linear regresssion model to predict the median value of the owner occupied homes. These models were then compared with a custom model designed after including all the analysis from the initial exploration. Finally, a comprehensive comparison was made between linear regression and CART to predict the median price values after supplying the same data. The results indicated that while CART outperformed linear regression, the additional details captured by the linear regression model in the exploratory phase was still a better choice. The final model included interaction term and variable transformation. This model resulted in an adjuted R-squared value of 0.85 and an avg MSE value of 3.60 medv ~ nox + ptratio + age + [log(lstat) + rm + log(crim) + dis] * rad_c
  • 3. Boston Housing Data Introduction The entire data consists of 506 observations and 14 variables. A train-test set of the ratio 80:20 was sampled for the study, resulting in 404 observations and 14 variables, all of type numeric. The variable chas (which captures the amenities of a riverside location) is categorical while the rest are continuous. Given below is the exploratory data analysis and model selection for best model to predict the median value of owner-occupied homes. Exploratory Data Analysis An initial look at the summary statistics of the data gives us some of the following insights: • There are no NA / missing values in the data set. • The median value of the owner occupied homes (medv – the dependent variable) ranges from 5 to 50 (in $1000s). • The average number of rooms per dwelling is ~6 rooms. • The full-value property-tax rate (in $10,000) varies from 187 to 711 • The proportion of owner occupied units built prior to 1940 is on the upper side. More than 50% of the observations are greater 75 years old From the distributions shown in figure 1, the following can be concluded about the variables taken for this study - • The proportion of owner-occupied units built prior to 1940 (age) and the proportion of blacks by town (black) are highly skewed to the left, which means that the most counts of these variables occur on the higher end. • The average number of rooms per dwelling (rm) follows a normal distribution i.e most of the dwellings have an average of 6 rooms. • There are more dwellings which have smaller distances to five Boston employment centers (dis is skewed to the right) • There are more dwellings which have lower median value (less than $25000) than the number of dwellings that have a higher value. (medv is skewed to the right) • There are lesser proportion of adults without high school education and male workers classified as laborers in the dwellings of the Boston suburbs (lstat is skewed to the right) • The full value property tax rate (tax - measured in $10000s) can be seen to be separated into 2 distinct clusters. One below 500 and the other more than 700. • The index of accessibility to radial highways (rad) also seems to be separated into 2 distinct clusters. A huge number of dwellings having this index less than 10 and the rest having more than 24.
  • 4. Figure 1:Histograms of different variables of Boston data set Studying the correlation between the variables, some of the following observations were made – • A strong correlation of 0.912 between variables rad and tax. This is expected as we often see that as the accessibility to radial highways increase, the property tax rate of the dwellings also increases. • A correlation of 0.76 between the proportion of non-retail business acres per town (indus) and the nitrogen oxide concentration (nox). This corroborates the fact that non-retail businesses have a high contribution to the nitrogen concentration in the air. • A correlation of 0.73 between the proportion of non-retail business acres per town (indus) and the property tax rate of the dwellings (tax). The tax rate may also be influenced by the presence of non-retail business near the dwellings • A correlation of 0.73 between proportion of owner-occupied units built prior to 1940 (age) and the nitrogen oxides concentration (nox). This might lead to the fact the older parts of the city, or where the older houses are situated have more air pollution. • A negative correlation of 0.74 between mean of distances to five Boston employment centers (dis) and proportion of owner-occupied units built prior to 1940 (age). Interesting to note that older homes are farther away from the employment centers, which shows that a city expands more where the employment centers are located. Correlation with the median value of owner-occupied homes (medv): • A negative correlation of 0.74 with lstat (percent of lower status of the population) i.e more the proportion of people with lower status, lesser is the value of the house. This can be attributed the fact of affordability.
  • 5. • A positive correlation of 0.70 with the average number of rooms per dwelling i.e as the number of rooms increase, a hike in the price of the dwellings can be observed. Figure 2: Correlation matrix Figure 3 shows the scatter plot of the various variables with respect to the variable medv. The linear regression lines are plotted to better visualize their relationship with medv. Also, we can consolidate on our understanding of the variables rad and tax, which have a high correlation. It can also be seen that applying log transformation on the variables crim and lstat seem to fit the linear line better. Figure 3: Scatter plots of different variables and medv (including log transformed variables)
  • 6. Table 1: Correlation coefficients with respect to medv Variable lstat_log lstat rm ptratio indus crim_log crim Correlation coefficient -0.82 -0.74 0.70 -0.51 -0.48 -0.45 -.039 p-value 9.2e-122 5.0e-88 2.4e-74 1.6e-34 4.9e-31 3.8e-27 1.1e-19 Further analyzing the correlation coefficients of the variables with respect to medv (as shown in table 1) confirms our understanding about transformed variables being more linearly correlated. The high correlation between tax and rad can also be observed (as shown in figure 4). Since their distributions are also in two clusters a new categorical variable called rad_c was created, and tax variable was dropped as rad_c would be able to explain most of the variation in tax variable. Figure 4: Correlation and plots of variables tax and rad With the introduction of new variable, there is a change of slope observed in the following variables
  • 7. Figure 5: Introduction of rad_c variable forces a change in slope Variable Selection and Modelling For the modelling phase, both classical and regularization techniques for variable selection were used to come up with the best linear regression model for the dependent variable medv. Best subset method, stepwise selection and LASSO (with parameter tuning to select best lambda) was performed. Table 2 gives a summary of these models. Table 2: Comparison of different models assumed through variable selection techniques Method Formula 10 fold Cross validation In-sample Prediction Out-Sample Prediction R2 Adj R2 AIC BIC Best subset medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + tax 23.56 24.35 12.75 0.741 0.735 2462 2514 Stepwise B/F/Both medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + tax 23.56 24.35 12.75 0.741 0.735 2462 2514 Full Model medv ~ . 24.42 25.10 13.39 0.742 0.734 3030 3102 Lasso (λ = 0.034) medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + age + indus 24.12 24.26 12.00 0.735 0.729 3036 3095 The difference between in-sample and out-sample prediction was high and surprisingly lower on the out-sample prediction. This was due to the random one-time split of data from test / train and goes to show how a single fold result should not be trusted. When the same was repeated for a 10-
  • 8. fold cross validation, a more realistic picture surfaced which was very different from the out of sample prediction. Since the splits were random, we obtained different results for 10-fold cross validation. From our exploratory data analysis, we discovered that taking log of crim and lstat variable increased their linear correlation with medv. We also observed that an interaction term with the transformed rad_c variable explained more variations in the regression line. We would now compare the above models with a customized model that incorporates the discoveries from exploratory data analysis. A comparison of repeated cross-validation with 5 repeats and 20 folds is depicted in Figure-6. Figure 6: Model comparison (RMSE and R sq) at 95% Confidence Interval The customized model performed much better at explaining the variation in median housing prices and predicting out of sample.
  • 9. Residual Diagnostics Stepwise Selection Model v/s Custom Model Residual Comparison Figure 7: Residual plots comparison for stepwise model (left) and custom model (right) Figure - 7 shows that custom model displays a slight improvement in the Q-Q plot that indicates that the residuals of the model are nearly normal. The curvature in the Scale-Location graph has also been linearized to an extend in the custom model. This indicates that our assumptions for linear regression holds better with the custom model than the other models. Thus, to make predictions for out of sample, the custom model should be preferred. Final Model Table 3: Model summary of the final selected model Formula R2 Adj R2 AIC BIC RMSE medv ~ nox + ptratio + age + (log(lstat) + rm + log(crim) + dis) * rad_c 0.854 0.850 2735 2794 3.607
  • 10. Comparison with CART After constructing the tree from the split data available, we observed the following values in comparison to linear model: Table 4: Comparison of predictions made by linear regression and CART Sample Type Linear Regression (full model) CART (cp = 0.015642) In-Sample (80%) 21.50 17.81 Out-Sample (20%) 23.91 21.76 The values observed in Table 4 suggests that CART performed better than the full regression model. The above values, however, are volatile i.e. the prediction errors vary with a slight change in the split of train / test data. Thus, to compare these two models and the model arrived at earlier, we needed to run a repeated cross validation for 5 repeats and 20-fold crosses. Figure – 8 depicts the summary of these repeats and predictions at 95% confidence interval. Figure 8: Comparison of model prediction between full linear regression, CART and custom model From above, it is evident that CART performs better than linear regression model. However, because of the simplicity of linear regression, the analysis done in the exploratory phase and the incorporated final model outperforms the CART model.