SlideShare a Scribd company logo
1 of 13
1
STATISTICS FOR BUSINESS ANALYTICS
BARATSAS SOTIRIOS
sotbaratsas@gmai.com
MSc in Business Analytics
1. Read the “usdata” dataset and use str() to understand its structure.
dataset <- read.csv(file="usdata", header=TRUE, sep=" ")
View(dataset)
str(dataset)
sapply(dataset, mode)
sapply(dataset, class)
NOTES
As we can see, we have a data frame with 63 observations of 6 variables. All the variables have been imported as integer,
numeric variables. Using the View(dataset) command, we can also see that the dataset is clean, withouth incorrect,
invalid or missing values. We only need to correct the data classes.
2. Convert the variables PRICE, SQFT, AGE, FEATS to be numeric variables and NE, COR to be factors.
dataset$PRICE<-as.numeric(dataset$PRICE)
dataset$SQFT<-as.numeric(dataset$SQFT)
dataset$AGE<-as.numeric(dataset$AGE)
dataset$FEATS<-as.numeric(dataset$FEATS)
dataset$NE<-factor(dataset$NE, levels=c(0,1), labels=c("No", "Yes"))
dataset$COR<-factor(dataset$COR, levels=c(0,1), labels=c("No", "Yes"))
str(dataset)
NOTES
We convert the variables PRICE, SQFT, AGE, FEATS into numeric variables and the variables NE, COR into factors. Then, we
check using the command str(dataset), if the transformation was successful. It appears it was.
3. Perform descriptive analysis and visualization for each variable to get an initial insight of what the data
looks like. Comment on your findings.
# For the numeric variables
require(psych)
index <- sapply(dataset, class) == "numeric"
numcolumns <- dataset[,index]
2
round(t(describe(numcolumns)),2)
# For the factors we can get information, using:
summary(dataset[5:6])
prop.table(table(dataset$NE))
prop.table(table(dataset$COR))
# Visualizing the variables' distributions
par(mfrow=c(2,3))
hist(dataset$PRICE, main="Price", xlab="USD (hundreds)")
hist(dataset$SQFT, main="Sq Feet", xlab="Square Feet")
hist(dataset$AGE, main="Age", xlab="Years")
n <- nrow(numcolumns)
plot(table(dataset$FEATS)/n, type='h', xlim=range(dataset$FEATS)+c(-1,1), main="Features",
ylab='Relative frequency', xlab="Features")
fcolumns <- dataset[,!index] # for Factors
barplot(sapply(fcolumns,table)/n, horiz=T, las=1, col=2:3, ylim=c(0,8), cex.names=1.3,
space=0.8, xpd=F)
legend('top', fill=2:3, legend=c('No','Yes'), horiz=T, bty='n',cex=0.6)
NOTES
PRICE: The variable "PRICE" has a mean of 1158.42 USD (hundreds) and a median of 1049. In its distribution, we can observe
there is a positive skewness, due to a large number of observations, outlying on the right side of the distribution.
SQFT: The variable "SQFT" has a mean of 1729.54 sq.ft and a median of 1680 sq.ft. The majority of observations are located
3
between 1000 and 2000 sq.ft. There are some outlier observations, located between 2500 and 3000 sq.ft, giving the
distribution a positive skewness.
AGE: The variable "AGE" has a mean of 17.46 years and a median of 20 years. The observations seem to be randomly
distributed, with the majority of observations included in 3 periods: 25-30 years, 15-20 years and 0-10 years.
FEATURES: Starting with 1 feature, we see the number of observations increase, as we increase the number of features,
with 4 being the most common number of features that a house has. After 4, we observe a few houses having 5 or 6 features
and a very low number of houses having 7 or 8. No house in our sample has more than 8 features.
NE: The majority of observations (61.9%) seems to be located in the the Northeast sector of the city. Since we want to use
this sample as a basis for the whole company (including other cities), we must take into account that this sample best
represents the sales in the NE part of the city and might be indicative of a trend in this area.
COR: The majority of the observations (77.7%) does not seem to be a Corner Location, which is something reasonable.
4. Conduct pairwise comparisons between the variables in the dataset to investigate if there are any
associations implied by the dataset. Comment on your findings. Is there a linear relationship between PRICE
and any of the variables in the dataset?
pairs(numcolumns)
#On first sight, we can see there probably is a linear relationship between PRICE and SQFT.
par(mfrow=c(2,3))
for(j in 2:length(numcolumns[1,])){
plot(numcolumns[,j], numcolumns[,1], xlab=names(numcolumns)[j], ylab='Price',cex.lab=1.2)
abline(lm(numcolumns[,1]~numcolumns[,j]))
}
# For factor variables
for(j in 1:2){
boxplot(numcolumns[,1]~fcolumns[,j], xlab=names(fcolumns)[j], ylab='Price', cex.lab=1.2)
}
# We can also do the box plots, but in this instance, I believe they don't present a
clarifying image. We could keep the box plot only for the variable "FEATS".
# par(mfrow=c(1,3))
# for(j in 2:length(numcolumns[1,])){
# boxplot(numcolumns[,1]~numcolumns[,j], xlab=names(numcolumns)[j],
ylab='Price',cex.lab=1.5)
# abline(lm(numcolumns[,1]~numcolumns[,j]),col=2)
# }
4
NOTES
Plotting each variable against PRICE, we can (again) see that there is a visible linear correlation between PRICE and SQFT.
Observing the lines, we can also see that there is a weaker positive correlation between PRICE and FEATS and a slight
negative correlaction between PRICE and AGE.
round(cor(numcolumns), 2)
require(corrplot)
par(mfrow=c(1,2))
corrplot(cor(numcolumns), method="ellipse")
corrplot(cor(numcolumns), method = "number")
NOTES
The corrplot paints a more definitive picture of the linear correlation between our variables. We can confirm that there is a
strong (positive) linear correlation between PRICE and SQFT, a weaker (positive) correlation between PRICE and FEATS, and
a slight negative correlation between PRICE and AGE.
We do not see any covariates being heavily correlated (which could mean a problem of multicollinearity). So, we do not
exclude any variables at this point.
5. Construct a model for the expected selling prices (PRICE) according to the remaining features.(hint:
Conduct multiple regression having PRICE as a response and all the other variables as predictors). Does this
linear model fit well to the data?
model1 <- lm(PRICE ~., data = dataset)
summary(model1)
5
NOTES
Generally, the goodness of fit of this model seems to be acceptable (R^2=0.87 | R^2 adj=0.86). This means, that changes
(increase/decrease) in the covariates of the model explain by 87% the changes (increases/decreases) in the response.
There are certain indicators (such as the negative intercept) that lead us to believe we might be able to build a better
model.
Also, the adjusted R-squared is lower than the Multiple R-squared. This might be an indicator, that the suitability of our
model could become better if we removed at least one of the covariates.
6. Find the best model for predicting the selling prices (PRICE). Select the appropriate features using stepwise
methods. (Hint: Use Forward, Backward or Stepwise procedure according to AIC or BIC to choose which
variables appear to be more significant for predicting selling PRICES).
model2<-step(model1, direction='both')
#step(model1, direction='both', k=log(n)) # If we wanted to use BIC
NOTES
We start with the full model (model1) and using the Akaike Information Criterion (AIC), we try to see if at every step our
model can be improved by adding or removing a variable. In the beginning, our model has AIC=632.62.44. We use the
stepwise method, in order to be able to both add and remove a variable at any step of the process.
In the first step, we see that, by removing the variable "NE", we can decrease the AIC to 631.05, so we do that.
In the second step, we see that by removing the variable "AGE", we can further decrease the AIC to 629.69.
In the third step, we see that by removing the variable "COR", we can further decrease the AIC to 628.84.
In the fourth step, we see that neither removing nor adding another variable will help us further decrease the AIC of our
model. Therefore, our algorithm stops and we are presented with the final model of the stepwise method.
7. Get the summary of your final model, (the model that you ended up having after conducting the stepwise
procedure) and comment on the output. Interpret the coefficients. Comment on the significance of each
coefficient and write down the mathematical formulation of the model (e.g PRICES = Intercept +
coef1*Variable1 + coef2*Variable2 +.... + ε where ε ~ Ν(0, ...) ). Should the intercept be excluded from our
model?
summary(model2)
6
NOTES
We see that the R^2 adj has increased, by removing these variables. Therefore, we assume this is a better model for
predicting the PRICEs of houses.
Our final model is:
PRICE = -175.92 + 0.68 * SQFT + 39.84 * FEATS + ε
ε ~ N(0, 143.6^2)
INTERPRETATION:
§ Intercept = -175.92 --> The expected price of a house, when SQFT = 0 and the house has zero FEATURES, is -
175.92 USD (hundreds) = -17592 USD. This interpretation is not sensible, since a house can't have a negative price.
Also, a house can not have Area=0 square feet.
§ b1(SQFT) = 0.68 --> An increase of 1 unit (sq.foot) in the lotsize of a house, will mean an increase of 0.68 USD
(hundreds) = 68 USD in the Price of the house. More practically, if we compare two houses with the same
characteristics which differ only by 1 sq.ft, then the expected difference in the price will be 68 USD in favor of the
larger house. Since, a reasonable change would be at least 100 sqft, we can say that, if we compare two houses
with the same characteristics which differ only by 100 sq.ft, then the expected difference in the price will be 6800
USD in favor of the larger house.
§ b2(FEATS) = 39.84 --> If we compare two houses with the same characteristics which differ only by 1 additional
feature, then the expected difference in the price is equal to 39.84 USD (hundreds) = 3984 USD, in favor of the
house with the larger number of features.
All three covariates are statistically significant. SQFT is statistically significant in all confidence intervals, while the
Intercept and FEATS are statistically significant with a 0.01 significance level (or lower).
# Changing from square feet to square meters
# If we want, we can change the unit of the lotsize from Sq.Ft to Sq.Meters.
#
# dataset$SQFT<-dataset$SQFT/10.764
# summary(step(lm(PRICE ~., data = dataset), direction='both'))
# -------------------------------
# SHOULD WE REMOVE THE INTERCEPT?
# -------------------------------
# Removing the intercept from the beginning
model3<-lm(PRICE~.-1, data=dataset)
model4<-step(model3, direction='both')
summary(model4) # We see an R^2 adj = 0.9859, however, that is not its true value.
# true.r2 <- 1-sum(model4$res^2)/((n-1)*var(dataset$PRICE)); true.r2
true.r2 <- 1-sum(model4$res^2)/sum( (dataset$PRICE - mean(dataset$PRICE))^2 ) # This is the
true R^2
7
#Removing the intercept straight from the last-used model (model2)
model5 <- update(model2, ~ . -1)
summary(model5) # This is not the true R^2
true.r2.2 <- 1-sum(model5$res^2)/sum( (dataset$PRICE - mean(dataset$PRICE))^2 ); true.r2.2 #
This is the true R^2
NOTES
We notice that in both cases, when we remove the Intercept, the R^2 of our model decreases, indicating that the goodness
of fit of our model also decreases. Also, on the summary of model 2, we can see that the Intercept is significant for a 0.01
significance level (or lower). Based on these 2 facts, we would advise, against removing the Intercept from the model. Thus,
we will keep if in the model for the rest of our analysis.
# --- Model with centered covariates
# Since our numeric covariates never have values even close to 0 (sqft, age, feats), it makes
sense to try to interpret the Intercept using a model with centered covariates.
numcol1 <- as.data.frame(scale(numcolumns, center = TRUE, scale = F))
numcol1$PRICE<-numcolumns$PRICE
sapply(numcol1,mean)
sapply(numcol1,sd)
round(sapply(numcol1,mean),5)
round(sapply(numcol1,sd),2)
cen_model<-lm(PRICE~., data=numcol1)
summary(cen_model)
8
NOTES
In this new model, the Intercept is equal to 1.158e+03 USD (hundreds) = 115800 USD. This price represents the expected
price of a house, when all the numeric covariates are at their mean. --> Meaning, the expected price of a house, that has
lotsize = 1729.54 sq.ft, age = 17.4 years and 3.95 features. Of course, for discrete variables, we have to round up (e.g. 4
features) for the interpretation to make perfect sense.
We can see the means using: sapply(numcolumns, mean)
8. Check the assumptions of your final model. Are the assumptions satisfied? If not, what is the impact of the
violation of the assumption not satisfied in terms of inference? What could someone do about it?
# --- Normality of the residuals ---
par(mfrow=c(1,1))
plot(model2, which = 2)
require(nortest)
shapiro.test(model2$residuals) # The p-value is large. We do not reject NORMALITY.
lillie.test(model2$residuals) # The p-value is 0.09. We accept Normality for 0.01 and 0.05
significance level.
NOTES
From the plot, we can see that the the middle part of the distribution of the residuals falls close to the normal distribution.
However, there is a stronger deviation on the edges.
That's why, the Shapiro test gives a very high p-value (0.63), but Lilliefors Test confirms the normality assumption only for
0.01 and 0.05 significance level (and not for 0.10).
# --- Homoscedasticity of the errors ---
#Is the variance a constant variance?
Stud.residuals <- rstudent(model2)
yhat <- fitted(model2)
par(mfrow=c(1,2))
plot(yhat, Stud.residuals) #plotting the student residuals versus the y^hats
abline(h=c(-2,2), col=2, lty=2)
plot(yhat, Stud.residuals^2)
abline(h=4, col=2, lty=2)
# - ncvTest -
library(car)
ncvTest(model2)
# The p-value is very small, thus we reject the hypothesis of constant variance. Thus, we the
assumption of Homoscedasticity of the errors is violated.
# - Levene Test -
yhat.quantiles<-cut(yhat, breaks=quantile(yhat, probs=seq(0,1,0.25)), dig.lab=6)
leveneTest(rstudent(model2)~yhat.quantiles)
boxplot(rstudent(model2)~yhat.quantiles)
9
NOTES
The first two plots, show that there are values outside the dotter red lines. This means, we probably don’t have constant
variance. The ncvTest and the Levene Test confirm that we reject the hypothesis of constant variance, since the p-value is
very small. As we see in the box plot, the variance of the 4 quantiles differs considerably. The violation of this assumption,
means that the variability of our response (PRICE) is not equal across the whole range of values of our covariates.
# --- non linearity ---
library(car)
par(mfrow=c(1,1))
residualPlot(model2, type='rstudent')
residualPlots(model2, plot=F, type = "rstudent")
NOTES
For 5% significance level, we reject the hypothesis of linearity between the response and the covariates. This means that we
probably will need a quadratic term to produce a good regression model.
# --- Independence ---
plot(rstudent(model2), type='l')
library(randtests); runs.test(model2$res)
library(lmtest);dwtest(model2)
library(car); durbinWatsonTest(model2)
10
NOTES
Using the runs.test and the durbinWatsonTest for a 0.05 significance level, we do not reject the null hypothesis, that the
order of observations is random == There is independence of the errors.
Since 2 of our assumptions are violated, to fix the model, we need to apply transformations.
The linearity and homoscedasticity problems may be solved, by applying a log transformation to
the dependent variable, which is appropriate, since PRICE is strictly positive.
logmodel<-lm(log(PRICE)~.-AGE-NE-COR, data=dataset)
summary(logmodel)
#Normality of the residuals
plot(logmodel, which = 2)
require(nortest)
shapiro.test(logmodel$residuals) # The p-value is large. We do not reject NORMALITY.
lillie.test(logmodel$residuals)
# Homoscedasticity of the errors
log.Stud.residuals <- rstudent(logmodel)
log.yhat <- fitted(logmodel)
par(mfrow=c(1,2))
plot(log.yhat, log.Stud.residuals) #plotting the student residuals versus the y^hats
abline(h=c(-2,2), col=2, lty=2)
plot(log.yhat, log.Stud.residuals^2)
abline(h=4, col=2, lty=2)
library(car)
ncvTest(logmodel)
log.yhat.quantiles<-cut(log.yhat, breaks=quantile(log.yhat, probs=seq(0,1,0.25)), dig.lab=6)
leveneTest(rstudent(logmodel)~log.yhat.quantiles)
boxplot(rstudent(logmodel)~log.yhat.quantiles)
# Linearity
library(car)
par(mfrow=c(1,1))
residualPlot(logmodel, type='rstudent')
residualPlots(logmodel, plot=F, type = "rstudent")
11
#Independence
plot(rstudent(logmodel), type='l')
library(randtests); runs.test(logmodel$res)
library(lmtest);dwtest(logmodel)
library(car); durbinWatsonTest(logmodel)
NOTES
In all the tests, the p-values are large enough, which means that the log transformation has solved our problems.
INTERPRETATION
Since our new model has an increased R^2 adj (87,72% vs 86,61% of the previously best model), and all of our assumptions
are fulfilled, we choose to keep this model.
We should be careful, that the interpretation of the coefficients has changed:
§ b1(SQFT) = 5.402e-04 --> An increase of 1 unit (sq.foot) in the lotsize of a house, will mean an increase of 0.054%
in the Price of the house. Since, a reasonable change would be at least 100 sqft, we can say that, an increase of
100 sq.ft would mean an increase of 5.4% in the price of a house.
§ b2(FEATS) = 2.850e-02 --> If we add 1 additional feature to a house, keeping all other characteristics unchanged,
then the expected increase in the price is equal to 2.85%.
§ Intercept = 5.959 --> The expected price of a house, when SQFT = 0 and the house has zero FEATURES, is
exp(5.959) = 387 USD (hundreds).
9. Conduct LASSO as a variable selection technique and compare the variables that you end up having using
LASSO to the variables that you ended up having using stepwise methods in (VI). Are you getting the same
results? Comment.
require(glmnet)
modmtrx <- model.matrix(model1)[,-1] # We create the design matrix, using the full model.
# In the design matrix, we remove the first column, because it's the response. We don't want
the response inside the matrix.
12
NOTES
In this graph it's not visible, but SQFT is actually the last variable to "be killed" by LASSO, although it's line is very close to
the baseline, because its coefficient is very small. It becomes more visible through this plot:
library(plotmo)
plot_glmnet(lasso)
NOTES
As we can see, NE is the first variable to be removed from the model, followed by AGE. With a little higher lambda, COR is
removed too, and finally, FEATS and SQFT are the last variables to be removed. We notice, that this is the exact order the
stepwise method removed the variables from the full model, using the AIC.
# Using cross validation to find a reasonable value for lambda
lasso1 <- cv.glmnet(modmtrx, dataset$PRICE, alpha = 1)
lasso1$lambda
lasso1$lambda.min
lasso1$lambda.1se
13
NOTES
If we chose lambda.1se = 46.55 as a reasonable value for the lambda (since this is the value of lambda, where the error is
within 1 standard error of the minimum), we see that LASSO would only keep FEATS and SQFT in the model, which is exactly
the same result as the stepwise regression, using AIC.

More Related Content

What's hot

Logistic regression
Logistic regressionLogistic regression
Logistic regressionVARUN KUMAR
 
Gold Price Forecasting
Gold Price ForecastingGold Price Forecasting
Gold Price Forecastingyou wu
 
Cointegration and error correction model
Cointegration and error correction modelCointegration and error correction model
Cointegration and error correction modelAditya KS
 
Different Types of Graphs
Different Types of GraphsDifferent Types of Graphs
Different Types of GraphsRileyAntler
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data scienceBrad Klingenberg
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysisSumit Das
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
 
Forecasting with Vector Autoregression
Forecasting with Vector AutoregressionForecasting with Vector Autoregression
Forecasting with Vector AutoregressionBryan Butler, MBA, MS
 
Probability Distribution
Probability DistributionProbability Distribution
Probability DistributionSarabjeet Kaur
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...Daniel Katz
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inferencejemille6
 
Working with Numerical Data
Working with  Numerical DataWorking with  Numerical Data
Working with Numerical DataGlobal Polis
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and ClusteringUsha Vijay
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regressionLong Beach City College
 

What's hot (20)

PCA
PCAPCA
PCA
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Gold Price Forecasting
Gold Price ForecastingGold Price Forecasting
Gold Price Forecasting
 
Cointegration and error correction model
Cointegration and error correction modelCointegration and error correction model
Cointegration and error correction model
 
Different Types of Graphs
Different Types of GraphsDifferent Types of Graphs
Different Types of Graphs
 
Breeds of Dogs & Cats
Breeds of Dogs & CatsBreeds of Dogs & Cats
Breeds of Dogs & Cats
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
Time series
Time seriesTime series
Time series
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
 
Point Estimation
Point EstimationPoint Estimation
Point Estimation
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Forecasting with Vector Autoregression
Forecasting with Vector AutoregressionForecasting with Vector Autoregression
Forecasting with Vector Autoregression
 
Probability Distribution
Probability DistributionProbability Distribution
Probability Distribution
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
 
Working with Numerical Data
Working with  Numerical DataWorking with  Numerical Data
Working with Numerical Data
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression
 

Similar to Predicting US house prices using Multiple Linear Regression in R

Week 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docxWeek 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docxcockekeshia
 
SBSI optimization tutorial
SBSI optimization tutorialSBSI optimization tutorial
SBSI optimization tutorialRichard Adams
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WAMohammed Al Hamadi
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning AlexAman1
 
Bootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslyBootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslykhaled125087
 
ddiaz_regression_project_stat104
ddiaz_regression_project_stat104ddiaz_regression_project_stat104
ddiaz_regression_project_stat104Ryan Diaz
 
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2Daniel Katz
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregressionkongara
 
Workshop 4
Workshop 4Workshop 4
Workshop 4eeetq
 
Plane-and-Solid-Geometry. introduction to proving
Plane-and-Solid-Geometry. introduction to provingPlane-and-Solid-Geometry. introduction to proving
Plane-and-Solid-Geometry. introduction to provingReyRoluna1
 
DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATIONDATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATIONijcsity
 
DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATION DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATION ijcsity
 
DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATIONDATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATIONijcsity
 
5 structured programming
5 structured programming 5 structured programming
5 structured programming hccit
 
BASIC OF ALGORITHM AND MATHEMATICS STUDENTS
BASIC OF ALGORITHM AND MATHEMATICS STUDENTSBASIC OF ALGORITHM AND MATHEMATICS STUDENTS
BASIC OF ALGORITHM AND MATHEMATICS STUDENTSjainyshah20
 

Similar to Predicting US house prices using Multiple Linear Regression in R (20)

Week 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docxWeek 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docx
 
Fst ch3 notes
Fst ch3 notesFst ch3 notes
Fst ch3 notes
 
SBSI optimization tutorial
SBSI optimization tutorialSBSI optimization tutorial
SBSI optimization tutorial
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
Bootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslyBootcamp of new world to taken seriously
Bootcamp of new world to taken seriously
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Case Study of Petroleum Consumption With R Code
Case Study of Petroleum Consumption With R CodeCase Study of Petroleum Consumption With R Code
Case Study of Petroleum Consumption With R Code
 
ddiaz_regression_project_stat104
ddiaz_regression_project_stat104ddiaz_regression_project_stat104
ddiaz_regression_project_stat104
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Curvefitting
CurvefittingCurvefitting
Curvefitting
 
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Workshop 4
Workshop 4Workshop 4
Workshop 4
 
Plane-and-Solid-Geometry. introduction to proving
Plane-and-Solid-Geometry. introduction to provingPlane-and-Solid-Geometry. introduction to proving
Plane-and-Solid-Geometry. introduction to proving
 
DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATIONDATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATION
 
DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATION DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATION
 
DATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATIONDATA TABLE, EQUATION FIT OR INTERPOLATION
DATA TABLE, EQUATION FIT OR INTERPOLATION
 
5 structured programming
5 structured programming 5 structured programming
5 structured programming
 
BASIC OF ALGORITHM AND MATHEMATICS STUDENTS
BASIC OF ALGORITHM AND MATHEMATICS STUDENTSBASIC OF ALGORITHM AND MATHEMATICS STUDENTS
BASIC OF ALGORITHM AND MATHEMATICS STUDENTS
 

More from Sotiris Baratsas

Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectSotiris Baratsas
 
Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)Sotiris Baratsas
 
Azure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth StreamAzure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth StreamSotiris Baratsas
 
Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)Sotiris Baratsas
 
Car Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLCar Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLSotiris Baratsas
 
Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)Sotiris Baratsas
 
Understanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate PresentationUnderstanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate PresentationSotiris Baratsas
 
How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!Sotiris Baratsas
 
The Secrets of the World's Best Presenters
The Secrets of the World's Best PresentersThe Secrets of the World's Best Presenters
The Secrets of the World's Best PresentersSotiris Baratsas
 
The Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - PresentationThe Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - PresentationSotiris Baratsas
 
A behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubbleA behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubbleSotiris Baratsas
 
[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week Presentation[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week PresentationSotiris Baratsas
 
Advanced Feedback Methodologies
Advanced Feedback MethodologiesAdvanced Feedback Methodologies
Advanced Feedback MethodologiesSotiris Baratsas
 
Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]Sotiris Baratsas
 
How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]Sotiris Baratsas
 
How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]Sotiris Baratsas
 
Global Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship PackageGlobal Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship PackageSotiris Baratsas
 
Online Marketing - STEP IT UP
Online Marketing - STEP IT UPOnline Marketing - STEP IT UP
Online Marketing - STEP IT UPSotiris Baratsas
 
Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)Sotiris Baratsas
 

More from Sotiris Baratsas (20)

Twitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics ProjectTwitter Mention Graph - Analytics Project
Twitter Mention Graph - Analytics Project
 
Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)
 
Azure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth StreamAzure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth Stream
 
Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)
 
Car Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLCar Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQL
 
Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)
 
Understanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate PresentationUnderstanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate Presentation
 
How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!
 
The Secrets of the World's Best Presenters
The Secrets of the World's Best PresentersThe Secrets of the World's Best Presenters
The Secrets of the World's Best Presenters
 
The Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - PresentationThe Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - Presentation
 
Why Global Talent
Why Global TalentWhy Global Talent
Why Global Talent
 
A behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubbleA behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubble
 
[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week Presentation[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week Presentation
 
Advanced Feedback Methodologies
Advanced Feedback MethodologiesAdvanced Feedback Methodologies
Advanced Feedback Methodologies
 
Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]
 
How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]
 
How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]
 
Global Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship PackageGlobal Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship Package
 
Online Marketing - STEP IT UP
Online Marketing - STEP IT UPOnline Marketing - STEP IT UP
Online Marketing - STEP IT UP
 
Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)
 

Recently uploaded

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Predicting US house prices using Multiple Linear Regression in R

  • 1. 1 STATISTICS FOR BUSINESS ANALYTICS BARATSAS SOTIRIOS sotbaratsas@gmai.com MSc in Business Analytics 1. Read the “usdata” dataset and use str() to understand its structure. dataset <- read.csv(file="usdata", header=TRUE, sep=" ") View(dataset) str(dataset) sapply(dataset, mode) sapply(dataset, class) NOTES As we can see, we have a data frame with 63 observations of 6 variables. All the variables have been imported as integer, numeric variables. Using the View(dataset) command, we can also see that the dataset is clean, withouth incorrect, invalid or missing values. We only need to correct the data classes. 2. Convert the variables PRICE, SQFT, AGE, FEATS to be numeric variables and NE, COR to be factors. dataset$PRICE<-as.numeric(dataset$PRICE) dataset$SQFT<-as.numeric(dataset$SQFT) dataset$AGE<-as.numeric(dataset$AGE) dataset$FEATS<-as.numeric(dataset$FEATS) dataset$NE<-factor(dataset$NE, levels=c(0,1), labels=c("No", "Yes")) dataset$COR<-factor(dataset$COR, levels=c(0,1), labels=c("No", "Yes")) str(dataset) NOTES We convert the variables PRICE, SQFT, AGE, FEATS into numeric variables and the variables NE, COR into factors. Then, we check using the command str(dataset), if the transformation was successful. It appears it was. 3. Perform descriptive analysis and visualization for each variable to get an initial insight of what the data looks like. Comment on your findings. # For the numeric variables require(psych) index <- sapply(dataset, class) == "numeric" numcolumns <- dataset[,index]
  • 2. 2 round(t(describe(numcolumns)),2) # For the factors we can get information, using: summary(dataset[5:6]) prop.table(table(dataset$NE)) prop.table(table(dataset$COR)) # Visualizing the variables' distributions par(mfrow=c(2,3)) hist(dataset$PRICE, main="Price", xlab="USD (hundreds)") hist(dataset$SQFT, main="Sq Feet", xlab="Square Feet") hist(dataset$AGE, main="Age", xlab="Years") n <- nrow(numcolumns) plot(table(dataset$FEATS)/n, type='h', xlim=range(dataset$FEATS)+c(-1,1), main="Features", ylab='Relative frequency', xlab="Features") fcolumns <- dataset[,!index] # for Factors barplot(sapply(fcolumns,table)/n, horiz=T, las=1, col=2:3, ylim=c(0,8), cex.names=1.3, space=0.8, xpd=F) legend('top', fill=2:3, legend=c('No','Yes'), horiz=T, bty='n',cex=0.6) NOTES PRICE: The variable "PRICE" has a mean of 1158.42 USD (hundreds) and a median of 1049. In its distribution, we can observe there is a positive skewness, due to a large number of observations, outlying on the right side of the distribution. SQFT: The variable "SQFT" has a mean of 1729.54 sq.ft and a median of 1680 sq.ft. The majority of observations are located
  • 3. 3 between 1000 and 2000 sq.ft. There are some outlier observations, located between 2500 and 3000 sq.ft, giving the distribution a positive skewness. AGE: The variable "AGE" has a mean of 17.46 years and a median of 20 years. The observations seem to be randomly distributed, with the majority of observations included in 3 periods: 25-30 years, 15-20 years and 0-10 years. FEATURES: Starting with 1 feature, we see the number of observations increase, as we increase the number of features, with 4 being the most common number of features that a house has. After 4, we observe a few houses having 5 or 6 features and a very low number of houses having 7 or 8. No house in our sample has more than 8 features. NE: The majority of observations (61.9%) seems to be located in the the Northeast sector of the city. Since we want to use this sample as a basis for the whole company (including other cities), we must take into account that this sample best represents the sales in the NE part of the city and might be indicative of a trend in this area. COR: The majority of the observations (77.7%) does not seem to be a Corner Location, which is something reasonable. 4. Conduct pairwise comparisons between the variables in the dataset to investigate if there are any associations implied by the dataset. Comment on your findings. Is there a linear relationship between PRICE and any of the variables in the dataset? pairs(numcolumns) #On first sight, we can see there probably is a linear relationship between PRICE and SQFT. par(mfrow=c(2,3)) for(j in 2:length(numcolumns[1,])){ plot(numcolumns[,j], numcolumns[,1], xlab=names(numcolumns)[j], ylab='Price',cex.lab=1.2) abline(lm(numcolumns[,1]~numcolumns[,j])) } # For factor variables for(j in 1:2){ boxplot(numcolumns[,1]~fcolumns[,j], xlab=names(fcolumns)[j], ylab='Price', cex.lab=1.2) } # We can also do the box plots, but in this instance, I believe they don't present a clarifying image. We could keep the box plot only for the variable "FEATS". # par(mfrow=c(1,3)) # for(j in 2:length(numcolumns[1,])){ # boxplot(numcolumns[,1]~numcolumns[,j], xlab=names(numcolumns)[j], ylab='Price',cex.lab=1.5) # abline(lm(numcolumns[,1]~numcolumns[,j]),col=2) # }
  • 4. 4 NOTES Plotting each variable against PRICE, we can (again) see that there is a visible linear correlation between PRICE and SQFT. Observing the lines, we can also see that there is a weaker positive correlation between PRICE and FEATS and a slight negative correlaction between PRICE and AGE. round(cor(numcolumns), 2) require(corrplot) par(mfrow=c(1,2)) corrplot(cor(numcolumns), method="ellipse") corrplot(cor(numcolumns), method = "number") NOTES The corrplot paints a more definitive picture of the linear correlation between our variables. We can confirm that there is a strong (positive) linear correlation between PRICE and SQFT, a weaker (positive) correlation between PRICE and FEATS, and a slight negative correlation between PRICE and AGE. We do not see any covariates being heavily correlated (which could mean a problem of multicollinearity). So, we do not exclude any variables at this point. 5. Construct a model for the expected selling prices (PRICE) according to the remaining features.(hint: Conduct multiple regression having PRICE as a response and all the other variables as predictors). Does this linear model fit well to the data? model1 <- lm(PRICE ~., data = dataset) summary(model1)
  • 5. 5 NOTES Generally, the goodness of fit of this model seems to be acceptable (R^2=0.87 | R^2 adj=0.86). This means, that changes (increase/decrease) in the covariates of the model explain by 87% the changes (increases/decreases) in the response. There are certain indicators (such as the negative intercept) that lead us to believe we might be able to build a better model. Also, the adjusted R-squared is lower than the Multiple R-squared. This might be an indicator, that the suitability of our model could become better if we removed at least one of the covariates. 6. Find the best model for predicting the selling prices (PRICE). Select the appropriate features using stepwise methods. (Hint: Use Forward, Backward or Stepwise procedure according to AIC or BIC to choose which variables appear to be more significant for predicting selling PRICES). model2<-step(model1, direction='both') #step(model1, direction='both', k=log(n)) # If we wanted to use BIC NOTES We start with the full model (model1) and using the Akaike Information Criterion (AIC), we try to see if at every step our model can be improved by adding or removing a variable. In the beginning, our model has AIC=632.62.44. We use the stepwise method, in order to be able to both add and remove a variable at any step of the process. In the first step, we see that, by removing the variable "NE", we can decrease the AIC to 631.05, so we do that. In the second step, we see that by removing the variable "AGE", we can further decrease the AIC to 629.69. In the third step, we see that by removing the variable "COR", we can further decrease the AIC to 628.84. In the fourth step, we see that neither removing nor adding another variable will help us further decrease the AIC of our model. Therefore, our algorithm stops and we are presented with the final model of the stepwise method. 7. Get the summary of your final model, (the model that you ended up having after conducting the stepwise procedure) and comment on the output. Interpret the coefficients. Comment on the significance of each coefficient and write down the mathematical formulation of the model (e.g PRICES = Intercept + coef1*Variable1 + coef2*Variable2 +.... + ε where ε ~ Ν(0, ...) ). Should the intercept be excluded from our model? summary(model2)
  • 6. 6 NOTES We see that the R^2 adj has increased, by removing these variables. Therefore, we assume this is a better model for predicting the PRICEs of houses. Our final model is: PRICE = -175.92 + 0.68 * SQFT + 39.84 * FEATS + ε ε ~ N(0, 143.6^2) INTERPRETATION: § Intercept = -175.92 --> The expected price of a house, when SQFT = 0 and the house has zero FEATURES, is - 175.92 USD (hundreds) = -17592 USD. This interpretation is not sensible, since a house can't have a negative price. Also, a house can not have Area=0 square feet. § b1(SQFT) = 0.68 --> An increase of 1 unit (sq.foot) in the lotsize of a house, will mean an increase of 0.68 USD (hundreds) = 68 USD in the Price of the house. More practically, if we compare two houses with the same characteristics which differ only by 1 sq.ft, then the expected difference in the price will be 68 USD in favor of the larger house. Since, a reasonable change would be at least 100 sqft, we can say that, if we compare two houses with the same characteristics which differ only by 100 sq.ft, then the expected difference in the price will be 6800 USD in favor of the larger house. § b2(FEATS) = 39.84 --> If we compare two houses with the same characteristics which differ only by 1 additional feature, then the expected difference in the price is equal to 39.84 USD (hundreds) = 3984 USD, in favor of the house with the larger number of features. All three covariates are statistically significant. SQFT is statistically significant in all confidence intervals, while the Intercept and FEATS are statistically significant with a 0.01 significance level (or lower). # Changing from square feet to square meters # If we want, we can change the unit of the lotsize from Sq.Ft to Sq.Meters. # # dataset$SQFT<-dataset$SQFT/10.764 # summary(step(lm(PRICE ~., data = dataset), direction='both')) # ------------------------------- # SHOULD WE REMOVE THE INTERCEPT? # ------------------------------- # Removing the intercept from the beginning model3<-lm(PRICE~.-1, data=dataset) model4<-step(model3, direction='both') summary(model4) # We see an R^2 adj = 0.9859, however, that is not its true value. # true.r2 <- 1-sum(model4$res^2)/((n-1)*var(dataset$PRICE)); true.r2 true.r2 <- 1-sum(model4$res^2)/sum( (dataset$PRICE - mean(dataset$PRICE))^2 ) # This is the true R^2
  • 7. 7 #Removing the intercept straight from the last-used model (model2) model5 <- update(model2, ~ . -1) summary(model5) # This is not the true R^2 true.r2.2 <- 1-sum(model5$res^2)/sum( (dataset$PRICE - mean(dataset$PRICE))^2 ); true.r2.2 # This is the true R^2 NOTES We notice that in both cases, when we remove the Intercept, the R^2 of our model decreases, indicating that the goodness of fit of our model also decreases. Also, on the summary of model 2, we can see that the Intercept is significant for a 0.01 significance level (or lower). Based on these 2 facts, we would advise, against removing the Intercept from the model. Thus, we will keep if in the model for the rest of our analysis. # --- Model with centered covariates # Since our numeric covariates never have values even close to 0 (sqft, age, feats), it makes sense to try to interpret the Intercept using a model with centered covariates. numcol1 <- as.data.frame(scale(numcolumns, center = TRUE, scale = F)) numcol1$PRICE<-numcolumns$PRICE sapply(numcol1,mean) sapply(numcol1,sd) round(sapply(numcol1,mean),5) round(sapply(numcol1,sd),2) cen_model<-lm(PRICE~., data=numcol1) summary(cen_model)
  • 8. 8 NOTES In this new model, the Intercept is equal to 1.158e+03 USD (hundreds) = 115800 USD. This price represents the expected price of a house, when all the numeric covariates are at their mean. --> Meaning, the expected price of a house, that has lotsize = 1729.54 sq.ft, age = 17.4 years and 3.95 features. Of course, for discrete variables, we have to round up (e.g. 4 features) for the interpretation to make perfect sense. We can see the means using: sapply(numcolumns, mean) 8. Check the assumptions of your final model. Are the assumptions satisfied? If not, what is the impact of the violation of the assumption not satisfied in terms of inference? What could someone do about it? # --- Normality of the residuals --- par(mfrow=c(1,1)) plot(model2, which = 2) require(nortest) shapiro.test(model2$residuals) # The p-value is large. We do not reject NORMALITY. lillie.test(model2$residuals) # The p-value is 0.09. We accept Normality for 0.01 and 0.05 significance level. NOTES From the plot, we can see that the the middle part of the distribution of the residuals falls close to the normal distribution. However, there is a stronger deviation on the edges. That's why, the Shapiro test gives a very high p-value (0.63), but Lilliefors Test confirms the normality assumption only for 0.01 and 0.05 significance level (and not for 0.10). # --- Homoscedasticity of the errors --- #Is the variance a constant variance? Stud.residuals <- rstudent(model2) yhat <- fitted(model2) par(mfrow=c(1,2)) plot(yhat, Stud.residuals) #plotting the student residuals versus the y^hats abline(h=c(-2,2), col=2, lty=2) plot(yhat, Stud.residuals^2) abline(h=4, col=2, lty=2) # - ncvTest - library(car) ncvTest(model2) # The p-value is very small, thus we reject the hypothesis of constant variance. Thus, we the assumption of Homoscedasticity of the errors is violated. # - Levene Test - yhat.quantiles<-cut(yhat, breaks=quantile(yhat, probs=seq(0,1,0.25)), dig.lab=6) leveneTest(rstudent(model2)~yhat.quantiles) boxplot(rstudent(model2)~yhat.quantiles)
  • 9. 9 NOTES The first two plots, show that there are values outside the dotter red lines. This means, we probably don’t have constant variance. The ncvTest and the Levene Test confirm that we reject the hypothesis of constant variance, since the p-value is very small. As we see in the box plot, the variance of the 4 quantiles differs considerably. The violation of this assumption, means that the variability of our response (PRICE) is not equal across the whole range of values of our covariates. # --- non linearity --- library(car) par(mfrow=c(1,1)) residualPlot(model2, type='rstudent') residualPlots(model2, plot=F, type = "rstudent") NOTES For 5% significance level, we reject the hypothesis of linearity between the response and the covariates. This means that we probably will need a quadratic term to produce a good regression model. # --- Independence --- plot(rstudent(model2), type='l') library(randtests); runs.test(model2$res) library(lmtest);dwtest(model2) library(car); durbinWatsonTest(model2)
  • 10. 10 NOTES Using the runs.test and the durbinWatsonTest for a 0.05 significance level, we do not reject the null hypothesis, that the order of observations is random == There is independence of the errors. Since 2 of our assumptions are violated, to fix the model, we need to apply transformations. The linearity and homoscedasticity problems may be solved, by applying a log transformation to the dependent variable, which is appropriate, since PRICE is strictly positive. logmodel<-lm(log(PRICE)~.-AGE-NE-COR, data=dataset) summary(logmodel) #Normality of the residuals plot(logmodel, which = 2) require(nortest) shapiro.test(logmodel$residuals) # The p-value is large. We do not reject NORMALITY. lillie.test(logmodel$residuals) # Homoscedasticity of the errors log.Stud.residuals <- rstudent(logmodel) log.yhat <- fitted(logmodel) par(mfrow=c(1,2)) plot(log.yhat, log.Stud.residuals) #plotting the student residuals versus the y^hats abline(h=c(-2,2), col=2, lty=2) plot(log.yhat, log.Stud.residuals^2) abline(h=4, col=2, lty=2) library(car) ncvTest(logmodel) log.yhat.quantiles<-cut(log.yhat, breaks=quantile(log.yhat, probs=seq(0,1,0.25)), dig.lab=6) leveneTest(rstudent(logmodel)~log.yhat.quantiles) boxplot(rstudent(logmodel)~log.yhat.quantiles) # Linearity library(car) par(mfrow=c(1,1)) residualPlot(logmodel, type='rstudent') residualPlots(logmodel, plot=F, type = "rstudent")
  • 11. 11 #Independence plot(rstudent(logmodel), type='l') library(randtests); runs.test(logmodel$res) library(lmtest);dwtest(logmodel) library(car); durbinWatsonTest(logmodel) NOTES In all the tests, the p-values are large enough, which means that the log transformation has solved our problems. INTERPRETATION Since our new model has an increased R^2 adj (87,72% vs 86,61% of the previously best model), and all of our assumptions are fulfilled, we choose to keep this model. We should be careful, that the interpretation of the coefficients has changed: § b1(SQFT) = 5.402e-04 --> An increase of 1 unit (sq.foot) in the lotsize of a house, will mean an increase of 0.054% in the Price of the house. Since, a reasonable change would be at least 100 sqft, we can say that, an increase of 100 sq.ft would mean an increase of 5.4% in the price of a house. § b2(FEATS) = 2.850e-02 --> If we add 1 additional feature to a house, keeping all other characteristics unchanged, then the expected increase in the price is equal to 2.85%. § Intercept = 5.959 --> The expected price of a house, when SQFT = 0 and the house has zero FEATURES, is exp(5.959) = 387 USD (hundreds). 9. Conduct LASSO as a variable selection technique and compare the variables that you end up having using LASSO to the variables that you ended up having using stepwise methods in (VI). Are you getting the same results? Comment. require(glmnet) modmtrx <- model.matrix(model1)[,-1] # We create the design matrix, using the full model. # In the design matrix, we remove the first column, because it's the response. We don't want the response inside the matrix.
  • 12. 12 NOTES In this graph it's not visible, but SQFT is actually the last variable to "be killed" by LASSO, although it's line is very close to the baseline, because its coefficient is very small. It becomes more visible through this plot: library(plotmo) plot_glmnet(lasso) NOTES As we can see, NE is the first variable to be removed from the model, followed by AGE. With a little higher lambda, COR is removed too, and finally, FEATS and SQFT are the last variables to be removed. We notice, that this is the exact order the stepwise method removed the variables from the full model, using the AIC. # Using cross validation to find a reasonable value for lambda lasso1 <- cv.glmnet(modmtrx, dataset$PRICE, alpha = 1) lasso1$lambda lasso1$lambda.min lasso1$lambda.1se
  • 13. 13 NOTES If we chose lambda.1se = 46.55 as a reasonable value for the lambda (since this is the value of lambda, where the error is within 1 standard error of the minimum), we see that LASSO would only keep FEATS and SQFT in the model, which is exactly the same result as the stepwise regression, using AIC.