1) The document describes using multiple linear regression and other statistical techniques to analyze a dataset on red wine quality ratings and physiochemical attributes.
2) Key findings include that alcohol content and sulphates most strongly predict higher quality ratings, while chlorides and volatile acidity predict lower ratings.
3) Backward selection multiple linear regression produced the best fitting model, with six significant predictors explaining wine quality. Principal component analysis and cross-validation methods were also applied to validate the model.
2. 1
Data Set Review:
What are the basic factors that make wine more preferable or high-rated? Are these
factors imaginary or real? These are some age-old conundrum about fermented grape
juice the world has come to know and love. I’ve been skeptical of professional wine
ratings after reading an instance of a blind taste test where oenologists could not
accurately pick out the most expensive wines. The wine quality data set caught my
fascination with this problem.
My objective is to conduct a multiple linear regression (MLR) model to predict wine
quality in relation to physiochemical attributes. My questions are as follows:
1. Is there a linear relationship between wine ratings (quality) and at least one of
the predictor variables?
2. Which physiochemical attribute(s) are the strongest predictors of highly-rated
wine and poorly-rated wine?
3. Which model has the best performance in fitting the data?
Wine ratings or “quality” scores are sensory data, meaning humans are assessing the
quality whereas the physiochemical values are objective data collected from lab tests.
There may be benefits to realize by measuring the impact of physiochemical tests in
wine quality. It could help wine producers by improving the production process and
identify target markets to increase profitability. In wine industry practicum,
certification and quality assessments to prevent adulteration of wine are also based on
these types of data collection and descriptive analysis.
I transformed the raw data set, obtained from http://mlr.cs.umass.edu/ml/machine-
learning-databases/wine-quality/, using text-to-columns feature in Excel to make it
suitable for analysis in R.
The data set is related to red variants of the Portuguese "Vinho Verde" wine. Only
physicochemical (inputs) and sensory (output) variables are available (e.g., there is no
private data about grape types, wine brand, wine selling price, etc.). There are 1599
observations with 11 input variables and 1 output variable. Several of the attributes
may be correlated, thus it makes sense to apply some sort of feature variable selection.
3. 2
Input Variables (based on physicochemical tests):
1 - fixed acidity (min: 4.60, max: 15.90)
2 - volatile acidity (min: 0.12, max: 1.58)
3 - citric acid (min: 0.00, max: 1.00)
4 - residual sugar (min: 0.90, max: 15.50)
5 - chlorides (min: 0.01, max: 0.61)
6 - free sulfur dioxide (min: 1.00, max: 72.00)
7 - total sulfur dioxide (min: 6.00, max: 289.00)
8 - density (min: 0.99, max: 1.00)
9 - pH (min: 2.74, max: 4.01)
10 - sulphates (min: 0.33, max: 2.00)
11 - alcohol (min: 8.40, max: 14.90)
Output Variable (based on sensory data):
12 - quality (min: 3.00, max: 8.00), score between 0 and 10
Red Wine Data Set:
I applied a simple Validation Set Approach to randomly divide the available set of
observations into two parts; 60% training set and 40% validation set. The model in
Figure 1 is fit on the training set and the fitted model is used to predict the responses
for the observations in the validation set. The resulting validation set error rate,
assessed using the Mean Squared Error (MSE) provides an estimate of the test error
rate.
> mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE
in validation set
[1] 0.4374297
4. 3
In order to see if there is a relationship between quality (response variable) and all 11
physiochemical attributes (predictor variables) we need to test the null hypothesis. The
hypothesis test can be performed by computing the F-statistic for the full model.
Null Hypothesis = H0: B1= B2 = B3… = B11 = 0
Alternative Hypothesis = Ha: at least one Bj is a non-zero
1. Multiple Linear Regression Model Using All 11 Predictor Variables
trainfit<- lm(quality~., data=training)
summary(trainfit) #Output of Regression coefficients for all Predictors
Figure 1
5. 4
The F-statistic on this full model (Figure 1) is 50.26, which is much larger than 1, so we
can reject the null hypothesis: H0 and conclude that there is a relationship between
quality (Y) and at least one of the predictors (Bj). Since there is a relatively large n
(n=1599), an F-statistic of 50.26 provides sufficient evidence to reject H0.
2. Diagnostics Plots to Verify Regression Assumptions
Figure 2a
Figure 2b
The “Residuals vs. Fitted” plot (Figure 2a) checks for equal variance. The red line is a
smooth fit to the residuals, intending to make any patterns more easily identifiable.
6. 5
Due to the strange parallel lines pattern observed in the plot, It’s not easy to conclude
the non-violation of equal variance assumption.
The “Normal Q-Q” plot (Figure 2b) of quantiles of this distribution against quantiles of
standard normal distribution does not appear to have systematic deviations for linearity
since the bulk of the points lie on a straight line.
3. Correlation Plots
We can explore the numerical predictors and response (quality) by creating a
correlation table (Figure 3a/3b) between quality and all physiochemical predictors.
Figure 3a
Correlation Plots (numerical values)
Figure 3b
Correlation Plot (coded by size/color of circles instead of numerical values)
7. 6
Alcohol seems to be the best single predictor of quality with a correlation of .49 or
49%
4. Variable Inflation Factor (VIF)
In addition to inspecting the correlation matrix, I computed the VIF (Figure 4) for all
predictors. There is typically a small amount of collinearity among predictors so I am
concerned with values that are between 5-10, which could indicate a problematic
amount of collinearity.
Figure 4
Both the correlation matrix and VIF calculation confirms there is a multicollinearity
problem with fixed.acidity and two (2) other predictors: citric.acidity and
density.
5. Multiple Linear Regression Model Without fixed.acidity Predictor Variable
trainfit1= update(trainfit, ~.-fixed.acidity, data = training)
summary(trainfit1)
I updated the MLR model (Figure 5) to remove fixed.acidity and the model yields
higher prediction accuracy (larger F-statistic and lower Residual Squared Error).
8. 7
Figure 5
Although the F-statistic on the MLR model without fixed.acidity is a better fit for the
data, we can also see that there are relatively large individual p-values on some of the
predictor variables such as residual.sugar, density, and citric.acidity.
6. Principal Component Analysis
The response variable, quality appears to only be related to a subset of the predictors.
We can use dimensionality reduction or PCA method for exploratory analysis and
produce derived variables (principal components) for use in a supervised learning
method.
PCA reduces the total set of numerical variables to remove the overlap of information
between them. The new variables will be a linear combination of the original variables
9. 8
that are weighted averages of the original variables. The linear combinations are
uncorrelated thus potentially correcting the problem of multicollinearity.
#Centers variable to have mean = 0 and normalize data to have standard
deviation = 1
> pr.out = prcomp(mydata.pca, scale = TRUE)
The rotation measure provides the principal component loading. Each column of
rotation matrix contains the principal component loading vectors which are measures of
interest.
> pr.out$rotation
Figure 6
Figure 6a
10. 9
Figure 6b
Plot of First Two Principal Components
We infer that the first principal component corresponds to a measure of
total.sulfur.dioxide and alcohol. Similarly, it can be said that the second
component corresponds to a measure of pH and fixed.acidity.
Figure 6a shows that first principal component explains 28.2% variance. Second
component explains 17.5% variance. Third component explains 14.1% variance and so
on.
To choose the number of components for the principal component regression model,
we need to look at a scree plot. A scree plot is used to access components which
explains the most variability in the data by representing values in descending order.
11. 10
Figure 6c
Scree Plot of PVE
Figure 6d
Scree Plot of Cumulative PVE (Sum = 1)
It appears that principal component 9 results in a cumulative variance of close to 98%.
In this case, using PCA did not do well in reducing the number of predictors from 11 to 9
without compromising on proportion of explained variance. Unfortunately, after
checking both scree plot of PVE (Figure 6c) and cumulative PVE (Figure 6d), we see that
PCA did not give us a small enough number of principal components required to get a
good understanding of the data.
12. 11
7. Forward and Backward Selection MLR Model
We know from the previous exploration of the data set that the response variable,
quality is related to only a subset of the predictors. In order to determine which
predictors are associated with the response, we need to fit a single model involving only
those predictors.
We cannot consider all 2P models with p = 11; there is over 2048 models! Therefore, we
can determine the best model using variable selection. Using the two classical
approaches, Forward Selection and Backward Selection yield a more efficient and
automated ways to choose a smaller set of models to consider.
#Forward Selection Method
null<-lm(quality~1, data=training)
forward<-step(null, scope=list(upper=trainfit), direction='forward')
summary(forward)
Figure 7a
13. 12
#Backward Selection Method
backward<-step(trainfit, direction='backward')
summary(backward)
Figure 7b
Forward selection method begins with the null model then adds predictors that results
in the lowest RSS whereas the backward selection method is conducting the opposite
optimization by starting with all the predictors then eliminating the variables with the
largest p-values. Although these techniques differ slightly in their optimization strategy,
it is interesting to point out that in this instance both methods fitted the data to the
same MLR model (Figure 7a and Figure 7b).
14. 13
8. Diagnostics Plots to Verify Regression Assumptions
I plotted the “Residuals vs. Fitted” and “Normal Q-Q plot” to search for any patterns
that might violate the regression assumptions (Figure 8a and Figure 8b).
Figure 8a
Figure 8b
This is very similar to the full model plot of “Residuals vs. Fitted.” The plot for the
forward/backward model (Figure 8a) appears to be the same odd pattern in the
residuals so we cannot run out the non-violation of the equal (constant) variance rule is
intact. The “Normal Q-Q plot” (Figure 8b) of quantiles of this distribution does not
appear to have systematic deviations for linearity since the bulk of the points lie on a
relatively straight line. Another important residual plot is the histogram plot of residuals
(Figure 8c), which shows a normally distributed bell-shaped curve.
15. 14
Figure 8c
Figure 8d
We can also visualize the regression results with the coefficient plot (Figure 8d) where
each coefficient is plotted as a point with a thick blue line representing one standard
error confidence interval and a vertical gray dotted line indicating 0. If the standard
error confidence interval does not contain 0, it is statistically significant. Here we can
see alcohol and sulphates has the largest effect on quality of the wine.
16. 15
9. Fitting the Backward Selection Model on the Validation Data Set
> PredBack$residual.scale
[1] 0.6540796
Figure 9
> AIC(backward)
[1] 1917.249
> AIC(trainfit)
[1] 1923.308
> AIC(trainfit1)
[1] 1921.749
Another metric we can use to compare the performance of models (trainfit-full model,
trainfit1- full model minus fixed.acidity, and model obtained from backward
selection method) is the Akaike Information Criterion (AIC) measure. The AIC value of
backward 1917.249 is smaller than that of trainfit and trainfit1 at 1923.308 and
1921.749, therefore the model obtained with backward selection method is better
fitting the data.
17. 16
10. Leave-One-Out-Cross-Validation Method (LOOCV)
The Validation Set Approach from part 1 has its drawbacks. Instead the LOOCV can be
used to fit a model with n - 1 training observations and the prediction is made for the
one excluded observation. The MSE provides an approximate unbiased estimate for
the test error, however, it is still highly variable since it is based on a single excluded
observation.
Major advantages of the LOOCV method over the validation set approach is that it has
far less bias. The statistical learning model is repeatedly using training sets that contains
n - 1 observations or almost the entire data set in contrast to the validation set
approach which uses 60% of the original data set to train the model.
The LOOCV approach also tends to not overestimate the test error rate as much as the
validation set approach. Instead of yielding different results due to randomness in the
training/validation set splits, performing LOOCV multiple times will always yield the
same results because there is no randomness in the testing/validation splits.
#LOOCV estimate for test error is 42.47
> cv.err$delta
[1] 0.4247753 0.4247728
11. k-Fold Cross-Validation
I used the k-fold cross validation method as an alternative to LOOCV. The approach
randomly divides the set of observations into k groups, or folds, of approximately equal
size. The first fold is treated as the validation set and the model is fitted on the
remaining k – 1 folds. The MSE is computed on observations in the held-out fold that is
treated as the validation set. The procedure is repeated k times, each time with a
different group of observations treated as the validation set.
Although there is a variability in the test error estimates, it is often lower than the
variability that results using the validation set approach.
#k-Fold CV estimate for test error is 42.40
cv.error.10
[1] 0.42409
12. Bias-Variance Trade-off for k-Fold Cross Validation
On the basis of bias reduction, LOOCV is preferred over the k-Fold CV method since it
takes (k - 1) n / k observations to train the model in contrast to the LOOCV method that
trains the model on n - 1 observations, which is almost the entire data set. However,
the LOOCV method has a higher variance than k-Fold CV with k < n due to the high
18. 17
correlation of averaging n fitted models whereas the average of k fitted models is
somewhat less correlated. This results in test error rates from LOOCV tendency to have
a higher variance than k-fold CV.
13. Conclusion
Backward Selection Multiple Linear Regression Model:
Quality = 4.1747748 - 0.8826331*volatile.acidity - 1.9781343*chlorides +
0.0041786*free.sulfur.dioxide - 0.0037963*total.sulfur.dioxide - 0.5251147*pH +
0.9535727*sulphates + 0.3182497*alcohol
It is interesting to note the backwards selection method only produced six regression
coefficients versus the original eleven predictors. This model shows that sulphates
and alcohol has the largest effect on predicting higher quality ratings of red wine in
this data set whereas higher physiochemical measures of chlorides and
volatile.acidity has the largest effect on predicting lower quality ratings.
With respect to the adjusted R2, Residual Squared Error, and AIC measures, we can
determine that the backward selection method best fits the data. In examining many
diagnostics graphs and plots, we can also conclude that linear regression is a compelling
method in analyzing the effects of physiochemical tests on quality rating of Vinho Verde
red wine.
19. 18
Appendix of R Codes:
mydata = read.csv("winequality-red.csv",header=TRUE)
mydata.pca = read.csv("winequality-red-pca.csv",header=TRUE)
row <- nrow(mydata)
set.seed(12345)
trainindex <- sample(row, 0.6*row, replace=FALSE)
training <- mydata[trainindex,]
validation <- mydata[-trainindex,]
#ValidationSetApproach
require(ISLR)
set.seed(1)
train=sample(1599,960)
lm.fit=lm(quality~.,data=mydata,subset=train)
predict(lm.fit)
rating.predict<-predict(lm.fit)
plot(rating.predict)
attach(mydata)
mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE in
validation set
#MLR with all 11 predictors (full model) used in forward selection method
later
trainfit<- lm(quality~., data=training)
#Output of Regression coefficients for all Predictors
summary(trainfit)
#Model fit
plot(trainfit)
#fitsummary = summary(trainfit)
#fitsummary$r.squared
#fitsummary$adj.r.squared
#fitsummary$sigma #RSE
AIC(trainfit)
BIC(trainfit)
#Principal Component Analysis (PCA)
physiochemattributes=row.names(mydata.pca)
physiochemattributes
names(mydata.pca)
#Means/Variances of variables
apply(mydata.pca, 2, mean)
apply(mydata.pca, 2, var)
#Centers variable to have mean zero and set variables to standard
deviation one
pr.out = prcomp(mydata.pca, scale = TRUE)
pr.out
summary(pr.out)
dim(pr.out$x)
pr.out$center
pr.out$scale
20. 19
pr.out$rotation
#Plot first two principal component
biplot(pr.out, scale=0)
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale=0)
#Check standard deviation and variance explained
pr.out$sdev
pr.var = pr.out$sdev^2
pr.var
#Compute proportion of variance explained PVE
pve= pr.var/sum(pr.var)
pve
#Plot PVE and Cumulative PVE
plot(pve, xlab= "Principal Component", ylab="Proportion of Variance
Explained", ylim=c(0,1), type='b', col="blue")
plot(cumsum(pve), xlab= "Principal Component", ylab="Proportion of
Variance Explained", ylim=c(0,1), type='b', col="blue")
#PCA Regression
require(pls)
set.seed(1000)
train <- mydata[1:1279,]
y_test <- mydata[1279:1599, 1]
test <- mydata[1279:1599, 2:5]
pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")
pcr_model
summary(pcr_model)
# Plot the root mean squared error
validationplot(pcr_model)
predplot(pcr_model)
coefplot(pcr_model)
# Plot the cross validation MSE
validationplot(pcr_model, val.type="MSEP")
train <- mydata[1:1279,]
test <- mydata[1279:1599, 2:5]
pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")
pcr_pred <- predict(pcr_model, data=test)
summary(pcr_model)
summary(pcr_pred)
mean(pcr_pred)
require(car)
vif(trainfit)
#remove fixed acidity from model because of multicollinearity
trainfit1= update(trainfit, ~.-fixed.acidity, data = training)
summary(trainfit1)
21. 20
plot(trainfit1)
anova(trainfit,trainfit1, backward)
#coefficients(trainfit)
#fitted(trainfit)
#anova(trainfit)
#residuals(trainfit)
#vcov(trainfit)
#histogram with 20 bars
Residual<-residuals(trainfit1)
hist(Residual,breaks=20)
#Q-Q pplot with residuals
qqnorm(residual, ylab="Standardized Residuals", xlab="Normal Scores",
main="Residual")
qqline(residual)
layout(matrix(c(1,2,3,4),2,2))
plot(trainfit)
#Correlation and Scatter Plots for full model (trainfit)
require(corrplot)
M = cor(training)
corrplot(M, method = "number")
corrplot(M, method = "circle")
corrplot.mixed(M)
#Scatterplot
plot(training)
#Forward Selection Method
null<-lm(quality~1, data=training)
forward<-step(null, scope=list(upper=trainfit), direction='forward')
summary(forward)
plot(forward)
#histogram with 20 bars
Residual1<-residuals(forward)
hist(Residual,breaks=20)
#Backward Selection Method, actually yields the same model as Forward
Selection Method
backward<-step(trainfit, direction='backward')
summary(backward)
AIC(backward)
AIC(trainfit)
AIC(trainfit1)
#lm.fit= lm(quality~volatile.acidity + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + pH + sulphates + alcohol, data=training)
#attach(training)
#Compute MSE on validation set
#########backward$delta
#Predicting validation data set with model with all predictors
22. 21
PredTrainfit<-predict(trainfit, validation, se.fit=TRUE)
PredTrainfit
PredTrainfit$residual.scale
#Predicting validation data set with model that does not have
fixed.acidity
PredTrainfit1<-predict(trainfit1, validation, se.fit=TRUE)
PredTrainfit1
PredTrainfit1$residual.scale
#Prediciting the validation dataset based on the mixed selection model
PredBack<-predict(backward, validation, se.fit=TRUE)
PredBack
PredBack$residual.scale
# Comparing the validation residual standard deviations, the mixed
selection model is better than base model
#Coefficient Plot of Backwards Selection Model
require(coefplot)
coefplot(backward)
#LOOCV
require(boot)
glm.fit=glm(quality~.,data=mydata)
cv.err=cv.glm(mydata,glm.fit)
cv.err$delta
cv.error=rep(0,5)
for (i in 1:5){
glm.fit=glm(quality~.,data=mydata)
cv.error[i]=cv.glm(mydata,glm.fit)$delta[1]
}
cv.error
#cross validation estimate for test error is 42.47
coef(glm.fit)
#K-fold Cross Validation
set.seed(17)
cv.error.10=rep(0,10)
for (i in 1:10) {
glm.fit=glm(quality~., data=mydata)
cv.error.10=cv.glm(mydata,glm.fit,K=10)$delta[2] }
cv.error.10
#k-fold CV estimate for test error is 42.40
23. 22
References:
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning: With applications in R. New York: Springer.
Lander, J. P. (2014). R for Everyone: Advanced Analytics and Graphics. Upper Saddle River,
NJ: Addison-Wesley.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib