SlideShare a Scribd company logo
1 of 23
Download to read offline
Melissa A. Johnson
STAT 515 - Fall 2016
December 8, 2016
A STATISTICAL MODEL OF
WINE QUALITY
1
Data Set Review:
What are the basic factors that make wine more preferable or high-rated? Are these
factors imaginary or real? These are some age-old conundrum about fermented grape
juice the world has come to know and love. I’ve been skeptical of professional wine
ratings after reading an instance of a blind taste test where oenologists could not
accurately pick out the most expensive wines. The wine quality data set caught my
fascination with this problem.
My objective is to conduct a multiple linear regression (MLR) model to predict wine
quality in relation to physiochemical attributes. My questions are as follows:
1. Is there a linear relationship between wine ratings (quality) and at least one of
the predictor variables?
2. Which physiochemical attribute(s) are the strongest predictors of highly-rated
wine and poorly-rated wine?
3. Which model has the best performance in fitting the data?
Wine ratings or “quality” scores are sensory data, meaning humans are assessing the
quality whereas the physiochemical values are objective data collected from lab tests.
There may be benefits to realize by measuring the impact of physiochemical tests in
wine quality. It could help wine producers by improving the production process and
identify target markets to increase profitability. In wine industry practicum,
certification and quality assessments to prevent adulteration of wine are also based on
these types of data collection and descriptive analysis.
I transformed the raw data set, obtained from http://mlr.cs.umass.edu/ml/machine-
learning-databases/wine-quality/, using text-to-columns feature in Excel to make it
suitable for analysis in R.
The data set is related to red variants of the Portuguese "Vinho Verde" wine. Only
physicochemical (inputs) and sensory (output) variables are available (e.g., there is no
private data about grape types, wine brand, wine selling price, etc.). There are 1599
observations with 11 input variables and 1 output variable. Several of the attributes
may be correlated, thus it makes sense to apply some sort of feature variable selection.
2
Input Variables (based on physicochemical tests):
1 - fixed acidity (min: 4.60, max: 15.90)
2 - volatile acidity (min: 0.12, max: 1.58)
3 - citric acid (min: 0.00, max: 1.00)
4 - residual sugar (min: 0.90, max: 15.50)
5 - chlorides (min: 0.01, max: 0.61)
6 - free sulfur dioxide (min: 1.00, max: 72.00)
7 - total sulfur dioxide (min: 6.00, max: 289.00)
8 - density (min: 0.99, max: 1.00)
9 - pH (min: 2.74, max: 4.01)
10 - sulphates (min: 0.33, max: 2.00)
11 - alcohol (min: 8.40, max: 14.90)
Output Variable (based on sensory data):
12 - quality (min: 3.00, max: 8.00), score between 0 and 10
Red Wine Data Set:
I applied a simple Validation Set Approach to randomly divide the available set of
observations into two parts; 60% training set and 40% validation set. The model in
Figure 1 is fit on the training set and the fitted model is used to predict the responses
for the observations in the validation set. The resulting validation set error rate,
assessed using the Mean Squared Error (MSE) provides an estimate of the test error
rate.
> mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE
in validation set
[1] 0.4374297
3
In order to see if there is a relationship between quality (response variable) and all 11
physiochemical attributes (predictor variables) we need to test the null hypothesis. The
hypothesis test can be performed by computing the F-statistic for the full model.
Null Hypothesis = H0: B1= B2 = B3… = B11 = 0
Alternative Hypothesis = Ha: at least one Bj is a non-zero
1. Multiple Linear Regression Model Using All 11 Predictor Variables
trainfit<- lm(quality~., data=training)
summary(trainfit) #Output of Regression coefficients for all Predictors
Figure 1
4
The F-statistic on this full model (Figure 1) is 50.26, which is much larger than 1, so we
can reject the null hypothesis: H0 and conclude that there is a relationship between
quality (Y) and at least one of the predictors (Bj). Since there is a relatively large n
(n=1599), an F-statistic of 50.26 provides sufficient evidence to reject H0.
2. Diagnostics Plots to Verify Regression Assumptions
Figure 2a
Figure 2b
The “Residuals vs. Fitted” plot (Figure 2a) checks for equal variance. The red line is a
smooth fit to the residuals, intending to make any patterns more easily identifiable.
5
Due to the strange parallel lines pattern observed in the plot, It’s not easy to conclude
the non-violation of equal variance assumption.
The “Normal Q-Q” plot (Figure 2b) of quantiles of this distribution against quantiles of
standard normal distribution does not appear to have systematic deviations for linearity
since the bulk of the points lie on a straight line.
3. Correlation Plots
We can explore the numerical predictors and response (quality) by creating a
correlation table (Figure 3a/3b) between quality and all physiochemical predictors.
Figure 3a
Correlation Plots (numerical values)
Figure 3b
Correlation Plot (coded by size/color of circles instead of numerical values)
6
Alcohol seems to be the best single predictor of quality with a correlation of .49 or
49%
4. Variable Inflation Factor (VIF)
In addition to inspecting the correlation matrix, I computed the VIF (Figure 4) for all
predictors. There is typically a small amount of collinearity among predictors so I am
concerned with values that are between 5-10, which could indicate a problematic
amount of collinearity.
Figure 4
Both the correlation matrix and VIF calculation confirms there is a multicollinearity
problem with fixed.acidity and two (2) other predictors: citric.acidity and
density.
5. Multiple Linear Regression Model Without fixed.acidity Predictor Variable
trainfit1= update(trainfit, ~.-fixed.acidity, data = training)
summary(trainfit1)
I updated the MLR model (Figure 5) to remove fixed.acidity and the model yields
higher prediction accuracy (larger F-statistic and lower Residual Squared Error).
7
Figure 5
Although the F-statistic on the MLR model without fixed.acidity is a better fit for the
data, we can also see that there are relatively large individual p-values on some of the
predictor variables such as residual.sugar, density, and citric.acidity.
6. Principal Component Analysis
The response variable, quality appears to only be related to a subset of the predictors.
We can use dimensionality reduction or PCA method for exploratory analysis and
produce derived variables (principal components) for use in a supervised learning
method.
PCA reduces the total set of numerical variables to remove the overlap of information
between them. The new variables will be a linear combination of the original variables
8
that are weighted averages of the original variables. The linear combinations are
uncorrelated thus potentially correcting the problem of multicollinearity.
#Centers variable to have mean = 0 and normalize data to have standard
deviation = 1
> pr.out = prcomp(mydata.pca, scale = TRUE)
The rotation measure provides the principal component loading. Each column of
rotation matrix contains the principal component loading vectors which are measures of
interest.
> pr.out$rotation
Figure 6
Figure 6a
9
Figure 6b
Plot of First Two Principal Components
We infer that the first principal component corresponds to a measure of
total.sulfur.dioxide and alcohol. Similarly, it can be said that the second
component corresponds to a measure of pH and fixed.acidity.
Figure 6a shows that first principal component explains 28.2% variance. Second
component explains 17.5% variance. Third component explains 14.1% variance and so
on.
To choose the number of components for the principal component regression model,
we need to look at a scree plot. A scree plot is used to access components which
explains the most variability in the data by representing values in descending order.
10
Figure 6c
Scree Plot of PVE
Figure 6d
Scree Plot of Cumulative PVE (Sum = 1)
It appears that principal component 9 results in a cumulative variance of close to 98%.
In this case, using PCA did not do well in reducing the number of predictors from 11 to 9
without compromising on proportion of explained variance. Unfortunately, after
checking both scree plot of PVE (Figure 6c) and cumulative PVE (Figure 6d), we see that
PCA did not give us a small enough number of principal components required to get a
good understanding of the data.
11
7. Forward and Backward Selection MLR Model
We know from the previous exploration of the data set that the response variable,
quality is related to only a subset of the predictors. In order to determine which
predictors are associated with the response, we need to fit a single model involving only
those predictors.
We cannot consider all 2P models with p = 11; there is over 2048 models! Therefore, we
can determine the best model using variable selection. Using the two classical
approaches, Forward Selection and Backward Selection yield a more efficient and
automated ways to choose a smaller set of models to consider.
#Forward Selection Method
null<-lm(quality~1, data=training)
forward<-step(null, scope=list(upper=trainfit), direction='forward')
summary(forward)
Figure 7a
12
#Backward Selection Method
backward<-step(trainfit, direction='backward')
summary(backward)
Figure 7b
Forward selection method begins with the null model then adds predictors that results
in the lowest RSS whereas the backward selection method is conducting the opposite
optimization by starting with all the predictors then eliminating the variables with the
largest p-values. Although these techniques differ slightly in their optimization strategy,
it is interesting to point out that in this instance both methods fitted the data to the
same MLR model (Figure 7a and Figure 7b).
13
8. Diagnostics Plots to Verify Regression Assumptions
I plotted the “Residuals vs. Fitted” and “Normal Q-Q plot” to search for any patterns
that might violate the regression assumptions (Figure 8a and Figure 8b).
Figure 8a
Figure 8b
This is very similar to the full model plot of “Residuals vs. Fitted.” The plot for the
forward/backward model (Figure 8a) appears to be the same odd pattern in the
residuals so we cannot run out the non-violation of the equal (constant) variance rule is
intact. The “Normal Q-Q plot” (Figure 8b) of quantiles of this distribution does not
appear to have systematic deviations for linearity since the bulk of the points lie on a
relatively straight line. Another important residual plot is the histogram plot of residuals
(Figure 8c), which shows a normally distributed bell-shaped curve.
14
Figure 8c
Figure 8d
We can also visualize the regression results with the coefficient plot (Figure 8d) where
each coefficient is plotted as a point with a thick blue line representing one standard
error confidence interval and a vertical gray dotted line indicating 0. If the standard
error confidence interval does not contain 0, it is statistically significant. Here we can
see alcohol and sulphates has the largest effect on quality of the wine.
15
9. Fitting the Backward Selection Model on the Validation Data Set
> PredBack$residual.scale
[1] 0.6540796
Figure 9
> AIC(backward)
[1] 1917.249
> AIC(trainfit)
[1] 1923.308
> AIC(trainfit1)
[1] 1921.749
Another metric we can use to compare the performance of models (trainfit-full model,
trainfit1- full model minus fixed.acidity, and model obtained from backward
selection method) is the Akaike Information Criterion (AIC) measure. The AIC value of
backward 1917.249 is smaller than that of trainfit and trainfit1 at 1923.308 and
1921.749, therefore the model obtained with backward selection method is better
fitting the data.
16
10. Leave-One-Out-Cross-Validation Method (LOOCV)
The Validation Set Approach from part 1 has its drawbacks. Instead the LOOCV can be
used to fit a model with n - 1 training observations and the prediction is made for the
one excluded observation. The MSE provides an approximate unbiased estimate for
the test error, however, it is still highly variable since it is based on a single excluded
observation.
Major advantages of the LOOCV method over the validation set approach is that it has
far less bias. The statistical learning model is repeatedly using training sets that contains
n - 1 observations or almost the entire data set in contrast to the validation set
approach which uses 60% of the original data set to train the model.
The LOOCV approach also tends to not overestimate the test error rate as much as the
validation set approach. Instead of yielding different results due to randomness in the
training/validation set splits, performing LOOCV multiple times will always yield the
same results because there is no randomness in the testing/validation splits.
#LOOCV estimate for test error is 42.47
> cv.err$delta
[1] 0.4247753 0.4247728
11. k-Fold Cross-Validation
I used the k-fold cross validation method as an alternative to LOOCV. The approach
randomly divides the set of observations into k groups, or folds, of approximately equal
size. The first fold is treated as the validation set and the model is fitted on the
remaining k – 1 folds. The MSE is computed on observations in the held-out fold that is
treated as the validation set. The procedure is repeated k times, each time with a
different group of observations treated as the validation set.
Although there is a variability in the test error estimates, it is often lower than the
variability that results using the validation set approach.
#k-Fold CV estimate for test error is 42.40
cv.error.10
[1] 0.42409
12. Bias-Variance Trade-off for k-Fold Cross Validation
On the basis of bias reduction, LOOCV is preferred over the k-Fold CV method since it
takes (k - 1) n / k observations to train the model in contrast to the LOOCV method that
trains the model on n - 1 observations, which is almost the entire data set. However,
the LOOCV method has a higher variance than k-Fold CV with k < n due to the high
17
correlation of averaging n fitted models whereas the average of k fitted models is
somewhat less correlated. This results in test error rates from LOOCV tendency to have
a higher variance than k-fold CV.
13. Conclusion
Backward Selection Multiple Linear Regression Model:
Quality = 4.1747748 - 0.8826331*volatile.acidity - 1.9781343*chlorides +
0.0041786*free.sulfur.dioxide - 0.0037963*total.sulfur.dioxide - 0.5251147*pH +
0.9535727*sulphates + 0.3182497*alcohol
It is interesting to note the backwards selection method only produced six regression
coefficients versus the original eleven predictors. This model shows that sulphates
and alcohol has the largest effect on predicting higher quality ratings of red wine in
this data set whereas higher physiochemical measures of chlorides and
volatile.acidity has the largest effect on predicting lower quality ratings.
With respect to the adjusted R2, Residual Squared Error, and AIC measures, we can
determine that the backward selection method best fits the data. In examining many
diagnostics graphs and plots, we can also conclude that linear regression is a compelling
method in analyzing the effects of physiochemical tests on quality rating of Vinho Verde
red wine.
18
Appendix of R Codes:
mydata = read.csv("winequality-red.csv",header=TRUE)
mydata.pca = read.csv("winequality-red-pca.csv",header=TRUE)
row <- nrow(mydata)
set.seed(12345)
trainindex <- sample(row, 0.6*row, replace=FALSE)
training <- mydata[trainindex,]
validation <- mydata[-trainindex,]
#ValidationSetApproach
require(ISLR)
set.seed(1)
train=sample(1599,960)
lm.fit=lm(quality~.,data=mydata,subset=train)
predict(lm.fit)
rating.predict<-predict(lm.fit)
plot(rating.predict)
attach(mydata)
mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE in
validation set
#MLR with all 11 predictors (full model) used in forward selection method
later
trainfit<- lm(quality~., data=training)
#Output of Regression coefficients for all Predictors
summary(trainfit)
#Model fit
plot(trainfit)
#fitsummary = summary(trainfit)
#fitsummary$r.squared
#fitsummary$adj.r.squared
#fitsummary$sigma #RSE
AIC(trainfit)
BIC(trainfit)
#Principal Component Analysis (PCA)
physiochemattributes=row.names(mydata.pca)
physiochemattributes
names(mydata.pca)
#Means/Variances of variables
apply(mydata.pca, 2, mean)
apply(mydata.pca, 2, var)
#Centers variable to have mean zero and set variables to standard
deviation one
pr.out = prcomp(mydata.pca, scale = TRUE)
pr.out
summary(pr.out)
dim(pr.out$x)
pr.out$center
pr.out$scale
19
pr.out$rotation
#Plot first two principal component
biplot(pr.out, scale=0)
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale=0)
#Check standard deviation and variance explained
pr.out$sdev
pr.var = pr.out$sdev^2
pr.var
#Compute proportion of variance explained PVE
pve= pr.var/sum(pr.var)
pve
#Plot PVE and Cumulative PVE
plot(pve, xlab= "Principal Component", ylab="Proportion of Variance
Explained", ylim=c(0,1), type='b', col="blue")
plot(cumsum(pve), xlab= "Principal Component", ylab="Proportion of
Variance Explained", ylim=c(0,1), type='b', col="blue")
#PCA Regression
require(pls)
set.seed(1000)
train <- mydata[1:1279,]
y_test <- mydata[1279:1599, 1]
test <- mydata[1279:1599, 2:5]
pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")
pcr_model
summary(pcr_model)
# Plot the root mean squared error
validationplot(pcr_model)
predplot(pcr_model)
coefplot(pcr_model)
# Plot the cross validation MSE
validationplot(pcr_model, val.type="MSEP")
train <- mydata[1:1279,]
test <- mydata[1279:1599, 2:5]
pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV")
pcr_pred <- predict(pcr_model, data=test)
summary(pcr_model)
summary(pcr_pred)
mean(pcr_pred)
require(car)
vif(trainfit)
#remove fixed acidity from model because of multicollinearity
trainfit1= update(trainfit, ~.-fixed.acidity, data = training)
summary(trainfit1)
20
plot(trainfit1)
anova(trainfit,trainfit1, backward)
#coefficients(trainfit)
#fitted(trainfit)
#anova(trainfit)
#residuals(trainfit)
#vcov(trainfit)
#histogram with 20 bars
Residual<-residuals(trainfit1)
hist(Residual,breaks=20)
#Q-Q pplot with residuals
qqnorm(residual, ylab="Standardized Residuals", xlab="Normal Scores",
main="Residual")
qqline(residual)
layout(matrix(c(1,2,3,4),2,2))
plot(trainfit)
#Correlation and Scatter Plots for full model (trainfit)
require(corrplot)
M = cor(training)
corrplot(M, method = "number")
corrplot(M, method = "circle")
corrplot.mixed(M)
#Scatterplot
plot(training)
#Forward Selection Method
null<-lm(quality~1, data=training)
forward<-step(null, scope=list(upper=trainfit), direction='forward')
summary(forward)
plot(forward)
#histogram with 20 bars
Residual1<-residuals(forward)
hist(Residual,breaks=20)
#Backward Selection Method, actually yields the same model as Forward
Selection Method
backward<-step(trainfit, direction='backward')
summary(backward)
AIC(backward)
AIC(trainfit)
AIC(trainfit1)
#lm.fit= lm(quality~volatile.acidity + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + pH + sulphates + alcohol, data=training)
#attach(training)
#Compute MSE on validation set
#########backward$delta
#Predicting validation data set with model with all predictors
21
PredTrainfit<-predict(trainfit, validation, se.fit=TRUE)
PredTrainfit
PredTrainfit$residual.scale
#Predicting validation data set with model that does not have
fixed.acidity
PredTrainfit1<-predict(trainfit1, validation, se.fit=TRUE)
PredTrainfit1
PredTrainfit1$residual.scale
#Prediciting the validation dataset based on the mixed selection model
PredBack<-predict(backward, validation, se.fit=TRUE)
PredBack
PredBack$residual.scale
# Comparing the validation residual standard deviations, the mixed
selection model is better than base model
#Coefficient Plot of Backwards Selection Model
require(coefplot)
coefplot(backward)
#LOOCV
require(boot)
glm.fit=glm(quality~.,data=mydata)
cv.err=cv.glm(mydata,glm.fit)
cv.err$delta
cv.error=rep(0,5)
for (i in 1:5){
glm.fit=glm(quality~.,data=mydata)
cv.error[i]=cv.glm(mydata,glm.fit)$delta[1]
}
cv.error
#cross validation estimate for test error is 42.47
coef(glm.fit)
#K-fold Cross Validation
set.seed(17)
cv.error.10=rep(0,10)
for (i in 1:10) {
glm.fit=glm(quality~., data=mydata)
cv.error.10=cv.glm(mydata,glm.fit,K=10)$delta[2] }
cv.error.10
#k-fold CV estimate for test error is 42.40
22
References:
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning: With applications in R. New York: Springer.
Lander, J. P. (2014). R for Everyone: Advanced Analytics and Graphics. Upper Saddle River,
NJ: Addison-Wesley.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

More Related Content

What's hot

Probit analysis in toxicological studies
Probit analysis in toxicological studies Probit analysis in toxicological studies
Probit analysis in toxicological studies kunthavai Nachiyar
 
Practice test ch 8 hypothesis testing ch 9 two populations
Practice test ch 8 hypothesis testing ch 9 two populationsPractice test ch 8 hypothesis testing ch 9 two populations
Practice test ch 8 hypothesis testing ch 9 two populationsLong Beach City College
 
Treatment comparisons in clinical trials with Covariates analysis of diastoli...
Treatment comparisons in clinical trials with Covariates analysis of diastoli...Treatment comparisons in clinical trials with Covariates analysis of diastoli...
Treatment comparisons in clinical trials with Covariates analysis of diastoli...Dr.Govind Nidigattu
 
Mth 245 lesson 17 notes sampling distributions sam
Mth 245 lesson 17 notes  sampling distributions  samMth 245 lesson 17 notes  sampling distributions  sam
Mth 245 lesson 17 notes sampling distributions samjack60216
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Long Beach City College
 
Basic data analyses skills for science research
Basic data analyses skills for science researchBasic data analyses skills for science research
Basic data analyses skills for science researchLaw Law
 
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populationsSolution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populationsLong Beach City College
 

What's hot (15)

Student t t est
Student t t estStudent t t est
Student t t est
 
Probit analysis in toxicological studies
Probit analysis in toxicological studies Probit analysis in toxicological studies
Probit analysis in toxicological studies
 
Practice test ch 8 hypothesis testing ch 9 two populations
Practice test ch 8 hypothesis testing ch 9 two populationsPractice test ch 8 hypothesis testing ch 9 two populations
Practice test ch 8 hypothesis testing ch 9 two populations
 
Treatment comparisons in clinical trials with Covariates analysis of diastoli...
Treatment comparisons in clinical trials with Covariates analysis of diastoli...Treatment comparisons in clinical trials with Covariates analysis of diastoli...
Treatment comparisons in clinical trials with Covariates analysis of diastoli...
 
Mth 245 lesson 17 notes sampling distributions sam
Mth 245 lesson 17 notes  sampling distributions  samMth 245 lesson 17 notes  sampling distributions  sam
Mth 245 lesson 17 notes sampling distributions sam
 
11
1111
11
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance
 
Two Variances or Standard Deviations
Two Variances or Standard DeviationsTwo Variances or Standard Deviations
Two Variances or Standard Deviations
 
Estimating a Population Proportion
Estimating a Population ProportionEstimating a Population Proportion
Estimating a Population Proportion
 
POSTER 2013.ppt
POSTER 2013.pptPOSTER 2013.ppt
POSTER 2013.ppt
 
Basic data analyses skills for science research
Basic data analyses skills for science researchBasic data analyses skills for science research
Basic data analyses skills for science research
 
Sampling Distributions and Estimators
Sampling Distributions and EstimatorsSampling Distributions and Estimators
Sampling Distributions and Estimators
 
Hypothesis testing in R
Hypothesis testing in RHypothesis testing in R
Hypothesis testing in R
 
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populationsSolution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
 
report
reportreport
report
 

Similar to Wine Quality Factors and Statistical Modeling

Table 2Survival Status Disease SeverityDon.docx
Table 2Survival Status      Disease         SeverityDon.docxTable 2Survival Status      Disease         SeverityDon.docx
Table 2Survival Status Disease SeverityDon.docxperryk1
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485sarthakkhare3
 
Wine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWei
Wine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWeiWine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWei
Wine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWeiShuai Wei
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsShantanu Deshpande
 
Factor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanFactor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanAman Chauhan
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...RAHUL WAGAJ
 
Diabetes data - model assessment using R
Diabetes data - model assessment using RDiabetes data - model assessment using R
Diabetes data - model assessment using RGregg Barrett
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptxmahamoh6
 
Lecture 5 practical_guidelines_assignments
Lecture 5 practical_guidelines_assignmentsLecture 5 practical_guidelines_assignments
Lecture 5 practical_guidelines_assignmentsDaria Bogdanova
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationAsadJaved304231
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification AnalysisYashIyengar
 
Model Calibration and Uncertainty Analysis
Model Calibration and Uncertainty AnalysisModel Calibration and Uncertainty Analysis
Model Calibration and Uncertainty AnalysisJ Boisvert-Chouinard
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 

Similar to Wine Quality Factors and Statistical Modeling (20)

Table 2Survival Status Disease SeverityDon.docx
Table 2Survival Status      Disease         SeverityDon.docxTable 2Survival Status      Disease         SeverityDon.docx
Table 2Survival Status Disease SeverityDon.docx
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485
 
Dm
DmDm
Dm
 
Wine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWei
Wine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWeiWine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWei
Wine Taste Preference Modeling Based On Physicochemical Tests_ShuaiWei
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalytics
 
Factor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanFactor analysis in R by Aman Chauhan
Factor analysis in R by Aman Chauhan
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...
 
Diabetes data - model assessment using R
Diabetes data - model assessment using RDiabetes data - model assessment using R
Diabetes data - model assessment using R
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptx
 
Doe introductionh
Doe introductionhDoe introductionh
Doe introductionh
 
Sem with amos ii
Sem with amos iiSem with amos ii
Sem with amos ii
 
Lecture 5 practical_guidelines_assignments
Lecture 5 practical_guidelines_assignmentsLecture 5 practical_guidelines_assignments
Lecture 5 practical_guidelines_assignments
 
debatrim_report (1)
debatrim_report (1)debatrim_report (1)
debatrim_report (1)
 
Spss & minitab
Spss & minitabSpss & minitab
Spss & minitab
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
Model Calibration and Uncertainty Analysis
Model Calibration and Uncertainty AnalysisModel Calibration and Uncertainty Analysis
Model Calibration and Uncertainty Analysis
 
Lab manual_statistik
Lab manual_statistikLab manual_statistik
Lab manual_statistik
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 

Wine Quality Factors and Statistical Modeling

  • 1. Melissa A. Johnson STAT 515 - Fall 2016 December 8, 2016 A STATISTICAL MODEL OF WINE QUALITY
  • 2. 1 Data Set Review: What are the basic factors that make wine more preferable or high-rated? Are these factors imaginary or real? These are some age-old conundrum about fermented grape juice the world has come to know and love. I’ve been skeptical of professional wine ratings after reading an instance of a blind taste test where oenologists could not accurately pick out the most expensive wines. The wine quality data set caught my fascination with this problem. My objective is to conduct a multiple linear regression (MLR) model to predict wine quality in relation to physiochemical attributes. My questions are as follows: 1. Is there a linear relationship between wine ratings (quality) and at least one of the predictor variables? 2. Which physiochemical attribute(s) are the strongest predictors of highly-rated wine and poorly-rated wine? 3. Which model has the best performance in fitting the data? Wine ratings or “quality” scores are sensory data, meaning humans are assessing the quality whereas the physiochemical values are objective data collected from lab tests. There may be benefits to realize by measuring the impact of physiochemical tests in wine quality. It could help wine producers by improving the production process and identify target markets to increase profitability. In wine industry practicum, certification and quality assessments to prevent adulteration of wine are also based on these types of data collection and descriptive analysis. I transformed the raw data set, obtained from http://mlr.cs.umass.edu/ml/machine- learning-databases/wine-quality/, using text-to-columns feature in Excel to make it suitable for analysis in R. The data set is related to red variants of the Portuguese "Vinho Verde" wine. Only physicochemical (inputs) and sensory (output) variables are available (e.g., there is no private data about grape types, wine brand, wine selling price, etc.). There are 1599 observations with 11 input variables and 1 output variable. Several of the attributes may be correlated, thus it makes sense to apply some sort of feature variable selection.
  • 3. 2 Input Variables (based on physicochemical tests): 1 - fixed acidity (min: 4.60, max: 15.90) 2 - volatile acidity (min: 0.12, max: 1.58) 3 - citric acid (min: 0.00, max: 1.00) 4 - residual sugar (min: 0.90, max: 15.50) 5 - chlorides (min: 0.01, max: 0.61) 6 - free sulfur dioxide (min: 1.00, max: 72.00) 7 - total sulfur dioxide (min: 6.00, max: 289.00) 8 - density (min: 0.99, max: 1.00) 9 - pH (min: 2.74, max: 4.01) 10 - sulphates (min: 0.33, max: 2.00) 11 - alcohol (min: 8.40, max: 14.90) Output Variable (based on sensory data): 12 - quality (min: 3.00, max: 8.00), score between 0 and 10 Red Wine Data Set: I applied a simple Validation Set Approach to randomly divide the available set of observations into two parts; 60% training set and 40% validation set. The model in Figure 1 is fit on the training set and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate, assessed using the Mean Squared Error (MSE) provides an estimate of the test error rate. > mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE in validation set [1] 0.4374297
  • 4. 3 In order to see if there is a relationship between quality (response variable) and all 11 physiochemical attributes (predictor variables) we need to test the null hypothesis. The hypothesis test can be performed by computing the F-statistic for the full model. Null Hypothesis = H0: B1= B2 = B3… = B11 = 0 Alternative Hypothesis = Ha: at least one Bj is a non-zero 1. Multiple Linear Regression Model Using All 11 Predictor Variables trainfit<- lm(quality~., data=training) summary(trainfit) #Output of Regression coefficients for all Predictors Figure 1
  • 5. 4 The F-statistic on this full model (Figure 1) is 50.26, which is much larger than 1, so we can reject the null hypothesis: H0 and conclude that there is a relationship between quality (Y) and at least one of the predictors (Bj). Since there is a relatively large n (n=1599), an F-statistic of 50.26 provides sufficient evidence to reject H0. 2. Diagnostics Plots to Verify Regression Assumptions Figure 2a Figure 2b The “Residuals vs. Fitted” plot (Figure 2a) checks for equal variance. The red line is a smooth fit to the residuals, intending to make any patterns more easily identifiable.
  • 6. 5 Due to the strange parallel lines pattern observed in the plot, It’s not easy to conclude the non-violation of equal variance assumption. The “Normal Q-Q” plot (Figure 2b) of quantiles of this distribution against quantiles of standard normal distribution does not appear to have systematic deviations for linearity since the bulk of the points lie on a straight line. 3. Correlation Plots We can explore the numerical predictors and response (quality) by creating a correlation table (Figure 3a/3b) between quality and all physiochemical predictors. Figure 3a Correlation Plots (numerical values) Figure 3b Correlation Plot (coded by size/color of circles instead of numerical values)
  • 7. 6 Alcohol seems to be the best single predictor of quality with a correlation of .49 or 49% 4. Variable Inflation Factor (VIF) In addition to inspecting the correlation matrix, I computed the VIF (Figure 4) for all predictors. There is typically a small amount of collinearity among predictors so I am concerned with values that are between 5-10, which could indicate a problematic amount of collinearity. Figure 4 Both the correlation matrix and VIF calculation confirms there is a multicollinearity problem with fixed.acidity and two (2) other predictors: citric.acidity and density. 5. Multiple Linear Regression Model Without fixed.acidity Predictor Variable trainfit1= update(trainfit, ~.-fixed.acidity, data = training) summary(trainfit1) I updated the MLR model (Figure 5) to remove fixed.acidity and the model yields higher prediction accuracy (larger F-statistic and lower Residual Squared Error).
  • 8. 7 Figure 5 Although the F-statistic on the MLR model without fixed.acidity is a better fit for the data, we can also see that there are relatively large individual p-values on some of the predictor variables such as residual.sugar, density, and citric.acidity. 6. Principal Component Analysis The response variable, quality appears to only be related to a subset of the predictors. We can use dimensionality reduction or PCA method for exploratory analysis and produce derived variables (principal components) for use in a supervised learning method. PCA reduces the total set of numerical variables to remove the overlap of information between them. The new variables will be a linear combination of the original variables
  • 9. 8 that are weighted averages of the original variables. The linear combinations are uncorrelated thus potentially correcting the problem of multicollinearity. #Centers variable to have mean = 0 and normalize data to have standard deviation = 1 > pr.out = prcomp(mydata.pca, scale = TRUE) The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vectors which are measures of interest. > pr.out$rotation Figure 6 Figure 6a
  • 10. 9 Figure 6b Plot of First Two Principal Components We infer that the first principal component corresponds to a measure of total.sulfur.dioxide and alcohol. Similarly, it can be said that the second component corresponds to a measure of pH and fixed.acidity. Figure 6a shows that first principal component explains 28.2% variance. Second component explains 17.5% variance. Third component explains 14.1% variance and so on. To choose the number of components for the principal component regression model, we need to look at a scree plot. A scree plot is used to access components which explains the most variability in the data by representing values in descending order.
  • 11. 10 Figure 6c Scree Plot of PVE Figure 6d Scree Plot of Cumulative PVE (Sum = 1) It appears that principal component 9 results in a cumulative variance of close to 98%. In this case, using PCA did not do well in reducing the number of predictors from 11 to 9 without compromising on proportion of explained variance. Unfortunately, after checking both scree plot of PVE (Figure 6c) and cumulative PVE (Figure 6d), we see that PCA did not give us a small enough number of principal components required to get a good understanding of the data.
  • 12. 11 7. Forward and Backward Selection MLR Model We know from the previous exploration of the data set that the response variable, quality is related to only a subset of the predictors. In order to determine which predictors are associated with the response, we need to fit a single model involving only those predictors. We cannot consider all 2P models with p = 11; there is over 2048 models! Therefore, we can determine the best model using variable selection. Using the two classical approaches, Forward Selection and Backward Selection yield a more efficient and automated ways to choose a smaller set of models to consider. #Forward Selection Method null<-lm(quality~1, data=training) forward<-step(null, scope=list(upper=trainfit), direction='forward') summary(forward) Figure 7a
  • 13. 12 #Backward Selection Method backward<-step(trainfit, direction='backward') summary(backward) Figure 7b Forward selection method begins with the null model then adds predictors that results in the lowest RSS whereas the backward selection method is conducting the opposite optimization by starting with all the predictors then eliminating the variables with the largest p-values. Although these techniques differ slightly in their optimization strategy, it is interesting to point out that in this instance both methods fitted the data to the same MLR model (Figure 7a and Figure 7b).
  • 14. 13 8. Diagnostics Plots to Verify Regression Assumptions I plotted the “Residuals vs. Fitted” and “Normal Q-Q plot” to search for any patterns that might violate the regression assumptions (Figure 8a and Figure 8b). Figure 8a Figure 8b This is very similar to the full model plot of “Residuals vs. Fitted.” The plot for the forward/backward model (Figure 8a) appears to be the same odd pattern in the residuals so we cannot run out the non-violation of the equal (constant) variance rule is intact. The “Normal Q-Q plot” (Figure 8b) of quantiles of this distribution does not appear to have systematic deviations for linearity since the bulk of the points lie on a relatively straight line. Another important residual plot is the histogram plot of residuals (Figure 8c), which shows a normally distributed bell-shaped curve.
  • 15. 14 Figure 8c Figure 8d We can also visualize the regression results with the coefficient plot (Figure 8d) where each coefficient is plotted as a point with a thick blue line representing one standard error confidence interval and a vertical gray dotted line indicating 0. If the standard error confidence interval does not contain 0, it is statistically significant. Here we can see alcohol and sulphates has the largest effect on quality of the wine.
  • 16. 15 9. Fitting the Backward Selection Model on the Validation Data Set > PredBack$residual.scale [1] 0.6540796 Figure 9 > AIC(backward) [1] 1917.249 > AIC(trainfit) [1] 1923.308 > AIC(trainfit1) [1] 1921.749 Another metric we can use to compare the performance of models (trainfit-full model, trainfit1- full model minus fixed.acidity, and model obtained from backward selection method) is the Akaike Information Criterion (AIC) measure. The AIC value of backward 1917.249 is smaller than that of trainfit and trainfit1 at 1923.308 and 1921.749, therefore the model obtained with backward selection method is better fitting the data.
  • 17. 16 10. Leave-One-Out-Cross-Validation Method (LOOCV) The Validation Set Approach from part 1 has its drawbacks. Instead the LOOCV can be used to fit a model with n - 1 training observations and the prediction is made for the one excluded observation. The MSE provides an approximate unbiased estimate for the test error, however, it is still highly variable since it is based on a single excluded observation. Major advantages of the LOOCV method over the validation set approach is that it has far less bias. The statistical learning model is repeatedly using training sets that contains n - 1 observations or almost the entire data set in contrast to the validation set approach which uses 60% of the original data set to train the model. The LOOCV approach also tends to not overestimate the test error rate as much as the validation set approach. Instead of yielding different results due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results because there is no randomness in the testing/validation splits. #LOOCV estimate for test error is 42.47 > cv.err$delta [1] 0.4247753 0.4247728 11. k-Fold Cross-Validation I used the k-fold cross validation method as an alternative to LOOCV. The approach randomly divides the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as the validation set and the model is fitted on the remaining k – 1 folds. The MSE is computed on observations in the held-out fold that is treated as the validation set. The procedure is repeated k times, each time with a different group of observations treated as the validation set. Although there is a variability in the test error estimates, it is often lower than the variability that results using the validation set approach. #k-Fold CV estimate for test error is 42.40 cv.error.10 [1] 0.42409 12. Bias-Variance Trade-off for k-Fold Cross Validation On the basis of bias reduction, LOOCV is preferred over the k-Fold CV method since it takes (k - 1) n / k observations to train the model in contrast to the LOOCV method that trains the model on n - 1 observations, which is almost the entire data set. However, the LOOCV method has a higher variance than k-Fold CV with k < n due to the high
  • 18. 17 correlation of averaging n fitted models whereas the average of k fitted models is somewhat less correlated. This results in test error rates from LOOCV tendency to have a higher variance than k-fold CV. 13. Conclusion Backward Selection Multiple Linear Regression Model: Quality = 4.1747748 - 0.8826331*volatile.acidity - 1.9781343*chlorides + 0.0041786*free.sulfur.dioxide - 0.0037963*total.sulfur.dioxide - 0.5251147*pH + 0.9535727*sulphates + 0.3182497*alcohol It is interesting to note the backwards selection method only produced six regression coefficients versus the original eleven predictors. This model shows that sulphates and alcohol has the largest effect on predicting higher quality ratings of red wine in this data set whereas higher physiochemical measures of chlorides and volatile.acidity has the largest effect on predicting lower quality ratings. With respect to the adjusted R2, Residual Squared Error, and AIC measures, we can determine that the backward selection method best fits the data. In examining many diagnostics graphs and plots, we can also conclude that linear regression is a compelling method in analyzing the effects of physiochemical tests on quality rating of Vinho Verde red wine.
  • 19. 18 Appendix of R Codes: mydata = read.csv("winequality-red.csv",header=TRUE) mydata.pca = read.csv("winequality-red-pca.csv",header=TRUE) row <- nrow(mydata) set.seed(12345) trainindex <- sample(row, 0.6*row, replace=FALSE) training <- mydata[trainindex,] validation <- mydata[-trainindex,] #ValidationSetApproach require(ISLR) set.seed(1) train=sample(1599,960) lm.fit=lm(quality~.,data=mydata,subset=train) predict(lm.fit) rating.predict<-predict(lm.fit) plot(rating.predict) attach(mydata) mean((quality-predict(lm.fit,mydata))[-train]^2) #estimated test MSE in validation set #MLR with all 11 predictors (full model) used in forward selection method later trainfit<- lm(quality~., data=training) #Output of Regression coefficients for all Predictors summary(trainfit) #Model fit plot(trainfit) #fitsummary = summary(trainfit) #fitsummary$r.squared #fitsummary$adj.r.squared #fitsummary$sigma #RSE AIC(trainfit) BIC(trainfit) #Principal Component Analysis (PCA) physiochemattributes=row.names(mydata.pca) physiochemattributes names(mydata.pca) #Means/Variances of variables apply(mydata.pca, 2, mean) apply(mydata.pca, 2, var) #Centers variable to have mean zero and set variables to standard deviation one pr.out = prcomp(mydata.pca, scale = TRUE) pr.out summary(pr.out) dim(pr.out$x) pr.out$center pr.out$scale
  • 20. 19 pr.out$rotation #Plot first two principal component biplot(pr.out, scale=0) pr.out$rotation = -pr.out$rotation pr.out$x = -pr.out$x biplot(pr.out, scale=0) #Check standard deviation and variance explained pr.out$sdev pr.var = pr.out$sdev^2 pr.var #Compute proportion of variance explained PVE pve= pr.var/sum(pr.var) pve #Plot PVE and Cumulative PVE plot(pve, xlab= "Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1), type='b', col="blue") plot(cumsum(pve), xlab= "Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1), type='b', col="blue") #PCA Regression require(pls) set.seed(1000) train <- mydata[1:1279,] y_test <- mydata[1279:1599, 1] test <- mydata[1279:1599, 2:5] pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV") pcr_model summary(pcr_model) # Plot the root mean squared error validationplot(pcr_model) predplot(pcr_model) coefplot(pcr_model) # Plot the cross validation MSE validationplot(pcr_model, val.type="MSEP") train <- mydata[1:1279,] test <- mydata[1279:1599, 2:5] pcr_model <- pcr(quality~., data = train,scale =TRUE, validation = "CV") pcr_pred <- predict(pcr_model, data=test) summary(pcr_model) summary(pcr_pred) mean(pcr_pred) require(car) vif(trainfit) #remove fixed acidity from model because of multicollinearity trainfit1= update(trainfit, ~.-fixed.acidity, data = training) summary(trainfit1)
  • 21. 20 plot(trainfit1) anova(trainfit,trainfit1, backward) #coefficients(trainfit) #fitted(trainfit) #anova(trainfit) #residuals(trainfit) #vcov(trainfit) #histogram with 20 bars Residual<-residuals(trainfit1) hist(Residual,breaks=20) #Q-Q pplot with residuals qqnorm(residual, ylab="Standardized Residuals", xlab="Normal Scores", main="Residual") qqline(residual) layout(matrix(c(1,2,3,4),2,2)) plot(trainfit) #Correlation and Scatter Plots for full model (trainfit) require(corrplot) M = cor(training) corrplot(M, method = "number") corrplot(M, method = "circle") corrplot.mixed(M) #Scatterplot plot(training) #Forward Selection Method null<-lm(quality~1, data=training) forward<-step(null, scope=list(upper=trainfit), direction='forward') summary(forward) plot(forward) #histogram with 20 bars Residual1<-residuals(forward) hist(Residual,breaks=20) #Backward Selection Method, actually yields the same model as Forward Selection Method backward<-step(trainfit, direction='backward') summary(backward) AIC(backward) AIC(trainfit) AIC(trainfit1) #lm.fit= lm(quality~volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol, data=training) #attach(training) #Compute MSE on validation set #########backward$delta #Predicting validation data set with model with all predictors
  • 22. 21 PredTrainfit<-predict(trainfit, validation, se.fit=TRUE) PredTrainfit PredTrainfit$residual.scale #Predicting validation data set with model that does not have fixed.acidity PredTrainfit1<-predict(trainfit1, validation, se.fit=TRUE) PredTrainfit1 PredTrainfit1$residual.scale #Prediciting the validation dataset based on the mixed selection model PredBack<-predict(backward, validation, se.fit=TRUE) PredBack PredBack$residual.scale # Comparing the validation residual standard deviations, the mixed selection model is better than base model #Coefficient Plot of Backwards Selection Model require(coefplot) coefplot(backward) #LOOCV require(boot) glm.fit=glm(quality~.,data=mydata) cv.err=cv.glm(mydata,glm.fit) cv.err$delta cv.error=rep(0,5) for (i in 1:5){ glm.fit=glm(quality~.,data=mydata) cv.error[i]=cv.glm(mydata,glm.fit)$delta[1] } cv.error #cross validation estimate for test error is 42.47 coef(glm.fit) #K-fold Cross Validation set.seed(17) cv.error.10=rep(0,10) for (i in 1:10) { glm.fit=glm(quality~., data=mydata) cv.error.10=cv.glm(mydata,glm.fit,K=10)$delta[2] } cv.error.10 #k-fold CV estimate for test error is 42.40
  • 23. 22 References: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. New York: Springer. Lander, J. P. (2014). R for Everyone: Advanced Analytics and Graphics. Upper Saddle River, NJ: Addison-Wesley. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib