1. Project Title:
Data analysis on quality of Wine
Project Team:
1. Saurabh Choudhary : sxc143430
2. Vijay Ramanathan :vxr141530
3. Chaitanya Vejendla: cxv140530
4. Siri Venkat Vemuri: sxv141130
2. The
Wine
Dataset:
Two datasets (Red Wine and White Wine) each consisting of following columns:
Input variables
1 fixed acidity 2. volatile acidity
3 citric acid 4 residual sugar
5 chlorides 6 free sulfur dioxide
7 total sulfur dioxide 8 density
9 pH 10 sulphates
11 alcohol
Output Variable
12 quality (score between 0 and 10)
Regression
Model
:
We
have
created
the
model
using
lm
between
the
different
sub-‐sets
of
predictor
variables
and
the
response
variable
(quality).
This
creates
coefficients
for
the
data.
The
Accuracy
is
calculated
by
rounding
the
predicted
value
to
the
nearest
integer.
It
is
done
as
follows:
White Wine:
lm.fit=lm(quality~.-density-total.sulfur.dioxide,data=white)
mean((round(predict(lm.fit)))==white$quality)
[1] 0.5165374
Red Wine:
lm.fit_red=lm(quality~.-fixed.acidity-citric.acid,data=red)
mean((round(predict(lm.fit_red)))==red$quality)
[1] 0.5959975
Residual Plots
Figure
1:
White
Wine
Figure
2:
Red
Wine
Conclusions:
Ø For
White
wine,
best
accuracy
of
51.65%
is
obtained
when
all
predictors
except
den
sity
and
sulfur
dioxide
are
considered.
Ø For
Red
wine,
best
accuracy
of
59.61%
is
obtained
when
all
predictors
except
acidit
y
and
citric
acid
are
considered.
Ø While
predicting
the
quality
the
physiochemical
properties
that
are
to
be
considered
,
varies
for
white
and
red
wines.
3.
Verifying
Model
with
LDA
:
1.
White
Wine-‐
LDA:
Accuracy:
49.8%
2.
Red
Wine:
Accuracy:
56.1%
KNN
Model:
Model
:
knn.pred=knn(train.x,test.x,train.y,k=3)
Best
accuracy
observed
for
k=5
in
Red
wine
data
and
for
K=1
in
white
wine
data.
K-Value Red-
Accuracy
White-
Accuracy
1 49.5 58.96
3 47.9 49.18
5 52.16 48.15
Decision
Tree
:
Accuracies
of
54.21%
and
52.36%
are
obtained
respectively
for
red
and
white
data
sets
4. RANDOM
FORESTS:
After
getting
poor
accuracies,
we
divided
the
group
into
3
groups
before
running
Forest
algo:
Good
wine
(with
quality
score
above
6)
Normal
wine
(with
quality
score
equal
to
6)
Bad
wine
(with
quality
score
below
6)
White
Wine
RedWine
randomForest(formula
=
taste
~
.
-‐
quality,
data
=
white_train
Predictor
bad
good
normal
bad
479
10
127
good
17
249
91
normal
171
152
664
Accuracy:
71.02%
randomForest(formula
=
taste
~
.
-‐
quality,
data
=
red_train)
Predictor
bad
good
normal
bad
0
0
0
good
5
274
56
normal
15
60
230
Accuracy:
78.75%
Conclusion:
1. Grouping
the
dataset
on
different
quality
segments
would
give
a
better
accuracy
2. Low
accuracies
hints
that
there
might
be
external
factors
other
than
just
the
cont
ent
of
wine
which
affects
its
quality
and
is
not
included
in
the
dataset
(for
eq.
age
of
wine,
process
of
manufacturing
etc.)