ProjectReport

Project Title:
Data analysis on quality of Wine
Project Team:
1. Saurabh Choudhary : sxc143430
2. Vijay Ramanathan :vxr141530
3. Chaitanya Vejendla: cxv140530
4. Siri Venkat Vemuri: sxv141130

The
Wine
Dataset:

Two datasets (Red Wine and White Wine) each consisting of following columns:
Input variables
1 fixed acidity 2. volatile acidity
3 citric acid 4 residual sugar
5 chlorides 6 free sulfur dioxide
7 total sulfur dioxide 8 density
9 pH 10 sulphates
11 alcohol
Output Variable
12 quality (score between 0 and 10)
Regression
Model
:

We
have
created
the
model
using
lm
between
the
different
sub-‐sets
of
predictor

variables
and
the
response
variable
(quality).
This
creates
coefficients
for
the
data.

The
Accuracy
is
calculated
by
rounding
the
predicted
value
to
the
nearest
integer.
It

is
done
as
follows:

White Wine:
lm.fit=lm(quality~.-density-total.sulfur.dioxide,data=white)
mean((round(predict(lm.fit)))==white$quality)
[1] 0.5165374

Red Wine:
lm.fit_red=lm(quality~.-fixed.acidity-citric.acid,data=red)
mean((round(predict(lm.fit_red)))==red$quality)
[1] 0.5959975
Residual Plots
Figure
1:
White
Wine

Figure
2:
Red
Wine

Conclusions:

Ø For
White
wine,
best
accuracy
of
51.65%
is
obtained
when
all
predictors
except
den
sity
and
sulfur
dioxide
are
considered.

Ø For
Red
wine,
best
accuracy
of
59.61%
is
obtained
when
all
predictors
except
acidit
y
and
citric
acid
are
considered.

Ø While
predicting
the
quality
the
physiochemical
properties
that
are
to
be
considered
,
varies
for
white
and
red
wines.

Verifying
Model
with
LDA
:

1.
White
Wine-‐
LDA:
Accuracy:
49.8%

2.
Red
Wine:
Accuracy:
56.1%

KNN
Model:

Model
:
knn.pred=knn(train.x,test.x,train.y,k=3)

Best
accuracy
observed
for
k=5
in
Red
wine
data
and
for
K=1
in
white
wine

data.

K-Value Red-
Accuracy
White-
Accuracy
1 49.5 58.96
3 47.9 49.18
5 52.16 48.15
Decision
Tree
:

Accuracies
of
54.21%

and
52.36%
are

obtained
respectively

for
red
and
white

data
sets

RANDOM
FORESTS:

After
getting
poor
accuracies,
we
divided
the
group
into
3
groups
before
running
Forest

algo:

Good
wine
(with
quality
score
above
6)

Normal
wine
(with
quality
score
equal
to
6)

Bad
wine
(with
quality
score
below
6)

White
Wine

RedWine

randomForest(formula
=
taste
~
.
-‐
quality,

data
=
white_train

Predictor

bad
good
normal

bad

479

10

127

good

17

249

91

normal

171

152

664

Accuracy:
71.02%

randomForest(formula
=
taste
~
.
-‐
quality,

data
=
red_train)

Predictor

bad
good
normal

bad

0

0

0

good

5

274

56

normal

15

60

230

Accuracy:
78.75%

Conclusion:

1. Grouping
the
dataset
on
different
quality
segments
would
give
a
better
accuracy

2. Low
accuracies
hints
that
there
might
be
external
factors
other
than
just
the
cont
ent
of
wine
which
affects
its
quality
and
is
not
included
in
the
dataset
(for
eq.
age

of
wine,
process
of
manufacturing
etc.)

ProjectReport

Recommended

Recommended

More Related Content

Similar to ProjectReport

Similar to ProjectReport (8)

ProjectReport