Comparing classifiers to predict wine quality

Comparing classifiers
for predicting
wine quality

Project Techniques of Artificial Intelligence

Laurent Declercq
May 2016

I
Abstract

This study aims to highlight the differences between two classification algorithms. Both
algorithms are applied on a dataset with white Portuguese wine. They are used to classify
human wine taste preferences, based on physicochemical properties. The algorithms studied
here are J48 (C4.5) and AdaBoostM1(J48). AdaBoostM1(J48) starts with the basic J48 but
attempts to improve it. That is why it is interesting to see how it performs on a real-world
application. Since AdaBoostM1(J48) adds complexity, a tradeoff should be made based on its
added value and its added cost. This study will provide clarity on its added value in classifying
red Portuguese wine.

II
Table of Contents
Abstract ............................................................................................................................. I
Table of figures ................................................................................................................ III
1. Introduction ............................................................................................................... 1
2. Dataset ...................................................................................................................... 2
2.1. Attributes ...................................................................................................................... 2
2.2. Class: Quality ................................................................................................................. 3
3. Algorithms ................................................................................................................. 4
3.1. ZeroR ............................................................................................................................. 4
3.2. J48 (C4.5) ....................................................................................................................... 4
3.2.1. Improvements over ID3 ................................................................................................ 5
3.3. AdaBoost (J48) ............................................................................................................... 6
4. Hypothesis ................................................................................................................. 9
5. Methodology ............................................................................................................. 9
5.1. Preprocessing ................................................................................................................ 9
5.1.1. Problem: Numeric class ................................................................................................ 9
5.1.2. Correctness of data ...................................................................................................... 9
5.1.3. Problem: Imbalanced dataset .................................................................................... 11
5.1.4. Normalization ............................................................................................................. 14
5.1.5. Feature Selection ....................................................................................................... 14
5.2. Training and Test set split ............................................................................................ 18
5.3. Comparing algorithms .................................................................................................. 18
5.3.1. Optimizing J48 ............................................................................................................ 18
5.3.2. Optimizing AdaBoost .................................................................................................. 19
5.3.3. Experimental setup .................................................................................................... 20
6. Results & Interpretation .......................................................................................... 21
7. Conclusion ............................................................................................................... 22
8. Limitations ............................................................................................................... 23
8.1. Outlier detection .......................................................................................................... 23
8.2. Oversampling ............................................................................................................... 23
8.3. Optimizing J48 ............................................................................................................. 23
9. References ............................................................................................................... 24
10. Appendix ............................................................................................................. 25
10.1. Appendix A : Outlier detection ..................................................................................... 25
10.2. Appendix B : Algorithm comparison ............................................................................. 27

III
Table of figures

Figure 1 : Pseudo-code C4.5 three-constructing algorithm ..................................................... 5
Figure 2 : Post-pruning - subtree raising .................................................................................. 6
Figure 3 : Post-pruning - subtree replacement ....................................................................... 6
Figure 4 : Principle of ensemble learning, here with 3 Machine Learning algorithms (ML) ..... 6
Figure 5 : Outliers for attribute 'fixed acidity' ........................................................................ 10
Figure 6 : Performance difference deleting outliers .............................................................. 11
Figure 7 : Effect of balancing dataset ..................................................................................... 12
Figure 8 : accuracy improvements by balancing dataset ....................................................... 13
Figure 9 : weighted average recall ......................................................................................... 13
Figure 10 : weighted average precision ................................................................................. 13
Figure 11 : Weighted average F-measure .............................................................................. 14
Figure 12 : Effect of normalising numeric attributes .............................................................. 14
Figure 13 : Filter and wrapper strategies ............................................................................... 15
Figure 14 : Relation of ‘residual sugar’ to ‘density’ shows high correlation ........................... 16
Figure 15 : Components of CFS .............................................................................................. 17
Figure 16 : Results of Attribute Selection with 10 fold cross-validation ................................ 17
Figure 17 : Accuracy before and after deleting 'residual sugar' ............................................ 18
Figure 18 : CPU time training before and after deleting 'residual sugar' .............................. 18
Figure 19 : Adjusting minNumObj parameter for J48 ............................................................ 19
Figure 20 : Accuracy as function of the iterations in AdaBoost ............................................. 20
Figure 21 : Experimental setup in Weka KnowledgeFlow ...................................................... 20

Table 1 : Attribute characteristics white wine dataset ............................................................. 2
Table 2 : Outlier detection ..................................................................................................... 10
Table 3 : Balancing dataset using SMOTE filter ...................................................................... 12
Table 4 : Experimental results ................................................................................................ 21

Ap. Figure 1 : Outliers with attribute 'citric acid' ................................................................... 25
Ap. Figure 2 : Outliers with attribute 'residual sugar' ............................................................ 25
Ap. Figure 3 : Outliers with attribute 'chlorides' .................................................................... 25
Ap. Figure 4 : Outliers with attribute 'Free Sulfur Dioxide' .................................................... 26
Ap. Figure 5 : Outliers with attribute 'Total Sulfur Dioxide' ................................................... 26
Ap. Figure 6 : Outliers with attribute 'alcohol' ....................................................................... 26
Ap. Figure 7 : Confusion matrix ZeroR .................................................................................... 27
Ap. Figure 8 : Confusion matrix J48 ........................................................................................ 27
Ap. Figure 9 : Confusion matrix AdaBoostM1(J48) ................................................................. 27

1
1. Introduction

The classification task is a form of supervised learning where a dataset, labeled with the right
class, is first specified. A classifier is trained on this set with the goal of generalizing its model
to other data. Different classifiers (learning algorithms) exist for this task with their
advantages and disadvantages. Increasingly complex algorithms combined with increasing
accuracy have been developed. Computational power has increased strongly over the years.
Therefore, much attention typically goes to accuracy of an algorithm. Currently, many smart
mobile devices and services are developed. However, for these less powerful, mobile
applications, the context is different and power and efficiency become important.

The algorithms under study here are J48 and AdaBoostM1(J48). J48 is a relatively simple tree-
building algorithm, very popular thanks to its great performance and its understandable
models. AdaBoost, attempts to boost the performance of an underlying algorithm, which in
this case is J48. This boosting comes at the cost of increased complexity. For the average
dataset, AdaBoost has shown better performance at the cost of computational power (J R
Quinlan, 2006). For it to be relevant, the gain should outweigh the cost, which of course
depends on the context.

The study is organized as follows: In section 2 the dataset and its attributes are discussed,
followed by the algorithms that are going to be applied on the dataset. Section 5.1 then
discusses the preprocessing steps the dataset underwent in order to increase the
performance of both algorithms. Once applied, the results are followed by a brief discussion
in section 6.

2
2. Dataset

The dataset talks about wine. More precisely it contains both physicochemical properties and
sensory data of red and white Portuguese wine (vinho verde). It was collected between May
2004 and February 2007 (Cortez, Cerdeira, Almeida, Matos, & Reis, 2009). to study three
regression techniques, support vector machine, multiple regression and neural networks.
Since the sensory data of both types of wine are based on a completely different taste, the
authors decided to split the dataset in two, a red dataset and a white dataset. The
experiments below are based solely on the white dataset, since the goal is not to compare
the results of both types of wine, but rather to compare different algorithms. For the white
dataset, 4898 instances where collected with 12 attributes each.

2.1. Attributes

There are eleven physicochemical properties recorded: fixed acidity, volatile acidity, citric
acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates
and alcohol.
Table 1 : Attribute characteristics white wine dataset
Acidity of the wine protects the wine from bacteria during the fermentation process. A
distinction should be made between the amount of acidity and the strength of the acidity.
The amount is measured in g/l, whereas the strength is measured in pH. Most wines show
pH-values between 2,9 and 3,9. The higher the wine’s acidity, the lower the pH value.

Three main acids can be found in wine grapes: tartaric, malic and citric acid. These are fixed
acids that contribute to the quality of the wine, the distinct taste-shaping during winemaking
and the aging of the wine. Tartaric acid is important for maintaining the wine’s chemical
stability and color. Its concentration in the wine grape varies with the soil and grape type.
Malic acid is not measured in the dataset. Next to that, citric acid is also present, but in much
smaller concentrations (1/20 of tartaric acid). It can be found in many citrus fruits and offsets
a strong citric taste. Extra citric acid can be added but with caution. Namely, certain bacteria
Min Max Mean StdDev
Fixed acidity ( g(tartaric acid)/l ) 3,80 14,20 6,86 0,84
Volatile acidity ( g(acetic acid)/l ) 0,08 1,10 0,28 0,10
Citric acid (g/l) 0,00 1,66 0,33 0,12
Residual sugar (g/l) 0,60 65,80 6,39 5,07
Chlorides ( g(sodium chloride)/l ) 0,01 0,35 0,05 0,02
Free sulfur dioxide (mg/l) 2,00 289,00 35,31 17,01
Total sulfur dioxide (mg/l) 9,00 440,00 138,36 42,50
Density (g/ml) 0,99 1,04 0,99 0,00
pH 2,72 3,82 3,19 0,15
Sulphates( g(potassium sulphate)/l ) 0,22 1,08 0,49 0,11
Alcohol (vol.%) 8,00 14,20 10,51 1,23

3
can convert citric acid into acetic acid (Beelman & Gallander, 1979). Acetic acid, as opposed
to the fixed acids mentioned above, is a volatile acid that can be found in vinegar (Drysdale &
Fleet, 1988). It is produced by bacteria that contribute to spoilage and has a low sensory
threshold. When the amount surpasses 700 mg/l it can already be sensed. Concentrations
higher than 1,2 g/l generally lead to an unpleasant taste. Between these limits however, high
levels of acetic acid can sometimes be found in higher quality wines in which it is said to create
a more complex taste.

Sugar counteracts the sourness of acids. Residual sugar is the sugar level, measured after
fermentation. The sweetness of a wine is clearly a function of the residual sugar. However, a
wine with high residual sugar levels can still taste ‘dry’ when acidity levels are elevated and
alcohol concentration is low. It is the winemaker’s quest to find a harmonious balance
between these elements.

Sodium chloride is known to be major contributor to saltiness. These levels depend on the
wine grape type and ultimately on the soil where they are grown (Coli et al., 2015). High levels
of sodium chloride are an unwanted characteristic and are in many countries limited by law.

Sulfur dioxide is a by-product of fermentation. Since 1487, it is common to add additional
sulfur dioxide to the wine in order to preserve longer thanks to its anti-oxidant and anti-
microbial properties (Robinson, 1994). Part of the sulfur dioxide binds with other compounds.
It is the free SO2 that protects the wine from oxidizing and browning before and during
fermentation. When, however, too much of it is added, the wine taste deteriorates since SO2
can prematurely stop fermentation. Less SO2 is needed when the wine has a low pH value and
high alcohol percentage. Potassium sulphate is added for the same goals.

The density of wine is close to that of water. Dry wine usually has a low density whereas sweet
wine is denser. The density of water is 1 kg/l, of ethanol 0,789 kg/l, and of sugar 1,587 kg/l.
This means that wine has a density around 0,97 to 1,2 kg/l.

Alcohol (ethanol) is produced during fermentation. It influences the fullness and roundness
of the wine taste. Its levels are influenced by the ripeness of the wine grapes at the time of
picking.

2.2. Class: Quality

The wine quality is calculated as the median of at least 3 evaluations made by different wine
experts. Each evaluation ranges from 0 (very bad) to 10 (very excellent). In the dataset, scores
were distributed between 3 and 9, meaning that no very bad nor very excellent wines were
present.

4
3. Algorithms

Three algorithms where chosen based on their interesting aspects. First of all, ZeroR is
included as a baseline algorithm compared to which any smart algorithm should be
performing better in order to be useful. The J48 is included as a tree-building algorithm
because trees are easy to understand and to model. A disadvantage of trees is that they are
prone to overfitting. To improve the J48 algorithm, another algorithm called AdaBoostM1 is
included which is kind of a meta-algorithm that has to be used with another algorithm. Here
it is combined with J48 to test its improvements.

3.1. ZeroR

Commonly referred to as the baseline algorithm, ZeroR is the simplest classification method.
It is based on a frequency table in which it looks at the majority class and predicts this class
all the time. It thus simply relies on the class attribute and ignores all other attributes. ZeroR
has no predictability power but is useful to benchmark with other classification methods.

3.2. J48 (C4.5)

J48 is an open-source java implementation of the C4.5 algorithm. C4.5 is a widely popular
statistical classifier that generates trees using the concept of information gain. It is an
improvement over the ID3 algorithm, earlier developed by Ross Quinlan (J Ross Quinlan,
1992).

Basically, it generates a decision tree in a top-down manner. Using a training set, at each stage
it will use a greedy approach to look for the attribute that best splits the set into subsets. Let
T be the set of instances associated with a stage. To test this, each attribute is evaluated
separately on T using a metric called information gain (more correctly, a gain ratio is used,
which is the attribute’s information gain divided by its entropy). This in turn is derived from
entropy. The attribute providing the highest information gain, is selected as node in the tree.
For each subset, this process is repeated until the subset contains only samples from the same
class or until the minimum number of leaf objects is reached. A simplified pseudo-code of the
tree-construction algorithm from C4.5 is included below (Salvatore, 2000).

5

3.2.1. Improvements over ID3
C4.5 improves ID3 in a number of ways. Whereas ID3 could only handle discrete and nominal
attributes, C4.5 can also handle continuous attributes. It therefore splits the set by going
through every possible split point of the attribute (in fact, entropy only needs to be evaluated
between points of different classes). It then chooses the split point that provides the highest
information gain. It is still incapable of handling a numeric class. Only missing class values,
binary class or nominal class values are allowed. Next to that, missing values are handled in
the entropy calculations, by treating them as a separate value. Another advantage of C4.5 is
that it allows to create context-independent rules from trees.

ID3 is very prone to overfitting the training data. This is the case when, given hypothesis space
H and your hypothesis h ∈ H, there exists another hypothesis h’ ∈ H that performs worse on
the training data than h ∈ H but better on the entire population. ID3 reduces this issue slightly
by performing pre-pruning. Only statistically significant attributes are allowed to be selected
by the information gain procedure. To determine this, it applies the chi-squared test.
However, to better solve the issue of overfitting, C4.5 allows for post-pruning, which is
backtracking the decisions and removing useless branches by replacing them with leaf nodes.
Another way it does this is by subtree raising. Both strategies can be seen in the figures below.
Pruning also reduces complexity and can remove parts of the classifier that are based on noisy
data.

Figure 1 : Pseudo-code C4.5 three-constructing algorithm

6

3.3. AdaBoost (J48)

AdaBoost is, as the name shows, a boosting method. Boosting is a form of ensemble learning
that iteratively forces new classifiers to focus on the errors produced by the earlier ones. With
ensemble learning, a committee of classifier algorithms is established. All committee
members get a vote and the overall decision is made based on the majority vote. By making
a decision based on a committee of ‘simple’ learning algorithms, the stability of that decision
improves. Therefore, AdaBoost may lead to a classifier whose variance is significantly lower
than those of the original algorithm. Next to that, boosting has been found to improve bias
(Kong & Dietterich, 1995).

Figure 2 : Post-pruning - subtree raising
Figure 3 : Post-pruning - subtree replacement
Figure 4 : Principle of ensemble learning, here with 3 Machine Learning algorithms (ML)

7
AdaBoost boosts the performance of the original learning algorithm on the training data. To
do this, it iteratively uses a weighted training set (Freund & Schapire, 1996; J R Quinlan, 2006;
Schapire, 2013). Each instance weight wj reflects the importance of the respective instance
and starts at value 1 for all instances. The first hypothesis is made, based on this set. An error
is calculated as the sum of all weights from the misclassified instances. All correctly classified
instances are then given lower weights. A new hypothesis is generated using these new
weights. This process is repeated until there are T hypotheses, with T being an input to
AdaBoost. All hypotheses can be seen as committee members with the weights of their votes
z being a function of their accuracy on the training set. The final hypothesis is based on the
majority of the weighted votes. The algorithm actually sums the votes taking into account
their weights. Pseudo-code for this algorithm can be seen below.

For clarification purposes the normalization steps and the steps to calculate the boosted
classifier are discarded. Also, the edge cases where the error equals 0 or exceeds 0.5 are not
mentioned in the code. However, they require different steps. When there is no error,
obviously, no extra trials should be performed and T should be set at t. The error rate of the
boosted algorithm should approach 0 as T increases. This is only the case when the error rate
of the trials is below 0.5. Therefore, when error > 0.5, the trials should be ended and T should
be replaced by t-1. AdaBoost thus makes the assumption that the simple classifiers perform
better than pure guessing. This is noted as the weak learning condition.

-------------------------------------------------------------------------------------------------
function AdaBoost(examples, L, T) returns a weighted-majority hypothesis
inputs: examples, set of N labeled examples (x1, y1),…,(xN,yN)
L, a ‘simple’ learning algorithm
T, the number of hypotheses (trials / iterations) in the ensemble
local variables: w, a vector of N example weights
h, a vector of T hypotheses
z, a vector of T hypothesis weights
for n = 1 to N do
w[n] ß 1/N

for t = 1 to T do
h[k] ß L(examples, w)
error ß 0
for n = 1 to N do
if h[t](xn) ≠ yn then error ß error + w[n]
for n = 1 to N do
if h[t](xn) = yn then w[n] ß w[n] . error / (1 – error)
w ß NORMALIZE(w) (so that the sum of
z[t] ß log(1 – error)/error
return WEIGHTED-MAJORITY(h,z)
-------------------------------------------------------------------------------------------------

8
Although the robustness against overfitting is a clear advantage from AdaBoost over C4.5, it
is found that AdaBoost, when run for a large number of iterations is still prone to overfitting
(Bartlett, 2007). Luckily, by setting a stopping rule, AdaBoost will perform consistently accross
datasets. This stopping rule limits the number of iterations t by making it a fixed function of
N.

9
4. Hypothesis

AdaBoostM1 should perform better than J48 on the dataset, given that it takes the simpler
J48 algorithm and iteratively applies it, focusing on the previously-made errors. The accuracy
and bias of AdaBoost is expected to be significantly better than J48. Voting will also lead to
more stable behavior. Therefore, we also expect AdaBoost to show reduced variance.
5. Methodology

First, the correctness of the data needs to be checked. Specifically, outliers and missing values
in the dataset are of interest. After that, the data is preprocessed further and the necessary
parameters of the two interesting algorithms are adjusted. Next, the Weka explorer is used
to split our data into a training set and a test set. After that, the algorithms are trained on the
training set and the performance on the test set is checked.

5.1. Preprocessing

5.1.1. Problem: Numeric class
The class variable ‘quality’ is numeric. This puts a limit to what classifying algorithms can be
used. Our baseline algorithm ZeroR can be applied but decision trees with J48 and
AdaBoostM1(J48) cannot be built since J48 cannot handle numeric class values, only missing
class values, binary class and nominal class values. Therefore, a filter called
NumericToNominal was applied to convert ‘quality’ to a nominal attribute.

5.1.2. Correctness of data
There are no missing values in the dataset. Possible outliers could be visually detected by
checking the table above, combined with scatter plots and knowledge on the topic. That is
partly why the importance of the different attributes is explained earlier. A more theoretical
approach could have been used, cf. the paragraph on limitations (Cousineau, 2009). However,
this visual method is chosen in order to avoid a normality assumption from the start and also
to limit the number of instances to be removed. Outliers are problematic since they can skew
the results (Cousineau, 2009). Namely, it might alter the mean, the variance and other
metrics. A resulting larger variance might lead to insignificant results whereas in reality the
results should be significant. When the model is based on skewed data, results are
suboptimal. When the number of outliers is small compared to the sample, these instances
should be removed.

The attribute ‘fixed acidity’ shows two outliers of both 11,8 g/l and 14,2 g/l. Although these
values are possible, they greatly disturb the distribution, since all other values range from 3,8
to 10,7. We therefore remove both instances using the RemoveRange filter. The figure below
shows the big distance from the other instances. By removing them, the standard deviation
lowered from 0,844 to 0,834.

10

‘citric acid’ shows a similar problem. Two instances are outliers to the general distribution,
namely instance 746 with citric acid level of 1,66 and instance 3151 with value of 1,23. These
numbers are very high compared to the third highest level measured, namely 1. By removing
these two instances, the standard deviation is lowered from 0,121 to 0,119. The graph of this
attribute, together with any other attribute where outliers are detected, can be found in
appendix A.

Looking at ‘residual sugar’, three instances with very unlikely values can be spotted
immediately. 31,6 g/l and 65,8 g/l worth of sugar are incredibly high levels. Six other outliers
exist with big values for residual sugar. By removing them, the values now lie between 0,6
and 20,8 g/l. With this, standard deviation is reduced from 5,07 to 4,94. The same method is
applied to remove 8 instances for attribute ‘chlorides’, with a lower standard deviation of
0,02. ‘free sulfur dioxide’ showed one instance that doubled the value of the second-highest.
This one is removed. 8 instances were removed looking at ‘total sulfur dioxide’. Lastly, two
instances showed extremely low values for ‘alcohol’ level. No other attributes show
significant outliers. To summarize, 32 outliers were found and deleted. The remaining dataset
contains 4866 instances.

Attribute Deleted outliers
fixed acidity 2
citric acid 2
residual sugar 9
chlorides 8
free sulfur dioxide 1
total sulfur dioxide 8
alcohol 2
Table 2 : Outlier detection
Figure 5 : Outliers for attribute 'fixed acidity'

11

The accuracy improvements are tested using a paired T-test. The paired T-test assumes that
the results from both datasets are independent and normally distributed. These assumptions
are fulfilled since the results of the dataset without outliers is independent of the results
before deleting them. Also by setting the Weka experimenter to perform 30 iterations, the
distribution of results can be approximated by a normal distribution. This is done for all
experiments in this study. The algorithms during this experiment, and the experiments during
further preprocessing steps are applied using default values in Weka. The models are trained
and evaluated with 10-fold stratified cross-validation. Compared to normal cross-validation,
stratified cross-validation has the benefit that every piece is a good representation of the
dataset. The folds are selected so that the mean response value is approximately equal in all
the folds. This has been proven to reduce the variance of the estimated accuracy. The results
are given in the table below.

Figure 6 : Performance difference deleting outliers
No significant improvements are found for both J48 and AdaBoost, with a 95% confidence
level (two tailed). Except for the baseline algorithm ZeroR, deleting outliers has no visible
effect on the accuracy and its standard deviation of the different algorithms.

5.1.3. Problem: Imbalanced dataset
When the separate classes are not equally represented, the dataset is imbalanced. An
imbalanced dataset can lead to overfitting and underperforming algorithms. Our dataset is
severely unbalanced with the amount of instances ranging from 5 in the minority class up to
2188 in the majority class. Extreme quality scores are rare compared to the mediocre classes.
By resampling, this problem can be solved. Resampling can either be done by deleting
instances from the over-represented class (under-sampling) or by adding copies of instances
from the under-represented class or synthetically creating such instances (over-sampling).
Generally, it might be better to over-sample unless you have plenty of data. There are some
disadvantages to over-sampling however. It increases the dataset, leading to increased
processing time needed to build a model. Also, since the class is not taken into account it may
cause overgeneralization. When put to extremes, oversampling can lead to overfitting
(Drummond & Holte, n.d.; Rahman & Davis, 2013).

Another option would be to keep the imbalanced dataset but to wrap your learning
algorithms in a penalization scheme, which adds an extra cost on misclassifying a minority
class. This however, means that the algorithms that are to be compared, are changed, making
comparisons less intuitive. Therefore, sampling is preferred.

12

In Weka, sampling can be achieved by applying the supervised SMOTE filter (Nitesh V Chawla,
2005). This resamples the dataset by applying the Synthetic Minority Oversampling
Technique. It does not simply copy instances from the minority class. Rather, it iteratively
looks at a number of neighbors and creates an instance with randomly distorted attributes,
within the boundaries of these neighbors.

We changed the percentage-parameter to correspond to the necessary extra instances to be
created. Since the over-sampling takes on extreme percentages, we expect a certain bias in
the results due to overgeneralization. However, this does not impact the differences between
J48 and AdaBoost. Remarks on this method can be found in the limitations paragraph. After
balancing, our training set consists of 15311 instances, which means that 10445 instances
were created. WEKA pushes these extra instances on the bottom of the dataset. If you want
to use 10-fold cross-validation, this might lead to folds with a lot of instances from the same
class, and thus eventually lead to overfitting. To avoid this issue, we apply an extra filter that
randomizes the instances over the dataset.

Class Number of instances % to add Amount added
1 14 15528 2173
2 161 1259 2026
3 1443 51,6 744
4 2188 0 0
5 880 148,6 1307
6 175 1150 2012
7 5 43660 2183
Table 3 : Balancing dataset using SMOTE filter

Figure 7 : Effect of balancing dataset

13
Figure 5 shows the effect of balancing the dataset, using the percentages stated in table 4.
However, for a skewed dataset, accuracy is not a great measure. Better is then, to measure
the performance of algorithms by precision and recall (Nv Chawla & Bowyer, 2002). Recall
measures the fraction of correctly predicted positives and the actual positives. Precision, on
the other hand shows the fraction of correctly predicted positives and all predicted positives.
Together, these metrics are used to calculate a harmonic mean known as the F1-measure. In
a multiclass environment, every class makes up a recall and precision metric, that can be
combined into a weighted recall, weighted precision and ultimately into a weighted F1-
measure .

Recall = TP
TP + FN
Precision = TP
TP + FP
F1-measure = 2 x precision x recall
precision + recall

Figure 8 : accuracy improvements by balancing dataset

Figure 9 : weighted average recall

Figure 10 : weighted average precision

14

Figure 11 : Weighted average F-measure
Results from the experiments are shown above. With a two-tailed confidence level of 95%,
the performance of the J48 and AdaBoostM1(J48) algorithms improved significantly (v) by
balancing the dataset. This was found by running the Weka experimenter. Only the baseline
algorithm deteriorated significantly (*). Also the standard deviations lowered, meaning more
stable results. This provides a broad conclusion that balancing truly improves the
performance of the mentioned algorithms. Here the default values of the algorithms were
used. The parameters of the different algorithms will be adjusted in a later stage when we
are comparing them to one another.

5.1.4. Normalization
When big differences among the variable ranges can be seen, normalization can be beneficial.
For this dataset, the scales are very different among the attributes. The values, measured on
different scales, are adjusted to fit a common scale. It is important that normalization is
applied after checking for outliers. Outliers have already been processed before. The default
values for the scale (1) and translation (0,0) are used, meaning that everything is scaled to the
interval [0,1]. The class values are ignored since they are nominal values. At 95% confidence
level, there is no significant difference when looking at the accuracy of the three algorithms.
Here, normalization has no effect. Therefore, we continue with the dataset without
normalizing the numeric attributes.

5.1.5. Feature Selection
In the real world, more attributes can lead to higher discrimination power. However, most
machine learning algorithms have difficulties handling irrelevant or redundant information.
Sometimes attributes can be completely irrelevant for the class. These attributes still need
processing power and can even bias the result. Therefore, feature subset selection is a great
way to improve classification results, lower processing time and raise readability of the model
(Guyon, 2003). This is done by identifying and neglecting or removing the irrelevant
information. Feature selection is successful if the number of dimensions can be reduced
without lowering (or by improving) the accuracy of the induction algorithm.
Figure 12 : Effect of normalising numeric attributes

15
There are four elements to consider when applying feature subset selection: the evaluation
strategy, search method, search direction and termination point.

The evaluation strategy used, depends on the processing power available and the dataset
used. There are three types of feature selection methods. An embedded method puts the
feature selection within the basic learning algorithm. A filter first selects features to be passed
on to the learning algorithm. The wrapper method wraps a feature selection algorithm
around a classifier. When the irrelevant attributes need to be deleted before applying a
learning algorithm, a filter can be used. This filter applies heuristics to the data characteristics
in order to determine the merit of including or excluding a specific attribute. When you want
to take into account the bias of the learning algorithm that is used for feature selection, the
wrapper method can be used. This leads to more reliable results for large datasets since the
feature selection is optimized for the particular learning algorithm used. This strategy
however, requires a tremendous amount of processing power because for every feature set
considered, the learning algorithm is called (Hall, 1999). A comparison of the two approaches
can be seen in the figure below (Hall, 1999).

The search method also has a great influence on the processing power needed. If there are
N attributes, there exist 2N
subsets. The search space can thus grow quickly, where exhaustive
search becomes infeasible. Heuristic search methods on the other hand can lead to
suboptimal solutions. Weka can force the method to hop over a suboptimal solution to lower
this possibility.

Figure 13 : Filter and wrapper strategies

16
The search direction can have a serious effect on the attributes selected. One can start by
selecting all attributes and iteratively deleting attributes from that selection until some
termination point. This method is called backward elimination. On the other hand, the
forward selection method starts with zero attributes and gradually builds up a selection until
some termination point. Combining these two methods leads to bi-directional search, where
you start with a subset of attributes and you either delete or add attributes depending on
some characteristic such as merit.

By setting a termination point, you avoid processing over the entire search space. Typically,
a termination point could be a fixed number of attributes to select or a merit threshold.

(i) Feature subset selection in Weka
Based on the scatter plots, we suspect some attributes to be irrelevant based on their
seemingly high correlation with each other. An example can be seen in the figure below,
which shows the relation of ‘residual sugar’ to ‘density’. Based on the theory above, one can
see that the higher the amount of residual sugar, the higher the density will be. This relation,
combined with the low correlation of SO2 with ‘quality’, will probably lead to the exclusion of
the one of the two attributes.

Figure 14 : Relation of ‘residual sugar’ to ‘density’ shows high correlation
Weka allows many methods to apply feature subset selection, either permanently or
temporarily during learning algorithm execution. Since processing power is limited, all
irrelevant attributes were first discarded before using the adapted dataset to train the
models. This method is very fast and leads to similar performance as the slower wrapper
method. Although there is a filter in Weka called AttributeSelection that combines an
evaluation strategy with a search method to automatically select the correct attributes, it
does not apply cross-validation. Therefore, an attribute selection is first processed, and its
results manually applied afterwards. The evaluation strategy used is CfsSubsetEval (CFS =
Correlation based Feature Selection), which looks at the correlation matrix of all attributes.
This leads to a metric called “symmetric uncertainty” (Hall, 1999). It considers the predictive
value of each attribute, together with the degree of inter-redundancy.

17
Attributes with high correlation with the class attribute and low inter-correlations are
preferred. CFS assumes that that the attributes are independent and can fail to select the
relevant attributes when they depend strongly on other attributes given a class. The
components of CFS are listed in the figure below (Hall, 1999).

Figure 15 : Components of CFS

Multiple search methods were used to compare the results. All lead to the same result.
Ultimately exhaustive search was used because it was more than feasible with only twelve
attributes. With 10-fold cross validation, it shows a clear exclusion of ‘residual sugar’.

Experimenting with the dataset before and after deleting the attribute ‘residual sugar’ shows
that by deleting it, the performance of the different algorithms does not deteriorate
significantly (95% significance level). More importantly however, the CPU time needed to
build the model does decrease significantly from 0.96 to 0.87 for J48 and from 8.84 to 8.35
for AdaBoost (cf figures below). This shows that by removing the attribute one can reduce
processing time without lowering the performance of the algorithms. It is therefore beneficial
to remove the attribute, especially when processing power is limited.
Figure 16 : Results of Attribute Selection with 10 fold cross-validation

18

5.2. Training and Test set split

The accuracies stated above are an estimate for the entire population. The goal here is to
compare the performance of the different algorithms on the dataset itself. To avoid biased
results, a separate test set should be used to check the performance of the models that were
built using the training set. Our dataset is sufficiently large to perform a 66% split in order to
create a test set. To do this, the RemovePercentage filter was used. However, the way this
split is done can dramatically affect the performance of the algorithms. The
RemovePercentage filter simply splits the dataset in the order listed at the moment.
Therefore, it is better to first randomize the dataset. Because the dataset has already been
randomized after balancing, RemovePercentage filter was directly applied with a 66% split.
This smaller dataset was saved as the test set. Afterwards, the original set was reloaded, the
same percentage used, but now combined with the InvertSelection property. This way a
training set of 10105 instances and a test set of 5206 instances was created. We will not touch
the test set until we created the models to be compared with each other.

5.3. Comparing algorithms

In Weka, the induction algorithms can be optimized by adjusting different parameters. The
default values lead to reasonable performance. However, since every dataset differs, it might
be beneficial to make use of tailored values. That is why, before performing the experiments,
both J48 and AdaBoost are optimized separately.

5.3.1. Optimizing J48
Of interest are the parameters minNumObj, which determines the minimum number of
instances needed to make up a leaf, and confidenceFactor, which determines the amount of
pruning that occurs (with a smaller value meaning more pruning). By increasing the minimum
Figure 17 : Accuracy before and after deleting 'residual sugar'
Figure 18 : CPU time training before and after deleting 'residual sugar'

19
number of objects in a leaf, the size of the tree can be limited. This allows for an easier to
understand model and can reduce overfitting. However, it is expected that accuracy will
decrease with growing leafs. Therefore, a tradeoff needs to be made between tree size and
accuracy. The graph below shows the impact of adjusting the minNumObj parameter. The
experimenter was used with 30 iterations.

Figure 19 : Adjusting minNumObj parameter for J48
By default, minNumObj is set at 2. This is increased up to 500. The graph shows a gradual
decline in both tree size and accuracy as a result. The standard deviation is not included in
this graph but gradually increases for both metrics. The tree size shrinks much faster than the
accuracy. By changing the parameter to 5, the tree size is divided in two while the accuracy
decreases only from 69,8 to 68,12. Going further to 10 as a minimum results in yet again half
the tree size, with a slight decrease in accuracy to 66,5. From that moment on, the accuracy
drops slightly faster. Therefore, minNumObj is set at 10. Remarks to this decision can be found
in the limitations section.

By adjusting the confidenceFactor, accuracy does not change significantly, values from 0,05
up to 0,5 have been tested, with 0,25 being the default value. The default value is therefore
used.

5.3.2. Optimizing AdaBoost
Since AdaBoost(J48) implements J48, it is important to use the same parameters for J48 here
as those used for the standalone J48 algorithm. AdaBoost itself also allows some adjusting.
Namely, the number of iterations can be adjusted (cf. T in the theoretical section on
AdaBoost). Obviously, the accuracy increases as more iterations make up the committee. The
graph below shows the accuracies corresponding to different numbers of iterations. The
accuracy improvements gradually decline. By setting an improvement threshold at 1%, the
number of iterations is set at 15, which leads to an accuracy of 78,28%.

20

Figure 20 : Accuracy as function of the iterations in AdaBoost
5.3.3. Experimental setup
The goal is to compare the performance of the different models on our test set. Since the
Weka experimenter does not allow supplying a separate test set, the Weka Knowledgeflow
was used to set up our experiment. The figure below shows the setup. Both the training set
and the test set are loaded and assigned a purpose. The attribute ‘quality’ is set as the class
of the datasets. All classifier elements are configured to fit the optimal parameters found
above. The results are evaluated and pointed toward both a text viewer and performance
charts.

Figure 21 : Experimental setup in Weka KnowledgeFlow

21
6. Results & Interpretation

The table below shows all relevant performance metrics of the different algorithms.

ZeroR J48 AdaBoost
Accuracy 12,93% 65,48% 78,33%
Kappa 0,00 0,60 0,75
Mean absolute error 0,24 0,12 0,07
Root mean squared error 0,35 0,26 0,22
Relative absolute error 100% 48% 27%
Root relative squared error 100% 75% 63%
TP rate 0,13 0,66 0,78
FP rate 0,13 0,06 0,04
Precision 0,02 0,65 0,78
Recall 0,13 0,66 0,78
F-measure 0,03 0,65 0,78
ROC area 0,50 89,00 0,96
Table 4 : Experimental results

From the table it is clear that AdaBoost outperforms the other algorithms in every aspect,
with J48 being the runner-up. Although our dataset is balanced, due to the random split, small
differences exist in class sizes. In the training set, the class with a score of 9 is slightly bigger
than the other classes. That is why ZeroR will impose as rule to always choose that class. This
leads to an accuracy of merely 12,9% on the test set. J48 performs much better, with an
accuracy of 65,48%. This equals an error reduction of 60%. Compared to J48, AdaBoost again
reduces the error of J48 with 37%. Since the classes are still quite balanced, the weighted
average of the F-measure approximates the accuracy. In the case of J48 and AdaBoost,
(weighted avg.) Precision and (weighted avg.) Recall also approximate the accuracy. ZeroR,
on the other hand, shows a much lower value for the weighted average of precision since in
all but one class, the TP and FP rates are zero.

Although the ROC area is especially useful for unbalanced dataset, here it confirms the other
metrics with the area for AdaBoost coming close to 1. This means it is an excellent prediction.
J48 with an ROC area just below 0,9 shows a good prediction. The confusion matrices can be
found in appendix B. They show that for the class with score 9, very little errors are made in
J48 and AdaBoost. For J48, no instances from this class are misclassified as having lower
scores and AdaBoost misclassifies only 3 instances like this. Also, only instances with scores
from 6 to 8 are occasionally misclassified as being in the last class. This shows that the models
perform better on good wines.

22
7. Conclusion

As expected, AdaBoost does a great job in improving the performance of the underlying
algorithm. The performance on our test set confirms the hypothesis. The performance of
ZeroR severely depends on the amount of classes and the distribution among them. J48, given
its lower complexity and smaller need for processing power, performs fairly well.

The results shown here depend of course on the dataset split and will vary somewhat when
randomized. To give an estimate of their performance on the entire population, cross-
validation should be performed with at least 30 iterations on which a T-test can be used. This
method was applied during the intermediary experiments at the preprocessing stage and
show similar results.

23
8. Limitations

8.1. Outlier detection

Here, outliers were detected arbitrarily by looking at scatter plots and comparing the values
with possible values in the area of winemaking. However, this method is not complete since
it involves choosing which instances to delete and which not. Better, but more drastic, would
have been combining the visual method with a statistical method for outlier detection. One
could assume that the attributes follow a normal distribution, or adjust the right filter so that
they do (Ben-gal, 2005; Cousineau, 2009). Then, a criterion could be established based on z-
scores, excluding all instances that are x standard deviations apart from the sample mean. In
Weka this can be done using the InterQuartileRange filter. Any value outside the range [Q1 –
k.IQR , Q3 + k.IQR], with k being a constant, would be designated as an outlier.

8.2. Oversampling

It has been shown that a combination of under-sampling and over-sampling might lead to
better performance than pure over-sampling. Also, the SMOTE filter has been taken to
extreme percentages in this study. The results will certainly suffer from overgeneralisation.
Although other studies report experiments on a binary class where one class takes up 98%,
no best practices has been found on limiting the SMOTE percentage and the effects on error
rate, recall and precision. Experimenting on different percentages goes beyond the scope of
this study but is an interesting domain. Above that, the over-sampling happens before
splitting the data into a training set and a test set. This means that after splitting, the test set
will include synthetic instances. These lead to results that hold no predictive power. For this
study however, only the difference between the J48 algorithm and the AdaBoost algorithm is
wanted. That is why, in this context, over-sampling the complete dataset is not a problem.

8.3. Optimizing J48

The accuracy of J48 stated in the graph is based on the training set using 10-fold cross
validation and thus provides no fixed indication of the accuracy on the test set. Next to that,
10 as value for MinNumObj is chosen arbitrarily, is not based on a fixed criterion/threshold
but rather on what, in my opinion, would be a good tradeoff between tree size and accuracy.
Clearly, this would need to be further investigated to be correct.

24
9. References

Bartlett, P. L. (2007). AdaBoost is Consistent, 8, 2347–2368.
Beelman, R. B., & Gallander, J. F. (1979). Wine Deacidification. Advances in Food Research,
25(C), 1–53. http://doi.org/10.1016/S0065-2628(08)60234-7
Ben-gal, I. (2005). Outlier Detection. Data Mining and Knowledge Discovery Handbook, 131–
146. http://doi.org/10.1007/0-387-25465-x_7
Chawla, N. V. (2005). Data Mining for Imbalanced Datasets: An Overview. Data Mining and
Knowledge Discovery Handbook, 853–867. http://doi.org/10.1007/0-387-25465-X_40
Chawla, N., & Bowyer, K. (2002). SMOTE: Synthetic Minority Over-sampling Technique
Nitesh. Journal of Artificial Intelligence Research, 16, 321–357.
http://doi.org/10.1613/jair.953
Coli, M. S., Gil, A., Rangel, P., Souza, E. S., Oliveira, M. F., Cristina, A., & Chiaradia, N. (2015).
Chloride concentration in red wines: influence of terroir and grape type. Food Science
and Technology, 35(1), 95–99. http://doi.org/10.1590/1678-457X.6493
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences
by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–
553. http://doi.org/10.1016/j.dss.2009.05.016
Cousineau, D. (2009). Outliers detection and treatment : a review ., 3(2010), 58–67.
Drummond, C., & Holte, R. C. (n.d.). C4.5, Class Imbalance, and Cost Sensitivity: Why Under-
Sampling beats Over-Sampling.
Drysdale, G. S., & Fleet, G. H. (1988). Acetic Acid Bacteria in Winemaking: A Review. Am. J.
Enol. Vitic., 39(2), 143–154.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Thirteenth
International Conference on Machine Learning, 148–156.
http://doi.org/10.1.1.133.1040
Guyon, I. (2003). An Introduction to Variable and Feature Selection, 3, 1157–1182.
http://doi.org/10.1023/A:1012487302797
Hall, M. a. (1999). Correlation-based Feature Selection for Machine Learning. Methodology,
21i195-i20(April), 1–5. http://doi.org/10.1.1.37.4643
Kong, E. B., & Dietterich, T. G. (1995). Error-Correcting Output Coding Corrects Bias and
Variance. Icml, 0, 313–321.
Quinlan, J. R. (1992). C4.5: Programs for Machine Learning. Morgan Kaufmann San Mateo
California (Vol. 1). http://doi.org/10.1016/S0019-9958(62)90649-6
Quinlan, J. R. (2006). Bagging, boosting, and C4.5. Proceedings of the Thirteenth National
Conference on Artificial Intelligence, 5(Quinlan 1993), 725–730.
Rahman, M. M., & Davis, D. N. (2013). Addressing the Class Imbalance Problem in Medical
Datasets. International Journal of Machine Learning and Computing, 3(2), 224–228.
http://doi.org/10.7763/IJMLC.2013.V3.307
Robinson, J. (1994). The Oxford companion to wine. In The Oxford companion to wine (pp.
401, 530–31). http://doi.org/641.22 R22
Salvatore, R. (2000). Efficient C4.5.
Schapire, R. E. (2013). Explaining adaboost. Empirical Inference: Festschrift in Honor of
Vladimir N. Vapnik, 37–52. http://doi.org/10.1007/978-3-642-41136-6_5

25
10. Appendix

10.1. Appendix A : Outlier detection

Ap. Figure 3 : Outliers with attribute 'chlorides'

Ap. Figure 1 : Outliers with attribute 'citric acid'
Ap. Figure 2 : Outliers with attribute 'residual sugar'

26

Ap. Figure 4 : Outliers with attribute 'Free Sulfur Dioxide'

Ap. Figure 5 : Outliers with attribute 'Total Sulfur Dioxide'

Ap. Figure 6 : Outliers with attribute 'alcohol'

27
10.2. Appendix B : Algorithm comparison

Ap. Figure 7 : Confusion matrix ZeroR

Ap. Figure 8 : Confusion matrix J48

Ap. Figure 9 : Confusion matrix AdaBoostM1(J48)

Comparing classifiers to predict wine quality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

Comparing classifiers to predict wine quality