2. I
Abstract
This study aims to highlight the differences between two classification algorithms. Both
algorithms are applied on a dataset with white Portuguese wine. They are used to classify
human wine taste preferences, based on physicochemical properties. The algorithms studied
here are J48 (C4.5) and AdaBoostM1(J48). AdaBoostM1(J48) starts with the basic J48 but
attempts to improve it. That is why it is interesting to see how it performs on a real-world
application. Since AdaBoostM1(J48) adds complexity, a tradeoff should be made based on its
added value and its added cost. This study will provide clarity on its added value in classifying
red Portuguese wine.
5. 1
1. Introduction
The classification task is a form of supervised learning where a dataset, labeled with the right
class, is first specified. A classifier is trained on this set with the goal of generalizing its model
to other data. Different classifiers (learning algorithms) exist for this task with their
advantages and disadvantages. Increasingly complex algorithms combined with increasing
accuracy have been developed. Computational power has increased strongly over the years.
Therefore, much attention typically goes to accuracy of an algorithm. Currently, many smart
mobile devices and services are developed. However, for these less powerful, mobile
applications, the context is different and power and efficiency become important.
The algorithms under study here are J48 and AdaBoostM1(J48). J48 is a relatively simple tree-
building algorithm, very popular thanks to its great performance and its understandable
models. AdaBoost, attempts to boost the performance of an underlying algorithm, which in
this case is J48. This boosting comes at the cost of increased complexity. For the average
dataset, AdaBoost has shown better performance at the cost of computational power (J R
Quinlan, 2006). For it to be relevant, the gain should outweigh the cost, which of course
depends on the context.
The study is organized as follows: In section 2 the dataset and its attributes are discussed,
followed by the algorithms that are going to be applied on the dataset. Section 5.1 then
discusses the preprocessing steps the dataset underwent in order to increase the
performance of both algorithms. Once applied, the results are followed by a brief discussion
in section 6.
6. 2
2. Dataset
The dataset talks about wine. More precisely it contains both physicochemical properties and
sensory data of red and white Portuguese wine (vinho verde). It was collected between May
2004 and February 2007 (Cortez, Cerdeira, Almeida, Matos, & Reis, 2009). to study three
regression techniques, support vector machine, multiple regression and neural networks.
Since the sensory data of both types of wine are based on a completely different taste, the
authors decided to split the dataset in two, a red dataset and a white dataset. The
experiments below are based solely on the white dataset, since the goal is not to compare
the results of both types of wine, but rather to compare different algorithms. For the white
dataset, 4898 instances where collected with 12 attributes each.
2.1. Attributes
There are eleven physicochemical properties recorded: fixed acidity, volatile acidity, citric
acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates
and alcohol.
Table 1 : Attribute characteristics white wine dataset
Acidity of the wine protects the wine from bacteria during the fermentation process. A
distinction should be made between the amount of acidity and the strength of the acidity.
The amount is measured in g/l, whereas the strength is measured in pH. Most wines show
pH-values between 2,9 and 3,9. The higher the wine’s acidity, the lower the pH value.
Three main acids can be found in wine grapes: tartaric, malic and citric acid. These are fixed
acids that contribute to the quality of the wine, the distinct taste-shaping during winemaking
and the aging of the wine. Tartaric acid is important for maintaining the wine’s chemical
stability and color. Its concentration in the wine grape varies with the soil and grape type.
Malic acid is not measured in the dataset. Next to that, citric acid is also present, but in much
smaller concentrations (1/20 of tartaric acid). It can be found in many citrus fruits and offsets
a strong citric taste. Extra citric acid can be added but with caution. Namely, certain bacteria
Min Max Mean StdDev
Fixed acidity ( g(tartaric acid)/l ) 3,80 14,20 6,86 0,84
Volatile acidity ( g(acetic acid)/l ) 0,08 1,10 0,28 0,10
Citric acid (g/l) 0,00 1,66 0,33 0,12
Residual sugar (g/l) 0,60 65,80 6,39 5,07
Chlorides ( g(sodium chloride)/l ) 0,01 0,35 0,05 0,02
Free sulfur dioxide (mg/l) 2,00 289,00 35,31 17,01
Total sulfur dioxide (mg/l) 9,00 440,00 138,36 42,50
Density (g/ml) 0,99 1,04 0,99 0,00
pH 2,72 3,82 3,19 0,15
Sulphates( g(potassium sulphate)/l ) 0,22 1,08 0,49 0,11
Alcohol (vol.%) 8,00 14,20 10,51 1,23
8. 4
3. Algorithms
Three algorithms where chosen based on their interesting aspects. First of all, ZeroR is
included as a baseline algorithm compared to which any smart algorithm should be
performing better in order to be useful. The J48 is included as a tree-building algorithm
because trees are easy to understand and to model. A disadvantage of trees is that they are
prone to overfitting. To improve the J48 algorithm, another algorithm called AdaBoostM1 is
included which is kind of a meta-algorithm that has to be used with another algorithm. Here
it is combined with J48 to test its improvements.
3.1. ZeroR
Commonly referred to as the baseline algorithm, ZeroR is the simplest classification method.
It is based on a frequency table in which it looks at the majority class and predicts this class
all the time. It thus simply relies on the class attribute and ignores all other attributes. ZeroR
has no predictability power but is useful to benchmark with other classification methods.
3.2. J48 (C4.5)
J48 is an open-source java implementation of the C4.5 algorithm. C4.5 is a widely popular
statistical classifier that generates trees using the concept of information gain. It is an
improvement over the ID3 algorithm, earlier developed by Ross Quinlan (J Ross Quinlan,
1992).
Basically, it generates a decision tree in a top-down manner. Using a training set, at each stage
it will use a greedy approach to look for the attribute that best splits the set into subsets. Let
T be the set of instances associated with a stage. To test this, each attribute is evaluated
separately on T using a metric called information gain (more correctly, a gain ratio is used,
which is the attribute’s information gain divided by its entropy). This in turn is derived from
entropy. The attribute providing the highest information gain, is selected as node in the tree.
For each subset, this process is repeated until the subset contains only samples from the same
class or until the minimum number of leaf objects is reached. A simplified pseudo-code of the
tree-construction algorithm from C4.5 is included below (Salvatore, 2000).
11. 7
AdaBoost boosts the performance of the original learning algorithm on the training data. To
do this, it iteratively uses a weighted training set (Freund & Schapire, 1996; J R Quinlan, 2006;
Schapire, 2013). Each instance weight wj reflects the importance of the respective instance
and starts at value 1 for all instances. The first hypothesis is made, based on this set. An error
is calculated as the sum of all weights from the misclassified instances. All correctly classified
instances are then given lower weights. A new hypothesis is generated using these new
weights. This process is repeated until there are T hypotheses, with T being an input to
AdaBoost. All hypotheses can be seen as committee members with the weights of their votes
z being a function of their accuracy on the training set. The final hypothesis is based on the
majority of the weighted votes. The algorithm actually sums the votes taking into account
their weights. Pseudo-code for this algorithm can be seen below.
For clarification purposes the normalization steps and the steps to calculate the boosted
classifier are discarded. Also, the edge cases where the error equals 0 or exceeds 0.5 are not
mentioned in the code. However, they require different steps. When there is no error,
obviously, no extra trials should be performed and T should be set at t. The error rate of the
boosted algorithm should approach 0 as T increases. This is only the case when the error rate
of the trials is below 0.5. Therefore, when error > 0.5, the trials should be ended and T should
be replaced by t-1. AdaBoost thus makes the assumption that the simple classifiers perform
better than pure guessing. This is noted as the weak learning condition.
-------------------------------------------------------------------------------------------------
function AdaBoost(examples, L, T) returns a weighted-majority hypothesis
inputs: examples, set of N labeled examples (x1, y1),…,(xN,yN)
L, a ‘simple’ learning algorithm
T, the number of hypotheses (trials / iterations) in the ensemble
local variables: w, a vector of N example weights
h, a vector of T hypotheses
z, a vector of T hypothesis weights
for n = 1 to N do
w[n] ß 1/N
for t = 1 to T do
h[k] ß L(examples, w)
error ß 0
for n = 1 to N do
if h[t](xn) ≠ yn then error ß error + w[n]
for n = 1 to N do
if h[t](xn) = yn then w[n] ß w[n] . error / (1 – error)
w ß NORMALIZE(w) (so that the sum of
z[t] ß log(1 – error)/error
return WEIGHTED-MAJORITY(h,z)
-------------------------------------------------------------------------------------------------
15. 11
The accuracy improvements are tested using a paired T-test. The paired T-test assumes that
the results from both datasets are independent and normally distributed. These assumptions
are fulfilled since the results of the dataset without outliers is independent of the results
before deleting them. Also by setting the Weka experimenter to perform 30 iterations, the
distribution of results can be approximated by a normal distribution. This is done for all
experiments in this study. The algorithms during this experiment, and the experiments during
further preprocessing steps are applied using default values in Weka. The models are trained
and evaluated with 10-fold stratified cross-validation. Compared to normal cross-validation,
stratified cross-validation has the benefit that every piece is a good representation of the
dataset. The folds are selected so that the mean response value is approximately equal in all
the folds. This has been proven to reduce the variance of the estimated accuracy. The results
are given in the table below.
Figure 6 : Performance difference deleting outliers
No significant improvements are found for both J48 and AdaBoost, with a 95% confidence
level (two tailed). Except for the baseline algorithm ZeroR, deleting outliers has no visible
effect on the accuracy and its standard deviation of the different algorithms.
5.1.3. Problem: Imbalanced dataset
When the separate classes are not equally represented, the dataset is imbalanced. An
imbalanced dataset can lead to overfitting and underperforming algorithms. Our dataset is
severely unbalanced with the amount of instances ranging from 5 in the minority class up to
2188 in the majority class. Extreme quality scores are rare compared to the mediocre classes.
By resampling, this problem can be solved. Resampling can either be done by deleting
instances from the over-represented class (under-sampling) or by adding copies of instances
from the under-represented class or synthetically creating such instances (over-sampling).
Generally, it might be better to over-sample unless you have plenty of data. There are some
disadvantages to over-sampling however. It increases the dataset, leading to increased
processing time needed to build a model. Also, since the class is not taken into account it may
cause overgeneralization. When put to extremes, oversampling can lead to overfitting
(Drummond & Holte, n.d.; Rahman & Davis, 2013).
Another option would be to keep the imbalanced dataset but to wrap your learning
algorithms in a penalization scheme, which adds an extra cost on misclassifying a minority
class. This however, means that the algorithms that are to be compared, are changed, making
comparisons less intuitive. Therefore, sampling is preferred.
16. 12
In Weka, sampling can be achieved by applying the supervised SMOTE filter (Nitesh V Chawla,
2005). This resamples the dataset by applying the Synthetic Minority Oversampling
Technique. It does not simply copy instances from the minority class. Rather, it iteratively
looks at a number of neighbors and creates an instance with randomly distorted attributes,
within the boundaries of these neighbors.
We changed the percentage-parameter to correspond to the necessary extra instances to be
created. Since the over-sampling takes on extreme percentages, we expect a certain bias in
the results due to overgeneralization. However, this does not impact the differences between
J48 and AdaBoost. Remarks on this method can be found in the limitations paragraph. After
balancing, our training set consists of 15311 instances, which means that 10445 instances
were created. WEKA pushes these extra instances on the bottom of the dataset. If you want
to use 10-fold cross-validation, this might lead to folds with a lot of instances from the same
class, and thus eventually lead to overfitting. To avoid this issue, we apply an extra filter that
randomizes the instances over the dataset.
Class Number of instances % to add Amount added
1 14 15528 2173
2 161 1259 2026
3 1443 51,6 744
4 2188 0 0
5 880 148,6 1307
6 175 1150 2012
7 5 43660 2183
Table 3 : Balancing dataset using SMOTE filter
Figure 7 : Effect of balancing dataset
18. 14
Figure 11 : Weighted average F-measure
Results from the experiments are shown above. With a two-tailed confidence level of 95%,
the performance of the J48 and AdaBoostM1(J48) algorithms improved significantly (v) by
balancing the dataset. This was found by running the Weka experimenter. Only the baseline
algorithm deteriorated significantly (*). Also the standard deviations lowered, meaning more
stable results. This provides a broad conclusion that balancing truly improves the
performance of the mentioned algorithms. Here the default values of the algorithms were
used. The parameters of the different algorithms will be adjusted in a later stage when we
are comparing them to one another.
5.1.4. Normalization
When big differences among the variable ranges can be seen, normalization can be beneficial.
For this dataset, the scales are very different among the attributes. The values, measured on
different scales, are adjusted to fit a common scale. It is important that normalization is
applied after checking for outliers. Outliers have already been processed before. The default
values for the scale (1) and translation (0,0) are used, meaning that everything is scaled to the
interval [0,1]. The class values are ignored since they are nominal values. At 95% confidence
level, there is no significant difference when looking at the accuracy of the three algorithms.
Here, normalization has no effect. Therefore, we continue with the dataset without
normalizing the numeric attributes.
5.1.5. Feature Selection
In the real world, more attributes can lead to higher discrimination power. However, most
machine learning algorithms have difficulties handling irrelevant or redundant information.
Sometimes attributes can be completely irrelevant for the class. These attributes still need
processing power and can even bias the result. Therefore, feature subset selection is a great
way to improve classification results, lower processing time and raise readability of the model
(Guyon, 2003). This is done by identifying and neglecting or removing the irrelevant
information. Feature selection is successful if the number of dimensions can be reduced
without lowering (or by improving) the accuracy of the induction algorithm.
Figure 12 : Effect of normalising numeric attributes
20. 16
The search direction can have a serious effect on the attributes selected. One can start by
selecting all attributes and iteratively deleting attributes from that selection until some
termination point. This method is called backward elimination. On the other hand, the
forward selection method starts with zero attributes and gradually builds up a selection until
some termination point. Combining these two methods leads to bi-directional search, where
you start with a subset of attributes and you either delete or add attributes depending on
some characteristic such as merit.
By setting a termination point, you avoid processing over the entire search space. Typically,
a termination point could be a fixed number of attributes to select or a merit threshold.
(i) Feature subset selection in Weka
Based on the scatter plots, we suspect some attributes to be irrelevant based on their
seemingly high correlation with each other. An example can be seen in the figure below,
which shows the relation of ‘residual sugar’ to ‘density’. Based on the theory above, one can
see that the higher the amount of residual sugar, the higher the density will be. This relation,
combined with the low correlation of SO2 with ‘quality’, will probably lead to the exclusion of
the one of the two attributes.
Figure 14 : Relation of ‘residual sugar’ to ‘density’ shows high correlation
Weka allows many methods to apply feature subset selection, either permanently or
temporarily during learning algorithm execution. Since processing power is limited, all
irrelevant attributes were first discarded before using the adapted dataset to train the
models. This method is very fast and leads to similar performance as the slower wrapper
method. Although there is a filter in Weka called AttributeSelection that combines an
evaluation strategy with a search method to automatically select the correct attributes, it
does not apply cross-validation. Therefore, an attribute selection is first processed, and its
results manually applied afterwards. The evaluation strategy used is CfsSubsetEval (CFS =
Correlation based Feature Selection), which looks at the correlation matrix of all attributes.
This leads to a metric called “symmetric uncertainty” (Hall, 1999). It considers the predictive
value of each attribute, together with the degree of inter-redundancy.
21. 17
Attributes with high correlation with the class attribute and low inter-correlations are
preferred. CFS assumes that that the attributes are independent and can fail to select the
relevant attributes when they depend strongly on other attributes given a class. The
components of CFS are listed in the figure below (Hall, 1999).
Figure 15 : Components of CFS
Multiple search methods were used to compare the results. All lead to the same result.
Ultimately exhaustive search was used because it was more than feasible with only twelve
attributes. With 10-fold cross validation, it shows a clear exclusion of ‘residual sugar’.
Experimenting with the dataset before and after deleting the attribute ‘residual sugar’ shows
that by deleting it, the performance of the different algorithms does not deteriorate
significantly (95% significance level). More importantly however, the CPU time needed to
build the model does decrease significantly from 0.96 to 0.87 for J48 and from 8.84 to 8.35
for AdaBoost (cf figures below). This shows that by removing the attribute one can reduce
processing time without lowering the performance of the algorithms. It is therefore beneficial
to remove the attribute, especially when processing power is limited.
Figure 16 : Results of Attribute Selection with 10 fold cross-validation
23. 19
number of objects in a leaf, the size of the tree can be limited. This allows for an easier to
understand model and can reduce overfitting. However, it is expected that accuracy will
decrease with growing leafs. Therefore, a tradeoff needs to be made between tree size and
accuracy. The graph below shows the impact of adjusting the minNumObj parameter. The
experimenter was used with 30 iterations.
Figure 19 : Adjusting minNumObj parameter for J48
By default, minNumObj is set at 2. This is increased up to 500. The graph shows a gradual
decline in both tree size and accuracy as a result. The standard deviation is not included in
this graph but gradually increases for both metrics. The tree size shrinks much faster than the
accuracy. By changing the parameter to 5, the tree size is divided in two while the accuracy
decreases only from 69,8 to 68,12. Going further to 10 as a minimum results in yet again half
the tree size, with a slight decrease in accuracy to 66,5. From that moment on, the accuracy
drops slightly faster. Therefore, minNumObj is set at 10. Remarks to this decision can be found
in the limitations section.
By adjusting the confidenceFactor, accuracy does not change significantly, values from 0,05
up to 0,5 have been tested, with 0,25 being the default value. The default value is therefore
used.
5.3.2. Optimizing AdaBoost
Since AdaBoost(J48) implements J48, it is important to use the same parameters for J48 here
as those used for the standalone J48 algorithm. AdaBoost itself also allows some adjusting.
Namely, the number of iterations can be adjusted (cf. T in the theoretical section on
AdaBoost). Obviously, the accuracy increases as more iterations make up the committee. The
graph below shows the accuracies corresponding to different numbers of iterations. The
accuracy improvements gradually decline. By setting an improvement threshold at 1%, the
number of iterations is set at 15, which leads to an accuracy of 78,28%.
25. 21
6. Results & Interpretation
The table below shows all relevant performance metrics of the different algorithms.
ZeroR J48 AdaBoost
Accuracy 12,93% 65,48% 78,33%
Kappa 0,00 0,60 0,75
Mean absolute error 0,24 0,12 0,07
Root mean squared error 0,35 0,26 0,22
Relative absolute error 100% 48% 27%
Root relative squared error 100% 75% 63%
TP rate 0,13 0,66 0,78
FP rate 0,13 0,06 0,04
Precision 0,02 0,65 0,78
Recall 0,13 0,66 0,78
F-measure 0,03 0,65 0,78
ROC area 0,50 89,00 0,96
Table 4 : Experimental results
From the table it is clear that AdaBoost outperforms the other algorithms in every aspect,
with J48 being the runner-up. Although our dataset is balanced, due to the random split, small
differences exist in class sizes. In the training set, the class with a score of 9 is slightly bigger
than the other classes. That is why ZeroR will impose as rule to always choose that class. This
leads to an accuracy of merely 12,9% on the test set. J48 performs much better, with an
accuracy of 65,48%. This equals an error reduction of 60%. Compared to J48, AdaBoost again
reduces the error of J48 with 37%. Since the classes are still quite balanced, the weighted
average of the F-measure approximates the accuracy. In the case of J48 and AdaBoost,
(weighted avg.) Precision and (weighted avg.) Recall also approximate the accuracy. ZeroR,
on the other hand, shows a much lower value for the weighted average of precision since in
all but one class, the TP and FP rates are zero.
Although the ROC area is especially useful for unbalanced dataset, here it confirms the other
metrics with the area for AdaBoost coming close to 1. This means it is an excellent prediction.
J48 with an ROC area just below 0,9 shows a good prediction. The confusion matrices can be
found in appendix B. They show that for the class with score 9, very little errors are made in
J48 and AdaBoost. For J48, no instances from this class are misclassified as having lower
scores and AdaBoost misclassifies only 3 instances like this. Also, only instances with scores
from 6 to 8 are occasionally misclassified as being in the last class. This shows that the models
perform better on good wines.