Bagging Decision Trees on Data Sets with Classification Noise

Bagging Decision Trees on Data Sets
with Classification Noise
Joaquín Abellán and Andrés R. Masegosa
Department of Computer Science and Artificial Intelligence
University of Granada
Sofia, February 2010
6th International Symposium on Foundations of Infomation and Knowledge Systems
FOIKS 2010 Sofia (Bulgaria) 1/25

Introduction
Part I
Introduction

Introduction
Introduction
Ensembles of Decision Trees (DT)
Features
They usually build different DT for different samples of the training dataset.
The ﬁnal prediction is a combination of the individual predictions of each tree.
Take advantage of the inherent instability of DT.
Bagging, AdaBoost and Randomization the most known approaches.

Introduction
Introduction
Classification Noise (CN) in the class values
Definition
The class values of the samples given to the learning algorithm have some
errors.
Random classification noise: the label of each example is flipped randomly
and independently with some fixed probability called the noise rate.
Causes
It is mainly due to errors in the data capture process.
Very common in real world applications: surveys, biological or medical
information...
Effects on ensembles of decision trees
The presence of classification noise degenerates the performance of any
classification inducer.
AdaBoost is known to be very affected by the presence of classification noise.
Bagging is the ensemble approach with the better response to classification
noise, specially with the C4.5 [21] as decision tree inducer.

Introduction
Introduction
Motivation of this study
Previous Works [7]
Simple decision trees (no continuous attributes, no missing values, no
post-pruning) built with different split criteria were considered in a Bagging
scheme.
Classic split criteria (InfoGain, InfoGain Ratio and Gini Index) and a new split
criteria based on imprecise probabilities were analyzed.
Imprecise Info-Gain generates the the most robust Bagging ensembles in data
sets with classiﬁcation noise.
Contributions
An extension of Bagging ensembles of credal decision trees to deal with:
Continuous Variables
Missing Data
Post-Pruning Process
Evaluate the performance on data sets with different rates of random
classiﬁcation noise.
Experimental comparison with Bagging ensembles built with C4.5R8 decision
tree inducer.

Previous Knowledge
Part II
Previous Knowledge

Previous Knowledge
Decision Trees
Decision Trees
Description
Attributes are placed at the nodes.
Class values predictions are placed at
the leaves.
Each leaf corresponds to a
classification decision rule.
Learning
Split Criteria selects the attribute to place at each branching node.
Stop Criteria decides when to fix a leaf and stop the branching.
Post-Prunning simplifies the tree pruning those branches with a low support in
their associated decision rules.

Previous Knowledge
Decision Trees
C4.5 Tree Inducer
Description
It is the most famous tree inducer, introduced by Quinlan in 1993 [21].
eight different releases have been proposed. In this work, we consider the last
one: C4.5R8.
The most inﬂuential data mining algorithm (IEEE ICDM06).
Features
Split Criteria: The Info-Gain Ratio as a quotient between the information gain of
an attribute and its entropy.
Numeric Attributes: It looks for the optimal binary split point in terms of
information gain.
Missing Values: Assumes Missing at Random Hypothesis. Marginalize the
missing variable when making predictions.
Post-Pruning: It employs the pessimistic error pruning: computes an upper
bound of the estimated error and when the bound of a leaf is higher than the
bound of its ancestor, this leaf is pruned.

Previous Knowledge
Bagging Decision Trees
Bagging Decision Trees
Procedure
Ti samples are generated by
random sampling with
replacement from the initial training
dataset.
From each Ti sample, a simple
decision tree is built using a given
split criteria.
Final prediction is made by a
majority voting criteria.
Motivation
Employing different decision tree inducers (C4.5, CART,...) we get different
Bagging ensembles.
Bagging ensembles have been reported to be on of the best classiﬁcation
models when there is noisy data samples [10,13,18].

Bagging Credal Decision Trees
Part III

Credal Decision Trees
Credal Decision Trees (CDT) Inducer
New Features
Handle with numeric attributes and the presence of
missing values in the data set.
Adaptation of C4.5 methods.
Strong simpliﬁcation of the algorithm and far less
number of parameters.
Take advantage of the good properties of Imprecise
Information Gain [2] measure against overﬁtting.

Imprecise Information Gain (Imprecise-IG) measure
Maximum Entropy Function
Probability intervals for multinomial
variables are computed from the
data set using Walley’s IDM [22].
We then obtain a credal set of
probability distributions for the
class variable, K(C).
Maximum Entropy of a credal set
S(K) is efﬁciently computed using
Abellan and Moral’s method [2].
Imprecise-IG Split criteria
Imprecise Info-Gain for each variable X is deﬁned as:
IIG(X, C) = S(K(C))
X
p(xi )S(K(C|(X = xi )))
It was successful applied to build simple decision trees in [3]

Split Criteria
Credal DT
The attribute with maximum Imprecise-IG score is
selected as split attribute at each branching node.
C4.5R8
There are multiple conditions and heuristics:
Maximum Info-Gain Ratio score.
IG score higher than the average value of the valid split
attributes.
Valid split attributes are those whose number of values is
smaller than 30% of number of samples in that branch.

Stop Criteria
Credal DT
All split attributes have negative
Imprecise-IG.
Minimum number of instances in a
leaf higher than 2.
C4.5R8
Info-Gain measure is always positive.
There are multiple conditions and heuristics:
Minimum number of instances in a leaf higher than 2.
When there is any valid split attribute.

Numeric Attributes
Credal DT
Each possible split point is evaluated. The one which generates the
bi-partition with the highest Imprecise-IG score is selected.
C4.5R8
Same approach: split point with the maximum Info-Gain.
There are extra restrictions and heuristics:
Minimum n. of instances: 10% of the ratio between the number of
instances in this branch and the cardinality of the class variable.
Info-Gain is corrected subtracting the logarithm of the number of
evaluated split points divided by the number of instances in this branch.

Post-Pruning
Credal DT
Reduced Error Pruning [20].
Most simple pruning process.
2-folds to build the and 1-fold to estimate the test error.

Experiments
Part IV
Experiments

Experiments
Experiments
Experimental Set-up
Benchmark
25 UCI data sets with very different features.
Bagging ensembles of 100 trees.
Bagging-CDT versus Bagging-C4.5R8.
Different noise rates were applied to training datasets (not to test datasets):
0%, 5%, 10%, 20% and 30%.
k-10 fold cross validation repeated 10 times were used to estimate the
classiﬁcation accuracy.
Statistical Tests [12,24]
Corrected Paired T-test.
Wilconxon Signed-Ranks Test.
Sign Test.
Friedman Test.
Nemenyi Test.

Experiments
Experiments
Performance Evaluation without tree post-pruning
Analysis
There is no statistical differences with low noise level.
B-CDT outperforms B-C4.5 ensembles with high noise levels.
B-CDT always induces more simple decision trees (lower n. of nodes).

Experiments
Experiments
Performance Evaluation with tree post-pruning
Analysis
Post-pruning methods help to improve the performance when there is noise in
the data.
The performance of B-C4.5 does not degenerate so quickly.
B-CDT also has better performance with high noise levels.
B-CDT also induces more simple trees.

Experiments
Experiments
Bias-Variance (BV) Error Analysis
Bias-Variance Decomposition of 0-1 Loss Functions [17]
Error = Bias2
+ Variance
Bias: Error component due to the incapacity of the predictor to model the
underlying distribution.
Variance: Error component that stems from the particularities of the training data
set (i.e. measure of overﬁtting).

Experiments
Experiments
BV Error Differences: No pruning
Analysis
Bias differences remains stable across the different noise levels.
Variance differences increases with the noise rate level.
B-CDT does not overﬁt so much as B-C4.5 (lower variance error) when there is
high noise levels.
Maybe the high number of heuristics of C4.5 may not help when there is
spurious data.

Conclusions and Future Works
Part V

Conclusions
We have presented an interesting application of information based
uncertainty measures on a challenging data-mining problem.
A very simple decision tree inducer that handles numeric attributes and deals
with missing values have been proposed.
An extensive experimental evaluation have been carried out to analyze the effect
of classification noise in Bagging ensembles.
Bagging with decision trees induced by the Imprecise-IG measure have a better
performance and less overfitting with medium-high noise levels.
The Imprecise-IG is a robust split criteria to build bagging ensembles of
decision trees.
Future Works
Develop a new pruning method based on imprecise probabilities.
Extend these methods to carry out credal classification.
Apply new imprecise models such as the NPI model.

Thanks for your attention!!
Questions?

Bagging Decision Trees on Data Sets with Classification Noise

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (8)

Similar to Bagging Decision Trees on Data Sets with Classification Noise

Similar to Bagging Decision Trees on Data Sets with Classification Noise (20)

More from NTNU

More from NTNU (16)

Recently uploaded

Recently uploaded (20)

Bagging Decision Trees on Data Sets with Classification Noise