Decision Tree Induction Using Impurity Measures

1. Introduction

Decision trees are one of the methods for concept learning from the examples. They are widely

used in machine learning and knowledge acquisition systems. The main application area are

classification tasks:

We are given a set of records, called training set. Each record from training set has the same

structure, consisting of a number of attribute/value pairs. One of these attributes represents class

of the record. We also have a test set for which the class information is unknown. The problem is

to derive a decision tree, using examples from training set, which will determine class for each

record in the test set.

The leaves of induced decision tree are class names and other nodes represent attribute-based tests

with a branch for each possible value of particular attribute.

Once the tree is formed, we can classify objects from the test set: starting at the root of tree, we

evaluate the test, and take the branch appropriate to the outcome. The process continues until leaf

is encountered, at which time the object is asserted to belong to class named by the leaf.

Induction of decision trees has been very active area of machine learning and many approaches

and techniques have been developed for building trees with high classification performance. The

most commonly addressed problems are :

•selecting of the best attribute for splitting

•dealing with noise in real-world tasks

•pruning of complex decision trees

•dealing with unknown attribute values

•dealing with continuous attribute values.

1

My intention is to give an overview of methods considering the first three problems, which I find

to be very mutually dependant and of prior importance for building of trees with good

classification ability.

2. Selection Criterion

For a given training set it is possible to construct many decision trees that will correctly classify

all of its objects. Among all accurate trees we are interested in the most simple one. This search is

guided by Occam’s Razor heuristic: among all rules that accurately account for the training set,

the simplest is likely to have the highest success rate when used to classify unseen objects. This

heuristics is also supported by analysis: Pearl and Quinlan [11] have derived upper bounds on the

expected error using different formalisms for generalizing from a set of known cases. For a

training set of predetermined size, these bounds increase with the complexity of induced

generalization.

Since decision tree is made of nodes that represent attribute-based test, simplifying the tree would

mean reducing the number of tests. We can achieve this by carefully selecting the order of tests to

be conducted. The example given in [11] shows how for the same training set given in Table 1

different decision trees may be constructed. Each of the examples in training set is described in

terms of 4 discrete attributes: outlook {sunny, overcast, rain}, temperature {cold, mild, hot},

humidity {high, normal}, windy {true, false} and each belonging to one of the classes N or P.

Figure 1 shows decision tree when attribute outlook is used for the first test and figure 2 shows

decision tree with the first test temperature. The difference in complexity is obvious.

2

No. Outlook Temperature Humidity Windy Class
1 sunny hot high false N
2 sunny hot high true N
3 overcast hot high false P
4 rain mild high false P
5 rain cool normal false P
6 rain cool normal true N
7 overcast cool normal true P
8 sunny mild high false N
9 sunny cool normal false P
10 rain mild normal false P
11 sunny mild normal true P

3

12 overcast mild high true P
13 overcast hot normal false P
14 rain mild high true N

Table1. A small training set

outlook

sunny overcast rain

P
humidity windy

high normal true false
Figure 1. A simple decision tree
N P N P

temperature

sunny

windy
outlook outlook
sunny o’cast rain sunny o’cast rain true false

P P N
P
windy windy humidit humidit
y y
true false true false
high normal high normal

N P P N
P P
windy outlook
true false sunny o’cast rain

4
N P N P null

Figure 2. A complex decision tree

This infers that the choice of test is crucial for simplicity of decision tree, on which many

researchers, such as Quinlan [11], Fayyad [5], White and Liu [16], agree.

A method of choosing a test to form the root of decision tree is usually referred to as selection

criterion. Many different selection criterion have been tested over the years and most common

once among them are maximum information gain and GINI index. Both of these methods belong

to the class of impurity measures, which are designed to capture aspects of partitioning of

examples relevant to good classification.

Impurity measures

Let S be set of training examples with each example e ∈ S belonging to one of the classes in

C = {C1, C2, …, Ck}. We can define the class vector (c1, c2, …, ck) ∈ Nk , where ci = |{e ∈ S |

class(e) = Ci}| and class probability vector (p1, p2, …, pk) ∈ [0, 1]k:

c1 c 2 c
( p1 , p 2 ,..., p k ) = ( , ,..., 3 )
|S| |S| |S|

It’s obvious that Σ pi =1.

A set of examples is said to be pure if all its examples belong to one class. Hence, if probability

vector of a set of examples has a component 1 (all other components being equal to 0) the set is

said to be pure. On the other hand, if all components are equal we get an extreme case of

impurity.

To quantify the notion of impurity, a family of functions known as impurity measures [5] is

defined.

5

Definition 1 Let S be a set of training examples having a class probability vector PC. A function

φ : [0, 1]k → R such that φ (PC) ≥ 0 is an impurity measure if it satisfies the following

conditions:

1. φ (PC) is minimum if ∃i such that component PCi = 1.

2. φ (PC) is maximum if ∀i, 1 ≤ i ≤ k, PCi = 1/k.

3. φ (PC) is symmetric with respect to components of PC.

4. φ (PC) is smooth (differentiable everywhere) in its range.

Conditions 1 and 2 express well-known extreme cases, and condition 3 insures that the measure is

not biased towards any of the classes.

For induction of decision trees, impurity measure is used to evaluate impurity of partition induced

by an attribute.

Let PC(S) be the class probability vector of S and let A be a discrete attribute over the set S. Let

assume the attribute A partition set S into the sets S 1, S2, …, Sv. The impurity of the partition is

defined as weighted average impurity on its component blocks:

v
| Si |
∆Φ( S , A) = ∑ ⋅ φ ( PC ( S i ))
i =1 | S |

Finally, the goodness-of-split due to attribute A is defined as reduction in impurity after the

partition.

∆Φ(S, A) = φ (PC(S)) - Φ(S, A)

If we choose entropy of the partition, E(A,S), as an impurity measure:

k
φ (PC(S)) = E(A, S) = ∑ − PC
i =1
i ⋅ log 2 ( PC i )

than the reduction in impurity gained by an attribute is called information gain.

6

This method was used in many algorithms for induction of decision trees such as ID3, GID3* and

CART.

The other popular impurity measure is GINI index used in CART [2]. To obtain GINI index we

set φ to be

φ (PC(S)) = ∑ PC
i≠ j
i ⋅ PC j

All functions belonging to family of impurity measures agree on the minima, maxima and

smoothness and as a consequence they should result in similar trees [2], [9] regarding complexity

and accuracy. After detailed analysis Breiman [3] reports basic differences in trees produced

using information gain and GINI index. The GINI prefers splits that put the largest class into one

pure node, and all others into the other. Entropy favors size–balanced children nodes. If the

number of classes is small both criterions should produce similar results. The difference appears

when number of classes is larger. In this case GINI produces splits that are too unbalanced near to

the root of the tree. On the other hand, splits produced by entropy show a lack of uniqueness.

This analysis point out some of the problems associated with impurity measures. But,

unfortunately, these are not the only ones.

Some experiments carried out in the mid eighties showed that the gain criterion tends to favor

attributes with many values [8]. This finding was supported by analysis in [11]. One of the

solutions to this problem was offered by Kononenko et al. in [8]: decision tree induced has to be

binary tree. This means that every test has only two outcomes. If we have an attribute A with

values A1, A2, …, Av the decision tree no longer branches on each possible value. Instead, a subset

of S is chosen and the tree has one branch for that subset and the other for remainder for S. This

criterion is known as subset criterion. Kononenko et al. report that this modification led to

smaller decision trees with an improved classification performance. Although, it’s obvious that

7

binary trees don’t suffer from bias in favor of attributes with large number of values, it is not

known if this is the only reason for their better performance.

This finding is repeated in [4] and in [5] Fayyad and Irani introduce the binary tree hypothesis:

For a to–down, non-backtracking, decision tree generation algorithm, if the algorithm applies a

proper attribute selection measure, then selecting a single attribute-value pair at each node and

thus constructing a binary tree, rather than selecting an attribute and branching on all its values

simultaneously, is likely to lead to a decision tree with fewer leaves.

The formal proof for this hypothesis doesn’t exist; it’s a result of informal analysis and empirical

evaluation. Fayyad has also shown in [4] that for every decision tree, there exists binary decision

tree that is logically equivalent to it. This would mean that for every decision tree we could

induce logically equivalent binary decision tree that is expected to have fewer nodes and to be

more accurate.

But binary trees have some side-effect explained in [11]:

First, this kind of trees is undoubtedly more unintelligible to human experts than is ordinarily the

case, with unrelated attribute values being grouped together and with multiple tests on the same

attribute.

Second, the subset criterion can require a large increase in computation, specially for attributes

with many values – for attribute A with v values there are 2v-1 – 1 different ways of specifying the

distinguished subset of attributes values. But since a decision tree is induced only once and than

used for classification and since computer efficiency is rapidly increasing this problem seems to

diminish.

8

In [11] Quinlan proposes another method for overcoming the bias in information gain called the

gain ratio. The gain ratio, GR (originally Quinlan denoted it by IV), is normalizing the gain with

attribute information:

∆Φ( S , A)
GR = v
|S | |S |
− ∑ i ⋅ log 2 i
i =1 | S | |S|

The attribute information is used as normalizing factor because of its property to increase as the

number of possible values increases.

As mentioned in [11] this ratio may not always be defined – the denominator may be zero – or it

may tend to favor attributes for which the denominator is very small. As a solution to this, the

gain ratio criterion will select, from among those attributes with an average-or-better gain, the

attribute that maximizes GR. The experiments described in [14] show the improvement in tree

simplicity and prediction accuracy when gain ratio criterion is used.

There is also another measure introduced by Lopez de Mantras in [], called distance measure, dN:

∆Φ( S , A)
1 - dN = k | S ij |
v | S ij |
− ∑∑ ⋅ log 2
i =1 j =1 | S | |S|

where |Sij| is the number of examples with value aj of attribute A that belong to class Ci.

This is just another attempt of normalizing information gain, but in this case with cell information

(cell is a subset of S which contains all examples with one attribute value that belong to one

class).

Although both of these normalized measures were claimed to be unbiased, the statistics analysis

in [17] shows that again each favor attributes with larger number of values. These results also

suggest that information gain is the worst measure compared to gain ratio and distance measure in

9

this respect, while gain ratio is the least biased. Furthermore, their analysis shows that the

magnitude of the bias is strongly dependent on the number of classes, increasing as k is increased.

Orthogonality measure

Recently, one conceptually new approach was introduced by Fayyad and Irani in [5]. In their

analysis they give a number of reasons why information entropy, as representative of class of

impurity measures, is not suitable for attribute selection.

Assume the following example: Consider a set S of 110 examples belonging to three classes {C 1,

C2, C3} whose class vector is (50, 10, 50). Assume that the attribute-value pairs (A, a1) and (A, a2)

induce two binary partition on S, π1 and π2 shown in the figure 3. We can see that π2 separates the

classes C1 and C2 from the class C3. However, the information gain measure prefers partition π1

(gain = 0.51) over π2 (gain = 0.43).

C1 C2 C3

45 8 5
Partition π1 produced by a1 Partition π2 produced by a2

45 8 5 50 0 50

5 2 45 0 10 0

Gain = 0.51 Gain = 0.43

Figure3. Two possible binary partitions.

10

Analysis shows that if π2 is accepted, subtree under this node has lower bound of tree leaves. On

the other hand, if π1 is chosen subtree could minimally have 6 leaves.

Intuitively, if the goal is to generate tree with smaller number of leaves, the selection measure

should be sensitive to total class separation – it should separate differing classes from each other

as much as possible while separating as few examples of the same class as possible. Above

example shows that information entropy doesn’t satisfy these demands – it’s completely

insensitive for class separation and within-class fragmentation. The only exception is when

learning problem has exactly two classes: than class purity and class separation become the same.

Another negative property of information gain emphasized in this paper is its tendency to induce

decision trees with near-minimal average depth. The empirical evaluation of this kind of trees

shows that they tend to have a large number of leaves and high error rate [4].

Another of deficiencies pointed out is, actually, embedded in definition of impurity measures:

their symmetry with respect to components of PC. This means that the set with a given class

probability vector evaluates identically to another set whose class vector is a permutation of the

first. Thus if one of the subsets of a set S has a different majority class than the original but the

distribution of classes is simply permuted, entropy will not detect the change. However this

change in dominant class is generally strong indicator that the attribute value is relevant to

classification.

Realizing above weakness of impurity measures authors define the desirable class of selection

measures:

Assuming induction of binary tree (relying on the binary tree hypothesis) for training set S and

attribute A, test τ on this attribute induces a binary partition on the set S into:

S = Sτ ∪ S¬τ , where Sτ = { e ∈ S | e satisfies τ }and S¬τ = S ~ Sτ.

11

Selection measure should satisfy the properties:

1. It is maximum when the classes in Sτ are disjoint with the classes in S¬τ (inter-class

separation).

2. It is minimum when the class distribution in Sτ is identical to the class distribution in S¬τ.

3. It favors partitions which keep examples of the same class in the same block (intra-class

cohesiveness).

4. It is sensitive to permutations in the class distribution.

5. It is non-negative, smooth (differenciable), and symmetric with respect to the classes.

This defines a family of measures called C-SEP (for Class SEParation), for evaluating binary

partitions.

One such measure proposed in this paper and proven to satisfy all requirements of C-SEP family

is orthogonality measure defined as:

ORT(τ, S) = 1 – cos θ(V1, V2),

Where θ(V1, V2) is the angle between two class vectors V1 and V2 of partitions Sτ and S¬τ,

respectively.

The result of empirical comparisons of orthogonality measure embedded in O-BTREE system

with entropy measure used in GID3* (produces branching only on a few individual values while

grouping the rest in one default branch), information gain in ID3, gain ratio in ID3-IV and

information gain for induction of binary trees in ID3-BIN, is taken from [5] and given in figures

4, 5, 6 and 7. In these experiments 5 different data sets were used: RIO (Reactive Ion Etching) –

synthetic data, and real-word data sets: Soybean, Auto, Harr90 and Mushroom. Descriptions of
7 3
6 2.5
5
2
4
1.5
3
1
2
1 12 0.5
0 0
GID3* ID3 ID3-IV ID3-BIN O-BTREE GID3* ID3 ID3-IV ID3-BIN O-BTREE

these sets may be found in [5]. The results reported are in terms of ratios relative to GID3*

performance (GID3* =1.0 in both cases).

Figure 4. Error ratios for RIE-random Domains Figure5. Leaf ratios for RIE-random Domains

2.2 ID3 2
ID3
2 ID3-IV
ID3-IV
1.8
ID3-BIN ID3-BIN
1.6
O-BTREE 1 .5
O-BTREE
1.4
1.2
1
1
0.8
0.6
0.4
0 .5
SOYBEAN AUTO HARR90 MUSHRM
SOYBEAN AUTO HARR90 MUSHRM

The figures 4, 5, 6 and 7 show that results for O-BTREE algorithm are almost always superior to

other algorithms.

Conclusion
Figure 6. Relative ratios of Error Rates (GID3=1) Figure 6. Ratios of Numbers of Leaves (GID3=1)

Until recently most algorithms for induction of decision trees were using one of the impurity

measures described in previous section. These functions were borrowed from information theory

without any formal analysis of their suitability for selection criterion. The empirical results were

acceptable and only small variations of these methods were further tested.

13

Fayyad’s and Irani’s approach in [5] introduces completely different family of measures, C-SEP,

for binary partitions. They recognize very important properties of the measure: inter-class

separation and intra-class cohesiveness that were not precisely defined in impurity measures. This

is the first step to better formalization of selection criterion which is necessary for further

improvement of decision trees’ accuracy and simplicity.

3. Noise

When we use decision tree induction techniques for real-word domains we have to expect noisy

data. The description of the object may include attribute based on measurements or subjective

judgement, both of which can give rise to errors in the values of the attributes. Sometimes the

class information may contain errors. These defects in the data may lead to two known problems:

•attribute inadequacy, meaning that even some examples may have identical description in

terms of attributes values they don’t belong to the same class. Inadequate attributes are not

able to distinguish among the object in set S.

•spurious tree complexity which is the result of tree induction algorithm trying to fit the

noisy data into the tree.

Recognizing these two problems we can define two modification of the tree-building algorithm if

it is to be able to operate with a noise-affected training set [11]:

•the algorithm must be able to decide that testing further attributes will not improve the

predictive accuracy of the decision tree

•the algorithm must be able to work with inadequate attributes

In [11] Quinlan suggests the chi-square test for stochastic independence as the implementation of

first modification:

14

Let S be collection of objects which belong to one of two classes N and P and let A be an attribute

with v values that produces subsets {S1, S2, …, Sv} of S, where Si contains pi and ni object of class

P and N, respectively. If the value of A is irrelevant (if the values of A for these objects are just

noise, the values would be expected to be unrelated to the objects’ classes) to the class of an

object in S the expected value pi′ of pi should be

p i + ni
pi ' = p ⋅
p+n

If ni′ is corresponding expected value of ni, the statistic:

v
( p i − p i ' ) 2 ( ni − ni ' )
∑ p' + n'
i =1 i i

is approximately chi-square with v-1 degrees of freedom. This statistic can be used to determine

the confidence with which one can reject the hypothesis that A is independent of the class of

objects in S [11].

The tree-building procedure can than be modified to prevent testing any attribute whose

irrelevance cannot be rejected with very high (e.g., 99%) confidence level. One difficulty with

chi-square test is that it’s unreliable for very small values of the expectation p i′ and ni′ , so the

common practise is to use it when all expectations are at least 4 [12].

Second algorithm’s modification should cope with inadequate attributes. Quinlan [12] suggest

two possibilities:

•Notation of the class could be generalized to continuous value laying between 0 and 1 : If

the subset of objects at leaf contained p examples belonging to class P and n examples

belonging to class N, the choice for c would be:

p
c=
p+n

15

In this case the class of 0.8 would be interpreted as ‘belonging to class P with probability 0.8’.

Classification error is defined as:

if the object is really class N: c- 0;

if the object is really class P: 1 – c.

This method is called probability method.

•Voting model could be established: assign all object to the more numerous class at the

leaf.

This method is called majority method.

It can be verified that the first method minimizes the sum of the squares of the classification

errors while the second one minimizes the sum of absolute errors over the objects in S. If the goal

is to minimize expected error the second approach seems more suitable and empirical results

shown in [12] approve this.

Two suggested modification were tested on various data sets and with different noise level

affecting different attributes or class information and results are shown in [12]. Quite different

forms of degradation are observed:

•Destroying class information produces linear increase in error, for the noise level of 100%

reaching 50% error which means that object would be classified randomly.

•Noise in a single attribute doesn’t have a dramatic effect, and it appears that it is directly

proportional to importance of an attribute. Importance of an attribute can be defined as

average classification error produced if the attribute is deleted altogether from the data.

•Noise in all attributes together leads to relatively rapid increase in error which generally

reach peak and declines. The explanation for appearance of peak is explained in [11].

16

These experiments leaded to very interesting and unexpected observation given in [11]: For

higher noise levels, the performance of the correct decision tree on corrupted data was found to be

inferior to that of an imperfect decision tree formed from data corrupted to similar level.

These observations impose some basic tactics for dealing with noisy data:

•It is important to eliminate noise affecting the class membership of the objects in the

training set.

•It is worthy to exclude noisy, less important attributes.

•The payoff in noise reduction increases with the importance of the attribute.

•The training set should reflect the noise distribution and level as expected when the

induced decision tree is used in practice.

•The majority method of assigning classes to leaves is preferable to the probability method.

Conclusion

The methods employed to cope with noise in decision tree induction are mostly based on

empirical results. Although it is obvious that they lead to improvement of the decision trees in the

terms of simplicity and accuracy there is no formal theory to support them. This implies that

laying some theoretical foundation should be necessity in the future.

4. Pruning

17

In noisy domains, pruning methods are employed to cut back a full-size tree to smaller one that is

likely to give better classification performance. Decision trees generated using the examples from

the training set are generally overfitted to accurately classify unseen examples from a test set.

Techniques used to prune the original tree usually consist of following steps [15]:

•generate a set of pruned trees

•estimate the performance of each of these trees

•select the best tree.

One of the major issues is what data set will be used to test the performance of the pruned tree.

The ideal situation would be if we could have complete set of test examples. Only than we could

be able to make optimal tree selection. However, in practice this is not possible and it is

approximated with a very large, independent test set, if one is available.

The real problem arises when the test set is not available. Then the same test used for building the

decision tree has to be used to estimate accuracy of the pruned trees. Resampling methods, such

as cross-validation, are the principal technique used in these situations. F-fold cross-validation is

a technique which divides the training set S into f blocks of roughly the same distribution and

then for each block in turn, a classifier is constructed from the cases in the remaining blocks and

then tested on the cases in the hold-out block. The error rate of the classifier produced from all the

cases is estimated as the ratio of the total number of errors on the hold-out cases to the total

number of cases. The average error rate from these distinct cross-validations is then a relatively

reliable estimate of the error rate of the single classifier produced from all the cases. The 10-fold

cross-validation has proven to be very reliable and it’s widely used for many different learning

models.

Quinlan in [13] describes three techniques for pruning:

18

•cost-complexity pruning,

•reduced error pruning, and

•pessimistic pruning.

Cost-Complexity Pruning

This technique is initially described in [2]. It consists of two stages:

•First, the sequence of trees T0, T1, …, Tk is generated, where T0 is original decision tree

and each Ti+1 is obtained by replacing one or more subtrees of T i with leaves until the final

tree Tk is just a leaf

•Then, each tree in the sequence is evaluated and one of them is selected as the final

pruned tree.

Cost-complexity measure is used for evaluation of pruned tree T:

If N is the total number of examples classified by T, E is the number of misclassifed ones, and

L(T) is the number of leaves in T, then cost-complexity is defined as sum

E + α ⋅ L(T )
N

where α is some parameter. Now, let’s suppose that we replace some subtree T* of tree T with the

beat possible leaf. In general, these pruned tree would have M more misclassified examples and

L(T*) – 1 fewer leaves. T and T* would have same cost-complexity if

M
α= .
N ⋅ ( L(T ∗ ) − 1)

To produce Ti+1 from Ti each non-leaf subtree is examined of Ti to find the one with minimum α

value. The one or more subtrees with that value of α are then replaced by their respective best

leaves.

19

For second stage of pruning we use independent test set containing N’ examples to test the

accuracy of the pruned trees. If E’ is the minimum number of errors observed with any Ti and

standard error of E’ is given by:

E '⋅( N '− E ' )
se( E ' ) =
N'

than the tree selected is the smallest one whose number of errors does not exceed E’ + se(E’).

Reduced Error Pruning

This technique is probably the simplest and most intuitive one for finding small pruned trees of

high accuracy. First, the original tree T is used to classify independent test set. Then for every

non-leaf subtree T* of T we examine the change in misclassifications over the test set that that

would occur if T* were replaced by the best possible leaf. If the new tree contains no subtree with

the same property, T* is replaced by the leaf. The process continues until any further replacement

would increase number of errors over the test set.

Pessimistic Pruning

This technique does not require separate test set. If decision tree T was generated from training set

with N examples and then tested with the same set we can assume that at some leaf of T there are

K classified examples of which J is misclassified. The ratio J/K does not provide a reliable

estimate of error rate of that leaf when unseen object are classified since the tree T has be tailored

to the training set. Instead, we can use more realistic measure known as continuity correction for

binomial distribution in which J is replaced with J + 1/2.

20

Now, let's consider some subtree T* of T, containing L(T*) leaves and classifying ΣJ examples

(sum over all leaves of T*) with ΣJ of them misclassified. According to the above measure it will

misclassify ΣJ + L(S)/2 unseen cases. If T* is replaced with the best leaf which misclassify E

examples from training set, the new pruned tree will be accepted whenever E + 1/2 is within one

standard error of ΣJ + L(S)/2 (standard error is defined as in the cost-complexity pruning).

All non-leaf subtrees are examined just once to see whether they should be pruned, but once the

subtree is pruned its subtrees aren’t further examined. This strategy makes this algorithm much

faster than previous two.

Quinlan compare these three techniques on 6 different domains with both real-word and synthetic

data. The general observation is that simplified trees are of superior or equivalent accuracy to the

originals, so pruning has been beneficial in both counts. Cost-complexity pruning tends to

produce smaller decision trees then either reduced error or pessimistic pruning, but they are less

accurate than the trees produced by two other techniques. This suggest that cost-complexity may

be overpruning. On the other hand, reduced error pruning and pessimistic pruning produce trees

with very similar accuracy, but knowing that the later one uses only training set for pruning and

that it is more efficient than the former one it can be pronounced as optimal technique among

three suggested.

OPT Algorithm

While previously described techniques were used to prune decision trees generated from noisy

data, Bratko and Bohanec in [1] introduce OPT algorithm for pruning accurate decision trees. The

problem they aim to solve is [1]: given a completely accurate, but complex, definition of a

concept, simplify the definition, possible at the expense of accuracy, so that the simplified

21

definition still correspond to the concept well in general, but may be inaccurate in some details.

So, while the previously mentioned techniques were designed to improve tree accuracy this one is

designed to reduce its size, which makes it impractical to be communicated and understood by the

user.

Bratko's and Bohanec's approach is somewhat similar to previous pruning algorithms: they

construct the sequence of pruned trees and then select the smallest tree that satisfies some

required accuracy. However, the tree sequence they construct is denser with respect to the number

of their leaves:

Sequence T0, T1, ..., Tn is constructed such that

1. n = L(T0) - 1

2. the trees in sequence decrease in size by one, i.e., L(Ti) = L(T0) - i for i = 0, 1, ..., n (unless

there is no pruned tree of the corresponding size) and

3. each Ti has the highest accuracy among all the pruned trees of T0 of the same size

This sequence is called optimal pruning sequence and was initially suggested by Breiman et al.

[2]. To construct this optimal pruning sequence efficiently in quadratic (polynomial) time with

respect to the number of leaves of T0 they use dynamic programming. The construction is

recursive in that each subtree of T0 is again a decision tree with its own optimal pruning sequence

The algorithm starts by constructing sequence that correspond to small subtrees near the leaves of

T0. These are then combined together, yielding sequences that correspond to larger and larger

subtrees of T0, until optimal pruning sequence for T0 is finally constructed.

The main advantage of OPT algorithm is density of its optimal pruning sequence which always

contains an optimal tree. Sequences produced by cost-complexity pruning or reduced error

pruning are sparse and therefore can miss some optimal solutions.

22

One interesting observation derived from the experiments conducted by Bratko and Bohanec is

that for real-world data a relatively high accuracy was achieved with relatively small pruned trees

disregarding the technique used for pruning, while that wasn't a case with synthetic data. This is

another proof of usefulness of pruning especially for the real-world domains.

Conclusion
Either if we want to improve classification accuracy of decision trees generated from noisy data

or to simplify accurate, but complex decision trees to make them more intelligible to human

experts, pruning is proved to be very successful. Recent papers [], [] suggest there is still some

space left for improvement of basic and most commonly used techniques described in this section.

5. Summary

The selection criterion is probably the most important aspect that determines the behavior of top-

down decision generation algorithm. If it select most important attributes regarding the class

information near the root of the tree, than any pruning technique can successfully cut-off the

branches of class-independent and/or noisy attributes because they will appear near to the leaves

of the tree. Thus, an intelligent selection method, which is able to recognize most important

attributes for classification will initially generate more simple trees and additionally will ease the

job of the pruning algorithm.

The main problem of this domain seems to be lack of theoretical foundation: many techniques are

still used because of their acceptable empirical evaluation not because they have be formally

23

proven to be superior. Development of formal theory for decision tree induction is necessary for

better understanding of this domain and for further improvement of decision trees’ classification

accuracy, especially for noisy, incomplete, real-world data.

6.References

[1] Bratko, I. & Bohanec, M. (1994). Trading accuracy for simplicity in decision trees, Machine

Learning 15, 223-250.

[2] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and

Regression Trees. Montrey, CA:Wadsworth & Brooks.

[3] Breiman, L., (1996).Technical note: Some properties of splitting criteria, Machine Learning

24, 41-47.

[4] Fayyad, U.M. (1991). On the induction of decision trees for multiple concept learning, PhD

dissertation, EECS Department, The University of Michigan.

[5] Fayyad, U.M. & Irani, R.B. (1993) The attribute selection problem in decision tree generation,

Proceedings of the 10th National Conference on AI, AAAI-92, 104-110, MIT Press.

[6] Lopez de Mantras, R. (1991). A distance-based attribute selection measure for decision tree

induction, Machine Learning 6, 81-92.

[7] Kearns, M. & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with

near optimal generalization, Submitted.

24

[8] Kononenko, I. , Bratko, I., Roskar, R. (1984). Experiments in automatic learning of medical

diagnosis rules, Technical Report, Faculty of Electrical Engineering, E.Kardelj University,

Ljubljana.

[9] Mingers, J. (1989). An empirical comparison of selection measures for decision-tree

induction, Machine Learning 3, 319-342.

[10] Schapire, R.E. & Helmbold, D.P. (1995). Predicting nearly as well as the best pruning of a

decision tree, Proceedings of the 8th Annual Conference on Computational Learning Theory,

ACM Press, 61-68.

[11] Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1, 81-106.

[12] Quinlan, J.R. (1986). The effect of noise on concept learning, Machine Learning: An

arificial intelligence approach, Morgan Kaufmann: San Mateo CA, 148-166.

[13] Quinlan, J.R. (1987). Simplifying decision trees, International Journal of Man-Machine

Studies, 27, 221-234.

[14] Quinlan, J.R. (1988). Decision trees and multi-valued attributes, Machine Intelligence 11,

305-318.

[15] Weiss, S.M. & Indurkhya, N. (1994 ). Small sample decision tree pruning, Proceedings of

the 11th International Conference on Machine Learning, Morgan Kaufmann, 335-342.

[16] White, A.P. & Liu, W.Z. (1994). The importance of attribute selection measures in decision

tree induction, Machine Learning 15, 25-41.

[17] White, A.P. & Liu, W.Z. (1994). Technical note: Bias in information-based measures in

decision tree induction, Machine Learning 15, 321-329.

25

Decision Tree Induction Using Impurity Measures

Recommended

Recommended

More Related Content

More from butest

More from butest (20)

Decision Tree Induction Using Impurity Measures