SlideShare a Scribd company logo
1 of 26
1. Introduction

Decision trees are one of the methods for concept learning from the examples. They are widely

used in machine learning and knowledge acquisition systems. The main application area are

classification tasks:

We are given a set of records, called training set. Each record from training set has the same

structure, consisting of a number of attribute/value pairs. One of these attributes represents class

of the record. We also have a test set for which the class information is unknown. The problem is

to derive a decision tree, using examples from training set, which will determine class for each

record in the test set.

The leaves of induced decision tree are class names and other nodes represent attribute-based tests

with a branch for each possible value of particular attribute.

Once the tree is formed, we can classify objects from the test set: starting at the root of tree, we

evaluate the test, and take the branch appropriate to the outcome. The process continues until leaf

is encountered, at which time the object is asserted to belong to class named by the leaf.

Induction of decision trees has been very active area of machine learning and many approaches

and techniques have been developed for building trees with high classification performance. The

most commonly addressed problems are :

       •selecting   of the best attribute for splitting

       •dealing   with noise in real-world tasks

       •pruning   of complex decision trees

       •dealing   with unknown attribute values

       •dealing   with continuous attribute values.


                                                      1
My intention is to give an overview of methods considering the first three problems, which I find

to be very mutually dependant and of prior importance for building of trees with good

classification ability.




2. Selection Criterion

For a given training set it is possible to construct many decision trees that will correctly classify

all of its objects. Among all accurate trees we are interested in the most simple one. This search is

guided by Occam’s Razor heuristic: among all rules that accurately account for the training set,

the simplest is likely to have the highest success rate when used to classify unseen objects. This

heuristics is also supported by analysis: Pearl and Quinlan [11] have derived upper bounds on the

expected error using different formalisms for generalizing from a set of known cases. For a

training set of predetermined size, these bounds increase with the complexity of induced

generalization.

Since decision tree is made of nodes that represent attribute-based test, simplifying the tree would

mean reducing the number of tests. We can achieve this by carefully selecting the order of tests to

be conducted. The example given in [11] shows how for the same training set given in Table 1

different decision trees may be constructed. Each of the examples in training set is described in

terms of 4 discrete attributes: outlook {sunny, overcast, rain}, temperature {cold, mild, hot},

humidity {high, normal}, windy {true, false} and each belonging to one of the classes N or P.

Figure 1 shows decision tree when attribute outlook is used for the first test and figure 2 shows

decision tree with the first test temperature. The difference in complexity is obvious.




                                                 2
No.   Outlook    Temperature       Humidity   Windy    Class
 1     sunny        hot              high      false    N
 2     sunny        hot              high       true    N
 3    overcast      hot              high      false    P
 4      rain        mild             high      false    P
 5      rain        cool            normal     false    P
 6      rain        cool            normal      true    N
 7    overcast      cool            normal      true    P
 8     sunny        mild             high      false    N
 9     sunny        cool            normal     false    P
10      rain        mild            normal     false    P
11     sunny        mild            normal      true    P




                               3
12            overcast                           mild                       high                           true                         P
     13            overcast                           hot                       normal                         false                         P
     14              rain                             mild                       high                           true                         N


    Table1. A small training set




                                                                  outlook


                                                   sunny      overcast          rain


                                                                    P
                                           humidity                                          windy

                               high                normal                          true                false
    Figure 1. A simple decision tree
                        N                                     P             N                                          P




                                                              temperature

                                                                  sunny


                                                                                                                              windy
            outlook                                                outlook
sunny     o’cast            rain                      sunny       o’cast            rain                                   true            false

P             P                                                                                                    N
                                                                        P
                              windy                windy                                     humidit                                          humidit
                                                                                               y                                                y
                       true    false          true    false
                                                                                          high        normal                               high           normal

                   N                   P       P              N
                                                                                                               P                                               P
                                                                                windy                                             outlook
                                                                            true          false                sunny              o’cast           rain

                                                                            4
                                                                        N                         P                N                  P              null
Figure 2. A complex decision tree



This infers that the choice of test is crucial for simplicity of decision tree, on which many

researchers, such as Quinlan [11], Fayyad [5], White and Liu [16], agree.

A method of choosing a test to form the root of decision tree is usually referred to as selection

criterion. Many different selection criterion have been tested over the years and most common

once among them are maximum information gain and GINI index. Both of these methods belong

to the class of impurity measures, which are designed to capture aspects of partitioning of

examples relevant to good classification.



Impurity measures

Let S be set of training examples with each example e ∈ S belonging to one of the classes in

C = {C1, C2, …, Ck}. We can define the class vector (c1, c2, …, ck) ∈ Nk , where ci = |{e ∈ S |

class(e) = Ci}| and class probability vector (p1, p2, …, pk) ∈ [0, 1]k:

                                                              c1 c 2       c
                                ( p1 , p 2 ,..., p k ) = (      ,    ,..., 3 )
                                                             |S| |S|      |S|

It’s obvious that Σ pi =1.

A set of examples is said to be pure if all its examples belong to one class. Hence, if probability

vector of a set of examples has a component 1 (all other components being equal to 0) the set is

said to be pure. On the other hand, if all components are equal we get an extreme case of

impurity.

To quantify the notion of impurity, a family of functions known as impurity measures [5] is

defined.


                                                             5
Definition 1 Let S be a set of training examples having a class probability vector PC. A function

φ : [0, 1]k → R       such that φ (PC) ≥ 0 is an impurity measure if it satisfies the following

conditions:

1.   φ (PC) is minimum if ∃i such that component PCi = 1.

2.   φ (PC) is maximum if ∀i, 1 ≤ i ≤ k, PCi = 1/k.

3.   φ (PC) is symmetric with respect to components of PC.

4.   φ (PC) is smooth (differentiable everywhere) in its range.

Conditions 1 and 2 express well-known extreme cases, and condition 3 insures that the measure is

not biased towards any of the classes.

For induction of decision trees, impurity measure is used to evaluate impurity of partition induced

by an attribute.

Let PC(S) be the class probability vector of S and let A be a discrete attribute over the set S. Let

assume the attribute A partition set S into the sets S 1, S2, …, Sv. The impurity of the partition is

defined as weighted average impurity on its component blocks:

                                     v
                                         | Si |
                        ∆Φ( S , A) = ∑          ⋅ φ ( PC ( S i ))
                                    i =1 | S |


Finally, the goodness-of-split due to attribute A is defined as reduction in impurity after the

partition.

                       ∆Φ(S, A) = φ (PC(S)) - Φ(S, A)

If we choose entropy of the partition, E(A,S), as an impurity measure:

                                                     k
                       φ (PC(S)) = E(A, S) =       ∑ − PC
                                                    i =1
                                                                i   ⋅ log 2 ( PC i )


than the reduction in impurity gained by an attribute is called information gain.




                                                           6
This method was used in many algorithms for induction of decision trees such as ID3, GID3* and

CART.

The other popular impurity measure is GINI index used in CART [2]. To obtain GINI index we

set φ to be


                       φ (PC(S)) =   ∑ PC
                                     i≠ j
                                            i   ⋅ PC j


All functions belonging to family of impurity measures agree on the minima, maxima and

smoothness and as a consequence they should result in similar trees [2], [9] regarding complexity

and accuracy. After detailed analysis Breiman [3] reports basic differences in trees produced

using information gain and GINI index. The GINI prefers splits that put the largest class into one

pure node, and all others into the other. Entropy favors size–balanced children nodes. If the

number of classes is small both criterions should produce similar results. The difference appears

when number of classes is larger. In this case GINI produces splits that are too unbalanced near to

the root of the tree. On the other hand, splits produced by entropy show a lack of uniqueness.

This analysis point out some of the problems associated with impurity measures. But,

unfortunately, these are not the only ones.

Some experiments carried out in the mid eighties showed that the gain criterion tends to favor

attributes with many values [8]. This finding was supported by analysis in [11]. One of the

solutions to this problem was offered by Kononenko et al. in [8]: decision tree induced has to be

binary tree. This means that every test has only two outcomes. If we have an attribute A with

values A1, A2, …, Av the decision tree no longer branches on each possible value. Instead, a subset

of S is chosen and the tree has one branch for that subset and the other for remainder for S. This

criterion is known as subset criterion. Kononenko et al. report that this modification led to

smaller decision trees with an improved classification performance. Although, it’s obvious that


                                                         7
binary trees don’t suffer from bias in favor of attributes with large number of values, it is not

known if this is the only reason for their better performance.

This finding is repeated in [4] and in [5] Fayyad and Irani introduce the binary tree hypothesis:

For a to–down, non-backtracking, decision tree generation algorithm, if the algorithm applies a

proper attribute selection measure, then selecting a single attribute-value pair at each node and

thus constructing a binary tree, rather than selecting an attribute and branching on all its values

simultaneously, is likely to lead to a decision tree with fewer leaves.

The formal proof for this hypothesis doesn’t exist; it’s a result of informal analysis and empirical

evaluation. Fayyad has also shown in [4] that for every decision tree, there exists binary decision

tree that is logically equivalent to it. This would mean that for every decision tree we could

induce logically equivalent binary decision tree that is expected to have fewer nodes and to be

more accurate.

But binary trees have some side-effect explained in [11]:

First, this kind of trees is undoubtedly more unintelligible to human experts than is ordinarily the

case, with unrelated attribute values being grouped together and with multiple tests on the same

attribute.

Second, the subset criterion can require a large increase in computation, specially for attributes

with many values – for attribute A with v values there are 2v-1 – 1 different ways of specifying the

distinguished subset of attributes values. But since a decision tree is induced only once and than

used for classification and since computer efficiency is rapidly increasing this problem seems to

diminish.




                                                  8
In [11] Quinlan proposes another method for overcoming the bias in information gain called the

gain ratio. The gain ratio, GR (originally Quinlan denoted it by IV), is normalizing the gain with

attribute information:

                                                    ∆Φ( S , A)
                                   GR =          v
                                                    |S |       |S |
                                             − ∑ i ⋅ log 2 i
                                               i =1 | S |      |S|

The attribute information is used as normalizing factor because of its property to increase as the

number of possible values increases.

As mentioned in [11] this ratio may not always be defined – the denominator may be zero – or it

may tend to favor attributes for which the denominator is very small. As a solution to this, the

gain ratio criterion will select, from among those attributes with an average-or-better gain, the

attribute that maximizes GR. The experiments described in [14] show the improvement in tree

simplicity and prediction accuracy when gain ratio criterion is used.

There is also another measure introduced by Lopez de Mantras in [], called distance measure, dN:

                                             ∆Φ( S , A)
                         1 - dN =        k   | S ij |
                                             v                | S ij |
                                  − ∑∑                ⋅ log 2
                                    i =1 j =1 | S |            |S|

where |Sij| is the number of examples with value aj of attribute A that belong to class Ci.

This is just another attempt of normalizing information gain, but in this case with cell information

(cell is a subset of S which contains all examples with one attribute value that belong to one

class).

Although both of these normalized measures were claimed to be unbiased, the statistics analysis

in [17] shows that again each favor attributes with larger number of values. These results also

suggest that information gain is the worst measure compared to gain ratio and distance measure in




                                                             9
this respect, while gain ratio is the least biased. Furthermore, their analysis shows that the

magnitude of the bias is strongly dependent on the number of classes, increasing as k is increased.



Orthogonality measure

Recently, one conceptually new approach was introduced by Fayyad and Irani in [5]. In their

analysis they give a number of reasons why information entropy, as representative of class of

impurity measures, is not suitable for attribute selection.

Assume the following example: Consider a set S of 110 examples belonging to three classes {C 1,

C2, C3} whose class vector is (50, 10, 50). Assume that the attribute-value pairs (A, a1) and (A, a2)

induce two binary partition on S, π1 and π2 shown in the figure 3. We can see that π2 separates the

classes C1 and C2 from the class C3. However, the information gain measure prefers partition π1

(gain = 0.51) over π2 (gain = 0.43).




                                         C1       C2      C3


                                         45        8          5
        Partition π1 produced by a1                               Partition π2 produced by a2


             45         8         5                                  50        0         50

              5         2         45                                  0        10          0

                  Gain = 0.51                                             Gain = 0.43

Figure3. Two possible binary partitions.




                                                  10
Analysis shows that if π2 is accepted, subtree under this node has lower bound of tree leaves. On

the other hand, if π1 is chosen subtree could minimally have 6 leaves.

Intuitively, if the goal is to generate tree with smaller number of leaves, the selection measure

should be sensitive to total class separation – it should separate differing classes from each other

as much as possible while separating as few examples of the same class as possible. Above

example shows that information entropy doesn’t satisfy these demands – it’s completely

insensitive for class separation and within-class fragmentation. The only exception is when

learning problem has exactly two classes: than class purity and class separation become the same.

Another negative property of information gain emphasized in this paper is its tendency to induce

decision trees with near-minimal average depth. The empirical evaluation of this kind of trees

shows that they tend to have a large number of leaves and high error rate [4].

Another of deficiencies pointed out is, actually, embedded in definition of impurity measures:

their symmetry with respect to components of PC. This means that the set with a given class

probability vector evaluates identically to another set whose class vector is a permutation of the

first. Thus if one of the subsets of a set S has a different majority class than the original but the

distribution of classes is simply permuted, entropy will not detect the change. However this

change in dominant class is generally strong indicator that the attribute value is relevant to

classification.

Realizing above weakness of impurity measures authors define the desirable class of selection

measures:

Assuming induction of binary tree (relying on the binary tree hypothesis) for training set S and

attribute A, test τ on this attribute induces a binary partition on the set S into:

S = Sτ ∪ S¬τ , where Sτ = { e ∈ S | e satisfies τ }and S¬τ = S ~ Sτ.



                                                   11
Selection measure should satisfy the properties:

1. It is maximum when the classes in Sτ are disjoint with the classes in S¬τ (inter-class

    separation).

2. It is minimum when the class distribution in Sτ is identical to the class distribution in S¬τ.

3. It favors partitions which keep examples of the same class in the same block (intra-class

    cohesiveness).

4. It is sensitive to permutations in the class distribution.

5. It is non-negative, smooth (differenciable), and symmetric with respect to the classes.

This defines a family of measures called C-SEP (for Class SEParation), for evaluating binary

partitions.

One such measure proposed in this paper and proven to satisfy all requirements of C-SEP family

is orthogonality measure defined as:

                                ORT(τ, S) = 1 – cos θ(V1, V2),

Where θ(V1, V2) is the angle between two class vectors V1 and V2 of partitions Sτ and S¬τ,

respectively.

The result of empirical comparisons of orthogonality measure embedded in O-BTREE system

with entropy measure used in GID3* (produces branching only on a few individual values while

grouping the rest in one default branch), information gain in ID3, gain ratio in ID3-IV and

information gain for induction of binary trees in ID3-BIN, is taken from [5] and given in figures

4, 5, 6 and 7. In these experiments 5 different data sets were used: RIO (Reactive Ion Etching) –

synthetic data, and real-word data sets: Soybean, Auto, Harr90 and Mushroom. Descriptions of
7                                                      3
6                                                     2.5
5
                                                       2
4
                                                      1.5
3
                                                       1
2
1                                                  12 0.5
0                                                      0
     GID3*      ID3   ID3-IV   ID3-BIN O-BTREE              GID3*    ID3    ID3-IV   ID3-BIN O-BTREE
these sets may be found in [5]. The results reported are in terms of ratios relative to GID3*

  performance (GID3* =1.0 in both cases).



   Figure 4. Error ratios for RIE-random Domains         Figure5. Leaf ratios for RIE-random Domains




  2.2     ID3                                                 2
                                                                                    ID3
    2     ID3-IV
                                                                                    ID3-IV
  1.8
          ID3-BIN                                                                   ID3-BIN
  1.6
          O-BTREE                                           1 .5
                                                                                    O-BTREE
  1.4
  1.2
    1
                                                              1
  0.8
  0.6
  0.4
                                                            0 .5
        SOYBEAN     AUTO   HARR90   MUSHRM
                                                                   SOYBEAN   AUTO     HARR90   MUSHRM




  The figures 4, 5, 6 and 7 show that results for O-BTREE algorithm are almost always superior to

  other algorithms.


  Conclusion
Figure 6. Relative ratios of Error Rates (GID3=1)        Figure 6. Ratios of Numbers of Leaves (GID3=1)


  Until recently most algorithms for induction of decision trees were using one of the impurity

  measures described in previous section. These functions were borrowed from information theory

  without any formal analysis of their suitability for selection criterion. The empirical results were

  acceptable and only small variations of these methods were further tested.



                                                    13
Fayyad’s and Irani’s approach in [5] introduces completely different family of measures, C-SEP,

for binary partitions. They recognize very important properties of the measure: inter-class

separation and intra-class cohesiveness that were not precisely defined in impurity measures. This

is the first step to better formalization of selection criterion which is necessary for further

improvement of decision trees’ accuracy and simplicity.


3. Noise

When we use decision tree induction techniques for real-word domains we have to expect noisy

data. The description of the object may include attribute based on measurements or subjective

judgement, both of which can give rise to errors in the values of the attributes. Sometimes the

class information may contain errors. These defects in the data may lead to two known problems:

       •attribute   inadequacy, meaning that even some examples may have identical description in

        terms of attributes values they don’t belong to the same class. Inadequate attributes are not

        able to distinguish among the object in set S.

       •spurious    tree complexity which is the result of tree induction algorithm trying to fit the

        noisy data into the tree.

Recognizing these two problems we can define two modification of the tree-building algorithm if

it is to be able to operate with a noise-affected training set [11]:

       •the   algorithm must be able to decide that testing further attributes will not improve the

        predictive accuracy of the decision tree

       •the   algorithm must be able to work with inadequate attributes

In [11] Quinlan suggests the chi-square test for stochastic independence as the implementation of

first modification:


                                                   14
Let S be collection of objects which belong to one of two classes N and P and let A be an attribute

with v values that produces subsets {S1, S2, …, Sv} of S, where Si contains pi and ni object of class

P and N, respectively. If the value of A is irrelevant (if the values of A for these objects are just

noise, the values would be expected to be unrelated to the objects’ classes) to the class of an

object in S the expected value pi′ of pi should be

                                                      p i + ni
                                         pi ' = p ⋅
                                                       p+n

If ni′ is corresponding expected value of ni, the statistic:

                                          v
                                              ( p i − p i ' ) 2 ( ni − ni ' )
                                         ∑ p' + n'
                                         i =1         i               i


is approximately chi-square with v-1 degrees of freedom. This statistic can be used to determine

the confidence with which one can reject the hypothesis that A is independent of the class of

objects in S [11].

The tree-building procedure can than be modified to prevent testing any attribute whose

irrelevance cannot be rejected with very high (e.g., 99%) confidence level. One difficulty with

chi-square test is that it’s unreliable for very small values of the expectation p i′ and ni′ , so the

common practise is to use it when all expectations are at least 4 [12].

Second algorithm’s modification should cope with inadequate attributes. Quinlan [12] suggest

two possibilities:

       •Notation     of the class could be generalized to continuous value laying between 0 and 1 : If

        the subset of objects at leaf contained p examples belonging to class P and n examples

        belonging to class N, the choice for c would be:

                                                p
                                         c=
                                               p+n



                                                       15
In this case the class of 0.8 would be interpreted as ‘belonging to class P with probability 0.8’.

Classification error is defined as:

                        if the object is really class N: c- 0;

                        if the object is really class P: 1 – c.

This method is called probability method.

       •Voting model could be established: assign all object to the more numerous class at the

       leaf.

This method is called majority method.

It can be verified that the first method minimizes the sum of the squares of the classification

errors while the second one minimizes the sum of absolute errors over the objects in S. If the goal

is to minimize expected error the second approach seems more suitable and empirical results

shown in [12] approve this.

Two suggested modification were tested on various data sets and with different noise level

affecting different attributes or class information and results are shown in [12]. Quite different

forms of degradation are observed:

       •Destroying class information produces linear increase in error, for the noise level of 100%

       reaching 50% error which means that object would be classified randomly.

       •Noise in a single attribute doesn’t have a dramatic effect, and it appears that it is directly

       proportional to importance of an attribute. Importance of an attribute can be defined as

       average classification error produced if the attribute is deleted altogether from the data.

       •Noise   in all attributes together leads to relatively rapid increase in error which generally

       reach peak and declines. The explanation for appearance of peak is explained in [11].




                                                    16
These experiments leaded to very interesting and unexpected observation given in [11]: For

higher noise levels, the performance of the correct decision tree on corrupted data was found to be

inferior to that of an imperfect decision tree formed from data corrupted to similar level.

These observations impose some basic tactics for dealing with noisy data:

      •It   is important to eliminate noise affecting the class membership of the objects in the

       training set.

      •It   is worthy to exclude noisy, less important attributes.

      •The    payoff in noise reduction increases with the importance of the attribute.

      •The     training set should reflect the noise distribution and level as expected when the

       induced decision tree is used in practice.

      •The    majority method of assigning classes to leaves is preferable to the probability method.



Conclusion

The methods employed to cope with noise in decision tree induction are mostly based on

empirical results. Although it is obvious that they lead to improvement of the decision trees in the

terms of simplicity and accuracy there is no formal theory to support them. This implies that

laying some theoretical foundation should be necessity in the future.




4. Pruning




                                                    17
In noisy domains, pruning methods are employed to cut back a full-size tree to smaller one that is

likely to give better classification performance. Decision trees generated using the examples from

the training set are generally overfitted to accurately classify unseen examples from a test set.

Techniques used to prune the original tree usually consist of following steps [15]:

      •generate a set of pruned trees

      •estimate the performance of each of these trees

      •select the best tree.

One of the major issues is what data set will be used to test the performance of the pruned tree.

The ideal situation would be if we could have complete set of test examples. Only than we could

be able to make optimal tree selection. However, in practice this is not possible and it is

approximated with a very large, independent test set, if one is available.

The real problem arises when the test set is not available. Then the same test used for building the

decision tree has to be used to estimate accuracy of the pruned trees. Resampling methods, such

as cross-validation, are the principal technique used in these situations. F-fold cross-validation is

a technique which divides the training set S into f blocks of roughly the same distribution and

then for each block in turn, a classifier is constructed from the cases in the remaining blocks and

then tested on the cases in the hold-out block. The error rate of the classifier produced from all the

cases is estimated as the ratio of the total number of errors on the hold-out cases to the total

number of cases. The average error rate from these distinct cross-validations is then a relatively

reliable estimate of the error rate of the single classifier produced from all the cases. The 10-fold

cross-validation has proven to be very reliable and it’s widely used for many different learning

models.

Quinlan in [13] describes three techniques for pruning:



                                                 18
•cost-complexity pruning,

       •reduced error pruning, and

       •pessimistic pruning.



Cost-Complexity Pruning

This technique is initially described in [2]. It consists of two stages:

       •First,   the sequence of trees T0, T1, …, Tk is generated, where T0 is original decision tree

          and each Ti+1 is obtained by replacing one or more subtrees of T i with leaves until the final

          tree Tk is just a leaf

       •Then, each tree in the sequence is evaluated and one of them is selected as the final

          pruned tree.

Cost-complexity measure is used for evaluation of pruned tree T:

If N is the total number of examples classified by T, E is the number of misclassifed ones, and

L(T) is the number of leaves in T, then cost-complexity is defined as sum

                                   E       + α ⋅ L(T )
                                       N

where α is some parameter. Now, let’s suppose that we replace some subtree T* of tree T with the

beat possible leaf. In general, these pruned tree would have M more misclassified examples and

L(T*) – 1 fewer leaves. T and T* would have same cost-complexity if

                                                  M
                                   α=                         .
                                           N ⋅ ( L(T ∗ ) − 1)

To produce Ti+1 from Ti each non-leaf subtree is examined of Ti to find the one with minimum α

value. The one or more subtrees with that value of α are then replaced by their respective best

leaves.


                                                           19
For second stage of pruning we use independent test set containing N’ examples to test the

accuracy of the pruned trees. If E’ is the minimum number of errors observed with any Ti and

standard error of E’ is given by:

                                             E '⋅( N '− E ' )
                               se( E ' ) =
                                                   N'

than the tree selected is the smallest one whose number of errors does not exceed E’ + se(E’).



Reduced Error Pruning

This technique is probably the simplest and most intuitive one for finding small pruned trees of

high accuracy. First, the original tree T is used to classify independent test set. Then for every

non-leaf subtree T* of T we examine the change in misclassifications over the test set that that

would occur if T* were replaced by the best possible leaf. If the new tree contains no subtree with

the same property, T* is replaced by the leaf. The process continues until any further replacement

would increase number of errors over the test set.



Pessimistic Pruning

This technique does not require separate test set. If decision tree T was generated from training set

with N examples and then tested with the same set we can assume that at some leaf of T there are

K classified examples of which J is misclassified. The ratio J/K does not provide a reliable

estimate of error rate of that leaf when unseen object are classified since the tree T has be tailored

to the training set. Instead, we can use more realistic measure known as continuity correction for

binomial distribution in which J is replaced with J + 1/2.




                                                      20
Now, let's consider some subtree T* of T, containing L(T*) leaves and classifying ΣJ examples

(sum over all leaves of T*) with ΣJ of them misclassified. According to the above measure it will

misclassify ΣJ + L(S)/2 unseen cases. If T* is replaced with the best leaf which misclassify E

examples from training set, the new pruned tree will be accepted whenever E + 1/2 is within one

standard error of ΣJ + L(S)/2 (standard error is defined as in the cost-complexity pruning).

All non-leaf subtrees are examined just once to see whether they should be pruned, but once the

subtree is pruned its subtrees aren’t further examined. This strategy makes this algorithm much

faster than previous two.

Quinlan compare these three techniques on 6 different domains with both real-word and synthetic

data. The general observation is that simplified trees are of superior or equivalent accuracy to the

originals, so pruning has been beneficial in both counts. Cost-complexity pruning tends to

produce smaller decision trees then either reduced error or pessimistic pruning, but they are less

accurate than the trees produced by two other techniques. This suggest that cost-complexity may

be overpruning. On the other hand, reduced error pruning and pessimistic pruning produce trees

with very similar accuracy, but knowing that the later one uses only training set for pruning and

that it is more efficient than the former one it can be pronounced as optimal technique among

three suggested.



OPT Algorithm

While previously described techniques were used to prune decision trees generated from noisy

data, Bratko and Bohanec in [1] introduce OPT algorithm for pruning accurate decision trees. The

problem they aim to solve is [1]: given a completely accurate, but complex, definition of a

concept, simplify the definition, possible at the expense of accuracy, so that the simplified



                                                21
definition still correspond to the concept well in general, but may be inaccurate in some details.

So, while the previously mentioned techniques were designed to improve tree accuracy this one is

designed to reduce its size, which makes it impractical to be communicated and understood by the

user.

Bratko's and Bohanec's approach is somewhat similar to previous pruning algorithms: they

construct the sequence of pruned trees and then select the smallest tree that satisfies some

required accuracy. However, the tree sequence they construct is denser with respect to the number

of their leaves:

Sequence T0, T1, ..., Tn is constructed such that

1. n = L(T0) - 1

2. the trees in sequence decrease in size by one, i.e., L(Ti) = L(T0) - i for i = 0, 1, ..., n (unless

there is no pruned tree of the corresponding size) and

3. each Ti has the highest accuracy among all the pruned trees of T0 of the same size

This sequence is called optimal pruning sequence and was initially suggested by Breiman et al.

[2]. To construct this optimal pruning sequence efficiently in quadratic (polynomial) time with

respect to the number of leaves of T0 they use dynamic programming. The construction is

recursive in that each subtree of T0 is again a decision tree with its own optimal pruning sequence

The algorithm starts by constructing sequence that correspond to small subtrees near the leaves of

T0. These are then combined together, yielding sequences that correspond to larger and larger

subtrees of T0, until optimal pruning sequence for T0 is finally constructed.

The main advantage of OPT algorithm is density of its optimal pruning sequence which always

contains an optimal tree. Sequences produced by cost-complexity pruning or reduced error

pruning are sparse and therefore can miss some optimal solutions.




                                                    22
One interesting observation derived from the experiments conducted by Bratko and Bohanec is

that for real-world data a relatively high accuracy was achieved with relatively small pruned trees

disregarding the technique used for pruning, while that wasn't a case with synthetic data. This is

another proof of usefulness of pruning especially for the real-world domains.




Conclusion
Either if we want to improve classification accuracy of decision trees generated from noisy data

or to simplify accurate, but complex decision trees to make them more intelligible to human

experts, pruning is proved to be very successful. Recent papers [], [] suggest there is still some

space left for improvement of basic and most commonly used techniques described in this section.




5. Summary

The selection criterion is probably the most important aspect that determines the behavior of top-

down decision generation algorithm. If it select most important attributes regarding the class

information near the root of the tree, than any pruning technique can successfully cut-off the

branches of class-independent and/or noisy attributes because they will appear near to the leaves

of the tree. Thus, an intelligent selection method, which is able to recognize most important

attributes for classification will initially generate more simple trees and additionally will ease the

job of the pruning algorithm.

The main problem of this domain seems to be lack of theoretical foundation: many techniques are

still used because of their acceptable empirical evaluation not because they have be formally



                                                 23
proven to be superior. Development of formal theory for decision tree induction is necessary for

better understanding of this domain and for further improvement of decision trees’ classification

accuracy, especially for noisy, incomplete, real-world data.




6.References

[1] Bratko, I. & Bohanec, M. (1994). Trading accuracy for simplicity in decision trees, Machine

Learning 15, 223-250.

[2] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and

Regression Trees. Montrey, CA:Wadsworth & Brooks.

[3] Breiman, L., (1996).Technical note: Some properties of splitting criteria, Machine Learning

24, 41-47.

[4] Fayyad, U.M. (1991). On the induction of decision trees for multiple concept learning, PhD

dissertation, EECS Department, The University of Michigan.

[5] Fayyad, U.M. & Irani, R.B. (1993) The attribute selection problem in decision tree generation,

Proceedings of the 10th National Conference on AI, AAAI-92, 104-110, MIT Press.

[6] Lopez de Mantras, R. (1991). A distance-based attribute selection measure for decision tree

induction, Machine Learning 6, 81-92.

[7] Kearns, M. & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with

near optimal generalization, Submitted.




                                                24
[8] Kononenko, I. , Bratko, I., Roskar, R. (1984). Experiments in automatic learning of medical

diagnosis rules, Technical Report, Faculty of Electrical Engineering, E.Kardelj University,

Ljubljana.

[9] Mingers, J. (1989). An empirical comparison of selection measures for decision-tree

induction, Machine Learning 3, 319-342.

[10] Schapire, R.E. & Helmbold, D.P. (1995). Predicting nearly as well as the best pruning of a

decision tree, Proceedings of the 8th Annual Conference on Computational Learning Theory,

ACM Press, 61-68.

[11] Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1, 81-106.

[12] Quinlan, J.R. (1986). The effect of noise on concept learning, Machine Learning: An

arificial intelligence approach, Morgan Kaufmann: San Mateo CA, 148-166.

[13] Quinlan, J.R. (1987). Simplifying decision trees, International Journal of Man-Machine

Studies, 27, 221-234.

[14] Quinlan, J.R. (1988). Decision trees and multi-valued attributes, Machine Intelligence 11,

305-318.

[15] Weiss, S.M. & Indurkhya, N. (1994 ). Small sample decision tree pruning, Proceedings of

the 11th International Conference on Machine Learning, Morgan Kaufmann, 335-342.

[16] White, A.P. & Liu, W.Z. (1994). The importance of attribute selection measures in decision

tree induction, Machine Learning 15, 25-41.

[17] White, A.P. & Liu, W.Z. (1994). Technical note: Bias in information-based measures in

decision tree induction, Machine Learning 15, 321-329.




                                                25
26

More Related Content

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Decision Tree Induction Using Impurity Measures

  • 1. 1. Introduction Decision trees are one of the methods for concept learning from the examples. They are widely used in machine learning and knowledge acquisition systems. The main application area are classification tasks: We are given a set of records, called training set. Each record from training set has the same structure, consisting of a number of attribute/value pairs. One of these attributes represents class of the record. We also have a test set for which the class information is unknown. The problem is to derive a decision tree, using examples from training set, which will determine class for each record in the test set. The leaves of induced decision tree are class names and other nodes represent attribute-based tests with a branch for each possible value of particular attribute. Once the tree is formed, we can classify objects from the test set: starting at the root of tree, we evaluate the test, and take the branch appropriate to the outcome. The process continues until leaf is encountered, at which time the object is asserted to belong to class named by the leaf. Induction of decision trees has been very active area of machine learning and many approaches and techniques have been developed for building trees with high classification performance. The most commonly addressed problems are : •selecting of the best attribute for splitting •dealing with noise in real-world tasks •pruning of complex decision trees •dealing with unknown attribute values •dealing with continuous attribute values. 1
  • 2. My intention is to give an overview of methods considering the first three problems, which I find to be very mutually dependant and of prior importance for building of trees with good classification ability. 2. Selection Criterion For a given training set it is possible to construct many decision trees that will correctly classify all of its objects. Among all accurate trees we are interested in the most simple one. This search is guided by Occam’s Razor heuristic: among all rules that accurately account for the training set, the simplest is likely to have the highest success rate when used to classify unseen objects. This heuristics is also supported by analysis: Pearl and Quinlan [11] have derived upper bounds on the expected error using different formalisms for generalizing from a set of known cases. For a training set of predetermined size, these bounds increase with the complexity of induced generalization. Since decision tree is made of nodes that represent attribute-based test, simplifying the tree would mean reducing the number of tests. We can achieve this by carefully selecting the order of tests to be conducted. The example given in [11] shows how for the same training set given in Table 1 different decision trees may be constructed. Each of the examples in training set is described in terms of 4 discrete attributes: outlook {sunny, overcast, rain}, temperature {cold, mild, hot}, humidity {high, normal}, windy {true, false} and each belonging to one of the classes N or P. Figure 1 shows decision tree when attribute outlook is used for the first test and figure 2 shows decision tree with the first test temperature. The difference in complexity is obvious. 2
  • 3. No. Outlook Temperature Humidity Windy Class 1 sunny hot high false N 2 sunny hot high true N 3 overcast hot high false P 4 rain mild high false P 5 rain cool normal false P 6 rain cool normal true N 7 overcast cool normal true P 8 sunny mild high false N 9 sunny cool normal false P 10 rain mild normal false P 11 sunny mild normal true P 3
  • 4. 12 overcast mild high true P 13 overcast hot normal false P 14 rain mild high true N Table1. A small training set outlook sunny overcast rain P humidity windy high normal true false Figure 1. A simple decision tree N P N P temperature sunny windy outlook outlook sunny o’cast rain sunny o’cast rain true false P P N P windy windy humidit humidit y y true false true false high normal high normal N P P N P P windy outlook true false sunny o’cast rain 4 N P N P null
  • 5. Figure 2. A complex decision tree This infers that the choice of test is crucial for simplicity of decision tree, on which many researchers, such as Quinlan [11], Fayyad [5], White and Liu [16], agree. A method of choosing a test to form the root of decision tree is usually referred to as selection criterion. Many different selection criterion have been tested over the years and most common once among them are maximum information gain and GINI index. Both of these methods belong to the class of impurity measures, which are designed to capture aspects of partitioning of examples relevant to good classification. Impurity measures Let S be set of training examples with each example e ∈ S belonging to one of the classes in C = {C1, C2, …, Ck}. We can define the class vector (c1, c2, …, ck) ∈ Nk , where ci = |{e ∈ S | class(e) = Ci}| and class probability vector (p1, p2, …, pk) ∈ [0, 1]k: c1 c 2 c ( p1 , p 2 ,..., p k ) = ( , ,..., 3 ) |S| |S| |S| It’s obvious that Σ pi =1. A set of examples is said to be pure if all its examples belong to one class. Hence, if probability vector of a set of examples has a component 1 (all other components being equal to 0) the set is said to be pure. On the other hand, if all components are equal we get an extreme case of impurity. To quantify the notion of impurity, a family of functions known as impurity measures [5] is defined. 5
  • 6. Definition 1 Let S be a set of training examples having a class probability vector PC. A function φ : [0, 1]k → R such that φ (PC) ≥ 0 is an impurity measure if it satisfies the following conditions: 1. φ (PC) is minimum if ∃i such that component PCi = 1. 2. φ (PC) is maximum if ∀i, 1 ≤ i ≤ k, PCi = 1/k. 3. φ (PC) is symmetric with respect to components of PC. 4. φ (PC) is smooth (differentiable everywhere) in its range. Conditions 1 and 2 express well-known extreme cases, and condition 3 insures that the measure is not biased towards any of the classes. For induction of decision trees, impurity measure is used to evaluate impurity of partition induced by an attribute. Let PC(S) be the class probability vector of S and let A be a discrete attribute over the set S. Let assume the attribute A partition set S into the sets S 1, S2, …, Sv. The impurity of the partition is defined as weighted average impurity on its component blocks: v | Si | ∆Φ( S , A) = ∑ ⋅ φ ( PC ( S i )) i =1 | S | Finally, the goodness-of-split due to attribute A is defined as reduction in impurity after the partition. ∆Φ(S, A) = φ (PC(S)) - Φ(S, A) If we choose entropy of the partition, E(A,S), as an impurity measure: k φ (PC(S)) = E(A, S) = ∑ − PC i =1 i ⋅ log 2 ( PC i ) than the reduction in impurity gained by an attribute is called information gain. 6
  • 7. This method was used in many algorithms for induction of decision trees such as ID3, GID3* and CART. The other popular impurity measure is GINI index used in CART [2]. To obtain GINI index we set φ to be φ (PC(S)) = ∑ PC i≠ j i ⋅ PC j All functions belonging to family of impurity measures agree on the minima, maxima and smoothness and as a consequence they should result in similar trees [2], [9] regarding complexity and accuracy. After detailed analysis Breiman [3] reports basic differences in trees produced using information gain and GINI index. The GINI prefers splits that put the largest class into one pure node, and all others into the other. Entropy favors size–balanced children nodes. If the number of classes is small both criterions should produce similar results. The difference appears when number of classes is larger. In this case GINI produces splits that are too unbalanced near to the root of the tree. On the other hand, splits produced by entropy show a lack of uniqueness. This analysis point out some of the problems associated with impurity measures. But, unfortunately, these are not the only ones. Some experiments carried out in the mid eighties showed that the gain criterion tends to favor attributes with many values [8]. This finding was supported by analysis in [11]. One of the solutions to this problem was offered by Kononenko et al. in [8]: decision tree induced has to be binary tree. This means that every test has only two outcomes. If we have an attribute A with values A1, A2, …, Av the decision tree no longer branches on each possible value. Instead, a subset of S is chosen and the tree has one branch for that subset and the other for remainder for S. This criterion is known as subset criterion. Kononenko et al. report that this modification led to smaller decision trees with an improved classification performance. Although, it’s obvious that 7
  • 8. binary trees don’t suffer from bias in favor of attributes with large number of values, it is not known if this is the only reason for their better performance. This finding is repeated in [4] and in [5] Fayyad and Irani introduce the binary tree hypothesis: For a to–down, non-backtracking, decision tree generation algorithm, if the algorithm applies a proper attribute selection measure, then selecting a single attribute-value pair at each node and thus constructing a binary tree, rather than selecting an attribute and branching on all its values simultaneously, is likely to lead to a decision tree with fewer leaves. The formal proof for this hypothesis doesn’t exist; it’s a result of informal analysis and empirical evaluation. Fayyad has also shown in [4] that for every decision tree, there exists binary decision tree that is logically equivalent to it. This would mean that for every decision tree we could induce logically equivalent binary decision tree that is expected to have fewer nodes and to be more accurate. But binary trees have some side-effect explained in [11]: First, this kind of trees is undoubtedly more unintelligible to human experts than is ordinarily the case, with unrelated attribute values being grouped together and with multiple tests on the same attribute. Second, the subset criterion can require a large increase in computation, specially for attributes with many values – for attribute A with v values there are 2v-1 – 1 different ways of specifying the distinguished subset of attributes values. But since a decision tree is induced only once and than used for classification and since computer efficiency is rapidly increasing this problem seems to diminish. 8
  • 9. In [11] Quinlan proposes another method for overcoming the bias in information gain called the gain ratio. The gain ratio, GR (originally Quinlan denoted it by IV), is normalizing the gain with attribute information: ∆Φ( S , A) GR = v |S | |S | − ∑ i ⋅ log 2 i i =1 | S | |S| The attribute information is used as normalizing factor because of its property to increase as the number of possible values increases. As mentioned in [11] this ratio may not always be defined – the denominator may be zero – or it may tend to favor attributes for which the denominator is very small. As a solution to this, the gain ratio criterion will select, from among those attributes with an average-or-better gain, the attribute that maximizes GR. The experiments described in [14] show the improvement in tree simplicity and prediction accuracy when gain ratio criterion is used. There is also another measure introduced by Lopez de Mantras in [], called distance measure, dN: ∆Φ( S , A) 1 - dN = k | S ij | v | S ij | − ∑∑ ⋅ log 2 i =1 j =1 | S | |S| where |Sij| is the number of examples with value aj of attribute A that belong to class Ci. This is just another attempt of normalizing information gain, but in this case with cell information (cell is a subset of S which contains all examples with one attribute value that belong to one class). Although both of these normalized measures were claimed to be unbiased, the statistics analysis in [17] shows that again each favor attributes with larger number of values. These results also suggest that information gain is the worst measure compared to gain ratio and distance measure in 9
  • 10. this respect, while gain ratio is the least biased. Furthermore, their analysis shows that the magnitude of the bias is strongly dependent on the number of classes, increasing as k is increased. Orthogonality measure Recently, one conceptually new approach was introduced by Fayyad and Irani in [5]. In their analysis they give a number of reasons why information entropy, as representative of class of impurity measures, is not suitable for attribute selection. Assume the following example: Consider a set S of 110 examples belonging to three classes {C 1, C2, C3} whose class vector is (50, 10, 50). Assume that the attribute-value pairs (A, a1) and (A, a2) induce two binary partition on S, π1 and π2 shown in the figure 3. We can see that π2 separates the classes C1 and C2 from the class C3. However, the information gain measure prefers partition π1 (gain = 0.51) over π2 (gain = 0.43). C1 C2 C3 45 8 5 Partition π1 produced by a1 Partition π2 produced by a2 45 8 5 50 0 50 5 2 45 0 10 0 Gain = 0.51 Gain = 0.43 Figure3. Two possible binary partitions. 10
  • 11. Analysis shows that if π2 is accepted, subtree under this node has lower bound of tree leaves. On the other hand, if π1 is chosen subtree could minimally have 6 leaves. Intuitively, if the goal is to generate tree with smaller number of leaves, the selection measure should be sensitive to total class separation – it should separate differing classes from each other as much as possible while separating as few examples of the same class as possible. Above example shows that information entropy doesn’t satisfy these demands – it’s completely insensitive for class separation and within-class fragmentation. The only exception is when learning problem has exactly two classes: than class purity and class separation become the same. Another negative property of information gain emphasized in this paper is its tendency to induce decision trees with near-minimal average depth. The empirical evaluation of this kind of trees shows that they tend to have a large number of leaves and high error rate [4]. Another of deficiencies pointed out is, actually, embedded in definition of impurity measures: their symmetry with respect to components of PC. This means that the set with a given class probability vector evaluates identically to another set whose class vector is a permutation of the first. Thus if one of the subsets of a set S has a different majority class than the original but the distribution of classes is simply permuted, entropy will not detect the change. However this change in dominant class is generally strong indicator that the attribute value is relevant to classification. Realizing above weakness of impurity measures authors define the desirable class of selection measures: Assuming induction of binary tree (relying on the binary tree hypothesis) for training set S and attribute A, test τ on this attribute induces a binary partition on the set S into: S = Sτ ∪ S¬τ , where Sτ = { e ∈ S | e satisfies τ }and S¬τ = S ~ Sτ. 11
  • 12. Selection measure should satisfy the properties: 1. It is maximum when the classes in Sτ are disjoint with the classes in S¬τ (inter-class separation). 2. It is minimum when the class distribution in Sτ is identical to the class distribution in S¬τ. 3. It favors partitions which keep examples of the same class in the same block (intra-class cohesiveness). 4. It is sensitive to permutations in the class distribution. 5. It is non-negative, smooth (differenciable), and symmetric with respect to the classes. This defines a family of measures called C-SEP (for Class SEParation), for evaluating binary partitions. One such measure proposed in this paper and proven to satisfy all requirements of C-SEP family is orthogonality measure defined as: ORT(τ, S) = 1 – cos θ(V1, V2), Where θ(V1, V2) is the angle between two class vectors V1 and V2 of partitions Sτ and S¬τ, respectively. The result of empirical comparisons of orthogonality measure embedded in O-BTREE system with entropy measure used in GID3* (produces branching only on a few individual values while grouping the rest in one default branch), information gain in ID3, gain ratio in ID3-IV and information gain for induction of binary trees in ID3-BIN, is taken from [5] and given in figures 4, 5, 6 and 7. In these experiments 5 different data sets were used: RIO (Reactive Ion Etching) – synthetic data, and real-word data sets: Soybean, Auto, Harr90 and Mushroom. Descriptions of 7 3 6 2.5 5 2 4 1.5 3 1 2 1 12 0.5 0 0 GID3* ID3 ID3-IV ID3-BIN O-BTREE GID3* ID3 ID3-IV ID3-BIN O-BTREE
  • 13. these sets may be found in [5]. The results reported are in terms of ratios relative to GID3* performance (GID3* =1.0 in both cases). Figure 4. Error ratios for RIE-random Domains Figure5. Leaf ratios for RIE-random Domains 2.2 ID3 2 ID3 2 ID3-IV ID3-IV 1.8 ID3-BIN ID3-BIN 1.6 O-BTREE 1 .5 O-BTREE 1.4 1.2 1 1 0.8 0.6 0.4 0 .5 SOYBEAN AUTO HARR90 MUSHRM SOYBEAN AUTO HARR90 MUSHRM The figures 4, 5, 6 and 7 show that results for O-BTREE algorithm are almost always superior to other algorithms. Conclusion Figure 6. Relative ratios of Error Rates (GID3=1) Figure 6. Ratios of Numbers of Leaves (GID3=1) Until recently most algorithms for induction of decision trees were using one of the impurity measures described in previous section. These functions were borrowed from information theory without any formal analysis of their suitability for selection criterion. The empirical results were acceptable and only small variations of these methods were further tested. 13
  • 14. Fayyad’s and Irani’s approach in [5] introduces completely different family of measures, C-SEP, for binary partitions. They recognize very important properties of the measure: inter-class separation and intra-class cohesiveness that were not precisely defined in impurity measures. This is the first step to better formalization of selection criterion which is necessary for further improvement of decision trees’ accuracy and simplicity. 3. Noise When we use decision tree induction techniques for real-word domains we have to expect noisy data. The description of the object may include attribute based on measurements or subjective judgement, both of which can give rise to errors in the values of the attributes. Sometimes the class information may contain errors. These defects in the data may lead to two known problems: •attribute inadequacy, meaning that even some examples may have identical description in terms of attributes values they don’t belong to the same class. Inadequate attributes are not able to distinguish among the object in set S. •spurious tree complexity which is the result of tree induction algorithm trying to fit the noisy data into the tree. Recognizing these two problems we can define two modification of the tree-building algorithm if it is to be able to operate with a noise-affected training set [11]: •the algorithm must be able to decide that testing further attributes will not improve the predictive accuracy of the decision tree •the algorithm must be able to work with inadequate attributes In [11] Quinlan suggests the chi-square test for stochastic independence as the implementation of first modification: 14
  • 15. Let S be collection of objects which belong to one of two classes N and P and let A be an attribute with v values that produces subsets {S1, S2, …, Sv} of S, where Si contains pi and ni object of class P and N, respectively. If the value of A is irrelevant (if the values of A for these objects are just noise, the values would be expected to be unrelated to the objects’ classes) to the class of an object in S the expected value pi′ of pi should be p i + ni pi ' = p ⋅ p+n If ni′ is corresponding expected value of ni, the statistic: v ( p i − p i ' ) 2 ( ni − ni ' ) ∑ p' + n' i =1 i i is approximately chi-square with v-1 degrees of freedom. This statistic can be used to determine the confidence with which one can reject the hypothesis that A is independent of the class of objects in S [11]. The tree-building procedure can than be modified to prevent testing any attribute whose irrelevance cannot be rejected with very high (e.g., 99%) confidence level. One difficulty with chi-square test is that it’s unreliable for very small values of the expectation p i′ and ni′ , so the common practise is to use it when all expectations are at least 4 [12]. Second algorithm’s modification should cope with inadequate attributes. Quinlan [12] suggest two possibilities: •Notation of the class could be generalized to continuous value laying between 0 and 1 : If the subset of objects at leaf contained p examples belonging to class P and n examples belonging to class N, the choice for c would be: p c= p+n 15
  • 16. In this case the class of 0.8 would be interpreted as ‘belonging to class P with probability 0.8’. Classification error is defined as: if the object is really class N: c- 0; if the object is really class P: 1 – c. This method is called probability method. •Voting model could be established: assign all object to the more numerous class at the leaf. This method is called majority method. It can be verified that the first method minimizes the sum of the squares of the classification errors while the second one minimizes the sum of absolute errors over the objects in S. If the goal is to minimize expected error the second approach seems more suitable and empirical results shown in [12] approve this. Two suggested modification were tested on various data sets and with different noise level affecting different attributes or class information and results are shown in [12]. Quite different forms of degradation are observed: •Destroying class information produces linear increase in error, for the noise level of 100% reaching 50% error which means that object would be classified randomly. •Noise in a single attribute doesn’t have a dramatic effect, and it appears that it is directly proportional to importance of an attribute. Importance of an attribute can be defined as average classification error produced if the attribute is deleted altogether from the data. •Noise in all attributes together leads to relatively rapid increase in error which generally reach peak and declines. The explanation for appearance of peak is explained in [11]. 16
  • 17. These experiments leaded to very interesting and unexpected observation given in [11]: For higher noise levels, the performance of the correct decision tree on corrupted data was found to be inferior to that of an imperfect decision tree formed from data corrupted to similar level. These observations impose some basic tactics for dealing with noisy data: •It is important to eliminate noise affecting the class membership of the objects in the training set. •It is worthy to exclude noisy, less important attributes. •The payoff in noise reduction increases with the importance of the attribute. •The training set should reflect the noise distribution and level as expected when the induced decision tree is used in practice. •The majority method of assigning classes to leaves is preferable to the probability method. Conclusion The methods employed to cope with noise in decision tree induction are mostly based on empirical results. Although it is obvious that they lead to improvement of the decision trees in the terms of simplicity and accuracy there is no formal theory to support them. This implies that laying some theoretical foundation should be necessity in the future. 4. Pruning 17
  • 18. In noisy domains, pruning methods are employed to cut back a full-size tree to smaller one that is likely to give better classification performance. Decision trees generated using the examples from the training set are generally overfitted to accurately classify unseen examples from a test set. Techniques used to prune the original tree usually consist of following steps [15]: •generate a set of pruned trees •estimate the performance of each of these trees •select the best tree. One of the major issues is what data set will be used to test the performance of the pruned tree. The ideal situation would be if we could have complete set of test examples. Only than we could be able to make optimal tree selection. However, in practice this is not possible and it is approximated with a very large, independent test set, if one is available. The real problem arises when the test set is not available. Then the same test used for building the decision tree has to be used to estimate accuracy of the pruned trees. Resampling methods, such as cross-validation, are the principal technique used in these situations. F-fold cross-validation is a technique which divides the training set S into f blocks of roughly the same distribution and then for each block in turn, a classifier is constructed from the cases in the remaining blocks and then tested on the cases in the hold-out block. The error rate of the classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases. The average error rate from these distinct cross-validations is then a relatively reliable estimate of the error rate of the single classifier produced from all the cases. The 10-fold cross-validation has proven to be very reliable and it’s widely used for many different learning models. Quinlan in [13] describes three techniques for pruning: 18
  • 19. •cost-complexity pruning, •reduced error pruning, and •pessimistic pruning. Cost-Complexity Pruning This technique is initially described in [2]. It consists of two stages: •First, the sequence of trees T0, T1, …, Tk is generated, where T0 is original decision tree and each Ti+1 is obtained by replacing one or more subtrees of T i with leaves until the final tree Tk is just a leaf •Then, each tree in the sequence is evaluated and one of them is selected as the final pruned tree. Cost-complexity measure is used for evaluation of pruned tree T: If N is the total number of examples classified by T, E is the number of misclassifed ones, and L(T) is the number of leaves in T, then cost-complexity is defined as sum E + α ⋅ L(T ) N where α is some parameter. Now, let’s suppose that we replace some subtree T* of tree T with the beat possible leaf. In general, these pruned tree would have M more misclassified examples and L(T*) – 1 fewer leaves. T and T* would have same cost-complexity if M α= . N ⋅ ( L(T ∗ ) − 1) To produce Ti+1 from Ti each non-leaf subtree is examined of Ti to find the one with minimum α value. The one or more subtrees with that value of α are then replaced by their respective best leaves. 19
  • 20. For second stage of pruning we use independent test set containing N’ examples to test the accuracy of the pruned trees. If E’ is the minimum number of errors observed with any Ti and standard error of E’ is given by: E '⋅( N '− E ' ) se( E ' ) = N' than the tree selected is the smallest one whose number of errors does not exceed E’ + se(E’). Reduced Error Pruning This technique is probably the simplest and most intuitive one for finding small pruned trees of high accuracy. First, the original tree T is used to classify independent test set. Then for every non-leaf subtree T* of T we examine the change in misclassifications over the test set that that would occur if T* were replaced by the best possible leaf. If the new tree contains no subtree with the same property, T* is replaced by the leaf. The process continues until any further replacement would increase number of errors over the test set. Pessimistic Pruning This technique does not require separate test set. If decision tree T was generated from training set with N examples and then tested with the same set we can assume that at some leaf of T there are K classified examples of which J is misclassified. The ratio J/K does not provide a reliable estimate of error rate of that leaf when unseen object are classified since the tree T has be tailored to the training set. Instead, we can use more realistic measure known as continuity correction for binomial distribution in which J is replaced with J + 1/2. 20
  • 21. Now, let's consider some subtree T* of T, containing L(T*) leaves and classifying ΣJ examples (sum over all leaves of T*) with ΣJ of them misclassified. According to the above measure it will misclassify ΣJ + L(S)/2 unseen cases. If T* is replaced with the best leaf which misclassify E examples from training set, the new pruned tree will be accepted whenever E + 1/2 is within one standard error of ΣJ + L(S)/2 (standard error is defined as in the cost-complexity pruning). All non-leaf subtrees are examined just once to see whether they should be pruned, but once the subtree is pruned its subtrees aren’t further examined. This strategy makes this algorithm much faster than previous two. Quinlan compare these three techniques on 6 different domains with both real-word and synthetic data. The general observation is that simplified trees are of superior or equivalent accuracy to the originals, so pruning has been beneficial in both counts. Cost-complexity pruning tends to produce smaller decision trees then either reduced error or pessimistic pruning, but they are less accurate than the trees produced by two other techniques. This suggest that cost-complexity may be overpruning. On the other hand, reduced error pruning and pessimistic pruning produce trees with very similar accuracy, but knowing that the later one uses only training set for pruning and that it is more efficient than the former one it can be pronounced as optimal technique among three suggested. OPT Algorithm While previously described techniques were used to prune decision trees generated from noisy data, Bratko and Bohanec in [1] introduce OPT algorithm for pruning accurate decision trees. The problem they aim to solve is [1]: given a completely accurate, but complex, definition of a concept, simplify the definition, possible at the expense of accuracy, so that the simplified 21
  • 22. definition still correspond to the concept well in general, but may be inaccurate in some details. So, while the previously mentioned techniques were designed to improve tree accuracy this one is designed to reduce its size, which makes it impractical to be communicated and understood by the user. Bratko's and Bohanec's approach is somewhat similar to previous pruning algorithms: they construct the sequence of pruned trees and then select the smallest tree that satisfies some required accuracy. However, the tree sequence they construct is denser with respect to the number of their leaves: Sequence T0, T1, ..., Tn is constructed such that 1. n = L(T0) - 1 2. the trees in sequence decrease in size by one, i.e., L(Ti) = L(T0) - i for i = 0, 1, ..., n (unless there is no pruned tree of the corresponding size) and 3. each Ti has the highest accuracy among all the pruned trees of T0 of the same size This sequence is called optimal pruning sequence and was initially suggested by Breiman et al. [2]. To construct this optimal pruning sequence efficiently in quadratic (polynomial) time with respect to the number of leaves of T0 they use dynamic programming. The construction is recursive in that each subtree of T0 is again a decision tree with its own optimal pruning sequence The algorithm starts by constructing sequence that correspond to small subtrees near the leaves of T0. These are then combined together, yielding sequences that correspond to larger and larger subtrees of T0, until optimal pruning sequence for T0 is finally constructed. The main advantage of OPT algorithm is density of its optimal pruning sequence which always contains an optimal tree. Sequences produced by cost-complexity pruning or reduced error pruning are sparse and therefore can miss some optimal solutions. 22
  • 23. One interesting observation derived from the experiments conducted by Bratko and Bohanec is that for real-world data a relatively high accuracy was achieved with relatively small pruned trees disregarding the technique used for pruning, while that wasn't a case with synthetic data. This is another proof of usefulness of pruning especially for the real-world domains. Conclusion Either if we want to improve classification accuracy of decision trees generated from noisy data or to simplify accurate, but complex decision trees to make them more intelligible to human experts, pruning is proved to be very successful. Recent papers [], [] suggest there is still some space left for improvement of basic and most commonly used techniques described in this section. 5. Summary The selection criterion is probably the most important aspect that determines the behavior of top- down decision generation algorithm. If it select most important attributes regarding the class information near the root of the tree, than any pruning technique can successfully cut-off the branches of class-independent and/or noisy attributes because they will appear near to the leaves of the tree. Thus, an intelligent selection method, which is able to recognize most important attributes for classification will initially generate more simple trees and additionally will ease the job of the pruning algorithm. The main problem of this domain seems to be lack of theoretical foundation: many techniques are still used because of their acceptable empirical evaluation not because they have be formally 23
  • 24. proven to be superior. Development of formal theory for decision tree induction is necessary for better understanding of this domain and for further improvement of decision trees’ classification accuracy, especially for noisy, incomplete, real-world data. 6.References [1] Bratko, I. & Bohanec, M. (1994). Trading accuracy for simplicity in decision trees, Machine Learning 15, 223-250. [2] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees. Montrey, CA:Wadsworth & Brooks. [3] Breiman, L., (1996).Technical note: Some properties of splitting criteria, Machine Learning 24, 41-47. [4] Fayyad, U.M. (1991). On the induction of decision trees for multiple concept learning, PhD dissertation, EECS Department, The University of Michigan. [5] Fayyad, U.M. & Irani, R.B. (1993) The attribute selection problem in decision tree generation, Proceedings of the 10th National Conference on AI, AAAI-92, 104-110, MIT Press. [6] Lopez de Mantras, R. (1991). A distance-based attribute selection measure for decision tree induction, Machine Learning 6, 81-92. [7] Kearns, M. & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with near optimal generalization, Submitted. 24
  • 25. [8] Kononenko, I. , Bratko, I., Roskar, R. (1984). Experiments in automatic learning of medical diagnosis rules, Technical Report, Faculty of Electrical Engineering, E.Kardelj University, Ljubljana. [9] Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction, Machine Learning 3, 319-342. [10] Schapire, R.E. & Helmbold, D.P. (1995). Predicting nearly as well as the best pruning of a decision tree, Proceedings of the 8th Annual Conference on Computational Learning Theory, ACM Press, 61-68. [11] Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1, 81-106. [12] Quinlan, J.R. (1986). The effect of noise on concept learning, Machine Learning: An arificial intelligence approach, Morgan Kaufmann: San Mateo CA, 148-166. [13] Quinlan, J.R. (1987). Simplifying decision trees, International Journal of Man-Machine Studies, 27, 221-234. [14] Quinlan, J.R. (1988). Decision trees and multi-valued attributes, Machine Intelligence 11, 305-318. [15] Weiss, S.M. & Indurkhya, N. (1994 ). Small sample decision tree pruning, Proceedings of the 11th International Conference on Machine Learning, Morgan Kaufmann, 335-342. [16] White, A.P. & Liu, W.Z. (1994). The importance of attribute selection measures in decision tree induction, Machine Learning 15, 25-41. [17] White, A.P. & Liu, W.Z. (1994). Technical note: Bias in information-based measures in decision tree induction, Machine Learning 15, 321-329. 25
  • 26. 26