Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1. Introduction

Decision trees are one of the methods for concept learning from the examples. They are widely

used in ma...
My intention is to give an overview of methods considering the first three problems, which I find

to be very mutually dep...
No.   Outlook    Temperature       Humidity   Windy    Class
 1     sunny        hot              high      false    N
 2 ...
12            overcast                           mild                       high                           true           ...
Figure 2. A complex decision tree



This infers that the choice of test is crucial for simplicity of decision tree, on wh...
Definition 1 Let S be a set of training examples having a class probability vector PC. A function

φ : [0, 1]k → R       s...
This method was used in many algorithms for induction of decision trees such as ID3, GID3* and

CART.

The other popular i...
binary trees don’t suffer from bias in favor of attributes with large number of values, it is not

known if this is the on...
In [11] Quinlan proposes another method for overcoming the bias in information gain called the

gain ratio. The gain ratio...
this respect, while gain ratio is the least biased. Furthermore, their analysis shows that the

magnitude of the bias is s...
Analysis shows that if π2 is accepted, subtree under this node has lower bound of tree leaves. On

the other hand, if π1 i...
Selection measure should satisfy the properties:

1. It is maximum when the classes in Sτ are disjoint with the classes in...
these sets may be found in [5]. The results reported are in terms of ratios relative to GID3*

  performance (GID3* =1.0 i...
Fayyad’s and Irani’s approach in [5] introduces completely different family of measures, C-SEP,

for binary partitions. Th...
Let S be collection of objects which belong to one of two classes N and P and let A be an attribute

with v values that pr...
In this case the class of 0.8 would be interpreted as ‘belonging to class P with probability 0.8’.

Classification error i...
These experiments leaded to very interesting and unexpected observation given in [11]: For

higher noise levels, the perfo...
In noisy domains, pruning methods are employed to cut back a full-size tree to smaller one that is

likely to give better ...
•cost-complexity pruning,

       •reduced error pruning, and

       •pessimistic pruning.



Cost-Complexity Pruning

Th...
For second stage of pruning we use independent test set containing N’ examples to test the

accuracy of the pruned trees. ...
Now, let's consider some subtree T* of T, containing L(T*) leaves and classifying ΣJ examples

(sum over all leaves of T*)...
definition still correspond to the concept well in general, but may be inaccurate in some details.

So, while the previous...
One interesting observation derived from the experiments conducted by Bratko and Bohanec is

that for real-world data a re...
proven to be superior. Development of formal theory for decision tree induction is necessary for

better understanding of ...
[8] Kononenko, I. , Bratko, I., Roskar, R. (1984). Experiments in automatic learning of medical

diagnosis rules, Technica...
26
Upcoming SlideShare
Loading in …5
×

CS532.doc

314 views

Published on

  • Be the first to comment

  • Be the first to like this

CS532.doc

  1. 1. 1. Introduction Decision trees are one of the methods for concept learning from the examples. They are widely used in machine learning and knowledge acquisition systems. The main application area are classification tasks: We are given a set of records, called training set. Each record from training set has the same structure, consisting of a number of attribute/value pairs. One of these attributes represents class of the record. We also have a test set for which the class information is unknown. The problem is to derive a decision tree, using examples from training set, which will determine class for each record in the test set. The leaves of induced decision tree are class names and other nodes represent attribute-based tests with a branch for each possible value of particular attribute. Once the tree is formed, we can classify objects from the test set: starting at the root of tree, we evaluate the test, and take the branch appropriate to the outcome. The process continues until leaf is encountered, at which time the object is asserted to belong to class named by the leaf. Induction of decision trees has been very active area of machine learning and many approaches and techniques have been developed for building trees with high classification performance. The most commonly addressed problems are : •selecting of the best attribute for splitting •dealing with noise in real-world tasks •pruning of complex decision trees •dealing with unknown attribute values •dealing with continuous attribute values. 1
  2. 2. My intention is to give an overview of methods considering the first three problems, which I find to be very mutually dependant and of prior importance for building of trees with good classification ability. 2. Selection Criterion For a given training set it is possible to construct many decision trees that will correctly classify all of its objects. Among all accurate trees we are interested in the most simple one. This search is guided by Occam’s Razor heuristic: among all rules that accurately account for the training set, the simplest is likely to have the highest success rate when used to classify unseen objects. This heuristics is also supported by analysis: Pearl and Quinlan [11] have derived upper bounds on the expected error using different formalisms for generalizing from a set of known cases. For a training set of predetermined size, these bounds increase with the complexity of induced generalization. Since decision tree is made of nodes that represent attribute-based test, simplifying the tree would mean reducing the number of tests. We can achieve this by carefully selecting the order of tests to be conducted. The example given in [11] shows how for the same training set given in Table 1 different decision trees may be constructed. Each of the examples in training set is described in terms of 4 discrete attributes: outlook {sunny, overcast, rain}, temperature {cold, mild, hot}, humidity {high, normal}, windy {true, false} and each belonging to one of the classes N or P. Figure 1 shows decision tree when attribute outlook is used for the first test and figure 2 shows decision tree with the first test temperature. The difference in complexity is obvious. 2
  3. 3. No. Outlook Temperature Humidity Windy Class 1 sunny hot high false N 2 sunny hot high true N 3 overcast hot high false P 4 rain mild high false P 5 rain cool normal false P 6 rain cool normal true N 7 overcast cool normal true P 8 sunny mild high false N 9 sunny cool normal false P 10 rain mild normal false P 11 sunny mild normal true P 3
  4. 4. 12 overcast mild high true P 13 overcast hot normal false P 14 rain mild high true N Table1. A small training set outlook sunny overcast rain P humidity windy high normal true false Figure 1. A simple decision tree N P N P temperature sunny windy outlook outlook sunny o’cast rain sunny o’cast rain true false P P N P windy windy humidit humidit y y true false true false high normal high normal N P P N P P windy outlook true false sunny o’cast rain 4 N P N P null
  5. 5. Figure 2. A complex decision tree This infers that the choice of test is crucial for simplicity of decision tree, on which many researchers, such as Quinlan [11], Fayyad [5], White and Liu [16], agree. A method of choosing a test to form the root of decision tree is usually referred to as selection criterion. Many different selection criterion have been tested over the years and most common once among them are maximum information gain and GINI index. Both of these methods belong to the class of impurity measures, which are designed to capture aspects of partitioning of examples relevant to good classification. Impurity measures Let S be set of training examples with each example e ∈ S belonging to one of the classes in C = {C1, C2, …, Ck}. We can define the class vector (c1, c2, …, ck) ∈ Nk , where ci = |{e ∈ S | class(e) = Ci}| and class probability vector (p1, p2, …, pk) ∈ [0, 1]k: c1 c 2 c ( p1 , p 2 ,..., p k ) = ( , ,..., 3 ) |S| |S| |S| It’s obvious that Σ pi =1. A set of examples is said to be pure if all its examples belong to one class. Hence, if probability vector of a set of examples has a component 1 (all other components being equal to 0) the set is said to be pure. On the other hand, if all components are equal we get an extreme case of impurity. To quantify the notion of impurity, a family of functions known as impurity measures [5] is defined. 5
  6. 6. Definition 1 Let S be a set of training examples having a class probability vector PC. A function φ : [0, 1]k → R such that φ (PC) ≥ 0 is an impurity measure if it satisfies the following conditions: 1. φ (PC) is minimum if ∃i such that component PCi = 1. 2. φ (PC) is maximum if ∀i, 1 ≤ i ≤ k, PCi = 1/k. 3. φ (PC) is symmetric with respect to components of PC. 4. φ (PC) is smooth (differentiable everywhere) in its range. Conditions 1 and 2 express well-known extreme cases, and condition 3 insures that the measure is not biased towards any of the classes. For induction of decision trees, impurity measure is used to evaluate impurity of partition induced by an attribute. Let PC(S) be the class probability vector of S and let A be a discrete attribute over the set S. Let assume the attribute A partition set S into the sets S 1, S2, …, Sv. The impurity of the partition is defined as weighted average impurity on its component blocks: v | Si | ∆Φ( S , A) = ∑ ⋅ φ ( PC ( S i )) i =1 | S | Finally, the goodness-of-split due to attribute A is defined as reduction in impurity after the partition. ∆Φ(S, A) = φ (PC(S)) - Φ(S, A) If we choose entropy of the partition, E(A,S), as an impurity measure: k φ (PC(S)) = E(A, S) = ∑ − PC i =1 i ⋅ log 2 ( PC i ) than the reduction in impurity gained by an attribute is called information gain. 6
  7. 7. This method was used in many algorithms for induction of decision trees such as ID3, GID3* and CART. The other popular impurity measure is GINI index used in CART [2]. To obtain GINI index we set φ to be φ (PC(S)) = ∑ PC i≠ j i ⋅ PC j All functions belonging to family of impurity measures agree on the minima, maxima and smoothness and as a consequence they should result in similar trees [2], [9] regarding complexity and accuracy. After detailed analysis Breiman [3] reports basic differences in trees produced using information gain and GINI index. The GINI prefers splits that put the largest class into one pure node, and all others into the other. Entropy favors size–balanced children nodes. If the number of classes is small both criterions should produce similar results. The difference appears when number of classes is larger. In this case GINI produces splits that are too unbalanced near to the root of the tree. On the other hand, splits produced by entropy show a lack of uniqueness. This analysis point out some of the problems associated with impurity measures. But, unfortunately, these are not the only ones. Some experiments carried out in the mid eighties showed that the gain criterion tends to favor attributes with many values [8]. This finding was supported by analysis in [11]. One of the solutions to this problem was offered by Kononenko et al. in [8]: decision tree induced has to be binary tree. This means that every test has only two outcomes. If we have an attribute A with values A1, A2, …, Av the decision tree no longer branches on each possible value. Instead, a subset of S is chosen and the tree has one branch for that subset and the other for remainder for S. This criterion is known as subset criterion. Kononenko et al. report that this modification led to smaller decision trees with an improved classification performance. Although, it’s obvious that 7
  8. 8. binary trees don’t suffer from bias in favor of attributes with large number of values, it is not known if this is the only reason for their better performance. This finding is repeated in [4] and in [5] Fayyad and Irani introduce the binary tree hypothesis: For a to–down, non-backtracking, decision tree generation algorithm, if the algorithm applies a proper attribute selection measure, then selecting a single attribute-value pair at each node and thus constructing a binary tree, rather than selecting an attribute and branching on all its values simultaneously, is likely to lead to a decision tree with fewer leaves. The formal proof for this hypothesis doesn’t exist; it’s a result of informal analysis and empirical evaluation. Fayyad has also shown in [4] that for every decision tree, there exists binary decision tree that is logically equivalent to it. This would mean that for every decision tree we could induce logically equivalent binary decision tree that is expected to have fewer nodes and to be more accurate. But binary trees have some side-effect explained in [11]: First, this kind of trees is undoubtedly more unintelligible to human experts than is ordinarily the case, with unrelated attribute values being grouped together and with multiple tests on the same attribute. Second, the subset criterion can require a large increase in computation, specially for attributes with many values – for attribute A with v values there are 2v-1 – 1 different ways of specifying the distinguished subset of attributes values. But since a decision tree is induced only once and than used for classification and since computer efficiency is rapidly increasing this problem seems to diminish. 8
  9. 9. In [11] Quinlan proposes another method for overcoming the bias in information gain called the gain ratio. The gain ratio, GR (originally Quinlan denoted it by IV), is normalizing the gain with attribute information: ∆Φ( S , A) GR = v |S | |S | − ∑ i ⋅ log 2 i i =1 | S | |S| The attribute information is used as normalizing factor because of its property to increase as the number of possible values increases. As mentioned in [11] this ratio may not always be defined – the denominator may be zero – or it may tend to favor attributes for which the denominator is very small. As a solution to this, the gain ratio criterion will select, from among those attributes with an average-or-better gain, the attribute that maximizes GR. The experiments described in [14] show the improvement in tree simplicity and prediction accuracy when gain ratio criterion is used. There is also another measure introduced by Lopez de Mantras in [], called distance measure, dN: ∆Φ( S , A) 1 - dN = k | S ij | v | S ij | − ∑∑ ⋅ log 2 i =1 j =1 | S | |S| where |Sij| is the number of examples with value aj of attribute A that belong to class Ci. This is just another attempt of normalizing information gain, but in this case with cell information (cell is a subset of S which contains all examples with one attribute value that belong to one class). Although both of these normalized measures were claimed to be unbiased, the statistics analysis in [17] shows that again each favor attributes with larger number of values. These results also suggest that information gain is the worst measure compared to gain ratio and distance measure in 9
  10. 10. this respect, while gain ratio is the least biased. Furthermore, their analysis shows that the magnitude of the bias is strongly dependent on the number of classes, increasing as k is increased. Orthogonality measure Recently, one conceptually new approach was introduced by Fayyad and Irani in [5]. In their analysis they give a number of reasons why information entropy, as representative of class of impurity measures, is not suitable for attribute selection. Assume the following example: Consider a set S of 110 examples belonging to three classes {C 1, C2, C3} whose class vector is (50, 10, 50). Assume that the attribute-value pairs (A, a1) and (A, a2) induce two binary partition on S, π1 and π2 shown in the figure 3. We can see that π2 separates the classes C1 and C2 from the class C3. However, the information gain measure prefers partition π1 (gain = 0.51) over π2 (gain = 0.43). C1 C2 C3 45 8 5 Partition π1 produced by a1 Partition π2 produced by a2 45 8 5 50 0 50 5 2 45 0 10 0 Gain = 0.51 Gain = 0.43 Figure3. Two possible binary partitions. 10
  11. 11. Analysis shows that if π2 is accepted, subtree under this node has lower bound of tree leaves. On the other hand, if π1 is chosen subtree could minimally have 6 leaves. Intuitively, if the goal is to generate tree with smaller number of leaves, the selection measure should be sensitive to total class separation – it should separate differing classes from each other as much as possible while separating as few examples of the same class as possible. Above example shows that information entropy doesn’t satisfy these demands – it’s completely insensitive for class separation and within-class fragmentation. The only exception is when learning problem has exactly two classes: than class purity and class separation become the same. Another negative property of information gain emphasized in this paper is its tendency to induce decision trees with near-minimal average depth. The empirical evaluation of this kind of trees shows that they tend to have a large number of leaves and high error rate [4]. Another of deficiencies pointed out is, actually, embedded in definition of impurity measures: their symmetry with respect to components of PC. This means that the set with a given class probability vector evaluates identically to another set whose class vector is a permutation of the first. Thus if one of the subsets of a set S has a different majority class than the original but the distribution of classes is simply permuted, entropy will not detect the change. However this change in dominant class is generally strong indicator that the attribute value is relevant to classification. Realizing above weakness of impurity measures authors define the desirable class of selection measures: Assuming induction of binary tree (relying on the binary tree hypothesis) for training set S and attribute A, test τ on this attribute induces a binary partition on the set S into: S = Sτ ∪ S¬τ , where Sτ = { e ∈ S | e satisfies τ }and S¬τ = S ~ Sτ. 11
  12. 12. Selection measure should satisfy the properties: 1. It is maximum when the classes in Sτ are disjoint with the classes in S¬τ (inter-class separation). 2. It is minimum when the class distribution in Sτ is identical to the class distribution in S¬τ. 3. It favors partitions which keep examples of the same class in the same block (intra-class cohesiveness). 4. It is sensitive to permutations in the class distribution. 5. It is non-negative, smooth (differenciable), and symmetric with respect to the classes. This defines a family of measures called C-SEP (for Class SEParation), for evaluating binary partitions. One such measure proposed in this paper and proven to satisfy all requirements of C-SEP family is orthogonality measure defined as: ORT(τ, S) = 1 – cos θ(V1, V2), Where θ(V1, V2) is the angle between two class vectors V1 and V2 of partitions Sτ and S¬τ, respectively. The result of empirical comparisons of orthogonality measure embedded in O-BTREE system with entropy measure used in GID3* (produces branching only on a few individual values while grouping the rest in one default branch), information gain in ID3, gain ratio in ID3-IV and information gain for induction of binary trees in ID3-BIN, is taken from [5] and given in figures 4, 5, 6 and 7. In these experiments 5 different data sets were used: RIO (Reactive Ion Etching) – synthetic data, and real-word data sets: Soybean, Auto, Harr90 and Mushroom. Descriptions of 7 3 6 2.5 5 2 4 1.5 3 1 2 1 12 0.5 0 0 GID3* ID3 ID3-IV ID3-BIN O-BTREE GID3* ID3 ID3-IV ID3-BIN O-BTREE
  13. 13. these sets may be found in [5]. The results reported are in terms of ratios relative to GID3* performance (GID3* =1.0 in both cases). Figure 4. Error ratios for RIE-random Domains Figure5. Leaf ratios for RIE-random Domains 2.2 ID3 2 ID3 2 ID3-IV ID3-IV 1.8 ID3-BIN ID3-BIN 1.6 O-BTREE 1 .5 O-BTREE 1.4 1.2 1 1 0.8 0.6 0.4 0 .5 SOYBEAN AUTO HARR90 MUSHRM SOYBEAN AUTO HARR90 MUSHRM The figures 4, 5, 6 and 7 show that results for O-BTREE algorithm are almost always superior to other algorithms. Conclusion Figure 6. Relative ratios of Error Rates (GID3=1) Figure 6. Ratios of Numbers of Leaves (GID3=1) Until recently most algorithms for induction of decision trees were using one of the impurity measures described in previous section. These functions were borrowed from information theory without any formal analysis of their suitability for selection criterion. The empirical results were acceptable and only small variations of these methods were further tested. 13
  14. 14. Fayyad’s and Irani’s approach in [5] introduces completely different family of measures, C-SEP, for binary partitions. They recognize very important properties of the measure: inter-class separation and intra-class cohesiveness that were not precisely defined in impurity measures. This is the first step to better formalization of selection criterion which is necessary for further improvement of decision trees’ accuracy and simplicity. 3. Noise When we use decision tree induction techniques for real-word domains we have to expect noisy data. The description of the object may include attribute based on measurements or subjective judgement, both of which can give rise to errors in the values of the attributes. Sometimes the class information may contain errors. These defects in the data may lead to two known problems: •attribute inadequacy, meaning that even some examples may have identical description in terms of attributes values they don’t belong to the same class. Inadequate attributes are not able to distinguish among the object in set S. •spurious tree complexity which is the result of tree induction algorithm trying to fit the noisy data into the tree. Recognizing these two problems we can define two modification of the tree-building algorithm if it is to be able to operate with a noise-affected training set [11]: •the algorithm must be able to decide that testing further attributes will not improve the predictive accuracy of the decision tree •the algorithm must be able to work with inadequate attributes In [11] Quinlan suggests the chi-square test for stochastic independence as the implementation of first modification: 14
  15. 15. Let S be collection of objects which belong to one of two classes N and P and let A be an attribute with v values that produces subsets {S1, S2, …, Sv} of S, where Si contains pi and ni object of class P and N, respectively. If the value of A is irrelevant (if the values of A for these objects are just noise, the values would be expected to be unrelated to the objects’ classes) to the class of an object in S the expected value pi′ of pi should be p i + ni pi ' = p ⋅ p+n If ni′ is corresponding expected value of ni, the statistic: v ( p i − p i ' ) 2 ( ni − ni ' ) ∑ p' + n' i =1 i i is approximately chi-square with v-1 degrees of freedom. This statistic can be used to determine the confidence with which one can reject the hypothesis that A is independent of the class of objects in S [11]. The tree-building procedure can than be modified to prevent testing any attribute whose irrelevance cannot be rejected with very high (e.g., 99%) confidence level. One difficulty with chi-square test is that it’s unreliable for very small values of the expectation p i′ and ni′ , so the common practise is to use it when all expectations are at least 4 [12]. Second algorithm’s modification should cope with inadequate attributes. Quinlan [12] suggest two possibilities: •Notation of the class could be generalized to continuous value laying between 0 and 1 : If the subset of objects at leaf contained p examples belonging to class P and n examples belonging to class N, the choice for c would be: p c= p+n 15
  16. 16. In this case the class of 0.8 would be interpreted as ‘belonging to class P with probability 0.8’. Classification error is defined as: if the object is really class N: c- 0; if the object is really class P: 1 – c. This method is called probability method. •Voting model could be established: assign all object to the more numerous class at the leaf. This method is called majority method. It can be verified that the first method minimizes the sum of the squares of the classification errors while the second one minimizes the sum of absolute errors over the objects in S. If the goal is to minimize expected error the second approach seems more suitable and empirical results shown in [12] approve this. Two suggested modification were tested on various data sets and with different noise level affecting different attributes or class information and results are shown in [12]. Quite different forms of degradation are observed: •Destroying class information produces linear increase in error, for the noise level of 100% reaching 50% error which means that object would be classified randomly. •Noise in a single attribute doesn’t have a dramatic effect, and it appears that it is directly proportional to importance of an attribute. Importance of an attribute can be defined as average classification error produced if the attribute is deleted altogether from the data. •Noise in all attributes together leads to relatively rapid increase in error which generally reach peak and declines. The explanation for appearance of peak is explained in [11]. 16
  17. 17. These experiments leaded to very interesting and unexpected observation given in [11]: For higher noise levels, the performance of the correct decision tree on corrupted data was found to be inferior to that of an imperfect decision tree formed from data corrupted to similar level. These observations impose some basic tactics for dealing with noisy data: •It is important to eliminate noise affecting the class membership of the objects in the training set. •It is worthy to exclude noisy, less important attributes. •The payoff in noise reduction increases with the importance of the attribute. •The training set should reflect the noise distribution and level as expected when the induced decision tree is used in practice. •The majority method of assigning classes to leaves is preferable to the probability method. Conclusion The methods employed to cope with noise in decision tree induction are mostly based on empirical results. Although it is obvious that they lead to improvement of the decision trees in the terms of simplicity and accuracy there is no formal theory to support them. This implies that laying some theoretical foundation should be necessity in the future. 4. Pruning 17
  18. 18. In noisy domains, pruning methods are employed to cut back a full-size tree to smaller one that is likely to give better classification performance. Decision trees generated using the examples from the training set are generally overfitted to accurately classify unseen examples from a test set. Techniques used to prune the original tree usually consist of following steps [15]: •generate a set of pruned trees •estimate the performance of each of these trees •select the best tree. One of the major issues is what data set will be used to test the performance of the pruned tree. The ideal situation would be if we could have complete set of test examples. Only than we could be able to make optimal tree selection. However, in practice this is not possible and it is approximated with a very large, independent test set, if one is available. The real problem arises when the test set is not available. Then the same test used for building the decision tree has to be used to estimate accuracy of the pruned trees. Resampling methods, such as cross-validation, are the principal technique used in these situations. F-fold cross-validation is a technique which divides the training set S into f blocks of roughly the same distribution and then for each block in turn, a classifier is constructed from the cases in the remaining blocks and then tested on the cases in the hold-out block. The error rate of the classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases. The average error rate from these distinct cross-validations is then a relatively reliable estimate of the error rate of the single classifier produced from all the cases. The 10-fold cross-validation has proven to be very reliable and it’s widely used for many different learning models. Quinlan in [13] describes three techniques for pruning: 18
  19. 19. •cost-complexity pruning, •reduced error pruning, and •pessimistic pruning. Cost-Complexity Pruning This technique is initially described in [2]. It consists of two stages: •First, the sequence of trees T0, T1, …, Tk is generated, where T0 is original decision tree and each Ti+1 is obtained by replacing one or more subtrees of T i with leaves until the final tree Tk is just a leaf •Then, each tree in the sequence is evaluated and one of them is selected as the final pruned tree. Cost-complexity measure is used for evaluation of pruned tree T: If N is the total number of examples classified by T, E is the number of misclassifed ones, and L(T) is the number of leaves in T, then cost-complexity is defined as sum E + α ⋅ L(T ) N where α is some parameter. Now, let’s suppose that we replace some subtree T* of tree T with the beat possible leaf. In general, these pruned tree would have M more misclassified examples and L(T*) – 1 fewer leaves. T and T* would have same cost-complexity if M α= . N ⋅ ( L(T ∗ ) − 1) To produce Ti+1 from Ti each non-leaf subtree is examined of Ti to find the one with minimum α value. The one or more subtrees with that value of α are then replaced by their respective best leaves. 19
  20. 20. For second stage of pruning we use independent test set containing N’ examples to test the accuracy of the pruned trees. If E’ is the minimum number of errors observed with any Ti and standard error of E’ is given by: E '⋅( N '− E ' ) se( E ' ) = N' than the tree selected is the smallest one whose number of errors does not exceed E’ + se(E’). Reduced Error Pruning This technique is probably the simplest and most intuitive one for finding small pruned trees of high accuracy. First, the original tree T is used to classify independent test set. Then for every non-leaf subtree T* of T we examine the change in misclassifications over the test set that that would occur if T* were replaced by the best possible leaf. If the new tree contains no subtree with the same property, T* is replaced by the leaf. The process continues until any further replacement would increase number of errors over the test set. Pessimistic Pruning This technique does not require separate test set. If decision tree T was generated from training set with N examples and then tested with the same set we can assume that at some leaf of T there are K classified examples of which J is misclassified. The ratio J/K does not provide a reliable estimate of error rate of that leaf when unseen object are classified since the tree T has be tailored to the training set. Instead, we can use more realistic measure known as continuity correction for binomial distribution in which J is replaced with J + 1/2. 20
  21. 21. Now, let's consider some subtree T* of T, containing L(T*) leaves and classifying ΣJ examples (sum over all leaves of T*) with ΣJ of them misclassified. According to the above measure it will misclassify ΣJ + L(S)/2 unseen cases. If T* is replaced with the best leaf which misclassify E examples from training set, the new pruned tree will be accepted whenever E + 1/2 is within one standard error of ΣJ + L(S)/2 (standard error is defined as in the cost-complexity pruning). All non-leaf subtrees are examined just once to see whether they should be pruned, but once the subtree is pruned its subtrees aren’t further examined. This strategy makes this algorithm much faster than previous two. Quinlan compare these three techniques on 6 different domains with both real-word and synthetic data. The general observation is that simplified trees are of superior or equivalent accuracy to the originals, so pruning has been beneficial in both counts. Cost-complexity pruning tends to produce smaller decision trees then either reduced error or pessimistic pruning, but they are less accurate than the trees produced by two other techniques. This suggest that cost-complexity may be overpruning. On the other hand, reduced error pruning and pessimistic pruning produce trees with very similar accuracy, but knowing that the later one uses only training set for pruning and that it is more efficient than the former one it can be pronounced as optimal technique among three suggested. OPT Algorithm While previously described techniques were used to prune decision trees generated from noisy data, Bratko and Bohanec in [1] introduce OPT algorithm for pruning accurate decision trees. The problem they aim to solve is [1]: given a completely accurate, but complex, definition of a concept, simplify the definition, possible at the expense of accuracy, so that the simplified 21
  22. 22. definition still correspond to the concept well in general, but may be inaccurate in some details. So, while the previously mentioned techniques were designed to improve tree accuracy this one is designed to reduce its size, which makes it impractical to be communicated and understood by the user. Bratko's and Bohanec's approach is somewhat similar to previous pruning algorithms: they construct the sequence of pruned trees and then select the smallest tree that satisfies some required accuracy. However, the tree sequence they construct is denser with respect to the number of their leaves: Sequence T0, T1, ..., Tn is constructed such that 1. n = L(T0) - 1 2. the trees in sequence decrease in size by one, i.e., L(Ti) = L(T0) - i for i = 0, 1, ..., n (unless there is no pruned tree of the corresponding size) and 3. each Ti has the highest accuracy among all the pruned trees of T0 of the same size This sequence is called optimal pruning sequence and was initially suggested by Breiman et al. [2]. To construct this optimal pruning sequence efficiently in quadratic (polynomial) time with respect to the number of leaves of T0 they use dynamic programming. The construction is recursive in that each subtree of T0 is again a decision tree with its own optimal pruning sequence The algorithm starts by constructing sequence that correspond to small subtrees near the leaves of T0. These are then combined together, yielding sequences that correspond to larger and larger subtrees of T0, until optimal pruning sequence for T0 is finally constructed. The main advantage of OPT algorithm is density of its optimal pruning sequence which always contains an optimal tree. Sequences produced by cost-complexity pruning or reduced error pruning are sparse and therefore can miss some optimal solutions. 22
  23. 23. One interesting observation derived from the experiments conducted by Bratko and Bohanec is that for real-world data a relatively high accuracy was achieved with relatively small pruned trees disregarding the technique used for pruning, while that wasn't a case with synthetic data. This is another proof of usefulness of pruning especially for the real-world domains. Conclusion Either if we want to improve classification accuracy of decision trees generated from noisy data or to simplify accurate, but complex decision trees to make them more intelligible to human experts, pruning is proved to be very successful. Recent papers [], [] suggest there is still some space left for improvement of basic and most commonly used techniques described in this section. 5. Summary The selection criterion is probably the most important aspect that determines the behavior of top- down decision generation algorithm. If it select most important attributes regarding the class information near the root of the tree, than any pruning technique can successfully cut-off the branches of class-independent and/or noisy attributes because they will appear near to the leaves of the tree. Thus, an intelligent selection method, which is able to recognize most important attributes for classification will initially generate more simple trees and additionally will ease the job of the pruning algorithm. The main problem of this domain seems to be lack of theoretical foundation: many techniques are still used because of their acceptable empirical evaluation not because they have be formally 23
  24. 24. proven to be superior. Development of formal theory for decision tree induction is necessary for better understanding of this domain and for further improvement of decision trees’ classification accuracy, especially for noisy, incomplete, real-world data. 6.References [1] Bratko, I. & Bohanec, M. (1994). Trading accuracy for simplicity in decision trees, Machine Learning 15, 223-250. [2] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees. Montrey, CA:Wadsworth & Brooks. [3] Breiman, L., (1996).Technical note: Some properties of splitting criteria, Machine Learning 24, 41-47. [4] Fayyad, U.M. (1991). On the induction of decision trees for multiple concept learning, PhD dissertation, EECS Department, The University of Michigan. [5] Fayyad, U.M. & Irani, R.B. (1993) The attribute selection problem in decision tree generation, Proceedings of the 10th National Conference on AI, AAAI-92, 104-110, MIT Press. [6] Lopez de Mantras, R. (1991). A distance-based attribute selection measure for decision tree induction, Machine Learning 6, 81-92. [7] Kearns, M. & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with near optimal generalization, Submitted. 24
  25. 25. [8] Kononenko, I. , Bratko, I., Roskar, R. (1984). Experiments in automatic learning of medical diagnosis rules, Technical Report, Faculty of Electrical Engineering, E.Kardelj University, Ljubljana. [9] Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction, Machine Learning 3, 319-342. [10] Schapire, R.E. & Helmbold, D.P. (1995). Predicting nearly as well as the best pruning of a decision tree, Proceedings of the 8th Annual Conference on Computational Learning Theory, ACM Press, 61-68. [11] Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1, 81-106. [12] Quinlan, J.R. (1986). The effect of noise on concept learning, Machine Learning: An arificial intelligence approach, Morgan Kaufmann: San Mateo CA, 148-166. [13] Quinlan, J.R. (1987). Simplifying decision trees, International Journal of Man-Machine Studies, 27, 221-234. [14] Quinlan, J.R. (1988). Decision trees and multi-valued attributes, Machine Intelligence 11, 305-318. [15] Weiss, S.M. & Indurkhya, N. (1994 ). Small sample decision tree pruning, Proceedings of the 11th International Conference on Machine Learning, Morgan Kaufmann, 335-342. [16] White, A.P. & Liu, W.Z. (1994). The importance of attribute selection measures in decision tree induction, Machine Learning 15, 25-41. [17] White, A.P. & Liu, W.Z. (1994). Technical note: Bias in information-based measures in decision tree induction, Machine Learning 15, 321-329. 25
  26. 26. 26

×