C4.5 algorithm and Multivariate Decision Trees Thales Sehn Korting Image Processing Division, National Institute for Space Research – INPE S˜o Jos´ dos Campos – SP, Brazil a e firstname.lastname@example.org Abstract The aim of this article is to show a brief descriptionabout the C4.5 algorithm, used to create Univariate De-cision Trees. We also talk about Multivariate DecisionTrees, their process to classify instances using more thanone attribute per node in the tree. We try to discuss howthey work, and how to implement the algorithms thatbuild such trees, including examples of Univariate andMultivariate results.1. Introduction Describing the Pattern Recognition process, the goalis to learn (or to “teach” a machine) how to classify ob-jects, through the analysis of an instances set, whoseclasses1 are known . As we know the classes of an instances set (or train-ing set), we can use several algorithms to discover theway the attributes-vector of the instances behaves, to Figure 1. Simple example of a classiﬁcation pro-estimate the classes for new instances. One manner to cess.do this is through Decision Trees (DT’s). A tree is either a leaf node labeled with a class, ora structure containing a test, linked to two or more The DT’s can deal with one attribute per test nodenodes (or subtrees) . So, to classify some instance, or with more than one. The former approach is calledﬁrst we get its attribute-vector, and apply this vec- Univariate DT, and the second is the Multivariatetor to the tree. The tests are performed into these at- method. This article explains the construction of Uni-tributes, reaching one or other leaf, to complete the variate DT’s and the C4.5 algorithm, used to build suchclassiﬁcation process, as in Figure 1. trees (Section 2). After this, we discuss the Multivari- If we have n attributes for our instances, we’ll have ate approach, and how to construct such trees (Sectiona n-dimensional space to the classes. And the DT will 3). At the end of each approach (Uni and Multivari-create hyperplanes (or partitions) to divide this space ate), we show some results for diﬀerent test cases.to the classes. A 2D space is shown in Figure 2, andthe lines means the hyperplanes in this dimension. 2. C4.5 Algorithm1 Mutually exclusive labels, such as “buildings”, “deforest- This section explains one of the algorithms used to ment”, etc. create Univariate DT’s. This one, called C4.5, is based
and ﬁnally, we deﬁne Gain by Gain(y, j) = Entropy(y − Entropy(j|y)) The aim is to maximize the Gain, dividing by over- all entropy due to split argument y by value j. 2.3. Prunning This is an important step to the result because of the outliers. All data sets contain a little subset of in- stances that are not well-deﬁned, and diﬀers from the other ones on its neighborhood. Figure 2. Partitions created in a DT. After the complete creation of the tree, that must classify all the instances in the training set, it is pruned.on the ID32 algorithm, that tries to ﬁnd small (or sim- This is to reduce classiﬁcation errors, caused by espe-ple) DT’s. We start presenting some premisses on wich cialization in the training set; this is done to make thethis algorithm is based, and after we discuss the infer- tree more general.ence of the weights and tests in the nodes of the trees. 2.4. Results2.1. Construction Some premisses guide this algorithm, such as the fol- To show concrete examples of the C4.5 algorithm ap-lowing : plication, we used the System WEKA . One training set, considering some aspects of working people, like • if all cases are of the same class, the tree is a leaf vacation time, working hours, health plan was used. and so the leaf is returned labelled with this class; The resulting classes are about the work conditions, • for each attribute, calculate the potential informa- i.e. good or bad. Figure 3 shows the resulting DT, us- tion provided by a test on the attribute (based on ing C4.5 implementation from WEKA. the probabilities of each case having a particular Another example deals with levels of contact-lenses, value for the attribute). Also calculate the gain in according to some characteristics of the patients. Re- information that would result from a test on the sults in Figure 4. attribute (based on the probabilities of each case with a particular value for the attribute being of 3. Multivariate DT’s a particular class); • depending on the current selection criterion, ﬁnd Talking about Multivariate DT’s, and inductive- the best attribute to branch on. learning, they are able to generalize well when deal- ing with attributes correlation. Also, the results are2.2. Counting gain easy to the humans, i.e. we can understand the inﬂu- ence of each attribute to the whole process . This process uses the “Entropy”, i.e. a measure of One problem, when using simple (or Univariate)the disorder of the data. The Entropy of y is calculated DT’s, is that in the whole path, they test some at-by tributes more than once. Sometimes this prejudices n the performance of the system, because with a sim- |yj | |yj | Entropy(y) = − log ple transformation in the data, such as principal com- j=1 |y| |y| ponents, we can reduce de correlation between the at-iterating over all possible values of y. The conditional tributes, and with a simple test realize the same clas-Entropy is siﬁcation. But the aim of the Multivariate DT’s are to perform diﬀerent tests with the data, according to the |yj | |yj | Figure 5. Entropy(j|y) = log |y| |y| The purpose of the Multivariate approach is to use more than one attribute in the test leaves. In the ex-2 ID3 stands for Iterative Dichotomiser 3 ample of Figure 5, we can change the whole set of tests
Figure 3. Simple Univariate DT, created by the C4.5 algorithm. In blue are the tests, green and red are the resulting classes. Figure 4. Other Univariate DT, created by the C4.5 algorithm. In blue are the tests, and in red the resulting classes.by the simple one x + y ≥ 8. But, how to develop an al- 3.1. Tree Constructiongorithm that is able to “discover” such planes? This isthe content of the following sections. The ﬁrst step in this phase is to have a set of train- We can think this approach like a linear combina- ing instances. All of them have a attributes, and a as-tion of the attributes, at each internal node. For exem- sociated class. This is the default procedure for all clas-ple, an instance with this attributes y = y1 , y2 , . . . , yn siﬁcation methods.belonging to class Cj . The tests at each node of the Through a top-down decision tree algorithm, and atree will follow the form: merit selection criterion, the process chooses the best test to split the data, creating a branch. Now, in the n+1 ﬁrst time, we have two partitions, on wich the algo- wi yi > 0 rithm do the same top-down analysis, to make more i=1 partitions, according to the criteria.where w1 , w2 , . . . wn+1 are real-valued coeﬃcients . One of the stop criterion is when some partitionLet’s also consider the attributes y1 , y2 , . . . , yn can be presents just a single class, so this node becomes areal too, but some approaches deals with symbolic ones, leave, with an associated class.most of the times inserting them into a scale of num- But, we want to know how the process splits thebers. data, and here is the diﬀerence between Multi and Uni- Multivariate and Univariate DT’s share some prop- variate DT’s.erties, when modelling the tree, specially at the stage Considering a multiclass instance set, we can repre-of prunig statistically invalid branches. sent the multivariate tests with a Linear Machine (LM)
Figure 5. Problem in the Univariate approach . It performs several, and the blue line (Multivariate) is much more eﬃcient.. 3.1.2. Thermal Perceptron: For not linearly sep- arable instances, one method is the “thermal percep-LM: Let y be an instance description consisting of 1 tron” , that also adjusts wi and wj , and deals with and the n features that describe the instance. Then some constants each discriminant function gi (y) has the form B c= T wi y B+k and where wi is a vector of n + 1 coeﬃcients. The LM (wj − wi )T y infers instance y to belong to class i iﬀ k= 2y T y (∀j, i = j)gi (y) > gj (y) The process is according to the following algorithm: 1. B = 2; Some methods for training a LM have been pro- 2. If LM is correct for all instancesposed. We can start the weights vector with a default Or B < 0.001, RETURNvalue for all wi , i = 1, . . . N . Here, we show the abso- 3. Otherwise, for each misclassified instancelute error correction rule, and the thermal perceptron. 3.1. Compute correction c update w[i] and w[j]3.1.1. Absolute Error Correction rule: One ap- 3.2. Adjust B <- aB - bproach for updating the weight of the discriminat func- with a = 0.99 and b = 0.0005tions is the absolute error correction rule, wich adjusts 4. Back to step 2wi , where i is the class to which the instance belongs,and wj , where j is the class to which the LM incor- The basic idea of this algorithm is to correct therectly assigns the instance. The correction is accom- weights-vector until all instances become correct, or inplished by the worst case, a certain number of iterations is reached (represented by the atualization of B value, decreasing wi ← wi + cy according the equation B = aB − b, as a = 99% and b = 0.0005 is also a linear small decreasing of the valueand B. wj ← wj − cy 3.2. Prunningwhere When prunnig Multivariate DT’s, one must consider that this can result in more classiﬁcation errors than (wj − wi )T y in generalization increasing. Generally, just some fea- c= tures (or attributes) are extracted from the multivari- 2y T y ate tests, instead of prunnnig the whole node.  saysis the smallest integer such that the updated LM will that a multivariate test with n−1 features is more gen-classify the instance correctly. eral than one based on n features.
3.3. Results  J. Quinlan. C 4. 5: Programs for Machine Learning. Mor- gan Kaufmann, 1992. Figure 6 shows a good example, doing the classiﬁ-  J. Quinlan. Learning decision tree classiﬁers. ACM Com-cation with simple tests, even with a complicated data puting Surveys (CSUR), 28(1):71–72, 1996.set.  Weka. WEKA (Data Mining Software). Available at http://www.cs.waikato.ac.nz/ml/weka/. 2006.4. Conclusion In this article we made a discussion about DecisionTrees, the Univariate and the Multivariate approaches.The C4.5 algorithm implements one way to build Uni-variate DT’s and some results were shown. About theMultivariate approach, ﬁrst we discussed about the ad-vantages of using it, and we showed how to build suchtrees with the Linear Machine approach, using the Ab-solute Error Correction and also the Thermal percep-tron rules. DT’s are a powerful tool for classiﬁcation, speciallywhen the results need to be interpreted by human. Mul-tivariate DT’s deals well with attributes correlation,presenting advantages in the tests, considering the Uni-variate approach.References C. Brodley and P. Utgoﬀ. Multivariate Versus Univariate Decision Trees. 1992. C. Brodley and P. Utgoﬀ. Multivariate decision trees. Ma- chine Learning, 19(1):45–77, 1995. S. Murthy, S. Kasif, and S. Salzberg. A System for Induction of Oblique Decision Trees. Arxiv preprint cs.AI/9408103, 1994. Figure 6. Multivariate DT, created by the OC1 algorithm (Oblique Classiﬁer 1) .