(Download Word File 113K)


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

(Download Word File 113K)

  1. 1. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 Decision Trees as a Data Mining Tool Bruno Crémilleux The production of decision trees is usually regarded step of data preparation, but also during the whole as an automatic method to discover knowledge from process. In fact, using decision trees can be embedded data: trees directly stemmed from the data without in the KDD process within the main steps (selection, other intervention. However, we cannot expect preprocessing, data mining, interpretation / acceptable results if we naively apply machine evaluation). The aim of the paper is to show the role learning to arbitrary data. By reviewing the whole of the user and to connect the use of decision trees process and some other works which implicitly have within the data mining framework. to be done to generate a decision tree, this papers This paper is organized as follows. Section 2 outlines shows that this method has to be placed in the the core of decision trees method (i.e. building and knowledge discovery in databases processing and, in pruning). Literature usually presents these points from fact, the user has to intervene both during the core of a technical side without describing the part regarding the method (building and pruning) and other the user: we will see that he has a role to play. Section associated tasks. 3 deals with associated tasks which are, in fact, 1. INTRODUCTION absolutely necessary. These tasks, where clearly the user has to intervene, are often not emphasized when Data mining and Knowledge Discovery in Databases we speak of decision trees. We will see that they have (KDD) are fields of increasing interest combining a great relevance and they act upon the final result. databases, artificial intelligence, machine learning and statistics. Briefly, the purpose of KDD is to extract 2. BUILDING AND PRUNING from largeamounts of data, non trivial ”nuggets” of 2.1 Building decision trees: choice of an attribute information in an easily understandable form. Such selection criterion discovered knowledge may be for instance regularities or exceptions. In induction of decision trees various attribute selection criteria are used to estimate the quality of Decision tree is a method which comes from the attributes in order to select the best one to split on. But machine learning community and explores data. Such we know at a theoretical level that criteria derived a method is able to give a summary of the data (which from an impurity measure have suitable properties to is easier to analyze than the raw data) or can be used generate decision trees and perform comparably (see to build a tool (like for example a classifier) to help a [10], [1] and [6]). We call such criteria C.M. criteria user formany different decision making tasks. Broadly (concave-maximum criteria) because an impurity speaking, a decision tree is built from a set of training measure, among other characteristics, is defined by a data having attribute values and a class name. The concave function. The most commonly used criteria result of the process is represented as a tree which which are the Shannon entropy (in the family of ID3 nodes specify attributes and branches specify attribute algorithms) and the Gini criterion (in CART values. Leaves of the tree correspond to sets of algorithms, see [1] for details), are C.M. criteria. examples with the same class or to elements in which no more attributes are available. Construction of Nevertheless, it exists other paradigms to build decision trees is described, among others, by Breiman decision trees. For example, Fayyad and Irani [10] etal. (1984) [1] who present an important and well- claim that grouping values of attributes and building know monograph on classification trees. A number of binary trees yield better trees. For that, they propose standard techniques havebeen developed, for example the ORT measure. ORT favours attributes that simply like the basic algorithms ID3 [20] and CART [1]. A separate the different classes without taking into survey of different methods of decision tree classifiers account the number of examples of nodes so that ORT and the various existing issues are presented in produces trees with small pure (or nearly pure) leaves Safavian and Landgrebe [25]. at their top more often than C.M. criteria. Usually, the production of decision trees is regarded as To better understand the differences between C.M. an automatic process: trees are straightforwardly and ORT criteria, let us consider the data set given in generated from data and the user is relegated to a the appendix and the trees induced from this data minor role. Nevertheless, we think that this method depicted in Figure 1: a tree built with a C.M. criterion intrinsically requires the user and not only during the is represented at the top and the tree built with the 91
  2. 2. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 ORT criterion at the bottom. ORT rapidly comes out has been resumed by Wallace and Patrick [26] who with the pure leaf Y2 = y21 while C.M. criterion splits suggest some improvements and show they generally it and arrives later at the split leaves. obtain better empirical results than those found by Quinlan. Buntine [3] presents a tree learning algorithm stemmed from Bayesian statistics whose main (2500,200,2500) objective is to provide outstanding predicted class Y1 = y11 Y1 = y12 probabilities on the nodes. We can also address the question of deciding which (2350,150,150) (150,50,2350) sub-nodes have to be built. For a splitting, the GID3* Y2 = y21 Y2 = y22 Y2 = y21 Y2 = y22 algorithm [12] groups in a single branch the values of an attribute which are estimated meaningless compared to its other values. For building of binary (0,150,0) (2350,0,150) (0,50,0) (150,0,2350) trees, another criterion is twoing [1]. Twoing groups C.M. tree classes into two superclasses so that considered as a two-class problem, the greatest decrease in node impurity is realized. Some properties of twoing are (2500,200,2500) described in Breiman [2]. About binary decision trees, Y2 = y21 Y2 = y22 let us note that in some situations, users do not always agree to group values since it yields meaningless trees (0,200,0) (2500,0,2500) and thus non-binary trees must not be definitively Y1 = y11 Y1 = y12 discarded. So, we have seen that there are many attribute (2350,0,150) (150,0,2350) selection criteria and even if some of them can be gathered in families, some choice has to be done. ORT tree According to us, we think that the choice of a Figure 1: An example of C.M. and ORT trees. paradigm depends whether the used data sets embed uncertainty or not, whether the phenomenon under We give here just a simple example, but some others study admits deterministic causes, and what level of both in artificial and real world domains are detailed intelligibility is required. in [6]: they show that ORT criterion produces more often than C.M. criteria trees with small leaves at their In the next paragraph, we move to the pruning stage. top. We also see in [6] that overspecified leaves with 2.2 Pruning decision trees: what about the C.M criteria tend to be small and at the bottom of the classification and the quality? tree (thus easy to prune) while leaves at the bottom of ORT trees can be large. In uncertain domains (we will We know that in many areas, like in medicine, data see this point on the next paragraph), such leaves are uncertain: there are always some examples which produced by ORT may be irrelevant and it is difficult escape from the rules. Translated in the context of to prune them without destroying the tree. decision trees, that means these examples seem similar but in fact differ from their classes. In these situations, Let us note that other selection criteria, such as the it is well-known (see [1], [4]) that decision trees ratio criterion, are related to other specific issues. The algorithms tend to divide nodes having few examples ratio criterion proposed by Quinlan [20], deriving and that the resulting trees tend to be very large and from the entropy criterion, is customized to avoid overspecified. Some branches, especially towards the favouring attributes with many values. Actually, in bottom, are present due to sample variability and some situations, to select an attribute essentially arestatistically meaningless (one can also say that they because it has many values might jeopardize the are due to noise in the sample). Such branches must semantic acceptance of the induced trees ([27] and either not be built or be pruned. If we do not want to [18]). The J-measure [15] is the product of two terms build them, we have to set out rules to stop the that are considered by Goodman and Smyth as the two building of the tree. We know it is better to generate basic criteria for evaluating a rule: one term is derived the entire tree and then to prune it (see for example [1] from the entropy function and the other measures the and [14]). Pruning methods (see [1], [19], [20]) try to simplicity of a rule. Quinlan and Rivest [21] were cut such branches in order to avoid this drawback. interested in the minimum description length principle to construct a decision tree minimizing a false The principal methods for pruning decision trees are classification rate when one looks for general rules examined in [9] and [19]. Most of these pruning and their case’s exceptional conditions. This principle methods are based on minimizing a classification error 92
  3. 3. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 rate when each element of the same node is classified the quality of each node is a key-point in uncertain in the most frequent class in this node. The latter is domains. estimated with a test file or using statistical methods such as cross-validation or bootstrap. So, about the pruning stage, the user is confronted to These pruning methods are inferred from situations some questions: where the built tree will be used as a classifier and they systematically discard a sub-tree which doesn’t - am I interested in obtaining a quality value of improve the used classification error rate. Let us each node? consider the sub-tree depicted in Figure 2. D is the - is there uncertainty in the data? class and it is here bivalued. In each node the first (resp. second) value indicates the number of examples and he has to know which use of the tree is pursued: having the first (resp. second) value of D. This sub- - a tree can be an efficient description oriented tree doesn't lessen the error rate, which is 10% both in by an a priori classification of its elements. Then, its root or in its leaves; nevertheless the sub-tree is of pruning the tree discards overspecific information to interest since it points out a specific population with a get a more legible description. constant value of D while in the remaining population it's impossible to predict a value for D. - a tree can be built to highlight reliable sub- populations. Here only some leaves of the pruned tree will be considered for further investigation. (90,10) - the tree can be transformed into a classifier for any new element in a large population. (79,0) (11,10) The choice of a pruning strategy is tied to the answers to these questions. Figure 2: A tree which could be interesting although it 3. ASSOCIATED TASKS doesn’t decrease the number of errors. We indicate in this paragraph when and how the users, by means of various associated tasks, intervene in the process of developing decision trees. Schematically, it In [5], we have proposed a pruning method (called is about gathering the data for the design of the C.M. pruning because a C.M. criterion is used to build training set, the encoding of the attributes, the specific the entire tree) suitable in uncertain domains. C.M. analysis of examples, the resulting tree analysis,… pruning builds a new attribute binding the root of a tree with its leaves, the attribute’s values Generally, these tasks are not emphasized in the corresponding to the branches leading to a leaf. It literature, they are usually considered as secondary, permits computation of the global quality of a tree. but we will see that they have a great relevance and The best sub-tree for pruning is the one that yields the that they act upon the final result. Of course, these highest quality pruned tree. This pruning method is tasks intersect with the building and pruning work that not tied to the use of the pruned tree as a classifier. we have previously described. This work has been resumed in [13]. In uncertain In practice, apart from the building and pruning steps, domains, a deep tree is less relevant than a small one: there is another step: the data preparation. We add a the deeper a tree, the less understandable and reliable. fourth step which aims to study the classification of So, a new quality index (called DI for Depth-Impurity) new examples on an - potentially pruned - tree. The has been defined in [13]. The latter manages a trade- user strongly intervenes during the first step, but also off between depth and impurity of each node of a tree. has a supervising role during all steps and more From this index, a new pruning method (denoted DI particularly a critics role after the second and third pruning) has been inferred. With regard to C.M. steps (see Figure 3). We do not detail here the fourth pruning, DI pruning introduces a damping function to step which is marginal from the point of view of the take into account the depth of the leaves. Moreover, user's role. by giving the quality of each nodes (and not only of a 3.1 Data preparation sub-tree), DI pruning is able to distinguish some sub- populations of interest in large populations, or, on the The aim of thisstep is to supply, from the database contrary, highlight set of examples with high gathering examples in their raw form, a training set as uncertainty (in the context of the studied problem). In adapted as possible to the decision trees development. this case, the user has to come back to the data to try This step is the one where the user intervenes most and improve their collection and preparation. Getting directly. His tasks are numerous: deleting examples 93
  4. 4. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 decision trees software data building pruning classification manipulation 60-65 40-62 data set 20-3 10-100-50 30-2 entire pruned results of the tree tree classification checks and checks and intervenes intervenes prepares checks and intervenes user Figure 3: Process to generate decision trees and relations with the user. considered as aberrant (outliers) and/or containing too new re-encodings and/or fusions of attributes, often many missing values, deleting attributes evaluated as causing a more general description level. irrelevant to the given task, re-encoding the attributes The current decision trees construction algorithms values (one knows that if the attributes have very deal most often with missing values by means of different numbers of values, those having more values specific and internal treatments [7]. On the contrary, tend to be chosen first ([27] and [18]), we have by a preliminary analysis of the database, relying on already referred to this point with the gain ratio the search of associations between data and leading to criterion), re-encoding several attributes (for example, uncertain rules that determine missing values, Ragel the fusion of attributes), segmenting continuous ([7], [24]) offers a strategy where the user can attributes, analyzing missing data, ... intervene: such a method leaves a place for the user Let us get back to some of these tasks. At first [16], and his knowledge in order to delete, add or modify the decision trees algorithms did not accept some rules. quantitative attributes, these had to be discretized. As we can see, this step depends in fact a lot on the This initial segmentation can be done by asking user's work. experts to set thresholds or by using a strategy relying on an impurity function [11]. The segmentation can 3.2 Building step also be done while building the trees as is the case The aim of this step is to induce a tree from a training with the software C4.5 [22]. A continuous attribute set arising from the previous step. Some system can then be segmented several times in a same tree. It parameters are to be specified. For example, it is seems relevant to us that the user may actively useless to keep on building a tree from a node having intervene in this process by indicating, for example, an too few examples, this amount being relative to the a priori discretization of the attributes for which it is initial number of examples in the base. An important meaningful and by letting the system manage the parameter to set is thus the minimum amount of others. One shall remark that, if one knows in a examples necessary for the node segmentation. Facing reasonable way how to split a continuous attribute to a particularly huge tree, the user will ask for the binary, the question is more delicate for a three-valued construction of a new tree by setting this parameter to (or more) discretization. a higher value, which is pruning the tree by means of a The user also has generally to decide the deletion, re- pragmatic process. We have seen (paragraph 2.1) that encoding or fusion of attributes. He has a priori ideas in uncertain induction, the user will most probably allowing a first pass in this task. But we shall see that choose a C.M. criterion in order to be able to prune. the tree construction, by making explicit the But if he knows that the studied phenomenon allows underlying studied phenomenon, suggests to the user deterministic causes in situations with few examples, 94
  5. 5. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 he can choose the ORT criterion to get a more concise attributes from those that it can be necessary to description of these situations. redefine. The presentation of the attributes and their respective Finally, building and pruning steps can be viewed as criterion scores at each node may allow the user to part of the study of the attributes. Experts of the select attributes that might not have the best score but domain usually appreciate to be able to restructure the that provide a promising way to lead to a relevant leaf. set of the initial attributes and to see at once the effect of such a modification on the tree (in general, after a The critics of the tree thus obtained is the most preliminary decision tree, they define new attributes important participation of the user in this step. He which summarize some of the initial ones). We have checks if the tree is understandable regarding his noticed [5] that when such attributes are used, the domain knowledge, if its general structure conforms to shape of the graphic representation of the quality his expectations. Facing a surprising result, he index as a function of the number of pruned sub-trees wonders if this is due to a bias in the training step or if changes and tends to show three parts: in the first one, it reflects a phenomenon, sometimes suspected, but the variation of the quality index is small, in the not yet explicitly uttered. Most often, seeing the tree second part this quality decreases regularly and in the gives the user new ideas about the attributes and he third part the quality becomes rapidly very low. It will choose to build again the tree after working again shows that the information embedded in the data set is on the training set and/or changing a parameter in the mainly in the top of the tree while the bottom can be induction system to confirm or infirm a conjecture. pruned. 3.3 Pruning step 3.4 Conclusion Apart from the questions at the end of paragraph 2.2 Through this paragraph, we have seen that the user about the data types and the aim searched for in interventions are numerous, that the associated tasks producing a tree, more questions arise to the user if he realization are closely linked to him. These tasks are uses a technique such as DI pruning. fundamental since they directly affect the results: the In fact, in this situation, the user has more information study of the results brings new experiments. The user to react upon. First, he knows the quality index of the starts again many times the work done during a step entire tree, which allows him to evaluate the global by changing the parameters or comes back to previous complexity of the problem. If this index is low, this steps (the arrows in Figure 3 shows all the relations means that the problem is delicate or inadequately between the different steps). At each step, the user described, that the training set is not representative, or may accept, override, or modify the generated rules, even that the decision trees method is not adapted to but more often he suggests alternative features and this specific problem. If the user has several trees, the experiments. Finally, the rule set is redefined through quality index allows to compare them and eventually subsequent data collection, rule induction, and expert to suggest new experiments. consideration. Moreover, the quality index on each node enhances We think it is necessary for the user to take part in the the populations where the class is easy to determine system so that a real development cycle takes place. with regards to sets of examples where it is impossible The latter seems fundamental to us in order to obtain to predict it. Such areas can suggest new experiments useful and satisfying trees. The user does not usually on smaller populations or even can question on the know beforehand which tree is relevant to his problem existence of additional attributes (which will have to and this is because he finds it gratifying to take part in be collected) to help determine the class for examples this search that he takes interest in the induction work. where it is not yet possible. Let us note that most authors try and define software From experiments [13], we noticed that the degree of architecture explicitly integrating the user. In the area pruning is quite bound to the uncertainty embedded in of induction graph (which is a generalization of the data. In practice, that means that the damping decision trees), the SIPINA software offers to the user process has to be adjusted according to the data in to fix the choice of an attribute, to gather temporarily order to obtain, in all situations, a relevant number of some values of an attribute, to stop the construction pruned trees. For that, we introduce a parameter to from some nodes, and so on. Dabija & al. [8] offer an control the damping process. By varying this learning system architecture (called KAISER, for parameter, one follows the quality index evolution Knowledge Acquisition Inductive System driven by during the pruning (for example the user distinguishes Explanatory Reasoning) for an interactive knowledge the parts of the tree that are due to random from those acquisition system based on decision trees and driven reliable). Such a work enhances the most relevant by explanatory reasoning. Moreover, the experts can incrementally add knowledge corresponding to the 95
  6. 6. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 domain theory. KAISER confronts built trees with the [7] Crémilleux B., Ragel A., & Bosson J. L. An domain theory, so that some incoherences may be Interactive and Understandable Method to Treat detected (for instance, the value of the attribute "eye" Missing Values: Application to a Medical Data Set. In for a cat has to be "oval"). Keravhut & Potvin [17] proceedings of the 5th International Conference on Information Systems Analysis and Synthesis (ISAS / have designed an assistant to collaborate with the user. SCI 99), pp. 137-144, M. Torres, B. Sanchez & E. This assistant, which is in the form of a graphic Wills (Eds.), Orlando, FL, 1999. interface, helps the user test the methods and their [8] Dabija V. G., Tsujino K., & Nishida S. Theory parameters in order to get the most relevant formation in the decision trees domain. Journal of combination for the problem at hands. Japanese Society for Artificial Intelligence, 7 (3), 4. CONCLUSION 136-147, 1992. [9] Esposito F., Malerba D., & Semeraro G. Decision tree Producing decision trees is often presented as pruning as search in the state space. In proceedings of "automatic" with a marginal participation from the European Conference on Machine Learning ECML user: we have stressed on the fact that the user has a 93, pp. 165-184, P. B. Brazdil (Ed.), Lecture notes in fundamental critics and supervisor role and that he artificial intelligence, N° 667, Springer-Verlag, intervenes in a major way. This leads to a real Vienna, Austria, 1993. development cycle between the user and the system. [10] Fayyad U. M., & Irani K. B. The attribute selection This cycle is only possible because the construction of problem in decision tree generation. In proceedings of a tree is nearly instantaneous. Tenth National Conference on Artificial Intelligence, pp. 104-110, Cambridge, MA: AAAI Press/MIT Press, The participation of the user for the data preparation, 1992. the choice of the parameters, the critics of the results [11] Fayyad U. M., & Irani K. B. Multi-interval is in fact at the heart of the more general process of discretization of continuous-valued attributes for Knowledge Discovery in Databases. As usual in KDD, classification learning. In proceedings of the we claim that the understanding and the declarativity Thirteenth International Joint Conference on Artificial of the mechanism of the methods is a key point to Intelligence IJCAI 93, pp. 1022-1027, Chambéry, achieve in practice a fruitful process of information France, 1993. extraction. Finally, we think that, in order to really [12] Fayyad U. M. Branching on attribute values in reach a data exploration reasoning, associating the decision tree generation. In proceedings of Twelfth user in a profitable way, it is important to give him a National Conference on Artificial Intelligence, pp. 601-606, AAAI Press/MIT Press, 1994. framework gathering all the tasks intervening in the process, so that he may freely explore the data, react, [13] Fournier D., & Crémilleux B. Using impurity and depth for decision trees pruning. In proceedings of the innovate with new experiments. 2th International ICSC Symposium on Engineering of References Intelligent Systems (EIS 2000), Paisley, UK, 2000. [1] Breiman L., Friedman J. H., Olshen R. A., & Stone C. [14] Gelfand S. B., Ravishankar C. S., & Delp E. J. An J. Classification and regression trees. Wadsworth. iterative growing and pruning algorithm for Statistics probability series. Belmont, 1984. classification tree design. IEEE Transactions on [3] Breiman L. Some properties of splitting criteria Pattern Analysis and Machine Intelligence 13(2), (technical note). Machine Learning 21, 41-47, 1996. 163-174, 1991. [3] Buntine W. Learning classification trees. Statistics [15] Goodman R. M. F., & Smyth, P. Information-theoretic and Computing 2, 63-73, 1992. rule induction. In proceedings of the Eighth European Conference on Artificial Intelligence ECAI 88, pp. [4] Catlett J. Overpruning large decision trees. In 357-362, München, Germany, 1988. proceedings of the Twelfth International Joint Conference on Artificial Intelligence IJCAI 91, pp. [16] Hunt E. B., Marin J., & Stone P. J. Experiments in 764-769, Sydney, Australia, 1991. induction. New York Academic Press, 1966. [5] Crémilleux B., & Robert C. A Pruning Method for [17] Kervahut T., & Potvin J. Y. An interactive-graphic Decision Trees in Uncertain Domains: Applications in environment for automatic generation of decision Medicine. In proceedings of the workshop Intelligent trees. Decision Support Systems 18, 117-134, 1996. Data Analysis in Medicine and Pharmacology, ECAI [18] Kononenko I. On biases in estimating multi-valued 96, pp. 15-20, Budapest, Hungary, 1996. attributes. In proceedings of the Fourteenth [6] Crémilleux B., Robert C., & Gaio M. Uncertain International Joint Conference on Artificial domains and decision trees: ORT versus C.M. criteria. Intelligence IJCAI 95, pp. 1034-1040, Montréal, In proceedings of the 7th Conference on Information Canada, 1995. Processing and Management of Uncertainty in [19] Mingers J. An empirical comparison of pruning Knowledge-based Systems, pp. 540-546, Paris, France, methods for decision-tree induction. Machine 1998. Learning 4, 227-243, 1989. 96
  7. 7. Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 [20] Quinlan J. R. Induction of decision trees. Machine Learning 1, 81-106, 1986. APPENDIX [21] Quinlan J. R., & Rivest R. L. Inferring decision trees Data file used to build trees for Figure 1 (D denotes using the minimum description length principle. the class and Y1 and Y2 are the attributes). Information and Computation 80(3), 227-248, 1989. [22] Quinlan J. R. C4.5 Programs for Machine Learning. San Mateo, CA. Morgan Kaufmann, 1993. D Y1 Y2 [23] Quinlan J. R. Improved use of continuous attributes in 1 d1 y11 y22 C4.5. Journal of Artificial Intelligence Research 4, 77-90, 1996. [24] Ragel A., & Crémilleux B. Treatment of Missing 2350 Values for Association Rules, Second Pacific Asia d1 y11 y22 Conference on KDD, PAKDD 98, pp. 258-270, X. Wu, 2351 d1 y12 y22 R. Kotagiri & K. B. Korb (Eds.), Lecture notes in artificial intelligence, N° 1394, Springer-Verlag, Melbourne, Australia, 1998. 2500 d1 y12 y22 [25] Safavian S. R., & Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions on 2501 d2 y11 y21 Systems, Man, and Cybernetics 21(3), 660-674, 1991. [26] Wallace C. S., & Patrick J. D. Coding decision trees. 2650 d2 y11 y21 Machine Learning11, 7-22, 1993. [27] White A. P., & Liu W. Z Bias in Information-Based 2651 d2 y12 y21 Measures in Decision Tree Induction. Machine Learning 15, 321-329, 1994. 2700 d2 y12 y21 2701 d3 y11 y22 2850 d3 y11 y22 2851 d3 y12 y22 5200 d3 y12 y22 B. Crémilleux is Maître des Conférences at the Université do Caen, France. 97