Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. To appear Proc. The 2003 International Conference on Machine Learning and Applications (ICMLA'03) Los Angeles, California, June 23-24, 2003. Fast Decision Tree Learning Techniques for Microarray Data Collections Xiaoyong Li and Christoph F. Eick Department of Computer Science University of Houston, TX 77204-3010 e-mail: ceick@cs.uh.edu Abstract gene expression profiles of tumors from cancer patients [1]. In addition to the enormous scientific DNA microarrays allow monitoring of potential of DNA microarrays to help in expression levels for thousands of genes understanding gene regulation and interactions, simultaneously. The ability to successfully microarrays have very important applications in analyze the huge amounts of genomic data is of pharmaceutical and clinical research. By comparing increasing importance for research in biology gene expression in normal and abnormal cells, and medicine. The focus of this paper is the microarrays may be used to identify which genes are discussion of techniques and algorithms of a involved in causing particular diseases. Currently, decision tree learning tool that has been devised most approaches to the computational analysis of taking into consideration the special features of gene expression data focus more on the attempt to microarray data sets: continuous-valued learn about genes and tumor classes in an attributes and small size of examples with a unsupervised way. Many research projects employ large number of genes. The paper introduces cluster analysis for both tumor samples and genes, novel approaches to speed up leave-one-out and mostly use hierarchical clustering methods [2,3] cross validation through the reuse of results of and partitioning methods, such as self-organizing previous computations, attribute pruning, and maps [4] to identify groups of similar genes and through approximate computation techniques. groups of similar samples. Our approach employs special histogram-based data structures for continuous attributes for This paper, however, centers on the application of speed up and for the purpose of pruning. We supervised learning techniques to microarray data present experimental results concerning three collections. In particular, we will discuss the features microarray data sets that suggest that these of a decision tree learning tool for microarray data optimizations lead to speedups between 150% sets. We assume that each data set includes gene and 400%. We also present arguments that our expression data of m-RNA samples. Normally, in attribute pruning techniques not only lead to these data sets the number of genes is pretty large better speed but also enhance the testing (usually between 1000 and 10,000). Each gene is accuracy. characterized by numerical values that measure the degree the gene is turned on for the particular Key words and phrases: decision trees, concept sample. The number of examples in the training set, learning for microarray data sets, leave-one-out on the other hand, is typically below one hundred. cross validation, heuristics for split point Associated with each sample is its type or class that selection, decision tree reuse. we are trying to predict. Moreover, in this paper we will restrict our discussions to binary classification problems. 1. Introduction Section 2 introduces decision tree learning The advent of DNA microarray technology provides techniques for microarray data collections. Section 3 biologists with the ability of monitoring expression discusses how to speed up leave-one-out cross levels for thousands of genes simultaneously. validation. Section 4 presents experimental results Applications of microarrays range from the study of that evaluate our techniques for three microarrray gene expression in yeast under different data sets and Section 5 summarizes our findings. environmental stress conditions to the comparison of
  2. 2. ∑ 2 2. Decision Tree Learning Techniques Gain(D,S)= H(D) − i =1 (| D i | / | D |) * H(D i ) for Microarray Data Collections In the above |D| denotes the number of elements in set D and D=(p1, p2) with p1+ p2 =1 and indicates 2.1 Decision Tree Algorithms Reviewed that of the |D| examples p1*|D| examples belong to the first class and p2*|D| examples belong to the The traditional decision tree learning algorithm (for second class. more discussions on decision trees see [5]) builds a Procedure buildTree(D): decision tree breadth-first by recursively dividing 1. Initialize root node R of tree T using data set D; the examples until each partition is pure by 2. Initialize queue Q to contain root node R; definition or meets other termination conditions (to 3. While Q is not empty do { be discussed later). If a node satisfies a termination 4. De-queue the first node N in Q; condition, the node is marked with a class label that 5. If N is not satisfying the termination is the majority class of the samples associated with condition { this node. In the case of microarray data sets, the splitting criterion for assigning examples to nodes is 6. For each gene Gi (i= 1, 2, …. ) of the form “A < v” (where A is an attribute v is a 7. {Evaluate splits on gene Gi based on real number). information gain; In algorithms description in Fig. 1 below, we 8. Record the best split point Si for Gi assume that and its information gain} 1. D is the whole microarray training data set; 9. Determine split point Smax with the 2. T is the decision tree to be built; highest information gain 3. N is one node of the decision tree in which holds 10. Use Smax to divide node N into N1 and N2 the indexes of samples; and attach nodes N1 and N2 to node N in the 4. R is the root node of the decision tree; decision tree T; 5. Q is a queue which contains nodes of the same 11. En-queue N1 and N2 to Q; type with N; 12. } 6. Si: is a split point which is a structure containing 13. } a gene index i, a real number v and an Figure 1: Decision Tree Learning Algorithm information gain value. A split point can be used to provided a split criterion to partition the tree 2.2 Attribute Histograms node N into two nodes N1 and N2 based on whether the gene i’s value of each example in Our research introduced a number of new data the node is or isnot greater than value v; structures for the purpose of speeding up the 7. Gi: denotes the i-th gene. decision tree learning algorithms. One of these data structures is called attribute histogram that captures The result of applying the decision tree learning the class distribution of a sorted continuous attribute. algorithm is a tree whose intermediate nodes Let us assume we have 7 examples and their associate split points with attributes, and whose leaf attribute values for an attribute A are 1.01, 1.07, nodes represent decisions (classes in our case). Test 1.44, 2.20, 3.86, 4.3, and 5.71 and their class conditions for a node are selected maximizing the distribution is (-, +, +, +, -, -, +); that is, the first information gain relying on the following example belongs to class 2, the second example is framework: We assume we have 2 classes , class 1,... If we group all the adjacent samples with sometimes called ‘+’ and ‘-“ in the following, in our the same class, we obtain the histogram for this classification problem. A test S subdivides the attribute which is (1-, 3+, 2-, 1+), for short (1,3,2,1) examples D= (p1,p2) into 2 subsets D1 =(p11,p12) as depicted in Fig. 2; if the class distribution for the and D2 =(p21,p22). The quality of a test S is sorted attribute A would have been (+,+,-,-,-,-,+) A’s measured using Gain(D,S): histogram would be (2,4,1). Efficient algorithms to Let H(D=(p1,…,pm))= Σi=1 (pi log2(1/pi)) (called compute attribute histograms have been discussed in the entropy function) [6].
  3. 3. 2.3 Searching for the Best Split Point 3. Optimizations for Leave-one-out As mentioned earlier the traditional decision tree Cross-validation algorithm has a preference for tests that reduce In k-fold cross-validation, we divide the data into k entropy. To find the best test for a node, we have to disjoint subsets of (approximately) equal size, then search through all the possible split points for each train the classifier k times, each time leaving out one attribute. In order to compute the best split point for of the subsets from training, but using only the a numeric attribute, normally the (sorted) list of its omitted subset as the test set to compute the error values is scanned from the beginning, and for each rate. If k equals the sample size, this is called "leave- split point that is placed half way between every two one-out" cross-validation. For the large data set size, adjacent attribute values, the entropy is computed. leave-one-out is very computation demanding since The entropy for each split point can actually be it has to construct more decision trees than normal efficiently computed as shown in Figure 2 because types of cross validation (k=10 is a popular choice in of the existence of our attribute histogram data the literature). But for data sets with few examples, structure. Based on its histogram (1-, 3+, 2-, 1+), we such as microarray data sets, leave-one-out cross only consider three possible split (1- | 3+, 2-, 1+), validation is pretty popular and practical since it (1-, 3+ | 2-, 1+) and (1-, 3+, 2- | 1+). The vertical bar gives the most unbiased evaluation model. Also, represents the split point. Thus we eliminate from 6 when doing leave-one-out cross validation the split points (Fayyad and Irani proved in [7] that computations for different subsets tend to be very splitting between adjacent samples that belong to the similar. Therefore, it seems attractive to speed up same class leads to sub-optimal information gain; in leave-one-out cross validation through the reuse of general, their paper advocates a multi-splitting results of previous computations, which is the main algorithms for continuous attributes whereas our topic of the next subsection. approach relies on binary splits) down to 3 split points. 3.1 Reuse of Sub-trees from Previous Runs It is important to note that the whole data set and the training sets in leave-one-out only differ in one example. Therefore, in the likely event that the same root test is selected for the two data sets, we already know that at least one of the 2 sub-trees below the root node generated by the first run (for the whole data set) can be reused when constructing other decision trees. Similar opportunities for reuse exist at other levels of decision trees. Taking advantage of this property, we compare the node to be split with the stored nodes that are from pervious runs, and reuse sub-trees if a match occurs. Figure 2: Example of an Attribute Histogram In order to get a speed up through sub-tree reuse, it is critical that matching nodes from A situation that we have not discussed until previous runs can be found quickly. To facilitate the now, involves histograms that contain identical comparison of two nodes, we use bit strings to attribute values that belong to different classes. To represent the sample list of each node. For example, cope with this situation when considering a split if we have totally 10 samples, and 5 are associated point, we need to check the two neighboring with the current node, we use the bit string examples’ attribute values on both sides of the split “0101001101” as the signature of this node, and use point. If they are the same, we have to discard this XOR string comparisons and signature hashing to split point even if its information gain is high. quickly determine if a reusable sub-tree exists. After we determined the best split point for all the attributes (genes in our cases), the attribute with 3.2 Using Histograms for Attribute Pruning highest information gain is selected and used to split the current node.
  4. 4. Assume that two histograms A (2+, 2-) and B (1+, 2nd: (2-, 3+, 7- | 5+, 2-). Apparently, the 2nd is better 1-, 1+, 1-) are given. In this case, our job is to find than the 1st. Since we are dealing with only binary the best split point among all possible splits of both classification, we can assign a numeric value of +1 histograms. Obviously, B can never give a better to one class and a value of –1 to the other class, and split than A because (2+ | 2-) has entropy 0. This we can use the sum of absolute differences in class implies that performing information gain memberships in the two resulting partitions to computations for attribute B is a waste of time. That approximate entropy computations; the larger this prompts us to think of some way to distinguish result is, the lower the entropy is. In this case, for the between “good” and “bad” histograms, and to first split the sum is |-2 + 3| + |-7 + 5 – 2| = 5, and for exclude attributes with bad histograms from the second the sum is |-2 + 3 – 7| + |5 – 2| = 9. We consideration for speed up. call this method absolute difference heuristic. We Mathematically, it might be quite complicated performed some experiments [8] to determine how to come up with a formula that predicts the best often the same split point is picked by the attribute to be used for a particular node of the information gain heuristic and the absolute decision tree. However, we are considering an difference heuristic. Our results indicate that in most approximate method that may not always be correct cases (approx. between 91 and 100% depending on but hopefully most of the time can be correct. The data set characteristics) the same split point is picked idea is to use an index, which we call “hist index”. by both methods. The hist index of histogram S is defined as: m 4. Evaluation Hist(S) = ∑ j= 1 Pj2 In this section we present the results of experiments where Pj is the relative frequency of block j in S. that evaluate our methods for 3 different microarray For example, if we have a histogram (1, 3, 4, 2), data sets. its hist index would be: 12 + 32 + 42 + 22 = 30. A 4.1 Data Sets and Experimental Design histogram with a high hist index is more likely to contain the best split point than a histogram with low The first data set is a leukemia data collection that hist index. Intuitively, we know that the fewer consists of 62 bone marrow and 10 peripheral blood blocks the histogram has, the better chance it has to samples from acute leukemia patients (obtained contain a good split point ---, mathematically, (a2 > from Golub el al [8]). The total 72 samples fall into a12 + a22) holds if we have (a = a1 + a2). two types of acute leukemia: acute myeloid Our decision tree learning algorithm uses the leukemia (AML) and acute lymphoblastic leukemia hist index to prune attributes as follows. Prior to (ALL). These samples come from both adults and determining the best split point of an attribute, its children. The RNA samples was hybridized to hist index is computed and we compare it with the Affymetrix high-density oligonucleotide microarrays average hist index of all the previous histograms in that contains probes for p = 7,130 human genes. the same round; only if its hist index value is larger The second data set a colon tissue data set than the previous average the best split point for this contains expression level (Red intensity/Green attribute will be determined, otherwise, the attribute intensity) of the 2000 genes with highest minimal is excluded from consideration for test conditions of intensity across 62 colon tissues. These gene the particular node. expressions in 40 tumor and 22 normal colon tissue samples were analyzed with an Affymetrix 3.3 Approximating Entropy Computations oligonucleotide array containing over 6,500 human This sub-section addresses the following question: genes (Alon et al. [2]). Do we really have to compute the log values that The third data set comes from a study of gene require a lot of floating point computation to find expression in the breast cancer patients (Veer et al. the smallest entropy values? [3]). The data set contains data from 98 primary Let us assume we have a histogram (2-, 3+, 7-, breast cancers patients: 34 from patients who 5+, 2-) and we need to determine its split point that developed distant metastases within 5 years, 44 from minimizes entropy. Let us consider the difference patients who continued to be disease-free after a between the two splits. 1st: (*2-, 3+ | 7-, 5+, 2-) and period of at least 5 years, 18 from patients with
  5. 5. BRCA1 germline mutations, and 2 from BRCA2 carriers. All patients were lymph node negative, and under 55 years of age at diagnosis. 4.2 Experimental Results In the experiments, we did not use all genes, but The first experiment evaluated the accuracy of the rather selected a subset P with p elements of the three decision tree learning tools. Tables 1-3 below genes. Decision trees were then learnt that operate display each algorithm’s error rate using the three on the selected subset of genes. As proposed in [9], different data sets and also using three different p we are removing genes from datasets based on the values for gene selection. ratio of their between-groups to within-groups sum The first column of the three tables represents of squares. For a particular gene j, the ratio is the p values that were used. The other columns give BSS ( j ) ∑ i ∑ kI ( yi = k )( x kj − x . j ) 2 the number of total misclassification and the error defined as: = , WSS ( j ) ∑ i ∑ kI ( yi = k )( xij − x kj ) 2 rate (inside the braces). Error rates were computed where x . j denotes the average expression level of using leave-one-out cross validation. gene j across all samples and x kj denotes the average Table 1: The Leukemia data set test result (72 samples) level of gene j across samples belonging to class k. Tools C5.0 Microarray Optimized To give an explicit example here, assume we Decision Decision Decision have four samples and two genes for each sample: Tree Tree Tree the first gene’s expression level values for the four P samples are (1, 2, 3, 4) and the second’s are (1, 3, 2, 4); the sample class memberships are (+, -, +, -) 1024 5(6.9%) 5(6.9%) 4(5.6%) (listed in the order of samples no.1, no.2, no.3 and 900 4(4.6%) 8(11.1%) 5(6.9%) no.4). For gene 1, we have BSS/WSS = 0.125, and for gene 2, BSS/WSS = 4. If we have to remove one 750 13(18.1% 11(15.3%) 3(4.2%) gene, gene 1 will be removed according to our rule ) since it has a lower BSS/WSS value. The removal of gene 1 is reasonable because we can tell the class Table 2: Colon Tissue data set test result (62 Samples) membership of the samples by looking at their gene 2 expression level values: if one sample’s gene 2 Tools C5.0 Microarray Optimized expression level is greater than 2.5, the sample Decision Decision Decision should belong to the negative class, otherwise the Tree Tree Tree P sample belongs to the positive class. If we evaluate gene 1 instead, we will not be able to perform the 1600 12(19.4% 15(24.2%) 16(25.8%) classification in one single step like we have just ) done with gene 2. After we calculate the BSS/WSS ratios for all 1200 12(19.4% 15(24.2%) 16(25.8%) genes in a data set, only the p genes with the largest ) ratios will remain in the datasets that will be used in 800 12(19.4% 14(22.6%) 16(25.8%) the experiments. Experiments were conducted with ) different p values. In the experiments, we compared the popular Table 3: Breast Cancer data set test result (78 Samples) C5.0/See5.0 decision tree tool (which was run with its default parameter settings) with two versions of Tools C5.0 Microarray Optimized our tool. The first version, called microarray Decision Decision Decision decision tree tool, does not use any optimizations Tree Tree Tree but employs pre-pruning. It stops growing the tree P when at least 90% of the examples belong to the 5000 38(48.7% 29(33.3%) 35(44.9%) majority class. The second version of our tool, that is ) called optimized decision tree tool, uses the same pre-pruning and employs all the techniques that were 1600 39(50.0% 32(41.0%) 30(38.5%) discussed in Section 3.
  6. 6. ) normal (Microarray Decision Tree) and optimized (Optimized Decision Tree). All these experiments 1200 39(50.0% 31(39.7%) 29(33.3%) were performed on an 850 Mhz Intel Pentium ) processor with 128MB main memory. The cpu time that is displayed (in seconds) in Table 4 includes the If we study the error rates for the three methods time of tree building and evaluation process (Note: listed in the three tables carefully, it can be noticed these experiments are identical to those previously that at an average the error rates for the optimized listed in Tables 1 to 3). Our experimental results decision tree are lower than that of the one not being suggest that the decision tree tool designed for optimized, which looks quite surprising since in the microarray data sets normally runs slightly faster optimized decision tree tool used a lot of than the C5.0 tool, while the speedup of the approximate computations and pruning. optimized microarray decision tree tool is quite However, further analysis revealed that the use significant and ranges from 150% to 400%. of attribute pruning (using the hist index we Table 4: CPU time comparison of three different decision introduced in Section 3.2) provides an explanation tree tools for the better average accuracy of the optimized decision tree tool . Why would attribute pruning lead P- CPU Time (Seconds) Data Sets to a more accurate prediction in some cases? The Value C5.0 Normal Optimized reason is that the entropy function does not take the 1024 6.7 3.5 1.2 class distribution on sorted attributes into Leukemia consideration. For example, if we have two attribute 900 5.6 3.1 1.1 Data set histograms (3+, 3-, 6+) and (3+, 1-, 2+, 1-, 2+, 1-, 750 6.0 4.1 1.1 2+), for the first histogram the best split point is (3+ | 3-, 6+) but for the second histogram there is one 1600 12.0 8.0 2.2 Colon similar split point (3+ | 1-, 2+, 1-, 2+, 1-, 2+) which Tissue 1200 9.0 6.0 1.7 is equivalent to (3+ | 3-, 6+) with respect to the Data set 800 5.9 3.8 1.1 information gain heuristic. Therefore, both split points have the same chance to be selected. But, just 5000 74.5 75.3 15.9 Breast by intuition, we would say that the second split point Cancer 2000 30.4 30.2 6.4 is a much worse than the first split point because of Data set its large number of blocks, requiring more tests to 1500 22.4 20.4 4.8 separate the two classes properly than the first one. The traditional information gain heuristic 5. Summary and Conclusion ignores such distributional aspects at all, which We introduced decision tree learning algorithms for causes the loss of accuracy in some circumstances. microarray data sets, and its optimization to speed However, hist index based pruning, as proposed in up leave-one-out cross validation. Aimed at this 3.2, improved on this situation by removing goal, several strategies were employed: the attributes that have a low hist index (like the second introduction of hist index to help pruning attributes, attribute in the above example) beforehand. approximate computations that measure entropy; and Intuitively, continuous attributes with long the reuse of subtrees from previous runs. We claim histograms “representing flip-flopping class that first two ideas are new, whereas, the third idea memberships” are not very attractive to be chosen in was also explored in Blockeel’s paper [10] that test conditions, because more nodes/tests are centered on the reuse of split points. The necessary in a decision tree to predict classes performance of microarray decision tree tool was correctly based on this attribute. In summary, some compared with that of commercially available of those “bad” attributes were removed by attribute decision tree tool C5.0/See5.0 using 3 microarray pruning that explains the higher average accuracy in data sets. The experiments suggest that our tool runs the experiments. between 150% and 400% faster than C5.0. In another experiment we compared the cpu We also compared the trees that were generated time for leave-one cross validation for the three tree in the experiments for the same data sets. We decision tree learning tools: C5.0 Decision Tree,
  7. 7. observed that the trees generated by the same tool [6] Xiaoyong Li. Concept learning techniques for are very similar. Trees generated by different tools microarray data collections, Master’s Thesis, also had a significant degree of similarity. Basically, University of Houston, December 2002. all the trees that were generated for the three data [7] U. Fayyad, and K. Irani. Multi-interval sets are of small size with normally less than 10 discretization of continuous-valued attributes for nodes. We also noticed that smaller trees seem to be classification learning, Proc. Int. Joint Conf. On correlated with a lower error rates. Artificial Intelligence (IJCAI-93), pp. 1022-1029, 1993. Also worth mentioning is that our experimental results revealed that the use of the hist index resulted [8] T. R. Golub, D. K. Slonim, P. Tamayo, C. in a better accuracy in some cases. These results also Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M.L. Loh, J. R. Downing, M. A. Caligiuri, C. D. suggest that for continuous attributes the traditional Bloomfield, and E. S. Lander. Molecular entropy-based information gain heuristic does not classification of cancer: class discovery and class work very well, because of its weakness to reflect prediction by gene expression monitoring, Science, the class distribution characteristics of the samples 286:531-537, 1999. with respect to continuous attributes. Therefore, [9] S. Dudoit, J. Fridlyand, and T. P. Speed. better evaluation heuristics are needed for Comparison of discrimination methods for the continuous attributes. This problem is the subject of classification of tumors using gene expression our current research; in particular, we are currently data, Journal of the American Statistical investigating multi-modal heuristics that use both Association, Vol. 97, No. 457, pp. 77—87, 2002. hist index and entropy. Another problem that is [10] H. Blockeel, J. Struyf. Efficient algorithms for investigated in our current research is the decision tree cross-validation, Machine Learning: generalization of the techniques described in this Proceedings of the Eighteenth International paper to classification problems that involve more Conference, 11-18, 2001. than two classes. References [1] A. Brazma, J. Vilo. Gene expression data analysis, FEBS Letters, 480:17-24, 2000. [2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Cell Biology, Vol. 96, pp. 6745-6750, June 1999. [3] Laura J. van ‘t Veer, Hongyue Dai, Marc J. van de Vijver, Yudong D. He, Augustinus A.M. Hart, Mao Mao, Hans L. Peterse, Karin van der Kooy, Matthew J. Marton, Anke T. Witteveen, George J. Schreiber, Ron M. Kerkhoven, Chris Roberts, Peter S. Linsley, René Bernards and Stephen H. Friend. Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415, pp. 530– 536, 2002. [4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander, and T. Golub. Interpreting patterns of gene expression with self-organizing maps. PNAS, 96:2907-2912, 1999. [5] J.R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufman, San Mateo, 1993.