Your SlideShare is downloading. ×
A combination of decision tree learning and clustering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

A combination of decision tree learning and clustering

2,301
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,301
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE) A Combination of Decision Tree Learning and Clustering for Data Classification Chinnapat Kaewchinporn Department of Computer Science King Mongkut’s Institute of Technology Ladkrabang. Ladkrabang, Thailand s0050117@kmitl.ac.th, scriptsds@gmail.com
  • 2. Contents Abstract Introduction C4.5 Decision Tree algorithm k-means Clustering algorithm bagging algorithm Tree Bagging and Weighted Clustering algorithm Datasets Experimental Results Conclusion A Combination of Decision Tree Learning and Clustering for Data Classification
  • 3. Abstract we present a new classification algorithm which is a combination of decision tree learning and clustering called Tree Bagging and Weighted Clustering (TBWC). A Combination of Decision Tree Learning and Clustering for Data Classification
  • 4. Introduction Data Classification is an important problem in Knowledge Discovery in Database (KDD). Currently, there are many techniques to solve this problem such as decision tree, naïve bays, instance-based learning , and artificial neural network. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 5. Introduction However, the techniques that mentioned above have some problems in classification, for examples, outliner handling, accuracy reducing when testing data are increased, and classification time increasing when there are a large numbers of attributes. There are many researches proposed new methods to enhance a predictive performance of a classifier. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 6. Introduction (Reference) For decision tree learning, Guo et al. introduced a new improved C4.5 decision tree based weights, which considers imbalanced weights between different instances, to address the class imbalanced problems. For k-means clustering, Zhu and Wang proposed the approach to optimize the k value by using a genetic algorithm. Qin et al. addressed the problem of efficiency of the k-means clustering algorithm for large datasets. This research presented the improved algorithm avoiding unnecessary calculations by using the triangle inequality. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 7. Introduction (Reference) For an ensemble classifier research, Gaddam et al. [13] proposed a novel method called “K-Means+ID3” for anomaly detection. The concept of the K-Means+ID3 algorithm used the k-means clustering method first partitions the training instances into k clusters. Then build an ID3 decision tree using the instances in each k-means clustering. For a final output of K-Means+ID3, the decisions of the k-means and ID3 methods are combined using the nearest-neighbor rule and the nearest- consensus rule. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 8. Introduction (Summary) we present the new classification algorithm which is a combination of decision tree learning and clustering called Tree Bagging and Weighted Clustering (TBWC). The concept of the TBWC algorithm is to select important attributes and weights to them by using decision tree bagging, then the weighted attributes are used to generate clusters for classifying a new instance. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 9. C4.5 Decision Tree Algorithm C4.5 decision tree is an algorithm used to generate decision trees for a classification problem. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 10. k-means Clustering algorithm k-means is a well-known algorithm for clustering. The main idea of k-means is data partitioning into k clusters by given a dataset of n objects and integer value of k. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 11. bagging algorithm Bagging is an approach of classifier combination method. The objective of bagging is to construct a composite model which improves the predictive performance of a single model. Basically, bagging uses a single algorithm to construct n models and requires a diversity of training data. For a final output of a prediction, the bagging algorithm counts the votes from all the classification models and assigns the class with the most votes to a new data. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 12. Tree Bagging and Weighted Clustering algorithm Tree Bagging and Weighted Clustering (TBWC) consists of two main parts. At first, important attributes are selected and weighted by using bagging technique with C4.5 The processes of TBWC decision tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 13. Tree Bagging and Weighted Clustering algorithm Second, the weighted attributes are used as the inputs of k-means clustering to create clusters. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 14. Tree Bagging and Weighted Clustering algorithm As shown in Fig., there are four processes of TBWC: 1) modeling process 2) attribute selection process 3) weighted clustering process 4) prediction process. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 15. Tree Bagging and Weighted Clustering algorithm Modeling process A. Modeling process Training data are used to create models with bagging technique with C4.5 decision tree. The output of the bagging algorithm is n models of The processes of TBWC decision tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 16. Tree Bagging and Weighted Clustering algorithmAttribute Selection process B. Attribute selection process After decision trees were created from bagging, attributes which appear in all trees are selected. For assigning a weight to attributes, we consider The processes of TBWC attribute nodes for each tree. A weight of an attribute is vary on a position of that attribute appearing in a tree and a size of the tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 17. Tree Bagging and Weighted Clustering algorithmAttribute Selection process B. Attribute selection process If there are same attributes occur in several positions in the tree, only the attribute node appearing in the lowest level is considered. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 18. Tree Bagging and Weighted Clustering algorithm The algorithm starts with defining a set of weight belongs to an attribute and initialized to zero. The next step is to assign a weight to each attribute by considering all internal nodes of each tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 19. Tree Bagging and Weighted Clustering algorithm For an example, suppose there are two models of decision trees and a set of attribute is A = {age, income, student, credit_rating}. The decision tree model 1 (M1) A Combination of Decision Tree Learning and Clustering for Data Classification
  • 20. Tree Bagging and Weighted Clustering algorithm For assigning a weight, the attribute node at level 0 is considered at first, thus the weight of attribute student is defined as formula. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 21. Tree Bagging and Weighted Clustering algorithm A Combination of Decision Tree Learning and Clustering for Data Classification
  • 22. Tree Bagging and Weighted Clustering algorithm The decision tree model 2 (M2) A Combination of Decision Tree Learning and Clustering for Data Classification
  • 23. Tree Bagging and Weighted Clustering algorithm A Combination of Decision Tree Learning and Clustering for Data Classification
  • 24. Tree Bagging and Weighted Clustering algorithm When all of tree models were considered, the next step is to calculate the actual weight of each attribute using equation. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 25. Tree Bagging and Weighted Clustering algorithm The total weight of each attribute is A Combination of Decision Tree Learning and Clustering for Data Classification
  • 26. Tree Bagging and Weighted Clustering algorithm At the end of the algorithm, the attribute which has the weight greater than zero will be selected. Finally, all of selected attributes are normalized into [0, 1] by using Min-Max normalization A Combination of Decision Tree Learning and Clustering for Data Classification
  • 27. Tree Bagging and Weighted Clustering algorithm C. Weighted clustering process The dataset with selectedWeighted clustering process attributes and their weights are used to generate k clusters by k-means clustering. each center is assigned a class label by using a majority vote. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 28. Tree Bagging and Weighted Clustering algorithm D. Prediction process When a new instance x is present, distances betweenPrediction process all pairs of cluster centers and the instance x are measured. A prediction of the instance x The processes of TBWC is defined by the class label of the nearest center cluster. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 29. Datasets In the experiments, we use five datasets: TABLE 1. THE CHARACTERISTIC OF EACH DATASET attribute Number datasets instance all type of class1. Cardiocography 1 2,126 23 real 32. Cardiocography 2 2,126 23 real 103. Internet Advertisement 3,279 1,558 categorical, integer, real 24. Libras Movement 360 91 real 155. Multiple Features 2,000 649 integer, real 10- Cardiocography 1, Cardiocography 2Cardiocography 1 and Cardiocography 2 are the information includes detection rate of fetal heart (FHR)and the rate of uterine contraction (UC).- Internet AdvertisementInternet Advertisements is the information of the problem identification picture on the webpage.- Libras MovementLibras Movement is the information of the problem of classifying hand movements.- Multiple FeaturesMultiple Features is the information of the problem bya number of electronic fingerprints. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 30. Experimental Results To evaluate the predictive performance of our proposed algorithm, C4.5 decision tree and k-means clustering were used to compare the predictive accuracy with TBWC. TABLE 2. THE COMPARSIONS OF THE PREDICTIVE ACCURACY OF EACH ALGORITHM Datasets C4.5 k-means TBWC Cardiocography 1 92.56% 86.97% 96.47% Cardiocography 2 82.83% 62.65% 90.35% Internet Advertisements 96.19% 93.11% 96.65% Libras Movement 60.56% 43.89% 80.56% Multiple Features 92.60% 93.45% 96.55% A Combination of Decision Tree Learning and Clustering for Data Classification
  • 31. Experimental Results 100 90 80 accuracy (%) 70 60 50 40 Internet Cardiocography 1 Cardiocography 2 Multiple Features Libras Movement Advertisements C4.5 k-means TBWC The predictive accuracy of each algorithms. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 32. Experimental Results TABLE 3. PARAMETERS SETTING FOR EACH DATASET number of number of datasets models n clusters k Cardiocography 1 40 100 Cardiocography 2 20 100 Internet Advertisements 10 100 Libras Movement 5 20 Multiple Features 10 80 The number of model (n) is the number of C4.5 decision tree models from the bagging algorithm. The number of cluster (k) is the number of clusters in k-means clustering. The values of each dataset (n and k) are provided by the model which has the highest accuracy by using validate dataset. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 33. Experimental Results TABLE 4. ATTRIBUTE REDUCTION FOR EACH DATASET number of number of selected Percent reduction datasets attributes attribute used attribute Cardiocography 1 23 22 4.35% Cardiocography 2 23 22 4.35% Internet Advertisements 1,558 626 59.82% Libras Movement 91 86 5.49% Multiple Features 649 454 30.05% The results show that the proposed algorithm has ability to select important attributes. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 34. Conclusion The Tree Bagging algorithm and Weighted Clustering (TBWC) was proposed to enhance an efficiency of data classification. The TBWC algorithm consists of two main steps: attribute selection and classifying a new instance. For attribute selection step, bagging algorithm with C4.5 decision tree are used to select and weight attributes. Then k-means clustering is applied to assign a class label to a new instance. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 35. Conclusion The experimental results are summarized in the following aspects. 1. The TBWC algorithm yields the highest accuracies when compared with decision tree and clustering for all datasets. 2. The TBWC algorithm can greatly improve the predictive performance espectically for multiple- classes datasets such as Libras Movement and Cardiocography 2. 3. The TBWC algorithm can reduce attributes up to 59.82% in Internet Advertisement dataset, and has a higher accuracy than C4.5 decision tree and k-means clustering. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 36. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 37. A Combination of Decision Tree Learning and Clustering for Data Classification