Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- 05 classification 1 decision tree a... by นนทวัฒน์ บุญบา 5849 views
- Building Decision Tree model with n... by Big Data Engineer... 12838 views
- 06 classification 2 bayesian and in... by นนทวัฒน์ บุญบา 3924 views
- Introduction to Weka: Application a... by Big Data Engineer... 27492 views
- Evaluation metrics: Precision, Reca... by Big Data Engineer... 23281 views
- Decision tree by Rajendra Akerkar 14249 views

4,934 views

5,075 views

5,075 views

Published on

No Downloads

Total views

4,934

On SlideShare

0

From Embeds

0

Number of Embeds

1,522

Shares

0

Downloads

63

Comments

0

Likes

1

No embeds

No notes for slide

- 1. 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE) A Combination of Decision Tree Learning and Clustering for Data Classification Chinnapat Kaewchinporn Department of Computer Science King Mongkut’s Institute of Technology Ladkrabang. Ladkrabang, Thailand s0050117@kmitl.ac.th, scriptsds@gmail.com
- 2. Contents Abstract Introduction C4.5 Decision Tree algorithm k-means Clustering algorithm bagging algorithm Tree Bagging and Weighted Clustering algorithm Datasets Experimental Results Conclusion A Combination of Decision Tree Learning and Clustering for Data Classification
- 3. Abstract we present a new classification algorithm which is a combination of decision tree learning and clustering called Tree Bagging and Weighted Clustering (TBWC). A Combination of Decision Tree Learning and Clustering for Data Classification
- 4. Introduction Data Classification is an important problem in Knowledge Discovery in Database (KDD). Currently, there are many techniques to solve this problem such as decision tree, naïve bays, instance-based learning , and artificial neural network. A Combination of Decision Tree Learning and Clustering for Data Classification
- 5. Introduction However, the techniques that mentioned above have some problems in classification, for examples, outliner handling, accuracy reducing when testing data are increased, and classification time increasing when there are a large numbers of attributes. There are many researches proposed new methods to enhance a predictive performance of a classifier. A Combination of Decision Tree Learning and Clustering for Data Classification
- 6. Introduction (Reference) For decision tree learning, Guo et al. introduced a new improved C4.5 decision tree based weights, which considers imbalanced weights between different instances, to address the class imbalanced problems. For k-means clustering, Zhu and Wang proposed the approach to optimize the k value by using a genetic algorithm. Qin et al. addressed the problem of efficiency of the k-means clustering algorithm for large datasets. This research presented the improved algorithm avoiding unnecessary calculations by using the triangle inequality. A Combination of Decision Tree Learning and Clustering for Data Classification
- 7. Introduction (Reference) For an ensemble classifier research, Gaddam et al. [13] proposed a novel method called “K-Means+ID3” for anomaly detection. The concept of the K-Means+ID3 algorithm used the k-means clustering method first partitions the training instances into k clusters. Then build an ID3 decision tree using the instances in each k-means clustering. For a final output of K-Means+ID3, the decisions of the k-means and ID3 methods are combined using the nearest-neighbor rule and the nearest- consensus rule. A Combination of Decision Tree Learning and Clustering for Data Classification
- 8. Introduction (Summary) we present the new classification algorithm which is a combination of decision tree learning and clustering called Tree Bagging and Weighted Clustering (TBWC). The concept of the TBWC algorithm is to select important attributes and weights to them by using decision tree bagging, then the weighted attributes are used to generate clusters for classifying a new instance. A Combination of Decision Tree Learning and Clustering for Data Classification
- 9. C4.5 Decision Tree Algorithm C4.5 decision tree is an algorithm used to generate decision trees for a classification problem. A Combination of Decision Tree Learning and Clustering for Data Classification
- 10. k-means Clustering algorithm k-means is a well-known algorithm for clustering. The main idea of k-means is data partitioning into k clusters by given a dataset of n objects and integer value of k. A Combination of Decision Tree Learning and Clustering for Data Classification
- 11. bagging algorithm Bagging is an approach of classifier combination method. The objective of bagging is to construct a composite model which improves the predictive performance of a single model. Basically, bagging uses a single algorithm to construct n models and requires a diversity of training data. For a final output of a prediction, the bagging algorithm counts the votes from all the classification models and assigns the class with the most votes to a new data. A Combination of Decision Tree Learning and Clustering for Data Classification
- 12. Tree Bagging and Weighted Clustering algorithm Tree Bagging and Weighted Clustering (TBWC) consists of two main parts. At first, important attributes are selected and weighted by using bagging technique with C4.5 The processes of TBWC decision tree. A Combination of Decision Tree Learning and Clustering for Data Classification
- 13. Tree Bagging and Weighted Clustering algorithm Second, the weighted attributes are used as the inputs of k-means clustering to create clusters. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
- 14. Tree Bagging and Weighted Clustering algorithm As shown in Fig., there are four processes of TBWC: 1) modeling process 2) attribute selection process 3) weighted clustering process 4) prediction process. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
- 15. Tree Bagging and Weighted Clustering algorithm Modeling process A. Modeling process Training data are used to create models with bagging technique with C4.5 decision tree. The output of the bagging algorithm is n models of The processes of TBWC decision tree. A Combination of Decision Tree Learning and Clustering for Data Classification
- 16. Tree Bagging and Weighted Clustering algorithmAttribute Selection process B. Attribute selection process After decision trees were created from bagging, attributes which appear in all trees are selected. For assigning a weight to attributes, we consider The processes of TBWC attribute nodes for each tree. A weight of an attribute is vary on a position of that attribute appearing in a tree and a size of the tree. A Combination of Decision Tree Learning and Clustering for Data Classification
- 17. Tree Bagging and Weighted Clustering algorithmAttribute Selection process B. Attribute selection process If there are same attributes occur in several positions in the tree, only the attribute node appearing in the lowest level is considered. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
- 18. Tree Bagging and Weighted Clustering algorithm The algorithm starts with defining a set of weight belongs to an attribute and initialized to zero. The next step is to assign a weight to each attribute by considering all internal nodes of each tree. A Combination of Decision Tree Learning and Clustering for Data Classification
- 19. Tree Bagging and Weighted Clustering algorithm For an example, suppose there are two models of decision trees and a set of attribute is A = {age, income, student, credit_rating}. The decision tree model 1 (M1) A Combination of Decision Tree Learning and Clustering for Data Classification
- 20. Tree Bagging and Weighted Clustering algorithm For assigning a weight, the attribute node at level 0 is considered at first, thus the weight of attribute student is defined as formula. A Combination of Decision Tree Learning and Clustering for Data Classification
- 21. Tree Bagging and Weighted Clustering algorithm A Combination of Decision Tree Learning and Clustering for Data Classification
- 22. Tree Bagging and Weighted Clustering algorithm The decision tree model 2 (M2) A Combination of Decision Tree Learning and Clustering for Data Classification
- 23. Tree Bagging and Weighted Clustering algorithm A Combination of Decision Tree Learning and Clustering for Data Classification
- 24. Tree Bagging and Weighted Clustering algorithm When all of tree models were considered, the next step is to calculate the actual weight of each attribute using equation. A Combination of Decision Tree Learning and Clustering for Data Classification
- 25. Tree Bagging and Weighted Clustering algorithm The total weight of each attribute is A Combination of Decision Tree Learning and Clustering for Data Classification
- 26. Tree Bagging and Weighted Clustering algorithm At the end of the algorithm, the attribute which has the weight greater than zero will be selected. Finally, all of selected attributes are normalized into [0, 1] by using Min-Max normalization A Combination of Decision Tree Learning and Clustering for Data Classification
- 27. Tree Bagging and Weighted Clustering algorithm C. Weighted clustering process The dataset with selectedWeighted clustering process attributes and their weights are used to generate k clusters by k-means clustering. each center is assigned a class label by using a majority vote. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
- 28. Tree Bagging and Weighted Clustering algorithm D. Prediction process When a new instance x is present, distances betweenPrediction process all pairs of cluster centers and the instance x are measured. A prediction of the instance x The processes of TBWC is defined by the class label of the nearest center cluster. A Combination of Decision Tree Learning and Clustering for Data Classification
- 29. Datasets In the experiments, we use five datasets: TABLE 1. THE CHARACTERISTIC OF EACH DATASET attribute Number datasets instance all type of class1. Cardiocography 1 2,126 23 real 32. Cardiocography 2 2,126 23 real 103. Internet Advertisement 3,279 1,558 categorical, integer, real 24. Libras Movement 360 91 real 155. Multiple Features 2,000 649 integer, real 10- Cardiocography 1, Cardiocography 2Cardiocography 1 and Cardiocography 2 are the information includes detection rate of fetal heart (FHR)and the rate of uterine contraction (UC).- Internet AdvertisementInternet Advertisements is the information of the problem identification picture on the webpage.- Libras MovementLibras Movement is the information of the problem of classifying hand movements.- Multiple FeaturesMultiple Features is the information of the problem bya number of electronic fingerprints. A Combination of Decision Tree Learning and Clustering for Data Classification
- 30. Experimental Results To evaluate the predictive performance of our proposed algorithm, C4.5 decision tree and k-means clustering were used to compare the predictive accuracy with TBWC. TABLE 2. THE COMPARSIONS OF THE PREDICTIVE ACCURACY OF EACH ALGORITHM Datasets C4.5 k-means TBWC Cardiocography 1 92.56% 86.97% 96.47% Cardiocography 2 82.83% 62.65% 90.35% Internet Advertisements 96.19% 93.11% 96.65% Libras Movement 60.56% 43.89% 80.56% Multiple Features 92.60% 93.45% 96.55% A Combination of Decision Tree Learning and Clustering for Data Classification
- 31. Experimental Results 100 90 80 accuracy (%) 70 60 50 40 Internet Cardiocography 1 Cardiocography 2 Multiple Features Libras Movement Advertisements C4.5 k-means TBWC The predictive accuracy of each algorithms. A Combination of Decision Tree Learning and Clustering for Data Classification
- 32. Experimental Results TABLE 3. PARAMETERS SETTING FOR EACH DATASET number of number of datasets models n clusters k Cardiocography 1 40 100 Cardiocography 2 20 100 Internet Advertisements 10 100 Libras Movement 5 20 Multiple Features 10 80 The number of model (n) is the number of C4.5 decision tree models from the bagging algorithm. The number of cluster (k) is the number of clusters in k-means clustering. The values of each dataset (n and k) are provided by the model which has the highest accuracy by using validate dataset. A Combination of Decision Tree Learning and Clustering for Data Classification
- 33. Experimental Results TABLE 4. ATTRIBUTE REDUCTION FOR EACH DATASET number of number of selected Percent reduction datasets attributes attribute used attribute Cardiocography 1 23 22 4.35% Cardiocography 2 23 22 4.35% Internet Advertisements 1,558 626 59.82% Libras Movement 91 86 5.49% Multiple Features 649 454 30.05% The results show that the proposed algorithm has ability to select important attributes. A Combination of Decision Tree Learning and Clustering for Data Classification
- 34. Conclusion The Tree Bagging algorithm and Weighted Clustering (TBWC) was proposed to enhance an efficiency of data classification. The TBWC algorithm consists of two main steps: attribute selection and classifying a new instance. For attribute selection step, bagging algorithm with C4.5 decision tree are used to select and weight attributes. Then k-means clustering is applied to assign a class label to a new instance. A Combination of Decision Tree Learning and Clustering for Data Classification
- 35. Conclusion The experimental results are summarized in the following aspects. 1. The TBWC algorithm yields the highest accuracies when compared with decision tree and clustering for all datasets. 2. The TBWC algorithm can greatly improve the predictive performance espectically for multiple- classes datasets such as Libras Movement and Cardiocography 2. 3. The TBWC algorithm can reduce attributes up to 59.82% in Internet Advertisement dataset, and has a higher accuracy than C4.5 decision tree and k-means clustering. A Combination of Decision Tree Learning and Clustering for Data Classification
- 36. A Combination of Decision Tree Learning and Clustering for Data Classification
- 37. A Combination of Decision Tree Learning and Clustering for Data Classification

No public clipboards found for this slide

Be the first to comment