QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE

889
-1

Published on

Quality of clustering is an important issue in application of clustering techniques. Most traditional cluster validity indices are geometry-based cluster quality measures. This work proposes a cluster validity index based on the decision-theoretic rough set model by considering various loss functions. Real time retail data show the usefulness of the proposed validity index for the evaluation of rough and crisp clustering. The measure is shown to help determine optimal number of clusters, as well as an important parameter called threshold in rough clustering. The experiments with a promotional campaign for the retail data illustrate the ability of the proposed measure to incorporate financial considerations in evaluating quality of a clustering scheme. This ability to deal with monetary values distinguishes the proposed decision-theoretic measure from other distance-based measures. Our proposed system validity index can also be efficient for evaluating other clustering algorithms such as fuzzy clustering.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
889
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE

  1. 1. International Journal of Research in Computer ScienceeISSN 2249-8265 Volume 2 Issue 1 (2011) pp. 39-43© White Globe Publicationswww.ijorcs.org QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE B.Rajasekhar1, B. Sunil Kumar2, Rajesh Vibhudi3, B.V.Rama Krishna4 12 Assistant Professor, Jawaharlal Nehru Institute of Technology, Hyderabad 3 Sri Mittapalli college of Engineering, Guntur, Hyderabad 4 Associate Professor, St Mary’s College of Engineering & Technology, Hyderabad Abstract:- Quality of clustering is an important similarity measure of clusters whose bases are theissue in application of clustering techniques. Most dispersion measure of a cluster and the clustertraditional cluster validity indices are geometry-based dissimilarity measure. In the case of fuzzy clusteringcluster quality measures. This work proposes a cluster algorithms, some validity indices such as partitionvalidity index based on the decision-theoretic rough coefficient [1] and classification entropy use only theset model by considering various loss functions. Real information of fuzzy membership grades to evaluatetime retail data show the usefulness of the proposed clustering results. The third category consists ofvalidity index for the evaluation of rough and crisp validity indices that make use of not only the fuzzyclustering. The measure is shown to help determine membership grades but also the structure of the data.optimal number of clusters, as well as an important All these validity indices are essentially based on theparameter called threshold in rough clustering. The geometric characteristics of the clusters. A decision-experiments with a promotional campaign for the theoretic measure of cluster quality, decision theoreticretail data illustrate the ability of the proposed framework has been helpful in providing a bettermeasure to incorporate financial considerations in understanding of classification models [4]. Theevaluating quality of a clustering scheme. This ability decision theoretic rough set model considers variousto deal with monetary values distinguishes the classes of loss functions. By adjusting loss functions,proposed decision-theoretic measure from other the decision-theoretic rough set model can also bedistance-based measures. Our proposed system extended to the multi category problem. It is possiblevalidity index can also be efficient for evaluating other to construct a cluster validity index by consideringclustering algorithms such as fuzzy clustering. various loss functions based on decision theory. Such a measure has an added advantage of being applicable toKeywords: Clustering, Classification, Decision Tree, rough-set-based clustering. This work describes how to I.K-means. develop a cluster validity index from the decision- theoretic rough set model. Based on the decision INTRODUCTION theory, the proposed rough cluster validity index is Unsupervised learning clustering is one of the taken as a function of total risk for grouping objectstechniques in data mining, categorizes unlabeled using a clustering algorithm. Since crisp[5] clusteringobjects into several clusters such that the objects is a special case of rough clustering, index validity isbelonging to the same cluster are other similar than applicable to both rough clustering and crispthose belonging to different clusters. Conventional clustering. Experiments with synthetic and real-worldclustering assigns an object to exactly one cluster. data show the usefulness of the proposed validityAssign an object to rough-set-based variation makes it index for the evaluation of rough clustering and crisppossible. [3]. Quality of clustering is an important clustering.issue in application of clustering techniques to real-world data. A good measure of cluster quality will help II. CLUSTERING TECHNIQUEin deciding various parameters used in clustering The clustering technique K-means [7] is aalgorithms. One such parameter that is common to prototype-based, simple partitional clusteringmost clustering algorithms is the number of clusters. technique which attempts to find k non-overlappingMany different indices of cluster validity have been clusters. These clusters are represented by theirproposed. In general, indices of cluster validity fall centroids (a cluster centroid is typically the mean ofinto one of three categories. Some validity indices the points in the cluster). The clustering process of K-measure partition validity to evaluate the properties of means is as follows. Firstly, k initial centroids arecrisp structure imposed on the data by the clustering selected, where k is specified by the user and indicatesalgorithm, such as Dunn indices [7] and Davies-Bould the desired number of clusters.in index [2]. These validity indices are based on www.ijorcs.org
  2. 2. 40 B.Rajasekhar, B. Sunil Kumar, Rajesh Vibhudi, B.V.Rama Krishna Secondly, every point in the data is then assigned to and Intra Î. The greater the value of Intra the more isthe closest centroid, and each collection of points the cluster compactness.[1] If the second BMUs of allassigned to a centroid forms a cluster. The centroid of data vectors in Ck are also in Ck, then Intra(Ck)=1. Theeach cluster is then updated based on the points intra-cluster connectivity of all clusters (Intra) is the 𝐴 = ∑ 𝑘𝐾 𝐼𝑛𝑡𝑟𝑎_𝐶𝑜𝑛𝑛(𝐶 𝑘 )assigned to the cluster. This process is repeated until average compactness which is given below 𝐾no point changes clusters.A. Clustering Crisp Method: The objective of the k-means is to assign n objects to k clusters. The process D. Cluster Quality: Several cluster validity indices tobegins by randomly choosing k objects as the centroids evaluate cluster quality obtained by different clustering 𝑑(𝑥1 , ⃗1 ) between the object vector ⃗1 and the cluster ⃗ 𝑐 𝑥of the k clusters. The objects are assigned to one of the algorithms. An excellent summary of various validityvector ⃗1 the distance 𝑑(𝑥1 , ⃗1 ) can be the standard 𝑐 ⃗ 𝑐k clusters based on the minimum value of the distance measures [10] are two classical cluster validity indices and one used for fuzzy clusters.Euclidean distance. 1. Davies-Bouldin Index:Assignment of all the objects to various clusters, the This index [6] is a function of the ratio of the sum ∑ ⃗ 𝑖 ∈𝑐 𝑖 ⃗ 𝑖 𝑥 ⃗ 𝑥 the distance between cluster ⃗ 𝑖 and ⃗𝑗 , denoted by 𝑑 𝑖𝑗 , 𝑐 𝑐 of within cluster scatter to between-cluster separation. ⃗𝑖 = 𝑐 , 𝑤ℎ𝑒𝑟𝑒 1 ≤ 𝑖 ≤ 𝑘new centroid vectors of the clusters are calculated as |𝑐 𝑖 | The scatter within the ith cluster, denoted by Si, and ⃗ 1�Here |𝑐 𝑖 | is the cardinality of cluster ⃗ 𝑖 . The process ⃗ 𝑐 𝑆 𝑖,𝑞 = �|𝑐 | ∑ ⃗𝑥∈𝑐 𝑖‖ ⃗ − ⃗ 𝑥 ⃗ 𝑖 ‖2 � 𝑐 1 𝑞 𝑞 are defined as follows: ⃗ 𝑖 𝑑 𝑖𝑗,𝑡 = �𝑐 𝑖 − ⃗𝑗 � 𝑡 ⃗ 𝑐stops when the centroids of clusters stabilize, i.e., the where 𝑐 𝑖 is the center of the ith cluster. 𝑐 𝑖𝑗 is thecentroid vectors from the previous iteration areidentical to those generated in the current iteration. number of objects in ⃗𝑗 . Integers q and t can be 𝑐B. Cluster Validity: A New validity index conn_indexfor prototype based clustering of data sets is applicablewith a wide variety of cluster characteristics clusters of selected independently such that q, t > 1. The Davies-different shapes, sizes, densities and even overlaps. Bouldin index for a clustering scheme (CS) is then 1 𝑘 𝐷𝐵(𝐶𝑆) = � 𝑅 𝑖,𝑞𝑡 , 𝑤ℎ𝑒𝑟𝑒 𝑅 𝑖,𝑞𝑡Conn_index is based on weighted Delaunay defined as 𝑘triangulation called “connectivity matrix”. = max1≤𝑗≤𝑘,𝑗≠1 { 𝑆 𝑖,𝑞 + 𝑆 𝑗,𝑞 ⁄ 𝑑 𝑖𝑗,𝑡 } 𝑖=1 Crisp clustering the Davies-Bouldin index and thegeneralized Dunn Index are some of the mostcommonly used indices depend on a separationmeasure between clusters and a measure for The Davies-Bouldin index considers the average casecompactness of clusters based on distance. When the of similarity between each cluster and the one that isclusters have homogeneous density distribution, one most similar to it. Lower Davies-Bouldin index meanseffective approach to correctly evaluate the clustering a better clustering scheme.of data sets is CDbw (composite density between andwithin clusters) [16]. CDbw finds prototypes for 2. Dunn Index:clusters instead of representing the clusters by their Dunn proposed another cluster validity index [7]. Thecentroids, and calculates the validity measure based on 𝛿�𝑐 𝑖 , ⃗𝑗 � ⃗ 𝑐 index corresponding to a clustering scheme (CS) is 𝐷(𝐶𝑆) = min � min � ��inter- and intra-cluster densities, and cluster defined by max1≤𝑞≤𝑘 ∆(𝑐 𝑞 )⃗separation. 1≤𝑗≤𝑘 1≤𝑗≤𝑘,𝑗≠1 𝛿�𝑐 𝑖 , ⃗𝑗 � = ⃗ 𝑐 min �𝑐 𝑖 −𝑐 𝑗 � , ⃗ ⃗C. Compactness of Clusters: Assuming k number ofclusters, N prototypes v in a data set, Ck and Cl are two 1≤𝑖,𝑗≤𝑘,𝑖≠𝑗 ∆(𝑐 𝑖 ) = max ‖ ⃗ 𝑖 − ⃗ 𝑡 ‖ ⃗ 𝑥 𝑥different clusters where 1 ≤ k, l ≤K, the new proposedCONN_Index will be defined with the help of Intra ⃗ 𝑖 ,𝑥 𝑡 ∈𝑐 𝑖 𝑥 ⃗ ⃗and Inter quantities which are considered ascompactness and separation. The compactness of Ck,Intra (Ck) is the ratio of the number of data vectors in usually large and the diameters of the clusters, Δ ( 𝑐 𝑖 )Ck whose second BMU is also in Ck, to the number of If a data set is well separated by a clustering scheme, the distance among the clusters, δ(ci,cj)(1≤ i,j ≤k) is ∑ 𝑖,𝑗�𝐶𝐴𝐷𝐽(𝑖, 𝑗): 𝑣 𝑖 𝑣 𝑗 ∈ 𝐶 𝑘 � 𝑁data vectors in Ck. The Intra (Ck) is defined by 𝐼𝑛𝑡𝑟𝑎_𝐶𝑜𝑛𝑛(𝐶 𝑘 ) = ∑ 𝑖,𝑗{ 𝐶𝐴𝐷𝐽(𝑖, 𝑗): 𝑣 𝑖 ∈ 𝐶 𝑘 } 𝑁 (1≤ i ≤k), are expected to be small. Therefore, a large value of D(CS) corresponds to a good clustering www.ijorcs.org
  3. 3. Quality of Cluster Index Based on Study of Decision Tree 41scheme. The main drawback of the Dunn index is that each other, would like to compute a function f: X*Y->the calculation is computationally expensive and the {0,1} at (x,y) with minimal amount of interactionindex is sensitive to noise. between them. Interaction is some measure of communication between the two parties and it is III. DECISION TREE usually the total number of bits exchanged between the parties. The classification of objects according toA. Decision Tree approximation operators in rough set theory can be A decision tree depicts riles for Classifying data easily fitted into the Bayesian decision-theoreticinto groups. Splits entire data set into some number of framework. Let Ω={A,Ac} denote the set of statespieces and then another rule may be applied to a piece, indicating that an object is in A and not in A,different rules to different pieces forming a second respectively. Let A ={a1,a2,a3} be the set of actions,generation of pieces. The tree depicts the first split into where a1,a2,a3 represent the three actions in classifyingpieces as branches emanating from a root and an object, deciding POS(A), deciding NEG(A), andsubsequent splits as branches emanating from nodes on deciding BND(A), respectively.older branches. The leaves of the tree are the finalgroups, the unsplit nodes. For some perverse reason, C. Implementation of the CRISP-DM:trees are always drawn upside down, like an CRISP-DM is based on the process flow showed inorganizational chart. For a tree to be useful, the data in Figure 1. The model proposes the following steps:a leaf must be similar with respect to some targetmeasure, so that the tree represents the segregation of a 1. Business Understanding – to understand the rulesmixture of data into purified groups. and business objectives of the company. Consider an example of data collected on people in 2. Understanding Data – to collect and describe data.a city park in the vicinity of a hotdog and ice cream 3. Data Preparation – to prepare data for import intostand. The owner of the concession stand wants to the software.know what predisposes people to buy ice cream. 4. Modelling – to select the modelling technique to beAmong all the people observed, forty percent buy ice used.cream. This is represented in the root node of the tree 5. Evaluation – to evaluate the process to see if theat the [9] top of the diagram. The first rule splits the technique solves the problem of modelling anddata according to the weather. Unless it is sunny and creation of rules.hot, only five percent buy ice cream. This is 6. Deployment – to deploy the system and train itsrepresented in the leaf on the left branch. On sunny users.and hot days, sixty percent buy ice cream. The treerepresents this population as an internal node that isfurther split into two branches, one of which is splitagain. 40% Buy ice cream No Sunny Hot Yes 5 % Buy Ice cream 60% Buy Ice Cream No Have extra money ? Yes 30% Buy Ice Cream 80% Buy Ice Cream No Figure 2 Example of Crisp data mining Yes Crave Ice Cream IV. PROPOSED SYSTEM 10% Buy Ice Cream 70% Buy Ice Cream Unsupervised classification method when the only data available are unlabeled. It is need to know the Figure 1 Example of Decision Tree number of clusters. A cluster validity measure canB. Yao’s model for Decision Tree: provide us some information about the appropriate number of clusters. Our solution possible to construct a The model consists of two parties to holding values cluster validity measure by considering various lossxЄX and yЄY respectively who can communicate with functions based on decision theory. www.ijorcs.org
  4. 4. 42 B.Rajasekhar, B. Sunil Kumar, Rajesh Vibhudi, B.V.Rama Krishna Cluster favorable execution time and the user has to know in Dataset Validity advance how many clusters are to be searched, k- Measure means is data driven is efficient for smaller data sets and anomaly detection. Instead of taking the mean value of the objects in a cluster as a reference point, a Medoid can be used, which is the most centrally Decision Loss located object in a cluster. Clustering requires the Tree Function distance between every pair of objects only once and uses the distance at every stage of iteration. Result Comparing to [8] clustering, classification algorithms performs efficient for complex datasets, noise and Figure 3 is proposed system ([6]) outlier detection such as algorithm designers have had We choose K-means clustering because 1)it is data much success with equal width method, equal depthdriven method relatively few assumptions on the method approaches to building class descriptions. It isdistributions of the underlying data and 2) greedy chosen decision tree learners made popular by ID3,search strategy of K-means guarantees at least a local C4.5 and CART, because they are relatively fast andminimum of the criterion function, thereby typically they produce competitive classifiers. In fact,accelerating the convergence of clusters on large the decision tree generator C4.5, a successor to ID3,datasets. has become a standard factor for comparison inA. Cluster Quality on Decision Theory: machine learning research, because it produces good classifiers quickly. For non numeric datasets, the Unsupervised learning method is the techniques we growth of the run time of ID3 (and C4.5) is linear in allapply only data available are unlabeled, algorithms examples.need to know the number of clusters. Cluster validitymeasures are Davies-Bouldin can help us assess The practical run time complexity of C4.5 has beenwhether a clustering method accurately presents the determined empirically to be worse than O (e2) onstructure of the data set. There are several cluster some datasets. One possible explanation is based onindices to evaluate crisp and fuzzy clusteruing. the observation of Oates and Jensen (1998) that theDecision framework has been helpful in providing a size of C4.5 trees increases linearly with the number ofbetter understanding of the classification model. examples. One of the factors of a in C4.5’s run-timeDecision rough set model considers various classes of complexity corresponds to the tree depth, whichloss functions, the extension of the decision rough set cannot be larger than the number of attributes. Treemodel to multicategory is possible to construct a depth is related to tree size, and thereby to the numbercluster validity measure by considering various loss of examples. When compared with C4.5, the run time V. CONCLUSIONfunctions based on decision theory. Within a given set complexity of CART is satisfactory.of objects there may be clusters such that objects in thesame cluster are more similar than those in differentclusters. Clustering is to find the right groups or A cluster quality index based on decision theory,clusters for the given set of objects. To find right proposal uses a loss function to construct the qualitycluster we need exponential time comparisons has index. Therefore, the cluster quality is evaluated bybeen proved to be NP-hard. For defining framework considering the total risk of classifying all the objects.we assume partitions a set of objects X={x1….xn} into Such a decision-theoretic representation of clusterclusters CS={c1…ck}, the k-means algorithm quality may be more useful in business-oriented dataapproximate the actual clustering. It is possible that mining than traditional geometry-based cluster qualityeach object may not necessarily belong to only one measures. In addition to evaluating crisp clustering, thecluster. However there will be corresponding to each proposal is an evaluation measure for rough clustering.cluster within the clustering scheme, the centroid of This is the first measure that takes into account specialthe hypothetical core will be used Cluster core. Let features of rough clustering that allow for an object tocore (ci) be the core of the cluster ci, which is used to belong to more than one cluster. The measure is showncalculate the centroid of the cluster. Any x1Є core (ci) to be useful in determining important aspects of acannot belong to other clusters. Therefore, core (ci) can clustering exercise such as determining the appropriatebe considered the best representation of ci to a certain number of clusters and size of boundary region. Theextent. application of the measure to synthetic data with known number of clusters and boundary regionB. Comparison of Clustering and Classification: provides credence to the proposal.Clustering work well for finding unlabeled clusters in A real advantage of the decision-theoretic clustersmall to large data points K-means algorithm is its validity measure is its ability to include monetary www.ijorcs.org
  5. 5. Quality of Cluster Index Based on Study of Decision Tree 43considerations in evaluating a clustering scheme. Useof the measure to derive an appropriate clusteringscheme for a promotional campaign in a retail storehighlighted its unique ability to include cost andbenefit considerations in commercial data mining. Wecan also extend it to evaluating other clusteringalgorithms such as fuzzy clustering. Such a clustervalidity measure can be useful in further theoretical VI. REFERENCESdevelopment in clustering.[1] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.[2] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no 2, pp. 224-227, Apr. 1979.[3] J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy Partitions,” J. Cybernetics, vol. 4, pp. 95-104, 1974.[4] S. Hirano and S. Tsumoto, “On Constructing Clusters from Non- Euclidean Dissimilarity Matrix by Using Rough Clustering,” Proc. Japanese Soc. for Artificial Intelligence (JSAI) Workshops, pp. 5-16, 2005.[5] T.B. Ho and N.B. Nguyen, “Nonhierarchical Document Clustering by a Tolerance Rough Set Model,” Int’l J. Intelligent Systems, vol. 17, no. 2, pp. 199-212, 2002.[6] Rough Cluster Quality Index Based on Decision Theory Pawan Lingras, Member, IEEE, Min Chen, and Duoqian Miao IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009[7] W. Pedrycz and J. Waletzky, “Fuzzy Clustering with Partial Supervision,” IEEE Trans. Systems, Man, and Cybernetics, vol. 27, no. 5, pp. 787-795, Sept. 1997.[8] Partition Algorithms– A Study and Emergence of Mining Projected Clusters in High-Dimensional Dataset-International Journal of Computer Science and Telecommunications [Volume 2, Issue 4, July 2011][9] Jensen, D. D. and Cohen, P. R (1999), "Multiple Comparisons in Induction Algorithms," Machine Learning (to appear). Excellent discussion of bias inherent in selecting an input. Explore http://www.cs.umass.edu/~jensen/papers. www.ijorcs.org

×