Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 6, November - December (2013), pp. 78-82 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET ©IAEME DIVISIVE HIERARCHICAL CLUSTERING USING PARTITIONING METHODS Megha Gupta M.Tech Scholar, Computer Science & Engineering, Arya College of Engg & IT Jaipur, Rajasthan, India Vishal Shrivastava Professor, Computer Science & Engineering Arya College of Engg & IT Jaipur, Rajasthan, India ABSTRACT Clustering is the process of partitioning a set of data so that the data can be divided into subsets. Clustering is implemented so that same set of data can be collected on one side and other set of data can be collected on the other end. Clustering can be done using many methods like partitioning methods, hierarchical methods, density based method. Hierarchical method creates a hierarchical decomposition of the given set of data objects. In successive iteration, a cluster is split into smaller clusters, until eventually each object is in one cluster, or a termination condition holds. In this paper, partitioning method has been used with hierarchical method to form better and improved clusters. We have used various algorithms for getting better and improved clusters. Keywords: Clustering, Hierarchical, Partitioning methods. I. INTRODUCTION Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process. 78
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to [1] some form of new knowledge. The iterative process consists of the following steps: Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection. • Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source. • Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection. • Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. • Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. • Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures. • Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results. Clustering is the organization of data in classes. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering is also called unsupervised classification, because the classification is not dictated by given class labels. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity) [2]. • II. RELATED WORK Hierarchical Clustering for Data-mining Hierarchical methods for supervised and unsupervised data mining give multilevel description of data. It is relevant for many applications related to information extraction, retrieval navigation and organizations. The two interpretation techniques have been used for description of the clusters: 1. Listing of prototypical data examples from the cluster. 2. Listing of typical features associated with the cluster. The Generalizable Gaussian Mixture model (GGM) and the soft Generalizable Gaussian mixture model (SGGM) are addressed for supervised and unsupervised learning. These two models help in calculating parameters of the Gaussian clusters with a modified EM procedure from two disjoint sets of observation that helps in ensuring high generalization ability [3]. Procedure The agglomerative clustering scheme is started by k clusters at level j=1, as given by the optimized GGM model of p(x). At each higher level in the hierarchy, two clusters are merged based on a similarity measure between pairs of clusters. The procedure is repeated till the top level. That is, at level j=1, there are k clusters and 1 cluster at the final level, j=2k-1. The natural distance measure between the cluster densities is the Kullback- Leibler (KL) divergence, since it reflects dissimilarity between the densities in the probabilistic space. 79
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME Limitations The drawback is KL only obtains an analytical expression for first level in the hierarchy while distances for the subsequently levels have to be approximated. Automatically Labeling Hierarchical Clustering A simple algorithm has been used that automatically assigns labels to hierarchical clusters. The algorithm evaluates candidates labels using information from cluster, parent cluster and corpus statistics. A trainable threshold enables the algorithm to assign just a few high- quality labels to each cluster. First, it is assumed that algorithm has access to a general collection of documents E, representing the word distribution in general English. This English corpus is used in selecting label candidates [4]. Procedure Given a cluster S and its parent cluster P, which includes all of documents in S and in sibling clusters of S, the algorithm selects labels for the cluster with following steps: 1. Collect phrase statistics: For every unigram, bigram and tri-gram phrase p occurring in the cluster S, calculate the document frequency and term frequency statistics for the cluster, the parent cluster and the general English corpus. 2. Select label candidates: Select the label candidates from unigram, bigram, tri-gram phrases based on document frequency in the cluster and in general English language. 3. Calculate the descriptive score: Calculate the descriptive score for each label candidate, and then sort the label candidates by these scores. 4. Calculate the cutoff point: Decide how many label candidate to display based on the descriptive scores. Limitation The most errors come from clusters containing small numbers of documents. The small number of observations in small clusters can be good and bad labels indistinguishable; minor variations in vocabulary can also produce statistical feature with high variance. Fast Hierarchical Clustering and other applications of dynamic closet pairs The data structures for dynamic closet pair problems with arbitrary distance functions, based on a technique used for Euclidean closet pairs. This paper show how to insert or delete object from n-object set, maintaining the closet pair, O (n log2n) time per update and O (n) space. The purpose of this paper is to show that much better bounds are possible, using data structures that are simple. If linear space is required, this represents an order –of-magnitude improved. Procedure The data structure consists of a partition of the dynamic set S into k ≤ log n subsets S1, S2….Sk, together with a digraph Gi for each set Si. Initially all points will be in S1 and G1 will have n1 edges. Gi may contain edges with neither endpoint in Si; if the number of edges in all graphs grows to 2n then the data structure will be rebuild by moving all points to S1 and recomputing G1. The closet pair will be represented by an edge in some Gi, so the pair can be found by scanning the edges in all graphs [5]. Create Gi for a new partition Si. Initially, Gi will consist of a single path. Choose the first vertex of the path to be any object in Si. Then, extend the path one edge at a time. When last vertex in the path P is in Si, choose the next vertex to be its nearest neighbor in S P, and when the last vertex is in S Si, choose the next vertex to be its nearest neighbor in Si P. Continue until the path can no longer be extended because S P or Si P is empty. 80
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME Merge Partitions: The update operations can cause k to be too large relative to n. If so, choose subsets Si and Sj as close to equal in size as |Si| ≤ |Sj| and |Sj| / |Si| minimized. Merge these two subsets into one and create graph Gi for the merged subset. To insert x, create a new subset Sk+1= {x} in the partition of S, create Gk+1, and merge partitions as necessary until k ≤ log n. To delete x, create a new subset Sk+1 consisting of all objects y such that (y, x) is a directed edge in some Gi. Remove x and all its adjacent edges from all the graphs Gi. Create the graph Gk+1 for Sk+1, and merge partitions as necessary until k ≤ log n. Theorem: The data structure above maintains the closet pair in S in O (n) space, amortized time O (n log n) per insertion, and amortized time O (n log2 n) per deletion. Limitations The methods that are tested involve sequential scans through memory, a behavior known to reduce the effectiveness of cached memory. Motivations Using hierarchical clustering, better clusters will be formed. The clusters formed will appear in better way and there will be tight bonding in between them. It means that the clusters formed will be refined using the various algorithm of hierarchical clustering. III. PROBLEM STATEMENT The objective of the proposed work is to perform hierarchical clustering to obtain the more refined clusters with strong relationship between members of same cluster. IV. PROPOSED APPROACH In this paper, we have used K-means algorithm and CFNG to find better and improved clusters. K-means Algorithm Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects into k clusters, Ci…..Ck, that is, Ci ⊂ D and Ci ∩ Cj=Ø for (1≤ i, j≤k). An objective function is used to access the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is, the objective function aims for high intracluster similarity and low intercluster similarity [6]. A centroid-based partitioning technique uses the centroid of a cluster, Ci, to represent that cluster. The centroid of a cluster is its center point. The centroid can be defined in various ways such as by the mean or medoids of the objects assigned to the cluster. The difference between an object p and Ci, the representation of the cluster, is measured by dist(p,Ci), where dist(x,y) is the Euclidean distance between two points x and y. CFNG Colored farthest neighbor graph shares many characteristics with SFN (shared farthest neighbors) by Rovetta and Masulli [7]. This algorithm yields binary partitions of objects into subsets, whereas number of subsets obtained by SFN can vary. The SFN algorithm can easily split a cluster where no natural partition is necessary, while the CFNG often avoids such splits. 81
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME 6 V. RESULTS To observe the effect of hierarchical clustering, k-means and CFNG algorithm are used and k means to observe the results, the experimental setup was designed using Java, MySQL. The obtained results are compared with K-means and CFNG, when executed individually. means Figure 1: Comparison of Proposed Algorithm with K-means and CFNG K means VI. CONCLUSION AND FUTURE SCOPE We have obtained better and improved clusters us using K-means and CFNG algorithms means hierarchically. The final clusters obtained are tightly bonded with each other. In this paper, we have used 2 different algorithms for hierarchical clustering. Instead of using clustering CFNG, we could have used other hierarchical clustering algorithm. REFERENCES [1] [2] [3] [4] [5] [6] Osmar R. Zaïane,” Principles of Knowledge discovery in databases”, 1999. Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques”. Osmar R. Zaïane,” Principles of Knowledge discovery in databases”, 1999. Pucktada Treeratpituk, Jamie Callon,” Automatically Labeling Hierarchical Clusters”. David Eppstein, “Fast Hierarchical Clustering and other applications of dynamic closet pairs”. G.Plaxton, Approximation Algorithms for Hierarchical Location problems. Proceedings of the 35th ACM Symposium on the Theory of Computation, 2003. [7] A.Borodin, R.Ostrovsky & Y. Rabani. Subquadratic approximation algorithm for clustering odin, problem in high dimensional spaces. Proceedings of 31st ACM Symposium on Theory of Computation, 1999. [8] Rinal H. Doshi, Dr. Harshad B. Bhadka and Richa Mehta, “Development of Pattern Development Knowledge Discovery Framework using Clustering Data Mining Algorithm”, International ta Algorithm Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, , pp. 101 - 112, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. , [9] Deepika Khurana and Dr. M.P.S Bhatia, “Dynamic Approach to K-Means Clustering nd Bhatia Means Algorithm”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 204 - 219, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [10] Meghana. N.Ingole, M.S.Bewoor, S.H.Patil,, “Context Sensitive Text Summarization using Context Hierarchical Clustering Algorithm”, International Journal of Computer Engineering & Algorithm Technology (IJCET), Volume 3, Issue 1, 2012, pp. 322 - 329, ISSN Print: 0976 – 6367, ISSN , Online: 0976 – 6375. 82