Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Mr. Santosh D. Rokade, Mr. A. M. Bainwad / International Journal of Engineering Researchand Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.504-508504 | P a g eSurvey of Combined Clustering ApproachesMr. Santosh D. Rokade*, Mr. A. M. Bainwad***( M.Tech Student, CSE Department, SGGS IE&T, Nanded, India)** (Assistant Professor, CSE Department, SGGS IE&T, Nanded, India)ABSTRACTData clustering is one of the veryimportant techniques and plays an importantrole in many areas such as data mining, patternrecognition, machine learning, bioinformaticsand other fields. There are many clusteringapproaches proposed in the literature withdifferent quality, complexity tradeoffs. Eachclustering algorithm has its own characteristicsand works on its domain space with no optimumsolution for all datasets of different properties,sizes, structures, and distributions. Combiningmultiple clustering is considered as new progressin the area of data clustering. In this paperdifferent combining clustering algorithms arediscussed. Combining clustering is based on thelevel of cooperation between the clusteringalgorithms; either they cooperate on theintermediate level or end result level.Cooperation among multiple clusteringtechniques is for the goal of increasing thehomogeneity of objects within the clustersKeywords - Chameleon , Ensemble clustering,Generic and Kernel based ensemble clustering,Hybrid clustering,1. IntroductionThe goal of clustering is to group the datapoints or objects that are close or similar to eachother and identify such grouping in anunsupervised manner, unsupervised is in the sensethat no information is provided to the algorithmabout which data point belongs to which cluster. Inother words data clustering is a data analysistechnique that enables the abstraction of largeamounts of data by forming meaningful groups orcategories of objects, these objects are formallyknown as clusters. This grouping is in such a waythat objects in the same cluster are similar to eachother, and those in different clusters are dissimilaraccording to some similarity measure or criteria.The increasing importance of data clustering indifferent areas has led to the development of avariety of algorithms. These algorithms arediffering in many aspects, such as the similaritymeasure used, the types of attributes they use tocharacterize the dataset, and the representation ofthe clusters. Combination. of different clusteringalgorithm is new progress area in the documentclustering to improve the result of differentclustering algorithms. The cooperation isperformed on end result level or intermediate level.Examples of end-result cooperation are theensemble clustering and the hybrid clusteringapproaches [3-9, 11, 12]. Cooperative clusteringmodel is an example of intermediate levelcooperation [13]. Sometimes, k-means andagglomerative hierarchical approaches arecombined so as to get the best results as comparesto individual algorithm. For example, in thedocument domain Scatter/Gather [1], a documentbrowsing system based on clustering, uses a hybridapproach involving both k-means andagglomerative hierarchical clustering. The k-meansis used because of its run-time efficiency and theagglomerative hierarchical clustering is usedbecause of its quality [2]. Survey of differentensemble clustering and hybrid clusteringapproaches is done in the upcoming section of thispaper.2. Ensemble clusteringThe concept of cluster ensemble isintroduced in [3] by A. Strehl and J. Ghosh. In thispaper, they introduce the problem of combiningmultiple partitioning of a set of objects withoutaccessing the original features. They call thisproblem as cluster ensemble problem. The clusterensemble problem is then formalized as acombinatorial optimization problem in terms ofshared mutual information. In addition to a directmaximization approach, they propose threeeffective and efficient techniques for obtaining highquality combiners or consensus functions. The firstcombiner induces a similarity measure from thepartitioning and then reclusters the objects. Thesecond combiner is based on hypergraphpartitioning. The third one collapses groups ofclusters into meta-clusters which then compete foreach object to determine the combined clustering.Unlike classification or regression settings, therehave been very few approaches proposed forcombining multiple clusterings. Bradley andFayyad [4] in 1998 proposed an approach forcombining multiple clusterings; here they combinethe results of several clusterings of a given dataset,where each solution resides in a common knownfeature space, for example combining multiple setsof cluster centers obtained by using k-means withdifferent initializations.According to D. Greene, P. Cunningham, recenttechniques for ensemble clustering are effective in
  2. 2. Mr. Santosh D. Rokade, Mr. A. M. Bainwad / International Journal of Engineering Researchand Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.504-508505 | P a g eimproving the accuracy and stability of standardclustering algorithms though these techniques havedrawback of computational cost of generating andcombining multiple clusterings of the data. D.Greene, P. Cunningham proposed efficientensemble methods for document clustering, in thatthey present an efficient kernel-based ensembleclustering method suitable for application to large,high-dimensional datasets [5].2.1 Generic Ensemble ClusteringEnsemble clustering is based on the ideaof combining multiple clusterings of a givendataset to produce a better combined or aggregatedsolution. The general process followed by thesetechniques is given in the fig.1 which has twodifferent phases.Fig.1 Generic ensemble clustering process.Phase-1. Generation: Construct a collect-ion of τbase clustering solutions, denoted aswhich represents the membersof the ensemble. This is typically done byrepeatedly applying a given clustering algorithm ina manner that leads to diversity among themembers.Phase-2. Integration: Once a collection ofensemble members has been generated, a suitableintegration function is applied to combine them toproduce a final “consensus” clustering.2.2 Kernel-based Ensemble ClusteringIn order to avoid repeated recompution ofsimilarity values in original feature space, D.Greene and P. Cunningham choose to represent thedata in the form of an kernel matching k,where k indicates the close resemblance betweenobject and . The main advantage of usingkernel methods in the ensemble clustering is thatafter construction of single kernel matrix we maysubsequently generate multiple partitions withoutusing original data. In [5] Greene D.Tsymbalproposed a Kernel-based correspondence clusteringwith prototype reduction that produces more stableresults than other schemes such as those based onpair-wise co-assignments, which are highlysensitive to the choice of final clustering algorithm.The Kernel-based correspondence clusteringalgorithm is described as follows:1) Construct full kernel matrix k and setcounter .2) Increment t and generate baseclustering : Produce a sub sampling withoutreplacement. Apply adjusted kernel k-meanswith random initialization to thesamples. Assign each out-of-sample objectto the nearest centroid in3) If , initialize V as the binarymembership matrix for . Otherwise,update V as follows: Compute the current consensusclustering C from V such that: Find the optimal correspondencebetween the clusters inand . For each object assigned to thejthcluster in the ,increment .4) Repeat from Step 2 until is stable or.5) Return the final consensus clustering .2.3 Ensemble clustering with Kernel ReductionThe ensemble clustering approachintroduced by D. Greene, P. Cunningham [5]allows each base clustering to be generated withoutreferring back to the original feature space but, forlarger datasets the computational cost of repeatedlyapplying an algorithm is very high . Toreduce this computational cost, the value of nshould be reduced. After this the ensemble processbecomes less computationally expensive. Greeneand Cunningham [5] showed that the principlesunderlying the kernel-based prototype reductiontechnique may also be used to improve theefficiency of ensemble clustering. The proposedtechniques mainly performed in three steps such asapplying prototype reduction, performingcorrespondence clustering on the reducedrepresentation and subsequently mapping theresulting aggregate solution back to the originaldata. The entire process is illustrated in Fig. 2.Fig. 2 Ensemble clustering process with prototypereductionThe entire ensemble process with prototypereduction is summarized in following algorithm.1) Construct full n * n kernel matrix K fromthe original data X.
  3. 3. Mr. Santosh D. Rokade, Mr. A. M. Bainwad / International Journal of Engineering Researchand Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.504-508506 | P a g e2) Apply prototype reduction to formthe reduced kernel matrix k’.3) Apply kernel-based correspondenceclustering using K’ as given in kernel-based correspondence clustering algorithmto produce a consensus clustering ’.4) Construct a full clustering by assigninga cluster label to each based on thenearest cluster in ’.5) Apply adjusted kernel k-means usingas an initial partition to produce a refinedfinal clustering of .Recent ensemble clustering techniqueshave been shown to be effective in improving theaccuracy and stability of standard clusteringalgorithms but computational complexity is themain drawback of ensemble clustering techniques.3. Hybrid ClusteringTo improve the performance andefficiency of algorithms several clustering methodshave been proposed to combine the features ofhierarchical and partitional clustering algorithms.In general, these algorithms first partition the inputdata set into m sub clusters and then a newhierarchical structure is constructed based on thesem subclusters.This idea of hybrid clustering is firstproposed in [6] where a multilevel algorithm isdeveloped. N. M. Murty and G. Krishna describeda hybrid clustering algorithm based on the conceptsof multilevel theory which is nonhierarchical at thefirst level and hierarchical from second levelonwards to cluster data sets having chain-likeclusters and concentric clusters. N. M. Murty andG. Krishna observed that this hybrid clusteringalgorithm gives the same results as the hierarchicalclustering algorithm with less computation andstorage requirements. At the first level, themultilevel algorithm partitions the data set intoseveral partitions and then performs the k-meansalgorithm on each partition to obtain severalsubclusters. In subsequent levels, this algorithmuses the centroids of the subclusters identified inthe previous level as the new input data points andperforms the hierarchical clustering algorithm onthose points. This process continues until exactly kclusters are determined. Finally, the algorithmperforms a top-down process to reassign all pointsof each subcluster to the cluster of their centroids[7].Balanced Iterative Reduced Clusteringusing Hierarchies (BIRCH) is another hybridclustering algorithm designed to deal with largeinput data sets [8, 9]. BIRCH algorithm introducestwo important concepts, first is cluster feature andanother is cluster feature tree which are used tosummarize cluster representations. These structureshelp the clustering method achieve good speed andscalability in large databases and also make iteffective for incremental and dynamic clustering ofincoming objects. A clustering feature (CF) is threedimensional vector summarizing information aboutclusters of objects. BIRCH uses CF to represent asubcluster. If CF of a subcluster is given, we canobtain the centroid, radius, and diameter of thatsubcluster easily. The CF vector of a new cluster isformed by merging two subclusters. This can bedirectly derived from the CF vectors of the twosubclusters by algebra operations. A CF tree is aheight balanced tree that stores the clusteringfeatures for a hierarchical clustering. The non leafnodes store sums of the CFs of their children andthus summarize clustering information about theirchildren.BIRCH algorithm consists of four phases.In Phase 1, BIRCH scans the database to build aninitial in-memory CF tree, which can be viewed asa multilevel compression of the data that tries topreserve the inherent clustering structure of thedata. BIRCH partitions the input data set into manysubclusters by a CF tree. In Phase 2, it reduces thesize of the CF tree that is the number of subclustersin order to apply a global clustering algorithm inPhase 3 on those generated subclusters. In Phase 4,each point in the data set is redistributed to theclosest centroids of the clusters produced in Phase3. Among these phases, Phase 2 and Phase 4 areused to further improve the clustering quality.Therefore these two phases are optional. BIRCHtries to produce the best clusters with the availableresources with a limited amount of main memory.An important consideration is to minimize the timerequired for input/output. BIRCH applies amultiphase clustering technique as: a single scan ofthe data set yields a basic good clustering and oneor more additional scans can be used to furtherimprove the quality [9].G. Karypis, E.H. Han, and V. Kumar in[10] proposed another hybrid clustering algorithmnamed as CHAMELEON. Chameleon is ahierarchical clustering algorithm that uses dynamicmodeling to determine the similarity between pairsof clusters. The Chameleon algorithm’s key featureis that it gives importance for bothinterconnectivity and closeness in identifying themost similar pair of clusters. Interconnectivity isthe number of links between two clusters andcloseness is the length of those links. Thisalgorithm is described as follows and summarizedin fig. 31) Construct a k-nearest neighbour graph.2) Partition the k-nearest neighbour graphinto many small sub clusters.3) Merge those sub clusters into finalclustering results.
  4. 4. Mr. Santosh D. Rokade, Mr. A. M. Bainwad / International Journal of Engineering Researchand Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.504-508507 | P a g eFig.3 Chameleon AlgorithmChameleon uses a k-nearest neighbourgraph approach to construct a sparse graph. Eachvertex of this graph represents a data object. Thereexists an edge between two vertices or objects ifone object is among the k most similar objects ofthe other. The edges are weighted to reflect thesimilarity between objects. Chameleon uses a graphpartitioning algorithm to partition the k-nearestneighbour graph into a large number of relativelysmall subclusters. It then uses an agglomerativehierarchical clustering algorithm that repeatedlymerges subclusters based on their similarity. Todetermine the pairs of most similar subclusters, ittakes into account both the interconnectivity aswell as the closeness of the clusters [9].Zhao and Karypis in [11] showed that thehybrid model of Bisecting k-means (BKM) and k-means (KM) clustering produces better results thanindividual BKM and KM. BKM [12] is a variant ofKM clustering that produces either a partitional ora hierarchical clustering by recursively applying thebasic KM method. It starts by considering thewhole dataset to be one cluster. At each step, onecluster V is selected and bisected further into twopartitions V1 and V2 using the basic KM algorithm.This process continues until the desired number ofclusters or some other specified stopping conditionis reached. There are a number of different ways tochoose which cluster to split. For example, we canchoose: the largest cluster at each step or the onewith least overall similarity or a criterion thatsatisfies both size and overall similarity. Thisbisecting approach is very attractive in manyapplications such as document retrieval, documentindexing problems and gene expression analysis asit is based on the homogeneity criterion. However,in some cases when a fraction of the dataset is leftbehind with no other way to re-cluster it again ateach level of the binary hierarchical tree, a“refinement” is needed to re-cluster these resultingsolutions. In [11], it has been conclude that theBKM with end-result refinement using the KMproduces better results than KM and BKM. Adrawback of this end- result enhancement is thatKM has to wait until the former BKM finishes itsclustering and then it takes the final set of centroidsas initial centres for a better refinement [2].Thus, in hybrid clustering, cascaded clusteringalgorithms cooperates together for the goal ofrefining the clustering solutions produced by aformer clustering algorithm. Different hybridclustering approaches discussed above are shownto be effective in improving the clustering qualitybut main drawback of this hybrid clusteringapproach is that it does not allow synchronousexecution of the clustering algorithms that is onealgorithm has to wait for another algorithm tofinish its clustering.4. ConclusionsCombining multiple clustering isconsidered as an example to further broaden a newprogress in the area of data clustering. In thispaper different combined clustering approacheshave been discussed that are shown to be effectivein improving the clustering quality. Thus,computational complexity is the main drawback ofensemble clustering techniques and idle timewastage is one of the drawback of hybrid clusteringapproachesREFERENCES[1] Cutting, D., Karger, D., Pedersen, J. andTurkey, J.W., “Scatter/Gather: A Cluster-based Approach to Browsing LargeDocument Collections,” SIGIR ‘92, 1992,pp. 318-329.[2] M. Steinbach, G. Karypis, V. Kumar, “Acomparison of document clusteringtechniques”, In Proceeding of the KDDWorkshop on Text Mining, 2000, pp. 109-110.[3] A. Strehl, J. Ghosh, “Cluster ensembles:knowledge reuse framework forcombining partitioning”, Conference onArtificial Intelligence (AAAI 2002),AAAI/MIT Press, Cambridge, MA, 2002,pp. 93-98.[4] U. M. Fayyad, C. Reina, and P. S.Bradley, “Initialization of iterativerefinement clustering algorithms”, InProc.14th Intl. Conf. on Machine learning(ICML), 1998, pp. 194-198.[5] D. Greene, P. Cunningham, “Efficientensemble methods for documentclustering”, Technical Report, TrinityCollege Dublin, Computer ScienceDepartment, 2006.[6] N.M. Murty and G. Krishna, “A HybridClustering Procedure for Concentric andChain-Like Clusters”, Int’l J. Computerand Information Sciences, vol. 10, no. 6,1981, pp. 397-412.[7] C. Lin, M. Chen, “Combining partitionaland hierarchical algorithms for robust andefficient data clustering with cohesionself-merging”, IEEE Transactions onKnowledge and Data Engineering 17 (2),2005, pp. 145-159.
  5. 5. Mr. Santosh D. Rokade, Mr. A. M. Bainwad / International Journal of Engineering Researchand Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.504-508508 | P a g e[8] T. Zhang, R. Ramakrishna, and M. Livny,“BIRCH: An Efficient Data ClusteringMethod for Very Large Databases,” Proc.Conf. Management of Data (ACMSIGMOD ’96), 1996, pp. 103-114.[9] Jiawei Han and Micheline Kamber, DataMining: Concepts and Techniques(Second Edition. Jim Gray, Series Editor,Morgan Kaufmann Publishers, March2006).[10] G. Karypis, E.H. Han, and V. Kumar,“Chameleon: Hierarchical ClusteringUsing Dynamic Modeling,” IEEEComputer Society, vol. 32, no. 8, Aug.1999, pp. 68-75.[11] Y. Zhao, G. Karypis, “Criterion functionsfor document clustering: experiments andanalysis”, Technical Report, 2002.[12] S.M. Savaresi, D. Boley, “On theperformance of bisecting k-means andPDDP”, in: Proceedings of the 1st SIAMInternational Conference on Data Mining,, 2001, pp. 114.[13] Rasha Kashef, Mohamed S. Kamel,“Cooperative Clustering”, PatternRecognition, 43, 2010, pp. 2315-2329.