Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
International Journal of Computer Applications Technology anNew Approach forAbhishek PatelDepartment of Information & Tech...
International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 2Observati...
International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 3= (3+4+4)...
International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 4subsequen...
International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 53. Find t...
Upcoming SlideShare
Loading in …5
×

New Approach for K-mean and K-medoids Algorithm

2,274 views

Published on

K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.

Published in: Education
  • Be the first to comment

  • Be the first to like this

New Approach for K-mean and K-medoids Algorithm

  1. 1. International Journal of Computer Applications Technology anNew Approach forAbhishek PatelDepartment of Information & Technology,Parul Institute of Engineering & Technology,Vadodara, Gujarat, IndiaAbstract: K-means and K-medoids clustering algorithms are widely used for many practical applications. Original kmedoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes itgenerates unstable and empty clusters which are meaningless.expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centrorequirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminatesunnecessary distance computation by using previous iteration. The new approach for ksystematically based on initial centroids. It generates stable clusters to improve accuracy.Keywords: K-means; K-medoids; centroids; clusters;1. INTRODUCTIONWe Technology advances have made data collection easierand faster, resulting in large, more complex, datasetsmany objects and dimensions. Important information ishidden in this data. Data Mining has become an intriguing andinteresting topic for the information extraction from such datacollection since the past decade. Furthermore there are somany subtopics related to it that research has become afascination for data miners. Cluster analysis or primitiveexploration with little or no prior knowledge, consists ofresearch developed across a wide variety of communities.The goal of a clustering algorithm is to group similar datapoints in the same cluster while purring dissimilar data pointsin different clusters. It groups the data in such a way thatinter-cluster similarity is maximized and intrasimilarity is minimized. Clustering is an unsupervilearning technique of machine learning whose purpose is togive the ability to machine to find some hidden structurewithin data.K-means and K-medoids are widely used simplest partitionbased unsupervised learning algorithms that solve the wellknown clustering problem. The procedure follows a simpleand easy way to classify a given data set through a certainnumber of clusters (assume k clusters) fixed a priori.In this direction, we have tried to put our efforts in enhancingthe partition base clustering algorithm to improve accuracyand generate better and stable cluster and also reduce its timecomplexity and to efficiently scale a large data set. For this Ihave used a concept of “initial centroids” which has proved tobe a batter option.2. ANALYSIS OF EXSISTINGSYSTEMK-means ClusteringA K-means is one of the simplest unsupervised learningalgorithms that solve the well known clustering problem. Theprocedure follows a simple and easy way to classify a givendata set through a certain number of clusters (assume kclusters) fixed a priori. The main idea is to define k centroids,one for each cluster. These centroids should be placed in acunning way because of different location causes differentresult. So, the better choice is to place them as much aspossible far away from each other. The next steeach point belonging to a given data set and associate it to theInternational Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013K-mean and K-medoids AlgorithmDepartment of Information & Technology,Parul Institute of Engineering & Technology,Purnima SinghDepartment of Computer Science & Engineering,Parul Institute of Engineering & Technology,Vadodara, Gujarat, Indiamedoids clustering algorithms are widely used for many practical applications. Original kand medoids randomly that affect the quality of the resulting clusters and sometimes itare meaningless. The original k-means and k-mediods algorithm is computationallythe product of the number of data items, number of clusters and the number of iterations.The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centrohen gives better, effective and stable cluster. It also takes less execution time because it eliminatesunnecessary distance computation by using previous iteration. The new approach for k- medoids selects initial k medoidscentroids. It generates stable clusters to improve accuracy.medoids; centroids; clusters;Technology advances have made data collection easierand faster, resulting in large, more complex, datasets withmany objects and dimensions. Important information ishidden in this data. Data Mining has become an intriguing andinteresting topic for the information extraction from such datacollection since the past decade. Furthermore there are soics related to it that research has become afascination for data miners. Cluster analysis or primitiveexploration with little or no prior knowledge, consists ofresearch developed across a wide variety of communities.s to group similar datapoints in the same cluster while purring dissimilar data pointsin different clusters. It groups the data in such a way thatcluster similarity is maximized and intra-clustersimilarity is minimized. Clustering is an unsupervisedlearning technique of machine learning whose purpose is togive the ability to machine to find some hidden structuremedoids are widely used simplest partitionbased unsupervised learning algorithms that solve the wellclustering problem. The procedure follows a simpleand easy way to classify a given data set through a certainnumber of clusters (assume k clusters) fixed a priori.efforts in enhancingering algorithm to improve accuracyand generate better and stable cluster and also reduce its timecomplexity and to efficiently scale a large data set. For this Ihave used a concept of “initial centroids” which has proved toOF EXSISTINGis one of the simplest unsupervised learningalgorithms that solve the well known clustering problem. Theprocedure follows a simple and easy way to classify a givendata set through a certain number of clusters (assume kidea is to define k centroids,one for each cluster. These centroids should be placed in acunning way because of different location causes differentresult. So, the better choice is to place them as much aspossible far away from each other. The next step is to takeeach point belonging to a given data set and associate it to thenearest centroid. When no point is pending, thecompleted and an early group age is done. At this point, thismethod needs to re-calculate k new centroids as baryceof the clusters resulting from the previous step. After these knew centroids, a new binding has to be done between thesame data set points and the nearest new centroid. A loop hasbeen generated. As a result of this loop we may notice that thek centroids change their location step by step until no morechanges are done. In other words centroids do not move anymore. Finally, this algorithm aims at minimizing an objectivefunction, in this case a squared error function. The objectivefunctionwhere is a chosen distance measure between adata point and the cluster centrethe distance of the n data points from their respective clustercentres.The algorithm is composed of the following steps:1. Place K points into the spaceobjects that are being clustered. These pointsrepresent initial group centroids.2. Assign each object to the group that has the closestcentroid.3. When all objects have been assigned, recalculate thepositions of the K centroids.4. Repeat Steps 2 and 3 until the centroids no longermove. This produces a separation of the objects intogroups from which the metric to be minimized canbe calculated.Time Complexity & Space Complexity:Let n = number of objectsK = number of clusterst = number of iterationThe time complexity of k mean algorithm is O(nkt) and spacecomplexity is O(n+k).medoids AlgorithmDepartment of Computer Science & Engineering,Parul Institute of Engineering & Technology,Vadodara, Gujarat, Indiamedoids clustering algorithms are widely used for many practical applications. Original k-mean and k-and medoids randomly that affect the quality of the resulting clusters and sometimes itmediods algorithm is computationallythe product of the number of data items, number of clusters and the number of iterations.The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centroids k as perhen gives better, effective and stable cluster. It also takes less execution time because it eliminatesmedoids selects initial k medoidsnearest centroid. When no point is pending, the first step isand an early group age is done. At this point, thiscalculate k new centroids as barycentersof the clusters resulting from the previous step. After these knew centroids, a new binding has to be done between thesame data set points and the nearest new centroid. A loop hasbeen generated. As a result of this loop we may notice that thentroids change their location step by step until no morechanges are done. In other words centroids do not move anymore. Finally, this algorithm aims at minimizing an objectivefunction, in this case a squared error function. The objectiveis a chosen distance measure between aand the cluster centre , is an indicator ofdata points from their respective clusterThe algorithm is composed of the following steps:Place K points into the space represented by theobjects that are being clustered. These pointsrepresent initial group centroids.Assign each object to the group that has the closestWhen all objects have been assigned, recalculate theSteps 2 and 3 until the centroids no longermove. This produces a separation of the objects intogroups from which the metric to be minimized can:The time complexity of k mean algorithm is O(nkt) and space
  2. 2. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 2Observation:k-means algorithm is a popular clustering algorithmapplied widely, but the standard algorithm which selects kobjects randomly from population as initial centroids cannot always give a good and stable clustering. Selectingcentroids by our algorithm can lead to a better clustering.K medoids AlgorithmThe k-medoids algorithm is a clustering algorithm related tothe k-means algorithm and the medoidshift algorithm. Boththe k-means and k-medoids algorithms are partitioned(breaking the dataset up into groups) and both attempt tominimize squared error, the distance between points labeled tobe in a cluster and a point designated as the center of thatcluster. In contrast to the k-means algorithm k-medoidschooses data points as centers.K-medoid is a classical partitioning technique of clusteringthat clusters the data set of n objects into k clusters known apriori. A useful tool for determining k is the silhouette. It ismore robust to noise and outliers as compared to k-means.A medoid can be defined as the object of a cluster, whoseaverage dissimilarity to all the objects in the cluster isminimal i.e. it is a most centrally located point in the givendata set.The most common realization of k-medoid clustering is thePartitioning Around Medoids (PAM) algorithm and is asfollows:1. Initialize: randomly select k of the n data points asthe mediods2. Associate each data point to the closest medoid.("closest" here is defined using any valid distancemetric, most commonly Euclidean distance,Manhattan distance or Minkowski distance)3. For each mediod m,For each non-mediod data point o Swap m and oand compute the total cost of theconfiguration.4. Select the configuration with the lowest cost.5. repeat steps 2 to 5 until there is no change in themedoid.Demonstration of PAMCluster the following data set of ten objects into two clustersi.e k = 2.Consider a data set of ten objects as follows:Table 1 – Data point for distribution of the dataX1 2 6X2 3 4X3 3 8X4 4 7X5 6 2X6 6 4X7 7 3X8 7 4X9 8 5X10 7 6Figure 1 – Distribution of the dataStep 1Initialize k centreLet us assume c1 = (3,4) and c2 = (7,4) So here c1 and c2 areselected as medoid. Calculating distance so as to associateeach data object to its nearest medoid. Cost is calculated usingMinkowski distance metric with r = 1.Table 2 – Cost Calculation for distribution of the datac1 Dataobjects(Xi)Cost(distance)3 4 2 6 33 4 3 8 43 4 4 7 43 4 6 2 53 4 6 4 33 4 7 3 53 4 8 5 63 4 7 6 6c2 Dataobjects(Xi)Cost(distance)7 4 2 6 77 4 3 8 87 4 4 7 67 4 6 2 37 4 6 4 17 4 7 3 17 4 8 5 27 4 7 6 2Then so the clusters become:Cluster1 = {(3,4)(2,6)(3,8)(4,7)}Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}Since the points (2,6) (3,8) and (4,7) are close to c1 hence theyform one cluster whilst remaining points form another cluster.So the total cost involved is 20.Where cost between any two points is found using formulawhere x is any data object, c is the medoid, and d is thedimension of the object which in this case is 2. Total cost isthe summation of the cost of data object from its medoid in itscluster so here:total cost={cost((3,4),(2,6)+cost((3,4),(3,8))+cost((3,4),(4,7))}+{cost((7,4),(6,2))+cost((7,4),(6,4))+cost((7,4),(7,3))}+{cost((7,4),(8,5))+cost((7,4),(7,6))}
  3. 3. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 3= (3+4+4) + (3+1+1+2+2)= 20Figure 2 – clusters after step 1Step 2Selection of nonmedoid O′ randomly.Let us assume O′ = (7,3), So now the medoids are c1(3,4) andO′(7,3). If c1 and O′ are new medoids, calculate the total costinvolved.By using the formula in the step 1c1 Dataobjects(Xi)Cost(distance)3 4 2 6 33 4 3 8 43 4 4 7 43 4 6 2 53 4 6 4 33 4 7 4 43 4 8 5 63 4 7 6 6O′ Dataobjects(Xi)Cost(distance)7 3 2 6 87 3 3 8 97 3 4 7 77 3 6 2 27 3 6 4 27 3 7 4 17 3 8 5 37 3 7 6 3Table 3: Cost Calculation for distribution of the dataFigure 3 – clusters after step 2Total cost = 3+4+4+2+2+1+3+3= 22So cost of swapping medoid from c2 to O′ isS = current total cost – past total cost= 22 – 20= 2>0.So moving to O′ would be bad idea, so the previous choicewas good and algorithm terminates here (i.e. there is nochange in the medoids).It may happen some data points may shift from one cluster toanother cluster depending upon their closeness to medoid. Forlarge values of n and k, such computations become verycostly.3. PROPOSED PARTITIONMETHOD ALGORITHM FOR NEW K-MEAN AND K-MEDOIDS CLUSTERSThe proposed algorithm of classical Partition method is basedon initial centroid, how to assign new data point to ppropriatecluster and calculate new centroid. In Proposed algorithm ofk-medoid how it initializes the cluster seeds means how itchoose initial medoids. The proposed algorithm is an accurateand efficient because with this approach significantcomputation is reduce by eliminating all of the distancecomparisons among points that do not fall within a commoncluster.Also, the size of all cluster are similar, it calculate thethreshold value, which control the size of each cluster. Herewe have to provide only number of clusters and data set as aninput parameter.3.1 Process in proposed algorithm.The key idea behind the use of the calculation of initialcentroid is greatly improve the accuracy and algorithmbecome good and more stable. Based on the initial clustercentroid and medoid is partition the data points into number ofclusters. To calculate the initial centroid and medoid itrequires only the input data points and number of clusters k.In second stage it also stored the previous clusters centroidand Euclidean distance between two data points. When newcentroid is calculate then a data point is assign to new clusteror previous cluster is decided by comparing present Euclideandistances between two points and previous Euclideandistance. It is greatly reduce the number of distancecalculation required for clustering.3.2 Proposed Algorithm for New k-meansInstead of initial centroids are selected randomly, for thestable cluster the initial centroids are determinedsystematically. It calculates the Euclidean distance betweeneach data point and selects two data-points between which thedistance is the shortest and form a data-point set whichcontains these two data-points, then we delete them from thepopulation. Now find out nearest data point of this set and putit into new set. The numbers of elements in the set are decidedby initial population and number of clusters systematically.These ways find the different sets of data points. Numbers ofsets are depending on the value of k. Now calculate the meanvalue of each sets that become initial centroid of the proposedk mean algorithm.After finding the initial centroids, it starts by forming theinitial clusters based on the relative distance of each data-point from the initial centroids. These clusters are
  4. 4. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 4subsequently fine-tuned by using a heuristic approach, therebyimproving the efficiency.Input:D = {d1, d2,......,dn} // set of n data itemsk // Number of desired clustersOutput:A set of k clustersSteps:Set p = 11. Compute the distance between each data point andall other data- points in the set D2. Find the closest pair of data points from the set Dand form a data-point set Am (1<= p <= k) whichcontains these two data- points, Delete these twodata points from the set D3. Find the data point in D that is closest to the datapoint set Ap, Add it to Ap and delete it from D4. Repeat step 4 until the number of data points in Amreaches 0.75*(n/k)5. If p<k, then p = p+1, find another pair of data pointsfrom D between which the distance is the shortest,form another data-point set Ap and delete themfrom D, Go to step 46. For each data-point set Am (1<=p<=k) find thearithmetic mean of the vectors of data pointsCp(1<=p<=k) in Ap, these means will be the initialcentroids7. Compute the distance of each data-point di(1<=i<=n) to all the centroids cj (1<=j<=k) as d(di,cj)8. For each data-point di, find the closest centroid cjand assign di to cluster j9. Set ClusterId[i]=j; // j:Id of the closest cluster10. Set Nearest_Dist[i]= d(di, cj)11. For each cluster j (1<=j<=k), recalculate thecentroids12. Repeat13. For each data-point di14.1 Compute its distance from the centroidof the presentnearest cluster14.2 If this distance is less than or equal tothe present nearestdistance, the data-point stays in the clusterElse14.2.1 For every centroid cj(1<=j<=k) Compute the distance(di, cj);Endfor14.2.2 Assign the data-point di to thecluster with the nearest centroid Cj14.2.3 Set ClusterId[i] =j14.2.4 Set Nearest_Dist[i] = d (di, cj);EndforFor each cluster j (1<=j<=k), recalculate the centroids; Untilthe convergence Criteria is met.Complexity Of AlgorithmEnhanced algorithm requires a time complexity of O (n 2) forfinding the initial centroids, as the maximum time requiredhere is for computing the distances between each data pointand all other data-points in the set D.In the original k-means algorithm, before the algorithmconverges the centroids are calculated many times and thedata points are assigned to their nearest centroids. Sincecomplete redistribution of the data points takes placeaccording to the new centroids, this takes O (n*k*l), where nis the number of data-points, k is the number of clusters and lis the number of iterations.To obtain the initial clusters, it requires O (nk). Here, somedata points remain in its cluster while the others move to otherclusters depending on their relative distance from the newcentroid and the old centroid. This requires O (1) if a data-point stays in its cluster and O (k) otherwise. As the algorithmconverges, the number of data points moving away from theircluster decreases with each iteration. Assuming that half thedata point’s move from their clusters, this requires O (nk/2).Hence the total cost of this phase of the algorithm is O (nk),not O (nkl). Thus the overall time complexity of the improvedalgorithm becomes O (n 2), since k is much less than n.3.3 Proposed Algorithm for newK-medoidsUnfortunately, K-means clustering is sensitive to the outliersand a set of objects closest to a centroid may be empty, inwhich case centroids cannot be updated. For this reason, K-medoids clustering are sometimes used, where representativeobjects called medoids are considered instead of centroids.Because it uses the most centrally located object in a cluster, itis less sensitive to outliers compared with the K-meansclustering. Among many algorithms for K-medoids clustering,Partitioning Around Medoids (PAM) proposed by Kaufmanand Rousseeuw (1990) is known to be most powerful.However, PAM also has a drawback that it works inefficientlyfor large data sets due to its complexity. We are interested indeveloping a new K-medoids clustering method that should befast and efficient.Algorithm for Select initial medoidsInput:D = {d1, d2,......,dn} // set of n data itemsK = number of desired clusterOutput:A set of k initial centroids Cm= {C1 ,C2,…..Ck}Steps:1. Set p = 12. Compute the distance between each data point andall other data- points in the set D
  5. 5. International Journal of Computer Applications Technology and ResearchVolume 2– Issue 1, 1-5, 2013www.ijcat.com 53. Find the closest pair of data points from the set Dand form a data-point set Am (1<= p <= k) whichcontains these two data- points, Delete these twodata points from the set D4. Find the data point in D that is closest to the datapoint set Ap, Add it to Am and delete it from D5. Repeat step 4 until the number of data points in Apreaches 0.75*(n/k)6. If p<k, then p = p+1, find another pair of data pointsfrom D between which the distance is the shortest,form another data-point set Ap and delete themfrom D, Go to step 47. For each data-point set Am (1<=p<=k) find thearithmetic mean Cp ((1<=p<=k) of the vectors ofdata points in Ap, these means will be the initialcentroids.Algorithm for k-medoidsInput:(1) Database of O objects(2) A set of k initial centroids Cm= {C1 ,C2,…..Ck}Output:A set of k clustersSteps:1. Initialize initial medoid which is very close tocentroids {C1,C2,…Ck} of the n data points2. Associate each data point to the closest medoid.("closest" here is defined using any valid distancemetric, most commonly Euclidean distance,Manhattan distance or Minkowski distance)3. For each medoid mFor each non-medoid data point oSwap m and o and compute the total cost of theconfiguration4. Select the configuration with the lowest cost5. Repeat steps 2 to 5 until there is no change in themedoidComplexity Of AlgorithmEnhanced algorithm requires a time complexity of O (n 2) forfinding the initial centroids, as the maximum time requiredhere is for computing the distances between each data pointand all other data-points in the set D. Complexity ofremaining part of the algorithm is O(k(n-k)2) because it is justlike PAM algorithm. So over all complexity of algorithm isO(n2) , since k is much less than n.4. CONCLUSIONThe k-means algorithm is widely used for clustering large setsof data. But the standard algorithm do not always guaranteegood results as the accuracy of the final clusters depend on theselection of initial centroids. Moreover, the computationalcomplexity of the standard algorithm is objectionably highowing to the need to reassign the data points a number oftimes, during every iteration of the loop.An enhanced k-means algorithm which combines asystematic method for finding initial centroids and an efficientway for assigning data points to clusters. This method ensuresthe entire process of clustering in O(n2) time withoutsacrificing the accuracy of clusters. The previousimprovements of the k-means algorithm compromise on eitheraccuracy or efficiency.The Proposed k medoid algorithm runs just like K-meansclustering algorithm. The proposed algorithm is usedsystematic method of choosing the initial medoids. Theperformance of the algorithm may vary according to themethod of selecting the initial medoids. It is more efficientthan existing k medoid. Time complexity of clustering isO(n2) time without sacrificing the accuracy of clusters.5. FUTURE WORKIn new Approach of classical partition based clusteringalgorithm, the value of k (the number of desired clusters) isgiven as input parameter, regardless of distribution of the datapoints. It would be better to develop some statistical methodsto compute the value of k, depending on the data distribution.6. REFERENCES[1] Dechang Pi, Xiaolin Qin and Qiang Wang, “FuzzyClustering Algorithm Based on Tree for AssociationRules”, International Journal of Information Technology,vol.12, No. 3, 2006.[2] Fahim A.M., Salem A.M., “Efficient enhanced k-meansclustering algorithm”, Journal of Zhejiang UniversityScience, 1626 – 1633, 2006.[3] Fang Yuag, Zeng Hui Meng, “A New Algorithm to getinitial centroid”, Third International Conference onMachine Learning and cybernetics, Shanghai, 26-29August,1191 – 1193, 2004.[4] Friedrich Leisch1 and Bettina Gr un2, “ExtendingStandard Cluster Algorithms to Allow for GroupConstraints”[5] Maria Camila N. Barioni, Humberto L. Razente, Agma J.M. Traina, “An efficient approach to scale up k-medoidbased algorithms in large databases”, 265 – 279.[6] Michel Steinbach, Levent Ertoz and Vipin Kumar,“Challenges in high dimensional data set”.[7] Zhexue Huang, “A Fast Clustering Algorithm to ClusterVery Large Categorical Data Sets in Data Mining”.[8] Rui Xu, Donlad Wunsch, “Survey of ClusteringAlgorithm”, IEEE Transactions on Neural Networks,Vol. 16, No. 3, may 2005.[9] Vance Febre, “Clustering and Continues k-meanalgorithm”, Los Alamos Science, No. 22,138 – 144.

×