Upcoming SlideShare
×

# Dynamic approach to k means clustering algorithm-2

468 views

Published on

Published in: Technology, Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
468
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
32
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Dynamic approach to k means clustering algorithm-2

1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME204DYNAMIC APPROACH TO k-Means CLUSTERING ALGORITHMDeepika Khurana1and Dr. M.P.S Bhatia21(Department of Computer Engineering, Netaji Subhas Institute of Technology, University ofDelhi, New Delhi, India)2(Department of Computer Engineering, Netaji Subhas Institute of Technology, University ofDelhi, New Delhi, India)ABSTRACTk-Means clustering algorithm is a heuristic algorithm that partitions the dataset into kclusters by minimizing the sum of squared distance in each cluster. In contrast, there arenumber of weaknesses. First it requires a prior knowledge of cluster number ‘k’. Second it issensitive to initialization which leads to random solutions. This paper presents a newapproach to k-Means clustering by providing a solution to initial selection of cluster centroidsand a dynamic approach based on silhouette validity index. Instead of running the algorithmfor different values of k, the user need to give only initial value of k as ko as input andalgorithm itself determines the right number of clusters for a given dataset. The algorithm isimplemented in the MATLAB R2009b and results are compared with the original k-Meansalgorithm and other modified k-Means clustering algorithms. The experimental resultsdemonstrate that our proposed scheme improves the initial center selection and overallcomputation time.Keywords: Clustering, Data mining, Dynamic, k-Means, Silhouette validity index.I. INTRODUCTIONData Mining is defined as mining of knowledge from huge amount of data. UsingData mining we can predict the nature and behaviour of any kind of data. It was recognizedthat information is at the heart of the business operations and that decision makers couldmake the use of data stored to gain the valuable insight into the business. DBMS gave accessto the data stored but this was only small part of what could be gained from the data.INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING& TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 4, Issue 3, May-June (2013), pp. 204-219© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI)www.jifactor.comIJCET© I A E M E
2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME205Analyzing data can further provide the knowledge about the business by going beyond thedata explicitly stored to derive knowledge about the business.Learning valuable information from the data made clustering techniques widelyapplied to the areas of artificial intelligence, customer – relationship management, datacompression, data mining, image processing, machine learning, pattern recognition, marketanalysis, and fraud-detection and so on. Cluster Analysis of a data is an important task inKnowledge Discovery and Data Mining. Clustering is the process to group the data on thebasis of similarities and dissimilarities among the data elements. Clustering is the process offinding the group of objects such that object in one group will be similar to one another anddifferent from the objects in the other group.Clustering is an unsupervised algorithm, which requires a parameter that specifies thenumber of clusters k. For setting this parameter either requires detailed knowledge of thedataset or requires the algorithm to be run for different values of k to determine the correctnumber of clusters. However for large and multidimensional data process of clusteringbecomes time consuming and determining the correct number of clusters in large databecomes difficult.The k-Means clustering algorithm is an old algorithm that has been intenselyresearched, owing to its ease and simplicity of implementation. However there have also beencriticisms on its performance, in particularly for demanding the value of k in prior. It isevident from the previous researches that providing the number of clusters in prior does notin any way assist in the production of good quality clusters. Original k-Means alsodetermines the initial centers randomly in each run which leads to different solutions.To validate the clustering results we have chosen Silhouette validity index as avalidity measure. The Silhouettes validity index is particularly useful when seeking to knowthe number of clusters that are compact and well separated. This index is used after theclustering to check the validity of clusters produced.This paper presents a new method for selection of the initial k centers and a dynamicapproach to k-Means clustering. Initial value of k as ko is provided by the user. The algorithmwill then partition the whole space into different segments and calculate the frequency of datapoints in each segment. The ko highest frequency segments are then chosen as initial koclusters. To determine the initial centers, the algorithm will calculate for each segment thedistance of points from origin; sort them and then coordinates corresponding to the mid valueof the distance is chosen to be the center for that segment. Then cluster assignment process isdone. Then the Silhouettes validity index is calculated for the initial ko clusters. This step isthen repeated for ( ko +2) and (ko -2) number of clusters. The algorithm will then iterateagain for specified conditions and stop at the maximum value of silhouette index yielding kcorrect number of clusters. The proposed approach is dynamic in the sense that user need notto check the algorithm for different values of k. Instead the algorithm stops itself at best valueof k giving compact and separated clusters. Proposed algorithm shows that it takes lessexecution time when compared with Original k-Means and modified approach to k-Meansclustering.The paper is organised as follows: Section 2 presents related work. Silhouette validityindex is discussed in 3. Section 4 describes Original k-Means. 5 and 6 sections details theapproaches discussed in [1] and [2] respectively. Section 7 describes the proposed algorithm.Section 8 shows implementation results. Conclusion and future work is presented in section9.
3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME206II. RELATED WORKIn literature [1] there is an improved k-Means algorithm based on the improvement ofthe sensitivity of the initial centers. This algorithm partitions the whole data space intodifferent segments and calculates the frequency of points in each segment. The segmentwhich shows the maximum frequency will b considered for initial centroid depending uponthe value of k.In literature [2] another method of finding initial cluster centers is discussed. It firstfinds closest pair of data points and then on the basis of these points it forms the subset ofdataset, and this process is repeated k times to find k small subsets, to find initial k centroids.The author in literature [3] uses Principal Component Analysis for dimensionreduction and to find initial cluster centers.In [4] first data set is pre-processed by transforming all data values to positive spaceand then data is sorted and divided into k equal sets and then middle value of each set is takenas initial center point.In literature[5] a dynamic solution to k –Means is proposed that algorithm is designedwith pre-processor using silhouette validity index that automatically determines theappropriate number of clusters, that increase the efficiency for clustering to a great extent.In [6] a method is proposed to make algorithm independent of number of iterationsthat avoids computing distance of each data point to cluster centers repeatedly, savingrunning time and reducing computational complexity.In the literature [7] dynamic means algorithm is proposed to improve the clusterquality and optimizing the number of clusters. The user has the flexibility either to fix thenumber of clusters or input the minimum number of clusters required. In the former case itworks same as k-Means algorithm. In the latter case the algorithm computes the new clustercenters by incrementing the cluster count by one in every iteration until it satisfies thevalidity of cluster qualityIn [8] the main purpose is to optimize the initial centroids for k-Meansalgorithm.Author proposed Hierarchical k-Means algorithm. It utilizes all the clustering resultsof k-Means in certain times, even though some of them reach the local optima. Then,transform the all centroids of clustering result by combining with Hierarchical algorithm inorder to determine the initial centroids for k-Means. This algorithm is better used for thecomplex clustering cases with large data set and many dimensional attributes.III. SILHOUETTE VALIDITY INDEXThe Silhouette value for each point is a measure of how similar that point is to thepoints in its own cluster compared to the points in other clusters. This technique computes thesilhouette width for each data point, silhouette width for each cluster and overall averagesilhouette width.The silhouette width for the ithpoint of mthcluster is given by equation 1:ࡿ࢏ሺ࢓ሻ ൌ࢈࢏ିࢇ࢏࢓ࢇ࢞ሺ࢈࢏,ࢇ࢏ሻሺ૚ሻWhere ai is the average distance from the ithpoint to the other points in its cluster and bi isthe minimum of the average distance from point i to the points in the other k-1 clusters. It
4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME207ranges from -1 to +1. Every point i with a silhouette index close to 1 indicates that it belongsto the cluster being assigned. A value of zero indicates object could also be assigned toanother closest cluster. A value of close to -1 indicates that object is wrongly clustered or issomewhere between clusters.The silhouette width for each cluster is given by equation 2:ࡿሺ࢓ሻ ൌ૚࢔ሺ࢓ሻ∑ ࡿ࢏ሺ࢓ሻ࢔ሺ࢓ሻ࢏ୀ૚ ሺ૛ሻThe overall average silhouette width is given by equation 3:ࡿ ൌ૚࢑∑ ࡿሺ࢓ሻ ሺ૜ሻ࢑࢓ୀ૚We have used this silhouette validity index as a measure of cluster validity in theimplementation of Original k –Means, modified approach I and modified approach II. Wehave used this measure as a basis to make the proposed algorithm work dynamically. .IV. ORIGINAL K-MEANS ALGORITHMThe k-Means algorithm takes the input parameter k, and partition a set of n objectsinto k clusters so that the resulting intra-cluster similarity is high but the inter-clustersimilarity is low cluster similarity is measured in regard to the mean value of the objects in acluster which can be viewed as a cluster’s centroid or center of gravity.The k-means algorithm proceeds as follows:1. Randomly select k of the objects, each of which initially represents a cluster mean orcenter.2. For each of the remaining objects, an object is assigned to a cluster to which it is themost similar, based on the distance between the object and the cluster mean. It thencomputes the new mean for each cluster using equation 4:3.ࡹ࢐ ൌ૚࢔࢐෍ ࢆ࢖ࢺࢆ࢖‫࡯א‬࢏ሺ૝ሻWhere, Mj is centroid of cluster j and nj is the number of data points in cluster j.4. This process iterates until the criterion function converges.Typically the square – error criterion is used, defined using equation 5:ࡱ ൌ ෍ ෍|࢖ െ ࢓࢏| ሺ૞ሻ࢖‫࡯א‬࢏࢑࢏ୀ૚Where p is the data point and mi is the center for cluster Ci. E is the sum of squarederror of all points in dataset. The distance of criterion function is the Euclideandistance which is used to calculate the distance between data point and cluster center.
5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME208The Euclidean distance between two vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2, y3 , y4 ---- yn) can be calculated using equation 6:ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ෍ ඥሺ࢞࢏ െ ࢟࢏ሻ૛࢔࢏ୀ૚ሺ૟ሻAlgorithm: The k –Means algorithm for partitioning, where each cluster’s center isrepresented by the mean value of the objects in the cluster.Input:• k: the number of clusters,• D: a data set containing n objects.Output: A set of k clusters.Method:1. arbitrarily choose k objects from D as the initial cluster centers;2. repeat3. (re)assign each object to the cluster to which the object is most similar, basedon the mean value of the objects in the cluster;4. update the cluster means , i.e., calculate the mean value of objects for eachcluster;5. until no change;V. MODIFIED APPROACH IThe first approach discussed in [1] optimizes the Original k –Means algorithm byproposing a method on how to choose initial clusters. The author proposed a method thatpartitions the given input data space into k * k segments, where k is desired number ofclusters. After portioning the data space, frequency of each segment is calculated and highestk frequency segments are chosen to represent initial clusters. If some parts are having samefrequency, the adjacent segments with the same least frequency are merged until we get the knumber of segments. Then initial centers are calculated by taking the mean of the data pointsin respective segments to get the initial k centers. By this process we will get the initial whichare always same as compared to the Original k – Means algorithm which always selectsinitial random centers for a given dataset.Next, a threshold distance is calculated for each centroid is defined as distancebetween each cluster centroid and for each centroid take the half of the minimum distancefrom the remaining centroids. Threshold distance is denoted by dc(i) for the cluster C i .To assign the data point to the cluster, take a point p in the dataset and calculate itsdistance from the centroid of cluster i and compare it with dc(i) . If it is less than or equal todc(i) then assign the data point p to the cluster i else calculate its distance from othercentroids. This process is repeated until data point p is assigned to one of the cluster. If datapoint p is not assigned to any of the cluster then the centroid which shows minimum distancefor the data point p becomes the centroid for that point. The centroid is then updated bycalculating the mean of the data points in the cluster.
6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME209Pseudo code for Modified k-Means algorithm is as follows:Input: Dataset of N data points D (i = 1 to N)Desired number of clusters = kOutput: N data points clustered into k clusters.Steps:1. Input the data set and value of k.2. If the value of k is 1 then Exit.3. Else4. /*divide the data point space into k*k, means k Vertically and k horizontally*/5. For each dimension{6. Calculate the minimum and maximum value of data Points.7. Calculate range of group(RG) using equation 7:ࡾࡳ ൌሺ࢓࢏࢔ ൅ ࢓ࢇ࢞ሻ࢑ሺૠሻ8. Divide the data point space in k group with width RG9. }10. Calculate the frequency of data points in each partitioned space.11. Choose the k highest frequency group.12. Calculate the mean of selected group. /* These will be the initial centroids of kclusters.*/13. Calculate the distance between each clusters using equation 8:ࢊ൫࡯࢏, ࡯࢐൯ ൌ ൛ࢊ൫࢓࢏, ࢓࢐൯: ሺ࢏, ࢐ሻ ‫א‬ ሾ૚, ࢑ሿ & ݅ ് ݆ൟ ሺૡሻWhere d(C i, C j) is distance between centroid i and j14. Take the minimum distance for each cluster and make it half using equation 9:ࢊࢉሺ࢏ሻ ൌ૚૛൛࢓࢏࢔ൣ ࢊ൫࡯࢏, ࡯࢐൯, … … … … … ൧ൟ ሺૢሻWhere, dc(i) is half of the minimum distance of i thcluster from other remainingclusters.15. For each data points Zp= 1 to N {16. For each cluster j= 1 to k {17. Calculate d(Zp,Mj) using equation 10:ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ∑ ሺ࢞࢏ െ ࢟࢏ሻ૛࢔࢏ୀ૚ (10)where d(xi,yi) is the distance between vector vectors x = (x1, x2 , x3 , x4-------- xn)and y= (y1 , y2 , y3 , y4 ---- yn).18. If (d(Zp,Mj)) ≤ dcj){19. Then Zp assign to cluster Cj .20. Break;21. }22. Else23. Continue;
7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME21024. }25. If Zp, does not belong to any cluster then26. Zp, ϵ min(d(Zp, , Mi)) where iϵ [1, k]27. }28. Check the termination condition of algorithm if Satisfied29. Exit.30. Else31. Calculate the centroid of cluster using equation 11:ࡹ࢐ ൌ૚࢔࢐෍ ࢆ࢖ ሺ૚૚ሻࢺࢆ࢖‫࡯א‬࢐Where Mj is centroid of cluster j and nj is the number of data points in cluster j.32. Go to step 13.VI. MODIFIED APPROACH IIIn the work of [2], author calculate the distance of between each data points and selectthat pair which show the minimum distance and remove it from actual dataset. Then took onedata point from data set and calculate the distance between selected initial point and datapoint from data set and add with initial data point which show the minimum distance. Repeatthis process till threshold value achieved. If number of subsets formed is less than k thenagain calculate the distance between each data point from the rest data set and repeat thatprocess till k cluster formed.First phase is to determine initial centroids, for this compute the distance betweeneach data point and all other data points in the set D. Then find out the closest pair of datapoints and form a set A1 consisting of these two data points, and delete them from the datapoint set D. Then determine the data point which is closest to the set A1, add it to A1 anddelete it from D. Repeat this procedure until the number of elements in the set A1 reaches athreshold. Then again form another data-point set A2. Repeat this till ’k’ such sets of datapoints are obtained. Finally the initial centroids are obtained by averaging all the vectors ineach data-point set. The Euclidean distance is used for determining the closeness of each datapoint to the cluster centroidsNext phase is to assign points to the clusters. Here the main idea is to set two simpledata structures to retain the labels of cluster and the distance of all the data objects to thenearest cluster during the each iteration, that can be used in next iteration, we calculate thedistance between the current data object and the new cluster center, if the computed distanceis smaller than or equal to the distance to the old center, the data object stays in it’s clusterthat was assigned to in previous iteration. Therefore, there is no need to calculate the distancefrom this data object to the other k- 1clustering center, saving the calculative time to the k-1cluster centers. Otherwise, we must calculate the distance from the current data object to all kcluster centers, and find the nearest cluster center and assign this point to the nearest clustercenter. And then we separately record the label of nearest cluster center and the distance to itscenter. Because in each iteration some data points still remain in the original cluster, it meansthat some parts of the data points will not be calculated, saving a total time of calculating thedistance, thereby enhancing the efficiency of the algorithm.
8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, MayPseudo code for modified k- Means algorithm is as follows:Input: Dataset D of N data points (i = 1 to N)Desired number of clusters = kOutput: N data points clustered into k clusters.Phase 1:Steps:1. Set m = 1;2. Compute the distance between each data point and all other dataset D using equationwhere d(xi,yi) is the distance between vectorand y= (y1 , y2 , y33. Find the closest pair of data points from the set D and form a data(1<= m <= k) which contains these two datapoints from the set D;4. Find the data point in D that is closest to the data point set Aand delete it from D;5. Repeat step 4 until the number of data points in A6. If m<k, then m = m+1, find another pair of data points from D between whichthe distance is the shortest, form another datafrom D, Go to step 4;7. For each data-point set Am (1<=m<=k) find the arithmetic mean of the vectof data points in APhase 2:Steps:1. Compute the distance of each data(1<=j<=k) as d(di,2. For each data-point3. Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */.4. Set Nearest _Dist[i]=5. For each cluster j6. Repeat7. For each data-point da. Compute its distance from the centrb. If this distance is less than or equal to the present nearest distance, thedata-point stays in the cluster;c. Else for every centroidd. End for;8. Assign the data-point9. Set ClusterId[i]=j;10. Set Nearest_Dist[i] =11. End for (step(2));12. For each cluster jcriteria is met i.e. either no center updates or no point moves tocluster.International Journal of Computer Engineering and Technology (IJCET), ISSN 09766375(Online) Volume 4, Issue 3, May – June (2013), © IAEME211Means algorithm is as follows:Dataset D of N data points (i = 1 to N)Desired number of clusters = kN data points clustered into k clusters.Compute the distance between each data point and all other data-set D using equation 12:(12)he distance between vector vectors x = (x1, x2 , x3 , y4 ---- yn).Find the closest pair of data points from the set D and form a data= k) which contains these two data- points, Delete these two datapoints from the set D;Find the data point in D that is closest to the data point set Am, Add it to Aand delete it from D;Repeat step 4 until the number of data points in Am reaches 0.75*If m<k, then m = m+1, find another pair of data points from D between whichthe distance is the shortest, form another data-point set Am and delete themfrom D, Go to step 4;point set Am (1<=m<=k) find the arithmetic mean of the vectof data points in Am, these means will be the initial centroidsCompute the distance of each data-point di (1<=i<=N) to all the centroids C, Cj) using equation (4.1)point di, find the closest centroid Cj and assign di to clusterSet Cluster Id[i]=j; /* j: Id of the closest cluster for point i */.Set Nearest _Dist[i]= d(di, Cj);j (1<=j<=k), recalculate the centroids;point di,Compute its distance from the centroid of the present nearest cluster;If this distance is less than or equal to the present nearest distance, thepoint stays in the cluster;Else for every centroid cj (1<=j<=k) compute the distancepoint di to the cluster with the nearest centroid CjSet ClusterId[i]=j;Set Nearest_Dist[i] = d(di, Cj);End for (step(2));j (1<=j<=k), Recalculate the centroids until the convergencecriteria is met i.e. either no center updates or no point moves toInternational Journal of Computer Engineering and Technology (IJCET), ISSN 0976-June (2013), © IAEME- points in the), x3 , x4-------- xn)Find the closest pair of data points from the set D and form a data-point set Ampoints, Delete these two data, Add it to Amreaches 0.75*(N/k);If m<k, then m = m+1, find another pair of data points from D between whichpoint set Am and delete thempoint set Am (1<=m<=k) find the arithmetic mean of the vectors(1<=i<=N) to all the centroids Cjto cluster j.oid of the present nearest cluster;If this distance is less than or equal to the present nearest distance, the(1<=j<=k) compute the distance d(di, Cj);(1<=j<=k), Recalculate the centroids until the convergencecriteria is met i.e. either no center updates or no point moves to another
9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME212VII. PROPOSED APPROACHThe changes are based on the selection of initial k centers and making the algorithm towork dynamically, i.e. instead of running algorithms for different values of k, we try to makealgorithm in such a way that it itself decides how many clusters are there in a given dataset.The two modifications are as follows:• A method to select initial k centers.• To make algorithm dynamic.Proposed Algorithm:The algorithm consists of three phases:First phase is to determine the initial k centroids. In this phase user inputs the dataset andvalue of k. The data space is divided into k*k segments as discussed in [1]. After dividing thedata space we choose the k segments with the highest frequency of points. If some parts arehaving same frequency, the adjacent segments with the same least frequency are merged untilwe get the k number of segments.Then we find the distance of each point in each selected segment with the origin and thesedistances are sorted for each segment and then middle point is selected as the center for thatsegment. This step is repeated for each k selected segments. These represent the initial kcentroids.Second phase is to assign points to the cluster based on the minimum distance between thepoint and the cluster centers. The distance measure used is Euclidean distance. It thencomputes the mean of the clusters formed as the next centers. This process is repeated untilno more center updates.Third phase is where algorithm iterates dynamically to determine right number of clusters.To choose the right number of clusters we use the concept the concept of Silhouette ValidityIndex.Pseudo Code for Proposed Algorithm:Input: Dataset of N points.Desired number of k clusters.Output: N points grouped into k clusters.Phase1: Finding Initial centroidsSteps:1. Input the dataset and value of k ≥ 2.2. Divide the data point set into k*k segments /*k vertically and k horizontally*/3. For each dimension{4. Calculate the minimum and maximum value of data points.5. Calculate the width (Rg) using equation 13:ࡾࢍ ൌ࢓࢏࢔ା࢓ࢇ࢞࢑ሺ૚૜ሻ}6. Calculate the frequency of data points in each segment.7. Choose the k highest frequency segments.8. For each segment i = 1 to k{9. For each point j in the segment i
10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME213{10. Calculate the distance of point j with origin}11. Sort these distances in ascending order in matrix D12. Select the middle point distance.13. The co-ordinates corresponding to the distance in 12 is chosen as initial centerfor the ithcluster.}14. These k co-ordinates are stored in matrix C which represents the initialcentroids.Phase2: Assigning points to the clusterSteps:1. Repeat2. For each data point p = 1 to N{3. For each cluster j = 1 to k{4. Calculate distance between point p and cluster centroid cj of Cj using equation14:ࢊ൫࢖ , ࢉ࢐൯ ൌ ටሺ࢖ െ ࢉ࢐ሻ૛૛(14)}}5. Assign p to min{d(p,cj)}where j [1,k].6. Check the termination condition of the algorithm if Satisfied7. Exit8. Else9. Calculate the new centroids of cluster using 15:ࢉ࢐ ൌ૚࢔࢐∑ ࢖ ሺ૚૞ሻࢺ࢖‫࡯א‬࢐Where nj is the number of points in cluster‫ܥ‬௝.10. Go to step 1.Phase3: To determine appropriate number of clustersFor the given value o the phase 1 and 2 are run for three iterations using k-2, k andk +2. Three corresponding Silhouette values are calculated as discussed in section2. These are denoted by Sk-2, Sk, Sk+2. The appropriate number of clusters is thenfound using following steps.Steps:1. If Sk-2 < Sk and Sk > Sk+2 then run phase 1 and phase 2 using k+1 and k-1 andcorresponding Sk+1 and Sk-1 are found. The maximum of the three Sk-1, Sk, Sk+1then determines the value of k as appropriate number of clusters. For exampleif Sk+1 is maximum, then number of clusters formed by the algorithm is k+1.
11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME2142. Else If Sk+2 > Sk and Sk+2 > Sk-2 then run phase 1 and phase 2 using k+1, k+3,k+4 and corresponding Sk+1, Sk+3 and Sk+4 are found. The k valuescorresponding to maximum of the Sk+1, Sk+2, Sk+3, Sk+4 is returned.3. Else If Sk+2 < Sk-2 and Sk < Sk-2 then run phase 1 and phase 2 using k-1, k-2, k-3, k-4 and corresponding Sk-1, Sk-2, Sk-3 and Sk-4 are found. The k valuescorresponding to maximum of the Sk-1, Sk-2, Sk-3, Sk-4 is returned.4. Stop.Thus the algorithm terminates itself where the best value of k is found. Thisvalue of k shows appropriate number of clusters for a given data set.VIII. RESULT ANALYSISThe proposed algorithm is implemented and results are compared with that ofmodified approach [1] and [2] in terms of execution time and initial centers chosen.1. The total time taken by the algorithm to form clusters and dynamicallydetermining the appropriate number of clusters is actually less than the totaltime taken by the algorithm [1] to run for different values of k.For example if we run algorithm in [1] for different values of k such as k = 2,3, 4, 5, 6, 7, etc. The algorithm in [1] takes more time as compared to theproposed algorithm which itself runs for different values of k.2. We define new method to determine initial centers that is based on middlevalue as compared to mean value. The reason behind this is that the middlevalue best represents the distribution and moreover as mean is influenced bytoo large and too small values, the middle value is not affected by this.The results show that algorithm works dynamically and is also an improvement overoriginal k-Means. Table I shows results of running algorithm in [1] over wine dataset fromUCI repository for k = 3, 4, 5, 6, 7 and 9. The algorithm is run for these values of k becausein proposed algorithm we initially fed k =7 and algorithm runs for these values of kautomatically and so total execution time of both algorithms are compared. And results showsthat proposed algorithm take less time than running the algorithm [1] individually fordifferent values of k.TABLE I: RESULTS OF ALGORITHM [1] FOR WINE DATASETSr. no. Value of k Silhouette validity Index Execution time (s)1. 3 0.4437 3.922. 4 0.3691 4.983. 5 0.3223 2.894. 6 0.2923 7.515. 7 0.2712 3.566. 9 0.2082 11.96TOTAL EXECUTION TIME 34.82The results for the proposed algorithm show different runs and stops at:maximum silhouette value = 0.443641 for best value of k = 3Elapsed time is 30.551549 seconds.
12. 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME215Thus, from results it is clear that algorithm stops itself when it finds the right numberof clusters. It shows that proposed algorithm takes less time as compared to algorithm in [1].Table II shows results of execution time for both algorithms. It also shows initial value fed toour algorithm and where the best value of k is found the algorithm stops. Experiments areperformed on random datasets of 50, 178 and 500 points.TABLE II COMPARING RESULTS OF PROPOSED ALGORITHM &ALGORITHM IN [1]Sr.no.Dataset Initial value ofkBest value ofkProposedalgorithm time (s)Algorithm [1]time (s)1. 50 points 4 6 18.6084 28.00962. 178points9 10 39.1941 50.27263. 500points5 4 66.6134 91.9400Table III shows comparison results between original k-means, modified approach IIand proposed algorithm. When comparing execution times of proposed algorithm with otheralgorithms, it is seen that proposed algorithm takes much less time than the original k-Meansand for the large dataset such as dataset of 500 points; the proposed algorithm alsooutperforms the modified approach II.TABLE- III EXECUTION TIME(s) COMPARISONSr. No. Dataset Original k-MeansModified approachIIProposed algorithm1. 50 points 15.1727 11.4979 18.60842. 178 points 74.5168 21.6497 39.19413. 500 points 86.7619 87.2461 66.6134From all the results we can conclude that although procedure of proposed algorithm is longbut it prevents user from running the algorithm for different values of k as in other threealgorithms discussed in previous chapters. The proposed algorithm dynamically iterates andstops at best value of k.Figure I –III shows different silhouette plots for all 3 datasets of random pointsdiscussed above depicting how close a point to the other members of its own cluster is. Theplot also shows that if any point is not placed incorrect cluster if the silhouette index value forthat point is negative. Figure IV shows execution time comparison of all the algorithmsdiscussed in the paper.
13. 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME216Figure 1 Silhouette plot for 50 pointsFigure 2 Silhouette plot for 178 pointsFigure 3 Silhouette plot for 500 points
14. 14. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME217Figure 4 Execution time (s) comparison of different algorithmsTable 4 shows comparison between all four algorithms discussed in paper on the basis ofexperiments results.TABLE 1V: COMPARISON OF ALGORITHMSParameters Original k-MeansModified Approach I Modified Approach II Proposed AlgorithmInitialCentersAlways randomand thusdifferentclusters forsame value of kfor given data.Way to select initialcenters is fixed byalways choosing initialcenters in the highestfrequency segment.Selection of initialcenters by alwayschoosing points basedon the similaritybetween points.Initial centers are fixedby choosing the centersin the highest frequencysegment, which ismiddle point of thatsegment pointscalculated from origin.RedundantDataCan work withdataredundancies.Suitable. Not suitable for datawith redundant points.Can work withredundant data.Dead UnitProblemYes. No. No. No.Value of k Fixed inputparameter.Fixed input parameter Fixed input parameter Initial value given asinput, algorithmdynamically iteratesand determines bestvalue of k for givendata.ExecutionTimeMore. Less as compared toOriginal k-Means, butmore than other twoalgorithmsLess than Original k-Means and ModifiedApproach I but morethan ProposedAlgorithm.Less than all other threealgorithms.
15. 15. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME218IX. CONCLUSIONIn this paper we presented different approaches to k-Means clustering that areconcluded using comparative presentation and stressing their pros and cons. Another issuediscussed in the paper is clustering validity which we measured using silhouette validityindex. For a given dataset this index shows which value of k produces compact and wellseparated clusters. The paper presents a new method for selecting initial centers and dynamicapproach to k-Means clustering so that user needs not to check the clusters for differentvalues of k. Instead inputs initial value of k and algorithm stops after it finds best value of ki.e. algorithm stops when it attains maximum silhouette value. Experiments also show that theproposed dynamic algorithm takes much less computation time then the other threealgorithms discussed in the paper.X. ACKNOWLEDGEMENTSI am eternally grateful to my research supervisor Dr. M.P.S Bhatia for theirinvigorating support and valuable suggestions and guidance. I thank him for his supervisionand patiently correcting me with my work. It is a great experience and I gained a lot here.Finally, we are thankful to the almighty God who had given us the power, good sense andconfidence to complete my research analysis successfully. I also thank my parents and myfriends who were a constant source of encouragement. I would also like to thanks NavneetSingh for his appreciation.REFERENCESProceedings Papers[1] Ran Vijay Singh and M.P.S Bhatia, “Data Clustering with Modified K-meansAlgorithm”, IEEE International Conference on Recent Trends in Information Technology,ICRTIT 2011, pp 717-721.[2] D. Napoleon and P. Ganga lakshmi, “An Efficient K-Means Clustering Algorithm forReducing Time Complexity using Uniform Distribution Data Points”, IEEE 2010.Journal Papers[3] Tajunisha and Saravanan, “Performance Analysis of k-means with differentinitialization methods for high dimensional data” International Journal of ArtificialIntelligence & Applications (IJAIA), Vol.1, No.4, October 2010[4] Neha Aggarwal and Kriti Aggarwal,”A Mid- point based k –mean ClusteringAlgorithm for Data Mining”. International Journal on Computer Science and Engineering(IJCSE) 2012.[5] Barileé Barisi Baridam,” More work on k-means Clustering algortithm: TheDimensionality Problem ”. International Journal of Computer Applications (0975 –8887)Volume 44– No.2, April 2012.Proceedings Papers[6] Shi Na, Li Xumin, Guan Yong “Research on K-means clustering algorithm”. Proc ofThird International symposium on Intelligent Information Technology and SecurityInformatics, IEEE 2010.
16. 16. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME219[7] Ahamad Shafeeq and Hareesha ”Dynamic clustering of data with modified K-meanalgorithm”, Proc. International Conference on Information and Computer Networks(ICICN 2012) IPCSIT vol. 27 (2012) © (2012) IACSIT Press, Singapore 2012.Research[8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroidsinitialization for k-Means. Reports of the faculty of Science and Engineering, SagaUniversity, Vol. 26, No. 1, 2007.Books[1] Jiawei Han and Micheline Kamber, data mining concepts and techniques (SecondEdition).