slides Céline Beji

1. R EADING SEMINAR ON CLASSICS PRESENTED BY ´ B EJI C E LINE Article AS 136: A K-Means Clustering Algorithm Suggested by C. Robert

2. Plan 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

9. I NTRODUCTION 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

10. I NTRODUCTION P RESENTATION OF THE ARTICLE AS 136: A K-Means Clustering Algorithm Authors: J. A Hartigan and M. A. Wong Source: Journal of the Roal Statistical Society. Series C (Applied Statistics), Vol 28, No.1 (1979), pp.100-108 Implemented in FORTRAN

11. I NTRODUCTION C LUSTERING Clustering is the classical problem of dividing a data sample in some space into a collection of disjoint groups.

12. I NTRODUCTION T HE AIM OF THE ALGORITHM Divide M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized.

13. I NTRODUCTION E XAMPLE OF APPLICATION The algorithm is apply even on large data sets. For example ranging from: Market segmentation Image processing Geostatistics ...

14. I NTRODUCTION E XAMPLE OF APPLICATION M ARKET SEGMENTATION It is necessary to segment the Customer in order to know more precisely the needs and expectations of each group. M: number of people N: number of criterion (age, sex, social status, etc ...) k: number of cluster F IGURE : market segmentation

15. AGORITHM 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

16. AGORITHM F IGURE : General Schema

18. AGORITHM I NITIALIZATION 1 For each point point (I=1,2,...,M) ﬁnd its clostest and second clostest cluster centers, keep them in IC1(I) and IC2(I), respectively. 2 Update the cluster centres to be the average of the points contained within them. 3 Put all cluster in the live set.

22. AGORITHM T HE LIVE SET We consider each point (I=1,2,...,M), let point I be in cluster L1. L1 is in the live set: If the cluster is updated in the quick-transfer (QTRAN) stage. L1 is not in the live set : otherwise, and if it has not been updated in the last M optimal-transfer (OPTRA) steps.

26. AGORITHM O PTIMAL - TRANSFER STAGE (OPTRA) Compute the minimum of the quantity for all cluster L (L=L1) , NC(L) ∗ D(I, L)2 R2 = (1) NC(L) + 1 NC(L): The number of points in cluster L D(I,L): The Euclidean distance between point I and cluster L Let L2 be the cluster with the smallest R2

27. AGORITHM O PTIMAL - TRANSFER STAGE (OPTRA) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 ≥ (2) NC(L1) + 1 NC(L2) + 1 No reallocation L2 is the new IC2(I) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (3) NC(L1) + 1 NC(L2) + 1 I is allocated to cluster L2 L1 is the new IC2(I) Cluster centres are updated to be the means of points assigned to them The two clusters that are involved in the transfer of point I at this particular step are now in the live set

28. AGORITHM O PTIMAL - TRANSFER STAGE (OPTRA) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 ≥ (2) NC(L1) + 1 NC(L2) + 1 No reallocation L2 is the new IC2(I) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (3) NC(L1) + 1 NC(L2) + 1 I is allocated to cluster L2 L1 is the new IC2(I) Cluster centres are updated to be the means of points assigned to them The two clusters that are involved in the transfer of point I at this particular step are now in the live set

33. AGORITHM T HE QUICK - TRANSFER STAGE (QTRAN) We consider each point (I=1,2,...,M) in turn Let L1=IC1(I) and L2=IC2(I) If NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (4) NC(L1) + 1 NC(L2) + 1 Then, point I remains in cluster L1 Otherwise, IC1 ↔ IC2 Update the centres of cluster L1 and L2 Its noted that transfer took place This step is repeated until transfer take place in the last M steps.

39. D EMONSTRATION 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

40. D EMONSTRATION

41. D EMONSTRATION

42. D EMONSTRATION

43. D EMONSTRATION

44. D EMONSTRATION

45. D EMONSTRATION

46. D EMONSTRATION

47. C ONVERGENCE AND T IME 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

48. C ONVERGENCE AND T IME C ONVERGENCE There are convergence, but the algorithm produces a clustering which is only locally optimal! F IGURE : A typical example of the k-means convergence to a local optima

49. C ONVERGENCE AND T IME T IME The time i approximately equal to CMNKI C: depends on the speed of the computer (=2.1×10−5 sec for an IBM 370/158) M: the number of points N: the number of dimensions K: the number of clusters I: the number of iterations (usually less than 10)

50. C ONSIDERABLE PROGRESS 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

51. C ONSIDERABLE PROGRESS R ELATED ALGORITHM AS 113: A TRANSFER ALGORITHM FOR NON - HIERARCHIAL CLASSIFICATION (by Banﬁeld and Bassil in 1977) It uses swops as well as transfer to try to overcome the problem of local optima. It is too expensive to use if the size of the data set is signiﬁcant.

52. C ONSIDERABLE PROGRESS R ELATED ALGORITHM AS 58: E UCLIDEAN CLUSTER ANALYSIS (by Sparks in 1973) It ﬁnd a K-partition of the sample, with within-cluster sum of squares. Only the closest centre is used to check for possible reallocation of the given point, it does not provide a locally optimal solution. A saving of about 50 per cent in time occurs in the k-Means algorithm due to using ”live” sets and due to using a quick- transfer stage which reduces the number of optimal transfer iterations by a factor of 4.

53. T HE LIMITS OF THE ALGORITHM 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

54. T HE LIMITS OF THE ALGORITHM T HE SELECTION OF THE INITIAL CLUSTER CENTRES F IGURE : The choice of the initial cluster centres affect the ﬁnal results

55. T HE LIMITS OF THE ALGORITHM T HE SELECTION OF THE INITIAL CLUSTER CENTRES T HE SOLUTION TO THE PROBLEM Proposed in the article:K sample points are chosen as the initial cluster centres. The points are ﬁrst ordered by their distances to the overall mean of the sample. Then, for cluster L (L = 1,2, ..., K), the 1 + (L -1) * [M/K]th point is chosen to be its initial cluster centre (it is guaranteed that no cluster will be empty). Commonly used:The cluster centers are chosen randomly. The algorithm is run several time, then select the set of cluster with minimum sum of the Squared Error.

56. T HE LIMITS OF THE ALGORITHM T HE SELECTION OF THE NUMBER OF CLUSTERS T HE SOLUTION TO THE PROBLEM : T HE MEAN SHIFT ALGORITHM The mean shift is similar to K-means in that it maintains a set of data points that are iteratively replaced by means. But there is no need to choose the number of clusters, because mean shift is likely to ﬁnd only a few clusters if indeed only a small number exist.

57. T HE LIMITS OF THE ALGORITHM S ENSITIVITY TO NOISE T HE SOLUTION TO THE PROBLEM :K- MEANS ALGORITHM CONSIDERING WEIGHTS This algorithm is the same as the k-mean, but each variable is weighted to provide a lower weight to the variables affected by high noise. The center of cluster L: WH(I) C(L) = XI (5) WHC(L) The quantity computed: WHC(L) ∗ D(I, L)2 R= (6) WHC(L) − WH(I) WH(I): the weight of each point I WHC(L): the weight of each cluster L

58. C ONCLUSION 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION

59. C ONCLUSION C ONCLUSION This algorithm has been a real revolution for the clustering methods:It produces good results and a high speed of convergence. This algorithm possesses ﬂaws, but many algorithms implemented today are arising from it.

60. C ONCLUSION R EFERENCES J. A. Hartigan and M. A.Wong (1979) Algorithm AS 136: A K-Means Clustering Algorithm. Appl.Statist., 28, 100-108. http://home.dei.polimi.it/matteucc/ Clustering/tutorial_html/AppletKM.html http: //en.wikipedia.org/wiki/K-means_clustering

61. Thank you for your attention !!! Retour.

slides Céline Beji

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to slides Céline Beji

Similar to slides Céline Beji (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

slides Céline Beji