slides Céline Beji

2,776 views

Published on

Presentation given by Céline Beji at the Reading Seminar on Statistical Classics, Oct. 22, 2012

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,776
On SlideShare
0
From Embeds
0
Number of Embeds
1,873
Actions
Shares
0
Downloads
42
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

slides Céline Beji

  1. 1. R EADING SEMINAR ON CLASSICS PRESENTED BY ´ B EJI C E LINE ArticleAS 136: A K-Means Clustering Algorithm Suggested by C. Robert
  2. 2. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  3. 3. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  4. 4. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  5. 5. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  6. 6. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  7. 7. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  8. 8. Plan1 I NTRODUCTION2 AGORITHM3 D EMONSTRATION4 C ONVERGENCE AND T IME5 C ONSIDERABLE PROGRESS6 T HE LIMITS OF THE ALGORITHM7 C ONCLUSION
  9. 9. I NTRODUCTION 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  10. 10. I NTRODUCTIONP RESENTATION OF THE ARTICLE AS 136: A K-Means Clustering Algorithm Authors: J. A Hartigan and M. A. Wong Source: Journal of the Roal Statistical Society. Series C (Applied Statistics), Vol 28, No.1 (1979), pp.100-108 Implemented in FORTRAN
  11. 11. I NTRODUCTIONC LUSTERING Clustering is the classical problem of dividing a data sample in some space into a collection of disjoint groups.
  12. 12. I NTRODUCTIONT HE AIM OF THE ALGORITHM Divide M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized.
  13. 13. I NTRODUCTIONE XAMPLE OF APPLICATION The algorithm is apply even on large data sets. For example ranging from: Market segmentation Image processing Geostatistics ...
  14. 14. I NTRODUCTIONE XAMPLE OF APPLICATION M ARKET SEGMENTATION It is necessary to segment the Customer in order to know more precisely the needs and expectations of each group. M: number of people N: number of criterion (age, sex, social status, etc ...) k: number of cluster F IGURE : market segmentation
  15. 15. AGORITHM 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  16. 16. AGORITHM F IGURE : General Schema
  17. 17. AGORITHM F IGURE : General Schema
  18. 18. AGORITHMI NITIALIZATION 1 For each point point (I=1,2,...,M) find its clostest and second clostest cluster centers, keep them in IC1(I) and IC2(I), respectively. 2 Update the cluster centres to be the average of the points contained within them. 3 Put all cluster in the live set.
  19. 19. AGORITHMI NITIALIZATION 1 For each point point (I=1,2,...,M) find its clostest and second clostest cluster centers, keep them in IC1(I) and IC2(I), respectively. 2 Update the cluster centres to be the average of the points contained within them. 3 Put all cluster in the live set.
  20. 20. AGORITHMI NITIALIZATION 1 For each point point (I=1,2,...,M) find its clostest and second clostest cluster centers, keep them in IC1(I) and IC2(I), respectively. 2 Update the cluster centres to be the average of the points contained within them. 3 Put all cluster in the live set.
  21. 21. AGORITHM F IGURE : General Schema
  22. 22. AGORITHMT HE LIVE SET We consider each point (I=1,2,...,M), let point I be in cluster L1. L1 is in the live set: If the cluster is updated in the quick-transfer (QTRAN) stage. L1 is not in the live set : otherwise, and if it has not been updated in the last M optimal-transfer (OPTRA) steps.
  23. 23. AGORITHMT HE LIVE SET We consider each point (I=1,2,...,M), let point I be in cluster L1. L1 is in the live set: If the cluster is updated in the quick-transfer (QTRAN) stage. L1 is not in the live set : otherwise, and if it has not been updated in the last M optimal-transfer (OPTRA) steps.
  24. 24. AGORITHMT HE LIVE SET We consider each point (I=1,2,...,M), let point I be in cluster L1. L1 is in the live set: If the cluster is updated in the quick-transfer (QTRAN) stage. L1 is not in the live set : otherwise, and if it has not been updated in the last M optimal-transfer (OPTRA) steps.
  25. 25. AGORITHM F IGURE : General Schema
  26. 26. AGORITHMO PTIMAL - TRANSFER STAGE (OPTRA) Compute the minimum of the quantity for all cluster L (L=L1) , NC(L) ∗ D(I, L)2 R2 = (1) NC(L) + 1 NC(L): The number of points in cluster L D(I,L): The Euclidean distance between point I and cluster L Let L2 be the cluster with the smallest R2
  27. 27. AGORITHMO PTIMAL - TRANSFER STAGE (OPTRA) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 ≥ (2) NC(L1) + 1 NC(L2) + 1 No reallocation L2 is the new IC2(I) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (3) NC(L1) + 1 NC(L2) + 1 I is allocated to cluster L2 L1 is the new IC2(I) Cluster centres are updated to be the means of points assigned to them The two clusters that are involved in the transfer of point I at this particular step are now in the live set
  28. 28. AGORITHMO PTIMAL - TRANSFER STAGE (OPTRA) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 ≥ (2) NC(L1) + 1 NC(L2) + 1 No reallocation L2 is the new IC2(I) If: NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (3) NC(L1) + 1 NC(L2) + 1 I is allocated to cluster L2 L1 is the new IC2(I) Cluster centres are updated to be the means of points assigned to them The two clusters that are involved in the transfer of point I at this particular step are now in the live set
  29. 29. AGORITHM F IGURE : General Schema
  30. 30. AGORITHM F IGURE : General Schema
  31. 31. AGORITHM F IGURE : General Schema
  32. 32. AGORITHM F IGURE : General Schema
  33. 33. AGORITHMT HE QUICK - TRANSFER STAGE (QTRAN) We consider each point (I=1,2,...,M) in turn Let L1=IC1(I) and L2=IC2(I) If NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (4) NC(L1) + 1 NC(L2) + 1 Then, point I remains in cluster L1 Otherwise, IC1 ↔ IC2 Update the centres of cluster L1 and L2 Its noted that transfer took place This step is repeated until transfer take place in the last M steps.
  34. 34. AGORITHMT HE QUICK - TRANSFER STAGE (QTRAN) We consider each point (I=1,2,...,M) in turn Let L1=IC1(I) and L2=IC2(I) If NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (4) NC(L1) + 1 NC(L2) + 1 Then, point I remains in cluster L1 Otherwise, IC1 ↔ IC2 Update the centres of cluster L1 and L2 Its noted that transfer took place This step is repeated until transfer take place in the last M steps.
  35. 35. AGORITHMT HE QUICK - TRANSFER STAGE (QTRAN) We consider each point (I=1,2,...,M) in turn Let L1=IC1(I) and L2=IC2(I) If NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (4) NC(L1) + 1 NC(L2) + 1 Then, point I remains in cluster L1 Otherwise, IC1 ↔ IC2 Update the centres of cluster L1 and L2 Its noted that transfer took place This step is repeated until transfer take place in the last M steps.
  36. 36. AGORITHMT HE QUICK - TRANSFER STAGE (QTRAN) We consider each point (I=1,2,...,M) in turn Let L1=IC1(I) and L2=IC2(I) If NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (4) NC(L1) + 1 NC(L2) + 1 Then, point I remains in cluster L1 Otherwise, IC1 ↔ IC2 Update the centres of cluster L1 and L2 Its noted that transfer took place This step is repeated until transfer take place in the last M steps.
  37. 37. AGORITHMT HE QUICK - TRANSFER STAGE (QTRAN) We consider each point (I=1,2,...,M) in turn Let L1=IC1(I) and L2=IC2(I) If NC(L1) ∗ D(I, L1)2 NC(L2) ∗ D(I, L2)2 < (4) NC(L1) + 1 NC(L2) + 1 Then, point I remains in cluster L1 Otherwise, IC1 ↔ IC2 Update the centres of cluster L1 and L2 Its noted that transfer took place This step is repeated until transfer take place in the last M steps.
  38. 38. AGORITHM F IGURE : General Schema
  39. 39. D EMONSTRATION 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  40. 40. D EMONSTRATION
  41. 41. D EMONSTRATION
  42. 42. D EMONSTRATION
  43. 43. D EMONSTRATION
  44. 44. D EMONSTRATION
  45. 45. D EMONSTRATION
  46. 46. D EMONSTRATION
  47. 47. C ONVERGENCE AND T IME 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  48. 48. C ONVERGENCE AND T IMEC ONVERGENCE There are convergence, but the algorithm produces a clustering which is only locally optimal! F IGURE : A typical example of the k-means convergence to a local optima
  49. 49. C ONVERGENCE AND T IMET IME The time i approximately equal to CMNKI C: depends on the speed of the computer (=2.1×10−5 sec for an IBM 370/158) M: the number of points N: the number of dimensions K: the number of clusters I: the number of iterations (usually less than 10)
  50. 50. C ONSIDERABLE PROGRESS 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  51. 51. C ONSIDERABLE PROGRESSR ELATED ALGORITHM AS 113: A TRANSFER ALGORITHM FOR NON - HIERARCHIAL CLASSIFICATION (by Banfield and Bassil in 1977) It uses swops as well as transfer to try to overcome the problem of local optima. It is too expensive to use if the size of the data set is significant.
  52. 52. C ONSIDERABLE PROGRESSR ELATED ALGORITHM AS 58: E UCLIDEAN CLUSTER ANALYSIS (by Sparks in 1973) It find a K-partition of the sample, with within-cluster sum of squares. Only the closest centre is used to check for possible reallocation of the given point, it does not provide a locally optimal solution. A saving of about 50 per cent in time occurs in the k-Means algorithm due to using ”live” sets and due to using a quick- transfer stage which reduces the number of optimal transfer iterations by a factor of 4.
  53. 53. T HE LIMITS OF THE ALGORITHM 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  54. 54. T HE LIMITS OF THE ALGORITHMT HE SELECTION OF THE INITIAL CLUSTER CENTRES F IGURE : The choice of the initial cluster centres affect the final results
  55. 55. T HE LIMITS OF THE ALGORITHMT HE SELECTION OF THE INITIAL CLUSTER CENTRES T HE SOLUTION TO THE PROBLEM Proposed in the article:K sample points are chosen as the initial cluster centres. The points are first ordered by their distances to the overall mean of the sample. Then, for cluster L (L = 1,2, ..., K), the 1 + (L -1) * [M/K]th point is chosen to be its initial cluster centre (it is guaranteed that no cluster will be empty). Commonly used:The cluster centers are chosen randomly. The algorithm is run several time, then select the set of cluster with minimum sum of the Squared Error.
  56. 56. T HE LIMITS OF THE ALGORITHMT HE SELECTION OF THE NUMBER OF CLUSTERS T HE SOLUTION TO THE PROBLEM : T HE MEAN SHIFT ALGORITHM The mean shift is similar to K-means in that it maintains a set of data points that are iteratively replaced by means. But there is no need to choose the number of clusters, because mean shift is likely to find only a few clusters if indeed only a small number exist.
  57. 57. T HE LIMITS OF THE ALGORITHMS ENSITIVITY TO NOISE T HE SOLUTION TO THE PROBLEM :K- MEANS ALGORITHM CONSIDERING WEIGHTS This algorithm is the same as the k-mean, but each variable is weighted to provide a lower weight to the variables affected by high noise. The center of cluster L: WH(I) C(L) = XI (5) WHC(L) The quantity computed: WHC(L) ∗ D(I, L)2 R= (6) WHC(L) − WH(I) WH(I): the weight of each point I WHC(L): the weight of each cluster L
  58. 58. C ONCLUSION 1 I NTRODUCTION 2 AGORITHM 3 D EMONSTRATION 4 C ONVERGENCE AND T IME 5 C ONSIDERABLE PROGRESS 6 T HE LIMITS OF THE ALGORITHM 7 C ONCLUSION
  59. 59. C ONCLUSION C ONCLUSION This algorithm has been a real revolution for the clustering methods:It produces good results and a high speed of convergence. This algorithm possesses flaws, but many algorithms implemented today are arising from it.
  60. 60. C ONCLUSIONR EFERENCES J. A. Hartigan and M. A.Wong (1979) Algorithm AS 136: A K-Means Clustering Algorithm. Appl.Statist., 28, 100-108. http://home.dei.polimi.it/matteucc/ Clustering/tutorial_html/AppletKM.html http: //en.wikipedia.org/wiki/K-means_clustering
  61. 61. Thank you for your attention !!!Retour.

×