Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kmeans plusplus

5,188 views

Published on

Kmeans++ implementation for MLDemos

Published in: Technology, Education
  • k-mean++ with given example illutrate the deference the
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Kmeans plusplus

  1. 1. K-means++ Seeding Algorithm, 
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
  2. 2. K-means!•  K-means: widely used clustering technique!•  Initialization: blind random on input data!•  Drawback: very sensitive to choice of initial cluster centers (seeds)!•  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
  3. 3. K-means++!•  A seeding technique for k-means
 from Arthur and Vassilvitskii [2007]!•  Idea: spread the k initial cluster centers away from each other.!•  O(log k)-competitive with the optimal clustering"•  substantial convergence time speedups (empirical)!
  4. 4. Algorithm!c  ∈  C:  cluster  center  x  ∈    X:  data  point  D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
  5. 5. Implementation!•  Based on Apache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation!•  Implemented directly in MLDemos’ core!
  6. 6. Implementation Test Dataset: 4 squares (n=16)!
  7. 7. Expected: 4 nice clusters!
  8. 8. Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
  9. 9. Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
  10. 10. Evaluation on Test Dataset!•  200 clustering runs, each with and without k- means++ initialization!•  Measure RSS (intra-class variance)!•  K-means!
 optimal clustering 115 times (57.5%) !•  K-means++ !
 optimal clustering 182 times (91%)!
  11. 11. Comparison of the frequency distribution ofRSS values between k-means and k-means++ on the evaluation dataset (n=200)!
  12. 12. Evaluation on Real Dataset!•  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)!•  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS!•  Difference highly significant (P < 0.0001) !
  13. 13. Comparison of the frequency distribution ofRSS values between k-means and k-means++ on the UCI real world dataset (n=500)!
  14. 14. Alternatives Seeding Algorithms!•  Extensive research into seeding techniques for k- means.!•  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.!•  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. !•  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
  15. 15. Conclusions and Future Work!•  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. !•  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
  16. 16. References!•  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).!•  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!•  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).!•  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).!•  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). !•  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). !•  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!

×