Kmeans plusplus
Upcoming SlideShare
Loading in...5
×
 

Kmeans plusplus

on

  • 1,782 views

Kmeans++ implementation for MLDemos

Kmeans++ implementation for MLDemos

Statistics

Views

Total Views
1,782
Views on SlideShare
1,782
Embed Views
0

Actions

Likes
0
Downloads
24
Comments
1

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • k-mean++ with given example illutrate the deference the
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Kmeans plusplus Kmeans plusplus Presentation Transcript

    • K-means++ Seeding Algorithm, 
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
    • K-means!•  K-means: widely used clustering technique!•  Initialization: blind random on input data!•  Drawback: very sensitive to choice of initial cluster centers (seeds)!•  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
    • K-means++!•  A seeding technique for k-means
 from Arthur and Vassilvitskii [2007]!•  Idea: spread the k initial cluster centers away from each other.!•  O(log k)-competitive with the optimal clustering"•  substantial convergence time speedups (empirical)!
    • Algorithm!c  ∈  C:  cluster  center  x  ∈    X:  data  point  D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
    • Implementation!•  Based on Apache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation!•  Implemented directly in MLDemos’ core!
    • Implementation Test Dataset: 4 squares (n=16)!
    • Expected: 4 nice clusters!
    • Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
    • Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
    • Evaluation on Test Dataset!•  200 clustering runs, each with and without k- means++ initialization!•  Measure RSS (intra-class variance)!•  K-means!
 optimal clustering 115 times (57.5%) !•  K-means++ !
 optimal clustering 182 times (91%)!
    • Comparison of the frequency distribution ofRSS values between k-means and k-means++ on the evaluation dataset (n=200)!
    • Evaluation on Real Dataset!•  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)!•  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS!•  Difference highly significant (P < 0.0001) !
    • Comparison of the frequency distribution ofRSS values between k-means and k-means++ on the UCI real world dataset (n=500)!
    • Alternatives Seeding Algorithms!•  Extensive research into seeding techniques for k- means.!•  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.!•  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. !•  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
    • Conclusions and Future Work!•  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. !•  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
    • References!•  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).!•  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!•  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).!•  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).!•  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). !•  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). !•  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!