SlideShare a Scribd company logo
K-means++ Seeding Algorithm, 

 Implementation in MLDemos!

             Renaud Richardet!
             Brain Mind Institute !
       Ecole Polytechnique Fédérale 

      de Lausanne (EPFL), Switzerland!
          renaud.richardet@epfl.ch !
                      !
K-means!
•  K-means: widely used clustering technique!
•  Initialization: blind random on input data!
•  Drawback: very sensitive to choice of initial cluster
   centers (seeds)!
•  Local optimal can be arbitrarily bad wrt. objective
   function, compared to global optimal clustering!
K-means++!
•  A seeding technique for k-means

   from Arthur and Vassilvitskii [2007]!
•  Idea: spread the k initial cluster centers away from
   each other.!
•  O(log k)-competitive with the optimal clustering"
•  substantial convergence time speedups (empirical)!
Algorithm!




c	
  ∈	
  C:	
  cluster	
  center	
  
x	
  ∈	
  	
  X:	
  data	
  point	
  
D(x):	
  distance	
  between	
  x	
  and	
  the	
  nearest	
  ck	
  that	
  has	
  already	
  been	
  chosen	
  	
  
	
  
Implementation!
•  Based on Apache Commons Math’s
   KMeansPlusPlusClusterer and 

   Arthur’s [2007] implementation!
•  Implemented directly in MLDemos’ core!
Implementation Test Dataset: 4 squares (n=16)!
Expected: 4 nice clusters!
Sample Output!
	
  1:	
  first	
  cluster	
  center	
  0	
  at	
  rand:	
  x=4	
  [-­‐2.0;	
  2.0]	
  
	
  1:	
  initial	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  18.0	
  
	
  1:	
  initial	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  17.0	
  
	
  1:	
  initial	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  6	
  [	
  2.0;-­‐2.0]	
  =	
  32.0	
  
	
  1:	
  initial	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  	
  1.0	
  
	
  1:	
  initial	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  1:	
  initial	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  1:	
  initial	
  minDist	
  for	
  10[	
  2.0;-­‐1.0]	
  =	
  25.0	
  
	
  1:	
  initial	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  	
  9.0	
  
	
  	
  	
  	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  1	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=3345.0	
  
	
  3:	
  	
  	
  random	
  index	
  1532.706909	
  
	
  4:	
  	
  new	
  cluster	
  point:	
  x=6	
  [2.0;-­‐2.0]	
  	
  
Sample Output (2)!
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  1	
  [	
  2.0;	
  1.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  2	
  [	
  1.0;-­‐1.0]	
  =	
  	
  2.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  3	
  [-­‐1.0;-­‐2.0]	
  =	
  	
  9.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  5	
  [	
  2.0;	
  2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  7	
  [-­‐1.0;	
  2.0]	
  =	
  25.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  8	
  [-­‐2.0;-­‐2.0]	
  =	
  16.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  9	
  [	
  1.0;	
  1.0]	
  =	
  10.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  10[2.0	
  ;-­‐1.0]	
  =	
  	
  1.0	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  11[-­‐2.0;-­‐1.0]	
  =	
  17.0	
  
              	
  […]	
  
	
  2:	
  picking	
  cluster	
  center	
  2	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  3:	
  	
  	
  distSqSum=961.0	
  
	
  3:	
  	
  	
  random	
  index	
  103.404701	
  
	
  4:	
  	
  	
  new	
  cluster	
  point:	
  x=1	
  [2.0;1.0]	
  
	
  4:	
  	
  	
  updating	
  minDist	
  for	
  0	
  [-­‐1.0;-­‐1.0]	
  =	
  13.0	
  
	
  […]	
  
Evaluation on Test Dataset!
•  200 clustering runs, each with and without k-
   means++ initialization!
•  Measure RSS (intra-class variance)!

•  K-means!

   optimal clustering 115 times (57.5%) !
•  K-means++ !

   optimal clustering 182 times (91%)!
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
Evaluation on Real Dataset!
•  UCI’s Water Treatment Plant data set

   daily measures of sensors in an urban waste water
   treatment plant (n=396, d=38)!
•  Sampled two times 500 clustering runs for k-means
   and k-means++ with k=13, and recorded RSS!




•  Difference highly significant (P < 0.0001) !
Comparison of the frequency distribution of
RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
Alternatives Seeding Algorithms!
•  Extensive research into seeding techniques for k-
   means.!
•  Steinley [2007]: evaluated 12 different techniques
   (omitting k-means++). Recommends multiple
   random starting points for general use.!
•  Maitra [2011] evaluated 11 techniques (including k-
   means++). Unable to provide recommendations
   when evaluating nine standard real-world datasets. !
•  Maitra analyzed simulated datasets and
   recommends using Milligan’s [1980] or Mirkin’s
   [2005] seeding technique, and Bradley’s [1998]
   when dataset is very large.!
Conclusions and Future Work!
•  Using a synthetic test dataset and a real world
   dataset, we showed that our implementation of
   the k-means++ seeding procedure in the
   MLDemos software package yields a significant
   reduction of the RSS. !
•  A short literature survey revealed that many
   seeding procedures exist for k-means, and that
   some alternatives to k-means++ might yield
   even larger improvements.!
References!
•    Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
     seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on
     Discrete algorithms 1027–1035 (2007).!
•    Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable
     K-Means+”. Unpublished working paper available at
     http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
•    Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means
     clustering”. Proc. 15th International Conf. on Machine Learning, 91-99
     (1998).!
•    Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of
     different methods for initializing the K-means clustering algorithm”.
     Unpublished working paper available at http://apghosh.public.iastate.edu/
     files/IEEEclust2.pdf (2011).!
•    Milligan G. W.: “The validation of four ultrametric clustering algorithms”.
     Pattern Recognition, vol. 12, 41–50 (1980). !
•    Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman
     and Hall (2005). !
•    Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical
     evaluation of several techniques”. Journal of Classification 24, 99–121
     (2007).!

More Related Content

Viewers also liked

MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
jeykottalam
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
Juhui Park
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010
Kostas Tampakis
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tno
Kostas Tampakis
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Marcel Rietveld ✔
 
Social Media Payments Opps and Challenges
Social Media Payments Opps and ChallengesSocial Media Payments Opps and Challenges
Social Media Payments Opps and Challenges
Banking Research Associates
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing Samples
Lenny Koupal
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draft
Andy Lehman
 
Lb spektakulare seebilde
Lb spektakulare seebildeLb spektakulare seebilde
Lb spektakulare seebildeKostas Tampakis
 
Meilleures photos d 'actualite de l'an 2009
Meilleures photos d 'actualite de l'an 2009 Meilleures photos d 'actualite de l'an 2009
Meilleures photos d 'actualite de l'an 2009 Kostas Tampakis
 
07 shubin-optimization2010 как делать полноценный сниипет
07 shubin-optimization2010 как делать полноценный сниипет07 shubin-optimization2010 как делать полноценный сниипет
07 shubin-optimization2010 как делать полноценный сниипетТарасов Константин
 
J O H A N N S E B A S T I A N B A C H
J O H A N N  S E B A S T I A N  B A C HJ O H A N N  S E B A S T I A N  B A C H
J O H A N N S E B A S T I A N B A C H
ceipcruceiro
 
Mount
MountMount

Viewers also liked (20)

MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
발표자료 11장
발표자료 11장발표자료 11장
발표자료 11장
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
National geographicphotos2010
National geographicphotos2010National geographicphotos2010
National geographicphotos2010
 
La bella roma[1][1]._tno
La bella roma[1][1]._tnoLa bella roma[1][1]._tno
La bella roma[1][1]._tno
 
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-HertogenboschSocialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
Socialmedia for business presentatie Hockeysocieteit 's-Hertogenbosch
 
Social Media Payments Opps and Challenges
Social Media Payments Opps and ChallengesSocial Media Payments Opps and Challenges
Social Media Payments Opps and Challenges
 
Foto surreali copia 21
Foto surreali copia 21Foto surreali copia 21
Foto surreali copia 21
 
Lenny Koupal Writing Samples
Lenny Koupal Writing SamplesLenny Koupal Writing Samples
Lenny Koupal Writing Samples
 
Et dieu crea_la_mer
Et dieu crea_la_merEt dieu crea_la_mer
Et dieu crea_la_mer
 
Zambia Capital Ask - draft
Zambia Capital Ask - draftZambia Capital Ask - draft
Zambia Capital Ask - draft
 
Laponsko
LaponskoLaponsko
Laponsko
 
Lb spektakulare seebilde
Lb spektakulare seebildeLb spektakulare seebilde
Lb spektakulare seebilde
 
Meilleures photos d 'actualite de l'an 2009
Meilleures photos d 'actualite de l'an 2009 Meilleures photos d 'actualite de l'an 2009
Meilleures photos d 'actualite de l'an 2009
 
07 shubin-optimization2010 как делать полноценный сниипет
07 shubin-optimization2010 как делать полноценный сниипет07 shubin-optimization2010 как делать полноценный сниипет
07 shubin-optimization2010 как делать полноценный сниипет
 
J O H A N N S E B A S T I A N B A C H
J O H A N N  S E B A S T I A N  B A C HJ O H A N N  S E B A S T I A N  B A C H
J O H A N N S E B A S T I A N B A C H
 
Mount
MountMount
Mount
 

Similar to Kmeans plusplus

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
animesh dwivedi
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
r-kor
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
Piotr Tylenda
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
Agnieszka Potulska
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
Iwan Sofana
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Tetsuya Sakai
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
NECST Lab @ Politecnico di Milano
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
Damian R. Mingle, MBA
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
Dataconomy Media
 
P1121133727
P1121133727P1121133727
P1121133727
Ashraf Aboshosha
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
Shrayes Ramesh
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
Mark Moriarty
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
Nitish Upreti
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
Alex Orso
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svm
UmmeHaniAsif
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
Venkata Reddy Konasani
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Dataconomy Media
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
Dhafer Malouche
 

Similar to Kmeans plusplus (20)

Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 
P1121133727
P1121133727P1121133727
P1121133727
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
 
Afsar ml applied_svm
Afsar ml applied_svmAfsar ml applied_svm
Afsar ml applied_svm
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 

Recently uploaded

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 

Recently uploaded (20)

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 

Kmeans plusplus

  • 1. K-means++ Seeding Algorithm, 
 Implementation in MLDemos! Renaud Richardet! Brain Mind Institute ! Ecole Polytechnique Fédérale 
 de Lausanne (EPFL), Switzerland! renaud.richardet@epfl.ch ! !
  • 2. K-means! •  K-means: widely used clustering technique! •  Initialization: blind random on input data! •  Drawback: very sensitive to choice of initial cluster centers (seeds)! •  Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering!
  • 3. K-means++! •  A seeding technique for k-means
 from Arthur and Vassilvitskii [2007]! •  Idea: spread the k initial cluster centers away from each other.! •  O(log k)-competitive with the optimal clustering" •  substantial convergence time speedups (empirical)!
  • 4. Algorithm! c  ∈  C:  cluster  center   x  ∈    X:  data  point   D(x):  distance  between  x  and  the  nearest  ck  that  has  already  been  chosen      
  • 5. Implementation! •  Based on Apache Commons Math’s KMeansPlusPlusClusterer and 
 Arthur’s [2007] implementation! •  Implemented directly in MLDemos’ core!
  • 6. Implementation Test Dataset: 4 squares (n=16)!
  • 7. Expected: 4 nice clusters!
  • 8. Sample Output!  1:  first  cluster  center  0  at  rand:  x=4  [-­‐2.0;  2.0]    1:  initial  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    1:  initial  minDist  for  1  [  2.0;  1.0]  =  17.0    1:  initial  minDist  for  2  [  1.0;-­‐1.0]  =  18.0    1:  initial  minDist  for  3  [-­‐1.0;-­‐2.0]  =  17.0    1:  initial  minDist  for  5  [  2.0;  2.0]  =  16.0    1:  initial  minDist  for  6  [  2.0;-­‐2.0]  =  32.0    1:  initial  minDist  for  7  [-­‐1.0;  2.0]  =    1.0    1:  initial  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    1:  initial  minDist  for  9  [  1.0;  1.0]  =  10.0    1:  initial  minDist  for  10[  2.0;-­‐1.0]  =  25.0    1:  initial  minDist  for  11[-­‐2.0;-­‐1.0]  =    9.0          […]    2:  picking  cluster  center  1  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=3345.0    3:      random  index  1532.706909    4:    new  cluster  point:  x=6  [2.0;-­‐2.0]    
  • 9. Sample Output (2)!  4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  10.0    4:      updating  minDist  for  1  [  2.0;  1.0]  =    9.0    4:      updating  minDist  for  2  [  1.0;-­‐1.0]  =    2.0    4:      updating  minDist  for  3  [-­‐1.0;-­‐2.0]  =    9.0    4:      updating  minDist  for  5  [  2.0;  2.0]  =  16.0    4:      updating  minDist  for  7  [-­‐1.0;  2.0]  =  25.0    4:      updating  minDist  for  8  [-­‐2.0;-­‐2.0]  =  16.0    4:      updating  minDist  for  9  [  1.0;  1.0]  =  10.0    4:      updating  minDist  for  10[2.0  ;-­‐1.0]  =    1.0    4:      updating  minDist  for  11[-­‐2.0;-­‐1.0]  =  17.0    […]    2:  picking  cluster  center  2  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    3:      distSqSum=961.0    3:      random  index  103.404701    4:      new  cluster  point:  x=1  [2.0;1.0]    4:      updating  minDist  for  0  [-­‐1.0;-­‐1.0]  =  13.0    […]  
  • 10. Evaluation on Test Dataset! •  200 clustering runs, each with and without k- means++ initialization! •  Measure RSS (intra-class variance)! •  K-means!
 optimal clustering 115 times (57.5%) ! •  K-means++ !
 optimal clustering 182 times (91%)!
  • 11. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the evaluation dataset (n=200)!
  • 12. Evaluation on Real Dataset! •  UCI’s Water Treatment Plant data set
 daily measures of sensors in an urban waste water treatment plant (n=396, d=38)! •  Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS! •  Difference highly significant (P < 0.0001) !
  • 13. Comparison of the frequency distribution of RSS values between k-means and k-means ++ on the UCI real world dataset (n=500)!
  • 14. Alternatives Seeding Algorithms! •  Extensive research into seeding techniques for k- means.! •  Steinley [2007]: evaluated 12 different techniques (omitting k-means++). Recommends multiple random starting points for general use.! •  Maitra [2011] evaluated 11 techniques (including k- means++). Unable to provide recommendations when evaluating nine standard real-world datasets. ! •  Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
  • 15. Conclusions and Future Work! •  Using a synthetic test dataset and a real world dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. ! •  A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
  • 16. References! •  Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).! •  Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).! •  Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).! •  Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/ files/IEEEclust2.pdf (2011).! •  Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). ! •  Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). ! •  Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!