SlideShare a Scribd company logo
1 of 31
Download to read offline
1
K-Means
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University
Press, May 2014. Example 13.1. [download]
● Evimaria Terzi: Data Mining course at Boston University
http://www.cs.bu.edu/~evimaria/cs565-13.html
2
Boston University Slideshow Title Goes Here
The k-means problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given
• problem:
• find k points c1,...,ck (named centers or means)
so that the cost
is minimized
3
Boston University Slideshow Title Goes Here
The k-means problem
• k=1 and k=n are easy special cases (why?)
• an NP-hard problem if the dimension of the
data is at least 2 (d≥2)
• in practice, a simple iterative algorithm
works quite well
4
Boston University Slideshow Title Goes Here
The k-means
algorithm
• voted among the top-10
algorithms in data mining
• one way of solving the k-
means problem
5
K-means algorithm
6
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k
cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of
points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj
(mean of the vectors in Xj)
1.repeat (go to step 2) until convergence
7
Boston University Slideshow Title Goes Here
Sample execution
8
1-dimensional clustering exercise
Exercise:
● For the data in the figure
● Run k-means with k=2 and initial centroids u1=2, u2=4
(Verify: last centroids are 18 units apart)
● Try with k=3 and initialization 2,3,30
http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
9
Limitations of k-means
● Clusters of different size
● Clusters of different density
● Clusters of non-globular shape
● Sensitive to initialization
10
Boston University Slideshow Title Goes Here
Limitations of k-means: different sizes
11
Boston University Slideshow Title Goes Here
Limitations of k-means: different
density
12
Boston University Slideshow Title Goes Here
Limitations of k-means: non-spherical
shapes
13
Boston University Slideshow Title Goes Here
Effects of bad initialization
14
Boston University Slideshow Title Goes Here
k-means algorithm
• finds a local optimum
• often converges quickly
but not always
• the choice of initial points can have large
influence in the result
• tends to find spherical clusters
• outliers can cause a problem
• different densities may cause a problem
15
Advanced: k-means initialization
16
Boston University Slideshow Title Goes Here
Initialization
• random initialization
• random, but repeat many times and take the
best solution
• helps, but solution can still be bad
• pick points that are distant to each other
• k-means++
• provable guarantees
17
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskii
k-means++: The advantages of careful
seeding
SODA 2007
18
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization
19
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization
20
Boston University Slideshow Title Goes Here
1
2
3
4
k-means algorithm:
initialization with further-first
traversal
21
Boston University Slideshow Title Goes Here
k-means algorithm:
initialization with further-first
traversal
22
Boston University Slideshow Title Goes Here
1
2
3
but... sensitive to outliers
23
Boston University Slideshow Title Goes Here
but... sensitive to outliers
24
Boston University Slideshow Title Goes Here
Here random may work well
25
Boston University Slideshow Title Goes Here
k-means++ algorithm
• interpolate between the two methods
• let D(x) be the distance between x and the
nearest center selected so far
• choose next center with probability proportional to
(D(x))a = Da(x)
 a = 0      random initialization
 a = ∞ furthest­first traversal
 a = 2      k­means++ 
26
Boston University Slideshow Title Goes Here
k-means++ algorithm
• initialization phase:
• choose the first center uniformly at random
• choose next center with probability proportional
to D2(x)
• iteration phase:
• iterate as in the k-means algorithm until
convergence
27
Boston University Slideshow Title Goes Here
k-means++ initialization
1
2
3
28
Boston University Slideshow Title Goes Here
k-means++ result
29
Boston University Slideshow Title Goes Here
• approximation guarantee comes just from the
first iteration (initialization)
• subsequent iterations can only improve cost
k-means++ provable guarantee
30
Boston University Slideshow Title Goes Here
Lesson learned
• no reason to use k-means and not k-means++
• k-means++ :
• easy to implement
• provable guarantee
• works well in practice
31
k-means--
● Algorithm 4.1 in [Chawla & Gionis SDM 2013]

More Related Content

What's hot

KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Simplilearn
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 

What's hot (20)

3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Chapter8
Chapter8Chapter8
Chapter8
 
Clustering
ClusteringClustering
Clustering
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Birch1
Birch1Birch1
Birch1
 
K means clustering
K means clusteringK means clustering
K means clustering
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 
KNN
KNNKNN
KNN
 

Viewers also liked

K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
djempol
 

Viewers also liked (20)

Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithm
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
slides Céline Beji
slides Céline Bejislides Céline Beji
slides Céline Beji
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 Algorithm
 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
K means
K meansK means
K means
 
Project PPT
Project PPTProject PPT
Project PPT
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 

Similar to K-Means Algorithm

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
Daiki Tanaka
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
Shubham Joshi
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
mourya chandra
 
Poggi analytics - distance - 1a
Poggi   analytics - distance - 1aPoggi   analytics - distance - 1a
Poggi analytics - distance - 1a
Gaston Liberman
 

Similar to K-Means Algorithm (20)

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
 
On-Homomorphic-Encryption-and-Secure-Computation.ppt
On-Homomorphic-Encryption-and-Secure-Computation.pptOn-Homomorphic-Encryption-and-Secure-Computation.ppt
On-Homomorphic-Encryption-and-Secure-Computation.ppt
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
SP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - OptimizationSP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - Optimization
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Poggi analytics - distance - 1a
Poggi   analytics - distance - 1aPoggi   analytics - distance - 1a
Poggi analytics - distance - 1a
 
Lecture1
Lecture1Lecture1
Lecture1
 
Lecture 3.pdf
Lecture 3.pdfLecture 3.pdf
Lecture 3.pdf
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
lecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptxlecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptx
 
Entity-Relationship Queries over Wikipedia
Entity-Relationship Queries over WikipediaEntity-Relationship Queries over Wikipedia
Entity-Relationship Queries over Wikipedia
 

More from Carlos Castillo (ChaTo)

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Clustering
ClusteringClustering
Clustering
 
Text similarity and the vector space model
Text similarity and the vector space modelText similarity and the vector space model
Text similarity and the vector space model
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

K-Means Algorithm

  • 1. 1 K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download] ● Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html
  • 2. 2 Boston University Slideshow Title Goes Here The k-means problem • consider set X={x1,...,xn} of n points in Rd • assume that the number k is given • problem: • find k points c1,...,ck (named centers or means) so that the cost is minimized
  • 3. 3 Boston University Slideshow Title Goes Here The k-means problem • k=1 and k=n are easy special cases (why?) • an NP-hard problem if the dimension of the data is at least 2 (d≥2) • in practice, a simple iterative algorithm works quite well
  • 4. 4 Boston University Slideshow Title Goes Here The k-means algorithm • voted among the top-10 algorithms in data mining • one way of solving the k- means problem
  • 6. 6 Boston University Slideshow Title Goes Here The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j, set the cluster Xj to be the set of points in X that are the closest to center cj 3.for each j let cj be the center of cluster Xj (mean of the vectors in Xj) 1.repeat (go to step 2) until convergence
  • 7. 7 Boston University Slideshow Title Goes Here Sample execution
  • 8. 8 1-dimensional clustering exercise Exercise: ● For the data in the figure ● Run k-means with k=2 and initial centroids u1=2, u2=4 (Verify: last centroids are 18 units apart) ● Try with k=3 and initialization 2,3,30 http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
  • 9. 9 Limitations of k-means ● Clusters of different size ● Clusters of different density ● Clusters of non-globular shape ● Sensitive to initialization
  • 10. 10 Boston University Slideshow Title Goes Here Limitations of k-means: different sizes
  • 11. 11 Boston University Slideshow Title Goes Here Limitations of k-means: different density
  • 12. 12 Boston University Slideshow Title Goes Here Limitations of k-means: non-spherical shapes
  • 13. 13 Boston University Slideshow Title Goes Here Effects of bad initialization
  • 14. 14 Boston University Slideshow Title Goes Here k-means algorithm • finds a local optimum • often converges quickly but not always • the choice of initial points can have large influence in the result • tends to find spherical clusters • outliers can cause a problem • different densities may cause a problem
  • 16. 16 Boston University Slideshow Title Goes Here Initialization • random initialization • random, but repeat many times and take the best solution • helps, but solution can still be bad • pick points that are distant to each other • k-means++ • provable guarantees
  • 17. 17 Boston University Slideshow Title Goes Here k-means++ David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007
  • 18. 18 Boston University Slideshow Title Goes Here k-means algorithm: random initialization
  • 19. 19 Boston University Slideshow Title Goes Here k-means algorithm: random initialization
  • 20. 20 Boston University Slideshow Title Goes Here 1 2 3 4 k-means algorithm: initialization with further-first traversal
  • 21. 21 Boston University Slideshow Title Goes Here k-means algorithm: initialization with further-first traversal
  • 22. 22 Boston University Slideshow Title Goes Here 1 2 3 but... sensitive to outliers
  • 23. 23 Boston University Slideshow Title Goes Here but... sensitive to outliers
  • 24. 24 Boston University Slideshow Title Goes Here Here random may work well
  • 25. 25 Boston University Slideshow Title Goes Here k-means++ algorithm • interpolate between the two methods • let D(x) be the distance between x and the nearest center selected so far • choose next center with probability proportional to (D(x))a = Da(x)  a = 0      random initialization  a = ∞ furthest­first traversal  a = 2      k­means++ 
  • 26. 26 Boston University Slideshow Title Goes Here k-means++ algorithm • initialization phase: • choose the first center uniformly at random • choose next center with probability proportional to D2(x) • iteration phase: • iterate as in the k-means algorithm until convergence
  • 27. 27 Boston University Slideshow Title Goes Here k-means++ initialization 1 2 3
  • 28. 28 Boston University Slideshow Title Goes Here k-means++ result
  • 29. 29 Boston University Slideshow Title Goes Here • approximation guarantee comes just from the first iteration (initialization) • subsequent iterations can only improve cost k-means++ provable guarantee
  • 30. 30 Boston University Slideshow Title Goes Here Lesson learned • no reason to use k-means and not k-means++ • k-means++ : • easy to implement • provable guarantee • works well in practice
  • 31. 31 k-means-- ● Algorithm 4.1 in [Chawla & Gionis SDM 2013]