2013 KSE Seminar
2013/10/11
Jung hoon Kim
TOPIC
Selection of K in
K-means clustering
Why I choose this paper
• There is always an assumption in k-means
algorithm, but I really want to execute without
human’s...
Paper Format
1)
2)
3)
4)
5)

Introduction
review the main known method for selecting K
analyses the factors influencing th...
Small introduction
K-means Algorithm
• k-means algorithm is a method of clustering
algorithm originally from signal processing, that is
popul...
K-means Algorithm
1) Pick a number (k) of point randomly
2) Assign every node to its nearest cluster center
3) Move each c...
Clustering: Example 2, Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1...
Clustering: Example 2, Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1...
Clustering: Example 2, Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1...
Clustering: Example 2, Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1...
Clustering: Example 2, Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1...
Comments on the K-Means Metho
d
• Strength
• Relatively efficient: O(tkn), where n is # instances, c is # clusters
, and t...
What’s the problem?
What’s the problem?
• Initialization problem

• it's a problem which is caused when much point is assigned to the part
of ...
What’s the problem?
• hard to find cluster in non-convex shape
What’s the problem?
• Selection of K
Existing Method
• Values of K determined through human’s viewpoint

• Using probabilistic theory
• Akeike’s information cr...
Paper proposed
Formula
Research Method
• The method has been validated on
15 artificial and 12 benchmark data sets.
• Also there are 12 benchmark...
Sample
Sample
Sample
Sample
Recommendation Example
f(X) < 0.85, K = X
else K=1
Conclusion
• The new method is closely related to the approach
of K-means clustering because it takes into account
informa...
improvement
• This paper did not mentioned how can we calculate
threshold(e.g, f(x) < 0.85), if we have lots of data
sets,...
Do you
have any question?
thank you
Upcoming SlideShare
Loading in …5
×

Selection K in K-means Clustering

776 views
575 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
776
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Selection K in K-means Clustering

  1. 1. 2013 KSE Seminar 2013/10/11 Jung hoon Kim
  2. 2. TOPIC
  3. 3. Selection of K in K-means clustering
  4. 4. Why I choose this paper • There is always an assumption in k-means algorithm, but I really want to execute without human’s intuition or insight. • This paper is first review existing automatical method for selecting the number of clusters for k-means algorithm
  5. 5. Paper Format 1) 2) 3) 4) 5) Introduction review the main known method for selecting K analyses the factors influencing the selection of K describes the proposed evaluation measure presents the results of applying the proposed measure to select K for different data sets 6) concludes the paper
  6. 6. Small introduction
  7. 7. K-means Algorithm • k-means algorithm is a method of clustering algorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean until move distance is smaller than threshold
  8. 8. K-means Algorithm 1) Pick a number (k) of point randomly 2) Assign every node to its nearest cluster center 3) Move each cluster center to the mean of its assigned nodes 4) Repeat 2-3 until convergence
  9. 9. Clustering: Example 2, Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5
  10. 10. Clustering: Example 2, Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5
  11. 11. Clustering: Example 2, Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5
  12. 12. Clustering: Example 2, Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5
  13. 13. Clustering: Example 2, Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k2 k3 1 0 0 1 2 3 4 expression in condition 1 5
  14. 14. Comments on the K-Means Metho d • Strength • Relatively efficient: O(tkn), where n is # instances, c is # clusters , and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or ge netic algorithms • Weakness • Need to specify c, the number of clusters, in advance • Initialization Problem • Not suitable to discover clusters with non-convex shapes
  15. 15. What’s the problem?
  16. 16. What’s the problem? • Initialization problem • it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density
  17. 17. What’s the problem? • hard to find cluster in non-convex shape
  18. 18. What’s the problem? • Selection of K
  19. 19. Existing Method • Values of K determined through human’s viewpoint • Using probabilistic theory • Akeike’s information criterion • if data sets are constructed by a set of Gaussian dist • Hardy method • if data sets are constructed by a set of Possion dist • Monte Carlo techniques(associated null hypothesis)
  20. 20. Paper proposed
  21. 21. Formula
  22. 22. Research Method • The method has been validated on 15 artificial and 12 benchmark data sets. • Also there are 12 benchmark data sets from the UCI Repository Machine Learning Databases • These fifteen artificial data sets show effective sample of lots of distribution which can be usually generated.
  23. 23. Sample
  24. 24. Sample
  25. 25. Sample
  26. 26. Sample
  27. 27. Recommendation Example f(X) < 0.85, K = X else K=1
  28. 28. Conclusion • The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algorithm • The proposed method can suggest multiple values of K to users for cases when different clustering results could be obtained with various required levels of detail • this method is computationally expensive if used with large data sets
  29. 29. improvement • This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold • Experiment data sets are almost biased. This means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly. • It is an important issue that we know the range, or maximum value of K.
  30. 30. Do you have any question?
  31. 31. thank you

×