Selection K in K-means Clustering

•Download as PPTX, PDF•

3 likes•1,247 views

This document summarizes a paper presentation on selecting the optimal number of clusters (K) for k-means clustering. The paper proposes a new evaluation measure to automatically select K without human intuition. It reviews existing methods, analyzes factors influencing K selection, describes the proposed measure, and applies it to real datasets. The method was validated on artificial and benchmark datasets. It aims to suggest multiple K values depending on the required detail level for clustering. However, it is computationally expensive for large datasets and the data used may not reflect real complexity.

Technology Education

2013 KSE Seminar
2013/10/11
Jung hoon Kim

Why I choose this paper
• There is always an assumption in k-means
algorithm, but I really want to execute without
human’s intuition or insight.
• This paper is first review existing automatical
method for selecting the number of clusters for
k-means algorithm

Paper Format
1)
2)
3)
4)
5)

Introduction
review the main known method for selecting K
analyses the factors influencing the selection of K
describes the proposed evaluation measure
presents the results of applying the proposed
measure to select K for different data sets
6) concludes the paper

K-means Algorithm
• k-means algorithm is a method of clustering
algorithm originally from signal processing, that is
popular for machine learning and data mining.
• k-means clustering aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean until move distance is smaller than
threshold

K-means Algorithm
1) Pick a number (k) of point randomly
2) Assign every node to its nearest cluster center
3) Move each cluster center to the mean of its
assigned nodes
4) Repeat 2-3 until convergence

Clustering: Example 2, Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k2

k3

1

0
0

1

2

3

4

expression in condition 1

5

Comments on the K-Means Metho
d
• Strength
• Relatively efficient: O(tkn), where n is # instances, c is # clusters
, and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may
be found using techniques such as: simulated annealing or ge
netic algorithms

• Weakness
• Need to specify c, the number of clusters, in advance
• Initialization Problem
• Not suitable to discover clusters with non-convex shapes

What’s the problem?
• Initialization problem

• it's a problem which is caused when much point is assigned to the part
of high density and less point is assigned to the part of low density

What’s the problem?
• hard to find cluster in non-convex shape

Existing Method
• Values of K determined through human’s viewpoint

• Using probabilistic theory
• Akeike’s information criterion
• if data sets are constructed by a set of Gaussian dist

• Hardy method
• if data sets are constructed by a set of Possion dist

• Monte Carlo techniques(associated null hypothesis)

Research Method
• The method has been validated on
15 artificial and 12 benchmark data sets.
• Also there are 12 benchmark data sets from the
UCI Repository Machine Learning Databases
• These fifteen artificial data sets show effective
sample of lots of distribution which can be usually
generated.

Recommendation Example
f(X) < 0.85, K = X
else K=1

Conclusion
• The new method is closely related to the approach
of K-means clustering because it takes into account
information reflecting the performance of the
algorithm
• The proposed method can suggest multiple values
of K to users for cases when different clustering
results could be obtained with various required
levels of detail
• this method is computationally expensive if used
with large data sets

improvement
• This paper did not mentioned how can we calculate
threshold(e.g, f(x) < 0.85), if we have lots of data
sets, we can apply learning algorithm to determine
threshold
• Experiment data sets are almost biased. This means,
having set of data is too ideal. It doesn't consider
the complexity in reality at all. It can be a way to
evaluate data randomly.
• It is an important issue that we know the range, or
maximum value of K.

What's hot

Clustering is one of the data mining techniques that have been around to discover business intelligence by grouping objects into clusters using a similarity measure. Clustering is an unsupervised learning process that has many utilities in real time applications in the fields of marketing, biology, libraries, insurance, city-planning, earthquake studies and document clustering. Latent trends and relationships among data objects can be unearthed using clustering algorithms. Many clustering algorithms came into existence. However, the quality of clusters has to be given paramount importance. The quality objective is to achieve highest similarity between objects of same cluster and lowest similarity between objects of different clusters. In this context, we studied two widely used clustering algorithms such as the K-Means and Fuzzy K-Means. K-Means is an exclusive clustering algorithm while the Fuzzy K-Means is an overlapping clustering algorithm. In this paper we prove the hypothesis “Fuzzy K-Means is better than K-Means for Clustering” through both literature and empirical study. We built a prototype application to demonstrate the differences between the two clustering algorithms. The experiments are made on diabetes dataset obtained from the UCI repository. The empirical results reveal that the performance of Fuzzy K-Means is better than that of K-means in terms of quality or accuracy of clusters. Thus, our empirical study proved the hypothesis “Fuzzy K-Means is better than K-Means for Clustering”.

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"

IJDKP

International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.

Af4201214217

IJERA Editor

-This paper describes three different fundamental mathematical programming approaches that are relevant to data mining. They are: Feature Selection, Clustering and Robust Representation. This paper comprises of two clustering algorithms such as K-mean algorithm and K-median algorithms. Clustering is illustrated by the unsupervised learning of patterns and clusters that may exist in a given databases and useful tool for Knowledge Discovery in Database (KDD). The results of k-median algorithm are used to collecting the blood cancer patient from a medical database. K-mean clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields.

A Novel Approach to Mathematical Concepts in Data Mining

ijdmtaiir

Training machine learning k means 2017

Iwan Sofana

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other.

Optimising Data Using K-Means Clustering Algorithm

IJERA Editor

Performance Analysis of Different Clustering Algorithm

IOSR Journals

lecture_mooney.ppt

butest

What am I going to get from this course? Provides a basic conceptual understanding of how clustering works Provides intuitive understanding of the mathematics behind various clustering algorithms Walk through Python code examples on how to use various cluster algorithms Show how clustering is applied in various industry applications Check it on Experfy: https://www.experfy.com/training/courses/unsupervised-learning-clustering

Unsupervised Learning: Clustering

Experfy

Improved k-means

Kasun Ranga Wijeweera

Noura2

Dr-mahmoud Algamel

Lightning talk at MLConf NYC 2015

Mohitdeep Singh

Instance based learning

swapnac12

Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.

A survey on Efficient Enhanced K-Means Clustering Algorithm

ijsrd.com

A Scalable Dataflow Implementation of Curran's Approximation Algorithm

NECST Lab @ Politecnico di Milano

xldb-2015

Mohitdeep Singh

AROPUB-IJPGE-14-30

shirko mahmoudi

K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.

Premeditated Initial Points for K-Means Clustering

IJCSIS Research Publications

Unsupervised Learning

SAHEEL FAL DESAI

A Study of Efficiency Improvements Technique for K-Means Algorithm

IRJET Journal

K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it generates unstable and empty clusters which are meaningless. expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations. The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates unnecessary distance computation by using previous iteration. The new approach for k systematically based on initial centroids. It generates stable clusters to improve accuracy.

New Approach for K-mean and K-medoids Algorithm

Editor IJCATR

What's hot (20)

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"

Af4201214217

A Novel Approach to Mathematical Concepts in Data Mining

Training machine learning k means 2017

Optimising Data Using K-Means Clustering Algorithm

Performance Analysis of Different Clustering Algorithm

lecture_mooney.ppt

Unsupervised Learning: Clustering

Improved k-means

Noura2

Lightning talk at MLConf NYC 2015

Instance based learning

A survey on Efficient Enhanced K-Means Clustering Algorithm

A Scalable Dataflow Implementation of Curran's Approximation Algorithm

xldb-2015

AROPUB-IJPGE-14-30

Premeditated Initial Points for K-Means Clustering

Unsupervised Learning

A Study of Efficiency Improvements Technique for K-Means Algorithm

New Approach for K-mean and K-medoids Algorithm

Similar to Selection K in K-means Clustering

Master's Thesis Presentation

●๋•máńíکhá Gőýálツ

Clustering.pptx

19526YuvaKumarIrigi

machine learning - Clustering in R

Sudhakar Chavan

CSA 3702 machine learning module 3

Nandhini S

UNIT_V_Cluster Analysis.pptx

sandeepsandy494692

k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning

Neural nw k means

Eng. Dr. Dennis N. Mwighusa

Document clustering for forensic analysis an approach for improving compute...

Madan Golla

Advanced database and data mining & clustering concepts

NithyananthSengottai

Fuzzy c means clustering protocol for wireless sensor networks

mourya chandra

The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in exploratory data analysis. This paper presents results of the experimental study of different approaches to k- Means clustering, thereby comparing results on different datasets using Original k-Means and other modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and execution time

Experimental study of Data clustering using k- Means and modified algorithms

IJDKP

Pattern recognition binoy k means clustering

108kaushik

26-Clustering MTech-2017.ppt

vikassingh569137

K means Clustering - algorithm to cluster n objects

VoidVampire

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...

Maninda Edirisooriya

K-Nearest Neighbor Classifier

Neha Kulkarni

Data mining techniques unit v

malathieswaran29

Unsupervised Learning in Machine Learning

Pyingkodi Maran

Parallel Algorithms K – means Clustering

Andreina Uzcategui

k-mean-clustering.pdf

YatharthKhichar1

Clustering.pptx

Mukul Kumar Singh Chauhan

Similar to Selection K in K-means Clustering (20)

Master's Thesis Presentation

Clustering.pptx

machine learning - Clustering in R

CSA 3702 machine learning module 3

UNIT_V_Cluster Analysis.pptx

Neural nw k means

Document clustering for forensic analysis an approach for improving compute...

Advanced database and data mining & clustering concepts

Fuzzy c means clustering protocol for wireless sensor networks

Experimental study of Data clustering using k- Means and modified algorithms

Pattern recognition binoy k means clustering

26-Clustering MTech-2017.ppt

K means Clustering - algorithm to cluster n objects

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...

K-Nearest Neighbor Classifier

Data mining techniques unit v

Unsupervised Learning in Machine Learning

Parallel Algorithms K – means Clustering

k-mean-clustering.pdf

Clustering.pptx

Recently uploaded

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Join us as we dive into the latest updates to the UiPath Orchestrator API, including new limits and features for 2024. Discover how these changes can enhance your automation projects and streamline your workflows. 📚 Overview of UiPath Orchestrator API 🔧 Recent changes to API limits 🛠️ How to adapt to new limits 📋 Best practices for using the Orchestrator API efficiently ❓ Q&A session

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀

DianaGray10

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

I'm excited to share my latest predictions on how AI, robotics, and other technological advancements will reshape industries in the coming years. The slides explore the exponential growth of computational power, the future of AI and robotics, and their profound impact on various sectors. Why this matters: The success of new products and investments hinges on precise timing and foresight into emerging categories. This deck equips founders, VCs, and industry leaders with insights to align future products with upcoming tech developments. These insights enhance the ability to forecast industry trends, improve market timing, and predict competitor actions. Highlights: ▪ Exponential Growth in Compute: How $1000 will soon buy the computational power of a human brain ▪ Scaling of AI Models: The journey towards beyond human-scale models and intelligent edge computing ▪ Transformative Technologies: From advanced robotics and brain interfaces to automated healthcare and beyond ▪ Future of Work: How automation will redefine jobs and economic structures by 2040 With so many predictions presented here, some will inevitably be wrong or mistimed, especially with potential external disruptions. For instance, a conflict in Taiwan could severely impact global semiconductor production, affecting compute costs and related advancements. Nonetheless, these slides are intended to guide intuition on future technological trends.

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Peter Udo Diehl

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Ever caught yourself nodding along when someone mentions "delivering value" in Agile, but secretly wondering what the heck they actually mean? You're not alone! Join us for an eye-opening session where we'll strip away the buzzwords and dive into the heart of Agile—value delivery. But what is "value"? Is it a mythical unicorn in the world of software development, or is there more to this overused term? This isn't going to be a sit-and-get lecture. We're talking about a face-to-face, interactive meetup where YOU play a crucial role. Come along to: Define It: What does "value" really mean? We’ll build a definition that’s not just words, but a compass for your Agile journey. Contextualise It: Discover what value means specifically to you, your team, your company, and your industry. Because one size does not fit all. Deliver It: Share strategies and gather new ones for uncovering and delivering true value—no more shooting in the dark!

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

David Michel

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»

QADay

ODC, Data Fabric and Architecture User Group

CatarinaPereira64715

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*

Essentials of Automations: Optimizing FME Workflows with Parameters

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Bits & Pixels using AI for Good.........

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

How world-class product teams are winning in the AI era by CEO and Founder, P...

Connector Corner: Automate dynamic content and events by pushing a button

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Accelerate your Kubernetes clusters with Varnish Caching

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

PHP Frameworks: I want to break free (IPC Berlin 2024)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»

ODC, Data Fabric and Architecture User Group

Selection K in K-means Clustering

1. 2013 KSE Seminar 2013/10/11 Jung hoon Kim

2. TOPIC

3. Selection of K in K-means clustering

4. Why I choose this paper • There is always an assumption in k-means algorithm, but I really want to execute without human’s intuition or insight. • This paper is first review existing automatical method for selecting the number of clusters for k-means algorithm

5. Paper Format 1) 2) 3) 4) 5) Introduction review the main known method for selecting K analyses the factors influencing the selection of K describes the proposed evaluation measure presents the results of applying the proposed measure to select K for different data sets 6) concludes the paper

6. Small introduction

7. K-means Algorithm • k-means algorithm is a method of clustering algorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean until move distance is smaller than threshold

8. K-means Algorithm 1) Pick a number (k) of point randomly 2) Assign every node to its nearest cluster center 3) Move each cluster center to the mean of its assigned nodes 4) Repeat 2-3 until convergence

9. Clustering: Example 2, Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

10. Clustering: Example 2, Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

11. Clustering: Example 2, Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

12. Clustering: Example 2, Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

13. Clustering: Example 2, Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k2 k3 1 0 0 1 2 3 4 expression in condition 1 5

14. Comments on the K-Means Metho d • Strength • Relatively efficient: O(tkn), where n is # instances, c is # clusters , and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or ge netic algorithms • Weakness • Need to specify c, the number of clusters, in advance • Initialization Problem • Not suitable to discover clusters with non-convex shapes

15. What’s the problem?

16. What’s the problem? • Initialization problem • it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density

17. What’s the problem? • hard to find cluster in non-convex shape

18. What’s the problem? • Selection of K

19. Existing Method • Values of K determined through human’s viewpoint • Using probabilistic theory • Akeike’s information criterion • if data sets are constructed by a set of Gaussian dist • Hardy method • if data sets are constructed by a set of Possion dist • Monte Carlo techniques(associated null hypothesis)

20. Paper proposed

21. Formula

22. Research Method • The method has been validated on 15 artificial and 12 benchmark data sets. • Also there are 12 benchmark data sets from the UCI Repository Machine Learning Databases • These fifteen artificial data sets show effective sample of lots of distribution which can be usually generated.

23. Sample

24. Sample

25. Sample

26. Sample

27. Recommendation Example f(X) < 0.85, K = X else K=1

28. Conclusion • The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algorithm • The proposed method can suggest multiple values of K to users for cases when different clustering results could be obtained with various required levels of detail • this method is computationally expensive if used with large data sets

29. improvement • This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold • Experiment data sets are almost biased. This means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly. • It is an important issue that we know the range, or maximum value of K.

30. Do you have any question?

31. thank you

Selection K in K-means Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Selection K in K-means Clustering

Similar to Selection K in K-means Clustering (20)

Recently uploaded

Recently uploaded (20)

Selection K in K-means Clustering