SlideShare a Scribd company logo
1 of 33
Binoy B Nair
What is Clustering?
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• Urban planning: Identifying groups of houses according to their house type, value, and
geographical location
• Seismology: Observed earth quake epicenters should be clustered along continent faults
What Is a Good Clustering?
• A good clustering method will produce clusters with
• High intra-class similarity
• Low inter-class similarity
• Precise definition of clustering quality is difficult
• Application-dependent
• Ultimately subjective
Requirements for Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal domain knowledge required to determine input parameters
• Ability to deal with noise and outliers
• Insensitivity to order of input records
• Robustness wrt high dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of objects using some
• Model-based: Hypothesize a model for each cluster and find best fit of models to data
• Density-based: Guided by connectivity and density functions
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects into a set of k
• Given a k, find a partition of k clusters that optimizes the chosen partitioning
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen, 1967): Each cluster is represented by the center of the
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987):
Each cluster is represented by one of the objects in the cluster
Partitional Clustering
• Each instance is placed in exactly one of K nonoverlapping
• Since only one set of clusters is output, the user normally has
to input the desired number of clusters K.
Similarity and Dissimilarity Between Objects
• Euclidean distance (p = 2):
• Properties of a metric d(i,j):
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
)||...|||(|),( 22
11 pp j
xjid 
Squared Error
1 2 3 4 5 6 7 8 9 10
Objective Function
Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the
nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found
above are correct.
5. If none of the N objects changed membership in the last iteration, exit.
Otherwise goto 3.
0 1 2 3 4 5
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
0 1 2 3 4 5
expression in condition 1
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
Worked out Example
Subject Features
1 (1,1)
2 (1.5,2)
3 (3,4)
4 (5,7)
5 (3.5,5)
6 (4.5,5)
7 (3.5,4.5)
As a simple illustration of a k-means algorithm, consider the following data set
consisting of the scores of two variables on each of seven individuals:
Scatter Plot
Individual Mean Vector (centroid)
Group 1 1 (1, 1)
Group 2 4 (5, 7)
This data set is to be grouped into two clusters, i.e k=2.
As a first step in finding a sensible initial partition, let the feature values of the two
individuals furthest apart (using the Euclidean distance measure), define the initial
cluster means, giving:
• The remaining individuals are now examined in sequence and
allocated to the cluster to which they are closest, in terms of
Euclidean distance to the cluster mean.
• The mean vector is recalculated each time a new member is added.
• This leads to the following series of steps:
Iteration 1- Assign Objects to closest clusters
Centroid 1
D(Oi ,C1)
Centroid 2
D(Oi ,C2)
1 (1,1) (1,1) (5,7)
2 (1.5,2) (1,1) (5,7)
3 (3,4) (1,1) (5,7)
4 (5,7) (1,1) (5,7)
5 (3.5,5) (1,1) (5,7)
6 (4.5,5) (1,1) (5,7)
7 (3.5,4.5) (1,1) (5,7)
A cluster is
defined by
its centroid
Iteration 1- Assign Objects to closest clusters
Centroid 1
D(Oi ,C1)
Centroid 2
D(Oi ,C2)
1 (1,1) (1,1) 0 (5,7) 7.21
2 (1.5,2) (1,1) 1.11 (5,7) 6.1
3 (3,4) (1,1) 3.05 (5,7) 3.60
4 (5,7) (1,1) 7.21 (5,7) 0
5 (3.5,5) (1,1) 4.71 (5,7) 2.5
6 (4.5,5) (1,1) 5.31 (5,7) 2.06
7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91
A cluster is
defined by
its centroid
(1 − 1.5)2+(1 − 2)2 =1.11
And so on…
Iteration 1- Assign Objects to closest clusters
Centroid 1
D(Oi ,C1)
Centroid 2
D(Oi ,C2)
1 (1,1) (1,1) 0 (5,7) 7.21 C1
2 (1.5,2) (1,1) 1.11 (5,7) 6.1 C1
3 (3,4) (1,1) 3.05 (5,7) 3.60 C1
4 (5,7) (1,1) 7.21 (5,7) 0 C2
5 (3.5,5) (1,1) 4.71 (5,7) 2.5 C2
6 (4.5,5) (1,1) 5.31 (5,7) 2.06 C2
7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 C2
A cluster is
defined by
its centroid
Object 1 is
assigned to
cluster 1 and so
Re computing Centroids at the end of Iteration 1
Individuals New Centroids
Cluster 1 1, 2, 3 C1 = ((1,1)+(1.5,2)+(3,4))/3 = (1.8, 2.3)
Cluster 2 4, 5, 6, 7 C2 =((5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/4 = (4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the
following characteristics:
Iteration 2- Check if any object has changed clusters
Centroid 1
D(Oi ,C1)
Centroid 2
D(Oi ,C2)
1 (1,1) (1.8,2.3) (4.1, 5.4)
2 (1.5,2) (1.8,2.3) (4.1, 5.4)
3 (3,4) (1.8,2.3) (4.1, 5.4)
4 (5,7) (1.8,2.3) (4.1, 5.4)
5 (3.5,5) (1.8,2.3) (4.1, 5.4)
6 (4.5,5) (1.8,2.3) (4.1, 5.4)
7 (3.5,4.5) (1.8,2.3) (4.1, 5.4)
Iteration 2- Check if any object has changed clusters
Centroid 1
D(Oi ,C1)
Centroid 2
D(Oi ,C2)
1 (1,1) (1.8,2.3) 1.53 (4.1, 5.4) 5.38 C1
2 (1.5,2) (1.8,2.3) 0.42 (4.1, 5.4) 4.28 C1
3 (3,4) (1.8,2.3) 2.08 (4.1, 5.4) 1.78 C2
4 (5,7) (1.8,2.3) 5.69 (4.1, 5.4) 1.84 C2
5 (3.5,5) (1.8,2.3) 3.19 (4.1, 5.4) 0.72 C2
6 (4.5,5) (1.8,2.3) 3.82 (4.1, 5.4) 0.57 C2
7 (3.5,4.5) (1.8,2.3) 2.78 (4.1, 5.4) 1.08 C2
Object 3 has
cluster from 1
to 2
Re computing Centroids at the end of Iteration 2
Individuals New Centroids
Cluster 1 1, 2 C1 = ((1,1)+(1.5,2))/2 = (1.3, 1.5)
Cluster 2 3,4, 5, 6, 7 C2 =((3,4)+(5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/5 = (3.9, 5.1)
Now the initial partition has changed wih Object 3 getting relocated to cluster 2 and
the two clusters at this stage having the following characteristics:
Iteration 3- Check if any object has changed clusters
Centroid 1
D(Oi ,C1)
Centroid 2
D(Oi ,C2)
1 (1,1) (1.3, 1.5) 0.58 (3.9, 5.1) 5.02 C1
2 (1.5,2) (1.3, 1.5) 0.54 (3.9, 5.1) 3.92 C1
3 (3,4) (1.3, 1.5) 3.02 (3.9, 5.1) 1.42 C2
4 (5,7) (1.3, 1.5) 6.63 (3.9, 5.1) 2.19 C2
5 (3.5,5) (1.3, 1.5) 4.13 (3.9, 5.1) 0.41 C2
6 (4.5,5) (1.3, 1.5) 4.74 (3.9, 5.1) 0.61 C2
7 (3.5,4.5) (1.3, 1.5) 3.72 (3.9, 5.1) 0.72 C2
No change in
compared to
• In this example each individual is now nearer its own cluster mean
than that of the other cluster and the iteration stops, choosing the
latest partitioning as the final cluster solution.
• Hence Objects {1,2} belong to first cluster and Objects {3,4,5,6,7}
belong to second cluster.
• The iterative relocation would continue until no more relocations occur.
• Luckily, in the example, we got the no-relocation condition satisfied in 3
iterations, but this is not usually the case. It might require hundreds of
iterations depending on the dataset.
• Also, it is possible that the k-means algorithm won't find a final solution at
• In this case it would be a good idea to consider stopping the algorithm
after a pre-chosen maximum of iterations.
Comments on the K-Means Method
• Strength
• Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may be found using
techniques such as: deterministic annealing and genetic algorithms
• Weakness
• Applicable only when mean is defined, then what about categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
• K-means algorithm is a simple yet popular method for clustering analysis
• Its performance is determined by initialisation and appropriate distance
• There are several variants of K-means to overcome its weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
• CLARA: extension to deal with large data sets
• Mixture models (EM algorithm): handling uncertainty of clusters
Online tutorial: the K-means function in Matlab
• Ke Chen, K-means Clustering ,COMP24111-Machine Learning,
University of Manchester, 2016.
• {Insert Reference}/ 10.ppt
• {Insert Reference}/MachinLearning3.ppt

More Related Content

What's hot

Apache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API Examples
Apache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API ExamplesApache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API Examples
Apache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API ExamplesBinu George
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classificationJamshed Khan
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patilwidespreadpromotion
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingSam Light
Zookeeper big sonata
Zookeeper  big sonataZookeeper  big sonata
Zookeeper big sonataAnh Le
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Timothy Spann
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
CLUSTER SILHOUETTES.pptxagniva pradhan
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner

What's hot (20)

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Apache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API Examples
Apache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API ExamplesApache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API Examples
Apache Zookeeper Explained: Tutorial, Use Cases and Zookeeper Java API Examples
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Cluster computing
Cluster computingCluster computing
Cluster computing
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
Zookeeper big sonata
Zookeeper  big sonataZookeeper  big sonata
Zookeeper big sonata
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
K means clustering
K means clusteringK means clustering
K means clustering
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka

Similar to Pattern recognition binoy k means clustering

Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
Business analytics course in delhi
Business analytics course in delhiBusiness analytics course in delhi
Business analytics course in delhibhuvan8999
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
business analytics course in delhi
business analytics course in delhibusiness analytics course in delhi
business analytics course in delhidevipatnala1
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.Data Analytics Courses in Pune
Data scientist course in hyderabad
Data scientist course in hyderabadData scientist course in hyderabad
Data scientist course in hyderabadprathyusha1234
Data scientist training in bangalore
Data scientist training in bangaloreData scientist training in bangalore
Data scientist training in bangaloreprathyusha1234
Data science course in chennai (3)
Data science course in chennai (3)Data science course in chennai (3)
Data science course in chennai (3)prathyusha1234
data science course in chennai
data science course in chennaidata science course in chennai
data science course in chennaidevipatnala1
Best institute for data science in hyderabad
Best institute for data science in hyderabadBest institute for data science in hyderabad
Best institute for data science in hyderabadprathyusha1234

Similar to Pattern recognition binoy k means clustering (20)

Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
08 clustering
08 clustering08 clustering
08 clustering
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
Business analytics course in delhi
Business analytics course in delhiBusiness analytics course in delhi
Business analytics course in delhi
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
business analytics course in delhi
business analytics course in delhibusiness analytics course in delhi
business analytics course in delhi
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.
Data scientist course in hyderabad
Data scientist course in hyderabadData scientist course in hyderabad
Data scientist course in hyderabad
Data scientist training in bangalore
Data scientist training in bangaloreData scientist training in bangalore
Data scientist training in bangalore
Data science course in chennai (3)
Data science course in chennai (3)Data science course in chennai (3)
Data science course in chennai (3)
data science course in chennai
data science course in chennaidata science course in chennai
data science course in chennai
Best institute for data science in hyderabad
Best institute for data science in hyderabadBest institute for data science in hyderabad
Best institute for data science in hyderabad

Recently uploaded

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

Recently uploaded (20)

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...

Pattern recognition binoy k means clustering

  • 2. What is Clustering? • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
  • 3. 3 Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • Urban planning: Identifying groups of houses according to their house type, value, and geographical location • Seismology: Observed earth quake epicenters should be clustered along continent faults
  • 4. 4 What Is a Good Clustering? • A good clustering method will produce clusters with • High intra-class similarity • Low inter-class similarity • Precise definition of clustering quality is difficult • Application-dependent • Ultimately subjective
  • 5. 5 Requirements for Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Minimal domain knowledge required to determine input parameters • Ability to deal with noise and outliers • Insensitivity to order of input records • Robustness wrt high dimensionality • Incorporation of user-specified constraints • Interpretability and usability
  • 6. 6 Major Clustering Approaches • Partitioning: Construct various partitions and then evaluate them by some criterion • Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion • Model-based: Hypothesize a model for each cluster and find best fit of models to data • Density-based: Guided by connectivity and density functions
  • 7. 7 Partitioning Algorithms • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen, 1967): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster
  • 8. Partitional Clustering • Each instance is placed in exactly one of K nonoverlapping clusters. • Since only one set of clusters is output, the user normally has to input the desired number of clusters K.
  • 9. 9 Similarity and Dissimilarity Between Objects • Euclidean distance (p = 2): • Properties of a metric d(i,j): • d(i,j)  0 • d(i,i) = 0 • d(i,j) = d(j,i) • d(i,j)  d(i,k) + d(k,j) )||...|||(|),( 22 22 2 11 pp j x i x j x i x j x i xjid 
  • 10. Squared Error 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 Objective Function
  • 11. Algorithm k-means 1. Decide on a value for k. 2. Initialize the k cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the k cluster centers, by assuming the memberships found above are correct. 5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.
  • 12. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 13. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 14. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 15. 0 1 2 3 4 5 0 1 2 3 4 5 K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 16. 0 1 2 3 4 5 0 1 2 3 4 5 expression in condition 1 expressionincondition2 K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k1 k2 k3
  • 18. Example Subject Features 1 (1,1) 2 (1.5,2) 3 (3,4) 4 (5,7) 5 (3.5,5) 6 (4.5,5) 7 (3.5,4.5) As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of seven individuals: Scatter Plot
  • 19. Working Individual Mean Vector (centroid) Group 1 1 (1, 1) Group 2 4 (5, 7) This data set is to be grouped into two clusters, i.e k=2. As a first step in finding a sensible initial partition, let the feature values of the two individuals furthest apart (using the Euclidean distance measure), define the initial cluster means, giving:
  • 20. Working • The remaining individuals are now examined in sequence and allocated to the cluster to which they are closest, in terms of Euclidean distance to the cluster mean. • The mean vector is recalculated each time a new member is added. • This leads to the following series of steps:
  • 21. Iteration 1- Assign Objects to closest clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1,1) (5,7) 2 (1.5,2) (1,1) (5,7) 3 (3,4) (1,1) (5,7) 4 (5,7) (1,1) (5,7) 5 (3.5,5) (1,1) (5,7) 6 (4.5,5) (1,1) (5,7) 7 (3.5,4.5) (1,1) (5,7) A cluster is defined by its centroid
  • 22. Iteration 1- Assign Objects to closest clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1,1) 0 (5,7) 7.21 2 (1.5,2) (1,1) 1.11 (5,7) 6.1 3 (3,4) (1,1) 3.05 (5,7) 3.60 4 (5,7) (1,1) 7.21 (5,7) 0 5 (3.5,5) (1,1) 4.71 (5,7) 2.5 6 (4.5,5) (1,1) 5.31 (5,7) 2.06 7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 A cluster is defined by its centroid (1 − 1.5)2+(1 − 2)2 =1.11 And so on…
  • 23. Iteration 1- Assign Objects to closest clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1,1) 0 (5,7) 7.21 C1 2 (1.5,2) (1,1) 1.11 (5,7) 6.1 C1 3 (3,4) (1,1) 3.05 (5,7) 3.60 C1 4 (5,7) (1,1) 7.21 (5,7) 0 C2 5 (3.5,5) (1,1) 4.71 (5,7) 2.5 C2 6 (4.5,5) (1,1) 5.31 (5,7) 2.06 C2 7 (3.5,4.5) (1,1) 4.3 (5,7) 2.91 C2 A cluster is defined by its centroid Object 1 is assigned to cluster 1 and so on..
  • 24. Re computing Centroids at the end of Iteration 1 Individuals New Centroids Cluster 1 1, 2, 3 C1 = ((1,1)+(1.5,2)+(3,4))/3 = (1.8, 2.3) Cluster 2 4, 5, 6, 7 C2 =((5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/4 = (4.1, 5.4) Now the initial partition has changed, and the two clusters at this stage having the following characteristics:
  • 25. Iteration 2- Check if any object has changed clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1.8,2.3) (4.1, 5.4) 2 (1.5,2) (1.8,2.3) (4.1, 5.4) 3 (3,4) (1.8,2.3) (4.1, 5.4) 4 (5,7) (1.8,2.3) (4.1, 5.4) 5 (3.5,5) (1.8,2.3) (4.1, 5.4) 6 (4.5,5) (1.8,2.3) (4.1, 5.4) 7 (3.5,4.5) (1.8,2.3) (4.1, 5.4)
  • 26. Iteration 2- Check if any object has changed clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1.8,2.3) 1.53 (4.1, 5.4) 5.38 C1 2 (1.5,2) (1.8,2.3) 0.42 (4.1, 5.4) 4.28 C1 3 (3,4) (1.8,2.3) 2.08 (4.1, 5.4) 1.78 C2 4 (5,7) (1.8,2.3) 5.69 (4.1, 5.4) 1.84 C2 5 (3.5,5) (1.8,2.3) 3.19 (4.1, 5.4) 0.72 C2 6 (4.5,5) (1.8,2.3) 3.82 (4.1, 5.4) 0.57 C2 7 (3.5,4.5) (1.8,2.3) 2.78 (4.1, 5.4) 1.08 C2 Object 3 has changed cluster from 1 to 2
  • 27. Re computing Centroids at the end of Iteration 2 Individuals New Centroids Cluster 1 1, 2 C1 = ((1,1)+(1.5,2))/2 = (1.3, 1.5) Cluster 2 3,4, 5, 6, 7 C2 =((3,4)+(5,7)+(3.5,5)+(4.5,5)+ (3.5,4.5))/5 = (3.9, 5.1) Now the initial partition has changed wih Object 3 getting relocated to cluster 2 and the two clusters at this stage having the following characteristics:
  • 28. Iteration 3- Check if any object has changed clusters Object (Oi) Features Centroid 1 (C1) D(Oi ,C1) Centroid 2 (C2) D(Oi ,C2) Closest Centroid 1 (1,1) (1.3, 1.5) 0.58 (3.9, 5.1) 5.02 C1 2 (1.5,2) (1.3, 1.5) 0.54 (3.9, 5.1) 3.92 C1 3 (3,4) (1.3, 1.5) 3.02 (3.9, 5.1) 1.42 C2 4 (5,7) (1.3, 1.5) 6.63 (3.9, 5.1) 2.19 C2 5 (3.5,5) (1.3, 1.5) 4.13 (3.9, 5.1) 0.41 C2 6 (4.5,5) (1.3, 1.5) 4.74 (3.9, 5.1) 0.61 C2 7 (3.5,4.5) (1.3, 1.5) 3.72 (3.9, 5.1) 0.72 C2 No change in clusters compared to previous iteration
  • 29. Conclusion • In this example each individual is now nearer its own cluster mean than that of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution. • Hence Objects {1,2} belong to first cluster and Objects {3,4,5,6,7} belong to second cluster.
  • 30. Notes • The iterative relocation would continue until no more relocations occur. • Luckily, in the example, we got the no-relocation condition satisfied in 3 iterations, but this is not usually the case. It might require hundreds of iterations depending on the dataset. • Also, it is possible that the k-means algorithm won't find a final solution at all. • In this case it would be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
  • 31. Comments on the K-Means Method • Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes
  • 32. 32 Summary • K-means algorithm is a simple yet popular method for clustering analysis • Its performance is determined by initialisation and appropriate distance measure • There are several variants of K-means to overcome its weaknesses • K-Medoids: resistance to noise and/or outliers • K-Modes: extension to categorical data clustering analysis • CLARA: extension to deal with large data sets • Mixture models (EM algorithm): handling uncertainty of clusters Online tutorial: the K-means function in Matlab
  • 33. References • • Ke Chen, K-means Clustering ,COMP24111-Machine Learning, University of Manchester, 2016. • {Insert Reference}/ 10.ppt • {Insert Reference}/MachinLearning3.ppt