SlideShare a Scribd company logo
1 of 19
1
Dept of EEE
K-means Algorithm
2
Abstract
k-Means is a rather simple but well known algorithms for grouping objects,
clustering. Again all objects need to be represented as a set of numerical
features. In addition the user has to specify the number of groups (referred to
as k) he wishes to identify. Each object can be thought of as being represented
by some feature vector in an n dimensional space, n being the number of all
features used to describe the objects to cluster. The algorithm then randomly
chooses k points in that vector space, these point serve as the initial centers of
the clusters. Afterwards all objects are each assigned to center they are closest
to. Usually the distance measure is chosen by the user and determined by the
learning task. After that, for each cluster a new center is computed by
averaging the feature vectors of all objects assigned to it. The process of
assigning objects and recomputing centers is repeated until the process
converges. The algorithm can be proven to converge after a finite number of
iterations. Several tweaks concerning distance measure, initial center choice
and computation of new average centers have been explored, as well as the
estimation of the number of clusters k. Yet the main principle always remains
the same. In this project we will discuss about K-means clustering algorithm,
implementation and its application to the problem of unsupervised learning
3
Contents
Abstract………………………………………………………………………....1
1. Introduction……………………………………………………………...…3
2. The k-means algorithm…………………………..........................................4
3. How the k-mean clustering algorithm works………………………….…...5
4. Task Formulation…………………………………………………………..6
4.1 K-means implementation………………………………………….…6
4.2 Estimation of parameters of a Gaussian mixture…………………….8
4.3 Unsupervised learning………………………………………………..9
5. Limitations………………………………………………………………..13
6. Difficulties with k-means…………………………………………………14
7. Available software……………………………………………………...…15
8. Applications of the k-Means Clustering Algorithm………………………15
9. Conclusion……………………………………………………………...…16
References…………………………………………………………………......17
4
The k-means Algorithm
1 Introduction
In this project, we describe the k-means algorithm, a straightforward and
widely-used clustering algorithm. Given a set of objects (records), the goal of
clustering or segmentation is to divide these objects into groups or “clusters”
such that objects within a group tend to be more similar to one another as
compared to objects belonging to different groups. In other words, clustering
algorithms place similar points in the same cluster while placing dissimilar
points in different clusters. Note that, in contrast to supervised tasks such as
regression or classification where there is a notion of a target value or class
label, the objects that form the inputs to a clustering procedure do not come
with an associated target. Therefore clustering is often referred to as
unsupervised learning. Because there is no need for labelled data, unsupervised
algorithms are suitable for many applications where labeled data is difficult to
obtain. Unsupervised tasks such as clustering are also often used to explore and
characterize the dataset before running a supervised learning task. Since
clustering makes no use of class labels, some notion of similarity must be
defined based on the attributes of the objects. The definition of similarity and
the method in which points are clustered differ based on the clustering
algorithm being applied. Thus, different clustering algorithms are suited to
different types of data sets and different purposes. The “best” clustering
algorithm to use therefore depends on the application. It is not uncommon to
try several different algorithms and choose depending on which is the most
useful.
The k-means algorithm is a simple iterative clustering algorithm that
partitions a given dataset into a user-specified number of clusters, k. The
algorithm is simple to implement and run, relatively fast, easy to adapt, and
common in practice. It is historically one of the most important algorithms in
data mining. Historically, k-means in its essential form has been discovered by
several researchers across different disciplines, most notably Lloyd
(1957,1982), Forgey (1965), Friedman and Rubin(1967), and McQueen(1967).
A detailed history of k-means along with descriptions of several variations are
given in Jain and Dubes. Gray and Neuhoff provide a nice historical
background for k-means placed in the larger context of hill-climbing
algorithms.
In the rest of this project, we will describe how k-means works, discuss the
limitations of k-means, difficulties and some applications of this algorithm.
5
2 The k-means algorithm
The k-means algorithm applies to objects that are represented by points in a
𝑑-dimensional vector space. Thus, it clusters a set of 𝑑-dimensional vectors,
𝐷 = {𝑥𝑖|𝑖 = 1,. .. , 𝑁}
where 𝑥𝑖 ∈ ℜ 𝑑
denotes the ith object or “data point”. As discussed in the
introduction, k-means is a clustering algorithm that partitions D intokclusters
of points. That is, the k-means algorithm clusters all of the data points in 𝐷
such that each point 𝑥𝑖 falls in one and only one of the 𝑘 partitions. One can
keep track of which point is in which cluster by assigning each point a cluster
ID. Points with the same cluster ID are in the same cluster, while points with
different cluster IDs are in different clusters. One can denote this with a cluster
membership vector m of length N, where 𝑚𝑖 is the cluster ID of 𝑥𝑖. The value
ofk is an input to the base algorithm. Typically, the value fork is based on
criteria such as prior knowledge of how many clusters actually appear in 𝐷,
how many clusters are desired for the current application, or the types of
clusters found by exploring/experimenting with different values of 𝑘. How 𝑘
is chosen is not necessary for understanding how k-means partitions the dataset
𝐷, and we will discuss how to choose 𝑘 when it is not pre-specified in a later
section.
In k-means, each of the 𝑘 clusters is represented by a single point in ℜ 𝑑
.
Let us denote this set of cluster representatives as the set
𝐶 = {𝑐𝑗|𝑗 = 1, .. ., 𝑘}.
These 𝑘 cluster representatives are also called the cluster means or cluster
centroids. In clustering algorithms, points are grouped by some notion of
“closeness” or “similarity.” In k-means, the default measure of closeness is the
Euclidean distance. In particular, one can readily show that k-means attempts
to minimize the following non-negative cost function:
∑ (𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ||𝑥𝑖 − 𝑐𝑗||2
2𝑁
𝑖=1 (1)
In other words, k-means attempts to minimize the total squared Euclidean
distance between each point 𝑥𝑖 and its closest cluster representative 𝑐𝑗 .
Equation 1 is often referredto as the k-means objective function.
6
3 How the k-mean clustering algorithm works
Here is step by step k-means clustering algorithm:
K -means clustering algorithm flowchart
Step1. Begin with a decision on the value of k =number of clusters
Step2. Put any initial partition that classifies the data into k clusters. You may
assign the training samples randomly, or systematically as the following:
1. Take the first k training sample as single-element clusters.
2. Assign each of the remaining(N-k) training sample to the cluster with
the nearest centroid.
After each assignment, recomputed the centroid of the gaining cluster.
Step3. Take each sample in sequence and compute its distance from the
centroid of each of the clusters. If a sample is not currently in the
cluster with the closest centroid, switch this sample to that cluster and
update the centroid of the cluster gaining the new sample and the cluster
losing the sample.
Step4. Repeat step 3 until convergence is achieved, that is until a pass through
the training sample causes no new assignments.
7
If the number of data is less than the number of cluster then we assign each
data as the centroid of the cluster. Each centroid will have a cluster number. If
the number of data is bigger than the number of cluster, for each data, we
calculate the distance to all centroid and get the minimum distance. This data is
said belong to the cluster that has minimum distance from this data.
Since we are not sure about the location of the centroid, we need to adjust the
centroid location based on the current updated data. Then we assign all the data
to this new centroid. This process is repeated until no data is moving to another
cluster anymore. Mathematically this loop can be proved to be convergent. The
convergence will always occur if the following condition satisfied:
1. Each switch in step 2 the sum of distances from each training sample to that
training sample’s group centroid is decreased.
2. There are only finitely many partitions of the training examples into k
clusters.
4 Task Formulation
4.1 K-means implementation
To implement the K-means clustering algorithm we have to follow the
description given below:
Tasks:
1. Download test data data.mat, display them using ppatterns().
The file data.mat contains single variable X - 2×N matrix of 2D
points.
2. Run the algorithm. In each iteration, display locations of means μj and
current classification of the test data. To display the classification, use
again the ppatterns function.
3. In each iteration, plot the average distance between points and their
respective closest means μj.
4. Experiment with diferent number K of means, eg. K = 2,3,4. Execute
the algorithm repeatedly, initialise the mean values μj with random
positions. Use the function rand.
8
9
4.2 Estimation of parameters of a Gaussian mixture
Let us assume that the distribution of our data is a mixture of three
gaussians:
𝑝(𝑥) = 𝑆𝑢𝑚3
𝑗=1 𝑃(𝑗) 𝑁( 𝑥 | 𝜇 𝑗, 𝛴𝑗) ,
where N(μj ,Σj) denotes a normal distribution with mean value μ j and
covariance Σ j. P(j) denotes the weight of j-th gaussian within the mixture.
The task is, for given input data x1 , x2 , ... , xN, to estimate the mixture
parameters μ j , Σ j , P(j) .
Tasks:
1. In each iteration of the implemented k-means algorithm, reestimate
means μ j and covariances Σ j using the maximal likelihood
method. P(j) will be the relative number (percentage) of data points
classified to j-th cluster.
2. In each iteration, plot the total likelihood L of estimated
parameters μ j , Σ j , P(j) :
10
4.3 Unsupervised learning
Apply the K-means clustering to the problem of "unsupervised learning".
The input consists of images of three letters, H, L, T. It is not known, which
letter is shown in which images. The task is to classify the images into three
classes. The images will be described by the two usual measurements:
x = (sum of pixel intensities in the left half of the image) - (sum of pixel
intensities in the right half of the image)
y = (sum of pixel intensities in the upper half of the image) - (sum of pixel
intensities in the lower half of the image)
Tasks:
1. Download the images of letters image_data.mat, compute
measurements x and y.
2. Using the k-means method, classify the images into three classes. In
each iteration, display the means μj , current classification, and the
likelihood L .
3. After the iteration stops, compute and display the average image of
each of the three classes. To display the final classification, you can
use show_class function.
11
12
Visualisation of classification by show_class:
Class 1
13
Class 2
Class 3
14
5 Limitations
The greedy-descent nature of k-means on a non-convex cost implies that the
convergence is only to a local optimum, and indeed the algorithm is typically
quite sensitive to the initial centroid locations. In other words, initializing the
set of cluster representatives 𝐶 differently can lead to very different clusters,
even on the same dataset 𝐷. A poor initialization can lead to very poor
clusters.
The local minima problem can be countered to some extent by running the
algorithm multiple times with different initial centroids and then selecting the
best result, or by doing limited local search about the converged solution. Other
approaches include methods that attempts to keep k-means from converging to
local minima. There are also a list of different methods of initialization, as well
as a discussion of other limitations of k-means.
As mentioned, choosing the optimal value of 𝑘 may be difficult. If one has
knowledge about the dataset, such as the number of partitions that naturally
comprise the dataset, then that knowledge can be used to choose 𝑘. Otherwise,
one must use some other criteria to choose 𝑘, thus solving the model selection
problem. One naive solution is to try several different values of 𝑘 and choose
the clustering which minimizes the k-means objective function (Equation 1).
Unfortunately, the value of the objective function is not as informative as one
would hope in this case. For example, the cost of the optimal solution decreases
with increasing 𝑘 till it hits zero when the number of clusters equals the
number of distinct data points. This makes it more difficult to use the objective
function to
(a) directly compare solutions with different numbers of clusters and
(b) to find the optimum value of 𝑘.
Thus, if the desired 𝑘 is not known in advance, one will typically run k-means
with different values of 𝑘, and then use some other, more suitable criterion to
select one of the results. For example, SAS uses the cube-clustering-criterion,
while X-means adds a complexity term (which increases with 𝑘) to the original
cost function (Eq. 1) and then identifies the k which minimizes this adjusted
cost. Alternatively, one can progressively increase the number of clusters, in
conjunction with a suitable stopping criterion. Bisecting k-means achieves this
by first putting all the data into a single cluster, and then recursively splitting
the least compact cluster into two using 2-means. The celebrated LBG
algorithm used for vector quantization doubles the number of clusters till a
suitable code-book size is obtained. Both these approaches thus alleviate the
need to know k beforehand.
15
6 Difficulties with k-means
k-means suffers from several other problems that can be understood by first
noting that the problem of fitting data using a mixture of k Gaussians with
identical, isotropic covariance matrices, (∑ = σ2
I) where I is the identity
matrix, results in a “soft” version of k-means.
More precisely, if the soft assignments of data points to the mixture
components of such a model are instead hardened so that each data point is
solely allocated to the most likely component, then one obtains the k-means
algorithm. From this connection it is evident that k-means inherently assumes
that the dataset is composed of a mixture of k balls or hyperspheres of data, and
each of the k clusters corresponds to one of the mixture components. Because
of this implicit assumption, k-means will falter whenever the data is not well
described by a superposition of reasonably separated spherical Gaussian
distributions. For example, k-means will have trouble if there are non-convex
shaped clusters in the data. This problem may be alleviated by rescaling the
data to “whiten” it before clustering, or by using a different distance measure
that is more appropriate for the dataset. For example, information-theoretic
clustering uses the KL-divergence to measure the distance between two data
points representing two discrete probability distributions. It has been recently
shown that if one measures distance by selecting any member of a very large
class of divergences called Bregman divergences during the assignment step
and makes no other changes, the essential properties of k-means, including
guaranteed convergence, linear separation boundaries and scalability, are
retained. This result makes k-means effective for a much larger class of
datasets so long as an appropriate divergence is used.
Another method of dealing with non-convex clusters is by pairing k-means
with another algorithm. For example, one can first cluster the data into a large
number of groups using k-means. These groups are then agglomerated into
larger clusters using single link hierarchical clustering, which can detect
complex shapes. This approach also makes the solution less sensitive to
initialization, and since the hierarchical method provides results at multiple
resolutions, one does not need to worry about choosing an exact value for k
either; instead, one can simply use a large value for k when creating the initial
clusters.
The algorithm is also sensitive to the presence of outliers, since “mean” is not a
robust statistic. A preprocessing step to remove outliers can be helpful.
Post-processing the results, for example to eliminate small clusters, or to merge
close clusters into a large cluster, is also desirable.
Another potential issue is the problem of “empty” clusters . When running
k-means, particularly with large values of k and/or when data resides in very
high dimensional space, it is possible that at some point of execution, there
exists a cluster representative cj such that all points xj in D are closer to some
other cluster representative that is not cj . When points in D are assigned to
16
their closest cluster, the jth cluster will have zero points assigned to it. That is,
cluster j is now an empty cluster. The standard algorithm does not guard
against empty clusters, but simple extensions (such as reinitializing the cluster
representative of the empty cluster or “stealing” some points from the largest
cluster) are possible.
7 Available software
Because of the k-means algorithm’s simplicity, effectiveness, and historical
importance, software to run the k-means algorithm is readily available in
several forms. It is a standard feature in many popular data mining software
packages. For example, it can be found in Weka or in SAS under the
FASTCLUS procedure. It is also commonly included as add-ons to existing
software. For example, several implementations of k-means are available as
parts of various toolboxes in Matlab. k-means is also available in Microsoft
Excel after adding XL Miner. Finally, several stand-alone versions of
k-means exist and can be easily found on the Internet. The algorithm is also
straightforward to code, and the reader is encouraged to create their own
implementation of k-means as an exercise.
8 Applications of the k-Means Clustering Algorithm
Briefly, optical character recognition, speech recognition, and
encoding/decoding as example applications of k-means. However, a survey
of the literature on the subject offers a more in depth treatment of some other
practical applications, such as "data detection … for burst-mode optical
receiver[s]", and recognition of musical genres. Researchers describe
"burst-mode data-transmission systems," a "significant feature of burst-mode
data transmissions is that due to unequal distances between" sender and
receivers, "signal attenuation is not the same" for all receivers. Because of
this, "conventional receivers are not suitable for burst-mode data
transmissions." The importance, they note, is that many "high-speed optical
multi-access network applications, [such as] optical bus networks [and]
WDMA optical star networks" can use burst-mode receivers.
In their paper, they provide a "new, efficient burst-mode signal detection
scheme" that utilizes "a two-step data clustering method based on a K-means
algorithm." They go on to explain that "the burst-mode signal detection
17
problem" can be expressed as a "binary hypothesis," determining if a bit is 0 or
1. Further, although they could use maximum likelihood sequence estimation
(MLSE) to determine the class, it "is very computationally complex, and not
suitable for high-speed burst-mode data transmission." Thus, they use an
approach based on k-means to solve the practical problem where simple MLSE
is not enough.
9 Conclusion
This project tried to explain about K-means clustering algorithm and its
application to the problem of un supervised learning. The k-means algorithm is
a simple iterative clustering algorithm that partitions a dataset into k clusters.
At its core, the algorithm works by iterating over two steps:
1) clustering all points in the dataset based on the distance between each
point and its closest cluster representative, and
2) re-estimating the cluster representatives.
Limitations of the k-means algorithm include the sensitivity of k-means to
initialization and determining the value of k. Despite its drawbacks, k-means
remains the most widely used partitional clustering algorithm in practice. The
algorithm is simple, easily understandable and reasonably scalable, and can be
easily modified to deal with different scenarios such as semi-supervised
learning or streaming data. Continual improvements and generalizations of the
basic algorithm have ensured its continued relevance and gradually increased
its effectiveness as well.
18
References
1. http://www.ideal.ece.utexas.edu/papers/km.pdf
2. http://www.science.uva.nl/research/ias/alumni/m.sc.theses/theses/NoahLaith.doc
3. http://cw.felk.cvut.cz/cmp/courses/ae4b33rpz/Labs/kmeans/index_en.html
19
Matlab codes for unsupervised learning task
clear;close all;
load('data.mat'); % X = 2x140
%% cast 1
model = kminovec(X, 4, 10, 1);
%% cast 2
% clear
Gmodel.Mean = [-2,1;1,1;0,-1]';
Gmodel.Cov (:, :, 1) = [ 0.1 0; 0 0.1];
Gmodel.Cov (:, :, 2) = [ 0.3 0; 0 0.3];
Gmodel.Cov (:, :, 3) = [ 0.01 0; 0 0.5];
Gmodel.Prior = [0.4;0.4;0.2];
gmm = gmmsamp(Gmodel, 100);
figure(gcf);clf;
ppatterns(gmm.X, gmm.y);
axis([-3 3 -3 3]);
model = kminovec(gmm.X, 3, 10, 1, gmm);
figure(gcf);plot(model.L);
%% cast 3
data = load('image_data.mat');
for i = 1:size(data.images, 3)
% soucet sum leva - prava cast obrazku
pX(i) = sum(sum(data.images(:, 1:floor(end/2) , i))) ...
- sum(sum(data.images(:, (floor(end/2)+1):end , i)));
% soucet sum horni - dolni
pY(i) = sum(sum(data.images(1:floor(end/2),: , i))) ...
- sum(sum(data.images((floor(end/2)+1):end , :, i)));
end
model = kminovec([pX;pY], 3, 10, 1);
show_class(data.images, model.class');
%% d
model = struct('Mean',[-2 3; 5 8],'Cov',[1 0.5],'Prior',[0.4 0.6;0]);
figure; hold on;
plot([-4:0.1:5], pdfgmm([-4:0.1:5],model),'r');
sample = gmmsamp(model,500);
[Y,X] = hist(sample.X,10);
bar(X,Y/500);

More Related Content

What's hot (20)

3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
CLUSTER SILHOUETTES.pptx
CLUSTER SILHOUETTES.pptxCLUSTER SILHOUETTES.pptx
CLUSTER SILHOUETTES.pptx
 
DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
K means clustring @jax
K means clustring @jaxK means clustring @jax
K means clustring @jax
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Chapter 09 classification advanced
Chapter 09 classification advancedChapter 09 classification advanced
Chapter 09 classification advanced
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
KNN
KNNKNN
KNN
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
My8clst
My8clstMy8clst
My8clst
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 

Viewers also liked

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Learning Vector Quantization LVQ
Learning Vector Quantization LVQLearning Vector Quantization LVQ
Learning Vector Quantization LVQESCOM
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Harmony search algorithm
Harmony search algorithmHarmony search algorithm
Harmony search algorithmAhmed Fouad Ali
 

Viewers also liked (10)

K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
slides Céline Beji
slides Céline Bejislides Céline Beji
slides Céline Beji
 
vector QUANTIZATION
vector QUANTIZATIONvector QUANTIZATION
vector QUANTIZATION
 
Learning Vector Quantization LVQ
Learning Vector Quantization LVQLearning Vector Quantization LVQ
Learning Vector Quantization LVQ
 
Vector quantization
Vector quantizationVector quantization
Vector quantization
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Harmony search algorithm
Harmony search algorithmHarmony search algorithm
Harmony search algorithm
 

Similar to Neural nw k means

8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsVoidVampire
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxSyed Ejaz
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxUnsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxAnupama Kate
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 

Similar to Neural nw k means (20)

8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
K means report
K means reportK means report
K means report
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objects
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
k-mean-clustering.pdf
k-mean-clustering.pdfk-mean-clustering.pdf
k-mean-clustering.pdf
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Clustering
ClusteringClustering
Clustering
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptx
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxUnsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
47 292-298
47 292-29847 292-298
47 292-298
 

Recently uploaded

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 

Recently uploaded (20)

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 

Neural nw k means

  • 2. 2 Abstract k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
  • 3. 3 Contents Abstract………………………………………………………………………....1 1. Introduction……………………………………………………………...…3 2. The k-means algorithm…………………………..........................................4 3. How the k-mean clustering algorithm works………………………….…...5 4. Task Formulation…………………………………………………………..6 4.1 K-means implementation………………………………………….…6 4.2 Estimation of parameters of a Gaussian mixture…………………….8 4.3 Unsupervised learning………………………………………………..9 5. Limitations………………………………………………………………..13 6. Difficulties with k-means…………………………………………………14 7. Available software……………………………………………………...…15 8. Applications of the k-Means Clustering Algorithm………………………15 9. Conclusion……………………………………………………………...…16 References…………………………………………………………………......17
  • 4. 4 The k-means Algorithm 1 Introduction In this project, we describe the k-means algorithm, a straightforward and widely-used clustering algorithm. Given a set of objects (records), the goal of clustering or segmentation is to divide these objects into groups or “clusters” such that objects within a group tend to be more similar to one another as compared to objects belonging to different groups. In other words, clustering algorithms place similar points in the same cluster while placing dissimilar points in different clusters. Note that, in contrast to supervised tasks such as regression or classification where there is a notion of a target value or class label, the objects that form the inputs to a clustering procedure do not come with an associated target. Therefore clustering is often referred to as unsupervised learning. Because there is no need for labelled data, unsupervised algorithms are suitable for many applications where labeled data is difficult to obtain. Unsupervised tasks such as clustering are also often used to explore and characterize the dataset before running a supervised learning task. Since clustering makes no use of class labels, some notion of similarity must be defined based on the attributes of the objects. The definition of similarity and the method in which points are clustered differ based on the clustering algorithm being applied. Thus, different clustering algorithms are suited to different types of data sets and different purposes. The “best” clustering algorithm to use therefore depends on the application. It is not uncommon to try several different algorithms and choose depending on which is the most useful. The k-means algorithm is a simple iterative clustering algorithm that partitions a given dataset into a user-specified number of clusters, k. The algorithm is simple to implement and run, relatively fast, easy to adapt, and common in practice. It is historically one of the most important algorithms in data mining. Historically, k-means in its essential form has been discovered by several researchers across different disciplines, most notably Lloyd (1957,1982), Forgey (1965), Friedman and Rubin(1967), and McQueen(1967). A detailed history of k-means along with descriptions of several variations are given in Jain and Dubes. Gray and Neuhoff provide a nice historical background for k-means placed in the larger context of hill-climbing algorithms. In the rest of this project, we will describe how k-means works, discuss the limitations of k-means, difficulties and some applications of this algorithm.
  • 5. 5 2 The k-means algorithm The k-means algorithm applies to objects that are represented by points in a 𝑑-dimensional vector space. Thus, it clusters a set of 𝑑-dimensional vectors, 𝐷 = {𝑥𝑖|𝑖 = 1,. .. , 𝑁} where 𝑥𝑖 ∈ ℜ 𝑑 denotes the ith object or “data point”. As discussed in the introduction, k-means is a clustering algorithm that partitions D intokclusters of points. That is, the k-means algorithm clusters all of the data points in 𝐷 such that each point 𝑥𝑖 falls in one and only one of the 𝑘 partitions. One can keep track of which point is in which cluster by assigning each point a cluster ID. Points with the same cluster ID are in the same cluster, while points with different cluster IDs are in different clusters. One can denote this with a cluster membership vector m of length N, where 𝑚𝑖 is the cluster ID of 𝑥𝑖. The value ofk is an input to the base algorithm. Typically, the value fork is based on criteria such as prior knowledge of how many clusters actually appear in 𝐷, how many clusters are desired for the current application, or the types of clusters found by exploring/experimenting with different values of 𝑘. How 𝑘 is chosen is not necessary for understanding how k-means partitions the dataset 𝐷, and we will discuss how to choose 𝑘 when it is not pre-specified in a later section. In k-means, each of the 𝑘 clusters is represented by a single point in ℜ 𝑑 . Let us denote this set of cluster representatives as the set 𝐶 = {𝑐𝑗|𝑗 = 1, .. ., 𝑘}. These 𝑘 cluster representatives are also called the cluster means or cluster centroids. In clustering algorithms, points are grouped by some notion of “closeness” or “similarity.” In k-means, the default measure of closeness is the Euclidean distance. In particular, one can readily show that k-means attempts to minimize the following non-negative cost function: ∑ (𝑎𝑟𝑔𝑚𝑖𝑛𝑗 ||𝑥𝑖 − 𝑐𝑗||2 2𝑁 𝑖=1 (1) In other words, k-means attempts to minimize the total squared Euclidean distance between each point 𝑥𝑖 and its closest cluster representative 𝑐𝑗 . Equation 1 is often referredto as the k-means objective function.
  • 6. 6 3 How the k-mean clustering algorithm works Here is step by step k-means clustering algorithm: K -means clustering algorithm flowchart Step1. Begin with a decision on the value of k =number of clusters Step2. Put any initial partition that classifies the data into k clusters. You may assign the training samples randomly, or systematically as the following: 1. Take the first k training sample as single-element clusters. 2. Assign each of the remaining(N-k) training sample to the cluster with the nearest centroid. After each assignment, recomputed the centroid of the gaining cluster. Step3. Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster losing the sample. Step4. Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments.
  • 7. 7 If the number of data is less than the number of cluster then we assign each data as the centroid of the cluster. Each centroid will have a cluster number. If the number of data is bigger than the number of cluster, for each data, we calculate the distance to all centroid and get the minimum distance. This data is said belong to the cluster that has minimum distance from this data. Since we are not sure about the location of the centroid, we need to adjust the centroid location based on the current updated data. Then we assign all the data to this new centroid. This process is repeated until no data is moving to another cluster anymore. Mathematically this loop can be proved to be convergent. The convergence will always occur if the following condition satisfied: 1. Each switch in step 2 the sum of distances from each training sample to that training sample’s group centroid is decreased. 2. There are only finitely many partitions of the training examples into k clusters. 4 Task Formulation 4.1 K-means implementation To implement the K-means clustering algorithm we have to follow the description given below: Tasks: 1. Download test data data.mat, display them using ppatterns(). The file data.mat contains single variable X - 2×N matrix of 2D points. 2. Run the algorithm. In each iteration, display locations of means μj and current classification of the test data. To display the classification, use again the ppatterns function. 3. In each iteration, plot the average distance between points and their respective closest means μj. 4. Experiment with diferent number K of means, eg. K = 2,3,4. Execute the algorithm repeatedly, initialise the mean values μj with random positions. Use the function rand.
  • 8. 8
  • 9. 9 4.2 Estimation of parameters of a Gaussian mixture Let us assume that the distribution of our data is a mixture of three gaussians: 𝑝(𝑥) = 𝑆𝑢𝑚3 𝑗=1 𝑃(𝑗) 𝑁( 𝑥 | 𝜇 𝑗, 𝛴𝑗) , where N(μj ,Σj) denotes a normal distribution with mean value μ j and covariance Σ j. P(j) denotes the weight of j-th gaussian within the mixture. The task is, for given input data x1 , x2 , ... , xN, to estimate the mixture parameters μ j , Σ j , P(j) . Tasks: 1. In each iteration of the implemented k-means algorithm, reestimate means μ j and covariances Σ j using the maximal likelihood method. P(j) will be the relative number (percentage) of data points classified to j-th cluster. 2. In each iteration, plot the total likelihood L of estimated parameters μ j , Σ j , P(j) :
  • 10. 10 4.3 Unsupervised learning Apply the K-means clustering to the problem of "unsupervised learning". The input consists of images of three letters, H, L, T. It is not known, which letter is shown in which images. The task is to classify the images into three classes. The images will be described by the two usual measurements: x = (sum of pixel intensities in the left half of the image) - (sum of pixel intensities in the right half of the image) y = (sum of pixel intensities in the upper half of the image) - (sum of pixel intensities in the lower half of the image) Tasks: 1. Download the images of letters image_data.mat, compute measurements x and y. 2. Using the k-means method, classify the images into three classes. In each iteration, display the means μj , current classification, and the likelihood L . 3. After the iteration stops, compute and display the average image of each of the three classes. To display the final classification, you can use show_class function.
  • 11. 11
  • 12. 12 Visualisation of classification by show_class: Class 1
  • 14. 14 5 Limitations The greedy-descent nature of k-means on a non-convex cost implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations. In other words, initializing the set of cluster representatives 𝐶 differently can lead to very different clusters, even on the same dataset 𝐷. A poor initialization can lead to very poor clusters. The local minima problem can be countered to some extent by running the algorithm multiple times with different initial centroids and then selecting the best result, or by doing limited local search about the converged solution. Other approaches include methods that attempts to keep k-means from converging to local minima. There are also a list of different methods of initialization, as well as a discussion of other limitations of k-means. As mentioned, choosing the optimal value of 𝑘 may be difficult. If one has knowledge about the dataset, such as the number of partitions that naturally comprise the dataset, then that knowledge can be used to choose 𝑘. Otherwise, one must use some other criteria to choose 𝑘, thus solving the model selection problem. One naive solution is to try several different values of 𝑘 and choose the clustering which minimizes the k-means objective function (Equation 1). Unfortunately, the value of the objective function is not as informative as one would hope in this case. For example, the cost of the optimal solution decreases with increasing 𝑘 till it hits zero when the number of clusters equals the number of distinct data points. This makes it more difficult to use the objective function to (a) directly compare solutions with different numbers of clusters and (b) to find the optimum value of 𝑘. Thus, if the desired 𝑘 is not known in advance, one will typically run k-means with different values of 𝑘, and then use some other, more suitable criterion to select one of the results. For example, SAS uses the cube-clustering-criterion, while X-means adds a complexity term (which increases with 𝑘) to the original cost function (Eq. 1) and then identifies the k which minimizes this adjusted cost. Alternatively, one can progressively increase the number of clusters, in conjunction with a suitable stopping criterion. Bisecting k-means achieves this by first putting all the data into a single cluster, and then recursively splitting the least compact cluster into two using 2-means. The celebrated LBG algorithm used for vector quantization doubles the number of clusters till a suitable code-book size is obtained. Both these approaches thus alleviate the need to know k beforehand.
  • 15. 15 6 Difficulties with k-means k-means suffers from several other problems that can be understood by first noting that the problem of fitting data using a mixture of k Gaussians with identical, isotropic covariance matrices, (∑ = σ2 I) where I is the identity matrix, results in a “soft” version of k-means. More precisely, if the soft assignments of data points to the mixture components of such a model are instead hardened so that each data point is solely allocated to the most likely component, then one obtains the k-means algorithm. From this connection it is evident that k-means inherently assumes that the dataset is composed of a mixture of k balls or hyperspheres of data, and each of the k clusters corresponds to one of the mixture components. Because of this implicit assumption, k-means will falter whenever the data is not well described by a superposition of reasonably separated spherical Gaussian distributions. For example, k-means will have trouble if there are non-convex shaped clusters in the data. This problem may be alleviated by rescaling the data to “whiten” it before clustering, or by using a different distance measure that is more appropriate for the dataset. For example, information-theoretic clustering uses the KL-divergence to measure the distance between two data points representing two discrete probability distributions. It has been recently shown that if one measures distance by selecting any member of a very large class of divergences called Bregman divergences during the assignment step and makes no other changes, the essential properties of k-means, including guaranteed convergence, linear separation boundaries and scalability, are retained. This result makes k-means effective for a much larger class of datasets so long as an appropriate divergence is used. Another method of dealing with non-convex clusters is by pairing k-means with another algorithm. For example, one can first cluster the data into a large number of groups using k-means. These groups are then agglomerated into larger clusters using single link hierarchical clustering, which can detect complex shapes. This approach also makes the solution less sensitive to initialization, and since the hierarchical method provides results at multiple resolutions, one does not need to worry about choosing an exact value for k either; instead, one can simply use a large value for k when creating the initial clusters. The algorithm is also sensitive to the presence of outliers, since “mean” is not a robust statistic. A preprocessing step to remove outliers can be helpful. Post-processing the results, for example to eliminate small clusters, or to merge close clusters into a large cluster, is also desirable. Another potential issue is the problem of “empty” clusters . When running k-means, particularly with large values of k and/or when data resides in very high dimensional space, it is possible that at some point of execution, there exists a cluster representative cj such that all points xj in D are closer to some other cluster representative that is not cj . When points in D are assigned to
  • 16. 16 their closest cluster, the jth cluster will have zero points assigned to it. That is, cluster j is now an empty cluster. The standard algorithm does not guard against empty clusters, but simple extensions (such as reinitializing the cluster representative of the empty cluster or “stealing” some points from the largest cluster) are possible. 7 Available software Because of the k-means algorithm’s simplicity, effectiveness, and historical importance, software to run the k-means algorithm is readily available in several forms. It is a standard feature in many popular data mining software packages. For example, it can be found in Weka or in SAS under the FASTCLUS procedure. It is also commonly included as add-ons to existing software. For example, several implementations of k-means are available as parts of various toolboxes in Matlab. k-means is also available in Microsoft Excel after adding XL Miner. Finally, several stand-alone versions of k-means exist and can be easily found on the Internet. The algorithm is also straightforward to code, and the reader is encouraged to create their own implementation of k-means as an exercise. 8 Applications of the k-Means Clustering Algorithm Briefly, optical character recognition, speech recognition, and encoding/decoding as example applications of k-means. However, a survey of the literature on the subject offers a more in depth treatment of some other practical applications, such as "data detection … for burst-mode optical receiver[s]", and recognition of musical genres. Researchers describe "burst-mode data-transmission systems," a "significant feature of burst-mode data transmissions is that due to unequal distances between" sender and receivers, "signal attenuation is not the same" for all receivers. Because of this, "conventional receivers are not suitable for burst-mode data transmissions." The importance, they note, is that many "high-speed optical multi-access network applications, [such as] optical bus networks [and] WDMA optical star networks" can use burst-mode receivers. In their paper, they provide a "new, efficient burst-mode signal detection scheme" that utilizes "a two-step data clustering method based on a K-means algorithm." They go on to explain that "the burst-mode signal detection
  • 17. 17 problem" can be expressed as a "binary hypothesis," determining if a bit is 0 or 1. Further, although they could use maximum likelihood sequence estimation (MLSE) to determine the class, it "is very computationally complex, and not suitable for high-speed burst-mode data transmission." Thus, they use an approach based on k-means to solve the practical problem where simple MLSE is not enough. 9 Conclusion This project tried to explain about K-means clustering algorithm and its application to the problem of un supervised learning. The k-means algorithm is a simple iterative clustering algorithm that partitions a dataset into k clusters. At its core, the algorithm works by iterating over two steps: 1) clustering all points in the dataset based on the distance between each point and its closest cluster representative, and 2) re-estimating the cluster representatives. Limitations of the k-means algorithm include the sensitivity of k-means to initialization and determining the value of k. Despite its drawbacks, k-means remains the most widely used partitional clustering algorithm in practice. The algorithm is simple, easily understandable and reasonably scalable, and can be easily modified to deal with different scenarios such as semi-supervised learning or streaming data. Continual improvements and generalizations of the basic algorithm have ensured its continued relevance and gradually increased its effectiveness as well.
  • 19. 19 Matlab codes for unsupervised learning task clear;close all; load('data.mat'); % X = 2x140 %% cast 1 model = kminovec(X, 4, 10, 1); %% cast 2 % clear Gmodel.Mean = [-2,1;1,1;0,-1]'; Gmodel.Cov (:, :, 1) = [ 0.1 0; 0 0.1]; Gmodel.Cov (:, :, 2) = [ 0.3 0; 0 0.3]; Gmodel.Cov (:, :, 3) = [ 0.01 0; 0 0.5]; Gmodel.Prior = [0.4;0.4;0.2]; gmm = gmmsamp(Gmodel, 100); figure(gcf);clf; ppatterns(gmm.X, gmm.y); axis([-3 3 -3 3]); model = kminovec(gmm.X, 3, 10, 1, gmm); figure(gcf);plot(model.L); %% cast 3 data = load('image_data.mat'); for i = 1:size(data.images, 3) % soucet sum leva - prava cast obrazku pX(i) = sum(sum(data.images(:, 1:floor(end/2) , i))) ... - sum(sum(data.images(:, (floor(end/2)+1):end , i))); % soucet sum horni - dolni pY(i) = sum(sum(data.images(1:floor(end/2),: , i))) ... - sum(sum(data.images((floor(end/2)+1):end , :, i))); end model = kminovec([pX;pY], 3, 10, 1); show_class(data.images, model.class'); %% d model = struct('Mean',[-2 3; 5 8],'Cov',[1 0.5],'Prior',[0.4 0.6;0]); figure; hold on; plot([-4:0.1:5], pdfgmm([-4:0.1:5],model),'r'); sample = gmmsamp(model,500); [Y,X] = hist(sample.X,10); bar(X,Y/500);