CLUSTER ANALYSIS
PRASIDDHA SARMA
17352220, MCA -III
Introduction
Cluster analysis is a class of techniques that are used to
classify objects or cases into relative groups called clusters.
Clustering as a data mining tool has its roots in many
application areas such as biology, security, business
intelligence, and Web search etc.
 Why Clusturing?
Requirements for Cluster Analysis
Scalability − We need highly scalable clustering algorithms
to deal with large databases.
Ability to deal with different kinds of attributes −
Algorithms should be capable to be applied on any kind of
data such as interval-based (numerical) data, categorical, and
binary data.
Requirements for Cluster Analysis(2)
Discovery of clusters with attribute shape − The clustering
algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not
only be able to handle low-dimensional data but also the
high dimensional space.
Requirements for Cluster Analysis(3)
Ability to deal with noisy data − Databases contain noisy,
missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.
Interpretability − The clustering results should be
interpretable, comprehensible, and usable.
Applications
Clustering analysis is broadly used in many applications
such as market research, pattern recognition, data analysis,
and image processing.
Clustering can also help marketers discover distinct groups
in their customer base. And they can characterize their
customer groups based on the purchasing patterns.
Applications(2)
In the field of biology, it can be used to derive plant and
animal taxonomies, categorize genes with similar
functionalities and gain insight into structures inherent to
populations.
Clustering also helps in identification of areas of similar
land use in an earth observation database. It also helps in the
identification of groups of houses in a city according to
house type, value, and geographic location.
Applications(3)
Clustering also helps in classifying documents on the web
for information discovery.
Clustering is also used in outlier detection applications
such as detection of credit card fraud.
Applications (4)
As a data mining function, cluster analysis serves as a tool
to gain insight into the distribution of data to observe
characteristics of each cluster.
Overview of Basic Clustering Methods
Clustering methods can be classified into the following
categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Overview of Basic Clustering Methods
Partitioning Methods
K-Means Algorithm
◦It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.
Partitioning Methods
K-Means Algorithm
◦It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.
Partitioning Methods
K-Means Algorithm
◦The k-means algorithm for partitioning, where each cluster’s
center is represented by the mean value of the objects in the
cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Partitioning Methods
K-Means Algorithm
Method
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
4. update the cluster means, that is, calculate the mean value of the
objects for each cluster;
5. until no change;
Partitioning Methods
K-Means Algorithm
Hierarichal Clusturing
Agglomerative vs Divisive
Cluster Analysis Introduction

Cluster Analysis Introduction

  • 1.
  • 2.
    Introduction Cluster analysis isa class of techniques that are used to classify objects or cases into relative groups called clusters. Clustering as a data mining tool has its roots in many application areas such as biology, security, business intelligence, and Web search etc.  Why Clusturing?
  • 3.
    Requirements for ClusterAnalysis Scalability − We need highly scalable clustering algorithms to deal with large databases. Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.
  • 4.
    Requirements for ClusterAnalysis(2) Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes. High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.
  • 5.
    Requirements for ClusterAnalysis(3) Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. Interpretability − The clustering results should be interpretable, comprehensible, and usable.
  • 6.
    Applications Clustering analysis isbroadly used in many applications such as market research, pattern recognition, data analysis, and image processing. Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.
  • 7.
    Applications(2) In the fieldof biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations. Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.
  • 8.
    Applications(3) Clustering also helpsin classifying documents on the web for information discovery. Clustering is also used in outlier detection applications such as detection of credit card fraud.
  • 9.
    Applications (4) As adata mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.
  • 10.
    Overview of BasicClustering Methods Clustering methods can be classified into the following categories − Partitioning Method Hierarchical Method Density-based Method Grid-Based Method
  • 11.
    Overview of BasicClustering Methods
  • 12.
    Partitioning Methods K-Means Algorithm ◦Itis an Centroid Based Technique Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤ k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters.
  • 13.
    Partitioning Methods K-Means Algorithm ◦Itis an Centroid Based Technique Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤ k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters.
  • 14.
    Partitioning Methods K-Means Algorithm ◦Thek-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects.
  • 15.
    Partitioning Methods K-Means Algorithm Method 1.arbitrarily choose k objects from D as the initial cluster centers; 2. repeat 3. (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 4. update the cluster means, that is, calculate the mean value of the objects for each cluster; 5. until no change;
  • 16.
  • 17.