CLUSTERING:
 What is Clustering
 Clustering Techniques
 Partitioning methods
 Hierarchical methods
 Density-based methods
 Graph based methods
 Model based methods
 Application of Clustering
1
Clustering
 Clustering is a technique that groups similar objects such that
the objects in the same group are more similar to each other
than the objects in the other groups. The group of similar
objects is called a Cluster.
 Clustering helps to split data into several subsets. Each of
these clusters consists of data objects with high inter-similarity
and low intra-similarity.
2
Example
3
4
Clustering
Techniques
Partitioning
methods
Hierarchical
methods
Density-based
methods
Graph based
methods
Model based
clustering
• k-Means algorithm [1957, 1967]
• k-Medoids algorithm
• k-Modes [1998]
• Fuzzy c-means algorithm [1999]
Divisive
Agglomerative
methods
• STING [1997]
• DBSCAN [1996]
• CLIQUE [1998]
• DENCLUE [1998]
• OPTICS [1999]
• Wave Cluster [1998]
• MST Clustering [1999]
• OPOSSUM [2000]
• SNN Similarity Clustering [2001, 2003]
• EM Algorithm [1977]
• Auto class [1996]
• COBWEB [1987]
• ANN Clustering [1982, 1989]
• AGNES [1990]
• BIRCH [1996]
• CURE [1998]
• ROCK [1999]
• Chamelon [1999]
• DIANA [1990]
• PAM [1990]
• CLARA [1990]
• CLARANS [1994]
hniques:
Centroids-based
Clustering(partitioning Clustering)
CS 40003: Data Analytics 5
 Centroid based clustering is considered as one of the
most simplest clustering algorithms, yet the most
effective way of creating clusters and assigning data
points.
 These groups of clustering methods iteratively measure
the distance between the clusters and the characteristic
centroids using various distance metrics. These are
either of Euclidian distance, Manhattan Distance or
Minkowski Distance.
k-Means Algorithm
 k-Means is one of the most widely used and perhaps the
simplest unsupervised algorithms to solve the clustering
problems.
 Using this algorithm, we classify a given data set
through a certain number of predetermined clusters or
“k” clusters.
 Each cluster is assigned a designated cluster center and
they are placed as much as possible far away from each
other.
6
7
where,
||xi – vj|| is the distance between Xi and Vj.
Ci is the count of data in cluster.C is the number of cluster centroids.
Advantages:
. Can be applied to any form of data – as long as the data has numerical (continuous)
entities.
. Much faster than other algorithms.
. Easy to understand and interpret.
Drawbacks:
. Fails for non-linear data.
. This cannot work for Categorical data.
. Cannot handle outliers.
K-Medoids Algorithm
 Medoids is a clustering algorithm resembling the K-Means
clustering technique. It falls under the category of un
supervised technique.It majorly differs from the K-Means
algorithm in terms of the way it selects the clusters’ centres.
The former selects the average of a cluster’s points as its centre
(which may or may not be one of the data points) while the
latter always picks the actual data points from the clusters as
their centres (also known as ‘exemplars’ or ‘medoids’). K-
Medoids also differs in this respect from the K-Medians
algorithm whic,h is the same as K-means.
CS 40003: Data Analytics 8
2.Hierarchical Clustering
 It also called Hierarchical cluster analysis or HCA is
an unsupervised clustering algorithm which involves
creating clusters that have predefined ordering from top
to bottom.
 It then proceeds to perform a decomposition of the data
objects based on this hierarchy, hence obtaining the
clusters.
 This clustering technique is divided into two types:
 Agglomerative Hierarchical Clustering
 Divisive Hierarchical Clustering
9
Agglomerative Approach
10
Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering
used to group objects in clusters based on their similarity. It’s also known as AGNES
(Agglomerative Nesting). It's a “bottom-up” approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
Diagram:
How does it works:
1.Make each data point a single-point cluster → forms N
clusters
2.Take the two closest data points and make them one
cluster → forms N-1 clusters
3.Take the two closest clusters and make them one cluster
→ Forms N-2 clusters.
4.Repeat step-3 until you are left with only one cluster.
Have a look at the visual representation of Agglomerative
Hierarchical Clustering for better understanding:
11
12
There are several ways to measure the distance between clusters in order to decide the
rules for clustering, and they are often called Linkage Methods. Some of the common
linkage methods are:
Complete-linkage: the distance between two clusters is defined as the longest distance
between two points in each cluster.
Single-linkage: the distance between two clusters is defined as the shortest distance
between two points in each cluster. This linkage may be used to detect high values in
your dataset which may be outliers as they will be merged at the end.
Average-linkage: the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then
calculates the distance between the two before merging.
The choice of linkage method entirely depends on you and there is no hard and fast
method that will always give you good results. Different linkage methods lead to
different clusters.
The point of doing all this is to demonstrate the way hierarchical clustering works, it
maintains a memory of how we went through this process and that memory is stored
in Dendrogram.
13
What is a Dendrogram?
A Dendrogram is a type of tree diagram showing hierarchical relationships between
different sets of data.
As already said a Dendrogram contains the memory of hierarchical clustering algorithm, so
just by looking at the Dendrogram you can tell how the cluster is formed.
Devise approach:
14
In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method
where we assign all of the observations to a single cluster and then partition the cluster to
two least similar clusters. Finally, we proceed recursively on each cluster until there is one
cluster for each observation. So this clustering approach is exactly opposite to
Agglomerative clustering.
3. Density-based Clustering
 If one looks into the previous two methods that we discussed, one
would observe that both hierarchical and centroid based
algorithms are dependent on a distance metric.
 The very definition of a cluster is based on this metric. Density-
based clustering methods take density into consideration instead
of distances.
 Clusters are considered as the densest region in a data space,
which is separated by regions of lower object density and it is
defined as a maximal-set of connected points.
15
4.Graph based Clustering
 Transform the data into a graph representation.
 Vertices are the data points to be clustered.
 Edges are weighted based on similarity between data.
16
5.Model based clustering
 Model-based clustering is a broad family of algorithms
designed for modelling an unknown distribution as a mixture
of simpler distributions, sometimes called basis distributions.
The classification of mixture model clustering is based on the
following four criteria.
 Parametric and non parametric model
 Gaussian mixture models (GMMs)
 non-Bayesian methods and Bayesian methods
 mixture of factor analysers (MFA).
17
Applications
 Pattern Recognition
 Spatial Data Analysis
 Image Processing
 Economic Science
 Crime Analysis
 Bio informatics
 Medical Imaging
 Robotics
 Climatology
18
CS 40003: Data Analytics 19

clustering ppt.pptx

  • 1.
    CLUSTERING:  What isClustering  Clustering Techniques  Partitioning methods  Hierarchical methods  Density-based methods  Graph based methods  Model based methods  Application of Clustering 1
  • 2.
    Clustering  Clustering isa technique that groups similar objects such that the objects in the same group are more similar to each other than the objects in the other groups. The group of similar objects is called a Cluster.  Clustering helps to split data into several subsets. Each of these clusters consists of data objects with high inter-similarity and low intra-similarity. 2
  • 3.
  • 4.
    4 Clustering Techniques Partitioning methods Hierarchical methods Density-based methods Graph based methods Model based clustering •k-Means algorithm [1957, 1967] • k-Medoids algorithm • k-Modes [1998] • Fuzzy c-means algorithm [1999] Divisive Agglomerative methods • STING [1997] • DBSCAN [1996] • CLIQUE [1998] • DENCLUE [1998] • OPTICS [1999] • Wave Cluster [1998] • MST Clustering [1999] • OPOSSUM [2000] • SNN Similarity Clustering [2001, 2003] • EM Algorithm [1977] • Auto class [1996] • COBWEB [1987] • ANN Clustering [1982, 1989] • AGNES [1990] • BIRCH [1996] • CURE [1998] • ROCK [1999] • Chamelon [1999] • DIANA [1990] • PAM [1990] • CLARA [1990] • CLARANS [1994] hniques:
  • 5.
    Centroids-based Clustering(partitioning Clustering) CS 40003:Data Analytics 5  Centroid based clustering is considered as one of the most simplest clustering algorithms, yet the most effective way of creating clusters and assigning data points.  These groups of clustering methods iteratively measure the distance between the clusters and the characteristic centroids using various distance metrics. These are either of Euclidian distance, Manhattan Distance or Minkowski Distance.
  • 6.
    k-Means Algorithm  k-Meansis one of the most widely used and perhaps the simplest unsupervised algorithms to solve the clustering problems.  Using this algorithm, we classify a given data set through a certain number of predetermined clusters or “k” clusters.  Each cluster is assigned a designated cluster center and they are placed as much as possible far away from each other. 6
  • 7.
    7 where, ||xi – vj||is the distance between Xi and Vj. Ci is the count of data in cluster.C is the number of cluster centroids. Advantages: . Can be applied to any form of data – as long as the data has numerical (continuous) entities. . Much faster than other algorithms. . Easy to understand and interpret. Drawbacks: . Fails for non-linear data. . This cannot work for Categorical data. . Cannot handle outliers.
  • 8.
    K-Medoids Algorithm  Medoidsis a clustering algorithm resembling the K-Means clustering technique. It falls under the category of un supervised technique.It majorly differs from the K-Means algorithm in terms of the way it selects the clusters’ centres. The former selects the average of a cluster’s points as its centre (which may or may not be one of the data points) while the latter always picks the actual data points from the clusters as their centres (also known as ‘exemplars’ or ‘medoids’). K- Medoids also differs in this respect from the K-Medians algorithm whic,h is the same as K-means. CS 40003: Data Analytics 8
  • 9.
    2.Hierarchical Clustering  Italso called Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating clusters that have predefined ordering from top to bottom.  It then proceeds to perform a decomposition of the data objects based on this hierarchy, hence obtaining the clusters.  This clustering technique is divided into two types:  Agglomerative Hierarchical Clustering  Divisive Hierarchical Clustering 9
  • 10.
    Agglomerative Approach 10 Agglomerative HierarchicalClustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). It's a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Diagram:
  • 11.
    How does itworks: 1.Make each data point a single-point cluster → forms N clusters 2.Take the two closest data points and make them one cluster → forms N-1 clusters 3.Take the two closest clusters and make them one cluster → Forms N-2 clusters. 4.Repeat step-3 until you are left with only one cluster. Have a look at the visual representation of Agglomerative Hierarchical Clustering for better understanding: 11
  • 12.
    12 There are severalways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are: Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster. Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end. Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging. The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results. Different linkage methods lead to different clusters. The point of doing all this is to demonstrate the way hierarchical clustering works, it maintains a memory of how we went through this process and that memory is stored in Dendrogram.
  • 13.
    13 What is aDendrogram? A Dendrogram is a type of tree diagram showing hierarchical relationships between different sets of data. As already said a Dendrogram contains the memory of hierarchical clustering algorithm, so just by looking at the Dendrogram you can tell how the cluster is formed.
  • 14.
    Devise approach: 14 In Divisiveor DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method where we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. So this clustering approach is exactly opposite to Agglomerative clustering.
  • 15.
    3. Density-based Clustering If one looks into the previous two methods that we discussed, one would observe that both hierarchical and centroid based algorithms are dependent on a distance metric.  The very definition of a cluster is based on this metric. Density- based clustering methods take density into consideration instead of distances.  Clusters are considered as the densest region in a data space, which is separated by regions of lower object density and it is defined as a maximal-set of connected points. 15
  • 16.
    4.Graph based Clustering Transform the data into a graph representation.  Vertices are the data points to be clustered.  Edges are weighted based on similarity between data. 16
  • 17.
    5.Model based clustering Model-based clustering is a broad family of algorithms designed for modelling an unknown distribution as a mixture of simpler distributions, sometimes called basis distributions. The classification of mixture model clustering is based on the following four criteria.  Parametric and non parametric model  Gaussian mixture models (GMMs)  non-Bayesian methods and Bayesian methods  mixture of factor analysers (MFA). 17
  • 18.
    Applications  Pattern Recognition Spatial Data Analysis  Image Processing  Economic Science  Crime Analysis  Bio informatics  Medical Imaging  Robotics  Climatology 18
  • 19.
    CS 40003: DataAnalytics 19