Categorization Of Major
Clustering Methods
Presented by,
C.Dhivyasri
I-M.Sc(Computer science)
Nadar Saraswathi College of Arts and Science
Definition of clustering methods
 Clustering methods are unsupervised machine learning techniques that group
unlabeled data points into clusters based on their inherent similarities, using
various algorithms and similarity measures to identify statistically validated
subgroups within datasets.
 These techniques are used in various fields, including market research and
healthcare, to find natural patterns and categorize data into meaningful
segments for analysis and strategic decision-making.
Clustering in Machine Learning
Types of Clustering Methods
 At the surface level, clustering helps in the analysis of unstructured
data. Graphing, the shortest distance, and the density of the data points are a
few of the elements that influence cluster formation.
 Centroid-based Clustering (Partitioning methods)
 Density-based Clustering (Model-based methods)
 Connectivity-based Clustering (Hierarchical clustering)
Centroid-based Clustering
(Partitioning methods)
 Centroid-based clustering organizes data points around central vectors
(centroids) that represent clusters. Each data point belongs to the cluster with
the nearest centroid. Generally, the similarity measure chosen for these
algorithms are Euclidian distance, Manhattan Distance or Minkowski Distance.
 The datasets are separated into a predetermined number of clusters, and each
cluster is referenced by a vector of values. When compared to the vector value,
the input data variable shows no difference and joins the cluster.
Popular Algorithms
 K-means:
This is the most widely known centroid-based algorithm. It aims to
minimize the sum of squared distances between data points and their respective
cluster centroids.
 K-medoids (PAM - Partitioning Around Medoids):
Unlike K-means, K-medoids uses an actual data point (medoid) as the
cluster center, making it more robust to outliers than K-means.
K-mean Algorithm
METHODS
 Randomly assign K objects from the dataset(D) as cluster centres(C)
 (Re) Assign each object to which object is most similar based upon mean
values.
 Update Cluster means, i.e., Recalculate the mean of each cluster with the
updated values.
 Repeat Step 2 until no change occurs.
Diagram for K-mean
Clustering
Flowchart
Connectivity-based Clustering
(Hierarchical clustering)
 Connectivity-based clustering builds a hierarchy of clusters using a measure of
connectivity based on distance when organizing a collection of items based on
their similarities.
 This method builds a dendrogram, a tree-like structure that visually represents the
relationships between objects.
 The key idea is that nearby objects are more related than distant ones, and the final
cluster hierarchy is determined by a chosen distance metric and a linkage
criterion.
Hierarchical Clustering
Types of Hierarchical Clustering
There are two main types of hierarchical clustering.
Agglomerative Hierarchical Clustering (Bottom-Up)
Divisive Hierarchical Clustering (Top-Down)
1.Agglomerative Hierarchical
Clustering (Bottom-Up)
 This method begins by treating each individual data point as a separate cluster.
 It then iteratively merges the two closest clusters at each step.
 This merging process continues until all data points are grouped into a single,
large cluster.
 The "closeness" or "similarity" between clusters is determined by a chosen
linkage criterion (e.g., single linkage, complete linkage, average linkage, Ward's
method).
Diagram
Advantages
 No Predefined K: Unlike algorithms like K-Means, it doesn't require
specifying the number of clusters (K) beforehand.
 Visual Insights: The resulting dendrogram offers a rich, hierarchical view
of the data's structure.
 Flexibility: It allows for various distance metrics and linkage methods to be
used.
Disadvantages
 Computational Cost:
It can be computationally expensive and slow for very large datasets
due to its O(n²) complexity.
 Sensitivity:
It can be sensitive to noisy data and outliers, and once a merge is
done, it cannot be undone.
2.Divisive Hierarchical
Clustering (Top-Down)
 Divisive Hierarchical Clustering (Top-Down Clustering) is a method that
starts with all data points in a single cluster and recursively splits it into
smaller, more homogeneous clusters.
 Unlike the bottom up Agglomerative approach, divisive clustering breaks
down broad categories into smaller groups by identifying the most dissimilar
points or clusters to split at each step.
 This top-down process continues until a stopping condition is met, such as
each data point forming its own cluster.
Diagram
Thank You

Categorization Of Major Clustering Methods.

  • 1.
    Categorization Of Major ClusteringMethods Presented by, C.Dhivyasri I-M.Sc(Computer science) Nadar Saraswathi College of Arts and Science
  • 2.
    Definition of clusteringmethods  Clustering methods are unsupervised machine learning techniques that group unlabeled data points into clusters based on their inherent similarities, using various algorithms and similarity measures to identify statistically validated subgroups within datasets.  These techniques are used in various fields, including market research and healthcare, to find natural patterns and categorize data into meaningful segments for analysis and strategic decision-making.
  • 3.
  • 4.
    Types of ClusteringMethods  At the surface level, clustering helps in the analysis of unstructured data. Graphing, the shortest distance, and the density of the data points are a few of the elements that influence cluster formation.  Centroid-based Clustering (Partitioning methods)  Density-based Clustering (Model-based methods)  Connectivity-based Clustering (Hierarchical clustering)
  • 5.
    Centroid-based Clustering (Partitioning methods) Centroid-based clustering organizes data points around central vectors (centroids) that represent clusters. Each data point belongs to the cluster with the nearest centroid. Generally, the similarity measure chosen for these algorithms are Euclidian distance, Manhattan Distance or Minkowski Distance.  The datasets are separated into a predetermined number of clusters, and each cluster is referenced by a vector of values. When compared to the vector value, the input data variable shows no difference and joins the cluster.
  • 6.
    Popular Algorithms  K-means: Thisis the most widely known centroid-based algorithm. It aims to minimize the sum of squared distances between data points and their respective cluster centroids.  K-medoids (PAM - Partitioning Around Medoids): Unlike K-means, K-medoids uses an actual data point (medoid) as the cluster center, making it more robust to outliers than K-means.
  • 7.
    K-mean Algorithm METHODS  Randomlyassign K objects from the dataset(D) as cluster centres(C)  (Re) Assign each object to which object is most similar based upon mean values.  Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.  Repeat Step 2 until no change occurs.
  • 8.
  • 9.
  • 10.
    Connectivity-based Clustering (Hierarchical clustering) Connectivity-based clustering builds a hierarchy of clusters using a measure of connectivity based on distance when organizing a collection of items based on their similarities.  This method builds a dendrogram, a tree-like structure that visually represents the relationships between objects.  The key idea is that nearby objects are more related than distant ones, and the final cluster hierarchy is determined by a chosen distance metric and a linkage criterion.
  • 11.
  • 12.
    Types of HierarchicalClustering There are two main types of hierarchical clustering. Agglomerative Hierarchical Clustering (Bottom-Up) Divisive Hierarchical Clustering (Top-Down)
  • 13.
    1.Agglomerative Hierarchical Clustering (Bottom-Up) This method begins by treating each individual data point as a separate cluster.  It then iteratively merges the two closest clusters at each step.  This merging process continues until all data points are grouped into a single, large cluster.  The "closeness" or "similarity" between clusters is determined by a chosen linkage criterion (e.g., single linkage, complete linkage, average linkage, Ward's method).
  • 14.
  • 15.
    Advantages  No PredefinedK: Unlike algorithms like K-Means, it doesn't require specifying the number of clusters (K) beforehand.  Visual Insights: The resulting dendrogram offers a rich, hierarchical view of the data's structure.  Flexibility: It allows for various distance metrics and linkage methods to be used.
  • 16.
    Disadvantages  Computational Cost: Itcan be computationally expensive and slow for very large datasets due to its O(n²) complexity.  Sensitivity: It can be sensitive to noisy data and outliers, and once a merge is done, it cannot be undone.
  • 17.
    2.Divisive Hierarchical Clustering (Top-Down) Divisive Hierarchical Clustering (Top-Down Clustering) is a method that starts with all data points in a single cluster and recursively splits it into smaller, more homogeneous clusters.  Unlike the bottom up Agglomerative approach, divisive clustering breaks down broad categories into smaller groups by identifying the most dissimilar points or clusters to split at each step.  This top-down process continues until a stopping condition is met, such as each data point forming its own cluster.
  • 18.
  • 19.