Data Mining: Cluster Analysis
Faculty of Information Science 1
Data Mining
Chapter 10 Cluster Analysis
10.1 Basic Concept
Part 1 of 7
Course Objectives and Learning Outcomes
Data Mining: Cluster Analysis
Faculty of Information Science 2
Objective
To understand the cluster criteria and
Clustering analysis.
To develop research interest towards
advances in data mining.
To discover structures and patterns in
high-dimensional data.
To collect group data with similar
patterns together.
To reduces the complexity and facilitates
interpretation
Outcome
To Identify appropriate data mining
algorithms to solve real world problems
To compare and evaluate different data
mining techniques like classification,
prediction, clustering and association rule
mining.
To learn about clusters such that
observations belonging to the same
cluster are very similar or homogeneous
and observations belonging to different
clusters are different or heterogeneous.
Course Outline
Data Mining: Cluster Analysis
Faculty of Information Science 3
Cluster Analysis: Basic Concepts Density-Based Methods
Partitioning Methods Grid-Based Methods
Hierarchical Methods Evaluation of Clustering
Part I
What is Clustering Analysis
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning
by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
Data Mining: Cluster Analysis
Faculty of Information Science 4
Clustering as a Preprocessing Tool (Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any cluster
Data Mining: Cluster Analysis
Faculty of Information Science 5
Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns
Data Mining: Cluster Analysis
Faculty of Information Science 6
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function, typically metric: d(i, j)
 The definitions of distance functions are usually rather different for interval-scaled,
Boolean, categorical, ordinal ratio, and vector variables
 Weights should be associated with different variables based on applications and data
semantics
 Quality of clustering:
 There is usually a separate “quality” function that measures the “goodness” of a
cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
Data Mining: Cluster Analysis
Faculty of Information Science 7
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g.,
density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)
Data Mining: Cluster Analysis
Faculty of Information Science 8
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality
Data Mining: Cluster Analysis
Faculty of Information Science 9
Major Clustering Approaches (I)
 Create a hierarchical decomposition of the set of data (or objects) using some criterion
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square k-melodies, errors
 Typical methods: k-means, CLARANS
 Hierarchical approach:
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, Wave Cluster, CLIQUE
Data Mining: Preprocessing
Faculty of Information Science 10
Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best fit of that
model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus
Data Mining: Cluster Analysis
Faculty of Information Science 11
Data Mining: Cluster Analysis
Faculty of Information Science 12
Summary
 Basic Concept
 Cluster, Clustering Analysis, Unsupervised Learning Methods, Application of
Clustering Analysis, Measure of cluster quality
 Clustering Methods
 Partition Methods
 Hierarchical Methods
 Density Based Methods
 Grid Based Methods
Data Mining: Cluster Analysis
Faculty of Information Science 13
Data Mining: Cluster Analysis
Faculty of Information Science 14
Data Mining
Chapter 10 Cluster Analysis
10.2 Partitions Methods with K-Mean
Part 2 of 7
Course Outline
Data Mining: Cluster Analysis
Faculty of Information Science 15
Cluster Analysis: Basic Concepts Density-Based Methods
Partitioning Methods Grid-Based Methods
Hierarchical Methods Evaluation of Clustering
Part I
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of k clusters,
such that the sum of squared distances is minimized (where ci is the centroid or
myeloid of cluster Ci)
 Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-melodies algorithms
 k-means (MacQueen’67, Lloyd’57/’82): where each cluster is represented by the
mean value of the objects in the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
where each cluster is represented by one of the objects located near the center
of the cluster
Data Mining: Cluster Analysis
Faculty of Information Science 16
The K-Means Clustering Method
Data Mining: Cluster Analysis
Faculty of Information Science 17
Start
Number of Cluster K
Centroid
Distance Objects
to Centroid
Group Objects in
Minimum Distance
No Object
move Group
End
No Yes
Example of K mean Clustering
Data Mining: Cluster Analysis
Faculty of Information Science 18
 Suppose we use machine A and machine B as the first centroid and the distance
using Euclidean distance. Use K-mean algorithm for the last two cluster the following
point.
A B C D
Attribute1 2 2 5 7
Attribute2 10 5 8 5
Euclidean Distance 𝑑 𝑥, 𝑦 = (𝑥1 − 𝑥2)2+(𝑦1 − 𝑦2)2
K=2
Centroid New Point
A(2,10) C(5,8)
B(2,5) D(7,5)
Example of K mean Clustering
Data Mining: Cluster Analysis
Faculty of Information Science 19
C D
A 3.61 * 7.07
B 4.24 3.61*
After the First Round
Centroid New Point
Cluster 1 A,C
Cluster 2 B,D
Calculate New Centroid Point
Centroid New Point
Cluster 1 A,C (3.5,9)
Cluster 2 B,D (4.5,5)
A B C D
Cluster 1 1.80* 4.27 1.8* 5.32
Cluster 2 5.12 2.5* 3.04 2.5*
After the Second Round
Centroid New Point
Cluster 1 A,C
Cluster 2 B,D
An Example of K-Means Clustering
Data Mining: Cluster Analysis
Faculty of Information Science 20
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change
Arbitrarily
partition
objects into
k groups
K=2
Update the
cluster
centroids
Reassign
Object
Update the
cluster
centroids
Loop if
needed
Animation for K-Means
Data Mining: Cluster Analysis
Faculty of Information Science 21
Mean-Shift Clustering
Data Mining: Cluster Analysis
Faculty of Information Science 22
Mean-Shift Clustering for a single sliding window
 Mean shift clustering is a sliding-
window-based algorithm
 to find dense areas of data points
 It is also a centroid-based algorithm
meaning that the goal is to locate the
center points of each group/class,
which works by updating candidates for
center points to be the mean of the
points within the sliding-window.
Data Mining: Cluster Analysis
Faculty of Information Science 23
 In contrast to K-means clustering, there is no need to select the number of clusters
as mean-shift automatically discovers this. That’s a massive advantage.
 The fact that the cluster centers converge towards the points of maximum density is
also quite desirable as it is quite intuitive to understand and fits well in a naturally
data-driven sense.
 The drawback is that the selection of the window size/radius “r” can be non-trivial.
 The heuristic clustering methods work well for finding spherical-shaped clusters in
small to medium databases.
 To find clusters with complex shapes and for clustering very large data sets,
partitioning based methods need to be extended.
Data Mining: Cluster Analysis
Faculty of Information Science 24
Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 Using the k-modes method for categorical data
 In comparison, k-medoids can be applied to a wide range of data
 Need to specify k, the number of clusters, in advance (there are ways to automatically
determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Data Mining: Cluster Analysis
Faculty of Information Science 25
Summary
 Basic of Partition Methods
 Definition, Evolution and when the partition, Some methods of Partition
Methods
 K-Means
 Working with Algorithms, Performance of algorithms with Example
Data Mining: Cluster Analysis
Faculty of Information Science 26
Data Mining: Cluster Analysis
Faculty of Information Science 27
Data Mining
Lecture Title : Cluster Analysis
Lecture Sub-title : Hierarchical Methods
Faculty of Information Science
University of Computer Studies, Magway
Textbooks & References / Citation
Data Mining: Cluster Analysis
Faculty of Information Science 28
 Text Books :Data Mining: Concepts and Techniques (3rd ed.)
 Reference Books :Data Mining: Concepts and Techniques (2rd ed.)
 Citations: Hierarchical Clustering - AGNES, DIANA, BIRCH &
CHAMELEON, Probabilistic Hierarchical Clustering in
https://www.datamining365.com/2020/01/clustering-in-data-mining.html
Course Outline
Data Mining: Cluster Analysis
Faculty of Information Science 29
Cluster Analysis: Basic Concepts Density-Based Methods
Partitioning Methods Grid-Based Methods
Evaluation of Clustering
Part I
Hierarchical Methods
Hierarchical Clustering Method
 Use distance matrix as clustering criteria. This method does not require the number
of clusters k as an input, but needs a termination condition
Data Mining: Cluster Analysis
Faculty of Information Science 30
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
Agglomerative Hierarchical Clustering
Data Mining: Cluster Analysis
Faculty of Information Science 31
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
Data Mining: Cluster Analysis
Faculty of Information Science 32
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
Data Mining: Cluster Analysis
Faculty of Information Science 33
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Dendrogram: Shows How Clusters are Merged
 Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram
 A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster
Data Mining: Cluster Analysis
Faculty of Information Science 34
divisive
(DIANA)
agglomerative
(AGNES)
Example: Dendrogram representation
Data Mining: Cluster Analysis
Faculty of Information Science 35
Distance between Clusters
Data Mining: Cluster Analysis
Faculty of Information Science 36
Closet Pair
Furthest Pair
Centroid Pair
Average set of pair
 Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters,
i.e., dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters,
i.e., dist(Ki, Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located data object in the cluster
Centroid, Radius and Diameter of a Cluster (for
numerical data sets)
 Centroid: the “middle” of a cluster
 Radius: square root of average distance from any point of the cluster to its centroid
 Diameter: square root of average mean squared distance between all pairs of points
in the cluster
Data Mining: Cluster Analysis
Faculty of Information Science 37
N
t
N
i ip
m
C
)
(
1



N
m
c
ip
t
N
i
m
R
2
)
(
1




)
1
(
2
)
(
1
1







N
N
iq
t
ip
t
N
i
N
i
m
D
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 Can never undo what was done previouslyand cannot work well on large
datasets
 Do not scale well: time complexity of at least O(n2), where n is the number of
total objects
 Integration of hierarchical & distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
 CHAMELEON (1999): hierarchical clustering using dynamic modeling
Data Mining: Cluster Analysis
Faculty of Information Science 38
Types of Hierarchical Clustering
 BIRCH - Balanced Iterative Reducing and Clustering using Hierarchies
 CURE - Clustering Using REpresentatives
 CHAMELEON - Hierarchical Clustering using Dynamic Modeling. This is a graph
partitioning approach used in clustering complex structures.
 Probabilistic Hierarchical Clustering
 Generative Clustering Model
Data Mining: Cluster Analysis
Faculty of Information Science 39
Quiz I
 In which of the following cases will K-Means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with round shapes
4. Data points with non-convex shapes
Data Mining: Cluster Analysis
Faculty of Information Science 40
Answer : D
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4
Quiz II
 Which of the following metrics, do we have for finding dissimilarity between two
clusters in hierarchical clustering?
1. Single-link
2. Complete-link
3. Average-link
Data Mining: Cluster Analysis
Faculty of Information Science 41
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3 Answer : D

Chapter 10.1,2,3.pptx

  • 1.
    Data Mining: ClusterAnalysis Faculty of Information Science 1 Data Mining Chapter 10 Cluster Analysis 10.1 Basic Concept Part 1 of 7
  • 2.
    Course Objectives andLearning Outcomes Data Mining: Cluster Analysis Faculty of Information Science 2 Objective To understand the cluster criteria and Clustering analysis. To develop research interest towards advances in data mining. To discover structures and patterns in high-dimensional data. To collect group data with similar patterns together. To reduces the complexity and facilitates interpretation Outcome To Identify appropriate data mining algorithms to solve real world problems To compare and evaluate different data mining techniques like classification, prediction, clustering and association rule mining. To learn about clusters such that observations belonging to the same cluster are very similar or homogeneous and observations belonging to different clusters are different or heterogeneous.
  • 3.
    Course Outline Data Mining:Cluster Analysis Faculty of Information Science 3 Cluster Analysis: Basic Concepts Density-Based Methods Partitioning Methods Grid-Based Methods Hierarchical Methods Evaluation of Clustering Part I
  • 4.
    What is ClusteringAnalysis  Cluster: A collection of data objects  similar (or related) to one another within the same group  dissimilar (or unrelated) to the objects in other groups  Cluster analysis (or clustering, data segmentation, …)  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms Data Mining: Cluster Analysis Faculty of Information Science 4
  • 5.
    Clustering as aPreprocessing Tool (Utility)  Summarization:  Preprocessing for regression, PCA, classification, and association analysis  Compression:  Image processing: vector quantization  Finding K-nearest Neighbors  Localizing search to one or a small number of clusters  Outlier detection  Outliers are often viewed as those “far away” from any cluster Data Mining: Cluster Analysis Faculty of Information Science 5
  • 6.
    Quality: What IsGood Clustering?  A good clustering method will produce high quality clusters  high intra-class similarity: cohesive within clusters  low inter-class similarity: distinctive between clusters  The quality of a clustering method depends on  the similarity measure used by the method  its implementation, and  Its ability to discover some or all of the hidden patterns Data Mining: Cluster Analysis Faculty of Information Science 6
  • 7.
    Measure the Qualityof Clustering  Dissimilarity/Similarity metric  Similarity is expressed in terms of a distance function, typically metric: d(i, j)  The definitions of distance functions are usually rather different for interval-scaled, Boolean, categorical, ordinal ratio, and vector variables  Weights should be associated with different variables based on applications and data semantics  Quality of clustering:  There is usually a separate “quality” function that measures the “goodness” of a cluster.  It is hard to define “similar enough” or “good enough”  The answer is typically highly subjective Data Mining: Cluster Analysis Faculty of Information Science 7
  • 8.
    Considerations for ClusterAnalysis  Partitioning criteria  Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)  Separation of clusters  Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)  Similarity measure  Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)  Clustering space  Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) Data Mining: Cluster Analysis Faculty of Information Science 8
  • 9.
    Requirements and Challenges Scalability  Clustering all the data instead of only on samples  Ability to deal with different types of attributes  Numerical, binary, categorical, ordinal, linked, and mixture of these  Constraint-based clustering  User may give inputs on constraints  Use domain knowledge to determine input parameters  Interpretability and usability  Others  Discovery of clusters with arbitrary shape  Ability to deal with noisy data  Incremental clustering and insensitivity to input order  High dimensionality Data Mining: Cluster Analysis Faculty of Information Science 9
  • 10.
    Major Clustering Approaches(I)  Create a hierarchical decomposition of the set of data (or objects) using some criterion  Partitioning approach:  Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square k-melodies, errors  Typical methods: k-means, CLARANS  Hierarchical approach:  Typical methods: Diana, Agnes, BIRCH, CAMELEON  Density-based approach:  Based on connectivity and density functions  Typical methods: DBSACN, OPTICS, DenClue  Grid-based approach:  based on a multiple-level granularity structure  Typical methods: STING, Wave Cluster, CLIQUE Data Mining: Preprocessing Faculty of Information Science 10
  • 11.
    Major Clustering Approaches(II)  Model-based:  A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other  Typical methods: EM, SOM, COBWEB  Frequent pattern-based:  Based on the analysis of frequent patterns  Typical methods: p-Cluster  User-guided or constraint-based:  Clustering by considering user-specified or application-specific constraints  Typical methods: COD (obstacles), constrained clustering  Link-based clustering:  Objects are often linked together in various ways  Massive links can be used to cluster objects: SimRank, LinkClus Data Mining: Cluster Analysis Faculty of Information Science 11
  • 12.
    Data Mining: ClusterAnalysis Faculty of Information Science 12
  • 13.
    Summary  Basic Concept Cluster, Clustering Analysis, Unsupervised Learning Methods, Application of Clustering Analysis, Measure of cluster quality  Clustering Methods  Partition Methods  Hierarchical Methods  Density Based Methods  Grid Based Methods Data Mining: Cluster Analysis Faculty of Information Science 13
  • 14.
    Data Mining: ClusterAnalysis Faculty of Information Science 14 Data Mining Chapter 10 Cluster Analysis 10.2 Partitions Methods with K-Mean Part 2 of 7
  • 15.
    Course Outline Data Mining:Cluster Analysis Faculty of Information Science 15 Cluster Analysis: Basic Concepts Density-Based Methods Partitioning Methods Grid-Based Methods Hierarchical Methods Evaluation of Clustering Part I
  • 16.
    Partitioning Algorithms: BasicConcept  Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or myeloid of cluster Ci)  Given k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-melodies algorithms  k-means (MacQueen’67, Lloyd’57/’82): where each cluster is represented by the mean value of the objects in the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): where each cluster is represented by one of the objects located near the center of the cluster Data Mining: Cluster Analysis Faculty of Information Science 16
  • 17.
    The K-Means ClusteringMethod Data Mining: Cluster Analysis Faculty of Information Science 17 Start Number of Cluster K Centroid Distance Objects to Centroid Group Objects in Minimum Distance No Object move Group End No Yes
  • 18.
    Example of Kmean Clustering Data Mining: Cluster Analysis Faculty of Information Science 18  Suppose we use machine A and machine B as the first centroid and the distance using Euclidean distance. Use K-mean algorithm for the last two cluster the following point. A B C D Attribute1 2 2 5 7 Attribute2 10 5 8 5 Euclidean Distance 𝑑 𝑥, 𝑦 = (𝑥1 − 𝑥2)2+(𝑦1 − 𝑦2)2 K=2 Centroid New Point A(2,10) C(5,8) B(2,5) D(7,5)
  • 19.
    Example of Kmean Clustering Data Mining: Cluster Analysis Faculty of Information Science 19 C D A 3.61 * 7.07 B 4.24 3.61* After the First Round Centroid New Point Cluster 1 A,C Cluster 2 B,D Calculate New Centroid Point Centroid New Point Cluster 1 A,C (3.5,9) Cluster 2 B,D (4.5,5) A B C D Cluster 1 1.80* 4.27 1.8* 5.32 Cluster 2 5.12 2.5* 3.04 2.5* After the Second Round Centroid New Point Cluster 1 A,C Cluster 2 B,D
  • 20.
    An Example ofK-Means Clustering Data Mining: Cluster Analysis Faculty of Information Science 20  Partition objects into k nonempty subsets  Repeat  Compute centroid (i.e., mean point) for each partition  Assign each object to the cluster of its nearest centroid  Until no change Arbitrarily partition objects into k groups K=2 Update the cluster centroids Reassign Object Update the cluster centroids Loop if needed
  • 21.
    Animation for K-Means DataMining: Cluster Analysis Faculty of Information Science 21
  • 22.
    Mean-Shift Clustering Data Mining:Cluster Analysis Faculty of Information Science 22
  • 23.
    Mean-Shift Clustering fora single sliding window  Mean shift clustering is a sliding- window-based algorithm  to find dense areas of data points  It is also a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding-window. Data Mining: Cluster Analysis Faculty of Information Science 23
  • 24.
     In contrastto K-means clustering, there is no need to select the number of clusters as mean-shift automatically discovers this. That’s a massive advantage.  The fact that the cluster centers converge towards the points of maximum density is also quite desirable as it is quite intuitive to understand and fits well in a naturally data-driven sense.  The drawback is that the selection of the window size/radius “r” can be non-trivial.  The heuristic clustering methods work well for finding spherical-shaped clusters in small to medium databases.  To find clusters with complex shapes and for clustering very large data sets, partitioning based methods need to be extended. Data Mining: Cluster Analysis Faculty of Information Science 24
  • 25.
    Comments on theK-Means Method  Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))  Comment: Often terminates at a local optimal.  Weakness  Applicable only to objects in a continuous n-dimensional space  Using the k-modes method for categorical data  In comparison, k-medoids can be applied to a wide range of data  Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)  Sensitive to noisy data and outliers  Not suitable to discover clusters with non-convex shapes Data Mining: Cluster Analysis Faculty of Information Science 25
  • 26.
    Summary  Basic ofPartition Methods  Definition, Evolution and when the partition, Some methods of Partition Methods  K-Means  Working with Algorithms, Performance of algorithms with Example Data Mining: Cluster Analysis Faculty of Information Science 26
  • 27.
    Data Mining: ClusterAnalysis Faculty of Information Science 27 Data Mining Lecture Title : Cluster Analysis Lecture Sub-title : Hierarchical Methods Faculty of Information Science University of Computer Studies, Magway
  • 28.
    Textbooks & References/ Citation Data Mining: Cluster Analysis Faculty of Information Science 28  Text Books :Data Mining: Concepts and Techniques (3rd ed.)  Reference Books :Data Mining: Concepts and Techniques (2rd ed.)  Citations: Hierarchical Clustering - AGNES, DIANA, BIRCH & CHAMELEON, Probabilistic Hierarchical Clustering in https://www.datamining365.com/2020/01/clustering-in-data-mining.html
  • 29.
    Course Outline Data Mining:Cluster Analysis Faculty of Information Science 29 Cluster Analysis: Basic Concepts Density-Based Methods Partitioning Methods Grid-Based Methods Evaluation of Clustering Part I Hierarchical Methods
  • 30.
    Hierarchical Clustering Method Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Data Mining: Cluster Analysis Faculty of Information Science 30 Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  • 31.
    Agglomerative Hierarchical Clustering DataMining: Cluster Analysis Faculty of Information Science 31
  • 32.
    AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical packages, e.g., Splus  Use the single-link method and the dissimilarity matrix  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster Data Mining: Cluster Analysis Faculty of Information Science 32 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 33.
    DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES  Eventually each node forms a cluster on its own Data Mining: Cluster Analysis Faculty of Information Science 33 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 34.
    Dendrogram: Shows HowClusters are Merged  Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram  A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster Data Mining: Cluster Analysis Faculty of Information Science 34 divisive (DIANA) agglomerative (AGNES)
  • 35.
    Example: Dendrogram representation DataMining: Cluster Analysis Faculty of Information Science 35
  • 36.
    Distance between Clusters DataMining: Cluster Analysis Faculty of Information Science 36 Closet Pair Furthest Pair Centroid Pair Average set of pair  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)  Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)  Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)  Medoid: a chosen, centrally located data object in the cluster
  • 37.
    Centroid, Radius andDiameter of a Cluster (for numerical data sets)  Centroid: the “middle” of a cluster  Radius: square root of average distance from any point of the cluster to its centroid  Diameter: square root of average mean squared distance between all pairs of points in the cluster Data Mining: Cluster Analysis Faculty of Information Science 37 N t N i ip m C ) ( 1    N m c ip t N i m R 2 ) ( 1     ) 1 ( 2 ) ( 1 1        N N iq t ip t N i N i m D
  • 38.
    Extensions to HierarchicalClustering  Major weakness of agglomerative clustering methods  Can never undo what was done previouslyand cannot work well on large datasets  Do not scale well: time complexity of at least O(n2), where n is the number of total objects  Integration of hierarchical & distance-based clustering  BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters  CHAMELEON (1999): hierarchical clustering using dynamic modeling Data Mining: Cluster Analysis Faculty of Information Science 38
  • 39.
    Types of HierarchicalClustering  BIRCH - Balanced Iterative Reducing and Clustering using Hierarchies  CURE - Clustering Using REpresentatives  CHAMELEON - Hierarchical Clustering using Dynamic Modeling. This is a graph partitioning approach used in clustering complex structures.  Probabilistic Hierarchical Clustering  Generative Clustering Model Data Mining: Cluster Analysis Faculty of Information Science 39
  • 40.
    Quiz I  Inwhich of the following cases will K-Means clustering fail to give good results? 1. Data points with outliers 2. Data points with different densities 3. Data points with round shapes 4. Data points with non-convex shapes Data Mining: Cluster Analysis Faculty of Information Science 40 Answer : D A. 1 and 2 B. 2 and 3 C. 2 and 4 D. 1, 2 and 4
  • 41.
    Quiz II  Whichof the following metrics, do we have for finding dissimilarity between two clusters in hierarchical clustering? 1. Single-link 2. Complete-link 3. Average-link Data Mining: Cluster Analysis Faculty of Information Science 41 A. 1 and 2 B. 1 and 3 C. 2 and 3 D. 1, 2 and 3 Answer : D

Editor's Notes

  • #18 Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when the assignment does not change
  • #31 Reffer: https://www.tutorialandexample.com/hierarchical-clustering-algorithm/
  • #33 Effer: https://medium.com/@ramhajraj/list-of-statistical-packages-f897f6aa3647
  • #37 Reffer: https://youtu.be/uxmHoQ1DGPU https://online.stat.psu.edu/stat505/lesson/14/14.4 https://www.stat.cmu.edu/~ryantibs/datamining/lectures/05-clus2.pdf Three most common linkage functions: single, complete, average linkage. Single linkage measures the least dissimilar pair between groups, complete linkage measures the most dissimilar pair, average linkage measures the average dissimilarity over all pairs dist(Ki, Kj) = min(tip, tjq) dmj = max (dkj, dlj) Notation Term Description dist(Ki, Kj) distance between clusters Ki and Kj min(tip, tjq) smallest distance between element tip in one cluster and other element tjq in other cluster tip distance between clusters i and p tjq distance between clusters j and q
  • #38 Reffer: https://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf
  • #41 K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.
  • #42 Answer : D