Chapter 10.1,2,3.pptx

Data Mining: Cluster Analysis
Faculty of Information Science 1
Data Mining
Chapter 10 Cluster Analysis
10.1 Basic Concept
Part 1 of 7

Course Objectives and Learning Outcomes
Objective
To understand the cluster criteria and
Clustering analysis.
To develop research interest towards
advances in data mining.
To discover structures and patterns in
high-dimensional data.
To collect group data with similar
patterns together.
To reduces the complexity and facilitates
interpretation
Outcome
To Identify appropriate data mining
algorithms to solve real world problems
To compare and evaluate different data
mining techniques like classification,
prediction, clustering and association rule
mining.
To learn about clusters such that
observations belonging to the same
cluster are very similar or homogeneous
and observations belonging to different
clusters are different or heterogeneous.

Course Outline
Cluster Analysis: Basic Concepts Density-Based Methods
Partitioning Methods Grid-Based Methods
Hierarchical Methods Evaluation of Clustering
Part I

What is Clustering Analysis
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning
by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

Clustering as a Preprocessing Tool (Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any cluster

Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function, typically metric: d(i, j)
 The definitions of distance functions are usually rather different for interval-scaled,
Boolean, categorical, ordinal ratio, and vector variables
 Weights should be associated with different variables based on applications and data
semantics
 Quality of clustering:
 There is usually a separate “quality” function that measures the “goodness” of a
cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective

Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g.,
density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)

Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality

Major Clustering Approaches (I)
 Create a hierarchical decomposition of the set of data (or objects) using some criterion
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square k-melodies, errors
 Typical methods: k-means, CLARANS
 Hierarchical approach:
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, Wave Cluster, CLIQUE
Data Mining: Preprocessing

Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best fit of that
model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus

Summary
 Basic Concept
 Cluster, Clustering Analysis, Unsupervised Learning Methods, Application of
Clustering Analysis, Measure of cluster quality
 Clustering Methods
 Partition Methods
 Hierarchical Methods
 Density Based Methods
 Grid Based Methods

Data Mining
Chapter 10 Cluster Analysis
10.2 Partitions Methods with K-Mean
Part 2 of 7

Course Outline
Hierarchical Methods Evaluation of Clustering
Part I

Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of k clusters,
such that the sum of squared distances is minimized (where ci is the centroid or
myeloid of cluster Ci)
 Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-melodies algorithms
 k-means (MacQueen’67, Lloyd’57/’82): where each cluster is represented by the
mean value of the objects in the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
where each cluster is represented by one of the objects located near the center
of the cluster

The K-Means Clustering Method
Start
Number of Cluster K
Centroid
Distance Objects
to Centroid
Group Objects in
Minimum Distance
No Object
move Group
End
No Yes

Example of K mean Clustering
 Suppose we use machine A and machine B as the first centroid and the distance
using Euclidean distance. Use K-mean algorithm for the last two cluster the following
point.
A B C D
Attribute1 2 2 5 7
Attribute2 10 5 8 5
Euclidean Distance 𝑑 𝑥, 𝑦 = (𝑥1 − 𝑥2)2+(𝑦1 − 𝑦2)2
K=2
Centroid New Point
A(2,10) C(5,8)
B(2,5) D(7,5)

Example of K mean Clustering
C D
A 3.61 * 7.07
B 4.24 3.61*
After the First Round
Centroid New Point
Cluster 1 A,C
Cluster 2 B,D
Calculate New Centroid Point
Centroid New Point
Cluster 1 A,C (3.5,9)
Cluster 2 B,D (4.5,5)
A B C D
Cluster 1 1.80* 4.27 1.8* 5.32
Cluster 2 5.12 2.5* 3.04 2.5*
After the Second Round
Centroid New Point
Cluster 1 A,C
Cluster 2 B,D

An Example of K-Means Clustering
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change
Arbitrarily
partition
objects into
k groups
K=2
Update the
cluster
centroids
Reassign
Object
Update the
cluster
centroids
Loop if
needed

Animation for K-Means

Mean-Shift Clustering

Mean-Shift Clustering for a single sliding window
 Mean shift clustering is a sliding-
window-based algorithm
 to find dense areas of data points
 It is also a centroid-based algorithm
meaning that the goal is to locate the
center points of each group/class,
which works by updating candidates for
center points to be the mean of the
points within the sliding-window.

 In contrast to K-means clustering, there is no need to select the number of clusters
as mean-shift automatically discovers this. That’s a massive advantage.
 The fact that the cluster centers converge towards the points of maximum density is
also quite desirable as it is quite intuitive to understand and fits well in a naturally
data-driven sense.
 The drawback is that the selection of the window size/radius “r” can be non-trivial.
 The heuristic clustering methods work well for finding spherical-shaped clusters in
small to medium databases.
 To find clusters with complex shapes and for clustering very large data sets,
partitioning based methods need to be extended.

Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 Using the k-modes method for categorical data
 In comparison, k-medoids can be applied to a wide range of data
 Need to specify k, the number of clusters, in advance (there are ways to automatically
determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

Summary
 Basic of Partition Methods
 Definition, Evolution and when the partition, Some methods of Partition
Methods
 K-Means
 Working with Algorithms, Performance of algorithms with Example

Data Mining
Lecture Title : Cluster Analysis
Lecture Sub-title : Hierarchical Methods
Faculty of Information Science
University of Computer Studies, Magway

Textbooks & References / Citation
 Text Books :Data Mining: Concepts and Techniques (3rd ed.)
 Reference Books :Data Mining: Concepts and Techniques (2rd ed.)
 Citations: Hierarchical Clustering - AGNES, DIANA, BIRCH &
CHAMELEON, Probabilistic Hierarchical Clustering in
https://www.datamining365.com/2020/01/clustering-in-data-mining.html

Course Outline
Evaluation of Clustering
Part I
Hierarchical Methods

Hierarchical Clustering Method
 Use distance matrix as clustering criteria. This method does not require the number
of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)

Agglomerative Hierarchical Clustering

AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

Dendrogram: Shows How Clusters are Merged
 Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram
 A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster
divisive
(DIANA)
agglomerative
(AGNES)

Example: Dendrogram representation

Distance between Clusters
Closet Pair
Furthest Pair
Centroid Pair
Average set of pair
 Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters,
i.e., dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters,
i.e., dist(Ki, Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located data object in the cluster

Centroid, Radius and Diameter of a Cluster (for
numerical data sets)
 Centroid: the “middle” of a cluster
 Radius: square root of average distance from any point of the cluster to its centroid
 Diameter: square root of average mean squared distance between all pairs of points
in the cluster
N
t
N
i ip
m
C
)
(
1



N
m
c
ip
t
N
i
m
R
2
)
(
1




)
1
(
2
)
(
1
1







N
N
iq
t
ip
t
N
i
N
i
m
D

Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 Can never undo what was done previouslyand cannot work well on large
datasets
 Do not scale well: time complexity of at least O(n2), where n is the number of
total objects
 Integration of hierarchical & distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
 CHAMELEON (1999): hierarchical clustering using dynamic modeling

Types of Hierarchical Clustering
 BIRCH - Balanced Iterative Reducing and Clustering using Hierarchies
 CURE - Clustering Using REpresentatives
 CHAMELEON - Hierarchical Clustering using Dynamic Modeling. This is a graph
partitioning approach used in clustering complex structures.
 Probabilistic Hierarchical Clustering
 Generative Clustering Model

Quiz I
 In which of the following cases will K-Means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with round shapes
4. Data points with non-convex shapes
Answer : D
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4

Quiz II
 Which of the following metrics, do we have for finding dissimilarity between two
clusters in hierarchical clustering?
1. Single-link
2. Complete-link
3. Average-link
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3 Answer : D

Chapter 10.1,2,3.pptx

More Related Content

Similar to Chapter 10.1,2,3.pptx

Recently uploaded

Chapter 10.1,2,3.pptx

Editor's Notes