CHAPTER 14 CLUSTERING.PPTX

Business Analytics – The Science of Data Driven Decision Making

CLUSTERING
U Dinesh Kumar

INTRODUCTION TO CLUSTERING
Clustering is usually one of the first tasks
performed in most analytics projects. It helps data
scientists to analyze individual clusters further.

Non-overlapping clusters
Cluster in which each observation belongs to
only one cluster. Non-overlapping clusters are
more frequently used clustering techniques in
practice.

Overlapping clusters
•An observation may belong to more than one
cluster

Probabilistic clusters
An observation may belong to a cluster according to a
probability distribution.

Hierarchical clustering
Hierarchical clustering creates subsets of data similar to
a tree-like structure in which the root node corresponds
to the complete set of data. Branches are created from
the root node to split the data into heterogeneous
subsets (clusters).

Euclidean Distance
Euclidean is one of the frequently used distance
measures when the data are either in interval or ratio
scale.
The Eucledian distance between two n-dimensional
observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n)
is given by
2 2 2
1 2 11 21 12 22 1 2
( , ) ( ) ( ) ( )
n n
D X X x x x x x x
      

Example
The below table has information about 20 wines sold in the market along
with their alcohol and alkalinity of ash content
Wine Alcohol
Alkalinity of
Ash
Wine Alcohol
Alkalinity of
Ash
1 14.8 28 11 10.7 12.2
2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8

Clusters of wine based on alcohol and ash
content.

Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k
stands for the kth observation in the data set). It is
possible that the range of X1k can be much smaller
compared to X2k, resulting in skewed Euclidean distance
value. An easier way of handling the potential bias is to
standardize the data using the following equation:
Standardized value of the attribute =
Where are, respectively, the mean and standard
deviation of ith attribute












i
X
i
ik X
X

and i
i X
X 

Manhattan Distance (City Block Distance)
Euclidean distance may not be appropriate while
measuring distance between different locations (for
example, distance between two shops in a city). In such
cases, we use Manhattan distance, which is given by
 


n
i
i
i X
X
X
X
DM
1
2
1
2
1 )
,
(

Minkowski Distance
Minsowski distance is the generalized distance measure
between two cases in the dataset and is given by
When p = 1, Minkowski distance is same as the
Manhattan distance.
For p = 2, Minkowski distance is same as the Euclidean
distance.
1
1 2 1 2
1
Minkowski ( , )
p
p
n
i i
i
D X X X X

 
 
 
 
 


Jaccard Similarity Coefficient (Jaccard
Index)
 Jaccard similarity coefficient (JSC) or Jaccard index
(Real and Vargas, 1996) is a measure used when the data
is qualitative, especially when attributes can be
represented in binary form.
 JSC for two n-dimensional data (n attributes), X1 and X2,
is given by
Jaccard(X1, X2) =
where n(X1  X2) is the number of attributes that belong to
both X1 and X2 (that is, X1  X2), n(X1  X2) is the number
of attributes that belong to either X1 or X2 (that is, X1  X2).
)
(
)
(
2
1
2
1
X
X
n
X
X
n



Example
Consider movie DVD purchases made by two customers as
given by the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu
Panda (KFP), Before Sunrise (BS), Bridge of spies (BoS),
Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest
Gump, Iron Man (IM), Kung Fu Panda (KFP), Schindler’s List
(SL), The God Father (TGF)}
In this case, each movie is an attribute. The purchases made
by the two customers are shown in Table
Movie Title BS BoS C FG IM JB KFP SL TGF
Customer 1 1 1 0 1 1 1 1 0 0
Customer 2 0 0 1 1 1 1 1 1 1

• The JSC is given by
44
.
0
9
4
2)
customer
1
n(customer
2)
customer
1
n(customer
JSC 




Higher the Jaccard coefficient, higher the similarity
between two observations being compared. The value of
JSC lies between 0 and 1.

Cosine Similarity
The cosine similarity between X1 and X2 is given by
Similarity (X1, X2) = cos() =
In cosine similarity, X1 and X2 are two n-dimensional
vectors and it measures the angle between two vectors
(thus called vector space model).



 






n
i
i
n
i
i
n
i
i
i
X
X
X
X
X
X
X
X
1
2
2
1
2
1
1
2
1
2
1
2
1

Cosine similarity of different values of .

Gower’s Similarity Coefficient
Gower’s similarity coefficient (Gower, 1971) is used
when the data has both quantitative and qualitative
data.
Gower’s coefficient between two n-dimensional
observations i and j is given by
where Dijk is the distance between observations (i and j)
for kth variable and Wijk is a binary variable that captures
whether the distance between observations is valid for
kth variable.





n
k
ijk
n
k
ijk
ijk
ij
W
W
D
D
1
1

Example
Table 14.5 shows 5 customers and their movie downloads from a
portal. The data consists of genre of the movies, maximum
rating given by the customer, and the marital status (code 1
implies married and 0 otherwise). For example, customer 1
downloaded 23 action, 5 romance, 15 comedy, and 0 Sci-fi
movies and his maximum rating was 4.
Customer Number of Movies Downloaded Under Each Genre
Maximum
Rating
(k = 5)
Marital
Status
Action
(k = 1)
Romance
(k = 2)
Comedy
(k = 3)
Sci-fi
(k = 4)
Married
(k = 6)
1 23 5 15 0 4 0
2 5 18 16 2 5 1
3 25 0 0 15 5 0
4 2 30 15 0 4 1
5 45 0 0 10 5 0

Solution
The Gowers distance between customers 1 and 2 can
be calculated as shown in Table below :
k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 Sum
Dijk
0.5814 0.5667 0.9375 0.8667 0.0000 0 2.952
Wijk
1 1 1 1 1 1 6
The Gower’s distance between customers 1 and 2 is
given by 2.952/6 = 0.492.





n
k
ijk
n
k
ijk
ijk
ij
W
W
D
D
1
1

Quality and Optimal Number of Clusters
Milligan and Cooper (1985) analysed over 30 procedures
for determining the optimal number of clusters and
recommended the index proposed by Calinski and
Harabasz (1974) which is given by
where CH(k) is the Calinski and Harabasz index with k-
clusters (k > 1), B(k) and W(k) are the between and
within clusters sum of squared variations with k clusters.
)
/(
)
(
1
/
)
(
)
(
k
n
k
W
k
k
B
k
CH




Clustering Algorithms
Clustering algorithms group data into finite number of
mutually exclusive subsets.
Steps followed in clustering algorithms:
• Variable selection.
• Deciding the distance/similarity measure for measuring
distance/dissimilarity between the observations.
• Deciding the number of clusters.
• Validation of the clusters.

Variable Selection
Ketchen and Shook (1996) suggest inductive, deductive,
and cognitive approaches for variable selection.
• Inductive is basically an exploratory approach and
starts with as many variables as possible.
• On the other hand, in deductive variable selection,
suitability of the variable and theoretical basis
influence selection of variables.
• Under cognitive variable selection, expert opinion
plays a major role in variable selection

Deciding Distance/Similarity Measures
Choosing the right distance/similarity measure plays an
important role in developing clusters.
Number of Clusters
Several approaches are available for deciding the number
of clusters such as CH index , Hartigan statistic [Eq.
(14.14)], Silhouette statistic, and elbow method in which
the ideal number of clusters is given by the position of
elbow in an L-shaped curve.

Cluster Validation
The clusters created should be validated for consistency
using different algorithms to ensure that the clusters
represent the structures that exist in the population.
Halkidi et al. (2001) suggest the following measures to
validate the clusters:
• Compactness: Closeness of each member of a
cluster which can be measured through variance.
• Separation: Distance between different clusters.

K-Means Clustering
• K-means clustering is one of the frequently used
clustering algorithms.
• It is a non-hierarchical clustering method in which the
number of clusters (K) is decided a priori.

K-Means Clustering - Steps
1) Choose K observations from the data that are likely to be
in different clusters. There are many ways of choosing
these initial K values; easiest approach is to choose
observations that are farthest (in one of the parameters of
the data).
2) The K observations chosen in step 1 are the centroids of
those clusters.
3) For remaining observations, find the cluster closest to the
centroid. Add the new observation (say observation j) to
the cluster with closest centroid. Adjust the centroid after
adding a new observation to the cluster. The closest
centroid is chosen based on an appropriate distance
measure.
4) Repeat step 3 till all observations are assigned to a cluster.

Hierarchical Clustering
Hierarchical clustering is a clustering algorithm which uses the
following steps to develop clusters:
1) Start with each data point in a single cluster.
2) Find the data points with shortest distance (using an
appropriate distance measure) and merge them to form
a cluster.
3) Repeat step 2 until all data points are merged to form a
single cluster
The above procedure is called agglomerative hierarchical
cluster

Dendrogram for movie clustering

Summary
• Clustering is an unsupervised learning algorithms that
divides the data set into mutually exclusive and
exhaustive subsets (in non-overlapping clusters) that
that are homogeneous within the group and
heterogeneous between the groups.
• Clustering is one of the frequently used techniques and
practitioners first cluster the data and develop
predictive models for each cluster for better
management.

• Several distance measures such as Euclidian distance,
Gower distance are used in clustering algorithms.
Similarity coefficients such as Jaccard coefficient and
Cosine similarity are used depending on the data type.
• K-means clustering and Hierarchical clustering are two
popular techniques used for clustering.
• One of the decisions to be taken during clustering is to
decide on the number of cluster. Usually this is carried
out using elbow curve. The cluster number at which
the elbow (bend) occurs in the elbow curve is the
optimal number of clusters.

CHAPTER 14 CLUSTERING.PPTX

Recommended

Recommended

More Related Content

Similar to CHAPTER 14 CLUSTERING.PPTX

Similar to CHAPTER 14 CLUSTERING.PPTX (20)

Recently uploaded

Recently uploaded (20)

CHAPTER 14 CLUSTERING.PPTX