Artificial Intelligence and machine learning

Artificial Intelligence and
Machine Learning for
Business (AIMLB)
Mukul Gupta
(Information Systems Area)

• 𝐾-means results can be sensitive to initialization
• 𝐾-means++ (Arthur and Vassilvitskii, 2006) is an
improvement over 𝐾-means
• Only difference is the way we initialize the cluster centers (rest
of it is just 𝐾-means)
• Basic idea: Initialize cluster centers such that they are
reasonably far from each other
• Note: In 𝐾-means++, the cluster centers are chosen to be 𝐾 of
the data points themselves
K-means++
Poor initialization: bad clustering
Desired clustering

• K-means++ works as follows
• Choose the first cluster mean uniformly randomly to be one of
the data points
• The subsequent 𝐾−1 cluster means are chosen as follows
• (1) For each unselected point 𝒙, compute its smallest distance 𝐷(𝒙)
from already initialized means
• (2) Select the next cluster mean uniformly at random to be one of the
unselected points based on probability prop. to 𝐷(𝒙) i.e.,
(𝒙)
∑ (𝒙)
∈𝒳
• (3) Repeat 1 and 2 until the 𝐾−1 cluster means are initialized
• Now run standard K-means with these initial cluster means
• K-means++ initialization scheme sort of ensures that the
initial cluster means are located in different clusters
K-means++
Thus farthest points are
most likely to be selected
as cluster means

• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of merges
or splits
Hierarchical Clustering
1 3 2 5 4 6
0
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
2
3 4
5

• Do not have to assume any particular number of
clusters
• Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Strengths of Hierarchical Clustering

• Two main types of hierarchical clustering
• Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one
cluster (or 𝑘 clusters) left
• Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains an
individual point (or there are 𝑘 clusters)
• Traditional hierarchical algorithms use a similarity or
distance matrix
• Merge or split one cluster at a time
Hierarchical Clustering

• Key Idea: Successively merge closest clusters
• Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two
clusters
• Different approaches to define the distance between clusters
distinguish the different algorithms
Agglomerative Clustering Algorithm

• Start with clusters of individual points and a
proximity matrix
Steps 1 and 2
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix

• After some merging steps, we have some clusters
Intermediate Situation
C1
C4
C2 C5
C3
C2
C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix

• We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
Step 4
C1
C4
C2 C5
C3
C2
C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix

• The question is “How do we update the proximity
matrix?”
Step 5
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5
C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix

How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
 MIN (Single linkage)
 MAX (Complete linkage)
 Group Average (Average linkage)
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
Proximity Matrix

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
 MIN (Single linkage)
 MAX
 Group Average
function

• Single-link distance between clusters 𝐶 and 𝐶 is the
minimum distance between any object in 𝐶 and any
object in 𝐶
• The distance is defined by the two most similar
objects
Distance between two clusters: MIN
   
j
i
y
x
j
i
sl C
y
C
x
y
x
d
C
C
D 

 ,
)
,
(
min
, ,

• Problem: clustering analysis with agglomerative
algorithm using single-linkage
Example: MIN
data matrix
distance matrix
Euclidean distance

• Merge two closest clusters (iteration 1)
Example: MIN

• Update distance matrix (iteration 1)
Example: MIN

• Merge two closest clusters (iteration 2)
Example: MIN

• Update distance matrix (iteration 2)
Example: MIN

• Merge two closest clusters/update distance matrix
(iteration 3)
Example: MIN

• Merge two closest clusters/update distance matrix
(iteration 4)
Example: MIN

• Final result (meeting termination condition)
Example: MIN

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
 MIN
 MAX (Complete linkage)
 Group Average
function

• Complete-link distance between clusters 𝐶 and 𝐶
is the maximum distance between any object in
𝐶 and any object in 𝐶
• The distance is defined by the two most dissimilar
objects
Distance between two clusters: MAX
   
j
i
y
x
j
i
cl C
y
C
x
y
x
d
C
C
D 

 ,
)
,
(
max
, ,

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
 MIN
 MAX
 Group Average (Average linkage)
function

• Group average distance between clusters 𝐶 and 𝐶
is the average distance between any object in 𝐶 and
any object in 𝐶
Distance between two clusters: Average
  



j
i C
y
C
x
j
i
j
i
avg y
x
d
C
C
C
C
D
,
)
,
(
1
,

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
 MIN
 MAX
 Group Average
function
– Ward’s Method uses squared distance
 

• Centroid distance between clusters 𝐶 and 𝐶 is the
distance between the centroid 𝑟 of 𝐶 and the
centroid 𝑟 of 𝐶
Distance between two clusters: Centroid
  )
,
(
, j
i
j
i
centroids r
r
d
C
C
D 

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity Matrix
 MIN
 MAX
 Group Average
function
– Ward’s Method uses squared distance
 
+

• Ward’s distance between clusters 𝐶 and 𝐶 is the
difference between the total within cluster sum of squares
for the two clusters separately, and the within cluster sum
of squares resulting from merging the two clusters in
cluster 𝐶
• 𝑟 : centroid of 𝐶
Distance between two clusters: Ward
       


 








ij
j
i C
x
ij
C
x
j
C
x
i
j
i
w r
x
r
x
r
x
C
C
D
2
2
2
,

• MIN
• Can handle non-elliptical shapes
• Sensitive to noise and outliers
• MAX
• Less susceptible to noise and outliers
• Tends to break large clusters and biased towards globular clusters
• Group average
• Compromise between MIN and MAX
• Biased towards globular clusters
• Ward
• Similar to group average and centroid distance
• Biased towards globular clusters
MIN, MAX, and Group Average

Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2 5
3
4
1
2
3
4
5
6
1
2
3
4
5

• 𝑂 𝑁 space since it uses the proximity matrix.
• 𝑁 is the number of points.
• 𝑂 𝑁 time in many cases
• There are 𝑁 steps and at each step the size, 𝑁 ,
proximity matrix must be updated and searched
• Complexity can be reduced to 𝑂 𝑁 log 𝑁 time with
some cleverness
Hierarchical Clustering: Time and Space
requirements

• Once a decision is made to combine two clusters, it
cannot be undone
• No global objective function is directly minimized
• Different schemes have problems with one or more
of the following:
• Sensitivity to noise
• Difficulty handling clusters of different sizes and non-
globular shapes
• Breaking large clusters
Hierarchical Clustering: Problems and
Limitations

• Dendrogram tree representation
• For a dendrogram tree, its horizontal axis indexes all objects in a given data set, while
its vertical axis expresses the lifetime of all possible cluster formations.
• The lifetime of a cluster (individual cluster) in the dendrogram is defined as a distance
interval from the moment that the cluster is created to the moment that it disappears due
to being merged with other clusters.
Key Concepts in Hierarchal Clustering
1. In the beginning we have 6
clusters: A, B, C, D, E and F
2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
4. We merge clusters E and (D, F)
into ((D, F), E) at distance 1.00
5. We merge clusters ((D, F), E) and C
into (((D, F), E), C) at distance 1.41
6. We merge clusters (((D, F), E), C)
and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
2
3
4
5
6
object
lifetime

• Lifetime vs. K-cluster Lifetime
Key Concepts in Hierarchal Clustering
2
3
4
5
6
object
lifetime
• Lifetime
The distance between that a cluster is created and that it
disappears (merges with other clusters during
clustering).
e.g. lifetime of A, B, C, D, E and F are 0.71, 0.71, 1.41,
0.50, 1.00 and 0.50, respectively, the lifetime of (A, B) is
2.50 – 0.71 = 1.79, ……
• K-cluster Lifetime
The distance from that K clusters emerge to that K clusters
vanish (due to the reduction to K-1 clusters).
e.g.
5-cluster lifetime is 0.71 - 0.50 = 0.21
4-cluster lifetime is 1.00 - 0.71 = 0.29
3-cluster lifetime is 1.41 – 1.00 = 0.41
2-cluster lifetime is 2.50 – 1.41 = 1.09

• Given a data set of five objects characterised by a single
continuous feature:
• The distance matrix on this dataset is given below. Apply the
agglomerative algorithm with single-link, complete-link and
averaging cluster distance measures to produce three
dendrogram trees, respectively.
Exercise
e
d
C
b
a
6
5
4
2
1
Feature
e
d
c
b
a
5
4
3
1
0
a
4
3
2
0
1
b
2
1
0
2
3
c
1
0
1
3
4
d
0
1
2
4
5
e

• Start with a single cluster composed of all data
points
• Split this into components
• Continue recursively
• Monothetic divisive methods split clusters using one
variable/dimension at a time
• Polythetic divisive methods make splits based on all
variables together
• Any intercluster distance measure can be used
• Computationally intensive, less widely used than
agglomerative methods
Divisive hierarchical clustering

• Cluster Cohesion: Measures how closely related are
objects in a cluster
• Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
• Cohesion is measured by within cluster sum of squares (SSE)
• Separation is measured by the between cluster sum of squares
(SSB)
Where 𝐶 is the size of cluster 𝑖
Measures of Cluster Validity: Cohesion
and Separation
𝑆𝑆𝐸 = 𝑥 − 𝑚
∈
𝑆𝑆𝐵 = 𝐶 𝑚 − 𝑚

• Example: SSE and SSB
• SSB + SSE = constant
Unsupervised Measures: Cohesion and
Separation
1 2 3 4 5
 

m1 m2
m
K=2 clusters:
K=1 cluster: 𝑆𝑆𝐸 = 1 − 3 + 2 − 3 + 4 − 3 + 5 − 3 = 10
𝑆𝑆𝐵 = 4 × 3 − 3 = 0
𝑇𝑜𝑡𝑎𝑙 = 10 + 0 = 10
𝑆𝑆𝐸 = 1 − 1.5 + 2 − 1.5 + 4 − 4.5 + 5 − 4.5 = 1
𝑆𝑆𝐵 = 2 × 3 − 1.5 + 2 × 3 − 4.5 = 9
𝑇𝑜𝑡𝑎𝑙 = 1 + 9 = 10

• Silhouette coefficient combines ideas of both cohesion and
separation, but for individual points, as well as clusters and
clusterings
• For an individual point, 𝑖
• Calculate 𝒂 = average distance of 𝑖 to the points in its cluster
• Calculate 𝒃 = min (average distance of 𝑖 to points in another cluster)
• The silhouette coefficient for a point is then given by
s = (b – a) / max(a,b)
• Value can vary between -1 and 1
• Typically ranges between 0 and 1.
• The closer to 1 the better.
• Can calculate the average silhouette coefficient for a cluster
or a clustering
Unsupervised Measures: Silhouette
Coefficient
Distances used
to calculate a
i
Distances used
to calculate b

• SSE is good for comparing two clusterings or two
clusters
• SSE can also be used to estimate the number of
clusters
Determining the Number of Clusters
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
K
SSE
5 10 15
-6
-4
-2
0
2
4
6

Artificial Intelligence and machine learning

More Related Content

Similar to Artificial Intelligence and machine learning

More from abhaykumar804670

Recently uploaded

Artificial Intelligence and machine learning