mlcourse.ai. Clustering

mlcourse.ai. Clustering
Yury Kashnitskiy, Dmitry Ignatov
Higher School of Economics
November 16, 2018
(Higher School of Economics) Clustering 16.11.2018 1 / 24

Plan
1 Clustering
Problem formulation
Applications
2 Clustering methods
k-Means
Hierarchical methods
Agglomerative clustering
Density-based methods

Clustering
Plan
1 Clustering
Problem formulation
Applications
k-Means

Clustering Problem formulation
Problem formulation
The main task of cluster analysis is to group instances into subgroups (clusters) of
similar ones.
These groups can be
Partitions
Hierarchies
Fuzzy partitions
Biclusters
Mixtures of distributions

Clustering Applications
Applications
Biology and medicine
Gene expression analysis
Tomography clustering
Humanitarian sciences
Sociology and anthropology
Psychology
Technical systems
Telemetry
Image segmentation
Marketing
Customer segmentation
Subgroup behavioral analysis
Text analytics
News clustering
Social networks
Comunity detection

Clustering methods
Plan
1 Clustering
Problem formulation
Applications
k-Means

Clustering methods
How to measure dissimilarity of instances
Instances x ∈ Rm
are representaed as feature matrices.





x1
x2
...
xn





⇐⇒




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n




Minkowski distance
d(x, y) =
m
i=1
|xi
− yi
|p
1
p
Cosine distance
d(x, y) = 1 −
⟨x, y⟩
⟨x, x⟩ ⟨y, y⟩
Hamming distance
d(x, y) =
1
m
m
i=1
[xi
̸= yi
]

Clustering methods
k-Means
k-Means is an iterative algorithm to split data into k clusters.
Geometrical mean of each cluster (called a centroid) is denoted with Cj is defined
as
cj =
1
|Cj |
i∈Cj
xi
The objective is the sum of squares of all distances between instances and
centroids of clusteres to which these instances belong.
J(C) =
k
j=1 i∈Cj
d(xi , cj )2

Clustering methods
k-Means
The algorithm
Input: Data, k — is a hyperparameter
Ouput: Partition of data into k clusters
* * *
1. Initialization: Set k points to be initial centroid
2. Update clusters: Given k centroids, each instance is attributed to one of
centroids. Thus, all instances attributed to a centroid cj
(j = 1 . . . k), form a cluster Cj .
3. Update centroids: For each cluster Cj , a new centroid is calculated as a
geometrical mean of all instances in this cluster.
Steps 2-3 are repeated until convergence.

Clustering methods
k-Means. Example

Clustering methods
Clustering quality and the number of clusters
Elbow method
For each k we can calculate J(C).
Then, we find such k that further increasing it does not decrease J “too much”.
Formally, we look for k that minimizes the following D(k):
D(k) =
|J(k) − J(k + 1)|
|J(k − 1) − J(k)|

Clustering methods
Elbow method
−6 −4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
6
2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
4000
k
J(R)
Elbow Method

Clustering methods
Silhouette
Silhouette for an instance xi in a cluster C is a function
s(i) =
bi − ai
max(ai , bi )
,
where a(i) — is the mean distance from xi to all other instances from C, а bm(i)
— is the mean distance from xi to instances from other clusters.

Clustering methods
Silhouette
Acceptable number of clusters

Clustering methods
Silhouette
Bad number of clusters

Clustering methods Hierarchical methods
From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒






d(x1, x1) d(x1, x2) . . . d(x1, xn)
d(x2, x1)
...
... d(x2, xn)
...
...
...
...
d(xn, x1) d(xn, x2) · · · d(xn, xn)







From a feature matrix we can move to a pairwise distance matrix.




x1
1 x2
1 · · · xm
1
x1
2 x2
2 · · · xm
2
· · · · · · · · · · · ·
x1
n xm
n · · · xm
n



 ⇒







0 d(x1, x2) d(x1, x3) · · · d(x1, xn)
0 d(x2, x3) · · · d(x2, xn)
... · · · · · ·
0 d(xn−1, xn)
0








Sequential merging of similar clusters
0 Start with each cluster having only one instance
1 Find two closest clusters
2 Merge them
Repeat steps 1-2 untill all instances are in the same cluster
How to define distance between clusters?

Linkage
1 Single Linkage
d(A, B) = min
x∈A,y∈B
d(x, y)
2 Complete Linkage
d(A, B) = max
x∈A,y∈B
d(x, y)

Linkage
3 Average Linkage
d(A, B) =
1
|A||B|
i∈A j∈B
d(xi , yj )
4 Weighted Average Linkage
Let clusterA be a union of clusters q и p. Then
d(A, B) =
d(p, B) + d(q, B)
2
5 Centroid Linkage
d(A, B) = ∥cA − cB ∥2

Merging clusters can be depicted with a dendrogram.
Let us take a look at a 1D sample: { 1, 2, 3, 7, 10, 12, 25, 29 }
1 2 3 7 10 12 25 29
0
5
10
15
20
25
Objects
Clusterdistances
B
C
A
Distance between
cluster A and B

Clustering methods Density-based methods
DBSCAN
DBSCAN stabds for Density Based Spatial Clustering of Applications with Noise.

DBSCAN algorithm
All point can be divided into elements of dense regions, border points and noise
(skipping formal definition here).

DBSCAN. Example
Hyperparams: M = 4, Eps > 0

DBSCAN. Example

DBSCAN. Pros and cons
Pros
+ Can find clusters of any shape
+ Easy to implement
+ Can find noise in data
+ Nice complexity — O(n log(n)) with a good data sctructure
(otherwise — O(n2
) )
Cons
- Parametric
- Doesn’t work well when clusters differ in density
- Depends on the chosen metric

Contacts
Questions
Thanks!
Please ask your questions in OpenDataSciene Slack team.
http://ods.ai

mlcourse.ai. Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to mlcourse.ai. Clustering

Similar to mlcourse.ai. Clustering (20)

More from Yury Kashnitsky

More from Yury Kashnitsky (8)

Recently uploaded

Recently uploaded (20)

mlcourse.ai. Clustering