SlideShare a Scribd company logo
1 of 92
Download to read offline
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Assessing the quality of a clustering
Christian Hennig
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1. A short introduction to cluster analysis
Cluster analysis is about finding groups in data.
var 1
−0.4 −0.2 0.0 0.2 0.4
3
3
3333
333 333
3333
333
3
33
3
3333
33
3
3
3333
4
5
4
4
4 544444
4444444444444 4
5
5
55
5
55 55
5
5 5
5
5
5
5
9
9995
58888
8
88
9
888
6
6
6
66
6
66666 6
6
11111111111
1
11111111
1
111
1
1111
6
111111111111111
1
111111
111111
111
6
1
9
999
2222
2
2
2
222
2
2
22 2
22
2
2
22
2 2 222
2 22 2
22 2
2
22
2
22
222
2
2
2
2
2
2 7
7
777
7777
7
777
7
7
3
3
3 333
333333
3333
333
3
33
3
3333
33
3
3
3333
4
5
4
4
454 444 4
4 4444 44 44 44 44 4
5
5
55
5
5 55 5
5
55
5
5
5
5
9
9995
5 8888
8
88
9
888
6
6
6
6 6
6
666 666
6
11111 11 111 1
1
11111 111
1
11 1
1
111 1
6
1 11111 11 11 1111 1
1
1 11 11 1
11 11 11
111
6
1
9
999
22 22
2
2
2
22 2
2
2
2 22
22
2
2
22
2 2222
2222
222
2
22
2
2 2
2 22
2
2
2
2
2
27
7
7 77
7777
7
777
7
7
−0.4 −0.2 0.0 0.2
−0.4−0.20.00.20.4
3
3
3 333
33 33 33
3333
33 3
3
33
3
33 33
33
3
3
3333
4
5
4
4
45 4 4444
444444 444 44 44 4
5
5
55
5
55 5 5
5
5 5
5
5
5
5
9
999 5
5888 8
8
88
9
888
6
6
6
66
6
66 666 6
6
1111111 11 11
1
11 11 1111
1
111
1
11 11
6
1111 111 111 11 11 1
1
111111
111111
111
6
1
9
999
2 222
2
2
2
22 2
2
2
22 2
2 2
2
2
22
22222
2 22 2
2 22
2
22
2
22
22 2
2
2
2
2
2
27
7
77
7
77 77
7
77 7
7
7
−0.4−0.20.00.20.4
3 3
3333333
3
3
3 333333
3
3 333 33
3
3333
333
3
3
45
4
44
5
44444 44
4444444444
4
45
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
9999
55
8
888
8
8
89 888
6
6
6
66 6
666
66
66
111111111111
1111111
1 1111
1
11
11
6
1111
11
111111
1111
111111
111111
111
6
1
9
9
99
22
2
2 222 2222
2
2
2
2 2222 22
2
2
2
2
22
2
2
2
22
2
2
2
2 2
22 2
22
2 2
2
2
22
77
7
7
7
77
77
7
7
77 7
7
var 2
33
3 333333
3
3
3 333333
3
3333 33
3
3333
333
3
3
45
4
44
5
4 444 44 4
444 44 44 44 4
4
45
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
9999
55
8
888
8
8
89888
6
6
6
6 6 6
666
66
6 6
11111 11 111 11
11111 11
11 11 1
1
11
1 1
6
1 11111
11 11 11
11 11
1 11 11 1
11 11 11
111
6
1
9
9
99
22
2
222222 22
2
2
2
2 222222
2
2
2
2
22
2
2
2
22
2
2
2
2 2
2 22
22
2 2
2
2
2 2
77
7
7
7
77
77
7
7
777
7
33
3 33333 3
3
3
3 3333 33
3
3333 33
3
3333
333
3
3
45
4
44
5
4 444444
4444 444 44 4
4
45
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
9999
55
8
88 8
8
8
8 9888
6
6
6
66 6
66 6
66
66
1111111 11 111
11 11 111
11 111
1
11
11
6
1111
11
1 111 11
11 11
111111
111111
111
6
1
9
9
99
2 2
2
2 2 22 22 22
2
2
2
22 222 22
2
2
2
2
22
2
2
2
2 2
2
2
2
22
22 2
2 2
22
2
2
2 2
7 7
7
7
7
77
77
7
7
7 77
7
3
33
33
3
3
3333
3
33333
33
3 33
3
33333
33 3333
3 4
5
4
4454
4
44
4
4
44
4
4
44
4
4
44
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
55
5 5
99
9
9
5
5
8
88888
89 888
6 6 6
6
6
6
6
66
6
6
6
6
11
1
1
1
11
1
1
1
11
1
1
111
1
11
1
11
1
1
111
1 61
1
11
1
1
11
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
6
1
9
9
9
9
2
2
22 222 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
22222
222222
2
2
2 2
2
22
2
2
2
2
2
7
7
7
7
7
7
777
7
7
77
77
3
33
33
3
3
33 33
3
33333
33
333
3
33333
333333
34
5
4
44 54
4
44
4
4
44
4
4
44
4
4
44
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
55
5 5
99
9
9
5
5
8
88888
89888
6 66
6
6
6
6
66
6
6
6
6
11
1
1
1
11
1
1
1
11
1
1
111
1
11
1
11
1
1
111
1 61
1
11
1
1
11
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
6
1
9
9
9
9
2
2
222222
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
22 22 2
22 22 22
2
2
22
2
22
2
2
2
2
2
7
7
7
7
7
7
777
7
7
77
7 7
var 3
−0.20.00.20.4
3
33
33
3
3
3 33 3
3
3333 3
3 3
333
3
33 333
33 3333
3 4
5
4
445 4
4
44
4
4
44
4
4
4 4
4
4
44
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5 5
55
99
9
9
5
5
8
88 888
8 9888
66 6
6
6
6
6
6 6
6
6
6
6
11
1
1
1
11
1
1
1
11
1
1
11 1
1
11
1
11
1
1
11 1
161
1
11
1
1
1 1
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
111
6
1
9
9
9
9
2
2
22 2 22 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
22 22 2
2 22222
2
2
2 2
2
22
2
2
2
2
2
7
7
7
7
7
7
7 77
7
7
7 7
77
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.2
3 3
3
33
3
33
3
3
3
3
3
33
3
3
3
3
3 33
3
3
3
3
333
3
33
333
4
5
4
44
5
4
4
444 444
44
4
4
4
4
44
4
4
4
5
5
5
5
5
55
5
5
5 5
5
5
5
5
5
9
99
9
5
5
88
8
88
8
8
9
888
6
6
6
6
6
6
66
6
6
6
66
11
11111
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
11 6
1111
11
1
1
1
1
1
1
11
11
1
1
1
1
1111111
1
1
11
6
1
9
9
99
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
222
22
2
2
2
2
2
22222 2
2
2
2
2
2
2 22 2
2
2
7
777
7
7
7
7
7
777
7 7
7
33
3
33
3
33
3
3
3
3
3
33
3
3
3
3
333
3
3
3
3
333
3
33
333
4
5
4
44
5
4
4
444444
44
4
4
4
4
44
4
4
4
5
5
5
5
5
55
5
5
55
5
5
5
5
5
9
99
9
5
5
88
8
88
8
8
9
888
6
6
6
6
6
6
66
6
6
6
66
11
11111
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
11 6
1111
11
1
1
1
1
1
1
11
11
1
1
1
1
1111111
1
1
11
6
1
9
9
99
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2 2 2
22
2
2
2
2
2
2 22 222
2
2
2
2
2
2222
2
2
7
7 77
7
7
7
7
7
777
77
7
−0.2 0.0 0.2 0.4
33
3
33
3
33
3
3
3
3
3
33
3
3
3
3
333
3
3
3
3
333
3
33
333
4
5
4
44
5
4
4
44 44 44
44
4
4
4
4
44
4
4
4
5
5
5
5
5
5 5
5
5
5 5
5
5
5
5
5
9
99
9
5
5
88
8
88
8
8
9
888
6
6
6
6
6
6
66
6
6
6
6 6
11
11
1 11
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
1 16
1 111
11
1
1
1
1
1
1
11
11
1
1
1
1
1 111
11 1
1
1
11
6
1
9
9
99
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2 22
22
2
2
2
2
2
22222 2
2
2
2
2
2
2 22 2
2
2
7
77 7
7
7
7
7
7
7 77
77
7
var 4
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1 Cluster analysis methods
1.1.1 k-means (Fix & Hodges 1951)
n
i=1
xi − ¯xC(i)
2
= min!
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1 Cluster analysis methods
1.1.1 k-means (Fix & Hodges 1951)
n
i=1
xi − ¯xC(i)
2
= min!
represents all objects by centroid,
“compact” clusters.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1 Cluster analysis methods
1.1.1 k-means (Fix & Hodges 1951)
n
i=1
xi − ¯xC(i)
2
= min!
represents all objects by centroid,
“compact” clusters.
Version: Don’t square, other centroids than mean (“pam”).
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
8
8
8888
8
8 8
8
8
8
888
88
8
8
8
88
8
8
8
8
8
8
88
8
8
8
8
8
71
7
7 7
7
7
7
7
7
7
77
77
7
7
7
77
7 7
7
7
7
1
7
1
1
9
9
1
9
1
9
1
1
9
9
1
9
44
44
4
4
9
9
9
9
9
9
92 99
9
2
2
2
2
2
2
2
2 2
2
2
2
2
55
5
5 5
55
5
5
55
5
5
5
55
55
5
5
5
55
5
5
5
5
55
3
5
5
55
5 5
5
5
5
5
5
5
555
5
55 5
5
55
5
5555
5
555
3
5
2
2
22
6
6
3
6
226
666
3
2
6
6
6
2
2
66 6
6
6
6
3
6
66
3
6
3
6
3
3
6
3
6
6
3
6
6
6 6
3
6
3
2
3
6
3
3
3
3
6
3
3
33
3
3
3
3
3
3
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
MDS 1
MDS2
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.2 Gaussian mixture model (Pearson 1894)
f(x) =
k
j=1
πjϕaj ,Σj
(x).
Clusters are described by Gaussian distributions.
Elliptical clusters, flexible size and shape.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
3
3
3333
3
33
3
3
3 33333
3
3
3
333
3
3
3
33
33
3
3
3
3
3
45
4
4 4
5
4
4
4
4
4
44
44
4
44
44
4 44
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
99
99
5 5
8
8
88
8
8
89 888
6
6
6
6
6 6
6
6 6
6
6
6
6
111
1 1
11
1
1
11
1
1
1
1111
1
1
1
11
1
1
11
11
6
1111
1 1
1
11
11
1
111
1
111
1
11
1
11111
111
6
1
9
9
99
2
2
2
2 222
2222
2
2
2
2
2
2
22 2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
7
7
7
7
7
7
7
77
7
7
7
7
7
7
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
MDS 1
MDS2
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.3 Classical hierarchical methods
Operate on dissimilarity matrices;
compute dissimilarity measure for every pair of observations.
Can use Euclidean distance,
but also tailor-made distances for other data formats.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.3 Classical hierarchical methods
Operate on dissimilarity matrices;
compute dissimilarity measure for every pair of observations.
Can use Euclidean distance,
but also tailor-made distances for other data formats.
“Cluster”: a collection of similar objects,
dissimilar to the others.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
Genetic data: 236 Tetragonula bees, 13 allele pairs
[,1] [,2] [,3] [,4] [,5] [,6] (...)
[1,] "NO" "AA" "PP" "HH" "EH" "FF"
[2,] "EO" "AA" "PP" "HH" "GH" "FF"
[3,] "NQ" "AA" "PT" "HH" "GF" "EF"
[4,] "OO" "AA" "PP" "GH" "GH" "EF"
[5,] "OO" "AA" "PP" "GH" "GH" "EF"
[6,] "LN" "AA" "PP" "HH" "EG" "FE"
(...)
Compute “shared allele distance”.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00 0.21 0.33 0.29 0.25
[2,] 0.21 0.00 0.33 0.25 0.21
[3,] 0.33 0.33 0.00 0.29 0.33 (...)
[4,] 0.29 0.25 0.29 0.00 0.08
[5,] 0.25 0.21 0.33 0.08 0.00
(...)
Dataset seen before is a
Euclidean approximation (“MDS”) of this.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.3 Classical hierarchical methods
Operate on dissimilarities and produce hierarchical trees
(originally motivated by biological classification).
Differ in definition of “dissimilarity between clusters”.
818280797778676675737668706361624172716469377465465654534751455955525048385749604358403942364432302826231816129198453353172917206111342433222710251415213219085848393878886899192170172173961041051069910398971021001019495171198182177168234220208206199205216210204185197194191190189219218209217215207214202175188183181178200179193176196213212180187192174221195211203186184201169167136133117151166145116165156142131110146155149144143132157128134125152158124154147129161163160153162150159140137126119122135121118111127120112130109115113164139123141108107114138148229233235231226236228222230224232227223225
0.00.20.40.6
Cluster Dendrogram
hclust (*, "single")
as.dist(tai$distmat)
Height
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
Single Linkage: (Florek and Perkal 1951)
˜d(A, B) = min
a∈A,b∈B
d(a, b)
Complete Linkage:
˜d(A, B) = max
a∈A,b∈B
d(a, b)
Average Linkage:
˜d(A, B) = avea∈A,b∈Bd(a, b)
These can deliver quite different clusterings.
(Complete L. very compact,
Single L. separated but maybe widespread)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.4 Spectral clustering (Shi and Malik 2000)
Dissimilarity-based nonlinear dimension reduction
for k-means.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1
1
1111
1
1 1
1
1
1
111
11
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
62
6
6 6
2
6
6
6
6
6
66
66
6
6
6
66
6 6
6
6
6
2
2
2
2
6
2
2
2
2
2
2
2
2
2
2
2
55
55
2
2
3
3
3
3
3
3
35 33
3
7
7
7
7
7
7
7
7 7
7
7
7
7
44
4
4 4
44
4
4
44
4
4
4
44
44
4
4
4
44
4
4
4
4
44
4
4
4
44
4 4
4
4
4
4
4
4
444
4
44 4
4
44
4
4444
4
444
4
4
5
5
55
7
7
7
7
777
777
7
7
7
7
7
7
7
77 7
7
7
7
7
7
77
7
7
7
7
7
7
7
7
7
7
7
7
7
7 7
7
7
7
7
7
7
7
7
7
7
7
7
7
77
7
7
7
7
7
7
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
ctai$points[,1]
ctai$points[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.5 Density-based methods
such as “DBSCAN” (Ester et al. 1996),
joins observations with all neighbouring points,
and neighbourhoods if they share enough points.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1
1
1111
1
1 1
1
1
1
111
11
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
2N
2
2 2
N
2
2
2
2
2
22
22
2
2
2
22
2 2
2
2
2
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
NN
NN
N
N
3
3
3
3
3
3
3N 33
3
N
N
4
4
4
N
4
4 4
N
N
4
4
55
5
5 5
55
5
5
55
5
5
5
55
55
5
5
5
55
5
5
5
5
55
N
5
5
55
5 5
5
5
5
5
5
5
555
5
55 5
5
55
5
5555
5
555
N
5
N
N
NN
6
6
6
6
666
666
6
N
6
6
6
6
6
66 6
6
6
6
6
6
66
6
6
6
6
6
6
6
6
6
6
6
6
6
6 6
6
6
6
6
6
6
7
7
7
7
N
7
7
77
7
7
7
7
7
7
−0.4 −0.2 0.0 0.2 0.4
−0.4−0.20.00.20.4
ctai$points[,1]
ctai$points[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Cluster analysis methods
1.1.6 Further issues in cluster analysis
Number of clusters
Cluster validation
Dissimilarity definition
Choice of method
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
2. Benchmarking and measurement of quality
Which clustering is better?
(Old faithful geyser data)
−2 −1 0 1 2
−2−101
mclust
waiting
duration
−2 −1 0 1 2
−2−101
pam
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Which clustering is better?
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Benchmarking approaches:
Real datasets with known classes
Simulated datasets from mixture distributions
Real datasets without known classes
With known truth can compute misclassification rates.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
The fact that we know certain true classes
doesn’t preclude other legitimate/”true” clusterings.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
The fact that we know certain true classes
doesn’t preclude other legitimate/”true” clusterings.
Classes in supervised classification problems
may not qualify as data analytic clusters.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Disadvantages of benchmarking with known truth
In datasets with known classes
clustering is not of real scientific interest.
Deviate systematically from real clustering problems.
The fact that we know certain true classes
doesn’t preclude other legitimate/”true” clusterings.
Classes in supervised classification problems
may not qualify as data analytic clusters.
So there could be better truths than the known one.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
How true are the true given classes?
(Hennig and Liao 2013, social stratification data)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
7 standard occupation classes such as “manual workers”,
“managerials and professionals”, “not working”
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
These are not “data analytic clusters”.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Mixture components aren’t always “data analytic clusters”
either.
55
3
54
3
54
5
5
4
4
5
4
3
5
4
5
5
4
3
534
4
4
4
4
4
4
5
54
5
4 5
4
4
4
5
5
5
4
54
4
5
5
4
5
4
5
5
5
4
4
4
4
5
4
5
5
55
5
4
4 4
4
4
5
5
5
4
3
4
1
4
4
3
4
4
5
4
1
4 54
5
5
3
5
54
5
1
4
4
4
4
4
5
4
3
4
5
4
55
5
4
5
4
5
3
5
4 4
4
5
3
5
4 5
34
4
5
5
5
4
3
4
5
5
55
4 5
4
54
4
5
4
4
4
4
4
5
4
5
3
5
5
3
4
5 5
4 5
4 5
5
54
4
4
4 4
5
5
4
54
5
5
4
4
5
4
5
5
5 5
4
4
4
5
5
5
5
4
4
4
5
4
4
2
5
4
3
5
4
4
5
4 54
5
4
4
4
4
5
4
5
5
5
4 4
5
4
5
54
4 4 5
4
5
5
4
5
5
5
5
4
4 5
3
5
5 54
5
4
4 4
4
35
5
5
4
5
4
4
5
3
4
5
5
4
4
4
5
4
4
4 5
5
4
54
5
44
5
4 5
3
4
4
3
3
4
4
55
4 5
4
4
4
5
5
4
5
5
555
4
5
4
4
5
5
3
5
4
4
5
5
4
4
5
3
4
55
4 54
4
4
45
3
4
5
3
5
5
4
4
3
4
2 5
4
54
4
4
4
4
4
4
2 5
4
4
4
5
5
4
4
5
5
5
4
5
5
5
55
4 5
5
5
4 54
5
4
4
554
4 551
4
5
4
5
2
33
4
4
45
54
4
5
1
5
44
4
4
4
4
54
4
3
4
4
4
4
5
3
5 554
4
44
5
4
4
5
5
4
5
4
4
5
4
4
5
5
5
2
3
5
34
5
4
5
3
4
4
5
5
1
4
5
4
4 5 4 54 54
5
4
4
4
4
55
4 5
5
54
4 5
4
5
5
5
2
4
3
4
4
4
5
5
5 5
5
4
4
5
4 4
3
5
0 20 40 60 80 100
−20−10010203040
xdata$x[,1]
xdata$x[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Using a known truth is useful and fair enough
but also want to evaluate clusterings
on data for which truth is not known.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
There is a range of cluster validation indexes
measuring clustering quality, such as
Average silhouette width (ASW)
(Kaufman and Rouseeuw 1990)
sw(i, C) = b(i,C)−a(i,C)
max(a(i,C),b(i,C)),
a(i, C) =
1
|Cj| − 1
x∈Cj
d(xi, x), b(i, C) = min
xi ∈Cl
1
|Cl|
x∈Cl
d(xi, x).
Maximum average sw ⇒ good C.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
“One size fits it all”-approach.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
“One size fits it all”-approach.
Homogeneity will normally dominate here:
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
My general philosophy
There are various different aims of clustering.
Measure them separately to characterise
what a method does best,
instead of producing a single ranking.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Clusters are regions of high density
without within-cluster gaps
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Clusters are regions of high density
without within-cluster gaps
Uniform cluster sizes
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
Typical clustering aims
Between-cluster separation
Within-cluster homogeneity (low distances)
Within-cluster homogeneous distributional shape
Good representation of data by centroids
Little loss of information
from original distance between objects.
Clusters are regions of high density
without within-cluster gaps
Uniform cluster sizes
Stability
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
These may be in conflict with each other.
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
clustering for information reduction requires
good representation by centroids,
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
clustering for information reduction requires
good representation by centroids,
groups in social network analysis shouldn’t have
large within-cluster gaps,
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Which clustering is better?
Benchmarking approaches
Cluster validation indexes
My general philosophy
Typical clustering aims
E.g., pattern recognition in images
requires separation,
clustering for information reduction requires
good representation by centroids,
groups in social network analysis shouldn’t have
large within-cluster gaps,
underlying “true” classes (biological species)
may cause homogeneous distributional shapes.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
3. Cluster quality statistics
Measuring between-cluster separation
∃ several ways measuring separation (as for other aims).
Straightforward: min distance between any two clusters,
or distance between centroids (e.g., k-means).
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
waiting
duration
M
M
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Measuring between-cluster separation
∃ several ways measuring separation (as for other aims).
Straightforward: min distance between any two clusters,
or distance between centroids (e.g., k-means).
These measure quite different concepts of separation.
(min distance relies on only two points;
centroid distance ignores what goes on at border.)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
p-separation index:
More stable version of “min distance”:
Average distance to nearest point in different cluster for
p = 10% “border” points in any cluster.
−2 −1 0 1 2
−2−101
waiting
duration
X
X
X
X
X
X
X
XX
X
X XX
X
X
X
X
X
X
X
X
X X
X
X
X X
X
X
X
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Measuring “density mountains vs. valleys”
Index that measures whether clusters correspond
to “density mountains”,
and whether “valleys” are between clusters.
Note: This is current research and may be revised.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Two aspects:
(a) Density goes down from mode;
no gaps and valleys within clusters.
(b) Cluster borders are valleys;
they don’t run through mountains.
Estimate density by weighted count of close points
(“kernel density”).
0.00.51.01.52.0
x
k(x)
10% quantile of within−cluster distances
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Start from cluster modes
−0.5 0.0 0.5 1.0 1.5 2.0
−1.8−1.6−1.4−1.2−1.0−0.8
sinlink g= 2 Step 1
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Connect closest point to cluster
−0.5 0.0 0.5 1.0 1.5 2.0
−1.8−1.6−1.4−1.2−1.0−0.8
sinlink g= 2 Step 3
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
As long as density goes down, no penalty
−0.5 0.0 0.5 1.0 1.5 2.0
−1.8−1.6−1.4−1.2−1.0−0.8
sinlink g= 2 Step 6
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Penalty for density increase
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 98
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 99
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 100
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 101
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 102
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 103
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 104
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 105
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 106
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 107
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 108
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
−2 −1 0 1 2
−2−101
sinlink g= 2 Step 297
waiting
duration
X
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Add penalty density∗density from other clusters
−2 −1 0 1 2
−2−101
specc g= 3
waiting
duration
P P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
PP
P
P
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Average largest within-cluster gap
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Average largest within-cluster gap
Variation of clusterings on bootstrapped data
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Measuring between-cluster separation
Measuring “density mountains vs. valleys”
Other statistics
Other statistics
Within-cluster average distance
Within-cluster similarity measure
to normal/uniform
Within-cluster (squared) distance to centroid
ρ(distance, cluster induced distance) (Hubert’s Γ)
Entropy of cluster sizes
Average largest within-cluster gap
Variation of clusterings on bootstrapped data
Standardise all indexes to [0, 1] so that
“large is good”.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
4. Examples
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
−10 0 10 20 30
010203040
xy[,1]
xy[,2]
3-means mclust-3
ave within 0.811 0.643
sep index 0.163 0.306
density index 0.977 0.978
within gap 0.927 0.949
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
−2 −1 0 1 2
−2−101
mclust
waiting
duration
−2 −1 0 1 2
−2−101
pam
waitingduration
−2 −1 0 1 2
−2−101
spectral
waiting
duration
−2 −1 0 1 2
−2−101
ave.linkage
waiting
duration
−2 −1 0 1 2
−2−101
single linkage
waiting
duration
−2 −1 0 1 2
−2−101
pdfCluster (3)
waiting
duration
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
mclust pam spect ave.l sing.l pdf3
ave within 0.71 0.95 0.82 0.90 0.04 0.98
sep index 0.98 0.30 0.94 0.60 0.99 0.78
density 0.99 0.44 0.70 0.63 0.59 0.99
gap 0.14 0.46 0.46 0.46 0.99 0.48
gamma 0.81 0.91 0.92 0.96 0.06 0.98
normality 0.69 0.44 0.45 0.48 0.11 0.52
Note: These values are quantile-standardised,
implementation of this in fpc is still to come.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
0.00.20.40.60.81.0
Number of clusters
dindex
2 3 4 5
kmeans
kmeans
kmeans
kmeans
avelink
avelink
avelink avelink
sinlink
sinlink
sinlink
sinlink
comlink
comlink
comlink
comlink
mclust
mclust
mclust mclust
pam
pam
pam
pam
specc
specc
specc
specc
pdfclus
Quantile−calibrated density index
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Can aggregate criteria by weighted mean
given well justified weights.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Can aggregate criteria by weighted mean
given well justified weights.
Benchmarking without known truth
and comparison of clusterings in practice.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
6. Discussion
Clustering quality is multidimensional.
Provide multidimensional evaluation,
characterising a method’s behaviour.
Can aggregate criteria by weighted mean
given well justified weights.
Benchmarking without known truth
and comparison of clusterings in practice.
Required: standardisation to compare
different indexes and numbers of clusters.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Much of this is implemented in R-package fpc,
more will be.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
Much of this is implemented in R-package fpc,
more will be.
Soon to come:
IFCS Cluster Benchmarking Repository
(Iven Van Mechelen, Nema Dean, Isabelle Guyon,
Anne-Laure Boulesteix, Doug Steinley, Friedrich Leisch,
Christian Hennig, Rainer Dangl)
This work is supported by EPSRC Grant EP/K033972/1.
Christian Hennig Assessing the quality of a clustering
A short introduction to cluster analysis
Benchmarking and measurement of quality
Cluster quality statistics
Examples
Discussion
A bit of marketing:
Christian Hennig Assessing the quality of a clustering

More Related Content

Similar to Christian Hennig- Assessing the quality of a clustering

psikometri
psikometripsikometri
psikometriekasepta
 
Thesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti EngineeringThesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti EngineeringMarcus Ljungblad
 
Canny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptxCanny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptxPriyankaHemrajani2
 
Evaluation of chem lab software.
Evaluation of chem lab software.Evaluation of chem lab software.
Evaluation of chem lab software.Abir Almaqrashi
 
Image scalar hw_algorithm
Image scalar hw_algorithmImage scalar hw_algorithm
Image scalar hw_algorithmsean chen
 
HIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptxHIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptxagniva pradhan
 
Lt analysis of item test validity
Lt analysis of item test validityLt analysis of item test validity
Lt analysis of item test validitySiti Purwaningsih
 
More Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingMore Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingConal Scanlon
 
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docxwealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docxmelbruce90096
 
Clustering and Association Rule
Clustering and Association RuleClustering and Association Rule
Clustering and Association RuleCisco
 
Anytime Dynamic A*
Anytime Dynamic A*Anytime Dynamic A*
Anytime Dynamic A*Elben Shira
 
Estadística descriptiva 1
Estadística descriptiva 1Estadística descriptiva 1
Estadística descriptiva 1Ian Santillann
 
DSD-INT 2016 Urban water modelling - Meijer
DSD-INT 2016 Urban water modelling - MeijerDSD-INT 2016 Urban water modelling - Meijer
DSD-INT 2016 Urban water modelling - MeijerDeltares
 
CTET 20th September 2015 Paper-2 Answer Key by Success Mantra
CTET 20th September 2015 Paper-2 Answer Key by Success MantraCTET 20th September 2015 Paper-2 Answer Key by Success Mantra
CTET 20th September 2015 Paper-2 Answer Key by Success MantraSuccessMantraInstitute
 

Similar to Christian Hennig- Assessing the quality of a clustering (20)

Accelerate performance
Accelerate performanceAccelerate performance
Accelerate performance
 
Tes Reliabilitas
Tes ReliabilitasTes Reliabilitas
Tes Reliabilitas
 
psikometri
psikometripsikometri
psikometri
 
Thesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti EngineeringThesis-presentation: Tuenti Engineering
Thesis-presentation: Tuenti Engineering
 
Canny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptxCanny Edge & Image Representation.pptx
Canny Edge & Image Representation.pptx
 
Evaluation of chem lab software.
Evaluation of chem lab software.Evaluation of chem lab software.
Evaluation of chem lab software.
 
Image scalar hw_algorithm
Image scalar hw_algorithmImage scalar hw_algorithm
Image scalar hw_algorithm
 
HIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptxHIERARCHICAL CLUSTER ANALYSIS.pptx
HIERARCHICAL CLUSTER ANALYSIS.pptx
 
4
44
4
 
Lt analysis of item test validity
Lt analysis of item test validityLt analysis of item test validity
Lt analysis of item test validity
 
More Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingMore Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story Mapping
 
AIPMT-2014-Answer-Key
AIPMT-2014-Answer-KeyAIPMT-2014-Answer-Key
AIPMT-2014-Answer-Key
 
Swot
SwotSwot
Swot
 
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docxwealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
wealth age region37 50 M24 88 U14 64 A13 63 U13 66 .docx
 
Clustering and Association Rule
Clustering and Association RuleClustering and Association Rule
Clustering and Association Rule
 
Dr. azad hama sharif
Dr. azad hama sharifDr. azad hama sharif
Dr. azad hama sharif
 
Anytime Dynamic A*
Anytime Dynamic A*Anytime Dynamic A*
Anytime Dynamic A*
 
Estadística descriptiva 1
Estadística descriptiva 1Estadística descriptiva 1
Estadística descriptiva 1
 
DSD-INT 2016 Urban water modelling - Meijer
DSD-INT 2016 Urban water modelling - MeijerDSD-INT 2016 Urban water modelling - Meijer
DSD-INT 2016 Urban water modelling - Meijer
 
CTET 20th September 2015 Paper-2 Answer Key by Success Mantra
CTET 20th September 2015 Paper-2 Answer Key by Success MantraCTET 20th September 2015 Paper-2 Answer Key by Success Mantra
CTET 20th September 2015 Paper-2 Answer Key by Success Mantra
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Christian Hennig- Assessing the quality of a clustering

  • 1. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Assessing the quality of a clustering Christian Hennig Christian Hennig Assessing the quality of a clustering
  • 2. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1. A short introduction to cluster analysis Cluster analysis is about finding groups in data. var 1 −0.4 −0.2 0.0 0.2 0.4 3 3 3333 333 333 3333 333 3 33 3 3333 33 3 3 3333 4 5 4 4 4 544444 4444444444444 4 5 5 55 5 55 55 5 5 5 5 5 5 5 9 9995 58888 8 88 9 888 6 6 6 66 6 66666 6 6 11111111111 1 11111111 1 111 1 1111 6 111111111111111 1 111111 111111 111 6 1 9 999 2222 2 2 2 222 2 2 22 2 22 2 2 22 2 2 222 2 22 2 22 2 2 22 2 22 222 2 2 2 2 2 2 7 7 777 7777 7 777 7 7 3 3 3 333 333333 3333 333 3 33 3 3333 33 3 3 3333 4 5 4 4 454 444 4 4 4444 44 44 44 44 4 5 5 55 5 5 55 5 5 55 5 5 5 5 9 9995 5 8888 8 88 9 888 6 6 6 6 6 6 666 666 6 11111 11 111 1 1 11111 111 1 11 1 1 111 1 6 1 11111 11 11 1111 1 1 1 11 11 1 11 11 11 111 6 1 9 999 22 22 2 2 2 22 2 2 2 2 22 22 2 2 22 2 2222 2222 222 2 22 2 2 2 2 22 2 2 2 2 2 27 7 7 77 7777 7 777 7 7 −0.4 −0.2 0.0 0.2 −0.4−0.20.00.20.4 3 3 3 333 33 33 33 3333 33 3 3 33 3 33 33 33 3 3 3333 4 5 4 4 45 4 4444 444444 444 44 44 4 5 5 55 5 55 5 5 5 5 5 5 5 5 5 9 999 5 5888 8 8 88 9 888 6 6 6 66 6 66 666 6 6 1111111 11 11 1 11 11 1111 1 111 1 11 11 6 1111 111 111 11 11 1 1 111111 111111 111 6 1 9 999 2 222 2 2 2 22 2 2 2 22 2 2 2 2 2 22 22222 2 22 2 2 22 2 22 2 22 22 2 2 2 2 2 2 27 7 77 7 77 77 7 77 7 7 7 −0.4−0.20.00.20.4 3 3 3333333 3 3 3 333333 3 3 333 33 3 3333 333 3 3 45 4 44 5 44444 44 4444444444 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 888 8 8 89 888 6 6 6 66 6 666 66 66 111111111111 1111111 1 1111 1 11 11 6 1111 11 111111 1111 111111 111111 111 6 1 9 9 99 22 2 2 222 2222 2 2 2 2 2222 22 2 2 2 2 22 2 2 2 22 2 2 2 2 2 22 2 22 2 2 2 2 22 77 7 7 7 77 77 7 7 77 7 7 var 2 33 3 333333 3 3 3 333333 3 3333 33 3 3333 333 3 3 45 4 44 5 4 444 44 4 444 44 44 44 4 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 888 8 8 89888 6 6 6 6 6 6 666 66 6 6 11111 11 111 11 11111 11 11 11 1 1 11 1 1 6 1 11111 11 11 11 11 11 1 11 11 1 11 11 11 111 6 1 9 9 99 22 2 222222 22 2 2 2 2 222222 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 22 22 2 2 2 2 2 2 77 7 7 7 77 77 7 7 777 7 33 3 33333 3 3 3 3 3333 33 3 3333 33 3 3333 333 3 3 45 4 44 5 4 444444 4444 444 44 4 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 88 8 8 8 8 9888 6 6 6 66 6 66 6 66 66 1111111 11 111 11 11 111 11 111 1 11 11 6 1111 11 1 111 11 11 11 111111 111111 111 6 1 9 9 99 2 2 2 2 2 22 22 22 2 2 2 22 222 22 2 2 2 2 22 2 2 2 2 2 2 2 2 22 22 2 2 2 22 2 2 2 2 7 7 7 7 7 77 77 7 7 7 77 7 3 33 33 3 3 3333 3 33333 33 3 33 3 33333 33 3333 3 4 5 4 4454 4 44 4 4 44 4 4 44 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 99 9 9 5 5 8 88888 89 888 6 6 6 6 6 6 6 66 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 111 1 11 1 11 1 1 111 1 61 1 11 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 22 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22222 222222 2 2 2 2 2 22 2 2 2 2 2 7 7 7 7 7 7 777 7 7 77 77 3 33 33 3 3 33 33 3 33333 33 333 3 33333 333333 34 5 4 44 54 4 44 4 4 44 4 4 44 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 99 9 9 5 5 8 88888 89888 6 66 6 6 6 6 66 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 111 1 11 1 11 1 1 111 1 61 1 11 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 222222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 22 2 22 22 22 2 2 22 2 22 2 2 2 2 2 7 7 7 7 7 7 777 7 7 77 7 7 var 3 −0.20.00.20.4 3 33 33 3 3 3 33 3 3 3333 3 3 3 333 3 33 333 33 3333 3 4 5 4 445 4 4 44 4 4 44 4 4 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 55 99 9 9 5 5 8 88 888 8 9888 66 6 6 6 6 6 6 6 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 11 1 1 11 1 11 1 1 11 1 161 1 11 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 22 2 2 22222 2 2 2 2 2 22 2 2 2 2 2 7 7 7 7 7 7 7 77 7 7 7 7 77 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.2 3 3 3 33 3 33 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 444 444 44 4 4 4 4 44 4 4 4 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 66 11 11111 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 11 6 1111 11 1 1 1 1 1 1 11 11 1 1 1 1 1111111 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 22 2 2 2 2 2 22222 2 2 2 2 2 2 2 22 2 2 2 7 777 7 7 7 7 7 777 7 7 7 33 3 33 3 33 3 3 3 3 3 33 3 3 3 3 333 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 444444 44 4 4 4 4 44 4 4 4 5 5 5 5 5 55 5 5 55 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 66 11 11111 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 11 6 1111 11 1 1 1 1 1 1 11 11 1 1 1 1 1111111 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 222 2 2 2 2 2 2222 2 2 7 7 77 7 7 7 7 7 777 77 7 −0.2 0.0 0.2 0.4 33 3 33 3 33 3 3 3 3 3 33 3 3 3 3 333 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 44 44 44 44 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 6 6 11 11 1 11 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 1 16 1 111 11 1 1 1 1 1 1 11 11 1 1 1 1 1 111 11 1 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 22 22 2 2 2 2 2 22222 2 2 2 2 2 2 2 22 2 2 2 7 77 7 7 7 7 7 7 7 77 77 7 var 4 Christian Hennig Assessing the quality of a clustering
  • 3. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! Christian Hennig Assessing the quality of a clustering
  • 4. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! represents all objects by centroid, “compact” clusters. Christian Hennig Assessing the quality of a clustering
  • 5. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! represents all objects by centroid, “compact” clusters. Version: Don’t square, other centroids than mean (“pam”). Christian Hennig Assessing the quality of a clustering
  • 6. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 8 8 8888 8 8 8 8 8 8 888 88 8 8 8 88 8 8 8 8 8 8 88 8 8 8 8 8 71 7 7 7 7 7 7 7 7 7 77 77 7 7 7 77 7 7 7 7 7 1 7 1 1 9 9 1 9 1 9 1 1 9 9 1 9 44 44 4 4 9 9 9 9 9 9 92 99 9 2 2 2 2 2 2 2 2 2 2 2 2 2 55 5 5 5 55 5 5 55 5 5 5 55 55 5 5 5 55 5 5 5 5 55 3 5 5 55 5 5 5 5 5 5 5 5 555 5 55 5 5 55 5 5555 5 555 3 5 2 2 22 6 6 3 6 226 666 3 2 6 6 6 2 2 66 6 6 6 6 3 6 66 3 6 3 6 3 3 6 3 6 6 3 6 6 6 6 3 6 3 2 3 6 3 3 3 3 6 3 3 33 3 3 3 3 3 3 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 MDS 1 MDS2 Christian Hennig Assessing the quality of a clustering
  • 7. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.2 Gaussian mixture model (Pearson 1894) f(x) = k j=1 πjϕaj ,Σj (x). Clusters are described by Gaussian distributions. Elliptical clusters, flexible size and shape. Christian Hennig Assessing the quality of a clustering
  • 8. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 3 3 3333 3 33 3 3 3 33333 3 3 3 333 3 3 3 33 33 3 3 3 3 3 45 4 4 4 5 4 4 4 4 4 44 44 4 44 44 4 44 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 99 99 5 5 8 8 88 8 8 89 888 6 6 6 6 6 6 6 6 6 6 6 6 6 111 1 1 11 1 1 11 1 1 1 1111 1 1 1 11 1 1 11 11 6 1111 1 1 1 11 11 1 111 1 111 1 11 1 11111 111 6 1 9 9 99 2 2 2 2 222 2222 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 MDS 1 MDS2 Christian Hennig Assessing the quality of a clustering
  • 9. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarity matrices; compute dissimilarity measure for every pair of observations. Can use Euclidean distance, but also tailor-made distances for other data formats. Christian Hennig Assessing the quality of a clustering
  • 10. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarity matrices; compute dissimilarity measure for every pair of observations. Can use Euclidean distance, but also tailor-made distances for other data formats. “Cluster”: a collection of similar objects, dissimilar to the others. Christian Hennig Assessing the quality of a clustering
  • 11. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods Genetic data: 236 Tetragonula bees, 13 allele pairs [,1] [,2] [,3] [,4] [,5] [,6] (...) [1,] "NO" "AA" "PP" "HH" "EH" "FF" [2,] "EO" "AA" "PP" "HH" "GH" "FF" [3,] "NQ" "AA" "PT" "HH" "GF" "EF" [4,] "OO" "AA" "PP" "GH" "GH" "EF" [5,] "OO" "AA" "PP" "GH" "GH" "EF" [6,] "LN" "AA" "PP" "HH" "EG" "FE" (...) Compute “shared allele distance”. Christian Hennig Assessing the quality of a clustering
  • 12. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods [,1] [,2] [,3] [,4] [,5] [1,] 0.00 0.21 0.33 0.29 0.25 [2,] 0.21 0.00 0.33 0.25 0.21 [3,] 0.33 0.33 0.00 0.29 0.33 (...) [4,] 0.29 0.25 0.29 0.00 0.08 [5,] 0.25 0.21 0.33 0.08 0.00 (...) Dataset seen before is a Euclidean approximation (“MDS”) of this. Christian Hennig Assessing the quality of a clustering
  • 13. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarities and produce hierarchical trees (originally motivated by biological classification). Differ in definition of “dissimilarity between clusters”. 818280797778676675737668706361624172716469377465465654534751455955525048385749604358403942364432302826231816129198453353172917206111342433222710251415213219085848393878886899192170172173961041051069910398971021001019495171198182177168234220208206199205216210204185197194191190189219218209217215207214202175188183181178200179193176196213212180187192174221195211203186184201169167136133117151166145116165156142131110146155149144143132157128134125152158124154147129161163160153162150159140137126119122135121118111127120112130109115113164139123141108107114138148229233235231226236228222230224232227223225 0.00.20.40.6 Cluster Dendrogram hclust (*, "single") as.dist(tai$distmat) Height Christian Hennig Assessing the quality of a clustering
  • 14. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods Single Linkage: (Florek and Perkal 1951) ˜d(A, B) = min a∈A,b∈B d(a, b) Complete Linkage: ˜d(A, B) = max a∈A,b∈B d(a, b) Average Linkage: ˜d(A, B) = avea∈A,b∈Bd(a, b) These can deliver quite different clusterings. (Complete L. very compact, Single L. separated but maybe widespread) Christian Hennig Assessing the quality of a clustering
  • 15. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.4 Spectral clustering (Shi and Malik 2000) Dissimilarity-based nonlinear dimension reduction for k-means. Christian Hennig Assessing the quality of a clustering
  • 16. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1 1 1111 1 1 1 1 1 1 111 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 62 6 6 6 2 6 6 6 6 6 66 66 6 6 6 66 6 6 6 6 6 2 2 2 2 6 2 2 2 2 2 2 2 2 2 2 2 55 55 2 2 3 3 3 3 3 3 35 33 3 7 7 7 7 7 7 7 7 7 7 7 7 7 44 4 4 4 44 4 4 44 4 4 4 44 44 4 4 4 44 4 4 4 4 44 4 4 4 44 4 4 4 4 4 4 4 4 444 4 44 4 4 44 4 4444 4 444 4 4 5 5 55 7 7 7 7 777 777 7 7 7 7 7 7 7 77 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 ctai$points[,1] ctai$points[,2] Christian Hennig Assessing the quality of a clustering
  • 17. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.5 Density-based methods such as “DBSCAN” (Ester et al. 1996), joins observations with all neighbouring points, and neighbourhoods if they share enough points. Christian Hennig Assessing the quality of a clustering
  • 18. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1 1 1111 1 1 1 1 1 1 111 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 2N 2 2 2 N 2 2 2 2 2 22 22 2 2 2 22 2 2 2 2 2 N N N N N N N N N N N N N N N N NN NN N N 3 3 3 3 3 3 3N 33 3 N N 4 4 4 N 4 4 4 N N 4 4 55 5 5 5 55 5 5 55 5 5 5 55 55 5 5 5 55 5 5 5 5 55 N 5 5 55 5 5 5 5 5 5 5 5 555 5 55 5 5 55 5 5555 5 555 N 5 N N NN 6 6 6 6 666 666 6 N 6 6 6 6 6 66 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 N 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 ctai$points[,1] ctai$points[,2] Christian Hennig Assessing the quality of a clustering
  • 19. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.6 Further issues in cluster analysis Number of clusters Cluster validation Dissimilarity definition Choice of method Christian Hennig Assessing the quality of a clustering
  • 20. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims 2. Benchmarking and measurement of quality Which clustering is better? (Old faithful geyser data) −2 −1 0 1 2 −2−101 mclust waiting duration −2 −1 0 1 2 −2−101 pam waiting duration Christian Hennig Assessing the quality of a clustering
  • 21. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Which clustering is better? −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  • 22. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Benchmarking approaches: Real datasets with known classes Simulated datasets from mixture distributions Real datasets without known classes With known truth can compute misclassification rates. Christian Hennig Assessing the quality of a clustering
  • 23. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. Christian Hennig Assessing the quality of a clustering
  • 24. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Christian Hennig Assessing the quality of a clustering
  • 25. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Classes in supervised classification problems may not qualify as data analytic clusters. Christian Hennig Assessing the quality of a clustering
  • 26. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Classes in supervised classification problems may not qualify as data analytic clusters. So there could be better truths than the known one. Christian Hennig Assessing the quality of a clustering
  • 27. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims How true are the true given classes? (Hennig and Liao 2013, social stratification data) Christian Hennig Assessing the quality of a clustering
  • 28. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims 7 standard occupation classes such as “manual workers”, “managerials and professionals”, “not working” Christian Hennig Assessing the quality of a clustering
  • 29. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims These are not “data analytic clusters”. Christian Hennig Assessing the quality of a clustering
  • 30. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Mixture components aren’t always “data analytic clusters” either. 55 3 54 3 54 5 5 4 4 5 4 3 5 4 5 5 4 3 534 4 4 4 4 4 4 5 54 5 4 5 4 4 4 5 5 5 4 54 4 5 5 4 5 4 5 5 5 4 4 4 4 5 4 5 5 55 5 4 4 4 4 4 5 5 5 4 3 4 1 4 4 3 4 4 5 4 1 4 54 5 5 3 5 54 5 1 4 4 4 4 4 5 4 3 4 5 4 55 5 4 5 4 5 3 5 4 4 4 5 3 5 4 5 34 4 5 5 5 4 3 4 5 5 55 4 5 4 54 4 5 4 4 4 4 4 5 4 5 3 5 5 3 4 5 5 4 5 4 5 5 54 4 4 4 4 5 5 4 54 5 5 4 4 5 4 5 5 5 5 4 4 4 5 5 5 5 4 4 4 5 4 4 2 5 4 3 5 4 4 5 4 54 5 4 4 4 4 5 4 5 5 5 4 4 5 4 5 54 4 4 5 4 5 5 4 5 5 5 5 4 4 5 3 5 5 54 5 4 4 4 4 35 5 5 4 5 4 4 5 3 4 5 5 4 4 4 5 4 4 4 5 5 4 54 5 44 5 4 5 3 4 4 3 3 4 4 55 4 5 4 4 4 5 5 4 5 5 555 4 5 4 4 5 5 3 5 4 4 5 5 4 4 5 3 4 55 4 54 4 4 45 3 4 5 3 5 5 4 4 3 4 2 5 4 54 4 4 4 4 4 4 2 5 4 4 4 5 5 4 4 5 5 5 4 5 5 5 55 4 5 5 5 4 54 5 4 4 554 4 551 4 5 4 5 2 33 4 4 45 54 4 5 1 5 44 4 4 4 4 54 4 3 4 4 4 4 5 3 5 554 4 44 5 4 4 5 5 4 5 4 4 5 4 4 5 5 5 2 3 5 34 5 4 5 3 4 4 5 5 1 4 5 4 4 5 4 54 54 5 4 4 4 4 55 4 5 5 54 4 5 4 5 5 5 2 4 3 4 4 4 5 5 5 5 5 4 4 5 4 4 3 5 0 20 40 60 80 100 −20−10010203040 xdata$x[,1] xdata$x[,2] Christian Hennig Assessing the quality of a clustering
  • 31. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Using a known truth is useful and fair enough but also want to evaluate clusterings on data for which truth is not known. Christian Hennig Assessing the quality of a clustering
  • 32. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims There is a range of cluster validation indexes measuring clustering quality, such as Average silhouette width (ASW) (Kaufman and Rouseeuw 1990) sw(i, C) = b(i,C)−a(i,C) max(a(i,C),b(i,C)), a(i, C) = 1 |Cj| − 1 x∈Cj d(xi, x), b(i, C) = min xi ∈Cl 1 |Cl| x∈Cl d(xi, x). Maximum average sw ⇒ good C. Christian Hennig Assessing the quality of a clustering
  • 33. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims “One size fits it all”-approach. Christian Hennig Assessing the quality of a clustering
  • 34. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims “One size fits it all”-approach. Homogeneity will normally dominate here: −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  • 35. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims My general philosophy There are various different aims of clustering. Measure them separately to characterise what a method does best, instead of producing a single ranking. Christian Hennig Assessing the quality of a clustering
  • 36. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Christian Hennig Assessing the quality of a clustering
  • 37. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Christian Hennig Assessing the quality of a clustering
  • 38. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Christian Hennig Assessing the quality of a clustering
  • 39. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Christian Hennig Assessing the quality of a clustering
  • 40. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Christian Hennig Assessing the quality of a clustering
  • 41. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Christian Hennig Assessing the quality of a clustering
  • 42. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Uniform cluster sizes Christian Hennig Assessing the quality of a clustering
  • 43. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Uniform cluster sizes Stability Christian Hennig Assessing the quality of a clustering
  • 44. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims These may be in conflict with each other. −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  • 45. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, Christian Hennig Assessing the quality of a clustering
  • 46. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, Christian Hennig Assessing the quality of a clustering
  • 47. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, groups in social network analysis shouldn’t have large within-cluster gaps, Christian Hennig Assessing the quality of a clustering
  • 48. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, groups in social network analysis shouldn’t have large within-cluster gaps, underlying “true” classes (biological species) may cause homogeneous distributional shapes. Christian Hennig Assessing the quality of a clustering
  • 49. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics 3. Cluster quality statistics Measuring between-cluster separation ∃ several ways measuring separation (as for other aims). Straightforward: min distance between any two clusters, or distance between centroids (e.g., k-means). Christian Hennig Assessing the quality of a clustering
  • 50. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 waiting duration Christian Hennig Assessing the quality of a clustering
  • 51. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 waiting duration M M Christian Hennig Assessing the quality of a clustering
  • 52. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Measuring between-cluster separation ∃ several ways measuring separation (as for other aims). Straightforward: min distance between any two clusters, or distance between centroids (e.g., k-means). These measure quite different concepts of separation. (min distance relies on only two points; centroid distance ignores what goes on at border.) Christian Hennig Assessing the quality of a clustering
  • 53. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics p-separation index: More stable version of “min distance”: Average distance to nearest point in different cluster for p = 10% “border” points in any cluster. −2 −1 0 1 2 −2−101 waiting duration X X X X X X X XX X X XX X X X X X X X X X X X X X X X X X X Christian Hennig Assessing the quality of a clustering
  • 54. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Measuring “density mountains vs. valleys” Index that measures whether clusters correspond to “density mountains”, and whether “valleys” are between clusters. Note: This is current research and may be revised. Christian Hennig Assessing the quality of a clustering
  • 55. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Two aspects: (a) Density goes down from mode; no gaps and valleys within clusters. (b) Cluster borders are valleys; they don’t run through mountains. Estimate density by weighted count of close points (“kernel density”). 0.00.51.01.52.0 x k(x) 10% quantile of within−cluster distances Christian Hennig Assessing the quality of a clustering
  • 56. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 waiting duration Christian Hennig Assessing the quality of a clustering
  • 57. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Start from cluster modes −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 1 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 58. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Connect closest point to cluster −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 3 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 59. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics As long as density goes down, no penalty −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 6 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 60. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Penalty for density increase −2 −1 0 1 2 −2−101 sinlink g= 2 Step 98 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 61. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 99 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 62. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 100 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 63. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 101 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 64. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 102 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 65. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 103 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 66. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 104 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 67. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 105 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 68. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 106 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 69. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 107 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 70. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 108 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 71. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 297 waiting duration X Christian Hennig Assessing the quality of a clustering
  • 72. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Add penalty density∗density from other clusters −2 −1 0 1 2 −2−101 specc g= 3 waiting duration P P P P P P P P P P P P P P P P P P P PP P P Christian Hennig Assessing the quality of a clustering
  • 73. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Christian Hennig Assessing the quality of a clustering
  • 74. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Christian Hennig Assessing the quality of a clustering
  • 75. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid Christian Hennig Assessing the quality of a clustering
  • 76. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Christian Hennig Assessing the quality of a clustering
  • 77. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Christian Hennig Assessing the quality of a clustering
  • 78. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Christian Hennig Assessing the quality of a clustering
  • 79. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Variation of clusterings on bootstrapped data Christian Hennig Assessing the quality of a clustering
  • 80. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Variation of clusterings on bootstrapped data Standardise all indexes to [0, 1] so that “large is good”. Christian Hennig Assessing the quality of a clustering
  • 81. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 4. Examples −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] 3-means mclust-3 ave within 0.811 0.643 sep index 0.163 0.306 density index 0.977 0.978 within gap 0.927 0.949 Christian Hennig Assessing the quality of a clustering
  • 82. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion −2 −1 0 1 2 −2−101 mclust waiting duration −2 −1 0 1 2 −2−101 pam waitingduration −2 −1 0 1 2 −2−101 spectral waiting duration −2 −1 0 1 2 −2−101 ave.linkage waiting duration −2 −1 0 1 2 −2−101 single linkage waiting duration −2 −1 0 1 2 −2−101 pdfCluster (3) waiting duration Christian Hennig Assessing the quality of a clustering
  • 83. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion mclust pam spect ave.l sing.l pdf3 ave within 0.71 0.95 0.82 0.90 0.04 0.98 sep index 0.98 0.30 0.94 0.60 0.99 0.78 density 0.99 0.44 0.70 0.63 0.59 0.99 gap 0.14 0.46 0.46 0.46 0.99 0.48 gamma 0.81 0.91 0.92 0.96 0.06 0.98 normality 0.69 0.44 0.45 0.48 0.11 0.52 Note: These values are quantile-standardised, implementation of this in fpc is still to come. Christian Hennig Assessing the quality of a clustering
  • 84. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 0.00.20.40.60.81.0 Number of clusters dindex 2 3 4 5 kmeans kmeans kmeans kmeans avelink avelink avelink avelink sinlink sinlink sinlink sinlink comlink comlink comlink comlink mclust mclust mclust mclust pam pam pam pam specc specc specc specc pdfclus Quantile−calibrated density index Christian Hennig Assessing the quality of a clustering
  • 85. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Christian Hennig Assessing the quality of a clustering
  • 86. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Christian Hennig Assessing the quality of a clustering
  • 87. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Christian Hennig Assessing the quality of a clustering
  • 88. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Benchmarking without known truth and comparison of clusterings in practice. Christian Hennig Assessing the quality of a clustering
  • 89. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Benchmarking without known truth and comparison of clusterings in practice. Required: standardisation to compare different indexes and numbers of clusters. Christian Hennig Assessing the quality of a clustering
  • 90. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Much of this is implemented in R-package fpc, more will be. Christian Hennig Assessing the quality of a clustering
  • 91. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Much of this is implemented in R-package fpc, more will be. Soon to come: IFCS Cluster Benchmarking Repository (Iven Van Mechelen, Nema Dean, Isabelle Guyon, Anne-Laure Boulesteix, Doug Steinley, Friedrich Leisch, Christian Hennig, Rainer Dangl) This work is supported by EPSRC Grant EP/K033972/1. Christian Hennig Assessing the quality of a clustering
  • 92. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion A bit of marketing: Christian Hennig Assessing the quality of a clustering