Successfully reported this slideshow.
Your SlideShare is downloading. ×

Christian Hennig- Assessing the quality of a clustering

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 92 Ad
Advertisement

More Related Content

Similar to Christian Hennig- Assessing the quality of a clustering (20)

More from PyData (20)

Advertisement

Recently uploaded (20)

Christian Hennig- Assessing the quality of a clustering

  1. 1. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Assessing the quality of a clustering Christian Hennig Christian Hennig Assessing the quality of a clustering
  2. 2. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1. A short introduction to cluster analysis Cluster analysis is about finding groups in data. var 1 −0.4 −0.2 0.0 0.2 0.4 3 3 3333 333 333 3333 333 3 33 3 3333 33 3 3 3333 4 5 4 4 4 544444 4444444444444 4 5 5 55 5 55 55 5 5 5 5 5 5 5 9 9995 58888 8 88 9 888 6 6 6 66 6 66666 6 6 11111111111 1 11111111 1 111 1 1111 6 111111111111111 1 111111 111111 111 6 1 9 999 2222 2 2 2 222 2 2 22 2 22 2 2 22 2 2 222 2 22 2 22 2 2 22 2 22 222 2 2 2 2 2 2 7 7 777 7777 7 777 7 7 3 3 3 333 333333 3333 333 3 33 3 3333 33 3 3 3333 4 5 4 4 454 444 4 4 4444 44 44 44 44 4 5 5 55 5 5 55 5 5 55 5 5 5 5 9 9995 5 8888 8 88 9 888 6 6 6 6 6 6 666 666 6 11111 11 111 1 1 11111 111 1 11 1 1 111 1 6 1 11111 11 11 1111 1 1 1 11 11 1 11 11 11 111 6 1 9 999 22 22 2 2 2 22 2 2 2 2 22 22 2 2 22 2 2222 2222 222 2 22 2 2 2 2 22 2 2 2 2 2 27 7 7 77 7777 7 777 7 7 −0.4 −0.2 0.0 0.2 −0.4−0.20.00.20.4 3 3 3 333 33 33 33 3333 33 3 3 33 3 33 33 33 3 3 3333 4 5 4 4 45 4 4444 444444 444 44 44 4 5 5 55 5 55 5 5 5 5 5 5 5 5 5 9 999 5 5888 8 8 88 9 888 6 6 6 66 6 66 666 6 6 1111111 11 11 1 11 11 1111 1 111 1 11 11 6 1111 111 111 11 11 1 1 111111 111111 111 6 1 9 999 2 222 2 2 2 22 2 2 2 22 2 2 2 2 2 22 22222 2 22 2 2 22 2 22 2 22 22 2 2 2 2 2 2 27 7 77 7 77 77 7 77 7 7 7 −0.4−0.20.00.20.4 3 3 3333333 3 3 3 333333 3 3 333 33 3 3333 333 3 3 45 4 44 5 44444 44 4444444444 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 888 8 8 89 888 6 6 6 66 6 666 66 66 111111111111 1111111 1 1111 1 11 11 6 1111 11 111111 1111 111111 111111 111 6 1 9 9 99 22 2 2 222 2222 2 2 2 2 2222 22 2 2 2 2 22 2 2 2 22 2 2 2 2 2 22 2 22 2 2 2 2 22 77 7 7 7 77 77 7 7 77 7 7 var 2 33 3 333333 3 3 3 333333 3 3333 33 3 3333 333 3 3 45 4 44 5 4 444 44 4 444 44 44 44 4 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 888 8 8 89888 6 6 6 6 6 6 666 66 6 6 11111 11 111 11 11111 11 11 11 1 1 11 1 1 6 1 11111 11 11 11 11 11 1 11 11 1 11 11 11 111 6 1 9 9 99 22 2 222222 22 2 2 2 2 222222 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 22 22 2 2 2 2 2 2 77 7 7 7 77 77 7 7 777 7 33 3 33333 3 3 3 3 3333 33 3 3333 33 3 3333 333 3 3 45 4 44 5 4 444444 4444 444 44 4 4 45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9999 55 8 88 8 8 8 8 9888 6 6 6 66 6 66 6 66 66 1111111 11 111 11 11 111 11 111 1 11 11 6 1111 11 1 111 11 11 11 111111 111111 111 6 1 9 9 99 2 2 2 2 2 22 22 22 2 2 2 22 222 22 2 2 2 2 22 2 2 2 2 2 2 2 2 22 22 2 2 2 22 2 2 2 2 7 7 7 7 7 77 77 7 7 7 77 7 3 33 33 3 3 3333 3 33333 33 3 33 3 33333 33 3333 3 4 5 4 4454 4 44 4 4 44 4 4 44 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 99 9 9 5 5 8 88888 89 888 6 6 6 6 6 6 6 66 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 111 1 11 1 11 1 1 111 1 61 1 11 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 22 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22222 222222 2 2 2 2 2 22 2 2 2 2 2 7 7 7 7 7 7 777 7 7 77 77 3 33 33 3 3 33 33 3 33333 33 333 3 33333 333333 34 5 4 44 54 4 44 4 4 44 4 4 44 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 99 9 9 5 5 8 88888 89888 6 66 6 6 6 6 66 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 111 1 11 1 11 1 1 111 1 61 1 11 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 222222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 22 2 22 22 22 2 2 22 2 22 2 2 2 2 2 7 7 7 7 7 7 777 7 7 77 7 7 var 3 −0.20.00.20.4 3 33 33 3 3 3 33 3 3 3333 3 3 3 333 3 33 333 33 3333 3 4 5 4 445 4 4 44 4 4 44 4 4 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 55 99 9 9 5 5 8 88 888 8 9888 66 6 6 6 6 6 6 6 6 6 6 6 11 1 1 1 11 1 1 1 11 1 1 11 1 1 11 1 11 1 1 11 1 161 1 11 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 11 111 6 1 9 9 9 9 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 22 2 2 22222 2 2 2 2 2 22 2 2 2 2 2 7 7 7 7 7 7 7 77 7 7 7 7 77 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.2 3 3 3 33 3 33 3 3 3 3 3 33 3 3 3 3 3 33 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 444 444 44 4 4 4 4 44 4 4 4 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 66 11 11111 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 11 6 1111 11 1 1 1 1 1 1 11 11 1 1 1 1 1111111 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 222 22 2 2 2 2 2 22222 2 2 2 2 2 2 2 22 2 2 2 7 777 7 7 7 7 7 777 7 7 7 33 3 33 3 33 3 3 3 3 3 33 3 3 3 3 333 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 444444 44 4 4 4 4 44 4 4 4 5 5 5 5 5 55 5 5 55 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 66 11 11111 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 11 6 1111 11 1 1 1 1 1 1 11 11 1 1 1 1 1111111 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 222 2 2 2 2 2 2222 2 2 7 7 77 7 7 7 7 7 777 77 7 −0.2 0.0 0.2 0.4 33 3 33 3 33 3 3 3 3 3 33 3 3 3 3 333 3 3 3 3 333 3 33 333 4 5 4 44 5 4 4 44 44 44 44 4 4 4 4 44 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9 99 9 5 5 88 8 88 8 8 9 888 6 6 6 6 6 6 66 6 6 6 6 6 11 11 1 11 1 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 1 16 1 111 11 1 1 1 1 1 1 11 11 1 1 1 1 1 111 11 1 1 1 11 6 1 9 9 99 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 22 22 2 2 2 2 2 22222 2 2 2 2 2 2 2 22 2 2 2 7 77 7 7 7 7 7 7 7 77 77 7 var 4 Christian Hennig Assessing the quality of a clustering
  3. 3. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! Christian Hennig Assessing the quality of a clustering
  4. 4. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! represents all objects by centroid, “compact” clusters. Christian Hennig Assessing the quality of a clustering
  5. 5. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1 Cluster analysis methods 1.1.1 k-means (Fix & Hodges 1951) n i=1 xi − ¯xC(i) 2 = min! represents all objects by centroid, “compact” clusters. Version: Don’t square, other centroids than mean (“pam”). Christian Hennig Assessing the quality of a clustering
  6. 6. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 8 8 8888 8 8 8 8 8 8 888 88 8 8 8 88 8 8 8 8 8 8 88 8 8 8 8 8 71 7 7 7 7 7 7 7 7 7 77 77 7 7 7 77 7 7 7 7 7 1 7 1 1 9 9 1 9 1 9 1 1 9 9 1 9 44 44 4 4 9 9 9 9 9 9 92 99 9 2 2 2 2 2 2 2 2 2 2 2 2 2 55 5 5 5 55 5 5 55 5 5 5 55 55 5 5 5 55 5 5 5 5 55 3 5 5 55 5 5 5 5 5 5 5 5 555 5 55 5 5 55 5 5555 5 555 3 5 2 2 22 6 6 3 6 226 666 3 2 6 6 6 2 2 66 6 6 6 6 3 6 66 3 6 3 6 3 3 6 3 6 6 3 6 6 6 6 3 6 3 2 3 6 3 3 3 3 6 3 3 33 3 3 3 3 3 3 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 MDS 1 MDS2 Christian Hennig Assessing the quality of a clustering
  7. 7. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.2 Gaussian mixture model (Pearson 1894) f(x) = k j=1 πjϕaj ,Σj (x). Clusters are described by Gaussian distributions. Elliptical clusters, flexible size and shape. Christian Hennig Assessing the quality of a clustering
  8. 8. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 3 3 3333 3 33 3 3 3 33333 3 3 3 333 3 3 3 33 33 3 3 3 3 3 45 4 4 4 5 4 4 4 4 4 44 44 4 44 44 4 44 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 99 99 5 5 8 8 88 8 8 89 888 6 6 6 6 6 6 6 6 6 6 6 6 6 111 1 1 11 1 1 11 1 1 1 1111 1 1 1 11 1 1 11 11 6 1111 1 1 1 11 11 1 111 1 111 1 11 1 11111 111 6 1 9 9 99 2 2 2 2 222 2222 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 MDS 1 MDS2 Christian Hennig Assessing the quality of a clustering
  9. 9. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarity matrices; compute dissimilarity measure for every pair of observations. Can use Euclidean distance, but also tailor-made distances for other data formats. Christian Hennig Assessing the quality of a clustering
  10. 10. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarity matrices; compute dissimilarity measure for every pair of observations. Can use Euclidean distance, but also tailor-made distances for other data formats. “Cluster”: a collection of similar objects, dissimilar to the others. Christian Hennig Assessing the quality of a clustering
  11. 11. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods Genetic data: 236 Tetragonula bees, 13 allele pairs [,1] [,2] [,3] [,4] [,5] [,6] (...) [1,] "NO" "AA" "PP" "HH" "EH" "FF" [2,] "EO" "AA" "PP" "HH" "GH" "FF" [3,] "NQ" "AA" "PT" "HH" "GF" "EF" [4,] "OO" "AA" "PP" "GH" "GH" "EF" [5,] "OO" "AA" "PP" "GH" "GH" "EF" [6,] "LN" "AA" "PP" "HH" "EG" "FE" (...) Compute “shared allele distance”. Christian Hennig Assessing the quality of a clustering
  12. 12. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods [,1] [,2] [,3] [,4] [,5] [1,] 0.00 0.21 0.33 0.29 0.25 [2,] 0.21 0.00 0.33 0.25 0.21 [3,] 0.33 0.33 0.00 0.29 0.33 (...) [4,] 0.29 0.25 0.29 0.00 0.08 [5,] 0.25 0.21 0.33 0.08 0.00 (...) Dataset seen before is a Euclidean approximation (“MDS”) of this. Christian Hennig Assessing the quality of a clustering
  13. 13. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.3 Classical hierarchical methods Operate on dissimilarities and produce hierarchical trees (originally motivated by biological classification). Differ in definition of “dissimilarity between clusters”. 818280797778676675737668706361624172716469377465465654534751455955525048385749604358403942364432302826231816129198453353172917206111342433222710251415213219085848393878886899192170172173961041051069910398971021001019495171198182177168234220208206199205216210204185197194191190189219218209217215207214202175188183181178200179193176196213212180187192174221195211203186184201169167136133117151166145116165156142131110146155149144143132157128134125152158124154147129161163160153162150159140137126119122135121118111127120112130109115113164139123141108107114138148229233235231226236228222230224232227223225 0.00.20.40.6 Cluster Dendrogram hclust (*, "single") as.dist(tai$distmat) Height Christian Hennig Assessing the quality of a clustering
  14. 14. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods Single Linkage: (Florek and Perkal 1951) ˜d(A, B) = min a∈A,b∈B d(a, b) Complete Linkage: ˜d(A, B) = max a∈A,b∈B d(a, b) Average Linkage: ˜d(A, B) = avea∈A,b∈Bd(a, b) These can deliver quite different clusterings. (Complete L. very compact, Single L. separated but maybe widespread) Christian Hennig Assessing the quality of a clustering
  15. 15. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.4 Spectral clustering (Shi and Malik 2000) Dissimilarity-based nonlinear dimension reduction for k-means. Christian Hennig Assessing the quality of a clustering
  16. 16. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1 1 1111 1 1 1 1 1 1 111 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 62 6 6 6 2 6 6 6 6 6 66 66 6 6 6 66 6 6 6 6 6 2 2 2 2 6 2 2 2 2 2 2 2 2 2 2 2 55 55 2 2 3 3 3 3 3 3 35 33 3 7 7 7 7 7 7 7 7 7 7 7 7 7 44 4 4 4 44 4 4 44 4 4 4 44 44 4 4 4 44 4 4 4 4 44 4 4 4 44 4 4 4 4 4 4 4 4 444 4 44 4 4 44 4 4444 4 444 4 4 5 5 55 7 7 7 7 777 777 7 7 7 7 7 7 7 77 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 ctai$points[,1] ctai$points[,2] Christian Hennig Assessing the quality of a clustering
  17. 17. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.5 Density-based methods such as “DBSCAN” (Ester et al. 1996), joins observations with all neighbouring points, and neighbourhoods if they share enough points. Christian Hennig Assessing the quality of a clustering
  18. 18. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1 1 1111 1 1 1 1 1 1 111 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 2N 2 2 2 N 2 2 2 2 2 22 22 2 2 2 22 2 2 2 2 2 N N N N N N N N N N N N N N N N NN NN N N 3 3 3 3 3 3 3N 33 3 N N 4 4 4 N 4 4 4 N N 4 4 55 5 5 5 55 5 5 55 5 5 5 55 55 5 5 5 55 5 5 5 5 55 N 5 5 55 5 5 5 5 5 5 5 5 555 5 55 5 5 55 5 5555 5 555 N 5 N N NN 6 6 6 6 666 666 6 N 6 6 6 6 6 66 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 N 7 7 77 7 7 7 7 7 7 −0.4 −0.2 0.0 0.2 0.4 −0.4−0.20.00.20.4 ctai$points[,1] ctai$points[,2] Christian Hennig Assessing the quality of a clustering
  19. 19. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Cluster analysis methods 1.1.6 Further issues in cluster analysis Number of clusters Cluster validation Dissimilarity definition Choice of method Christian Hennig Assessing the quality of a clustering
  20. 20. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims 2. Benchmarking and measurement of quality Which clustering is better? (Old faithful geyser data) −2 −1 0 1 2 −2−101 mclust waiting duration −2 −1 0 1 2 −2−101 pam waiting duration Christian Hennig Assessing the quality of a clustering
  21. 21. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Which clustering is better? −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  22. 22. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Benchmarking approaches: Real datasets with known classes Simulated datasets from mixture distributions Real datasets without known classes With known truth can compute misclassification rates. Christian Hennig Assessing the quality of a clustering
  23. 23. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. Christian Hennig Assessing the quality of a clustering
  24. 24. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Christian Hennig Assessing the quality of a clustering
  25. 25. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Classes in supervised classification problems may not qualify as data analytic clusters. Christian Hennig Assessing the quality of a clustering
  26. 26. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Disadvantages of benchmarking with known truth In datasets with known classes clustering is not of real scientific interest. Deviate systematically from real clustering problems. The fact that we know certain true classes doesn’t preclude other legitimate/”true” clusterings. Classes in supervised classification problems may not qualify as data analytic clusters. So there could be better truths than the known one. Christian Hennig Assessing the quality of a clustering
  27. 27. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims How true are the true given classes? (Hennig and Liao 2013, social stratification data) Christian Hennig Assessing the quality of a clustering
  28. 28. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims 7 standard occupation classes such as “manual workers”, “managerials and professionals”, “not working” Christian Hennig Assessing the quality of a clustering
  29. 29. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims These are not “data analytic clusters”. Christian Hennig Assessing the quality of a clustering
  30. 30. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Mixture components aren’t always “data analytic clusters” either. 55 3 54 3 54 5 5 4 4 5 4 3 5 4 5 5 4 3 534 4 4 4 4 4 4 5 54 5 4 5 4 4 4 5 5 5 4 54 4 5 5 4 5 4 5 5 5 4 4 4 4 5 4 5 5 55 5 4 4 4 4 4 5 5 5 4 3 4 1 4 4 3 4 4 5 4 1 4 54 5 5 3 5 54 5 1 4 4 4 4 4 5 4 3 4 5 4 55 5 4 5 4 5 3 5 4 4 4 5 3 5 4 5 34 4 5 5 5 4 3 4 5 5 55 4 5 4 54 4 5 4 4 4 4 4 5 4 5 3 5 5 3 4 5 5 4 5 4 5 5 54 4 4 4 4 5 5 4 54 5 5 4 4 5 4 5 5 5 5 4 4 4 5 5 5 5 4 4 4 5 4 4 2 5 4 3 5 4 4 5 4 54 5 4 4 4 4 5 4 5 5 5 4 4 5 4 5 54 4 4 5 4 5 5 4 5 5 5 5 4 4 5 3 5 5 54 5 4 4 4 4 35 5 5 4 5 4 4 5 3 4 5 5 4 4 4 5 4 4 4 5 5 4 54 5 44 5 4 5 3 4 4 3 3 4 4 55 4 5 4 4 4 5 5 4 5 5 555 4 5 4 4 5 5 3 5 4 4 5 5 4 4 5 3 4 55 4 54 4 4 45 3 4 5 3 5 5 4 4 3 4 2 5 4 54 4 4 4 4 4 4 2 5 4 4 4 5 5 4 4 5 5 5 4 5 5 5 55 4 5 5 5 4 54 5 4 4 554 4 551 4 5 4 5 2 33 4 4 45 54 4 5 1 5 44 4 4 4 4 54 4 3 4 4 4 4 5 3 5 554 4 44 5 4 4 5 5 4 5 4 4 5 4 4 5 5 5 2 3 5 34 5 4 5 3 4 4 5 5 1 4 5 4 4 5 4 54 54 5 4 4 4 4 55 4 5 5 54 4 5 4 5 5 5 2 4 3 4 4 4 5 5 5 5 5 4 4 5 4 4 3 5 0 20 40 60 80 100 −20−10010203040 xdata$x[,1] xdata$x[,2] Christian Hennig Assessing the quality of a clustering
  31. 31. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Using a known truth is useful and fair enough but also want to evaluate clusterings on data for which truth is not known. Christian Hennig Assessing the quality of a clustering
  32. 32. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims There is a range of cluster validation indexes measuring clustering quality, such as Average silhouette width (ASW) (Kaufman and Rouseeuw 1990) sw(i, C) = b(i,C)−a(i,C) max(a(i,C),b(i,C)), a(i, C) = 1 |Cj| − 1 x∈Cj d(xi, x), b(i, C) = min xi ∈Cl 1 |Cl| x∈Cl d(xi, x). Maximum average sw ⇒ good C. Christian Hennig Assessing the quality of a clustering
  33. 33. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims “One size fits it all”-approach. Christian Hennig Assessing the quality of a clustering
  34. 34. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims “One size fits it all”-approach. Homogeneity will normally dominate here: −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  35. 35. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims My general philosophy There are various different aims of clustering. Measure them separately to characterise what a method does best, instead of producing a single ranking. Christian Hennig Assessing the quality of a clustering
  36. 36. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Christian Hennig Assessing the quality of a clustering
  37. 37. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Christian Hennig Assessing the quality of a clustering
  38. 38. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Christian Hennig Assessing the quality of a clustering
  39. 39. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Christian Hennig Assessing the quality of a clustering
  40. 40. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Christian Hennig Assessing the quality of a clustering
  41. 41. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Christian Hennig Assessing the quality of a clustering
  42. 42. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Uniform cluster sizes Christian Hennig Assessing the quality of a clustering
  43. 43. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims Typical clustering aims Between-cluster separation Within-cluster homogeneity (low distances) Within-cluster homogeneous distributional shape Good representation of data by centroids Little loss of information from original distance between objects. Clusters are regions of high density without within-cluster gaps Uniform cluster sizes Stability Christian Hennig Assessing the quality of a clustering
  44. 44. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims These may be in conflict with each other. −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] Christian Hennig Assessing the quality of a clustering
  45. 45. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, Christian Hennig Assessing the quality of a clustering
  46. 46. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, Christian Hennig Assessing the quality of a clustering
  47. 47. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, groups in social network analysis shouldn’t have large within-cluster gaps, Christian Hennig Assessing the quality of a clustering
  48. 48. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Which clustering is better? Benchmarking approaches Cluster validation indexes My general philosophy Typical clustering aims E.g., pattern recognition in images requires separation, clustering for information reduction requires good representation by centroids, groups in social network analysis shouldn’t have large within-cluster gaps, underlying “true” classes (biological species) may cause homogeneous distributional shapes. Christian Hennig Assessing the quality of a clustering
  49. 49. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics 3. Cluster quality statistics Measuring between-cluster separation ∃ several ways measuring separation (as for other aims). Straightforward: min distance between any two clusters, or distance between centroids (e.g., k-means). Christian Hennig Assessing the quality of a clustering
  50. 50. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 waiting duration Christian Hennig Assessing the quality of a clustering
  51. 51. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 waiting duration M M Christian Hennig Assessing the quality of a clustering
  52. 52. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Measuring between-cluster separation ∃ several ways measuring separation (as for other aims). Straightforward: min distance between any two clusters, or distance between centroids (e.g., k-means). These measure quite different concepts of separation. (min distance relies on only two points; centroid distance ignores what goes on at border.) Christian Hennig Assessing the quality of a clustering
  53. 53. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics p-separation index: More stable version of “min distance”: Average distance to nearest point in different cluster for p = 10% “border” points in any cluster. −2 −1 0 1 2 −2−101 waiting duration X X X X X X X XX X X XX X X X X X X X X X X X X X X X X X X Christian Hennig Assessing the quality of a clustering
  54. 54. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Measuring “density mountains vs. valleys” Index that measures whether clusters correspond to “density mountains”, and whether “valleys” are between clusters. Note: This is current research and may be revised. Christian Hennig Assessing the quality of a clustering
  55. 55. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Two aspects: (a) Density goes down from mode; no gaps and valleys within clusters. (b) Cluster borders are valleys; they don’t run through mountains. Estimate density by weighted count of close points (“kernel density”). 0.00.51.01.52.0 x k(x) 10% quantile of within−cluster distances Christian Hennig Assessing the quality of a clustering
  56. 56. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 waiting duration Christian Hennig Assessing the quality of a clustering
  57. 57. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Start from cluster modes −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 1 waiting duration X Christian Hennig Assessing the quality of a clustering
  58. 58. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Connect closest point to cluster −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 3 waiting duration X Christian Hennig Assessing the quality of a clustering
  59. 59. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics As long as density goes down, no penalty −0.5 0.0 0.5 1.0 1.5 2.0 −1.8−1.6−1.4−1.2−1.0−0.8 sinlink g= 2 Step 6 waiting duration X Christian Hennig Assessing the quality of a clustering
  60. 60. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Penalty for density increase −2 −1 0 1 2 −2−101 sinlink g= 2 Step 98 waiting duration X Christian Hennig Assessing the quality of a clustering
  61. 61. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 99 waiting duration X Christian Hennig Assessing the quality of a clustering
  62. 62. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 100 waiting duration X Christian Hennig Assessing the quality of a clustering
  63. 63. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 101 waiting duration X Christian Hennig Assessing the quality of a clustering
  64. 64. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 102 waiting duration X Christian Hennig Assessing the quality of a clustering
  65. 65. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 103 waiting duration X Christian Hennig Assessing the quality of a clustering
  66. 66. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 104 waiting duration X Christian Hennig Assessing the quality of a clustering
  67. 67. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 105 waiting duration X Christian Hennig Assessing the quality of a clustering
  68. 68. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 106 waiting duration X Christian Hennig Assessing the quality of a clustering
  69. 69. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 107 waiting duration X Christian Hennig Assessing the quality of a clustering
  70. 70. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 108 waiting duration X Christian Hennig Assessing the quality of a clustering
  71. 71. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics −2 −1 0 1 2 −2−101 sinlink g= 2 Step 297 waiting duration X Christian Hennig Assessing the quality of a clustering
  72. 72. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Add penalty density∗density from other clusters −2 −1 0 1 2 −2−101 specc g= 3 waiting duration P P P P P P P P P P P P P P P P P P P PP P P Christian Hennig Assessing the quality of a clustering
  73. 73. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Christian Hennig Assessing the quality of a clustering
  74. 74. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Christian Hennig Assessing the quality of a clustering
  75. 75. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid Christian Hennig Assessing the quality of a clustering
  76. 76. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Christian Hennig Assessing the quality of a clustering
  77. 77. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Christian Hennig Assessing the quality of a clustering
  78. 78. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Christian Hennig Assessing the quality of a clustering
  79. 79. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Variation of clusterings on bootstrapped data Christian Hennig Assessing the quality of a clustering
  80. 80. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Measuring between-cluster separation Measuring “density mountains vs. valleys” Other statistics Other statistics Within-cluster average distance Within-cluster similarity measure to normal/uniform Within-cluster (squared) distance to centroid ρ(distance, cluster induced distance) (Hubert’s Γ) Entropy of cluster sizes Average largest within-cluster gap Variation of clusterings on bootstrapped data Standardise all indexes to [0, 1] so that “large is good”. Christian Hennig Assessing the quality of a clustering
  81. 81. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 4. Examples −10 0 10 20 30 010203040 xy[,1] xy[,2] −10 0 10 20 30 010203040 xy[,1] xy[,2] 3-means mclust-3 ave within 0.811 0.643 sep index 0.163 0.306 density index 0.977 0.978 within gap 0.927 0.949 Christian Hennig Assessing the quality of a clustering
  82. 82. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion −2 −1 0 1 2 −2−101 mclust waiting duration −2 −1 0 1 2 −2−101 pam waitingduration −2 −1 0 1 2 −2−101 spectral waiting duration −2 −1 0 1 2 −2−101 ave.linkage waiting duration −2 −1 0 1 2 −2−101 single linkage waiting duration −2 −1 0 1 2 −2−101 pdfCluster (3) waiting duration Christian Hennig Assessing the quality of a clustering
  83. 83. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion mclust pam spect ave.l sing.l pdf3 ave within 0.71 0.95 0.82 0.90 0.04 0.98 sep index 0.98 0.30 0.94 0.60 0.99 0.78 density 0.99 0.44 0.70 0.63 0.59 0.99 gap 0.14 0.46 0.46 0.46 0.99 0.48 gamma 0.81 0.91 0.92 0.96 0.06 0.98 normality 0.69 0.44 0.45 0.48 0.11 0.52 Note: These values are quantile-standardised, implementation of this in fpc is still to come. Christian Hennig Assessing the quality of a clustering
  84. 84. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 0.00.20.40.60.81.0 Number of clusters dindex 2 3 4 5 kmeans kmeans kmeans kmeans avelink avelink avelink avelink sinlink sinlink sinlink sinlink comlink comlink comlink comlink mclust mclust mclust mclust pam pam pam pam specc specc specc specc pdfclus Quantile−calibrated density index Christian Hennig Assessing the quality of a clustering
  85. 85. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Christian Hennig Assessing the quality of a clustering
  86. 86. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Christian Hennig Assessing the quality of a clustering
  87. 87. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Christian Hennig Assessing the quality of a clustering
  88. 88. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Benchmarking without known truth and comparison of clusterings in practice. Christian Hennig Assessing the quality of a clustering
  89. 89. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion 6. Discussion Clustering quality is multidimensional. Provide multidimensional evaluation, characterising a method’s behaviour. Can aggregate criteria by weighted mean given well justified weights. Benchmarking without known truth and comparison of clusterings in practice. Required: standardisation to compare different indexes and numbers of clusters. Christian Hennig Assessing the quality of a clustering
  90. 90. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Much of this is implemented in R-package fpc, more will be. Christian Hennig Assessing the quality of a clustering
  91. 91. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion Much of this is implemented in R-package fpc, more will be. Soon to come: IFCS Cluster Benchmarking Repository (Iven Van Mechelen, Nema Dean, Isabelle Guyon, Anne-Laure Boulesteix, Doug Steinley, Friedrich Leisch, Christian Hennig, Rainer Dangl) This work is supported by EPSRC Grant EP/K033972/1. Christian Hennig Assessing the quality of a clustering
  92. 92. A short introduction to cluster analysis Benchmarking and measurement of quality Cluster quality statistics Examples Discussion A bit of marketing: Christian Hennig Assessing the quality of a clustering

×