More Related Content Similar to Clustering introduction (20) Clustering introduction2. 2
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning
– Convolutional neural network
– Train deep nets with open-source tools
10. 10
SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……
13. 13
SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the
“goodness” of the resulting clusters?
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To determine the optimal number of clusters
14. 14
SCR©
Cluster Validity
• Numerical measures:
– External: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal: Used to measure the goodness of a clustering structure without
respect to external information.
• Sum of Squared Error (SSE)
– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization
15. 15
SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a
cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
i Cx
i
i
mxWSS 2
)(
i
ii mmCBSS 2
)(
16. 16
SCR©
Internal Measures: WSE and BSE
• Example: SSE
– BSS + WSS = constant
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222
Total
BSS
WSS
1 2 3 4 5
m1 m2
m
K=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222
Total
BSS
WSSK=1 cluster:
17. 17
SCR©
Internal Measures: WSE and BSE
• Can be used to estimate the number of clusters
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
KSSE5 10 15
-6
-4
-2
0
2
4
6
WSS
18. 18
SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a
cluster.
• Cluster separation is the sum of the weights between nodes in the
cluster and nodes outside the cluster.
cohesion separation
19. 19
SCR©
Correlation between affinity matrix and
incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering
• Correlation r between D and C is given by
n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
20. 20
SCR©
Correlation with Incidence matrix
n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810
21. 21
SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and
inspect visually.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22. 22
SCR©
• Clusters in random data are not so crisp
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix
23. 23
SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part
of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art
accessible only to those true believers who have experience and great
courage.”
Algorithms for Clustering Data, Jain and Dubes
24. 24
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Hierarchical clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning - Yan
– Convolutional neural network
– Train deep nets with open-source tools