Clustering introduction

Clustering for New Discovery in Data
Houston Machine Learning Meetup

2
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning
– Convolutional neural network
– Train deep nets with open-source tools

3
SCR©
Roadmap: Application
• Business analytics
• Recommendation system
• Natural language processing
• Computer vision
• Energy industry

4
SCR©
Agenda
• Introduction
• Application of clustering
• K-means
• DBSCAN
• Cluster validation

5
SCR©
What is clustering
Clustering: to discover the natural groupings of a set of objects/patterns in the
unlabeled data

6
SCR©
Application: Recommendation

7
SCR©
Application: Document Clustering
https://www.noggle.online/knowledgebase/document-clustering/

8
SCR©
Application: Pizza Hut Center
Delivery locations

9
SCR©
Application: Discovering Gene functions
Important to discover diseases
and treatment

10
SCR©
Clustering Algorithm
• K-Means (King of clustering, many variants)
• DBSCAN (group neighboring points)
• Mean shift (locating the maxima of density)
• Spectral clustering (cares about connectivity instead of proximity)
• Hierarchical clustering (a hierarchical structure, multiple levels)
• Expectation Maximization (k-means is a variant of EM)
• Latent Dirichlet Allocation (natural language processing)
……

11
SCR©
• K-Means
• DBSCAN

13
SCR©
Cluster Validity
• For cluster analysis, the question is how to evaluate the
“goodness” of the resulting clusters?
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To determine the optimal number of clusters

14
SCR©
Cluster Validity
• Numerical measures:
– External: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal: Used to measure the goodness of a clustering structure without
respect to external information.
• Sum of Squared Error (SSE)
– Relative: Used to compare two different clusterings.
• Often an external or internal measurement is used for this function, e.g., SSE or entropy
• Visualization

15
SCR©
Internal Measures: WSE and BSE
• Cluster Cohesion: Measures how closely related are objects in a
cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
– Separation is measured by the between cluster sum of squares
– Where |Ci| is the size of cluster i
 


i Cx
i
i
mxWSS 2
)(
 
i
ii mmCBSS 2
)(

16
SCR©
• Example: SSE
– BSS + WSS = constant
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222



Total
BSS
WSS
1 2 3 4 5
 
m1 m2
m
K=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222



Total
BSS
WSSK=1 cluster:

17
SCR©
• Can be used to estimate the number of clusters
2 5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
9
10
KSSE5 10 15
-6
-4
-2
0
2
4
6
WSS

18
SCR©
Internal Measures: Proximity graph measures
• Cluster cohesion is the sum of the weight of all links within a
cluster.
• Cluster separation is the sum of the weights between nodes in the
cluster and nodes outside the cluster.
cohesion separation

19
SCR©
Correlation between affinity matrix and
incidence matrix
• Given affinity distance matrix D = {d11,d12, …, dnn }
Incidence matrix C= { c11, c12,…, cnn } from clustering
• Correlation r between D and C is given by








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((

20
SCR©
Correlation with Incidence matrix








n
ji
ij
n
ji
ij
n
ji
ijij
ccdd
ccdd
r
1,1
2
_
1,1
2
_
1,1
__
)()(
))((
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
r = -0.9235 r = -0.5810

21
SCR©
Visualization of similarity matrix
• Order the similarity matrix with respect to cluster labels and
inspect visually.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

22
SCR©
• Clusters in random data are not so crisp
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Visualization of similarity matrix

23
SCR©
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and frustrating part
of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art
accessible only to those true believers who have experience and great
courage.”
Algorithms for Clustering Data, Jain and Dubes

24
SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Hierarchical clustering - Kunal
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– From neural network to deep learning - Yan
– Convolutional neural network
– Train deep nets with open-source tools

Clustering introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Clustering introduction

Similar to Clustering introduction (20)

More from Yan Xu

More from Yan Xu (20)

Recently uploaded

Recently uploaded (20)

Clustering introduction