Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
DETERMINING THE NUMBER OF CLUSTERS
IN A DATASET USING ABC
I. KABUL, P. HALL, J. SILVA, W. SARLE
ENTERPRISE MINER R&D
SAS INSTITUTE

CLUSTERING
Objects within a
cluster are as
similar as possible
Objects from
different clusters
are as dissimilar as
possible
Hossein Parsaei

CHALLENGES IN CLUSTERING
• No prior knowledge
• Which similarity measure ?
• Which clustering algorithm?
• How to evaluate the results?
• How many clusters?
The Aligned Box Criterion (ABC) addresses the unsolved, important
problem of determining the number of clusters in a data set.
ABC can be applied in Market Segmentation and many other types of
statistical, data mining and machine learning analyses.

CONTENTS
• Background
• Aligned Box Criterion (ABC) Method
• Results
• ABC Method in Parallel and Distributed Architecture
• Conclusions

BACKGROUND

FINDING THE RIGHT NUMBER OF CLUSTERS
• Many methods have been proposed:
• Calinski-Harabasz index [Calinski 1974]
• Cubic clustering criterion (CCC) [Sarle 1983]
• Silhouette statistic [Rousseeuw 1987]
• Gap statistic [Tibshirani 2001]
• Jump method [Sugar 2003]
• Prediction strength [Tibshirani 2005]
• Dirichlet process [Teh 2006]

WITHIN CLUSTER SUM OF SQUARES
• A good clustering yields clusters where
observations have small within-cluster
sum-of-squares (and high between-
cluster sum-of-squares).
• Low values when the partition is good,
BUT these are by construction
monotone nonincreasing (within cluster
dissimilarity always decreases with
more clusters)

 

 


r
r r
Ci
ir
Ci Cj
jir
xxn
xxD
2
2
2


k
r
r
r
k D
n
W
1 2
1
Within-cluster SSE:
Measure of compactness of
clusters

BACKGROUND USING WK TO DETERMINE # OF CLUSTERS
Elbow method (L-curve method)
Idea:
use the k corresponding to the “elbow”
Problem:
no reference clustering to compare
the differences Wk  Wk1’s are not normalized for comparison

BACKGROUND REFERENCE DISTRIBUTIONS
• Cubic Clustering Criterion (CCC), Gap Statistic and ABC amplify the elbow
phenomenon by using differences between within cluster sum of squares of a
clustering solution in the training data (Wk) and a clustering solution in a reference
distribution (Wk
*).
• Aligned box criterion (ABC)
• Gap statistic
• Cubic clustering criterion (CCC)
Reference
distribution
complexity
Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983
Gap Statistic: Tibshirani et al, J.R. Statist. Soc., 2001

CCC METHOD
Instead of using Wk directly, CCC uses R2 .
𝑅2 = 1 −
𝑇𝑟𝑎𝑐𝑒 𝑊
𝑇𝑟𝑎𝑐𝑒 𝑇
, 𝑇𝑟𝑎𝑐𝑒 𝑊 = 𝑊𝑘
For CCC calculation, R2 and E(R2) are approximated by
heuristic formulas.
𝐶𝐶𝐶 = log
1 − 𝐸(𝑅2)
1 − 𝑅2
𝑛𝑝∗
2
(0.001 + 𝐸(𝑅2))1.2
Cubic Clustering Criterion
(CCC): SAS Technical Report
A-108, 1983
Derived from numerous Monte Carlo simulations to generate one
hyper-cube reference distribution based on the dimensions of the
given training dataset to test all k of interest.

GAP STATISTICS METHOD
The Gap Statistic computes the (log) ratio Wk* / Wk.
𝐺𝑎𝑝 𝑘 = log 𝑊𝑘
∗
− log 𝑊𝑘
Wk* is calculated from a clustering solution in the reference distribution.
Finds k that maximizes Gap(k) (within some tolerance)

TWO TYPES OF
UNIFORM
DISTRIBUTIONS
1. Align with feature axes (data-geometry independent)
Observations Bounding Box (aligned
with feature axes)
Monte Carlo
Simulations

TWO TYPES OF
UNIFORM
DISTRIBUTIONS
2. Align with principal axes (data-geometry dependent)
Observations Bounding Box (aligned
with principal axes)
Monte Carlo
Simulations

COMPUTATION
OF THE GAP
STATISTIC
for l = 1 to B
Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Cluster the M.C. sample into k groups and compute log Wkb
Compute
Compute sd(k), the standard deviation of {log Wkb}l=1,…,B
Set the total s.e.
Find the smallest k such that
)(/11 ksdBsk 


B
b
kkb WW
B
kGap
1
loglog
1
)(
1)1()(  kskGapkGap

GAP STATISTIC

NO-CLUSTER EXAMPLE
(JOURNAL VERSION)

ABC (ALIGNED BOX CRITERION)

ABC METHOD ABC improves upon CCC and Gap Statistics by generating better estimates for Wk*.
ABC uses k reference distributions, one for each tested k (k is number of clusters).
• Data-driven Monte Carlo simulation of reference distribution at each tested k.
• The reference distribution is k uniform hyper boxes aligned with the Principal
Components from the clustering solution of the input data.
Gap Statistic Reference Distribution ABC Reference Distribution

ABC METHOD
Why multiple reference distributions?
The gap statistic performs hypothesis testing between k clusters/no-clusters for the whole
input space
• ABC is similar to recursive hypothesis testing between 1 cluster/2 clusters for each of
the k candidate clusters
• More stringent test. It is harder for larger k to pass this test. This is desirable.
Gap Statistic Reference Distribution ABC Reference Distribution

ESTIMATING k REFERENCE DISTRIBUTIONS
Sample Data

Aligned Box Criterion

ALIGNED BOX
CRITERION
(ABC)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Considering each cluster k separately
Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)
Cluster the M.C. sample into k groups and compute log Wkb
Compute
Compute sd(k), the s.d. of {log Wkb}l=1,…,B
Set the total s.e.
Find the smallest k such that
)(/11 ksdBsk 
1)1()(  kskABCkABC
𝐴𝐵𝐶(𝑘) = log 𝑊𝑘
+
− log 𝑊𝑘

ABC METHOD
RESULTS

Wk*decreases
faster.
Gap Statistic Aligned Box Criterion

Gap Statistic Aligned Box Criterion
AlignedBoxCriterion
Clearer
Maxima.

RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS

RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS
• Observations: 7,000
• Variables: 2
• Monte Carlo Replications: 20
CCC method ABC method

ESTIMATING k CLAIMS PREDICTION CHALLENGE DATA
• Anonymized customer data
• 32 customer and product features
• 13,184,290 customer records

ESTIMATING k EXECUTING CALCULATIONS
• Cubic clustering criterion: PROC FASTCLUS
• Gap statistic: R cluster package in the Open Source Integration Node in
SAS Enterprise Miner
• Aligned box criterion: PROC HPCLUS

ESTIMATING k INTERPRETING RESULTS
Cubic Clustering Criterion

Gap Statistic

REFERENCE
DISTRIBUTION
EFFECT OF CHANGING NUMBER OF OBSERVATIONS
• How the number of observations in the reference distribution affects the result
• Based on the number of observations n in the input dataset,
we generated w*n number of observations in the reference distribution
where w is between 0 and 1

RESULTS SIMPLE CASE

RESULTS DATA SET WITH MORE CLUSTERS

RESULTS DATA SET WITH MORE OBSERVATIONS

RESULTS REAL DATA
Kaggle Claims Prediction Challenge (n= 13,184,290, p= 35), 50 runs

RESULTS SCALABILITY

RESULTS STABILITY

ABC METHOD
FOR PARALLEL AND DISTRIBUTED ARCHITECTURES

PARALLEL ABC PART 1-2
Node1
Root
…..
Node2
Node3
NodeN
1) Run clustering k-means (in parallel) for k clusters
2) Assign each observation to a cluster
3) Compute 𝑊𝑘
1) Assign each cluster to a node
2) Collect the XX’ matrix for each cluster
in the assigned node using a tree-based algorithm
3) Do PCA using XX’ matrix
Node1
…..
Node2
Node3
NodeN

PARALLEL ABC PART 3-4
Node1
…..
Node2
Node3
NodeN
1) Eigenvectors are broadcasted to every node
2) Based on their assigned clusters,
the observations in each node are projected into
the new space
1) Bounding boxes are computed locally at each node for each cluster k
2) Bounding box information from each node is collected at the root
and the root computes the bounding box coordinates for each cluster k
3) This information is distributed to each node and each node generates
reference distributions
Node1
…..
Node2
Node3
NodeN
Node1
Root
…..
Node2
Node3
NodeN
Node1
…..
Node2
Node3
NodeN

PARALLEL ABC PART 5
Node1
Root
…..
Node2
Node3
NodeN
Run k-means clustering in parallel for the reference distribution and compute 𝑊𝑘
+
Do this for B number of reference distributions
Compute ABC for cluster k

PARALLEL ABC PART 6
What about the O(n^3) complexity of SVD???
- Computation of XX’ is parallelized
- Or, do stochastic SVD

ABC METHOD
CONCLUSION

RESULTS
More accurate reference distributions lead to:
• Better defined maxima.
• Wk* values decreasing rapidly, especially for K > k.
• Exposure of possible alternative solutions.

CONCLUSION
For large, highly dimensional or noisy data ABC is found
to be:
• Stable
• Scalable
Moreover, it exhibits desirable properties:
• Clearer peaks
• More stringent hypothesis test promotes smaller k
values

Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Similar to Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Editor's Notes