Clustering - ACM 2013 02-25

Fast Single-pass k-means
Clustering

whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software
Foundation
– particularly Mahout, Zookeeper and Drill
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning

Agenda
• Rationale
• Theory
– clusterable data, k-mean failure modes, sketches
• Algorithms
– ball k-means, surrogate methods
• Implementation
– searchers, vectors, clusterers
• Results
• Application

Why k-means?
• Clustering allows fast search
– k-nn models allow agile modeling
– lots of data points, 108 typical
– lots of clusters, 104 typical
• Model features
– Distance to nearest centroids
– Poor man’s manifold discovery

What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization to unseen data critical
– number of points per cluster
– distance distributions
– target function distributions
– model performance stability
• Agreement to “gold standard” is a non-issue

The Problem
• Spirals are a classic “counter” example for k-
means
• Classic low dimensional manifold with added
noise
• But clustering still makes modeling work well

The Cluster Proximity Features
• Every point can be described by the nearest
cluster
– 4.3 bits per point in this case
– Significant error that can be decreased (to a point)
by increasing number of clusters
• Or by the proximity to the 2 nearest clusters (2
x 4.3 bits + 1 sign bit + 2 proximities)
– Error is negligible
– Unwinds the data into a simple representation

Diagonalized Cluster Proximity

The Limiting Case
• Too many clusters lead to over-fitting
• Which we mediate by averaging over several
nearby clusters
• In the limit we get k-nn modeling
– and probably use k-means to speed up search

Intuitive Theory
• Traditionally, minimize over all distributions
– optimization is NP-complete
– that isn’t like real data
• Recently, assume well-clusterable data
• Interesting approximation bounds provable
s 2
Dk-1
2
(X) > Dk
2
(X)
1+O(s 2
)

For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)

Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
in 80’s
initialize k cluster centroids somehow
for each of many iterations:
for each data point:
assign point to nearest cluster
recompute cluster centroids from points assigned to clusters
• Highly variable quality, several restarts recommended

Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time

Surrogate Method
• Start with sloppy clustering into κ = k log n
clusters
• Use this sketch as a weighted surrogate for the
data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time

Algorithm Costs
• O(k d log n) per point per iteration for Lloyd’s
algorithm
• Number of iterations not well known
• Iteration > log n reasonable assumption

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice

Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal

Pragmatics
• But this requires a fast search internally
• Have to cluster on the fly for sketch
• Have to guarantee sketch quality
• Previous methods had very high complexity

How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold

Resulting Surrogate
• Result is large set of centroids
– these provide approximation of original
distribution
– we can cluster centroids to get a close
approximation of clustering original
– or we can just use the result directly
• Either way, we win

How Can We Search Faster?
• First rule: don’t do it
– If we can eliminate most candidates, we can do less work
– Projection search and k-means search
• Second rule: don’t do it
– We can convert big floating point math to clever bit-wise
integer math
– Locality sensitive hashing
• Third rule: reduce dimensionality
– Projection search
– Random projection for very high dimension

Projection Search
total ordering!

LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine
• We can replace (some) vector dot products with long
integer XOR
x - y 2
= x2
- 2(x× y)+ y2
= x2
- 2 x y cosq + y2

LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis

The Internals
• Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
• Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
• Super-fast clustering
– Kmeans, StreamingKmeans

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

What About Map-Reduce?
• Map-reduce implementation is nearly trivial
– Compute surrogate on each split
– Total surrogate is union of all partial surrogates
– Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
already
• Map-reduce speedup shows same linear
speedup

How Well Does it Work?
• Theoretical guarantees for well clusterable
data
– Shindler, Wong and Meyerson, NIPS, 2011
• Evaluation on synthetic data
– Rough clustering produces correct surrogates
– Ball k-means strategy 1 performance is very good
with large k

How Well Does it Work?
• Empirical evaluation on 20 newsgroups
• Alternative algorithms include ball k-means
versus streaming k-means|ball k-means
• Results
Average distance to nearest cluster on held-out data
same or slightly smaller
Median distance to nearest cluster is smaller
> 10x faster (I/O and encoding limited)

The Business Case
• Our customer has 100 million cards in
circulation
• Quick and accurate decision-making is key.
– Marketing offers
– Fraud prevention

Opportunity
• Demand of modeling is increasing rapidly
• So they are testing something simpler and
more agile
• Like k-nearest neighbor

What’s that?
• Find the k nearest training examples – lookalike
customers
• This is easy … but hard
– easy because it is so conceptually simple and you don’t
have knobs to turn or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
• Initial rapid prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time

Required Scale and Speed and
Accuracy
• Want 20 million queries against 25 million
references in 10,000 s
• Should be able to search > 100 million
references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
search

How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop
• 1 CPU = 5 Gflops
• We need 10 M CPU seconds => 10,000 CPU’s
• Real-world efficiency losses may increase that by
10x
• Not good!

K-means Search
• First do clustering with lots (thousands) of clusters
• Then search nearest clusters to find nearest points
• We win if we find >50% overlap with “true” answer
• We lose if we can’t cluster super-fast
– more on this later

Some Details
• Clumpy data works better
– Real data is clumpy 
• Speedups of 100-200x seem practical with
50% overlap
– Projection search and LSH give additional 100x
• More experiments needed

Summary
• Nearest neighbor algorithms can be blazing
fast
• But you need blazing fast clustering
– Which we now have

Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Come get the slides at
http://www.mapr.com/company/events/acmsf-2-25-13
• Get the code as part of Mahout trunk
• Contact me at tdunning@maprtech.com or @ted_dunning

Clustering - ACM 2013 02-25

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (7)

Similar to Clustering - ACM 2013 02-25

Similar to Clustering - ACM 2013 02-25 (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Clustering - ACM 2013 02-25

Editor's Notes