Graphlab Ted Dunning Clustering

1©MapR Technologies - Confidential
Large-scale Single-pass k-Means
Clustering at Scale
Ted Dunning

Large-scale Single-pass k-Means
Clustering

Large-scale k-Means Clustering

Goals
 Cluster very large data sets
 Facilitate large nearest neighbor search
 Allow very large number of clusters
 Achieve good quality
– low average distance to nearest centroid on held-out data
 Based on Mahout Math
 Runs on Hadoop (really MapR) cluster
 FAST – cluster tens of millions in minutes

Non-goals
 Use map-reduce (but it is there)
 Minimize the number of clusters
 Support metrics other than L2

Anti-goals
 Multiple passes over original data
 Scale as O(k n)

Why?

K-nearest Neighbor with
Super Fast k-means

What’s that?
 Find the k nearest training examples
 Use the average value of the target variable from them
 This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time

How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup

How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene

What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
 Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices
 Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
 Super-fast clustering
– Kmeans, StreamingKmeans

Projection Search
java.lang.TreeSet!

How Many Projections?

K-means Search
 Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
 Recursive application
– to search a cluster, use a Searcher!

x

But This Requires k-means!
 Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable

Basic Method
 Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
 Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters

Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > 10 log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease

How It Works
 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

(Let’s not use k-means searcher, though)
 Empirically, projection search beats 64 bit LSH by a bit

Moving to Scale
 Map-reduce implementation nearly trivial
 Map: rough-cluster input data, output ß, weighted centroids
 Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory
 Combiner possible, but essentially never important

 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.mapr.com/ted-mlconf
Hash tags: #mlconf #mahout #mapr

Graphlab Ted Dunning Clustering

More Related Content

What's hot

Similar to Graphlab Ted Dunning Clustering

More from MapR Technologies

Recently uploaded

Graphlab Ted Dunning Clustering