Graphlab dunning-clustering

Large-scale Single-pass k-Means
Clustering at Scale
Ted Dunning

©MapR Technologies - Confidential 1

Large-scale Single-pass k-Means
Clustering


Large-scale k-Means Clustering


Goals

 Cluster very large data sets
 Facilitate large nearest neighbor search
 Allow very large number of clusters
 Achieve good quality
– low average distance to nearest centroid on held-out data
 Based on Mahout Math
 Runs on Hadoop (really MapR) cluster
 FAST – cluster tens of millions in minutes


Non-goals

 Use map-reduce (but it is there)
 Minimize the number of clusters
 Support metrics other than L2


Anti-goals

 Multiple passes over original data
 Scale as O(k n)


Why?


K-nearest Neighbor with
Super Fast k-means


What’s that?

 Find the k nearest training examples
 Use the average value of the target variable from them

 This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results

 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time


How We Did It

 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions

 Ambitious goal of ~ 1,000,000 x speedup


How We Did It

 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions

 Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene


What We Did

 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid

 Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices

 Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute

 Super-fast clustering
– Kmeans, StreamingKmeans


Projection Search

java.lang.TreeSet!


How Many Projections?


K-means Search

 Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters

 Recursive application
– to search a cluster, use a Searcher!


x


But This Requires k-means!

 Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not

 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable


Basic Method

 Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
 Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters


Algorithmic Details

Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid

if number of centroids > 10 log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease


How It Works

 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly


Parallel Speedup?

200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

Warning, Recursive Descent

 Inner loop requires finding nearest centroid

 With lots of centroids, this is slow

 But wait, we have classes to accelerate that!






(Let’s not use k-means searcher, though)






(Let’s not use k-means searcher, though)

 Empirically, projection search beats 64 bit LSH by a bit


Moving to Scale

 Map-reduce implementation nearly trivial

 Map: rough-cluster input data, output ß, weighted centroids

 Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory

 Combiner possible, but essentially never important


 Contact:
– tdunning@maprtech.com
– @ted_dunning

 Slides and such:
– http://info.mapr.com/ted-mlconf

Hash tags: #mlconf #mahout #mapr


Graphlab dunning-clustering

More Related Content

What's hot

Similar to Graphlab dunning-clustering

More from Ted Dunning

Recently uploaded

Graphlab dunning-clustering