1©MapR Technologies - Confidential
Large-scale Single-pass k-Means
Clustering at Scale
Ted Dunning
2©MapR Technologies - Confidential
Large-scale Single-pass k-Means
Clustering
3©MapR Technologies - Confidential
Large-scale k-Means Clustering
4©MapR Technologies - Confidential
Goals
 Cluster very large data sets
 Facilitate large nearest neighbor search
 Allow...
5©MapR Technologies - Confidential
Non-goals
 Use map-reduce (but it is there)
 Minimize the number of clusters
 Suppor...
6©MapR Technologies - Confidential
Anti-goals
 Multiple passes over original data
 Scale as O(k n)
7©MapR Technologies - Confidential
Why?
8©MapR Technologies - Confidential
K-nearest Neighbor with
Super Fast k-means
9©MapR Technologies - Confidential
What’s that?
 Find the k nearest training examples
 Use the average value of the targ...
10©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish deve...
11©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish deve...
12©MapR Technologies - Confidential
What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVecto...
13©MapR Technologies - Confidential
Projection Search
java.lang.TreeSet!
14©MapR Technologies - Confidential
How Many Projections?
15©MapR Technologies - Confidential
K-means Search
 Simple Idea
– pre-cluster the data
– to find the nearest points, sear...
16©MapR Technologies - Confidential
17©MapR Technologies - Confidential
x
18©MapR Technologies - Confidential
19©MapR Technologies - Confidential
20©MapR Technologies - Confidential
x
21©MapR Technologies - Confidential
But This Requires k-means!
 Need a new k-means algorithm to get speed
– Hadoop is ver...
22©MapR Technologies - Confidential
Basic Method
 Use a single pass of k-means with very many clusters
– output is a bad-...
23©MapR Technologies - Confidential
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
samp...
24©MapR Technologies - Confidential
How It Works
 Result is large set of centroids
– these provide approximation of origi...
25©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5...
26©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots ...
27©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots ...
28©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots ...
29©MapR Technologies - Confidential
Moving to Scale
 Map-reduce implementation nearly trivial
 Map: rough-cluster input ...
30©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.map...
Upcoming SlideShare
Loading in...5
×

Graphlab Ted Dunning Clustering

233

Published on

Talk on the Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
233
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Graphlab Ted Dunning Clustering

  1. 1. 1©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering at Scale Ted Dunning
  2. 2. 2©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering
  3. 3. 3©MapR Technologies - Confidential Large-scale k-Means Clustering
  4. 4. 4©MapR Technologies - Confidential Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes
  5. 5. 5©MapR Technologies - Confidential Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2
  6. 6. 6©MapR Technologies - Confidential Anti-goals  Multiple passes over original data  Scale as O(k n)
  7. 7. 7©MapR Technologies - Confidential Why?
  8. 8. 8©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  9. 9. 9©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  10. 10. 10©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  11. 11. 11©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  12. 12. 12©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans
  13. 13. 13©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  14. 14. 14©MapR Technologies - Confidential How Many Projections?
  15. 15. 15©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  16. 16. 16©MapR Technologies - Confidential
  17. 17. 17©MapR Technologies - Confidential x
  18. 18. 18©MapR Technologies - Confidential
  19. 19. 19©MapR Technologies - Confidential
  20. 20. 20©MapR Technologies - Confidential x
  21. 21. 21©MapR Technologies - Confidential But This Requires k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable
  22. 22. 22©MapR Technologies - Confidential Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters
  23. 23. 23©MapR Technologies - Confidential Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease
  24. 24. 24©MapR Technologies - Confidential How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  25. 25. 25©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  26. 26. 26©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  27. 27. 27©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  28. 28. 28©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit
  29. 29. 29©MapR Technologies - Confidential Moving to Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but essentially never important
  30. 30. 30©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×