Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large-scale Single-pass k-MeansClustering at ScaleTed Dunning©MapR Technologies - Confidential   1
Large-scale Single-pass k-MeansClustering©MapR Technologies - Confidential   2
Large-scale k-Means Clustering©MapR Technologies - Confidential   3
Goals     Cluster very large data sets     Facilitate large nearest neighbor search     Allow very large number of clus...
Non-goals     Use map-reduce (but it is there)     Minimize the number of clusters     Support metrics other than L2©Ma...
Anti-goals     Multiple passes over original data     Scale as O(k n)©MapR Technologies - Confidential     6
Why?©MapR Technologies - Confidential    7
K-nearest Neighbor with  Super Fast k-means©MapR Technologies - Confidential   8
What’s that?     Find the k nearest training examples     Use the average value of the target variable from them     Th...
How We Did It     2 week hackathon with 6 developers from customer bank     Agile-ish development     To avoid IP issue...
How We Did It     2 week hackathon with 6 developers from customer bank     Agile-ish development     To avoid IP issue...
What We Did     Mechanism for extending Mahout Vectors       –   DelegatingVector, WeightedVector, Centroid     Shared m...
Projection Search                                         java.lang.TreeSet!©MapR Technologies - Confidential   13
How Many Projections?©MapR Technologies - Confidential   14
K-means Search     Simple Idea       –   pre-cluster the data       –   to find the nearest points, search the nearest cl...
©MapR Technologies - Confidential   16
x©MapR Technologies - Confidential       17
©MapR Technologies - Confidential   18
©MapR Technologies - Confidential   19
x©MapR Technologies - Confidential       20
But This Requires k-means!     Need a new k-means algorithm to get speed       –   Hadoop is very slow at iterative map-r...
Basic Method     Use a single pass of k-means with very many clusters       –   output is a bad-ish clustering but a good...
Algorithmic DetailsForeach data point xn           compute distance to nearest centroid, ∂           sample u, if u > ∂/ß ...
How It Works     Result is large set of centroids       –   these provide approximation of original distribution       – ...
Parallel Speedup?                                        200                                                              ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Moving to Scale     Map-reduce implementation nearly trivial     Map: rough-cluster input data, output ß, weighted centr...
     Contact:       –   tdunning@maprtech.com       –   @ted_dunning     Slides and such:       –   http://info.mapr.com...
Upcoming SlideShare
Loading in …5
×

Graphlab dunning-clustering

1,372 views

Published on

Talk on the upcoming Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Graphlab dunning-clustering

  1. 1. Large-scale Single-pass k-MeansClustering at ScaleTed Dunning©MapR Technologies - Confidential 1
  2. 2. Large-scale Single-pass k-MeansClustering©MapR Technologies - Confidential 2
  3. 3. Large-scale k-Means Clustering©MapR Technologies - Confidential 3
  4. 4. Goals Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality – low average distance to nearest centroid on held-out data Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes©MapR Technologies - Confidential 4
  5. 5. Non-goals Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2©MapR Technologies - Confidential 5
  6. 6. Anti-goals Multiple passes over original data Scale as O(k n)©MapR Technologies - Confidential 6
  7. 7. Why?©MapR Technologies - Confidential 7
  8. 8. K-nearest Neighbor with Super Fast k-means©MapR Technologies - Confidential 8
  9. 9. What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 9
  10. 10. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup©MapR Technologies - Confidential 10
  11. 11. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene©MapR Technologies - Confidential 11
  12. 12. What We Did Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 12
  13. 13. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 13
  14. 14. How Many Projections?©MapR Technologies - Confidential 14
  15. 15. K-means Search Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters Recursive application – to search a cluster, use a Searcher!©MapR Technologies - Confidential 15
  16. 16. ©MapR Technologies - Confidential 16
  17. 17. x©MapR Technologies - Confidential 17
  18. 18. ©MapR Technologies - Confidential 18
  19. 19. ©MapR Technologies - Confidential 19
  20. 20. x©MapR Technologies - Confidential 20
  21. 21. But This Requires k-means! Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable©MapR Technologies - Confidential 21
  22. 22. Basic Method Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters©MapR Technologies - Confidential 22
  23. 23. Algorithmic DetailsForeach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease©MapR Technologies - Confidential 23
  24. 24. How It Works Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 24
  25. 25. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 25
  26. 26. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that!©MapR Technologies - Confidential 26
  27. 27. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)©MapR Technologies - Confidential 27
  28. 28. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) Empirically, projection search beats 64 bit LSH by a bit©MapR Technologies - Confidential 28
  29. 29. Moving to Scale Map-reduce implementation nearly trivial Map: rough-cluster input data, output ß, weighted centroids Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory Combiner possible, but essentially never important©MapR Technologies - Confidential 29
  30. 30.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr©MapR Technologies - Confidential 30

×