Your SlideShare is downloading. ×
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Graphlab Ted Dunning  Clustering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Graphlab Ted Dunning Clustering

207

Published on

Talk on the Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.

Talk on the Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
207
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering at Scale Ted Dunning
  • 2. 2©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering
  • 3. 3©MapR Technologies - Confidential Large-scale k-Means Clustering
  • 4. 4©MapR Technologies - Confidential Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes
  • 5. 5©MapR Technologies - Confidential Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2
  • 6. 6©MapR Technologies - Confidential Anti-goals  Multiple passes over original data  Scale as O(k n)
  • 7. 7©MapR Technologies - Confidential Why?
  • 8. 8©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  • 9. 9©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 10. 10©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  • 11. 11©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  • 12. 12©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans
  • 13. 13©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  • 14. 14©MapR Technologies - Confidential How Many Projections?
  • 15. 15©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  • 16. 16©MapR Technologies - Confidential
  • 17. 17©MapR Technologies - Confidential x
  • 18. 18©MapR Technologies - Confidential
  • 19. 19©MapR Technologies - Confidential
  • 20. 20©MapR Technologies - Confidential x
  • 21. 21©MapR Technologies - Confidential But This Requires k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable
  • 22. 22©MapR Technologies - Confidential Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters
  • 23. 23©MapR Technologies - Confidential Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease
  • 24. 24©MapR Technologies - Confidential How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 25. 25©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 26. 26©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  • 27. 27©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  • 28. 28©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit
  • 29. 29©MapR Technologies - Confidential Moving to Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but essentially never important
  • 30. 30©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr

×