Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

685 views

583 views

583 views

Published on

These new algorithms require only a single pass over the data and each pass has a cost that is roughly O(log k) where k is the desired number of clusters. The resulting implementation [3] which is being ported into Mahout has demonstrated some stunning speed. In one test, a uni-processor threaded implementation demonstrated the ability to cluster data points in just 20 micro-seconds per data point. Moreover, this algorithm is easily ported to map-reduce with essentially perfect linear scaling. This implies we should be able to cluster hundreds of millions of data points in minutes on moderate sized cluster. Even more exciting, these algorithms are online algorithms, so it is possible to build a real-time clustering engine that clusters data points as they arrive and never needs to look back at old data.

I will talk about the basic intuitions behind these algorithms, how they are implemented, their limitations and how to use them. I will also talk about some of the very exciting practical implications of having a super-fast clustering algorithm.

Published in:
Technology

No Downloads

Total views

685

On SlideShare

0

From Embeds

0

Number of Embeds

116

Shares

0

Downloads

0

Comments

0

Likes

5

No embeds

No notes for slide

- 1. Super-fast Online Clustering©MapR Technologies - Confidential 1
- 2. whoami – Ted Dunning©MapR Technologies - Confidential 2
- 3. Clustering? Why? Because other people do it – Really! Because cluster distances make great model features – Better Because good clusters help with really fast nearest neighbor search – Very nice Because we can use clusters as a surrogate for all the data – And that lets us train models or do visualization©MapR Technologies - Confidential 3
- 4. Agenda Nearest neighbor models – Colored dots; need good distance metric; projection, LSH and k-means search K-means algorithms – O(k d log n) per point for Lloyd’s algorithm … not good for k = 2000, n = 108 – Surrogate methods • fast, sloppy single pass clustering with κ = k log n • fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per point • fast, in-memory, high-quality clustering of κ weighted centroids • result consists of k high-quality centroids for the original data Results©MapR Technologies - Confidential 4
- 5. Nearest Neighbor Models Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 5
- 6. K-Nearest Neighbor Example©MapR Technologies - Confidential 6
- 7. Comparison to Other Modeling Approaches Logistic regression – Depends on linear separability – k-nn works very well if logistic regression works – k-nn can work very well even if logistic regression fails due to interactions producing non-linear decision surface Tree-based methods – mostly roughly equivalent in accuracy with k-nn©MapR Technologies - Confidential 7
- 8. Required Scale and Speed and Accuracy Want 20 million queries against 25 million references in 10,000 s Should be able to search > 100 million references Should be linearly and horizontally scalable Must have >50% overlap against reference search Evaluation by sub-sampling is viable, but tricky©MapR Technologies - Confidential 8
- 9. How Hard is That? 20 M x 25 M x 100 Flop = 50 P Flop 1 CPU = 5 Gflops We need 10 M CPU seconds => 10,000 CPU’s Real-world efficiency losses may increase that by 10x Not good!©MapR Technologies - Confidential 9
- 10. How Can We Search Faster? First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing Third rule: reduce dimensionality – Projection search – Random projection for very high dimension©MapR Technologies - Confidential 10
- 11. Note the Circularity Clustering helps nearest neighbor search But clustering needs nearest neighbor search internally How droll !©MapR Technologies - Confidential 11
- 12. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 12
- 13. How Many Projections?©MapR Technologies - Confidential 13
- 14. LSH Search Each random projection produces independent sign bit If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1) Distance in L2 is closely related to cosine x - y 2 = x - 2(x × y) + y 2 2 = x 2 - 2 x y cosq + y 2 We can replace (some) vector dot products with long integer XOR©MapR Technologies - Confidential 14
- 15. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is©MapR Technologies - Confidential 15
- 16. Results©MapR Technologies - Confidential 16
- 17. K-means Search First do clustering with lots (thousands) of clusters Then search nearest clusters to find nearest points We win if we find >50% overlap with “true” answer We lose if we can’t cluster super-fast – more on this later©MapR Technologies - Confidential 17
- 18. Lots of Clusters Are Fine©MapR Technologies - Confidential 18
- 19. Lots of Clusters Are Fine©MapR Technologies - Confidential 19
- 20. Some Details Clumpy data works better – Real data is clumpy Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH can be used to accelerate that (some) More experiments needed Definitely need fast search©MapR Technologies - Confidential 20
- 21. So Now Some Clustering©MapR Technologies - Confidential 21
- 22. Lloyd’s Algorithm Part of CS folk-lore Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters Highly variable quality, several restarts recommended©MapR Technologies - Confidential 22
- 23. Ball k-means Provably better for highly clusterable data Tries to find initial centroids in the “core” of real clusters Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster©MapR Technologies - Confidential 23
- 24. Surrogate Method Start with sloppy clustering into κ = k log n clusters Use these clusters as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably high quality for highly clusterable data Sloppy clustering can be done on-line Surrogate can be kept in memory Ball k-means pass can be done at any time©MapR Technologies - Confidential 24
- 25. Algorithm Costs O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108 Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids – result consists of k high-quality centroids This is a big deal: – k d log n = 2000 x 10 x 26 = 50,000 – log k + log log n = 11 + 5 = 17 – 3000 times faster makes the grade as a bona fide big deal©MapR Technologies - Confidential 25
- 26. The Internals Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 26
- 27. How It Works For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid If centroids > K ~ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 27
- 28. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 28
- 29. What About Map-Reduce Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate Threaded version shows linear speedup already – Map-reduce speedup is likely, not entirely guaranteed©MapR Technologies - Confidential 29
- 30. How Well Does it Work? Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011 Evaluation on held-out data – Need results here©MapR Technologies - Confidential 30
- 31. Summary Nearest neighbor algorithms can be blazing fast But you need blazing fast clustering – Which we now have©MapR Technologies - Confidential 31
- 32. Contact Us! We’re hiring at MapR in California Contact Ted at tdunning@maprtech.com or @ted_dunning For slides and other infohttp://www.mapr.com/company/events/speaking/la-hug-9-25-12©MapR Technologies - Confidential 32

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment