Super-fast Online Clustering©MapR Technologies - Confidential   1
whoami – Ted Dunning©MapR Technologies - Confidential   2
Clustering? Why?     Because other people do it       –   Really!     Because cluster distances make great model feature...
Agenda     Nearest neighbor models       –   Colored dots; need good distance metric; projection, LSH and k-means        ...
Nearest Neighbor Models     Find the k nearest training examples     Use the average value of the target variable from t...
K-Nearest Neighbor Example©MapR Technologies - Confidential   6
Comparison to Other Modeling Approaches     Logistic regression       –   Depends on linear separability       –   k-nn w...
Required Scale and Speed and Accuracy     Want 20 million queries against 25 million references in 10,000 s     Should b...
How Hard is That?     20 M x 25 M x 100 Flop = 50 P Flop     1 CPU = 5 Gflops     We need 10 M CPU seconds => 10,000 CP...
How Can We Search Faster?     First rule: don’t do it       –   If we can eliminate most candidates, we can do less work ...
Note the Circularity     Clustering helps nearest neighbor search     But clustering needs nearest neighbor search inter...
Projection Search                                         java.lang.TreeSet!©MapR Technologies - Confidential   12
How Many Projections?©MapR Technologies - Confidential   13
LSH Search     Each random projection produces independent sign bit     If two vectors have the same projected sign bits...
LSH Bit-match Versus Cosine                       1                     0.8                     0.6                     0....
Results©MapR Technologies - Confidential   16
K-means Search     First do clustering with lots (thousands) of clusters     Then search nearest clusters to find neares...
Lots of Clusters Are Fine©MapR Technologies - Confidential   18
Lots of Clusters Are Fine©MapR Technologies - Confidential   19
Some Details     Clumpy data works better       –   Real data is clumpy      Speedups of 100-200x seem practical with 5...
So Now Some Clustering©MapR Technologies - Confidential             21
Lloyd’s Algorithm     Part of CS folk-lore     Developed in the late 50’s for signal quantization, published in 80’s    ...
Ball k-means     Provably better for highly clusterable data     Tries to find initial centroids in the “core” of real c...
Surrogate Method     Start with sloppy clustering into κ = k log n clusters     Use these clusters as a weighted surroga...
Algorithm Costs     O(k d log n) per point for Lloyd’s algorithm                  … not so good for k = 2000, n = 108   ...
The Internals     Mechanism for extending Mahout Vectors       –   DelegatingVector, WeightedVector, Centroid     Search...
How It Works     For each point       –   Find approximately nearest centroid (distance = d)       –   If d > threshold, ...
Parallel Speedup?                                        200                                                              ...
What About Map-Reduce     Map-reduce implementation is nearly trivial       –   Compute surrogate on each split       –  ...
How Well Does it Work?     Theoretical guarantees for well clusterable data       –   Shindler, Wong and Meyerson, NIPS, ...
Summary     Nearest neighbor algorithms can be blazing fast     But you need blazing fast clustering       –   Which we ...
Contact Us!     We’re hiring at MapR in California     Contact Ted at tdunning@maprtech.com or @ted_dunning     For sli...
Upcoming SlideShare
Loading in …5
×

LA HUG - Ted Dunning 2012-09-25

685 views
583 views

Published on

Recent algorithmic developments [1] have enabled dramatic improvements in performance for clustering applications. Previously, the workhorse clustering algorithm was k-means which scaled linearly with the desired number of clusters times the data size times the number of iterations required. The number of iterations itself depended on the number of clusters and in map-reduce implementations such as in Mahout [2], the required iterative implementation is exceedingly painful.

These new algorithms require only a single pass over the data and each pass has a cost that is roughly O(log k) where k is the desired number of clusters. The resulting implementation [3] which is being ported into Mahout has demonstrated some stunning speed. In one test, a uni-processor threaded implementation demonstrated the ability to cluster data points in just 20 micro-seconds per data point. Moreover, this algorithm is easily ported to map-reduce with essentially perfect linear scaling. This implies we should be able to cluster hundreds of millions of data points in minutes on moderate sized cluster. Even more exciting, these algorithms are online algorithms, so it is possible to build a real-time clustering engine that clusters data points as they arrive and never needs to look back at old data.

I will talk about the basic intuitions behind these algorithms, how they are implemented, their limitations and how to use them. I will also talk about some of the very exciting practical implications of having a super-fast clustering algorithm.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
685
On SlideShare
0
From Embeds
0
Number of Embeds
116
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • The sub-bullets are just for reference and should be deleted later
  • The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple.
  • This slide is red to indicate missing data
  • LA HUG - Ted Dunning 2012-09-25

    1. 1. Super-fast Online Clustering©MapR Technologies - Confidential 1
    2. 2. whoami – Ted Dunning©MapR Technologies - Confidential 2
    3. 3. Clustering? Why? Because other people do it – Really! Because cluster distances make great model features – Better Because good clusters help with really fast nearest neighbor search – Very nice Because we can use clusters as a surrogate for all the data – And that lets us train models or do visualization©MapR Technologies - Confidential 3
    4. 4. Agenda Nearest neighbor models – Colored dots; need good distance metric; projection, LSH and k-means search K-means algorithms – O(k d log n) per point for Lloyd’s algorithm … not good for k = 2000, n = 108 – Surrogate methods • fast, sloppy single pass clustering with κ = k log n • fast sloppy search for nearest cluster, O(log κ) = O(d (log k + log log n)) per point • fast, in-memory, high-quality clustering of κ weighted centroids • result consists of k high-quality centroids for the original data Results©MapR Technologies - Confidential 4
    5. 5. Nearest Neighbor Models Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 5
    6. 6. K-Nearest Neighbor Example©MapR Technologies - Confidential 6
    7. 7. Comparison to Other Modeling Approaches Logistic regression – Depends on linear separability – k-nn works very well if logistic regression works – k-nn can work very well even if logistic regression fails due to interactions producing non-linear decision surface Tree-based methods – mostly roughly equivalent in accuracy with k-nn©MapR Technologies - Confidential 7
    8. 8. Required Scale and Speed and Accuracy Want 20 million queries against 25 million references in 10,000 s Should be able to search > 100 million references Should be linearly and horizontally scalable Must have >50% overlap against reference search Evaluation by sub-sampling is viable, but tricky©MapR Technologies - Confidential 8
    9. 9. How Hard is That? 20 M x 25 M x 100 Flop = 50 P Flop 1 CPU = 5 Gflops We need 10 M CPU seconds => 10,000 CPU’s Real-world efficiency losses may increase that by 10x Not good!©MapR Technologies - Confidential 9
    10. 10. How Can We Search Faster? First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing Third rule: reduce dimensionality – Projection search – Random projection for very high dimension©MapR Technologies - Confidential 10
    11. 11. Note the Circularity Clustering helps nearest neighbor search But clustering needs nearest neighbor search internally How droll !©MapR Technologies - Confidential 11
    12. 12. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 12
    13. 13. How Many Projections?©MapR Technologies - Confidential 13
    14. 14. LSH Search Each random projection produces independent sign bit If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1) Distance in L2 is closely related to cosine x - y 2 = x - 2(x × y) + y 2 2 = x 2 - 2 x y cosq + y 2 We can replace (some) vector dot products with long integer XOR©MapR Technologies - Confidential 14
    15. 15. LSH Bit-match Versus Cosine 1 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is©MapR Technologies - Confidential 15
    16. 16. Results©MapR Technologies - Confidential 16
    17. 17. K-means Search First do clustering with lots (thousands) of clusters Then search nearest clusters to find nearest points We win if we find >50% overlap with “true” answer We lose if we can’t cluster super-fast – more on this later©MapR Technologies - Confidential 17
    18. 18. Lots of Clusters Are Fine©MapR Technologies - Confidential 18
    19. 19. Lots of Clusters Are Fine©MapR Technologies - Confidential 19
    20. 20. Some Details Clumpy data works better – Real data is clumpy  Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH can be used to accelerate that (some) More experiments needed Definitely need fast search©MapR Technologies - Confidential 20
    21. 21. So Now Some Clustering©MapR Technologies - Confidential 21
    22. 22. Lloyd’s Algorithm Part of CS folk-lore Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters  Highly variable quality, several restarts recommended©MapR Technologies - Confidential 22
    23. 23. Ball k-means Provably better for highly clusterable data Tries to find initial centroids in the “core” of real clusters Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster©MapR Technologies - Confidential 23
    24. 24. Surrogate Method Start with sloppy clustering into κ = k log n clusters Use these clusters as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably high quality for highly clusterable data Sloppy clustering can be done on-line Surrogate can be kept in memory Ball k-means pass can be done at any time©MapR Technologies - Confidential 24
    25. 25. Algorithm Costs O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108 Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids – result consists of k high-quality centroids This is a big deal: – k d log n = 2000 x 10 x 26 = 50,000 – log k + log log n = 11 + 5 = 17 – 3000 times faster makes the grade as a bona fide big deal©MapR Technologies - Confidential 25
    26. 26. The Internals Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 26
    27. 27. How It Works For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid If centroids > K ~ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 27
    28. 28. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 28
    29. 29. What About Map-Reduce Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate Threaded version shows linear speedup already – Map-reduce speedup is likely, not entirely guaranteed©MapR Technologies - Confidential 29
    30. 30. How Well Does it Work? Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011 Evaluation on held-out data – Need results here©MapR Technologies - Confidential 30
    31. 31. Summary Nearest neighbor algorithms can be blazing fast But you need blazing fast clustering – Which we now have©MapR Technologies - Confidential 31
    32. 32. Contact Us! We’re hiring at MapR in California Contact Ted at tdunning@maprtech.com or @ted_dunning For slides and other infohttp://www.mapr.com/company/events/speaking/la-hug-9-25-12©MapR Technologies - Confidential 32

    ×