Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Lloyds Banking Group Presentation a... by Corinium Global 69 views
- RIWC_PARA_A194 lloyds banking group... by Marco Muscroft 29 views
- Value To Customer Analysis by Nirmala last 1276 views
- Public Final Saqhib Ali Lloyds Bank... by Saqhib Ali MBA ACCA 64 views
- Interpolacion by Priscila Espinoza 7886 views
- Nearest Neighbor Algorithm Zaffar ... by Zaffar Ahmed Shaikh 2166 views

8,685 views

Published on

Just as significant, this new algorithm allows clustering with a very large number of clusters which makes it practical to use as a feature extraction algorithm or set up for a nearest neighbor search.

Published in:
Technology

No Downloads

Total views

8,685

On SlideShare

0

From Embeds

0

Number of Embeds

4,691

Shares

0

Downloads

73

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Customer Behavior Analysis with Large Scale k-means Analysis
- 2. whoami – Ted Dunning• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill• Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning• Get slides and more info at http://www.mapr.com/company/events/speaking/strata-10-2-12
- 3. Agenda• Nearest neighbor models• K-means algorithms – O(k d log n) per point for Lloyd’s algorithm – Surrogate (sketch) methods• Results
- 4. Context• Digital transformation.• Data helps us better serve our customers.• Privacy is paramount.
- 5. The Business Case• Our customer has 100 million cards in circulation• Quick and accurate decision-making is key. – Marketing offers – Fraud prevention
- 6. Opportunity• Demand of modeling is increasing rapidly• So we are testing something simpler and more agile• Like k-nearest neighbor
- 7. What’s that?• Find the k nearest training examples – lookalike customers• This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results• Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
- 8. K-Nearest Neighbor Example
- 9. Required Scale and Speed and Accuracy• Want 20 million queries against 25 million references in 10,000 s• Should be able to search > 100 million references• Should be linearly and horizontally scalable• Must have >50% overlap against reference search• Evaluation by sub-sampling is viable, but tricky
- 10. How Hard is That?• 20 M x 25 M x 100 Flop = 50 P Flop• 1 CPU = 5 Gflops• We need 10 M CPU seconds => 10,000 CPU’s• Real-world efficiency losses may increase that by 10x• Not good!
- 11. How Can We Search Faster?• First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search• Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing• Third rule: reduce dimensionality – Projection search – Random projection for very high dimension
- 12. Projection Search total ordering!
- 13. How Many Projections?
- 14. LSH Search• Each random projection produces independent sign bit• If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1)• Distance in L2 is closely related to cosine x - y 2 = x - 2(x × y) + y 2 2 = x 2 - 2 x y cosq + y 2• We can replace (some) vector dot products with long integer XOR
- 15. 1 LSH Bit-match Versus Cosine 0.8 0.6 0.4 0.2Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is
- 16. Results with 32 Bits
- 17. K-means Search• First do clustering with lots (thousands) of clusters• Then search nearest clusters to find nearest points• We win if we find >50% overlap with “true” answer• We lose if we can’t cluster super-fast – more on this later
- 18. Lots of Clusters Are Fine
- 19. Lots of Clusters Are Fine
- 20. Some Details• Clumpy data works better – Real data is clumpy • Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH can be used to accelerate that (some)• More experiments needed• Definitely need fast search
- 21. Lloyd’s Algorithm• Part of CS folk-lore• Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters• Highly variable quality, several restarts recommended
- 22. Ball k-means• Provably better for highly clusterable data• Tries to find initial centroids in the “core” of real clusters• Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
- 23. Surrogate Method• Start with sloppy clustering into κ = k log n clusters• Use this sketch as a weighted surrogate for the data• Cluster surrogate data using ball k-means• Results are provably good for highly clusterable data• Sloppy clustering is on-line• Surrogate can be kept in memory• Ball k-means pass can be done at any time
- 24. Algorithm Costs• O(k d log n) per point for Lloyd’s algorithm … not so good for k = 2000, n = 108• Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids – result consists of k high-quality centroids• This is a big deal: – k d log n = 2000 x 10 x 26 = 50,000 – log k + log log n = 11 + 5 = 17 – 3000 times faster makes the grade as a bona fide big deal
- 25. The Internals• Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid• Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute• Super-fast clustering – Kmeans, StreamingKmeans
- 26. How It Works• For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid• If centroids > K ~ C log N – Recursively cluster centroids with higher threshold• Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
- 27. Parallel Speedup? 200 Non- threaded ✓ 100 2Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads
- 28. What About Map-Reduce• Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate• Threaded version shows linear speedup already – Map-reduce speedup is likely, not entirely guaranteed
- 29. How Well Does it Work?• Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011• Evaluation on synthetic data – Rough clustering produces correct surrogates – Possible issue in ball k-means initialization (still produces good clustering on test data)
- 30. Summary• Nearest neighbor algorithms can be blazing fast• But you need blazing fast clustering – Which we now have
- 31. Contact Us!• We’re hiring at MapR in US and Europe• Amex is hiring in Phoenix and New York• Come get the slides at http://www.mapr.com/company/events/speaking/strata-10- 2-12• Contact Ted at tdunning@maprtech.com or @ted_dunning

No public clipboards found for this slide

Be the first to comment