Your SlideShare is downloading. ×
Super-Fast Clustering                Report from MapR workshop©MapR Technologies - Confidential   1
     Contact:       –   tdunning@maprtech.com       –   @ted_dunning     Twitter for this talk       –   #mapr_uk     S...
Company Background      MapR provides the industry’s best Hadoop Distribution       –    Combines the best of the Hadoop ...
We Also Do …     Open source development       –   Zookeeper       –   Hadoop       –   Mahout       –   Stuff     Partn...
We Also Do …     Open source development       –   Zookeeper       –   Hadoop       –   Mahout       –   Stuff     Partn...
The Problem     A certain bank       –   had lots of customers       –   had lots of prospective customers       –   had ...
But …     These models were arduous to build     And hard to test     So people suggested something simpler     Like k...
What’s that?     Find the k nearest training examples     Use the average value of the target variable from them     Th...
What We Did     Mechanism for extending Mahout Vectors       –   DelegatingVector, WeightedVector, Centroid     Searcher...
Projection Search©MapR Technologies - Confidential   10
K-means Search©MapR Technologies - Confidential   11
But These Require k-means!     Need a new k-means algorithm to get speed     Streaming k-means is       –   One pass (th...
How It Works     For each point       –   Find approximately nearest centroid (distance = d)       –   If d > threshold, ...
Parallel Speedup?                                        200                                                              ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
     Contact:       –   tdunning@maprtech.com       –   @ted_dunning     Slides and such:       –   http://info.mapr.com...
Thank You©MapR Technologies - Confidential   18
Upcoming SlideShare
Loading in...5
×

Super-Fast Clustering Report in MapR

3,025

Published on

Presentation by Ted Dunning core Mahout commiter, architect at MapR, and author of Mahout in Action at Data Science London 23/05/12

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,025
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • MapR combines the best of the open source technology with our own deep innovations to provide the most advanced distribution for Apache Hadoop.MapR’s team has a deep bench of enterprise software experience with proven success across storage, networking, virtualization, analytics, and open source technologies.Our CEO has driven multiple companies to successful outcomes in the analytic, storage, and virtualization spaces.Our CTO and co-founder M.C. Srivas was most recently at Google in BigTable. He understands the challenges of MapReduce at huge scale. Srivas was also the chief software architect at Spinnaker Networks which came out of stealth with the fastest NAS storage on the market and was acquired quickly by NetAppThe team includes experience with enterprise storage at Cisco, VmWare, IBM and EMC. Our VP of Engineering led emerging technologies and a 600 person for EMC’s NAS engineering team. We also have experience in Business Intelligence and Analytic companies and open source committers in Hadoop, Zookeeper and Mahout including PMC members.MapR is proven technology with installs by leading Hadoop installations across industries and OEM by EMC and Cisco.
  • Transcript of "Super-Fast Clustering Report in MapR"

    1. 1. Super-Fast Clustering Report from MapR workshop©MapR Technologies - Confidential 1
    2. 2.  Contact: – tdunning@maprtech.com – @ted_dunning Twitter for this talk – #mapr_uk Slides and such: – http://info.mapr.com/ted-uk-05-2012©MapR Technologies - Confidential 2
    3. 3. Company Background MapR provides the industry’s best Hadoop Distribution – Combines the best of the Hadoop community contributions with significant internally financed infrastructure development Background of Team – Deep management bench with extensive analytic, storage, virtualization, and open source experience – Google, EMC, Cisco, VMWare, Network Appliance, IBM, Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media, Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco – Over 1,000 installs ©MapR Technologies - Confidential 3
    4. 4. We Also Do … Open source development – Zookeeper – Hadoop – Mahout – Stuff Partner workshops – Machine learning – Information architecture – Cluster design©MapR Technologies - Confidential 4
    5. 5. We Also Do … Open source development – Zookeeper – Hadoop – Mahout – Stuff Partner workshops – Machine learning – Information architecture – Cluster design©MapR Technologies - Confidential 5
    6. 6. The Problem A certain bank – had lots of customers – had lots of prospective customers – had a non-trivial number of fraudulent customers – had a non-trivial number of fraudulent merchants They also – collected data – built models – collected more data – built more models©MapR Technologies - Confidential 6
    7. 7. But … These models were arduous to build And hard to test So people suggested something simpler Like k-nearest neighbor©MapR Technologies - Confidential 7
    8. 8. What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 8
    9. 9. What We Did Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 9
    10. 10. Projection Search©MapR Technologies - Confidential 10
    11. 11. K-means Search©MapR Technologies - Confidential 11
    12. 12. But These Require k-means! Need a new k-means algorithm to get speed Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable©MapR Technologies - Confidential 12
    13. 13. How It Works For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid If centroids > K ~ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 13
    14. 14. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 14
    15. 15. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that!©MapR Technologies - Confidential 15
    16. 16. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)©MapR Technologies - Confidential 16
    17. 17.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-uk-05-2012©MapR Technologies - Confidential 17
    18. 18. Thank You©MapR Technologies - Confidential 18

    ×