Your SlideShare is downloading. ×
0
Mahout, New and ImprovedNow with Super Fast Clustering©MapR Technologies - Confidential   1
Agenda     What happened in Mahout 0.7       –   less bloat       –   simpler structure       –   general cleanup©MapR Te...
To Cut Out Bloat©MapR Technologies - Confidential   3
©MapR Technologies - Confidential   4
Bloat is Leaving in 0.7     Lots of abandoned code in Mahout       –   average code quality is poor       –   no users   ...
Bloat is Leaving in 0.7     Lots of abandoned code in Mahout       –   average code quality is poor       –   no users   ...
Integration of  Collections©MapR Technologies - Confidential   7
Nobody Cares about Collections     We need it, math is built on it     Pull it into math     Broke the build (battle of...
Pig Vector©MapR Technologies - Confidential   9
What is it?     Supports access to Mahout functionality from Pig     So far -- text vectorization     And classificatio...
What is it?     Supports Pig access to Mahout functions     So far text vectorization     And classification     And m...
Compile and Install     Start by compiling and installing mahout in your local repository:           cd ~/Apache         ...
Tokenize and Vectorize Text     Tokenized is done using a text encoder       –   the dimension of the resulting vectors (...
The Formula     Not normal arithmetic     Describes which variables to use, whether offset is included     Also describ...
The Formula     Not normal arithmetic     Describes which variables to use, whether offset is included     Also describ...
Load and Encode Data     Load the data            a = load /Users/tdunning/Downloads/NNBench.csv using PigStorage(,)     ...
Train a Model     Pass previously encoded data to a sequential model trainer                  define train org.apache.mah...
Reservations and Qualms     Pig-vector isn’t done     And it is ugly     And it doesn’t quite work     And it is hard ...
Potential     Add Naïve Bayes Model?     Somehow simplify the syntax?     Try a recent version of elephant-bird?     S...
Large-scale k-Means Clustering©MapR Technologies - Confidential   20
Goals     Cluster very large data sets     Facilitate large nearest neighbor search     Allow very large number of clus...
Non-goals     Use map-reduce (but it is there)     Minimize the number of clusters     Support metrics other than L2©Ma...
Anti-goals     Multiple passes over original data     Scale as O(k n)©MapR Technologies - Confidential     23
Why?©MapR Technologies - Confidential    24
K-nearest Neighbor with  Super Fast k-means©MapR Technologies - Confidential   25
What’s that?     Find the k nearest training examples     Use the average value of the target variable from them     Th...
Modeling with k-nearest Neighbors                                        a                                    b           ...
Subject to Some Limits©MapR Technologies - Confidential   28
Log Transform Improves Things©MapR Technologies - Confidential   29
Neighbors Depend on Good Presentation©MapR Technologies - Confidential   30
How We Did It     2 week hackathon with 6 developers from MapR customer     Agile-ish development     To avoid IP issue...
How We Did It     2 week hackathon with 6 developers from customer bank     Agile-ish development     To avoid IP issue...
What We Did     Mechanism for extending Mahout Vectors       –   DelegatingVector, WeightedVector, Centroid     Shared m...
Projection Search                                         java.lang.TreeSet!©MapR Technologies - Confidential   34
Projection Search     Projection onto a line provides a total order on data     Nearby points stay nearby     Some othe...
How Many Projections?©MapR Technologies - Confidential   36
K-means Search     Simple Idea       –   pre-cluster the data       –   to find the nearest points, search the nearest cl...
©MapR Technologies - Confidential   38
x©MapR Technologies - Confidential       39
©MapR Technologies - Confidential   40
©MapR Technologies - Confidential   41
x©MapR Technologies - Confidential       42
But This Requires k-means!     Need a new k-means algorithm to get speed       –   Hadoop is very slow at iterative map-r...
Basic Method     Use a single pass of k-means with very many clusters       –   output is a bad-ish clustering but a good...
Algorithmic DetailsForeach data point xn           compute distance to nearest centroid, ∂           sample u, if u > ∂/ß ...
How It Works     Result is large set of centroids       –   these provide approximation of original distribution       – ...
Parallel Speedup?                                        200                                                              ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Moving to Ultra Mega Super Scale     Map-reduce implementation nearly trivial     Map: rough-cluster input data, output ...
     Contact:       –   tdunning@maprtech.com       –   @ted_dunning     Slides and such:       –   http://info.mapr.com...
Upcoming SlideShare
Loading in...5
×

Boston hug-2012-07

729

Published on

Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
729
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Boston hug-2012-07"

  1. 1. Mahout, New and ImprovedNow with Super Fast Clustering©MapR Technologies - Confidential 1
  2. 2. Agenda What happened in Mahout 0.7 – less bloat – simpler structure – general cleanup©MapR Technologies - Confidential 2
  3. 3. To Cut Out Bloat©MapR Technologies - Confidential 3
  4. 4. ©MapR Technologies - Confidential 4
  5. 5. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list©MapR Technologies - Confidential 5
  6. 6. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list – oops, too late since 0.7 is already released©MapR Technologies - Confidential 6
  7. 7. Integration of Collections©MapR Technologies - Confidential 7
  8. 8. Nobody Cares about Collections We need it, math is built on it Pull it into math Broke the build (battle of the code expanders) Fixed now (thanks to Grant)©MapR Technologies - Confidential 8
  9. 9. Pig Vector©MapR Technologies - Confidential 9
  10. 10. What is it? Supports access to Mahout functionality from Pig So far -- text vectorization And classification And model saving©MapR Technologies - Confidential 10
  11. 11. What is it? Supports Pig access to Mahout functions So far text vectorization And classification And model saving Kind of works (see pigML from twitter for better function)©MapR Technologies - Confidential 11
  12. 12. Compile and Install Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package©MapR Technologies - Confidential 12
  13. 13. Tokenize and Vectorize Text Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector (10,x+y+1, x:numeric, y:word, z:text); You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier©MapR Technologies - Confidential 13
  14. 14. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use©MapR Technologies - Confidential 14
  15. 15. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use – but that doesn’t do anything yet!©MapR Technologies - Confidential 15
  16. 16. Load and Encode Data Load the data a = load /Users/tdunning/Downloads/NNBench.csv using PigStorage(,) as (x1:int, x2:int, x3:int); And encode it b = foreach a generate 1 as key, EncodeVector(*) as v; Note that the true meaning of * is very subtle Now store it store b into vectors.dat using com.twitter.elephantbird.pig.store.SequenceFileStorage ( -c com.twitter.elephantbird.pig.util.IntWritableConverter’, -c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);©MapR Technologies - Confidential 16
  17. 17. Train a Model Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc); Note that the argument is a string with its own syntax©MapR Technologies - Confidential 17
  18. 18. Reservations and Qualms Pig-vector isn’t done And it is ugly And it doesn’t quite work And it is hard to build But there seems to be promise©MapR Technologies - Confidential 18
  19. 19. Potential Add Naïve Bayes Model? Somehow simplify the syntax? Try a recent version of elephant-bird? Switch to pigML?©MapR Technologies - Confidential 19
  20. 20. Large-scale k-Means Clustering©MapR Technologies - Confidential 20
  21. 21. Goals Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality – low average distance to nearest centroid on held-out data Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes©MapR Technologies - Confidential 21
  22. 22. Non-goals Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2©MapR Technologies - Confidential 22
  23. 23. Anti-goals Multiple passes over original data Scale as O(k n)©MapR Technologies - Confidential 23
  24. 24. Why?©MapR Technologies - Confidential 24
  25. 25. K-nearest Neighbor with Super Fast k-means©MapR Technologies - Confidential 25
  26. 26. What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you have few knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results, not just single nearest Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 26
  27. 27. Modeling with k-nearest Neighbors a b c©MapR Technologies - Confidential 27
  28. 28. Subject to Some Limits©MapR Technologies - Confidential 28
  29. 29. Log Transform Improves Things©MapR Technologies - Confidential 29
  30. 30. Neighbors Depend on Good Presentation©MapR Technologies - Confidential 30
  31. 31. How We Did It 2 week hackathon with 6 developers from MapR customer Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup©MapR Technologies - Confidential 31
  32. 32. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene©MapR Technologies - Confidential 32
  33. 33. What We Did Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices Searcher interface – Brute, ProjectionSearch, KmeansSearch, LshSearch Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 33
  34. 34. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 34
  35. 35. Projection Search Projection onto a line provides a total order on data Nearby points stay nearby Some other points also wind up close Search points just before or just after the query point©MapR Technologies - Confidential 35
  36. 36. How Many Projections?©MapR Technologies - Confidential 36
  37. 37. K-means Search Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters Recursive application – to search a cluster, use a Searcher!©MapR Technologies - Confidential 37
  38. 38. ©MapR Technologies - Confidential 38
  39. 39. x©MapR Technologies - Confidential 39
  40. 40. ©MapR Technologies - Confidential 40
  41. 41. ©MapR Technologies - Confidential 41
  42. 42. x©MapR Technologies - Confidential 42
  43. 43. But This Requires k-means! Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads on one node) – Very parallelizable©MapR Technologies - Confidential 43
  44. 44. Basic Method Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters©MapR Technologies - Confidential 44
  45. 45. Algorithmic DetailsForeach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > k log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease©MapR Technologies - Confidential 45
  46. 46. How It Works Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 46
  47. 47. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 47
  48. 48. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that!©MapR Technologies - Confidential 48
  49. 49. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)©MapR Technologies - Confidential 49
  50. 50. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) Empirically, projection search beats 64 bit LSH by a bit – More optimization may change this story©MapR Technologies - Confidential 50
  51. 51. Moving to Ultra Mega Super Scale Map-reduce implementation nearly trivial Map: rough-cluster input data, output ß, weighted centroids Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory Combiner possible, but not important©MapR Technologies - Confidential 51
  52. 52.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr©MapR Technologies - Confidential 52
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×