New Directions in Mahout©MapR Technologies - Confidential   1
Cut Out Bloat©MapR Technologies - Confidential   2
©MapR Technologies - Confidential   3
Bloat is Leaving in 0.7     Lots of abandoned code in Mahout       –   average code quality is poor       –   no users   ...
Integration of  Collections©MapR Technologies - Confidential   5
Nobody Cares about Collections     We need it, math is built on it     Pull it into math     Broke the build (battle of...
K-nearest Neighbor with  Super Fast k-means©MapR Technologies - Confidential   7
What’s that?     Find the k nearest training examples     Use the average value of the target variable from them     Th...
How We Did It     2 week hackathon with 6 developers from customer bank     Agile-ish development     To avoid IP issue...
How We Did It     2 week hackathon with 6 developers from customer bank     Agile-ish development     To avoid IP issue...
What We Did     Mechanism for extending Mahout Vectors       –   DelegatingVector, WeightedVector, Centroid     Searcher...
Projection Search                                         java.lang.TreeSet!©MapR Technologies - Confidential   12
How Many Projections?©MapR Technologies - Confidential   13
K-means Search     Simple Idea       –   pre-cluster the data       –   to find the nearest points, search the nearest cl...
©MapR Technologies - Confidential   15
x©MapR Technologies - Confidential       16
©MapR Technologies - Confidential   17
©MapR Technologies - Confidential   18
x©MapR Technologies - Confidential       19
But This Require k-means!     Need a new k-means algorithm to get speed       –   Hadoop is very slow at iterative map-re...
How It Works     For each point       –   Find approximately nearest centroid (distance = d)       –   If d > threshold, ...
Parallel Speedup?                                        200                                                              ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Warning, Recursive Descent     Inner loop requires finding nearest centroid     With lots of centroids, this is slow   ...
Pig Vector©MapR Technologies - Confidential   25
What is it?     Supports Pig access to Mahout functions     So far text vectorization     And classification     And m...
What is it?     Supports Pig access to Mahout functions     So far text vectorization     And classification     And m...
Compile and Install     Start by compiling and installing mahout in your local repository:           cd ~/Apache         ...
Tokenize and Vectorize Text     Tokenized is done using a text encoder       –   the dimension of the resulting vectors (...
The Formula     Not normal arithmetic     Describes which variables to use, whether offset is included     Also describ...
The Formula     Not normal arithmetic     Describes which variables to use, whether offset is included     Also describ...
Load and Encode Data     Load the data            a = load /Users/tdunning/Downloads/NNBench.csv using PigStorage(,)     ...
Train a Model     Pass previously encoded data to a sequential model trainer                  define train org.apache.mah...
Reservations and Qualms     Pig-vector isn’t done     And it is ugly     And it doesn’t quite work     And it is hard ...
Potential     Add Naïve Bayes Model?     Somehow simplify the syntax?     Try a recent version of elephant-bird?     S...
     Contact:       –   tdunning@maprtech.com       –   @ted_dunning     Slides and such:       –   http://info.mapr.com...
Upcoming SlideShare
Loading in...5
×

New Directions for Mahout

1,971

Published on

I gave this talk at Buzzwords just now to fill in for an ill speaker.

The topics include things that are being added to or taken out of Mahout. These include cruft (out), fast clustering (in), nearest neighbor search (in), Pig bindings for Mahout (who knows).

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,971
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
41
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

New Directions for Mahout

  1. 1. New Directions in Mahout©MapR Technologies - Confidential 1
  2. 2. Cut Out Bloat©MapR Technologies - Confidential 2
  3. 3. ©MapR Technologies - Confidential 3
  4. 4. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list 0.7 is about to be released©MapR Technologies - Confidential 4
  5. 5. Integration of Collections©MapR Technologies - Confidential 5
  6. 6. Nobody Cares about Collections We need it, math is built on it Pull it into math Broke the build (battle of the code expanders) Fixed now (thanks Grant)©MapR Technologies - Confidential 6
  7. 7. K-nearest Neighbor with Super Fast k-means©MapR Technologies - Confidential 7
  8. 8. What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 8
  9. 9. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup©MapR Technologies - Confidential 9
  10. 10. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene©MapR Technologies - Confidential 10
  11. 11. What We Did Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 11
  12. 12. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 12
  13. 13. How Many Projections?©MapR Technologies - Confidential 13
  14. 14. K-means Search Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters Recursive application – to search a cluster, use a Searcher!©MapR Technologies - Confidential 14
  15. 15. ©MapR Technologies - Confidential 15
  16. 16. x©MapR Technologies - Confidential 16
  17. 17. ©MapR Technologies - Confidential 17
  18. 18. ©MapR Technologies - Confidential 18
  19. 19. x©MapR Technologies - Confidential 19
  20. 20. But This Require k-means! Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable©MapR Technologies - Confidential 20
  21. 21. How It Works For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid If centroids > K ~ C log N – Recursively cluster centroids with higher threshold Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 21
  22. 22. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 22
  23. 23. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that!©MapR Technologies - Confidential 23
  24. 24. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)©MapR Technologies - Confidential 24
  25. 25. Pig Vector©MapR Technologies - Confidential 25
  26. 26. What is it? Supports Pig access to Mahout functions So far text vectorization And classification And model saving©MapR Technologies - Confidential 26
  27. 27. What is it? Supports Pig access to Mahout functions So far text vectorization And classification And model saving Kind of works (see pigML from twitter for better function)©MapR Technologies - Confidential 27
  28. 28. Compile and Install Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package©MapR Technologies - Confidential 28
  29. 29. Tokenize and Vectorize Text Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector (10,x+y+1, x:numeric, y:word, z:text); You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier©MapR Technologies - Confidential 29
  30. 30. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use©MapR Technologies - Confidential 30
  31. 31. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use – but that doesn’t do anything yet!©MapR Technologies - Confidential 31
  32. 32. Load and Encode Data Load the data a = load /Users/tdunning/Downloads/NNBench.csv using PigStorage(,) as (x1:int, x2:int, x3:int); And encode it b = foreach a generate 1 as key, EncodeVector(*) as v; Note that the true meaning of * is very subtle Now store it store b into vectors.dat using com.twitter.elephantbird.pig.store.SequenceFileStorage ( -c com.twitter.elephantbird.pig.util.IntWritableConverter’, -c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);©MapR Technologies - Confidential 32
  33. 33. Train a Model Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc); Note that the argument is a string with its own syntax©MapR Technologies - Confidential 33
  34. 34. Reservations and Qualms Pig-vector isn’t done And it is ugly And it doesn’t quite work And it is hard to build But there seems to be promise©MapR Technologies - Confidential 34
  35. 35. Potential Add Naïve Bayes Model? Somehow simplify the syntax? Try a recent version of elephant-bird? Switch to pigML?©MapR Technologies - Confidential 35
  36. 36.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-bbuzz-2012 Hash tags: #bbuzz #mahout©MapR Technologies - Confidential 36
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×