New directions for mahout

470 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
470
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

New directions for mahout

  1. 1. 1©MapR Technologies - Confidential New Directions in Mahout
  2. 2. 2©MapR Technologies - Confidential Cut Out Bloat
  3. 3. 3©MapR Technologies - Confidential
  4. 4. 4©MapR Technologies - Confidential Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list  0.7 is about to be released
  5. 5. 5©MapR Technologies - Confidential Integration of Collections
  6. 6. 6©MapR Technologies - Confidential Nobody Cares about Collections  We need it, math is built on it  Pull it into math  Broke the build (battle of the code expanders)  Fixed now (thanks Grant)
  7. 7. 7©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  8. 8. 8©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  9. 9. 9©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  10. 10. 10©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  11. 11. 11©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans
  12. 12. 12©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  13. 13. 13©MapR Technologies - Confidential How Many Projections?
  14. 14. 14©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  15. 15. 15©MapR Technologies - Confidential
  16. 16. 16©MapR Technologies - Confidential x
  17. 17. 17©MapR Technologies - Confidential
  18. 18. 18©MapR Technologies - Confidential
  19. 19. 19©MapR Technologies - Confidential x
  20. 20. 20©MapR Technologies - Confidential But This Require k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable
  21. 21. 21©MapR Technologies - Confidential How It Works  For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid  If centroids > K ~ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  22. 22. 22©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  23. 23. 23©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  24. 24. 24©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  25. 25. 25©MapR Technologies - Confidential Pig Vector
  26. 26. 26©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving
  27. 27. 27©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving  Kind of works (see pigML from twitter for better function)
  28. 28. 28©MapR Technologies - Confidential Compile and Install  Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests  Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package
  29. 29. 29©MapR Technologies - Confidential Tokenize and Vectorize Text  Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types  Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector ('10','x+y+1', 'x:numeric, y:word, z:text');  You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier
  30. 30. 30©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use
  31. 31. 31©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use – but that doesn’t do anything yet!
  32. 32. 32©MapR Technologies - Confidential Load and Encode Data  Load the data a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',') as (x1:int, x2:int, x3:int);  And encode it b = foreach a generate 1 as key, EncodeVector(*) as v;  Note that the true meaning of * is very subtle  Now store it store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage ( '-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);
  33. 33. 33©MapR Technologies - Confidential Train a Model  Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( 'iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');  Note that the argument is a string with its own syntax
  34. 34. 34©MapR Technologies - Confidential Reservations and Qualms  Pig-vector isn’t done  And it is ugly  And it doesn’t quite work  And it is hard to build  But there seems to be promise
  35. 35. 35©MapR Technologies - Confidential Potential  Add Naïve Bayes Model?  Somehow simplify the syntax?  Try a recent version of elephant-bird?  Switch to pigML?
  36. 36. 36©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-bbuzz-2012 Hash tags: #bbuzz #mahout

×