Your SlideShare is downloading. ×
New directions for mahout
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

New directions for mahout

233
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
233
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1©MapR Technologies - Confidential New Directions in Mahout
  • 2. 2©MapR Technologies - Confidential Cut Out Bloat
  • 3. 3©MapR Technologies - Confidential
  • 4. 4©MapR Technologies - Confidential Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list  0.7 is about to be released
  • 5. 5©MapR Technologies - Confidential Integration of Collections
  • 6. 6©MapR Technologies - Confidential Nobody Cares about Collections  We need it, math is built on it  Pull it into math  Broke the build (battle of the code expanders)  Fixed now (thanks Grant)
  • 7. 7©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  • 8. 8©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 9. 9©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  • 10. 10©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  • 11. 11©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans
  • 12. 12©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  • 13. 13©MapR Technologies - Confidential How Many Projections?
  • 14. 14©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  • 15. 15©MapR Technologies - Confidential
  • 16. 16©MapR Technologies - Confidential x
  • 17. 17©MapR Technologies - Confidential
  • 18. 18©MapR Technologies - Confidential
  • 19. 19©MapR Technologies - Confidential x
  • 20. 20©MapR Technologies - Confidential But This Require k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable
  • 21. 21©MapR Technologies - Confidential How It Works  For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid  If centroids > K ~ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 22. 22©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 23. 23©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  • 24. 24©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  • 25. 25©MapR Technologies - Confidential Pig Vector
  • 26. 26©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving
  • 27. 27©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving  Kind of works (see pigML from twitter for better function)
  • 28. 28©MapR Technologies - Confidential Compile and Install  Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests  Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package
  • 29. 29©MapR Technologies - Confidential Tokenize and Vectorize Text  Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types  Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector ('10','x+y+1', 'x:numeric, y:word, z:text');  You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier
  • 30. 30©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use
  • 31. 31©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use – but that doesn’t do anything yet!
  • 32. 32©MapR Technologies - Confidential Load and Encode Data  Load the data a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',') as (x1:int, x2:int, x3:int);  And encode it b = foreach a generate 1 as key, EncodeVector(*) as v;  Note that the true meaning of * is very subtle  Now store it store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage ( '-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);
  • 33. 33©MapR Technologies - Confidential Train a Model  Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( 'iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');  Note that the argument is a string with its own syntax
  • 34. 34©MapR Technologies - Confidential Reservations and Qualms  Pig-vector isn’t done  And it is ugly  And it doesn’t quite work  And it is hard to build  But there seems to be promise
  • 35. 35©MapR Technologies - Confidential Potential  Add Naïve Bayes Model?  Somehow simplify the syntax?  Try a recent version of elephant-bird?  Switch to pigML?
  • 36. 36©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-bbuzz-2012 Hash tags: #bbuzz #mahout

×