1©MapR Technologies - Confidential
New Directions in Mahout
2©MapR Technologies - Confidential
Cut Out Bloat
3©MapR Technologies - Confidential
4©MapR Technologies - Confidential
Bloat is Leaving in 0.7
 Lots of abandoned code in Mahout
– average code quality is po...
5©MapR Technologies - Confidential
Integration of
Collections
6©MapR Technologies - Confidential
Nobody Cares about Collections
 We need it, math is built on it
 Pull it into math
 ...
7©MapR Technologies - Confidential
K-nearest Neighbor with
Super Fast k-means
8©MapR Technologies - Confidential
What’s that?
 Find the k nearest training examples
 Use the average value of the targ...
9©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish devel...
10©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish deve...
11©MapR Technologies - Confidential
What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVecto...
12©MapR Technologies - Confidential
Projection Search
java.lang.TreeSet!
13©MapR Technologies - Confidential
How Many Projections?
14©MapR Technologies - Confidential
K-means Search
 Simple Idea
– pre-cluster the data
– to find the nearest points, sear...
15©MapR Technologies - Confidential
16©MapR Technologies - Confidential
x
17©MapR Technologies - Confidential
18©MapR Technologies - Confidential
19©MapR Technologies - Confidential
x
20©MapR Technologies - Confidential
But This Require k-means!
 Need a new k-means algorithm to get speed
– Hadoop is very...
21©MapR Technologies - Confidential
How It Works
 For each point
– Find approximately nearest centroid (distance = d)
– I...
22©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5...
23©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots ...
24©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots ...
25©MapR Technologies - Confidential
Pig Vector
26©MapR Technologies - Confidential
What is it?
 Supports Pig access to Mahout functions
 So far text vectorization
 An...
27©MapR Technologies - Confidential
What is it?
 Supports Pig access to Mahout functions
 So far text vectorization
 An...
28©MapR Technologies - Confidential
Compile and Install
 Start by compiling and installing mahout in your local repositor...
29©MapR Technologies - Confidential
Tokenize and Vectorize Text
 Tokenized is done using a text encoder
– the dimension o...
30©MapR Technologies - Confidential
The Formula
 Not normal arithmetic
 Describes which variables to use, whether offset...
31©MapR Technologies - Confidential
The Formula
 Not normal arithmetic
 Describes which variables to use, whether offset...
32©MapR Technologies - Confidential
Load and Encode Data
 Load the data
a = load '/Users/tdunning/Downloads/NNBench.csv' ...
33©MapR Technologies - Confidential
Train a Model
 Pass previously encoded data to a sequential model trainer
define trai...
34©MapR Technologies - Confidential
Reservations and Qualms
 Pig-vector isn’t done
 And it is ugly
 And it doesn’t quit...
35©MapR Technologies - Confidential
Potential
 Add Naïve Bayes Model?
 Somehow simplify the syntax?
 Try a recent versi...
36©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.map...
Upcoming SlideShare
Loading in...5
×

New directions for mahout

249

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
249
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

New directions for mahout

  1. 1. 1©MapR Technologies - Confidential New Directions in Mahout
  2. 2. 2©MapR Technologies - Confidential Cut Out Bloat
  3. 3. 3©MapR Technologies - Confidential
  4. 4. 4©MapR Technologies - Confidential Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list  0.7 is about to be released
  5. 5. 5©MapR Technologies - Confidential Integration of Collections
  6. 6. 6©MapR Technologies - Confidential Nobody Cares about Collections  We need it, math is built on it  Pull it into math  Broke the build (battle of the code expanders)  Fixed now (thanks Grant)
  7. 7. 7©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  8. 8. 8©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  9. 9. 9©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  10. 10. 10©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  11. 11. 11©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans
  12. 12. 12©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  13. 13. 13©MapR Technologies - Confidential How Many Projections?
  14. 14. 14©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  15. 15. 15©MapR Technologies - Confidential
  16. 16. 16©MapR Technologies - Confidential x
  17. 17. 17©MapR Technologies - Confidential
  18. 18. 18©MapR Technologies - Confidential
  19. 19. 19©MapR Technologies - Confidential x
  20. 20. 20©MapR Technologies - Confidential But This Require k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable
  21. 21. 21©MapR Technologies - Confidential How It Works  For each point – Find approximately nearest centroid (distance = d) – If d > threshold, new centroid – Else possibly new cluster – Else add to nearest centroid  If centroids > K ~ C log N – Recursively cluster centroids with higher threshold  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  22. 22. 22©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  23. 23. 23©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  24. 24. 24©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  25. 25. 25©MapR Technologies - Confidential Pig Vector
  26. 26. 26©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving
  27. 27. 27©MapR Technologies - Confidential What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving  Kind of works (see pigML from twitter for better function)
  28. 28. 28©MapR Technologies - Confidential Compile and Install  Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests  Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package
  29. 29. 29©MapR Technologies - Confidential Tokenize and Vectorize Text  Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types  Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector ('10','x+y+1', 'x:numeric, y:word, z:text');  You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier
  30. 30. 30©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use
  31. 31. 31©MapR Technologies - Confidential The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use – but that doesn’t do anything yet!
  32. 32. 32©MapR Technologies - Confidential Load and Encode Data  Load the data a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',') as (x1:int, x2:int, x3:int);  And encode it b = foreach a generate 1 as key, EncodeVector(*) as v;  Note that the true meaning of * is very subtle  Now store it store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage ( '-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);
  33. 33. 33©MapR Technologies - Confidential Train a Model  Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( 'iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');  Note that the argument is a string with its own syntax
  34. 34. 34©MapR Technologies - Confidential Reservations and Qualms  Pig-vector isn’t done  And it is ugly  And it doesn’t quite work  And it is hard to build  But there seems to be promise
  35. 35. 35©MapR Technologies - Confidential Potential  Add Naïve Bayes Model?  Somehow simplify the syntax?  Try a recent version of elephant-bird?  Switch to pigML?
  36. 36. 36©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-bbuzz-2012 Hash tags: #bbuzz #mahout
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×