New directions for mahout

1©MapR Technologies - Confidential
New Directions in Mahout

Cut Out Bloat

Bloat is Leaving in 0.7
 Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
 Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
 If you care, get on the mailing list
 0.7 is about to be released

Integration of
Collections

Nobody Cares about Collections
 We need it, math is built on it
 Pull it into math
 Broke the build (battle of the code expanders)
 Fixed now (thanks Grant)

K-nearest Neighbor with
Super Fast k-means

What’s that?
 Find the k nearest training examples
 Use the average value of the target variable from them
 This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time

How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup

How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene

What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
 Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
 Super-fast clustering
– Kmeans, StreamingKmeans

Projection Search
java.lang.TreeSet!

How Many Projections?

K-means Search
 Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
 Recursive application
– to search a cluster, use a Searcher!

x

But This Require k-means!
 Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable

How It Works
 For each point
– Find approximately nearest centroid (distance = d)
– If d > threshold, new centroid
– Else possibly new cluster
– Else add to nearest centroid
 If centroids > K ~ C log N
– Recursively cluster centroids with higher threshold
 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!

Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)

Pig Vector

What is it?
 Supports Pig access to Mahout functions
 So far text vectorization
 And classification
 And model saving

What is it?
 Supports Pig access to Mahout functions
 So far text vectorization
 And classification
 And model saving
 Kind of works (see pigML from twitter for better function)

Compile and Install
 Start by compiling and installing mahout in your local repository:
cd ~/Apache
git clone https://github.com/apache/mahout.git
cd mahout
mvn install -DskipTests
 Then do the same with pig-vector
cd ~/Apache
git clone git@github.com:tdunning/pig-vector.git
cd pig-vector
mvn package

Tokenize and Vectorize Text
 Tokenized is done using a text encoder
– the dimension of the resulting vectors (typically 100,000-1,000,000
– a description of the variables to be included in the encoding
– the schema of the tuples that pig will pass together with their data types
 Example:
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector
('10','x+y+1', 'x:numeric, y:word, z:text');
 You can also add a Lucene 3.1 analyzer in parentheses if you want
something fancier

The Formula
 Not normal arithmetic
 Describes which variables to use, whether offset is included
 Also describes which interactions to use

The Formula
 Not normal arithmetic
 Describes which variables to use, whether offset is included
 Also describes which interactions to use
– but that doesn’t do anything yet!

Load and Encode Data
 Load the data
a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
as (x1:int, x2:int, x3:int);
 And encode it
b = foreach a generate 1 as key, EncodeVector(*) as v;
 Note that the true meaning of * is very subtle
 Now store it
store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage
(
'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c
com.twitter.elephantbird.pig.util.GenericWritableConverter
-t org.apache.mahout.math.VectorWritable’);

Train a Model
 Pass previously encoded data to a sequential model trainer
define train org.apache.mahout.pig.LogisticRegression(
'iterations=5, inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast
comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space
talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');
 Note that the argument is a string with its own syntax

Reservations and Qualms
 Pig-vector isn’t done
 And it is ugly
 And it doesn’t quite work
 And it is hard to build
 But there seems to be promise

Potential
 Add Naïve Bayes Model?
 Somehow simplify the syntax?
 Try a recent version of elephant-bird?
 Switch to pigML?

 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.mapr.com/ted-bbuzz-2012
Hash tags: #bbuzz #mahout

New directions for mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to New directions for mahout

Similar to New directions for mahout (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

New directions for mahout