Boston hug-2012-07

Mahout, New and Improved
Now with Super Fast Clustering

©MapR Technologies - Confidential 1

Agenda

 What happened in Mahout 0.7
– less bloat
– simpler structure
– general cleanup


To Cut Out Bloat


Bloat is Leaving in 0.7

 Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
 Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
 If you care, get on the mailing list


Bloat is Leaving in 0.7

 Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
 Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
 If you care, get on the mailing list
– oops, too late since 0.7 is already released


Integration of
Collections


Nobody Cares about Collections

 We need it, math is built on it

 Pull it into math

 Broke the build (battle of the code expanders)

 Fixed now (thanks to Grant)


Pig Vector


What is it?

 Supports access to Mahout functionality from Pig

 So far -- text vectorization

 And classification

 And model saving


What is it?

 Supports Pig access to Mahout functions

 So far text vectorization

 And classification

 And model saving

 Kind of works (see pigML from twitter for better function)


Compile and Install

 Start by compiling and installing mahout in your local repository:
cd ~/Apache
git clone https://github.com/apache/mahout.git
cd mahout
mvn install -DskipTests

 Then do the same with pig-vector
cd ~/Apache
git clone git@github.com:tdunning/pig-vector.git
cd pig-vector
mvn package


Tokenize and Vectorize Text

 Tokenized is done using a text encoder
– the dimension of the resulting vectors (typically 100,000-1,000,000
– a description of the variables to be included in the encoding
– the schema of the tuples that pig will pass together with their data types
 Example:
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector
('10','x+y+1', 'x:numeric, y:word, z:text');

 You can also add a Lucene 3.1 analyzer in parentheses if you want
something fancier


The Formula

 Not normal arithmetic

 Describes which variables to use, whether offset is included

 Also describes which interactions to use


The Formula

 Not normal arithmetic

 Describes which variables to use, whether offset is included

 Also describes which interactions to use
– but that doesn’t do anything yet!


Load and Encode Data

 Load the data
a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
as (x1:int, x2:int, x3:int);
 And encode it
b = foreach a generate 1 as key, EncodeVector(*) as v;
 Note that the true meaning of * is very subtle
 Now store it
store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage
(
'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c
com.twitter.elephantbird.pig.util.GenericWritableConverter
-t org.apache.mahout.math.VectorWritable’);


Train a Model

 Pass previously encoded data to a sequential model trainer
define train org.apache.mahout.pig.LogisticRegression(
'iterations=5, inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast
comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space
talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');
 Note that the argument is a string with its own syntax


Reservations and Qualms

 Pig-vector isn’t done

 And it is ugly

 And it doesn’t quite work

 And it is hard to build

 But there seems to be promise


Potential

 Add Naïve Bayes Model?

 Somehow simplify the syntax?

 Try a recent version of elephant-bird?

 Switch to pigML?


Large-scale k-Means Clustering


Goals

 Cluster very large data sets
 Facilitate large nearest neighbor search
 Allow very large number of clusters
 Achieve good quality
– low average distance to nearest centroid on held-out data
 Based on Mahout Math
 Runs on Hadoop (really MapR) cluster
 FAST – cluster tens of millions in minutes


Non-goals

 Use map-reduce (but it is there)
 Minimize the number of clusters
 Support metrics other than L2


Anti-goals

 Multiple passes over original data
 Scale as O(k n)


Why?


K-nearest Neighbor with
Super Fast k-means


What’s that?

 Find the k nearest training examples
 Use the average value of the target variable from them

 This is easy … but hard
– easy because it is so conceptually simple and you have few knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results, not just single nearest

 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time


Modeling with k-nearest Neighbors

a

b c


Subject to Some Limits


Log Transform Improves Things


Neighbors Depend on Good Presentation


How We Did It

 2 week hackathon with 6 developers from MapR customer
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions

 Ambitious goal of ~ 1,000,000 x speedup


How We Did It

 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions

 Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene


What We Did

 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid

 Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices

 Searcher interface
– Brute, ProjectionSearch, KmeansSearch, LshSearch

 Super-fast clustering
– Kmeans, StreamingKmeans


Projection Search

java.lang.TreeSet!


Projection Search

 Projection onto a line provides a total order on data
 Nearby points stay nearby
 Some other points also wind up close

 Search points just before or just after the query point


How Many Projections?


K-means Search

 Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters

 Recursive application
– to search a cluster, use a Searcher!


x


But This Requires k-means!

 Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not

 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads on one node)
– Very parallelizable


Basic Method

 Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
 Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters


Algorithmic Details

Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid

if number of centroids > k log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease


How It Works

 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly


Parallel Speedup?

200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

Warning, Recursive Descent

 Inner loop requires finding nearest centroid

 With lots of centroids, this is slow

 But wait, we have classes to accelerate that!






(Let’s not use k-means searcher, though)






(Let’s not use k-means searcher, though)

 Empirically, projection search beats 64 bit LSH by a bit
– More optimization may change this story


Moving to Ultra Mega Super Scale

 Map-reduce implementation nearly trivial

 Map: rough-cluster input data, output ß, weighted centroids

 Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory

 Combiner possible, but not important


 Contact:
– tdunning@maprtech.com
– @ted_dunning

 Slides and such:
– http://info.mapr.com/ted-boston-2012-07

Hash tags: #boston-hug #mahout #mapr


Boston hug-2012-07

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Boston hug-2012-07

Similar to Boston hug-2012-07 (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

Boston hug-2012-07