Sdforum 11-04-2010

Apache Mahout
Thursday, November 4, 2010

Apache Mahout
Now with extra whitening and classiﬁcation powers!

• Mahout intro
• Scalability in general
• Supervised learning recap
• The new SGD classiﬁers

Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant

Mahout!
• Scalable data-mining and recommendations
• Not all data-mining
• Not the fanciest data-mining
• Just some of the scalable stuff
• Not a competitor for R or Weka

General Areas
• Recommendations
• lots of support, lots of ﬂexibility,
production ready
• Unsupervised learning (clustering)
• lots of options, lots of ﬂexibility,
production ready (ish)

General Areas
• Supervised learning (classiﬁcation)
• multiple architectures, fair number of
options, somewhat inter-operable
• production ready (for the right deﬁnition
of production and ready)
• Large scale SVD
• larger scale coming, beware sharp edges

Scalable?
• Scalable means
• Time is proportional to problem size by
resource size
• Does not imply Hadoop or parallel
THE AUTHOR
t ∝
|P|
|R|

Wall
Clock
Time
# of Training Examples
Scalable Algorithm
(Mahout wins!)
Traditional
Datamining
Works here
Scalable Solutions Required
Non-scalable Algorithm

Scalable means ...
• One unit of work requires about a unit of
time
• Not like the company store (bit.ly/22XVa4)
t ∝
|P|
|R|
|P| = O(1) =⇒ t = O(1)

Wall
Clock
Time
# of Training Examples
Parallel Algorithm
Sequential
Algorithm
Preferred
Parallel Algorithm Preferred
Sequential Algorithm

Toy Example

Training Data Sample
yes
no 0.92 0.01 circle
0.30 0.41 square
Filled?
x coordinate y coordinate
shape
predictor
variables
target
variable

What matters most?
!
!
!
!
!
!
!
!
!
!

SGD Classiﬁcation
• Supervised learning of logistic regression
• Sequential gradient descent, not parallel
• Highly optimized for high dimensional
sparse data, possibly with interactions
• Scalable, real dang fast to train

Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn

Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast

Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Stateless,
parallel

Small example
• On 20 newsgroups
• converges in < 10,000 training examples
(less than one pass through the data)
• accuracy comparable to SVM, Naive
Bayes, Complementary Naive Bayes
• learning rate, regularization set
automagically on held-out data

System Structure
EvolutionaryProcess ep
void train(target, features)
AdaptiveLogisticRegression
20
1
OnlineLogisticRegression folds
void train(target, tracking, features)
double auc()
CrossFoldLearner
5
1
Matrix beta
void train(target, features)
double classifyScalar(features)
OnlineLogisticRegression

Training API
public interface OnlineLearner {
void train(int actual, Vector instance);
void train(long trackingKey, int actual, Vector instance);
void train(long trackingKey, String groupKey, int actual, Vector instance);
void close();
}

Classiﬁcation API
public class AdaptiveLogisticRegression implements OnlineLearner {
public AdaptiveLogisticRegression(int numCategories, int numFeatures,
PriorFunction prior);
public void train(int actual, Vector instance);
public void train(long trackingKey, int actual, Vector instance);
public void train(long trackingKey, String groupKey, int actual,
Vector instance);
public void close();
public double auc();
public State<Wrapper> getBest();
}
CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner();
double averageCorrect = model.percentCorrect();
double averageLL = model.logLikelihood();
double p = model.classifyScalar(features);

Speed?
• Encoding API for hashed feature vectors
• String, byte[] or double interfaces
• String allows simple parsing
• byte[] and double allows speed
• Abstract interactions supported

Speed!
• Parsing and encoding dominate single
learner
• Moderate optimization allows 1 million
training examples with 200 features to be
encoded in 14 seconds in a single core
• 20 million mixed text, categorical features
with many interactions learned in ~ 1 hour

More Speed!
• Evolutionary optimization of learning
parameters allows simple operation
• 20x threading allows high machine use
• 20 newsgroup test completes in less time
on single node with SGD than on Hadoop
with Complementary Naive Bayes

Summary
• Mahout provides early production quality
scalable data-mining
• New classiﬁcation systems allow industrial
scale classiﬁcation

Contact Info
Ted Dunning
tdunning@maprtech.com

Contact Info
Ted Dunning
tdunning@maprtech.com
or tdunning@apache.com

Sdforum 11-04-2010

Recommended

Recommended

More Related Content

Similar to Sdforum 11-04-2010

Similar to Sdforum 11-04-2010 (20)

More from Ted Dunning

More from Ted Dunning (20)

Recently uploaded

Recently uploaded (20)

Sdforum 11-04-2010