Apache Mahout
Thursday, November 4, 2010
Apache Mahout
Now with extra whitening and classification powers!
Thursday, November 4, 2010
• Mahout intro
• Scalability in general
• Supervised learning recap
• The new SGD classifiers
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout!
• Scalable data-mining and recommendations
• Not all data-mining
• Not the fanciest data-mining
• Just some of the...
General Areas
• Recommendations
• lots of support, lots of flexibility,
production ready
• Unsupervised learning (clusterin...
General Areas
• Supervised learning (classification)
• multiple architectures, fair number of
options, somewhat inter-opera...
Scalable?
• Scalable means
• Time is proportional to problem size by
resource size
• Does not imply Hadoop or parallel
THE...
Wall
Clock
Time
# of Training Examples
Scalable Algorithm
(Mahout wins!)
Traditional
Datamining
Works here
Scalable Soluti...
Scalable means ...
• One unit of work requires about a unit of
time
• Not like the company store (bit.ly/22XVa4)
t ∝
|P|
|...
Wall
Clock
Time
# of Training Examples
Parallel Algorithm
Sequential
Algorithm
Preferred
Parallel Algorithm Preferred
Sequ...
Toy Example
Thursday, November 4, 2010
Training Data Sample
yes
no 0.92 0.01 circle
0.30 0.41 square
Filled?
x coordinate y coordinate
shape
predictor
variables
...
What matters most?
!
!
!
!
!
!
!
!
!
!
Thursday, November 4, 2010
SGD Classification
• Supervised learning of logistic regression
• Sequential gradient descent, not parallel
• Highly optimi...
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
?...
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
?...
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
?...
Small example
• On 20 newsgroups
• converges in < 10,000 training examples
(less than one pass through the data)
• accurac...
System Structure
EvolutionaryProcess ep
void train(target, features)
AdaptiveLogisticRegression
20
1
OnlineLogisticRegress...
Training API
public interface OnlineLearner {
void train(int actual, Vector instance);
void train(long trackingKey, int ac...
Classification API
public class AdaptiveLogisticRegression implements OnlineLearner {
public AdaptiveLogisticRegression(int...
Speed?
• Encoding API for hashed feature vectors
• String, byte[] or double interfaces
• String allows simple parsing
• by...
Speed!
• Parsing and encoding dominate single
learner
• Moderate optimization allows 1 million
training examples with 200 ...
More Speed!
• Evolutionary optimization of learning
parameters allows simple operation
• 20x threading allows high machine...
Summary
• Mahout provides early production quality
scalable data-mining
• New classification systems allow industrial
scale...
Contact Info
Ted Dunning
tdunning@maprtech.com
Thursday, November 4, 2010
Contact Info
Ted Dunning
tdunning@maprtech.com
or tdunning@apache.com
Thursday, November 4, 2010
Upcoming SlideShare
Loading in …5
×

Sdforum 11-04-2010

8,722 views

Published on

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,722
On SlideShare
0
From Embeds
0
Number of Embeds
276
Actions
Shares
0
Downloads
108
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Sdforum 11-04-2010

  1. 1. Apache Mahout Thursday, November 4, 2010
  2. 2. Apache Mahout Now with extra whitening and classification powers! Thursday, November 4, 2010
  3. 3. • Mahout intro • Scalability in general • Supervised learning recap • The new SGD classifiers Thursday, November 4, 2010
  4. 4. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  5. 5. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  6. 6. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  7. 7. Mahout! • Scalable data-mining and recommendations • Not all data-mining • Not the fanciest data-mining • Just some of the scalable stuff • Not a competitor for R or Weka Thursday, November 4, 2010
  8. 8. General Areas • Recommendations • lots of support, lots of flexibility, production ready • Unsupervised learning (clustering) • lots of options, lots of flexibility, production ready (ish) Thursday, November 4, 2010
  9. 9. General Areas • Supervised learning (classification) • multiple architectures, fair number of options, somewhat inter-operable • production ready (for the right definition of production and ready) • Large scale SVD • larger scale coming, beware sharp edges Thursday, November 4, 2010
  10. 10. Scalable? • Scalable means • Time is proportional to problem size by resource size • Does not imply Hadoop or parallel THE AUTHOR t ∝ |P| |R| Thursday, November 4, 2010
  11. 11. Wall Clock Time # of Training Examples Scalable Algorithm (Mahout wins!) Traditional Datamining Works here Scalable Solutions Required Non-scalable Algorithm Thursday, November 4, 2010
  12. 12. Scalable means ... • One unit of work requires about a unit of time • Not like the company store (bit.ly/22XVa4) t ∝ |P| |R| |P| = O(1) =⇒ t = O(1) Thursday, November 4, 2010
  13. 13. Wall Clock Time # of Training Examples Parallel Algorithm Sequential Algorithm Preferred Parallel Algorithm Preferred Sequential Algorithm Thursday, November 4, 2010
  14. 14. Toy Example Thursday, November 4, 2010
  15. 15. Training Data Sample yes no 0.92 0.01 circle 0.30 0.41 square Filled? x coordinate y coordinate shape predictor variables target variable Thursday, November 4, 2010
  16. 16. What matters most? ! ! ! ! ! ! ! ! ! ! Thursday, November 4, 2010
  17. 17. SGD Classification • Supervised learning of logistic regression • Sequential gradient descent, not parallel • Highly optimized for high dimensional sparse data, possibly with interactions • Scalable, real dang fast to train Thursday, November 4, 2010
  18. 18. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Thursday, November 4, 2010
  19. 19. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Thursday, November 4, 2010
  20. 20. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Stateless, parallel Thursday, November 4, 2010
  21. 21. Small example • On 20 newsgroups • converges in < 10,000 training examples (less than one pass through the data) • accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes • learning rate, regularization set automagically on held-out data Thursday, November 4, 2010
  22. 22. System Structure EvolutionaryProcess ep void train(target, features) AdaptiveLogisticRegression 20 1 OnlineLogisticRegression folds void train(target, tracking, features) double auc() CrossFoldLearner 5 1 Matrix beta void train(target, features) double classifyScalar(features) OnlineLogisticRegression Thursday, November 4, 2010
  23. 23. Training API public interface OnlineLearner { void train(int actual, Vector instance); void train(long trackingKey, int actual, Vector instance); void train(long trackingKey, String groupKey, int actual, Vector instance); void close(); } Thursday, November 4, 2010
  24. 24. Classification API public class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close(); public double auc(); public State<Wrapper> getBest(); } CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood(); double p = model.classifyScalar(features); Thursday, November 4, 2010
  25. 25. Speed? • Encoding API for hashed feature vectors • String, byte[] or double interfaces • String allows simple parsing • byte[] and double allows speed • Abstract interactions supported Thursday, November 4, 2010
  26. 26. Speed! • Parsing and encoding dominate single learner • Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core • 20 million mixed text, categorical features with many interactions learned in ~ 1 hour Thursday, November 4, 2010
  27. 27. More Speed! • Evolutionary optimization of learning parameters allows simple operation • 20x threading allows high machine use • 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes Thursday, November 4, 2010
  28. 28. Summary • Mahout provides early production quality scalable data-mining • New classification systems allow industrial scale classification Thursday, November 4, 2010
  29. 29. Contact Info Ted Dunning tdunning@maprtech.com Thursday, November 4, 2010
  30. 30. Contact Info Ted Dunning tdunning@maprtech.com or tdunning@apache.com Thursday, November 4, 2010

×