SD Forum 11 04-2010

420 views

Published on

SD Forum November 4, 2010

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
420
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SD Forum 11 04-2010

  1. 1. Apache Mahout Thursday, November 4, 2010
  2. 2. Apache Mahout Now with extra whitening and classification powers! Thursday, November 4, 2010
  3. 3. • Mahout intro • Scalability in general • Supervised learning recap • The new SGD classifiers Thursday, November 4, 2010
  4. 4. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  5. 5. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  6. 6. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  7. 7. Mahout! • Scalable data-mining and recommendations • Not all data-mining • Not the fanciest data-mining • Just some of the scalable stuff • Not a competitor for R or Weka Thursday, November 4, 2010
  8. 8. General Areas • Recommendations • lots of support, lots of flexibility, production ready • Unsupervised learning (clustering) • lots of options, lots of flexibility, production ready (ish) Thursday, November 4, 2010
  9. 9. General Areas • Supervised learning (classification) • multiple architectures, fair number of options, somewhat inter-operable • production ready (for the right definition of production and ready) • Large scale SVD • larger scale coming, beware sharp edges Thursday, November 4, 2010
  10. 10. Scalable? • Scalable means • Time is proportional to problem size by resource size • Does not imply Hadoop or parallel THE AUTHOR t ∝ |P| |R| Thursday, November 4, 2010
  11. 11. Wall Clock Time # of Training Examples Scalable Algorithm (Mahout wins!) Traditional Datamining Works here Scalable Solutions Required Non-scalable Algorithm Thursday, November 4, 2010
  12. 12. Scalable means ... • One unit of work requires about a unit of time • Not like the company store (bit.ly/22XVa4) t ∝ |P| |R| |P| = O(1) =⇒ t = O(1) Thursday, November 4, 2010
  13. 13. Wall Clock Time # of Training Examples Parallel Algorithm Sequential Algorithm Preferred Parallel Algorithm Preferred Sequential Algorithm Thursday, November 4, 2010
  14. 14. Toy Example Thursday, November 4, 2010
  15. 15. Training Data Sample yes no 0.92 0.01 circle 0.30 0.41 square Filled? x coordinate y coordinate shape predictor variables target variable Thursday, November 4, 2010
  16. 16. What matters most? ! ! ! ! ! ! ! ! ! ! Thursday, November 4, 2010
  17. 17. SGD Classification • Supervised learning of logistic regression • Sequential gradient descent, not parallel • Highly optimized for high dimensional sparse data, possibly with interactions • Scalable, real dang fast to train Thursday, November 4, 2010
  18. 18. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Thursday, November 4, 2010
  19. 19. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Thursday, November 4, 2010
  20. 20. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Stateless, parallel Thursday, November 4, 2010
  21. 21. Small example • On 20 newsgroups • converges in < 10,000 training examples (less than one pass through the data) • accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes • learning rate, regularization set automagically on held-out data Thursday, November 4, 2010
  22. 22. System Structure EvolutionaryProcess ep void train(target, features) AdaptiveLogisticRegression 20 1 OnlineLogisticRegression folds void train(target, tracking, features) double auc() CrossFoldLearner 5 1 Matrix beta void train(target, features) double classifyScalar(features) OnlineLogisticRegression Thursday, November 4, 2010
  23. 23. Training API public interface OnlineLearner { void train(int actual, Vector instance); void train(long trackingKey, int actual, Vector instance); void train(long trackingKey, String groupKey, int actual, Vector instance); void close(); } Thursday, November 4, 2010
  24. 24. Classification API public class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close(); public double auc(); public State<Wrapper> getBest(); } CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood(); double p = model.classifyScalar(features); Thursday, November 4, 2010
  25. 25. Speed? • Encoding API for hashed feature vectors • String, byte[] or double interfaces • String allows simple parsing • byte[] and double allows speed • Abstract interactions supported Thursday, November 4, 2010
  26. 26. Speed! • Parsing and encoding dominate single learner • Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core • 20 million mixed text, categorical features with many interactions learned in ~ 1 hour Thursday, November 4, 2010
  27. 27. More Speed! • Evolutionary optimization of learning parameters allows simple operation • 20x threading allows high machine use • 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes Thursday, November 4, 2010
  28. 28. Summary • Mahout provides early production quality scalable data-mining • New classification systems allow industrial scale classification Thursday, November 4, 2010
  29. 29. Contact Info Ted Dunning tdunning@maprtech.com Thursday, November 4, 2010
  30. 30. Contact Info Ted Dunning tdunning@maprtech.com or tdunning@apache.com Thursday, November 4, 2010

×