• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Sdforum 11-04-2010

  • 7,509 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,509
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
104
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Mahout Thursday, November 4, 2010
  • 2. Apache Mahout Now with extra whitening and classification powers! Thursday, November 4, 2010
  • 3. • Mahout intro • Scalability in general • Supervised learning recap • The new SGD classifiers Thursday, November 4, 2010
  • 4. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 5. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 6. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 7. Mahout! • Scalable data-mining and recommendations • Not all data-mining • Not the fanciest data-mining • Just some of the scalable stuff • Not a competitor for R or Weka Thursday, November 4, 2010
  • 8. General Areas • Recommendations • lots of support, lots of flexibility, production ready • Unsupervised learning (clustering) • lots of options, lots of flexibility, production ready (ish) Thursday, November 4, 2010
  • 9. General Areas • Supervised learning (classification) • multiple architectures, fair number of options, somewhat inter-operable • production ready (for the right definition of production and ready) • Large scale SVD • larger scale coming, beware sharp edges Thursday, November 4, 2010
  • 10. Scalable? THE AUTHOR • Scalable means |P | t∝ |R| • Time is proportional to problem size by resource size • Does not imply Hadoop or parallel Thursday, November 4, 2010
  • 11. Wall Clock Time Scalable Algorithm (Mahout wins!) Non-scalable Algorithm # of Training Examples Traditional Scalable Solutions Required Datamining Works here Thursday, November 4, 2010
  • 12. Scalable|P | means ... t∝ |R| |P | = O(1) =⇒ t = O(1) • One unit of work requires about a unit of time • Not like the company store (bit.ly/22XVa4) Thursday, November 4, 2010
  • 13. Wall Clock Time Parallel Algorithm Sequential Algorithm # of Training Examples Sequential Parallel Algorithm Preferred Algorithm Preferred Thursday, November 4, 2010
  • 14. Toy Example Thursday, November 4, 2010
  • 15. Training Data Sample x coordinate y coordinate shape Filled? no 0.92 0.01 circle yes 0.30 0.41 square target predictor variable variables Thursday, November 4, 2010
  • 16. What matters most? ! ! ! ! ! ! ! ! ! ! Thursday, November 4, 2010
  • 17. SGD Classification • Supervised learning of logistic regression • Sequential gradient descent, not parallel • Highly optimized for high dimensional sparse data, possibly with interactions • Scalable, real dang fast to train Thursday, November 4, 2010
  • 18. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn Learning Model T x1 ... xn Algorithm T x1 ... xn ? x1 ... xn T ? x1 ... xn T ? x1 ... xn T ? x1 ... xn Model T ? x1 ... xn T Thursday, November 4, 2010
  • 19. Supervised Learning Sequential but fast T x1 ... xn T x1 ... xn T x1 ... xn Learning Model T x1 ... xn Algorithm T x1 ... xn ? x1 ... xn T ? x1 ... xn T ? x1 ... xn T ? x1 ... xn Model T ? x1 ... xn T Thursday, November 4, 2010
  • 20. Supervised Learning Sequential but fast T x1 ... xn T x1 ... xn T x1 ... xn Learning Model T x1 ... xn Algorithm T x1 ... xn ? x1 ... xn T ? x1 ... xn T ? x1 ... xn T ? x1 ... xn Model T ? x1 ... xn T Stateless, parallel Thursday, November 4, 2010
  • 21. Small example • On 20 newsgroups • converges in < 10,000 training examples (less than one pass through the data) • accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes • learning rate, regularization set automagically on held-out data Thursday, November 4, 2010
  • 22. System Structure AdaptiveLogisticRegression EvolutionaryProcess ep void train(target, features) 1 20 CrossFoldLearner OnlineLogisticRegression folds void train(target, tracking, features) double auc() 1 5 OnlineLogisticRegression Matrix beta void train(target, features) double classifyScalar(features) Thursday, November 4, 2010
  • 23. Training API public interface OnlineLearner { void train(int actual, Vector instance); void train(long trackingKey, int actual, Vector instance); void train(long trackingKey, String groupKey, int actual, Vector instance); void close(); } Thursday, November 4, 2010
  • 24. Classification API public class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close(); public double auc(); public State<Wrapper> getBest(); } CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood(); double p = model.classifyScalar(features); Thursday, November 4, 2010
  • 25. Speed? • Encoding API for hashed feature vectors • String, byte[] or double interfaces • String allows simple parsing • byte[] and double allows speed • Abstract interactions supported Thursday, November 4, 2010
  • 26. Speed! • Parsing and encoding dominate single learner • Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core • 20 million mixed text, categorical features with many interactions learned in ~ 1 hour Thursday, November 4, 2010
  • 27. More Speed! • Evolutionary optimization of learning parameters allows simple operation • 20x threading allows high machine use • 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes Thursday, November 4, 2010
  • 28. Summary • Mahout provides early production quality scalable data-mining • New classification systems allow industrial scale classification Thursday, November 4, 2010
  • 29. Contact Info Ted Dunning tdunning@maprtech.com Thursday, November 4, 2010
  • 30. Contact Info Ted Dunning tdunning@maprtech.com or tdunning@apache.com Thursday, November 4, 2010