Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

No Downloads

Total views

8,722

On SlideShare

0

From Embeds

0

Number of Embeds

276

Shares

0

Downloads

108

Comments

0

Likes

9

No embeds

No notes for slide

- 1. Apache Mahout Thursday, November 4, 2010
- 2. Apache Mahout Now with extra whitening and classiﬁcation powers! Thursday, November 4, 2010
- 3. • Mahout intro • Scalability in general • Supervised learning recap • The new SGD classiﬁers Thursday, November 4, 2010
- 4. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
- 5. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
- 6. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
- 7. Mahout! • Scalable data-mining and recommendations • Not all data-mining • Not the fanciest data-mining • Just some of the scalable stuff • Not a competitor for R or Weka Thursday, November 4, 2010
- 8. General Areas • Recommendations • lots of support, lots of ﬂexibility, production ready • Unsupervised learning (clustering) • lots of options, lots of ﬂexibility, production ready (ish) Thursday, November 4, 2010
- 9. General Areas • Supervised learning (classiﬁcation) • multiple architectures, fair number of options, somewhat inter-operable • production ready (for the right deﬁnition of production and ready) • Large scale SVD • larger scale coming, beware sharp edges Thursday, November 4, 2010
- 10. Scalable? • Scalable means • Time is proportional to problem size by resource size • Does not imply Hadoop or parallel THE AUTHOR t ∝ |P| |R| Thursday, November 4, 2010
- 11. Wall Clock Time # of Training Examples Scalable Algorithm (Mahout wins!) Traditional Datamining Works here Scalable Solutions Required Non-scalable Algorithm Thursday, November 4, 2010
- 12. Scalable means ... • One unit of work requires about a unit of time • Not like the company store (bit.ly/22XVa4) t ∝ |P| |R| |P| = O(1) =⇒ t = O(1) Thursday, November 4, 2010
- 13. Wall Clock Time # of Training Examples Parallel Algorithm Sequential Algorithm Preferred Parallel Algorithm Preferred Sequential Algorithm Thursday, November 4, 2010
- 14. Toy Example Thursday, November 4, 2010
- 15. Training Data Sample yes no 0.92 0.01 circle 0.30 0.41 square Filled? x coordinate y coordinate shape predictor variables target variable Thursday, November 4, 2010
- 16. What matters most? ! ! ! ! ! ! ! ! ! ! Thursday, November 4, 2010
- 17. SGD Classiﬁcation • Supervised learning of logistic regression • Sequential gradient descent, not parallel • Highly optimized for high dimensional sparse data, possibly with interactions • Scalable, real dang fast to train Thursday, November 4, 2010
- 18. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Thursday, November 4, 2010
- 19. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Thursday, November 4, 2010
- 20. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Stateless, parallel Thursday, November 4, 2010
- 21. Small example • On 20 newsgroups • converges in < 10,000 training examples (less than one pass through the data) • accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes • learning rate, regularization set automagically on held-out data Thursday, November 4, 2010
- 22. System Structure EvolutionaryProcess ep void train(target, features) AdaptiveLogisticRegression 20 1 OnlineLogisticRegression folds void train(target, tracking, features) double auc() CrossFoldLearner 5 1 Matrix beta void train(target, features) double classifyScalar(features) OnlineLogisticRegression Thursday, November 4, 2010
- 23. Training API public interface OnlineLearner { void train(int actual, Vector instance); void train(long trackingKey, int actual, Vector instance); void train(long trackingKey, String groupKey, int actual, Vector instance); void close(); } Thursday, November 4, 2010
- 24. Classiﬁcation API public class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close(); public double auc(); public State<Wrapper> getBest(); } CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood(); double p = model.classifyScalar(features); Thursday, November 4, 2010
- 25. Speed? • Encoding API for hashed feature vectors • String, byte[] or double interfaces • String allows simple parsing • byte[] and double allows speed • Abstract interactions supported Thursday, November 4, 2010
- 26. Speed! • Parsing and encoding dominate single learner • Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core • 20 million mixed text, categorical features with many interactions learned in ~ 1 hour Thursday, November 4, 2010
- 27. More Speed! • Evolutionary optimization of learning parameters allows simple operation • 20x threading allows high machine use • 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes Thursday, November 4, 2010
- 28. Summary • Mahout provides early production quality scalable data-mining • New classiﬁcation systems allow industrial scale classiﬁcation Thursday, November 4, 2010
- 29. Contact Info Ted Dunning tdunning@maprtech.com Thursday, November 4, 2010
- 30. Contact Info Ted Dunning tdunning@maprtech.com or tdunning@apache.com Thursday, November 4, 2010

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment