MahoutWednesday, March 16, 2011            1
Mahout                            Scalable Data Mining for EverybodyWednesday, March 16, 2011                             ...
What is Mahout                   • Recommendations (people who x this also                            x that)             ...
What is Mahout?                   • Recommendations (people who x this also                            x that)            ...
Classification in Detail                   • Naive Bayes Family                    • Hadoop based training                 ...
Classification in Detail                   • Naive Bayes Family                    • Hadoop based training                 ...
So What?                                   Online training                                   has low                      ...
So What?                                   Online training                                   has low                      ...
So What?                                   Online training                                   has low                      ...
So What?                                   Online training                                   has low                      ...
So What?                            big starts here                                              Online training          ...
An ExampleWednesday, March 16, 2011                7
An ExampleWednesday, March 16, 2011                7
An ExampleWednesday, March 16, 2011                7
An ExampleWednesday, March 16, 2011                7
An ExampleWednesday, March 16, 2011                7
An ExampleWednesday, March 16, 2011                7
An ExampleWednesday, March 16, 2011                7
And Another                   From: Dr. Paul Acquah                   Dear Sir,                   Re: Proposal for over-in...
And Another                    Date: Thu, May 20, 2010 at 10:51 AM                    From: George <george@fumble-tech.com...
And Another                    Date: Thu, May 20, 2010 at 10:51 AM                    From: George <george@fumble-tech.com...
And Another                    Date: Thu, May 20, 2010 at 10:51 AM                    From: George <george@fumble-tech.com...
Mahout’s SGD                   • Learns on-line per example                    • O(1) memory                    • O(1) tim...
Special Features                   • Hashed feature encoding                   • Per-term annealing                    • l...
Feature EncodingWednesday, March 16, 2011                      11
Feature EncodingWednesday, March 16, 2011                      11
Hashed EncodingWednesday, March 16, 2011                     12
Feature CollisionsWednesday, March 16, 2011                        13
Learning Rate Annealing        Learning Rate                            # training examples seenWednesday, March 16, 2011 ...
Learning Rate   Per-term Annealing                                   # training examples seenWednesday, March 16, 2011    ...
Learning Rate   Per-term Annealing                                 Common                                  Feature        ...
Learning Rate   Per-term Annealing                                                          Rare                          ...
General Structure                • OnlineLogisticRegression                 • Traditional logistic regression             ...
Next Level                   • CrossFoldLearner                    • contains multiple primitive learners                 ...
And again                   • AdaptiveLogisticRegression                    • 20 x CrossFoldLearner                    • e...
A comparison                   • Traditional view                    • 400 x (read + OLR)                   • Revised Maho...
Deployment                   • Training                    • ModelSerializer.writeBinary(..., model)                   • D...
The Upshot                   • One machine can go fast                    • SITM trains in 2 billion examples in 3        ...
Upcoming SlideShare
Loading in...5
×

MAHOUT classifier tour

4,744

Published on

This ta

Published in: Technology

MAHOUT classifier tour

  1. 1. MahoutWednesday, March 16, 2011 1
  2. 2. Mahout Scalable Data Mining for EverybodyWednesday, March 16, 2011 1
  3. 3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)Wednesday, March 16, 2011 2
  4. 4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)Wednesday, March 16, 2011 3
  5. 5. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) trainingWednesday, March 16, 2011 4
  6. 6. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) trainingWednesday, March 16, 2011 5
  7. 7. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  8. 8. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  9. 9. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  10. 10. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  11. 11. So What? big starts here Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  12. 12. An ExampleWednesday, March 16, 2011 7
  13. 13. An ExampleWednesday, March 16, 2011 7
  14. 14. An ExampleWednesday, March 16, 2011 7
  15. 15. An ExampleWednesday, March 16, 2011 7
  16. 16. An ExampleWednesday, March 16, 2011 7
  17. 17. An ExampleWednesday, March 16, 2011 7
  18. 18. An ExampleWednesday, March 16, 2011 7
  19. 19. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign companys bank account for our favor. ...Wednesday, March 16, 2011 8
  20. 20. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  21. 21. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  22. 22. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  23. 23. Mahout’s SGD • Learns on-line per example • O(1) memory • O(1) time per training example • Sequential implementation • fast, but not parallelWednesday, March 16, 2011 9
  24. 24. Special Features • Hashed feature encoding • Per-term annealing • learn the boring stuff once • Auto-magical learning knob turning • learns correct learning rate, learns correct learning rate for learning learning rate, ...Wednesday, March 16, 2011 10
  25. 25. Feature EncodingWednesday, March 16, 2011 11
  26. 26. Feature EncodingWednesday, March 16, 2011 11
  27. 27. Hashed EncodingWednesday, March 16, 2011 12
  28. 28. Feature CollisionsWednesday, March 16, 2011 13
  29. 29. Learning Rate Annealing Learning Rate # training examples seenWednesday, March 16, 2011 14
  30. 30. Learning Rate Per-term Annealing # training examples seenWednesday, March 16, 2011 15
  31. 31. Learning Rate Per-term Annealing Common Feature # training examples seenWednesday, March 16, 2011 15
  32. 32. Learning Rate Per-term Annealing Rare Feature # training examples seenWednesday, March 16, 2011 15
  33. 33. General Structure • OnlineLogisticRegression • Traditional logistic regression • Stochastic Gradient Descent • Per term annealing • Too fast (for the disk + encoder)Wednesday, March 16, 2011 16
  34. 34. Next Level • CrossFoldLearner • contains multiple primitive learners • online cross validation • 5x more workWednesday, March 16, 2011 17
  35. 35. And again • AdaptiveLogisticRegression • 20 x CrossFoldLearner • evolves good learning and regularization rates • 100 x more work than basic learner • still faster than disk + encodingWednesday, March 16, 2011 18
  36. 36. A comparison • Traditional view • 400 x (read + OLR) • Revised Mahout view • 1 x (read + mu x 100 x OLR) x eta • mu = efficiency from killing losers early • eta = efficiency from stopping earlyWednesday, March 16, 2011 19
  37. 37. Deployment • Training • ModelSerializer.writeBinary(..., model) • Deployment • m = ModelSerializer.readBinary(...) • r = m.classifyScalar(featureVector)Wednesday, March 16, 2011 20
  38. 38. The Upshot • One machine can go fast • SITM trains in 2 billion examples in 3 hours • Deployability pays off big • simple sample server farmWednesday, March 16, 2011 21
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×