MAHOUT classifier tour

5,273 views

Published on

This ta

Published in: Technology

MAHOUT classifier tour

  1. 1. MahoutWednesday, March 16, 2011 1
  2. 2. Mahout Scalable Data Mining for EverybodyWednesday, March 16, 2011 1
  3. 3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)Wednesday, March 16, 2011 2
  4. 4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)Wednesday, March 16, 2011 3
  5. 5. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) trainingWednesday, March 16, 2011 4
  6. 6. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) trainingWednesday, March 16, 2011 5
  7. 7. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  8. 8. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  9. 9. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  10. 10. So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  11. 11. So What? big starts here Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  12. 12. An ExampleWednesday, March 16, 2011 7
  13. 13. An ExampleWednesday, March 16, 2011 7
  14. 14. An ExampleWednesday, March 16, 2011 7
  15. 15. An ExampleWednesday, March 16, 2011 7
  16. 16. An ExampleWednesday, March 16, 2011 7
  17. 17. An ExampleWednesday, March 16, 2011 7
  18. 18. An ExampleWednesday, March 16, 2011 7
  19. 19. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign companys bank account for our favor. ...Wednesday, March 16, 2011 8
  20. 20. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  21. 21. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  22. 22. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  23. 23. Mahout’s SGD • Learns on-line per example • O(1) memory • O(1) time per training example • Sequential implementation • fast, but not parallelWednesday, March 16, 2011 9
  24. 24. Special Features • Hashed feature encoding • Per-term annealing • learn the boring stuff once • Auto-magical learning knob turning • learns correct learning rate, learns correct learning rate for learning learning rate, ...Wednesday, March 16, 2011 10
  25. 25. Feature EncodingWednesday, March 16, 2011 11
  26. 26. Feature EncodingWednesday, March 16, 2011 11
  27. 27. Hashed EncodingWednesday, March 16, 2011 12
  28. 28. Feature CollisionsWednesday, March 16, 2011 13
  29. 29. Learning Rate Annealing Learning Rate # training examples seenWednesday, March 16, 2011 14
  30. 30. Learning Rate Per-term Annealing # training examples seenWednesday, March 16, 2011 15
  31. 31. Learning Rate Per-term Annealing Common Feature # training examples seenWednesday, March 16, 2011 15
  32. 32. Learning Rate Per-term Annealing Rare Feature # training examples seenWednesday, March 16, 2011 15
  33. 33. General Structure • OnlineLogisticRegression • Traditional logistic regression • Stochastic Gradient Descent • Per term annealing • Too fast (for the disk + encoder)Wednesday, March 16, 2011 16
  34. 34. Next Level • CrossFoldLearner • contains multiple primitive learners • online cross validation • 5x more workWednesday, March 16, 2011 17
  35. 35. And again • AdaptiveLogisticRegression • 20 x CrossFoldLearner • evolves good learning and regularization rates • 100 x more work than basic learner • still faster than disk + encodingWednesday, March 16, 2011 18
  36. 36. A comparison • Traditional view • 400 x (read + OLR) • Revised Mahout view • 1 x (read + mu x 100 x OLR) x eta • mu = efficiency from killing losers early • eta = efficiency from stopping earlyWednesday, March 16, 2011 19
  37. 37. Deployment • Training • ModelSerializer.writeBinary(..., model) • Deployment • m = ModelSerializer.readBinary(...) • r = m.classifyScalar(featureVector)Wednesday, March 16, 2011 20
  38. 38. The Upshot • One machine can go fast • SITM trains in 2 billion examples in 3 hours • Deployability pays off big • simple sample server farmWednesday, March 16, 2011 21

×