MAHOUT classifier tour
Upcoming SlideShare
Loading in...5
×
 

MAHOUT classifier tour

on

  • 5,229 views

This ta

This ta

Statistics

Views

Total Views
5,229
Views on SlideShare
5,196
Embed Views
33

Actions

Likes
8
Downloads
67
Comments
0

4 Embeds 33

http://www.linkedin.com 22
https://www.linkedin.com 8
https://twitter.com 2
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MAHOUT classifier tour MAHOUT classifier tour Presentation Transcript

  • MahoutWednesday, March 16, 2011 1
  • Mahout Scalable Data Mining for EverybodyWednesday, March 16, 2011 1
  • What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)Wednesday, March 16, 2011 2
  • What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)Wednesday, March 16, 2011 3
  • Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) trainingWednesday, March 16, 2011 4
  • Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) trainingWednesday, March 16, 2011 5
  • So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  • So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  • So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  • So What? Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  • So What? big starts here Online training has low overhead for small and moderate size data-setsWednesday, March 16, 2011 6
  • An ExampleWednesday, March 16, 2011 7
  • An ExampleWednesday, March 16, 2011 7
  • An ExampleWednesday, March 16, 2011 7
  • An ExampleWednesday, March 16, 2011 7
  • An ExampleWednesday, March 16, 2011 7
  • An ExampleWednesday, March 16, 2011 7
  • An ExampleWednesday, March 16, 2011 7
  • And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign companys bank account for our favor. ...Wednesday, March 16, 2011 8
  • And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  • And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  • And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?Wednesday, March 16, 2011 8
  • Mahout’s SGD • Learns on-line per example • O(1) memory • O(1) time per training example • Sequential implementation • fast, but not parallelWednesday, March 16, 2011 9
  • Special Features • Hashed feature encoding • Per-term annealing • learn the boring stuff once • Auto-magical learning knob turning • learns correct learning rate, learns correct learning rate for learning learning rate, ...Wednesday, March 16, 2011 10
  • Feature EncodingWednesday, March 16, 2011 11
  • Feature EncodingWednesday, March 16, 2011 11
  • Hashed EncodingWednesday, March 16, 2011 12
  • Feature CollisionsWednesday, March 16, 2011 13
  • Learning Rate Annealing Learning Rate # training examples seenWednesday, March 16, 2011 14
  • Learning Rate Per-term Annealing # training examples seenWednesday, March 16, 2011 15
  • Learning Rate Per-term Annealing Common Feature # training examples seenWednesday, March 16, 2011 15
  • Learning Rate Per-term Annealing Rare Feature # training examples seenWednesday, March 16, 2011 15
  • General Structure • OnlineLogisticRegression • Traditional logistic regression • Stochastic Gradient Descent • Per term annealing • Too fast (for the disk + encoder)Wednesday, March 16, 2011 16
  • Next Level • CrossFoldLearner • contains multiple primitive learners • online cross validation • 5x more workWednesday, March 16, 2011 17
  • And again • AdaptiveLogisticRegression • 20 x CrossFoldLearner • evolves good learning and regularization rates • 100 x more work than basic learner • still faster than disk + encodingWednesday, March 16, 2011 18
  • A comparison • Traditional view • 400 x (read + OLR) • Revised Mahout view • 1 x (read + mu x 100 x OLR) x eta • mu = efficiency from killing losers early • eta = efficiency from stopping earlyWednesday, March 16, 2011 19
  • Deployment • Training • ModelSerializer.writeBinary(..., model) • Deployment • m = ModelSerializer.readBinary(...) • r = m.classifyScalar(featureVector)Wednesday, March 16, 2011 20
  • The Upshot • One machine can go fast • SITM trains in 2 billion examples in 3 hours • Deployability pays off big • simple sample server farmWednesday, March 16, 2011 21