• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Machine Learning with Hadoop
 

Machine Learning with Hadoop

on

  • 10,123 views

Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms ...

Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.

Statistics

Views

Total Views
10,123
Views on SlideShare
10,048
Embed Views
75

Actions

Likes
24
Downloads
280
Comments
0

13 Embeds 75

http://www.advogato.org 28
http://advogato.org 19
http://sness.blogspot.com 12
http://sness.blogspot.ru 5
http://sness.blogspot.ca 2
http://paper.li 2
http://www.linkedin.com 1
http://sness.blogspot.in 1
http://www.slashdocs.com 1
http://a0.twimg.com 1
http://sness.blogspot.co.uk 1
http://sness.blogspot.de 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Machine Learning with Hadoop Machine Learning with Hadoop Presentation Transcript

    • Training on a pluggable machine learning platform
      Machine Learning on Hadoop at Huffington Post | AOL
    • A Little Bit about Us
      Core Services Team at HPMG | AOL
      Thu Kyaw (thu.kyaw@teamaol.com)
      Principal Software Engineer
      Worked on machine learning, data mining, and natural language processing
      Sang Chul Song, Ph.D. (sangchul.song@teamaol.com)
      Senior Software Engineer
      Worked on data intensive computing – data archiving / information retrieval
    • Machine Learning:Supervised Classification
      1. Learning Phase
      Model
      Train
      “Business”
      2. Classifying Phase
      “Entertainment”
      Model
      Result
      Classify
      capital gains to be taxed …
      “Politics”
    • Two Machine Learning Use Cases at HuffPost | AOL
      Comment Moderation
      Evaluate All New HuffPost User Comments Every Day
      Identify Abusive / Aggressive Comments
      Auto Delete / Publish ~25% Comments Every Day
      Article Classification
      Tag Articles for Advertising
      E.g.: scary, salacious, …
    • Our Classification Tasks
      abusive
      non-abusive
      non-abusive
      scary
      sexy
      non-abusive
      non-abusive
      abusive
      Comment Moderation
      Article Classification
    • In Order to Meet Our Needs,We Require…
      Support for important algorithms, including
      SVM
      Perceptron / Winnow
      Bayesian
      Decision Tree
      AdaBoost …
      Ability to build tons of models on regular basis, and pick the best
      Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best
    • However,
      N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially.
      For example, N=5, K=5, L=10  500K
    • So, we parallelize on Hadoop
      Good news:
      Mahout, a parallel machine learning tool, is already available.
      There are Mallet, libsvm, Weka, … that support necessary algorithms.
      Bad news:
      Mahout doesn’t support necessary algorithms yet.
      Other algorithms do not run natively on Hadoop.
    • Therefore, we do…
      We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.
      On top of our platform, we generate / test hundred thousands models, and choose the best.
      We use Pig for Hadoop implementation.
    • Our Approach
      OUR APPROACH More algorithms (thus better model), and faster parallel processing
      AdaBoost, SVM, Decision Tree,
      Bayesian and a Lot Others
      Train Request
      Return
      CONVENTIONAL
      1000s Models(one for each param set)
      Best Model
      Training Data
      Select
      Train (sequential)
    • What Parallelization?
      Training Task
      Training Task
      Training Task
      Training Task
      Training Task
    • General Processing Flow
      TrainingDocs
      Preprocess
      VectorizedDocs
      Train
      Model
      Preprocess Parameters
      Stopword use, n-gram size, stemming, etc.
      Train Parameters
      Algorithm and algorithm specific parameters
      (e.g. SVM, C, Ɛ, and other kernel parameters)
    • Our Parallel Processing Flow
      Model
      Vectorized
      Docs
      Model
      Model
      TrainingDocs
      Vectorized Docs
      Model
      Model
      Model
      Model
      Vectorized Docs
      Model
      Model
    • Preprocessing on Hadoop
      (see next slide)
      Preprocessing on Hadoop
      business Investments are taxed as capital gains.....
      business It was the overleveraged and underregulatedbanks …
      none I am afraid we may be headed for …
      none In the famous words of Homer Simpson, “it takes 2 to lie …”

      Vector 1
      Training Data
      Vector 2
      Vector 3
      Vector 4
      279 68ngram_stem_stopword 1snowballtrue
      279 68 ngram_stem_stopword2 snowball true
      279 68 ngram_stem_stopword3 snowball true
      279 68 ngram_stem_stopword 1 porter true
      279 68 ngram_stem_stopword2porter true
      279 68 ngram_stem_stopword3none false

      Vector 5
      Preprocessing Request (a parameter set per line)
      Vector k
    • Preprocessing on HadoopBig Picture
      Vector 1
      Through UDF Call
      Vector 2
      UDF
      par = LOAD param_file AS par1, par2, …;
      run = FOREACH par GENERATE RunPreprocess(par1, par2, …);
      STORE run ..;
      RunPreprocess()
      ……..
      Preprocessors (Pluggable Pipes)
      Stemmer
      Tokenizer
      StopwordFilter
      Vector k
      Vectorizer
      FeatureSelector
    • Training on Hadoop
      010101101020101100010101110100010101011100…
      010111010100010100100010101011100110110101…
      011101011010101011101011011010001010010101…
      010010111010100010101010001010111010101010…
      111010110001110101011010100101011010001011…
      Model 1
      Training on Hadoop
      (see next slide)
      Vectors
      Model 2
      Model 3
      Model 4
      73 923 balanced_winnow 5 1 10…
      73 923 balanced_winnow 5 210…
      73 923 balanced_winnow 5 310…
      73 923 balanced_winnow 5 1 20 …
      73 923 balanced_winnow 5 2 20 …
      73 923 balanced_winnow 5 320…

      Model 5
      Train Request (a parameter set per line)
      Model k
      Mahout, Weka, Mallet
      or libsvm
    • Training on HadoopBig Picture
      Model 1
      Through UDF Call
      Model 2
      UDF
      RunTrainer()
      par = LOAD param_file AS par1, par2, …;
      run = FOREACH par GENERATERunTrainer(par1, par2, …);
      STORE run ..;
      …….
      Mallet
      • AdaBoost (M2)
      • Bagging
      • Balanced Winnow
      • C45
      • Decision Tree
      Mahout
      • Bayesian
      • Logistic Regression
      Weka
      • AdaBoostM1
      • Bagging
      • Addictive Regression
      Model k
      libsvm
      • SVM
    • Training on Hadoop : Trick #1
      Each model can be generated independently  an easy parallelization problem (aka ‘embarrassingly parallel’)
      But, how do we achieve parallelism with Pig?
      par = LOAD param_file AS par1, par2, …;
      run = FOREACH par GENERATE RunTrainer(par1, par2, …);
      STORE run ...;
      par = LOAD param_file AS par1, par2, …;
      grp = GROUP par BY (par1, par2, …) PARALLEL 50
      fltn = FOREACH grp GENERATE group.par1 AS par1, …;
      run = FOREACH fltn GENERATE RunTrainer(par1, …);
      STORE run …;
    • Training on Hadoop: Trick #2
      We call ML functions from UDF.
      Some functions can take too long to return, and Hadoop will kill the job if they do.
      RunTrainer()
      “Pig Heartbeat” Thread
      Main Thread
    • As a result, we now see…
      We are now able to build tens of thousands of models within an hour and choose the best.
      Previously, the same task took us days.
      As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.
    • Useful Resources
      Mahout: http://mahout.apache.org/
      Mallet: http://mallet.cs.umass.edu/
      Weka: http://www.cs.waikato.ac.nz/ml/weka/
      libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
      OpenNLP: http://incubator.apache.org/opennlp/
      Pig: http://pig.apache.org/
    • Thank You!