Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning with Hadoop


Published on

Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.

Published in: Technology, Education

Machine Learning with Hadoop

  1. 1. Training on a pluggable machine learning platform<br />Machine Learning on Hadoop at Huffington Post | AOL<br />
  2. 2. A Little Bit about Us<br />Core Services Team at HPMG | AOL <br />Thu Kyaw (<br />Principal Software Engineer<br />Worked on machine learning, data mining, and natural language processing<br />Sang Chul Song, Ph.D. (<br />Senior Software Engineer<br />Worked on data intensive computing – data archiving / information retrieval<br />
  3. 3. Machine Learning:Supervised Classification<br />1. Learning Phase<br />Model<br />Train<br />“Business”<br />2. Classifying Phase<br />“Entertainment”<br />Model<br />Result<br />Classify<br />capital gains to be taxed …<br />“Politics”<br />
  4. 4. Two Machine Learning Use Cases at HuffPost | AOL<br />Comment Moderation<br />Evaluate All New HuffPost User Comments Every Day<br />Identify Abusive / Aggressive Comments<br />Auto Delete / Publish ~25% Comments Every Day<br />Article Classification<br />Tag Articles for Advertising<br />E.g.: scary, salacious, …<br />
  5. 5. Our Classification Tasks<br />abusive<br />non-abusive<br />non-abusive<br />scary<br />sexy<br />non-abusive<br />non-abusive<br />abusive<br />Comment Moderation<br />Article Classification<br />
  6. 6. In Order to Meet Our Needs,We Require…<br />Support for important algorithms, including<br />SVM<br />Perceptron / Winnow<br />Bayesian<br />Decision Tree<br />AdaBoost …<br />Ability to build tons of models on regular basis, and pick the best<br />Because, in general, it’s difficult to know in advance what algorithm / parameter set will work best<br />
  7. 7. However,<br />N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially.<br />For example, N=5, K=5, L=10  500K<br />
  8. 8. So, we parallelize on Hadoop<br />Good news: <br />Mahout, a parallel machine learning tool, is already available.<br />There are Mallet, libsvm, Weka, … that support necessary algorithms.<br />Bad news: <br />Mahout doesn’t support necessary algorithms yet. <br />Other algorithms do not run natively on Hadoop.<br />
  9. 9. Therefore, we do…<br />We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.<br />On top of our platform, we generate / test hundred thousands models, and choose the best.<br />We use Pig for Hadoop implementation.<br />
  10. 10. Our Approach<br />OUR APPROACH More algorithms (thus better model), and faster parallel processing <br />AdaBoost, SVM, Decision Tree,<br />Bayesian and a Lot Others<br />Train Request<br />Return<br />CONVENTIONAL<br />1000s Models(one for each param set)<br />Best Model<br />Training Data<br />Select<br />Train (sequential)<br />
  11. 11. What Parallelization?<br />Training Task<br />Training Task<br />Training Task<br />Training Task<br />Training Task<br />
  12. 12. General Processing Flow<br />TrainingDocs<br />Preprocess<br />VectorizedDocs<br />Train<br />Model<br />Preprocess Parameters<br />Stopword use, n-gram size, stemming, etc.<br />Train Parameters<br />Algorithm and algorithm specific parameters<br />(e.g. SVM, C, Ɛ, and other kernel parameters)<br />
  13. 13. Our Parallel Processing Flow<br />Model<br />Vectorized<br />Docs<br />Model<br />Model<br />TrainingDocs<br />Vectorized Docs<br />Model<br />Model<br />Model<br />Model<br />Vectorized Docs<br />Model<br />Model<br />
  14. 14. Preprocessing on Hadoop<br />(see next slide)<br />Preprocessing on Hadoop<br />business Investments are taxed as capital gains.....<br />business It was the overleveraged and underregulatedbanks …<br />none I am afraid we may be headed for …<br />none In the famous words of Homer Simpson, “it takes 2 to lie …”<br />…<br />Vector 1<br />Training Data<br />Vector 2<br />Vector 3<br />Vector 4<br />279 68ngram_stem_stopword 1snowballtrue<br />279 68 ngram_stem_stopword2 snowball true<br />279 68 ngram_stem_stopword3 snowball true<br />279 68 ngram_stem_stopword 1 porter true<br />279 68 ngram_stem_stopword2porter true<br />279 68 ngram_stem_stopword3none false<br />…<br />Vector 5<br />Preprocessing Request (a parameter set per line)<br />Vector k<br />
  15. 15. Preprocessing on HadoopBig Picture<br />Vector 1<br />Through UDF Call<br />Vector 2<br />UDF<br />par = LOAD param_file AS par1, par2, …;<br />run = FOREACH par GENERATE RunPreprocess(par1, par2, …);<br />STORE run ..;<br />RunPreprocess()<br />……..<br />Preprocessors (Pluggable Pipes)<br />Stemmer<br />Tokenizer<br />StopwordFilter<br />Vector k<br />Vectorizer<br />FeatureSelector<br />
  16. 16. Training on Hadoop<br />010101101020101100010101110100010101011100…<br />010111010100010100100010101011100110110101…<br />011101011010101011101011011010001010010101…<br />010010111010100010101010001010111010101010…<br />111010110001110101011010100101011010001011…<br />Model 1<br />Training on Hadoop<br />(see next slide)<br />Vectors<br />Model 2<br />Model 3<br />Model 4<br />73 923 balanced_winnow 5 1 10…<br />73 923 balanced_winnow 5 210…<br />73 923 balanced_winnow 5 310…<br />73 923 balanced_winnow 5 1 20 …<br />73 923 balanced_winnow 5 2 20 …<br />73 923 balanced_winnow 5 320…<br />…<br />Model 5<br />Train Request (a parameter set per line)<br />Model k<br />Mahout, Weka, Mallet<br />or libsvm<br />
  17. 17. Training on HadoopBig Picture<br />Model 1<br />Through UDF Call<br />Model 2<br />UDF<br />RunTrainer()<br />par = LOAD param_file AS par1, par2, …;<br />run = FOREACH par GENERATERunTrainer(par1, par2, …);<br />STORE run ..;<br />…….<br />Mallet<br /><ul><li>AdaBoost (M2)
  18. 18. Bagging
  19. 19. Balanced Winnow
  20. 20. C45
  21. 21. Decision Tree
  22. 22. …</li></ul>Mahout<br /><ul><li>Bayesian
  23. 23. Logistic Regression
  24. 24. …</li></ul>Weka<br /><ul><li>AdaBoostM1
  25. 25. Bagging
  26. 26. Addictive Regression
  27. 27. …</li></ul>Model k<br />libsvm<br /><ul><li>SVM</li></li></ul><li>Training on Hadoop : Trick #1<br />Each model can be generated independently  an easy parallelization problem (aka ‘embarrassingly parallel’)<br />But, how do we achieve parallelism with Pig?<br />par = LOAD param_file AS par1, par2, …;<br />run = FOREACH par GENERATE RunTrainer(par1, par2, …);<br />STORE run ...;<br />par = LOAD param_file AS par1, par2, …;<br />grp = GROUP par BY (par1, par2, …) PARALLEL 50<br />fltn = FOREACH grp GENERATE group.par1 AS par1, …;<br />run = FOREACH fltn GENERATE RunTrainer(par1, …);<br />STORE run …;<br />
  28. 28. Training on Hadoop: Trick #2<br />We call ML functions from UDF.<br />Some functions can take too long to return, and Hadoop will kill the job if they do.<br />RunTrainer()<br />“Pig Heartbeat” Thread<br />Main Thread<br />
  29. 29. As a result, we now see…<br />We are now able to build tens of thousands of models within an hour and choose the best.<br />Previously, the same task took us days.<br />As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.<br />
  30. 30. Useful Resources<br />Mahout:<br />Mallet:<br />Weka:<br />libsvm:<br />OpenNLP:<br />Pig:<br />
  31. 31. Thank You!<br />