Knitting boar atl_hug_jan2013_v2

647 views

Published on

  • Be the first to comment

  • Be the first to like this

Knitting boar atl_hug_jan2013_v2

  1. 1. KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect1
  2. 2. ✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email: josh@floe.tv
  3. 3. ✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts
  4. 4. Introduction to MACHINE LEARNING4
  5. 5. ✛ What is Data Mining? > “the process of extracting patterns from data”✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
  6. 6. ✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics.✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning✛ Statistics > Math and stuff✛ Machine Learning > Considered a branch of artificial intelligence
  7. 7. ✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization “Descriptive Statistics”
  8. 8. ✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive or Pig✛ ML work performed with > SAS > SPSS > R > Mahout
  9. 9. Introduction to9 MAHOUT
  10. 10. ✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining10 Copyright 2010 Cloudera Inc. All rights reserved
  11. 11. ✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based11 Copyright 2010 Cloudera Inc. All rights reserved
  12. 12. ✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People✛ Advertisement > Cloudera has a great Data Science training course on this topic > http://university.cloudera.com/training/data_science/in troduction_to_data_science_- _building_recommender_systems.html
  13. 13. ✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
  14. 14. ✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  15. 15. IntroducingKNITTING BOAR 15
  16. 16. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  17. 17. ✛ We Need > Hypothesis about data > Cost function > Update function✛ Basic Algorithm: Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/1117
  18. 18. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter18
  19. 19. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms19
  20. 20. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron20
  21. 21. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output21
  22. 22. “Are the gains gotten from using X worth the integration costs incurred in building the end-to- end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201222
  23. 23. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers > work on partitions of the data > Stay active over supersteps ✛ Master > Performs superstep > Averages parameter vector23
  24. 24. ✛ Collects all parameter vectors at each pass / superstep ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector24
  25. 25. ✛ Each given a split of the total dataset > Similar to a map task ✛ Performs local logistic regression run ✛ Local parameter vector sent to master at superstep25
  26. 26. OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model26
  27. 27. 300 250 200seconds 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size in MB Input Size vs Processing Time 27
  28. 28. Knitting Boar PARTING THOUGHTS28
  29. 29. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated29
  30. 30. ✛ Knitting Boar > https://github.com/jpatanooga/KnittingBoar > 100% Java > ASF 2.0 Licensed > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > https://github.com/emsixteeen/IterativeReduce > 100% Java > ASF 2.0 Licensed30
  31. 31. ✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg31
  32. 32. ✛ “Parallel Linear Regression on Iterative Reduce and YARN” ✛ Hadoop Summit Europe 2013 > March 20, 21 > http://hadoopsummit.org/amsterdam/32
  33. 33. ✛ Strata / Hadoop World 2012 Slides > http://www.cloudera.com/content/cloudera/en/resourc es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf33
  34. 34. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://freewallpaper.in/wallpaper2/2202-2- 2001_space_odyssey_-_5.jpg34

×