Successfully reported this slideshow.
Your SlideShare is downloading. ×

Strata + Hadoop World 2012: Knitting Boar

Strata + Hadoop World 2012: Knitting Boar

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Strata + Hadoop World 2012: Knitting Boar

  1. 1. KNITTING BOAR Building Machine Learning Tools with Hadoop‟s YARN Josh Patterson Principal Solutions Architect Michael Katzenellenbogen Principal Solutions Architect 1
  2. 2. ✛ Josh Patterson - josh@cloudera.com > Master‟s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) ✛ Michael Katzenellenbollen - michael@cloudera.com > Principal Solutions Architect @ Cloudera > Systems Guy („nuff said)
  3. 3. ✛ Intro / Background ✛ Introducing Knitting Boar ✛ Integrating Knitting Boar and YARN ✛ Results and Lessons Learned
  4. 4. Background and INTRODUCTION 4
  5. 5. ✛ Why Machine Learning? > Growing interest in predictive modeling ✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression ✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  6. 6. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible ✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  7. 7. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter 7
  8. 8. ✛ Currently Single Process > Multi-threaded parallel, but not cluster parallel > Runs locally, not deployed to the cluster ✛ Defined in: > https://cwiki.apache.org/MAHOUT/logistic- regression.html 8
  9. 9. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms 9
  10. 10. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches 10
  11. 11. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output 11
  12. 12. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 2012 12
  13. 13. Introducing KNITTING BOAR 13
  14. 14. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector 14
  15. 15. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions 15
  16. 16. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers‟ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector 16
  17. 17. OnlineLogisticRegression Knitting Boar‟s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model 17
  18. 18. Integrating Knitting Boar with YARN 18
  19. 19. ✛ Yet Another Resource Negotiator ✛ Framework for scheduling distributed applications ✛ Typically runs on top of an HDFS cluster > Though not required, nor is it coupled to HDFS Node Manager ✛ MRv2 is now a Container App Mstr distributed application Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container 19
  20. 20. ✛ High setup / teardown costs ✛ Not designed for super-step operations ✛ Need to refactor the problem to fit MapReduce > We can now just launch a distributed application 20
  21. 21. ✛ Designed specifically for parallel iterative algorithms on Hadoop > Implemented directly on top of YARN ✛ Intrinsic Parallelism > Easier to focus on problem > Not focusing on the distributed application part 21
  22. 22. ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . . 22
  23. 23. ✛ Client > Launches the YARN ApplicationMaster ✛ Master > Computes required resources > Obtains resources from YARN > Launches Workers ✛ Workers > Computation on partial data (input split) > Synchronizes with Master 23
  24. 24. Pig, Hive, Scala, Java, Crunch Algorithms MapReduce IterativeReduce BranchReduce Giraph … HDFS / YARN 24
  25. 25. Knitting Boar PERFORMANCE, SCALING, AND RESULTS 25
  26. 26. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time 26
  27. 27. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated 27
  28. 28. ✛ Knitting Boar > 100% Java > ASF 2.0 Licensed > https://github.com/jpatanooga/KnittingBoar > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > [ coming soon ] 28
  29. 29. The Road Ahead ✛ SGD > More testing > Demo use cases ✛ IterativeReduce > Reliability > Durability Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg 29
  30. 30. ✛ Mahout‟s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ Hadoop AllReduce and Terascale Learning > http://hunch.net/?p=2094 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That‟s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf 30
  31. 31. ✛ Langford > http://hunch.net/~vw/ ✛ Zinkevick, 2011 > http://www.research.rutgers.edu/~lihong/pub/Zinkevic h11Parallelized.pdf ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ Dekel, 2010 > http://arxiv.org/pdf/1012.1367.pdf 31
  32. 32. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://agileknitter.com/wp- content/uploads/2010/06/Pictures_-_Misc_- _Knitting_Needles.jpg 32

Editor's Notes

  • Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  • At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  • “say hello to my leeeeetle friend….”
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • Segue into yarn
  • Performance still largely dependent on implementation of algo
  • 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  • Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  • Basecamp: use story of how we get to basecamp to see how to climb some more

×