Strata + Hadoop World 2012: Knitting Boar


Published on

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  • At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  • “say hello to my leeeeetle friend….”
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • Segue into yarn
  • Performance still largely dependent on implementation of algo
  • 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  • Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  • Basecamp: use story of how we get to basecamp to see how to climb some more
  • Strata + Hadoop World 2012: Knitting Boar

    1. 1. KNITTING BOAR Building Machine Learning Tools with Hadoop‟s YARN Josh Patterson Principal Solutions Architect Michael Katzenellenbogen Principal Solutions Architect1
    2. 2. ✛ Josh Patterson - > Master‟s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)✛ Michael Katzenellenbollen - > Principal Solutions Architect @ Cloudera > Systems Guy („nuff said)
    3. 3. ✛ Intro / Background✛ Introducing Knitting Boar✛ Integrating Knitting Boar and YARN✛ Results and Lessons Learned
    4. 4. Background and INTRODUCTION4
    5. 5. ✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
    6. 6. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
    7. 7. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter7
    8. 8. ✛ Currently Single Process > Multi-threaded parallel, but not cluster parallel > Runs locally, not deployed to the cluster ✛ Defined in: > regression.html8
    9. 9. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms9
    10. 10. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches10
    11. 11. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output11
    12. 12. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201212
    13. 13. IntroducingKNITTING BOAR 13
    14. 14. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector14
    15. 15. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions15
    16. 16. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers‟ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector16
    17. 17. OnlineLogisticRegression Knitting Boar‟s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model17
    18. 18. Integrating Knitting Boar withYARN18
    19. 19. ✛ Yet Another Resource Negotiator ✛ Framework for scheduling distributed applications ✛ Typically runs on top of an HDFS cluster > Though not required, nor is it coupled to HDFS Node Manager ✛ MRv2 is now a Container App Mstr distributed application Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container19
    20. 20. ✛ High setup / teardown costs ✛ Not designed for super-step operations ✛ Need to refactor the problem to fit MapReduce > We can now just launch a distributed application20
    21. 21. ✛ Designed specifically for parallel iterative algorithms on Hadoop > Implemented directly on top of YARN ✛ Intrinsic Parallelism > Easier to focus on problem > Not focusing on the distributed application part21
    22. 22. ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . .22
    23. 23. ✛ Client > Launches the YARN ApplicationMaster ✛ Master > Computes required resources > Obtains resources from YARN > Launches Workers ✛ Workers > Computation on partial data (input split) > Synchronizes with Master23
    24. 24. Pig, Hive, Scala, Java, Crunch Algorithms MapReduce IterativeReduce BranchReduce Giraph … HDFS / YARN24
    25. 25. Knitting Boar PERFORMANCE, SCALING, AND RESULTS25
    26. 26. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time26
    27. 27. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated27
    28. 28. ✛ Knitting Boar > 100% Java > ASF 2.0 Licensed > > Quick Start ∗ ✛ IterativeReduce > [ coming soon ]28
    29. 29. The Road Ahead ✛ SGD > More testing > Demo use cases ✛ IterativeReduce > Reliability > Durability Picture:
    30. 30. ✛ Mahout‟s SGD implementation > gression.pdf ✛ Hadoop AllReduce and Terascale Learning > ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That‟s Not a Nail! >
    31. 31. ✛ Langford > ✛ Zinkevick, 2011 > h11Parallelized.pdf ✛ McDonald, 2010 > ✛ Dekel, 2010 >
    32. 32. ✛ photos-of-mount-everest-pictures.jpg ✛ medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ content/uploads/2010/06/Pictures_-_Misc_- _Knitting_Needles.jpg32