Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata + Hadoop World 2012: Knitting Boar

4,785 views

Published on

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.

Published in: Technology
  • //DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... //DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... //DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... //DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... //DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... //DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... //DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Strata + Hadoop World 2012: Knitting Boar

  1. 1. KNITTING BOAR Building Machine Learning Tools with Hadoop‟s YARN Josh Patterson Principal Solutions Architect Michael Katzenellenbogen Principal Solutions Architect1
  2. 2. ✛ Josh Patterson - josh@cloudera.com > Master‟s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)✛ Michael Katzenellenbollen - michael@cloudera.com > Principal Solutions Architect @ Cloudera > Systems Guy („nuff said)
  3. 3. ✛ Intro / Background✛ Introducing Knitting Boar✛ Integrating Knitting Boar and YARN✛ Results and Lessons Learned
  4. 4. Background and INTRODUCTION4
  5. 5. ✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  6. 6. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  7. 7. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter7
  8. 8. ✛ Currently Single Process > Multi-threaded parallel, but not cluster parallel > Runs locally, not deployed to the cluster ✛ Defined in: > https://cwiki.apache.org/MAHOUT/logistic- regression.html8
  9. 9. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms9
  10. 10. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches10
  11. 11. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output11
  12. 12. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201212
  13. 13. IntroducingKNITTING BOAR 13
  14. 14. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector14
  15. 15. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions15
  16. 16. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers‟ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector16
  17. 17. OnlineLogisticRegression Knitting Boar‟s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model17
  18. 18. Integrating Knitting Boar withYARN18
  19. 19. ✛ Yet Another Resource Negotiator ✛ Framework for scheduling distributed applications ✛ Typically runs on top of an HDFS cluster > Though not required, nor is it coupled to HDFS Node Manager ✛ MRv2 is now a Container App Mstr distributed application Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container19
  20. 20. ✛ High setup / teardown costs ✛ Not designed for super-step operations ✛ Need to refactor the problem to fit MapReduce > We can now just launch a distributed application20
  21. 21. ✛ Designed specifically for parallel iterative algorithms on Hadoop > Implemented directly on top of YARN ✛ Intrinsic Parallelism > Easier to focus on problem > Not focusing on the distributed application part21
  22. 22. ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . .22
  23. 23. ✛ Client > Launches the YARN ApplicationMaster ✛ Master > Computes required resources > Obtains resources from YARN > Launches Workers ✛ Workers > Computation on partial data (input split) > Synchronizes with Master23
  24. 24. Pig, Hive, Scala, Java, Crunch Algorithms MapReduce IterativeReduce BranchReduce Giraph … HDFS / YARN24
  25. 25. Knitting Boar PERFORMANCE, SCALING, AND RESULTS25
  26. 26. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time26
  27. 27. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated27
  28. 28. ✛ Knitting Boar > 100% Java > ASF 2.0 Licensed > https://github.com/jpatanooga/KnittingBoar > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > [ coming soon ]28
  29. 29. The Road Ahead ✛ SGD > More testing > Demo use cases ✛ IterativeReduce > Reliability > Durability Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg29
  30. 30. ✛ Mahout‟s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ Hadoop AllReduce and Terascale Learning > http://hunch.net/?p=2094 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That‟s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf30
  31. 31. ✛ Langford > http://hunch.net/~vw/ ✛ Zinkevick, 2011 > http://www.research.rutgers.edu/~lihong/pub/Zinkevic h11Parallelized.pdf ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ Dekel, 2010 > http://arxiv.org/pdf/1012.1367.pdf31
  32. 32. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://agileknitter.com/wp- content/uploads/2010/06/Pictures_-_Misc_- _Knitting_Needles.jpg32

×