• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Strata + Hadoop World 2012: Knitting Boar
 

Strata + Hadoop World 2012: Knitting Boar

on

  • 3,804 views

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven ...

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.

Statistics

Views

Total Views
3,804
Views on SlideShare
2,919
Embed Views
885

Actions

Likes
10
Downloads
0
Comments
0

11 Embeds 885

http://www.cloudera.com 478
http://www.scoop.it 347
http://www.facebook.com 25
https://twitter.com 16
http://www.linkedin.com 7
https://www.facebook.com 4
http://author01.mtv.cloudera.com 3
http://author01.core.cloudera.com 2
http://hootsuite.scoop.it 1
https://si0.twimg.com 1
http://m.facebook.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  • At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  • “say hello to my leeeeetle friend….”
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • Segue into yarn
  • Performance still largely dependent on implementation of algo
  • 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  • Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  • Basecamp: use story of how we get to basecamp to see how to climb some more

Strata + Hadoop World 2012: Knitting Boar Strata + Hadoop World 2012: Knitting Boar Presentation Transcript

  • KNITTING BOAR Building Machine Learning Tools with Hadoop‟s YARN Josh Patterson Principal Solutions Architect Michael Katzenellenbogen Principal Solutions Architect1
  • ✛ Josh Patterson - josh@cloudera.com > Master‟s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)✛ Michael Katzenellenbollen - michael@cloudera.com > Principal Solutions Architect @ Cloudera > Systems Guy („nuff said)
  • ✛ Intro / Background✛ Introducing Knitting Boar✛ Integrating Knitting Boar and YARN✛ Results and Lessons Learned
  • Background and INTRODUCTION4
  • ✛ Why Machine Learning? > Growing interest in predictive modeling✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter7
  • ✛ Currently Single Process > Multi-threaded parallel, but not cluster parallel > Runs locally, not deployed to the cluster ✛ Defined in: > https://cwiki.apache.org/MAHOUT/logistic- regression.html8
  • Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms9
  • Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches10
  • Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output11
  • “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 201212
  • IntroducingKNITTING BOAR 13
  • ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector14
  • ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions15
  • ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers‟ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector16
  • OnlineLogisticRegression Knitting Boar‟s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model17
  • Integrating Knitting Boar withYARN18
  • ✛ Yet Another Resource Negotiator ✛ Framework for scheduling distributed applications ✛ Typically runs on top of an HDFS cluster > Though not required, nor is it coupled to HDFS Node Manager ✛ MRv2 is now a Container App Mstr distributed application Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container19
  • ✛ High setup / teardown costs ✛ Not designed for super-step operations ✛ Need to refactor the problem to fit MapReduce > We can now just launch a distributed application20
  • ✛ Designed specifically for parallel iterative algorithms on Hadoop > Implemented directly on top of YARN ✛ Intrinsic Parallelism > Easier to focus on problem > Not focusing on the distributed application part21
  • ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . .22
  • ✛ Client > Launches the YARN ApplicationMaster ✛ Master > Computes required resources > Obtains resources from YARN > Launches Workers ✛ Workers > Computation on partial data (input split) > Synchronizes with Master23
  • Pig, Hive, Scala, Java, Crunch Algorithms MapReduce IterativeReduce BranchReduce Giraph … HDFS / YARN24
  • Knitting Boar PERFORMANCE, SCALING, AND RESULTS25
  • 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time26
  • ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated27
  • ✛ Knitting Boar > 100% Java > ASF 2.0 Licensed > https://github.com/jpatanooga/KnittingBoar > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > [ coming soon ]28
  • The Road Ahead ✛ SGD > More testing > Demo use cases ✛ IterativeReduce > Reliability > Durability Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg29
  • ✛ Mahout‟s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ Hadoop AllReduce and Terascale Learning > http://hunch.net/?p=2094 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That‟s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf30
  • ✛ Langford > http://hunch.net/~vw/ ✛ Zinkevick, 2011 > http://www.research.rutgers.edu/~lihong/pub/Zinkevic h11Parallelized.pdf ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ Dekel, 2010 > http://arxiv.org/pdf/1012.1367.pdf31
  • ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://agileknitter.com/wp- content/uploads/2010/06/Pictures_-_Misc_- _Knitting_Needles.jpg32