Strata + Hadoop World 2012: Knitting Boar

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
KNITTING BOAR
    Building Machine Learning Tools with Hadoop‟s YARN




    Josh Patterson
    Principal Solutions Architect

    Michael Katzenellenbogen
    Principal Solutions Architect




1
✛ Josh Patterson - josh@cloudera.com
   > Master‟s Thesis: self-organizing mesh networks
       ∗   Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
   > Conceived, built, and led Hadoop integration for openPDC project
      at Tennessee Valley Authority (TVA)


✛ Michael Katzenellenbollen - michael@cloudera.com
   > Principal Solutions Architect @ Cloudera
   > Systems Guy („nuff said)
✛ Intro / Background
✛ Introducing Knitting Boar
✛ Integrating Knitting Boar and YARN
✛ Results and Lessons Learned
Background and
    INTRODUCTION




4
✛   Why Machine Learning?
    >   Growing interest in predictive modeling

✛   Linear Models are Simple, Useful
    >   Stochastic Gradient Descent is a very popular tool for
        building linear models like Logistic Regression

✛   Building Models Still is Time Consuming
    >   The “Need for speed”
    >   “More data beats a cleverer algorithm”
✛ Parallelize Mahout’s Stochastic Gradient Descent
  >   With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
   using YARN
  >   Wanted a first class Hadoop-Yarn citizen
  >   Work through dev progressions towards a stable state
  >   Worry about “frameworks” later
✛ Training                        Training Data

       > Simple gradient descent
         procedure
       > Loss functions needs to be
         convex
    ✛ Prediction                         SGD

      > Logistic Regression:
          ∗ Sigmoid function using
            parameter vector (dot)
            example as exponential
                                        Model
            parameter


7
✛ Currently Single Process
      > Multi-threaded parallel, but not cluster parallel
      > Runs locally, not deployed to the cluster
    ✛ Defined in:
      > https://cwiki.apache.org/MAHOUT/logistic-
        regression.html




8
Current Limitations
    ✛ Sequential algorithms on a single node only
      goes so far
    ✛ The “Data Deluge”
      > Presents algorithmic challenges when combined with
        large data sets
      > need to design algorithms that are able to perform in
        a distributed fashion
    ✛ MapReduce only fits certain types of algorithms




9
Distributed Learning Strategies
 ✛ Langford, 2007
    > Vowpal Wabbit
 ✛ McDonald 2010
   > Distributed Training Strategies for the Structured
     Perceptron
 ✛ Dekel 2010
   > Optimal Distributed Online Prediction Using Mini-
     Batches




10
Input             Processor    Processor    Processor



                                         Superstep 1
     Map      Map      Map

                             Processor    Processor    Processor



     Reduce         Reduce               Superstep 2

                                             . . .
           Output


11
“Are the gains gotten from using X worth the integration costs incurred in
     building the end-to-end solution?

     If no, then operationally, we can consider the Hadoop stack …

     there are substantial costs in knitting together a patchwork of different
     frameworks, programming models, etc.”

     –– Lin, 2012




12
Introducing
KNITTING BOAR




 13
✛ Parallel Iterative implementation of SGD on
     YARN

 ✛ Workers work on partitions of the data
 ✛ Master keeps global copy of merged parameter
     vector




14
✛ Each given a split of the total dataset
   > Similar to a map task
 ✛ Using a modified OLR
   > process N samples in a batch (subset of split)
 ✛ Batched gradient accumulation updates sent to
     master node
     > Gradient influences future models vectors towards
       better predictions




15
✛ Accumulates gradient updates
   > From batches of worker OLR runs
 ✛ Produces new global parameter vector
   > By averaging workers‟ vectors
 ✛ Sends update to all workers
   > Workers replace local parameter vector with new
     global parameter vector




16
OnlineLogisticRegression
                                              Knitting Boar‟s POLR
                                    Split 1             Split 2             Split 3
           Training Data




                                 Worker 1             Worker 2
                                                                     …   Worker N




                                Partial Model        Partial Model       Partial Model
     OnlineLogisticRegression


                                                     Master



             Model
                                                    Global Model

17
Integrating Knitting Boar with
YARN




18
✛ Yet Another Resource Negotiator

 ✛ Framework for scheduling distributed applications
 ✛ Typically runs on top of an HDFS cluster
    > Though not required,
      nor is it coupled to HDFS
                                                                            Node
                                                                           Manager

 ✛ MRv2 is now a                                                    Container   App Mstr

     distributed application         Client

                                                         Resource           Node
                                                         Manager           Manager
                                     Client

                                                                    App Mstr    Container




                                      MapReduce Status                      Node
                                                                           Manager
                                        Job Submission
                                        Node Status
                                      Resource Request              Container   Container




19
✛ High setup / teardown costs
 ✛ Not designed for super-step operations
 ✛ Need to refactor the problem to fit MapReduce
   > We can now just launch a distributed application




20
✛ Designed specifically for parallel iterative
     algorithms on Hadoop
     > Implemented directly on top of YARN
 ✛ Intrinsic Parallelism
    > Easier to focus on problem
    > Not focusing on the distributed application part




21
✛ ComputableMaster
                      Worker   Worker   Worker
   > Setup()
   > Compute()                 Master
   > Complete()
 ✛ ComputableWorker   Worker   Worker   Worker


   > Setup()
                               Master
   > Compute()
                                . . .




22
✛ Client
   > Launches the YARN ApplicationMaster
 ✛ Master
   > Computes required resources
   > Obtains resources from YARN
   > Launches Workers
 ✛ Workers
   > Computation on partial data (input split)
   > Synchronizes with Master



23
Pig, Hive, Scala, Java, Crunch



                           Algorithms


 MapReduce   IterativeReduce     BranchReduce      Giraph   …




                          HDFS / YARN




24
Knitting Boar
     PERFORMANCE, SCALING, AND RESULTS




25
300


     250


     200


     150                                                                     OLR
                                                                             POLR
     100


      50


       0
           4.1   8.2   12.3   16.4   20.5   24.6   28.7   32.8   36.9   41




                 Input Size vs Processing Time


26
✛ Parallel SGD
   > The Boar is temperamental, experimental
       ∗ Linear speedup (roughly)

 ✛ Developing YARN Applications
   > More complex the just MapReduce
   > Requires lots of “plumbing”
 ✛ IterativeReduce
    > Great native-Hadoop way to implement algorithms
    > Easy to use and well integrated



27
✛ Knitting Boar
   > 100% Java
   > ASF 2.0 Licensed
   > https://github.com/jpatanooga/KnittingBoar
   > Quick Start
       ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

 ✛ IterativeReduce
    > [ coming soon ]




28
The Road Ahead

                  ✛ SGD
                    > More testing
                    > Demo use cases
                  ✛ IterativeReduce
                     > Reliability
                     > Durability



                  Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg



29
✛ Mahout‟s SGD implementation
   > http://lingpipe.files.wordpress.com/2008/04/lazysgdre
     gression.pdf
 ✛ Hadoop AllReduce and Terascale Learning
   > http://hunch.net/?p=2094
 ✛ MapReduce is Good Enough? If All You Have is
     a Hammer, Throw Away Everything That‟s Not a
     Nail!
     > http://arxiv.org/pdf/1209.2191v1.pdf



30
✛ Langford
    > http://hunch.net/~vw/
 ✛ Zinkevick, 2011
    > http://www.research.rutgers.edu/~lihong/pub/Zinkevic
      h11Parallelized.pdf
 ✛ McDonald, 2010
   > http://dl.acm.org/citation.cfm?id=1858068
 ✛ Dekel, 2010
   > http://arxiv.org/pdf/1012.1367.pdf



31
✛ http://eteamjournal.files.wordpress.com/2011/03/
   photos-of-mount-everest-pictures.jpg
 ✛ http://images.fineartamerica.com/images-
   medium-large/-say-hello-to-my-little-friend--luis-
   ludzska.jpg
 ✛ http://agileknitter.com/wp-
   content/uploads/2010/06/Pictures_-_Misc_-
   _Knitting_Needles.jpg



32
1 of 32

More Related Content

Viewers also liked(17)

2 5583111353385615372 558311135338561537
2 558311135338561537
Ajaindu Shrivastava285 views
BibliografiaBibliografia
Bibliografia
Gerardo0321126 views
Una historia particularUna historia particular
Una historia particular
pedro774415 views
Estudio de caso Estudio de caso
Estudio de caso
Sebastianecci166 views
AutoestimaAutoestima
Autoestima
Stephanie Claros224 views
07 03 lesson-0407 03 lesson-04
07 03 lesson-04
ISKCON Chowpatty264 views
Ley de ciasLey de cias
Ley de cias
Cesar Chiluisa370 views
Opportunities for commerce studentsOpportunities for commerce students
Opportunities for commerce students
Akhilesh shukla4K views
Aplikasi corel drawAplikasi corel draw
Aplikasi corel draw
stfxpcm439 views
Síndrome guillain   barréSíndrome guillain   barré
Síndrome guillain barré
Meli Mejía811 views
A verdadeira páscoa'A verdadeira páscoa'
A verdadeira páscoa'
Marly Brito2.2K views
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
EB 2,3 Rainha Santa Isabel - Carreira785 views
Kisi kisi uas ips Kelas 6 Semester 1Kisi kisi uas ips Kelas 6 Semester 1
Kisi kisi uas ips Kelas 6 Semester 1
Rachmah Safitri11K views
Agricultural biodiversityAgricultural biodiversity
Agricultural biodiversity
mickymouseemail16.1K views
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
Xavier Ochoa1.8K views

More from Cloudera, Inc.(20)

Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views

Recently uploaded(20)

Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet49 views
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver24 views

Strata + Hadoop World 2012: Knitting Boar

  • 1. KNITTING BOAR Building Machine Learning Tools with Hadoop‟s YARN Josh Patterson Principal Solutions Architect Michael Katzenellenbogen Principal Solutions Architect 1
  • 2. ✛ Josh Patterson - josh@cloudera.com > Master‟s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) ✛ Michael Katzenellenbollen - michael@cloudera.com > Principal Solutions Architect @ Cloudera > Systems Guy („nuff said)
  • 3. ✛ Intro / Background ✛ Introducing Knitting Boar ✛ Integrating Knitting Boar and YARN ✛ Results and Lessons Learned
  • 4. Background and INTRODUCTION 4
  • 5. Why Machine Learning? > Growing interest in predictive modeling ✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression ✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • 6. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible ✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • 7. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter 7
  • 8. ✛ Currently Single Process > Multi-threaded parallel, but not cluster parallel > Runs locally, not deployed to the cluster ✛ Defined in: > https://cwiki.apache.org/MAHOUT/logistic- regression.html 8
  • 9. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms 9
  • 10. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches 10
  • 11. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output 11
  • 12. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 2012 12
  • 14. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector 14
  • 15. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions 15
  • 16. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers‟ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector 16
  • 17. OnlineLogisticRegression Knitting Boar‟s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model 17
  • 19. ✛ Yet Another Resource Negotiator ✛ Framework for scheduling distributed applications ✛ Typically runs on top of an HDFS cluster > Though not required, nor is it coupled to HDFS Node Manager ✛ MRv2 is now a Container App Mstr distributed application Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container 19
  • 20. ✛ High setup / teardown costs ✛ Not designed for super-step operations ✛ Need to refactor the problem to fit MapReduce > We can now just launch a distributed application 20
  • 21. ✛ Designed specifically for parallel iterative algorithms on Hadoop > Implemented directly on top of YARN ✛ Intrinsic Parallelism > Easier to focus on problem > Not focusing on the distributed application part 21
  • 22. ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . . 22
  • 23. ✛ Client > Launches the YARN ApplicationMaster ✛ Master > Computes required resources > Obtains resources from YARN > Launches Workers ✛ Workers > Computation on partial data (input split) > Synchronizes with Master 23
  • 24. Pig, Hive, Scala, Java, Crunch Algorithms MapReduce IterativeReduce BranchReduce Giraph … HDFS / YARN 24
  • 25. Knitting Boar PERFORMANCE, SCALING, AND RESULTS 25
  • 26. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time 26
  • 27. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated 27
  • 28. ✛ Knitting Boar > 100% Java > ASF 2.0 Licensed > https://github.com/jpatanooga/KnittingBoar > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > [ coming soon ] 28
  • 29. The Road Ahead ✛ SGD > More testing > Demo use cases ✛ IterativeReduce > Reliability > Durability Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg 29
  • 30. ✛ Mahout‟s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ Hadoop AllReduce and Terascale Learning > http://hunch.net/?p=2094 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That‟s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf 30
  • 31. ✛ Langford > http://hunch.net/~vw/ ✛ Zinkevick, 2011 > http://www.research.rutgers.edu/~lihong/pub/Zinkevic h11Parallelized.pdf ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ Dekel, 2010 > http://arxiv.org/pdf/1012.1367.pdf 31
  • 32. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://agileknitter.com/wp- content/uploads/2010/06/Pictures_-_Misc_- _Knitting_Needles.jpg 32

Editor's Notes

  1. Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  2. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  3. The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  4. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  5. Bottou similar to Xu2010 in the 2010 paper
  6. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  7. Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  8. “say hello to my leeeeetle friend….”
  9. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  10. Segue into yarn
  11. Performance still largely dependent on implementation of algo
  12. 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  13. Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  14. Basecamp: use story of how we get to basecamp to see how to climb some more