MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN


Published on

Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelize parameter estimation for linear models on the next-gen YARN framework Iterative Reduce and the parallel machine learning library Metronome. We also take a look at non-linear modeling with the introduction of parallel neural network training in Metronome as well.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Talk about how you normally would use the Normal equation, notes from Andrew Ng
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  • TODO: add in diagram of biological neuron
  • MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

    1. 1. Josh Patterson Email: Twitter: @jpatanooga Github: atanooga Past Published in IAAI-09: “TinyTermite: A Secure Routing Algorithm” Grad work in Meta-heuristics, Antalgorithms Tennessee Valley Authority (TVA) Hadoop and the Smartgrid Cloudera Principal Solution Architect Today: Consultant
    2. 2. Sections 1. Parallel Iterative Algorithms 2. Parallel Neural Networks 3. Future Directions
    3. 3. 5 Machine Learning and Optimization Direct Methods Normal Equation Iterative Methods Newton’s Method Quasi-Newton Gradient Descent Heuristics AntNet PSO Genetic Algorithms
    4. 4. Linear Regression In linear regression, data is modeled using linear predictor functions unknown model parameters are estimated from the data. We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model Y = (1*x0) + (c1*x1) + … + (cN*xN)
    5. 5. 7 Stochastic Gradient Descent Hypothesis about data Cost function Update function Andrew Ng’s Tutorial: /11
    6. 6. 8 Stochastic Gradient Descent Training Training Data Simple gradient descent procedure Loss functions needs to be convex (with exceptions) Linear Regression SGD Loss Function: squared error of prediction Prediction: linear combination of coefficients and input variables Model
    7. 7. 9 Mahout’s SGD Currently Single Process Multi-threaded parallel, but not cluster parallel Runs locally, not deployed to the cluster Tied to logistic regression implementation
    8. 8. 10 Distributed Learning Strategies McDonald, 2010 Distributed Training Strategies for the Structured Perceptron Langford, 2007 Vowpal Wabbit Jeff Dean’s Work on Parallel SGD DownPour SGD
    9. 9. 11 MapReduce vs. Parallel Iterative Input Processor Map Map Map Reduce Output Processor Superstep 1 Processor Reduce Processor Processor Superstep 2 . . . Processor
    10. 10. 12 YARN Yet Another Resource Negotiator Framework for scheduling distributed applications Allows for any type of parallel application to run natively on hadoop MRv2 is now a distributed application Node Manager Container App Mstr Client Resource Manager Node Manager Client App Mstr MapReduce Status Job Submission Node Status Resource Request Container Node Manager Container Container
    11. 11. 13 IterativeReduce API ComputableMaster Worker Setup() ComputableWorker Setup() Compute() Worker Master Compute() Complete() Worker Worker Worker Master . . . Worker
    12. 12. 14 SGD: Serial vs Parallel Split 1 Split 2 Split 3 Training Data Worker 1 Partial Model Worker 2 … Partial Model Master Model Global Model Worker N Partial Model
    13. 13. Parallel Iterative Algorithms on YARN Based directly on work we did with Knitting Boar Parallel logistic regression And then added Parallel linear regression Parallel Neural Networks Packaged in a new suite of parallel iterative algorithms called Metronome 100% Java, ASF 2.0 Licensed, on github
    14. 14. Linear Regression Results Total Processing Time Linear Regression - Parallel vs Serial 200 150 100 Parallel Runs Serial Runs 50 0 64 128 192 256 Megabytes Processed Total 320
    15. 15. 17 Logistic Regression: 20Newsgroups 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 Input Size vs Processing Time 36.9 41
    16. 16. Convergence Testing Debugging parallel iterative algorithms during testing is hard Processes on different hosts are difficult to observe Using the Unit Test framework IRUnit we can simulate the IterativeReduce framework We know the plumbing of message passing works Allows us to focus on parallel algorithm design/testing while still using standard debugging tools
    17. 17. What are Neural Networks? Inspired by nervous systems in biological systems Models layers of neurons in the brain Can learn non-linear functions Recently enjoying a surge in popularity
    18. 18. Multi-Layer Perceptron First layer has input neurons Last layer has output neurons Each neuron in the layer connected to all neurons in the next layer Neuron has activation function, typically sigmoid / logistic Input to neuron is the sum of the weight * input of connections
    19. 19. Backpropogation Learning Calculates the gradient of the error of the network regarding the network's modifiable weights Intuition Run forward pass of example through network Compute activations and output Iterating output layer back to input layer (backwards) For each neuron in the layer Compute node’s responsibility for error Update weights on connections
    20. 20. Parallelizing Neural Networks Dean, (NIPS, 2012) First Steps: Focus on linear convex models, calculating distributed gradient Model Parallelism must be combined with distributed optimization that leverages data parallelization simultaneously process distinct training examples in each of the many model replicas periodically combine their results to optimize our objective function Single pass frameworks such as MapReduce “ill-suited”
    21. 21. Costs of Neural Network Training Connections count explodes quickly as neurons and layers increase Example: {784, 450, 10} network has 357,300 connections Need fast iterative framework Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time 5,000 minutes or 83 hours 3 ways to speed up training Subdivide dataset between works (data parallelism) Max transfer rate of disks and Vector caching to max data throughput Minimize inter-epoch setup times with proper iterative framework
    22. 22. Vector In-Memory Caching Since we make lots of passes over same dataset In memory caching makes sense here Once a record is vectorized it is cached in memory on the worker node Speedup (single pass, “no cache” vs “cached”): ~12x
    23. 23. Neural Networks Parallelization Speedup Training Speedup Factor (Multiple) 6.00 5.00 4.00 UCI Iris 3.00 UCI Lenses UCI Wine 2.00 UCI Dermatology NIST Handwriting Downsample 1.00 1 2 3 4 Number of Parallel Processing Units 5
    24. 24. Lessons Learned Linear scale continues to be achieved with parameter averaging variations Tuning is critical Need to be good at selecting a learning rate
    25. 25. Future Directions Adagrad (SGD Adaptive Learning Rates) Parallel Quasi-Newton Methods L-BFGS Conjugate Gradient More Neural Network Learning Refinement Training progressively larger networks
    26. 26. Github IterativeReduce Metronome
    27. 27. Unit Testing and IRUnit Simulates the IterativeReduce parallel framework Uses the same file that YARN applications do Examples a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB