Parallel Linear Regression in Interative Reduce and YARN


Published on

Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Reference some thoughts on attribution pipelines
  • Talk about how you normally would use the Normal equation, notes from Andrew Ng
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  • The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  • At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  • Bottou similar to Xu2010 in the 2010 paper
  • Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  • Performance still largely dependent on implementation of algo
  • POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  • Parallel Linear Regression in Interative Reduce and YARN

    1. 1. Josh Patterson Email: Past Published in IAAI-09: “TinyTermite: A Secure Routing Algorithm” Twitter: Grad work in Meta-heuristics, Ant- algorithms @jpatanooga Tennessee Valley Authority (TVA) Github: Hadoop and the Smartgrid Cloudera Principal Solution Architect atanooga Today Independent Consultant
    2. 2. Sections1. Modern Data Analytics2. Parallel Linear Regression3. Performance and Results
    3. 3. The World as Optimization Data tells us about our model/engine/product We take this data and evolve our product towards a state of minimal market error WSJ Special Section, Monday March 11, 2013 Zynga changing games based off player behavior UPS cut fuel consumption by 8.4MM gallons Ford used sentiment analysis to look at how new car features would be received
    4. 4. The Modern Data Landscape Apps are coming but they need Platforms Components Workflows Lots of investment in Hadoop in this space Lots of ETL pipelines Lots of descriptive Statistics Growing interest in Machine Learning
    5. 5. Hadoop as The Linux of Data Hadoop has won the Cycle “Hadoop is the kernel of a Gartner: Hadoop will be in distributed operating 2/3s of advanced analytics products by 2015 [1] system, and all the other components around the kernel are now arriving on this stage” ---Doug Cutting
    6. 6. Today’s Hadoop ML Pipeline Data cleansing / ETL performed with Hive or Pig Data In Place Processed Mahout R Custom MapReduce Algorithm Or Externally Processed SAS SPSS KXEN Weka
    7. 7. As Focus Shifts to Applications Data rates have been climbing fast Speed at Scale becomes the new Killer App Companies will want to leverage the Big Data infrastructure they’ve already been working with Hadoop HDFS as main storage system A drive to validate big data investments with results Emergence of applications which create “data products”
    8. 8. Patterson’s Law“As the percent of your total data heldin a storage system approaches 100%the amount of in-system processingand analytics also approaches 100%”
    9. 9. Tools Will Move onto Hadoop Already seeing this with Vendors Who hasn’t announced a SQL engine on Hadoop lately? Trend will continue with machine learning tools Mahout was the beginning More are following But what about parallel iterative algorithms?
    10. 10. Distributed Systems Are Hard Lots of moving parts Especially as these applications become more complicated Machine learning can be a non-trivial operation We need great building blocks that work well together I agree with Jimmy Lin [3]: “keep it simple” “make sure costs don’t outweigh benefits” Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!
    11. 11. To Summarize Data moving into Hadoop everywhere Patterson’s Law Focus on hadoop, build around next-gen “linux of data” Need simple components to build next-gen data base apps They should work cleanly with the cluster that the fortune 500 has: Hadoop Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem Minimize YATTL
    12. 12. Linear Regression In linear regression, data is modeled using linear predictor functions unknown model parameters are estimated from the data. We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model Y = (1*x0) + (c1*x1) + … + (cN*xN)
    13. 13. Machine Learning and Optimization Algorithms (Convergent) Iterative Methods Newton’s Method Quasi-Newton Gradient Descent Heuristics AntNet PSO Genetic Algorithms
    14. 14. Stochastic Gradient Descent Hypothesis about data Cost function Update functionAndrew Ng’s Tutorial:
    15. 15. Stochastic Gradient Descent Training DataTraining Simple gradient descent procedure Loss functions needs to be convex (with exceptions) SGDLinear Regression Loss Function: squared error of prediction Prediction: linear combination of Model coefficients and input variables
    16. 16. Mahout’s SGD Currently Single Process Multi-threaded parallel, but not cluster parallel Runs locally, not deployed to the cluster Tied to logistic regression implementation
    17. 17. Current LimitationsSequential algorithms on a single node only goes sofarThe “Data Deluge” Presents algorithmic challenges when combined with large data sets need to design algorithms that are able to perform in a distributed fashionMapReduce only fits certain types of algorithms
    18. 18. Distributed Learning Strategies McDonald, 2010 Distributed Training Strategies for the Structured Perceptron Langford, 2007 Vowpal Wabbit Jeff Dean’s Work on Parallel SGD DownPour SGD Sandblaster
    19. 19. MapReduce vs. Parallel Iterative Input Processor Processor ProcessorMap Map Map Superstep 1 Processor Processor ProcessorReduce Reduce Superstep 2 Output . . .
    20. 20. YARNYet Another Resource Node ManagerNegotiator Container App MstrFramework for scheduling Clientdistributed applications Resource Node Manager Manager Client App Mstr Container Allows for any type of parallel application to run natively on hadoop MapReduce Status Node Manager Job Submission MRv2 is now a distributed Node Status Resource Request Container Container application
    21. 21. IterativeReduce Designed specifically for parallel iterative algorithms on Hadoop Implemented directly on top of YARN Intrinsic Parallelism Easier to focus on problem Not focusing on the distributed application part
    22. 22. IterativeReduce API ComputableMaster Worker Worker Worker Setup() Master Compute() Complete() Worker Worker Worker ComputableWorker Master Setup() Compute() . . .
    23. 23. SGD Master Collects all parameter vectors at each pass / superstep Produces new global parameter vector By averaging workers’ vectors Sends update to all workers Workers replace local parameter vector with new global parameter vector
    24. 24. SGD WorkerEach given a split of the total dataset Similar to a map taskPerforms local SGD passLocal parameter vector sent to master atsuperstepStays active/resident between iterations
    25. 25. SGD: Serial vs Parallel Split 1 Split 2 Split 3 Training Data Worker N Worker 1 Worker 2 … Partial Partial Model Partial Model Model Master Model Global Model
    26. 26. Parallel Linear Regression with IterativeReduce Based directly on work we did with Knitting Boar Parallel logistic regression Scales linearly with input size Can produce a linear regression model off large amounts of data Packaged in a new suite of parallel iterative algorithms called Metronome 100% Java, ASF 2.0 Licensed, on github
    27. 27. Unit Testing and IRUnit Simulates the IterativeReduce parallel framework Uses the same file that YARN applications do Examples rc/test/java/tv/floe/metronome/linearregression/iterative reduce/TestSimulateLinearRegressionIterativeReduce.j ava /src/test/java/com/cloudera/knittingboar/sgd/iterativere duce/
    28. 28. Running the Job via YARN Build with Maven Copy Jar to host with cluster access Copy dataset to HDFS Run job Yarn jar iterativereduce-0.1-SNAPSNOT.jar
    29. 29. Results Linear Regression - Parallel vs Serial 200 Total Processing Time 150 100 Parallel Runs 50 Serial Runs 0 64 128 192 256 320 Megabytes Processed Total
    30. 30. Lessons Learned Linear scale continues to be achieved with parameter averaging variations Tuning is critical Need to be good at selecting a learning rate YARN still experimental, has caveats Container allocation is still slow Metronome continues to be experimental
    31. 31. Special Thanks Michael Katzenellenbollen Dr. James Scott University of Texas at Austin Dr. Jason Baldridge University of Texas at Austin
    32. 32. Future Directions More testing, stability Cache vectors in memory for speed Metronome Take on properties of LibLinear Plugable optimization, general linear models YARN-centric first class Hadoop citizen Focus on being a complement to Mahout K-means, PageRank implementations
    33. 33. Github IterativeReduce Metronome Knitting Boar
    34. 34. References1. intelligence/gartner-hadoop-will-be-in-two-thirds-of- advanced-analytics-products-2015-2114752. regression.html3. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! •