Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Josh Patterson Email:                   Past                            Published in IAAI-09:  josh@floe.tv               ...
Sections1. Modern Data Analytics2. Parallel Linear Regression3. Performance and Results
The World as Optimization Data tells us about our model/engine/product   We take this data and evolve our product towards ...
The Modern Data Landscape Apps are coming but they need   Platforms   Components   Workflows Lots of investment in Hadoop ...
Hadoop as The Linux of Data Hadoop has won the Cycle      “Hadoop is the                               kernel of a  Gartne...
Today’s Hadoop ML Pipeline Data cleansing / ETL performed with Hive or Pig Data In Place Processed    Mahout    R    Custo...
As Focus Shifts to Applications Data rates have been climbing fast   Speed at Scale becomes the new Killer App Companies w...
Patterson’s Law“As the percent of your total data heldin a storage system approaches 100%the amount of in-system processin...
Tools Will Move onto Hadoop Already seeing this with Vendors  Who hasn’t announced a SQL engine on Hadoop  lately? Trend w...
Distributed Systems Are Hard Lots of moving parts   Especially as these applications become more complicated Machine learn...
To Summarize Data moving into Hadoop everywhere   Patterson’s Law   Focus on hadoop, build around next-gen “linux of data”...
Linear Regression In linear regression, data is modeled using linear predictor functions   unknown model parameters are   ...
16     Machine Learning and Optimization      Algorithms      (Convergent) Iterative Methods        Newton’s Method       ...
17        Stochastic Gradient Descent         Hypothesis about data         Cost function         Update function     Andr...
18     Stochastic Gradient Descent                                           Training Data     Training       Simple gradi...
19     Mahout’s SGD      Currently Single Process       Multi-threaded parallel, but not cluster parallel       Runs local...
20     Current Limitations     Sequential algorithms on a single node only goes so     far     The “Data Deluge”      Pres...
21     Distributed Learning Strategies      McDonald, 2010        Distributed Training Strategies for the Structured      ...
22     MapReduce               vs. Parallel Iterative           Input                                   Processor    Proce...
23     YARN     Yet Another Resource Negotiator                                                                           ...
24     IterativeReduce      Designed specifically for parallel iterative      algorithms on Hadoop        Implemented dire...
25     IterativeReduce API      ComputableMaster   Worker   Worker   Worker       Setup()                                 ...
26     SGD Master      Collects all parameter vectors at each pass /      superstep      Produces new global parameter vec...
27     SGD Worker     Each given a split of the total dataset       Similar to a map task     Performs local SGD pass     ...
28     SGD: Serial vs Parallel                          Split 1       Split 2            Split 3       Training Data      ...
Parallel Linear Regression with IterativeReduce  Based directly on work we did with Knitting Boar    Parallel logistic reg...
Unit Testing and IRUnit Simulates the IterativeReduce parallel framework   Uses the same app.properties file that YARN app...
Running the Job via YARN Build with Maven Copy Jar to host with cluster access Copy dataset to HDFS Run job  Yarn jar iter...
Results                               Linear Regression - Parallel vs Serial                         200 Total Processing ...
Lessons Learned Linear scale continues to be achieved with parameter averaging variations Tuning is critical   Need to be ...
Special Thanks Michael Katzenellenbollen Dr. James Scott  University of Texas at Austin Dr. Jason Baldridge  University of...
Future Directions More testing, stability Cache vectors in memory for speed Metronome   Take on properties of LibLinear   ...
Github IterativeReduce  https://github.com/emsixteeen/IterativeReduce Metronome  https://github.com/jpatanooga/Metronome K...
References1. http://www.infoworld.com/d/business-   intelligence/gartner-hadoop-will-be-in-two-thirds-of-   advanced-analy...
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Upcoming SlideShare
Loading in …5
×

6

Share

Download to read offline

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Download to read offline

Josh Patterson's Hadoop Summit EU 2013 talk on parallel linear linear regression on IterativeReduce and YARN.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

  1. 1. Josh Patterson Email: Past Published in IAAI-09: josh@floe.tv “TinyTermite: A Secure Routing Algorithm” Twitter: Grad work in Meta-heuristics, Ant-algorithms Tennessee Valley Authority (TVA) @jpatanooga Hadoop and the Smartgrid Github: Cloudera Principal Solution Architect https://github.com/jp Today atanooga Independent Consultant
  2. 2. Sections1. Modern Data Analytics2. Parallel Linear Regression3. Performance and Results
  3. 3. The World as Optimization Data tells us about our model/engine/product We take this data and evolve our product towards a state of minimal market error WSJ Special Section, Monday March 11, 2013 Zynga changing games based off player behavior UPS cut fuel consumption by 8.4MM gallons Ford used sentiment analysis to look at how new car features would be received
  4. 4. The Modern Data Landscape Apps are coming but they need Platforms Components Workflows Lots of investment in Hadoop in this space Lots of ETL pipelines Lots of descriptive Statistics Growing interest in Machine Learning
  5. 5. Hadoop as The Linux of Data Hadoop has won the Cycle “Hadoop is the kernel of a Gartner: Hadoop will be in distributed operating 2/3s of advanced analytics products by 2015 [1] system, and all the other components around the kernel are now arriving on this stage” ---Doug Cutting
  6. 6. Today’s Hadoop ML Pipeline Data cleansing / ETL performed with Hive or Pig Data In Place Processed Mahout R Custom MapReduce Algorithm Or Externally Processed SAS SPSS KXEN Weka
  7. 7. As Focus Shifts to Applications Data rates have been climbing fast Speed at Scale becomes the new Killer App Companies will want to leverage the Big Data infrastructure they’ve already been working with Hadoop HDFS as main storage system A drive to validate big data investments with results Emergence of applications which create “data products”
  8. 8. Patterson’s Law“As the percent of your total data heldin a storage system approaches 100%the amount of in-system processingand analytics also approaches 100%”
  9. 9. Tools Will Move onto Hadoop Already seeing this with Vendors Who hasn’t announced a SQL engine on Hadoop lately? Trend will continue with machine learning tools Mahout was the beginning More are following But what about parallel iterative algorithms?
  10. 10. Distributed Systems Are Hard Lots of moving parts Especially as these applications become more complicated Machine learning can be a non-trivial operation We need great building blocks that work well together I agree with Jimmy Lin [3]: “keep it simple” “make sure costs don’t outweigh benefits” Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!
  11. 11. To Summarize Data moving into Hadoop everywhere Patterson’s Law Focus on hadoop, build around next-gen “linux of data” Need simple components to build next-gen data base apps They should work cleanly with the cluster that the fortune 500 has: Hadoop Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem Minimize YATTL
  12. 12. Linear Regression In linear regression, data is modeled using linear predictor functions unknown model parameters are estimated from the data. We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model Y = (1*x0) + (c1*x1) + … + (cN*xN)
  13. 13. 16 Machine Learning and Optimization Algorithms (Convergent) Iterative Methods Newton’s Method Quasi-Newton Gradient Descent Heuristics AntNet PSO Genetic Algorithms
  14. 14. 17 Stochastic Gradient Descent Hypothesis about data Cost function Update function Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view /11
  15. 15. 18 Stochastic Gradient Descent Training Data Training Simple gradient descent procedure Loss functions needs to be convex (with exceptions) Linear Regression SGD Loss Function: squared error of prediction Prediction: linear combination of coefficients and input variables Model
  16. 16. 19 Mahout’s SGD Currently Single Process Multi-threaded parallel, but not cluster parallel Runs locally, not deployed to the cluster Tied to logistic regression implementation
  17. 17. 20 Current Limitations Sequential algorithms on a single node only goes so far The “Data Deluge” Presents algorithmic challenges when combined with large data sets need to design algorithms that are able to perform in a distributed fashion MapReduce only fits certain types of algorithms
  18. 18. 21 Distributed Learning Strategies McDonald, 2010 Distributed Training Strategies for the Structured Perceptron Langford, 2007 Vowpal Wabbit Jeff Dean’s Work on Parallel SGD DownPour SGD Sandblaster
  19. 19. 22 MapReduce vs. Parallel Iterative Input Processor Processor Processor Map Map Map Superstep 1 Processor Processor Processor Reduce Reduce Superstep 2 Output . . .
  20. 20. 23 YARN Yet Another Resource Negotiator Node Manager Framework for scheduling Container App Mstr distributed applications Client Resource Node Manager Manager Allows for any type of parallel Client application to run natively on App Mstr Container hadoop MRv2 is now a distributed MapReduce Status Node Manager application Job Submission Node Status Resource Request Container Container
  21. 21. 24 IterativeReduce Designed specifically for parallel iterative algorithms on Hadoop Implemented directly on top of YARN Intrinsic Parallelism Easier to focus on problem Not focusing on the distributed application part
  22. 22. 25 IterativeReduce API ComputableMaster Worker Worker Worker Setup() Master Compute() Complete() Worker Worker Worker ComputableWorker Master Setup() Compute() . . .
  23. 23. 26 SGD Master Collects all parameter vectors at each pass / superstep Produces new global parameter vector By averaging workers’ vectors Sends update to all workers Workers replace local parameter vector with new global parameter vector
  24. 24. 27 SGD Worker Each given a split of the total dataset Similar to a map task Performs local SGD pass Local parameter vector sent to master at superstep Stays active/resident between iterations
  25. 25. 28 SGD: Serial vs Parallel Split 1 Split 2 Split 3 Training Data Worker N Worker 1 Worker 2 … Partial Partial Model Partial Model Model Master Model Global Model
  26. 26. Parallel Linear Regression with IterativeReduce Based directly on work we did with Knitting Boar Parallel logistic regression Scales linearly with input size Can produce a linear regression model off large amounts of data Packaged in a new suite of parallel iterative algorithms called Metronome 100% Java, ASF 2.0 Licensed, on github
  27. 27. Unit Testing and IRUnit Simulates the IterativeReduce parallel framework Uses the same app.properties file that YARN applications do Examples https://github.com/jpatanooga/Metronome/blob/master/src/test/jav a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat eLinearRegressionIterativeReduce.java https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB oar_IRUnitSim.java
  28. 28. Running the Job via YARN Build with Maven Copy Jar to host with cluster access Copy dataset to HDFS Run job Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
  29. 29. Results Linear Regression - Parallel vs Serial 200 Total Processing Time 150 100 Parallel Runs 50 Serial Runs 0 64 128 192 256 320 Megabytes Processed Total
  30. 30. Lessons Learned Linear scale continues to be achieved with parameter averaging variations Tuning is critical Need to be good at selecting a learning rate YARN still experimental, has caveats Container allocation is still slow Metronome continues to be experimental
  31. 31. Special Thanks Michael Katzenellenbollen Dr. James Scott University of Texas at Austin Dr. Jason Baldridge University of Texas at Austin
  32. 32. Future Directions More testing, stability Cache vectors in memory for speed Metronome Take on properties of LibLinear Plugable optimization, general linear models YARN-centric first class Hadoop citizen Focus on being a complement to Mahout K-means, PageRank implementations
  33. 33. Github IterativeReduce https://github.com/emsixteeen/IterativeReduce Metronome https://github.com/jpatanooga/Metronome Knitting Boar https://github.com/jpatanooga/KnittingBoar
  34. 34. References1. http://www.infoworld.com/d/business- intelligence/gartner-hadoop-will-be-in-two-thirds-of- advanced-analytics-products-2015-2114752. https://cwiki.apache.org/MAHOUT/logistic- regression.html3. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! • http://arxiv.org/pdf/1209.2191.pdf
  • tamakoji

    Jul. 18, 2014
  • vigs143

    Sep. 1, 2013
  • soulmachine

    Jul. 19, 2013
  • chaoh

    Jun. 12, 2013
  • rgaidot

    Jun. 11, 2013
  • bikash21

    May. 25, 2013

Josh Patterson's Hadoop Summit EU 2013 talk on parallel linear linear regression on IterativeReduce and YARN.

Views

Total views

5,765

On Slideshare

0

From embeds

0

Number of embeds

846

Actions

Downloads

62

Shares

0

Comments

0

Likes

6

×