Josh Patterson, Principal at Patterson Consulting: Introduction to Parallel Iterative Machine Learning Algorithms on Hadoop’s NextGeneration YARN Framework
5. 5
Machine Learning and Optimization
Direct Methods
Normal Equation
Iterative Methods
Newton’s Method
Quasi-Newton
Gradient Descent
Heuristics
AntNet
PSO
Genetic Algorithms
6. Linear Regression
In linear regression, data is modeled
using linear predictor functions
unknown model parameters are
estimated from the data.
We use optimization techniques like
Stochastic Gradient Descent to find
the coeffcients in the model
Y = (1*x0) + (c1*x1) + … + (cN*xN)
8. 8
Stochastic Gradient Descent
Training Data
Training
Simple gradient descent procedure
Loss functions needs to be convex (with
exceptions)
Linear Regression
SGD
Loss Function: squared error of
prediction
Prediction: linear combination of
coefficients and input variables
Model
9. 9
Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation
10. 10
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Dean’s Work on Parallel SGD
DownPour SGD
12. 12
YARN
Node
Manager
Yet Another Resource Negotiator
Container
Framework for scheduling distributed
applications
App Ms
Client
Node
Manager
Resource
Manager
Client
Allows for any type of parallel
application to run natively on hadoop
App Mstr
Node
Manager
MapReduce Status
MRv2 is now a distributed application
Job Submission
Node Status
Resource Request
Containe
Container
Containe
14. 14
SGD: Serial vs Parallel
Split 1
Split 2
Split 3
Training Data
Worker 1
Partial
Model
Worker 2
…
Partial Model
Master
Model
Global Model
Worker N
Partial
Model
15. Parallel Iterative Algorithms on YARN
Based directly on work we did with Knitting Boar
Parallel logistic regression
And then added
Parallel linear regression
Parallel Neural Networks
Packaged in a new suite of parallel iterative algorithms called
Metronome
100% Java, ASF 2.0 Licensed, on github
16. Linear Regression Results
160
Total Processing Time
140
120
100
Series 1
Series 2
80
60
40
64.0
128.0
192.0
Megabytes Processed Total
256.0
320.0
18. Convergence Testing
Debugging parallel iterative algorithms during testing is
hard
Processes on different hosts are difficult to observe
Using the Unit Test framework IRUnit we can simulate
the IterativeReduce framework
We know the plumbing of message passing works
Allows us to focus on parallel algorithm design/testing while
still using standard debugging tools
20. What are Neural Networks?
Inspired by nervous systems in biological systems
Models layers of neurons in the brain
Can learn non-linear functions
Recently enjoying a surge in popularity
21. Multi-Layer Perceptron
First layer has input neurons
Last layer has output neurons
Each neuron in the layer connected
to all neurons in the next layer
Neuron has activation function,
typically sigmoid / logistic
Input to neuron is the sum of the
weight * input of connections
22. Backpropogation Learning
Calculates the gradient of the error of the network regarding
the network's modifiable weights
Intuition
Run forward pass of example through network
Compute activations and output
Iterating output layer back to input layer (backwards)
For each neuron in the layer
Compute node’s responsibility for error
Update weights on connections
23. Parallelizing Neural Networks
Dean, (NIPS, 2012)
First Steps: Focus on linear convex models, calculating
distributed gradient
Model Parallelism must be combined with distributed
optimization that leverages data parallelization
simultaneously process distinct training examples in each of
the many model replicas
periodically combine their results to optimize our objective
function
Single pass frameworks such as MapReduce “ill-suited”
24. Costs of Neural Network Training
Connections count explodes quickly as neurons and layers increase
Example: {784, 450, 10} network has 357,300 connections
Need fast iterative framework
Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time
5,000 minutes or 83 hours
3 ways to speed up training
Subdivide dataset between works (data parallelism)
Max transfer rate of disks and Vector caching to max data throughput
Minimize inter-epoch setup times with proper iterative framework
25. Vector In-Memory Caching
Since we make lots of passes over same dataset
In memory caching makes sense here
Once a record is vectorized it is cached in memory on
the worker node
Speedup (single pass, “no cache” vs “cached”):
~12x
28. Lessons Learned
Linear scale continues to be achieved with
parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
31. Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/
test/java/tv/floe/metronome/linearregression/iterativereduce/
TestSimulateLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/
src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/
TestKnittingBoar_IRUnitSim.java