MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Josh Patterson
Email:
josh@floe.tv

Twitter:
@jpatanooga

Github:
https://github.com/jp
atanooga

Past
Published in IAAI-09:
“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Antalgorithms

Tennessee Valley Authority
(TVA)
Hadoop and the Smartgrid

Cloudera
Principal Solution Architect

Today: Consultant

Sections
1. Parallel Iterative Algorithms
2. Parallel Neural Networks

3. Future Directions

5

Machine Learning and Optimization
Direct Methods
Normal Equation

Iterative Methods
Newton’s Method
Quasi-Newton

Gradient Descent

Heuristics
AntNet
PSO
Genetic Algorithms

Linear Regression
In linear regression, data is
modeled using linear predictor
functions
unknown model parameters are
estimated from the data.

We use optimization techniques
like Stochastic Gradient Descent to
find the coeffcients in the model

Y = (1*x0) + (c1*x1) + … + (cN*xN)

7

Stochastic Gradient Descent
Hypothesis about data
Cost function
Update function

Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view
/11

8

Stochastic Gradient Descent
Training

Training Data

Simple gradient descent procedure
Loss functions needs to be convex
(with exceptions)

Linear Regression

SGD

Loss Function: squared error of
prediction
Prediction: linear combination of
coefficients and input variables
Model

9

Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation

10

Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured
Perceptron

Langford, 2007
Vowpal Wabbit

Jeff Dean’s Work on Parallel SGD
DownPour SGD

11

MapReduce

vs. Parallel Iterative

Input
Processor

Map

Map

Map

Reduce

Output

Processor

Superstep 1
Processor

Reduce

Processor

Processor

Superstep 2
. . .

Processor

12

YARN
Yet Another Resource Negotiator
Framework for scheduling
distributed applications
Allows for any type of parallel
application to run natively on
hadoop
MRv2 is now a distributed
application

Node
Manager

Container

App Mstr

Client
Resource
Manager

Node
Manager

Client
App Mstr

MapReduce Status
Job Submission
Node Status
Resource Request

Container

Node
Manager

Container

Container

13

IterativeReduce API
ComputableMaster

Worker

Setup()

ComputableWorker
Setup()
Compute()

Worker

Master

Compute()
Complete()

Worker

Worker

Worker

Master
. . .

Worker

14

SGD: Serial vs Parallel
Split 1

Split 2

Split 3

Training Data
Worker 1

Partial
Model

Worker 2

…

Partial Model

Master

Model

Global Model

Worker N

Partial
Model

Parallel Iterative Algorithms on YARN
Based directly on work we did with Knitting Boar
Parallel logistic regression

And then added
Parallel linear regression
Parallel Neural Networks

Packaged in a new suite of parallel iterative algorithms
called Metronome
100% Java, ASF 2.0 Licensed, on github

Linear Regression Results
Total Processing Time

Linear Regression - Parallel vs Serial
200
150

100

Parallel Runs
Serial Runs

50
0
64

128

192

256

Megabytes Processed Total

320

17

Logistic Regression: 20Newsgroups
300
250
200
150

OLR
POLR

100
50
0
4.1

8.2

12.3

16.4

20.5

24.6

28.7

32.8

Input Size vs Processing Time

36.9

41

Convergence Testing
Debugging parallel iterative algorithms during
testing is hard
Processes on different hosts are difficult to observe

Using the Unit Test framework IRUnit we can
simulate the IterativeReduce framework
We know the plumbing of message passing works
Allows us to focus on parallel algorithm design/testing
while still using standard debugging tools

What are Neural Networks?
Inspired by nervous systems in biological
systems
Models layers of neurons in the brain

Can learn non-linear functions
Recently enjoying a surge in popularity

Multi-Layer Perceptron
First layer has input neurons
Last layer has output neurons
Each neuron in the layer
connected to all neurons in the
next layer
Neuron has activation
function, typically sigmoid /
logistic
Input to neuron is the sum of the
weight * input of connections

Backpropogation Learning
Calculates the gradient of the error of the network
regarding the network's modifiable weights
Intuition
Run forward pass of example through network
Compute activations and output

Iterating output layer back to input layer (backwards)
For each neuron in the layer
Compute node’s responsibility for error
Update weights on connections

Parallelizing Neural Networks
Dean, (NIPS, 2012)
First Steps: Focus on linear convex models, calculating
distributed gradient
Model Parallelism must be combined with distributed
optimization that leverages data parallelization
simultaneously process distinct training examples in
each of the many model replicas
periodically combine their results to optimize our
objective function

Single pass frameworks such as MapReduce “ill-suited”

Costs of Neural Network Training

Connections count explodes quickly as neurons and layers increase
Example: {784, 450, 10} network has 357,300 connections

Need fast iterative framework
Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time
5,000 minutes or 83 hours

3 ways to speed up training
Subdivide dataset between works (data parallelism)
Max transfer rate of disks and Vector caching to max data throughput
Minimize inter-epoch setup times with proper iterative framework

Vector In-Memory Caching
Since we make lots of passes over same dataset
In memory caching makes sense here
Once a record is vectorized it is cached in memory
on the worker node

Speedup (single pass, “no cache” vs “cached”):
~12x

Neural Networks Parallelization Speedup

Training Speedup Factor (Multiple)

6.00
5.00
4.00
UCI Iris
3.00

UCI Lenses
UCI Wine

2.00

UCI Dermatology
NIST Handwriting Downsample

1.00
1

2

3

4

Number of Parallel Processing Units

5

Lessons Learned
Linear scale continues to be achieved with
parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate

Future Directions
Adagrad (SGD Adaptive Learning Rates)
Parallel Quasi-Newton Methods
L-BFGS

Conjugate Gradient

More Neural Network Learning Refinement
Training progressively larger networks

Github
IterativeReduce
https://github.com/emsixteeen/IterativeReduce

Metronome
https://github.com/jpatanooga/Metronome

Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do

Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/jav
a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat
eLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j
ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB
oar_IRUnitSim.java

MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Similar to MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN (20)

More from Josh Patterson

More from Josh Patterson (20)

Recently uploaded

Recently uploaded (20)

MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN

Editor's Notes