08 distributed optimization

Distributed optimization
mquartulli@vicomtech.org

Motivation
• Given a trained model, ML / prediction is easy to distribute
• Not a full-blown “Big Data” problem
• What about model training in the face of Big (training) Data?
• Distributed training needed!
• Under the hood: ML as optimisation

ML and optimisation
‘Big Data’ ML:
• high training sample volumes
• high-dimensional data
• distributed data: collection, storage
methods are based on optimisation
• write ML as a (typically convex) optimisation problem
• optimise.

Problem formalization
Problem:
• minimize J(𝜃), 𝜃 ∈ ℝd
• subject to Ji(𝜃) ≤ bi, i = 1,...,m
with
• 𝜃 = (𝜃1 ,…, 𝜃d) ∈ ℝd
the optimisation variable
• J : Rd
→ R the objective function
• Ji : Rd
→ R, i = 1,…, m the constraints
• constants b1 ,…, bm the bounds for the constraints.

Gradient descent
• Update the parameters in the opposite direction of the gradient of
the objective function ∇ 𝜃J(𝜃) w.r.t. the parameters.
• The learning rate 𝜂 determines the size of the steps we take to reach
a (local) minimum.
• We follow the direction of the slope of the surface created by the
objective function downhill until we reach a valley.  
 
[NOTE: heavily based on Sebastian Ruder’s “An overview of
gradient descent optimization algorithms”, 19 Jan 2016]

Batch gradient descent
• Idea: depending on the amount of data, trade-off between the
accuracy of the parameter update and the time it takes to perform
an update.
• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃)

Stochastic gradient descent
• Idea: perform a parameter update for each training example x(i) and
label y(i)
• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃; x(i), y(i))
• Performs redundant computations for large datasets

Momentum gradient descent
• Idea: overcome ravine oscillations by momentum
• Update:
• vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃)
• 𝜃 = 𝜃 - vt

Nesterov accelerated gradient
• Idea: 1. big jump in the direction of the previous accumulated
gradient & measure the gradient and then 2. make a correction.
• Update:
• vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃-𝛾 vt-1)
• 𝜃 = 𝜃 - vt

Adagrad
• Idea: larger updates for infrequent and smaller updates for frequent
parameters.
• Update: let gt,i = ∇ 𝜃J(𝜃i); 𝜃t+1,i = 𝜃t,i + 𝛥𝜃t. Then:
• SGD: 𝛥𝜃t = - 𝜂 ∙ gt
• Adagrad: 𝛥𝜃t = - 𝜂 / √(Gt+ϵ) ⊙ gt 
 
with Gt ∈ℝd⨉d a diagonal matrix where each diagonal element i,i the sum
of square of gradients w.r.t. 𝜃i up to time step t, ⊙ element-wise matrix-
vector multiplication.

Adadelta
• Idea: Instead of accumulating all past squared gradients, restrict the
window of accumulated past gradients to some ﬁxed size w.
• The sum of gradients is recursively deﬁned as a decaying average
of all past squared gradients: 
E[𝛥𝜃2]t = 𝛾 E[𝛥𝜃2]t-1 + (1-𝛾) 𝛥𝜃t2
• Update: we replace the diagonal matrix Gt with the decaying
average over past squared gradients E[g2]t  
 
𝛥𝜃t = - RMS[𝛥𝜃]t-1/RMS[g]t ⊙ gt

RMSprop
• Idea: use the ﬁrst update vector of Adadelta
• Update:
• E[g2]t = 0.9 E[g2]t-1 + 0.1 gt2
• 𝛥𝜃t = - 𝜂 / √(E[g2]t + ϵ) ⊙ gt

Visualization and comparison
Adagrad, Adadelta, and RMSprop
almost immediately head off in the
right direction and converge
similarly fast, while Momentum and
NAG are led off-track, evoking the
image of a ball rolling down the hill.
NAG, however, is quickly able to
correct its course due to its
increased responsiveness by
looking ahead and heads to the
minimum.

Conclusions
• Big Data ML requires (scalable, distributed) algorithms to process
training points in small batches, performing effective incremental
updates to the model
• Final objective: a closed loop that trains models, compares them
recursively
• Key challenge: evaluation metrics in the face of available resources  
(including data)

08 distributed optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to 08 distributed optimization

Similar to 08 distributed optimization (20)

Recently uploaded

Recently uploaded (20)

08 distributed optimization