2. Motivation
• Given a trained model, ML / prediction is easy to distribute
• Not a full-blown “Big Data” problem
• What about model training in the face of Big (training) Data?
• Distributed training needed!
• Under the hood: ML as optimisation
3. ML and optimisation
‘Big Data’ ML:
• high training sample volumes
• high-dimensional data
• distributed data: collection, storage
methods are based on optimisation
• write ML as a (typically convex) optimisation problem
• optimise.
4. Problem formalization
Problem:
• minimize J(𝜃), 𝜃 ∈ ℝd
• subject to Ji(𝜃) ≤ bi, i = 1,...,m
with
• 𝜃 = (𝜃1 ,…, 𝜃d) ∈ ℝd
the optimisation variable
• J : Rd
→ R the objective function
• Ji : Rd
→ R, i = 1,…, m the constraints
• constants b1 ,…, bm the bounds for the constraints.
5. Gradient descent
• Update the parameters in the opposite direction of the gradient of
the objective function ∇ 𝜃J(𝜃) w.r.t. the parameters.
• The learning rate 𝜂 determines the size of the steps we take to reach
a (local) minimum.
• We follow the direction of the slope of the surface created by the
objective function downhill until we reach a valley.
[NOTE: heavily based on Sebastian Ruder’s “An overview of
gradient descent optimization algorithms”, 19 Jan 2016]
6. Batch gradient descent
• Idea: depending on the amount of data, trade-off between the
accuracy of the parameter update and the time it takes to perform
an update.
• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃)
7. Stochastic gradient descent
• Idea: perform a parameter update for each training example x(i) and
label y(i)
• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃; x(i), y(i))
• Performs redundant computations for large datasets
9. Nesterov accelerated gradient
• Idea: 1. big jump in the direction of the previous accumulated
gradient & measure the gradient and then 2. make a correction.
• Update:
• vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃-𝛾 vt-1)
• 𝜃 = 𝜃 - vt
10. Adagrad
• Idea: larger updates for infrequent and smaller updates for frequent
parameters.
• Update: let gt,i = ∇ 𝜃J(𝜃i); 𝜃t+1,i = 𝜃t,i + 𝛥𝜃t. Then:
• SGD: 𝛥𝜃t = - 𝜂 ∙ gt
• Adagrad: 𝛥𝜃t = - 𝜂 / √(Gt+ϵ) ⊙ gt
with Gt ∈ℝd⨉d a diagonal matrix where each diagonal element i,i the sum
of square of gradients w.r.t. 𝜃i up to time step t, ⊙ element-wise matrix-
vector multiplication.
11. Adadelta
• Idea: Instead of accumulating all past squared gradients, restrict the
window of accumulated past gradients to some fixed size w.
• The sum of gradients is recursively defined as a decaying average
of all past squared gradients:
E[𝛥𝜃2]t = 𝛾 E[𝛥𝜃2]t-1 + (1-𝛾) 𝛥𝜃t2
• Update: we replace the diagonal matrix Gt with the decaying
average over past squared gradients E[g2]t
𝛥𝜃t = - RMS[𝛥𝜃]t-1/RMS[g]t ⊙ gt
13. Visualization and comparison
Adagrad, Adadelta, and RMSprop
almost immediately head off in the
right direction and converge
similarly fast, while Momentum and
NAG are led off-track, evoking the
image of a ball rolling down the hill.
NAG, however, is quickly able to
correct its course due to its
increased responsiveness by
looking ahead and heads to the
minimum.
14. Conclusions
• Big Data ML requires (scalable, distributed) algorithms to process
training points in small batches, performing effective incremental
updates to the model
• Final objective: a closed loop that trains models, compares them
recursively
• Key challenge: evaluation metrics in the face of available resources
(including data)