SlideShare a Scribd company logo
1 of 14
Download to read offline
Distributed optimization
mquartulli@vicomtech.org
Motivation
• Given a trained model, ML / prediction is easy to distribute
• Not a full-blown “Big Data” problem
• What about model training in the face of Big (training) Data?
• Distributed training needed!
• Under the hood: ML as optimisation
ML and optimisation
‘Big Data’ ML:
• high training sample volumes
• high-dimensional data
• distributed data: collection, storage
methods are based on optimisation
• write ML as a (typically convex) optimisation problem
• optimise.
Problem formalization
Problem:
• minimize J(𝜃), 𝜃 ∈ ℝd
• subject to Ji(𝜃) ≤ bi, i = 1,...,m
with
• 𝜃 = (𝜃1 ,…, 𝜃d) ∈ ℝd
the optimisation variable
• J : Rd
→ R the objective function
• Ji : Rd
→ R, i = 1,…, m the constraints
• constants b1 ,…, bm the bounds for the constraints.
Gradient descent
• Update the parameters in the opposite direction of the gradient of
the objective function ∇ 𝜃J(𝜃) w.r.t. the parameters.
• The learning rate 𝜂 determines the size of the steps we take to reach
a (local) minimum.
• We follow the direction of the slope of the surface created by the
objective function downhill until we reach a valley. 



[NOTE: heavily based on Sebastian Ruder’s “An overview of
gradient descent optimization algorithms”, 19 Jan 2016]
Batch gradient descent
• Idea: depending on the amount of data, trade-off between the
accuracy of the parameter update and the time it takes to perform
an update.
• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃)
Stochastic gradient descent
• Idea: perform a parameter update for each training example x(i) and
label y(i)
• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃; x(i), y(i))
• Performs redundant computations for large datasets
Momentum gradient descent
• Idea: overcome ravine oscillations by momentum
• Update:
• vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃)
• 𝜃 = 𝜃 - vt
Nesterov accelerated gradient
• Idea: 1. big jump in the direction of the previous accumulated
gradient & measure the gradient and then 2. make a correction.
• Update:
• vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃-𝛾 vt-1)
• 𝜃 = 𝜃 - vt
Adagrad
• Idea: larger updates for infrequent and smaller updates for frequent
parameters.
• Update: let gt,i = ∇ 𝜃J(𝜃i); 𝜃t+1,i = 𝜃t,i + 𝛥𝜃t. Then:
• SGD: 𝛥𝜃t = - 𝜂 ∙ gt
• Adagrad: 𝛥𝜃t = - 𝜂 / √(Gt+ϵ) ⊙ gt



with Gt ∈ℝd⨉d a diagonal matrix where each diagonal element i,i the sum
of square of gradients w.r.t. 𝜃i up to time step t, ⊙ element-wise matrix-
vector multiplication.
Adadelta
• Idea: Instead of accumulating all past squared gradients, restrict the
window of accumulated past gradients to some fixed size w.
• The sum of gradients is recursively defined as a decaying average
of all past squared gradients:

E[𝛥𝜃2]t = 𝛾 E[𝛥𝜃2]t-1 + (1-𝛾) 𝛥𝜃t2
• Update: we replace the diagonal matrix Gt with the decaying
average over past squared gradients E[g2]t 



𝛥𝜃t = - RMS[𝛥𝜃]t-1/RMS[g]t ⊙ gt
RMSprop
• Idea: use the first update vector of Adadelta
• Update:
• E[g2]t = 0.9 E[g2]t-1 + 0.1 gt2
• 𝛥𝜃t = - 𝜂 / √(E[g2]t + ϵ) ⊙ gt
Visualization and comparison
Adagrad, Adadelta, and RMSprop
almost immediately head off in the
right direction and converge
similarly fast, while Momentum and
NAG are led off-track, evoking the
image of a ball rolling down the hill.
NAG, however, is quickly able to
correct its course due to its
increased responsiveness by
looking ahead and heads to the
minimum.
Conclusions
• Big Data ML requires (scalable, distributed) algorithms to process
training points in small batches, performing effective incremental
updates to the model
• Final objective: a closed loop that trains models, compares them
recursively
• Key challenge: evaluation metrics in the face of available resources 

(including data)

More Related Content

What's hot

What's hot (20)

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Lda
LdaLda
Lda
 
A Correlative Information-Theoretic Measure for Image Similarity
A Correlative Information-Theoretic Measure for Image SimilarityA Correlative Information-Theoretic Measure for Image Similarity
A Correlative Information-Theoretic Measure for Image Similarity
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Graph based approaches to Gene Expression Clustering
Graph based approaches to Gene Expression ClusteringGraph based approaches to Gene Expression Clustering
Graph based approaches to Gene Expression Clustering
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)
 
07 learning
07 learning07 learning
07 learning
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Pca
PcaPca
Pca
 
Some Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSome Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial Derivatives
 
Dimension reduction(jiten01)
Dimension reduction(jiten01)Dimension reduction(jiten01)
Dimension reduction(jiten01)
 

Viewers also liked

Viewers also liked (10)

07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
05 astrostat feigelson
05 astrostat feigelson05 astrostat feigelson
05 astrostat feigelson
 
05 sensor signal_models_feature_extraction
05 sensor signal_models_feature_extraction05 sensor signal_models_feature_extraction
05 sensor signal_models_feature_extraction
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
06 ashish mahabal bse2
06 ashish mahabal bse206 ashish mahabal bse2
06 ashish mahabal bse2
 
06 ashish mahabal bse1
06 ashish mahabal bse106 ashish mahabal bse1
06 ashish mahabal bse1
 
07 big skyearth_dlr_7_april_2016
07 big skyearth_dlr_7_april_201607 big skyearth_dlr_7_april_2016
07 big skyearth_dlr_7_april_2016
 
08 visualisation seminar ver0.2
08 visualisation seminar   ver0.208 visualisation seminar   ver0.2
08 visualisation seminar ver0.2
 
06 ashish mahabal bse3
06 ashish mahabal bse306 ashish mahabal bse3
06 ashish mahabal bse3
 
04 bigdata and_cloud_computing
04 bigdata and_cloud_computing04 bigdata and_cloud_computing
04 bigdata and_cloud_computing
 

Similar to 08 distributed optimization

Dynamic programming class 16
Dynamic programming class 16Dynamic programming class 16
Dynamic programming class 16
Kumar
 

Similar to 08 distributed optimization (20)

Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Dynamic programming class 16
Dynamic programming class 16Dynamic programming class 16
Dynamic programming class 16
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Matlab tutorial and Linear Algebra Review.ppt
Matlab tutorial and Linear Algebra Review.pptMatlab tutorial and Linear Algebra Review.ppt
Matlab tutorial and Linear Algebra Review.ppt
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming Graphs
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 

Recently uploaded

Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Recently uploaded (20)

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 

08 distributed optimization

  • 2. Motivation • Given a trained model, ML / prediction is easy to distribute • Not a full-blown “Big Data” problem • What about model training in the face of Big (training) Data? • Distributed training needed! • Under the hood: ML as optimisation
  • 3. ML and optimisation ‘Big Data’ ML: • high training sample volumes • high-dimensional data • distributed data: collection, storage methods are based on optimisation • write ML as a (typically convex) optimisation problem • optimise.
  • 4. Problem formalization Problem: • minimize J(𝜃), 𝜃 ∈ ℝd • subject to Ji(𝜃) ≤ bi, i = 1,...,m with • 𝜃 = (𝜃1 ,…, 𝜃d) ∈ ℝd the optimisation variable • J : Rd → R the objective function • Ji : Rd → R, i = 1,…, m the constraints • constants b1 ,…, bm the bounds for the constraints.
  • 5. Gradient descent • Update the parameters in the opposite direction of the gradient of the objective function ∇ 𝜃J(𝜃) w.r.t. the parameters. • The learning rate 𝜂 determines the size of the steps we take to reach a (local) minimum. • We follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. 
 
 [NOTE: heavily based on Sebastian Ruder’s “An overview of gradient descent optimization algorithms”, 19 Jan 2016]
  • 6. Batch gradient descent • Idea: depending on the amount of data, trade-off between the accuracy of the parameter update and the time it takes to perform an update. • Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃)
  • 7. Stochastic gradient descent • Idea: perform a parameter update for each training example x(i) and label y(i) • Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇ 𝜃J(𝜃; x(i), y(i)) • Performs redundant computations for large datasets
  • 8. Momentum gradient descent • Idea: overcome ravine oscillations by momentum • Update: • vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃) • 𝜃 = 𝜃 - vt
  • 9. Nesterov accelerated gradient • Idea: 1. big jump in the direction of the previous accumulated gradient & measure the gradient and then 2. make a correction. • Update: • vt = 𝛾 vt-1 + 𝜂 ∙ ∇ 𝜃J(𝜃-𝛾 vt-1) • 𝜃 = 𝜃 - vt
  • 10. Adagrad • Idea: larger updates for infrequent and smaller updates for frequent parameters. • Update: let gt,i = ∇ 𝜃J(𝜃i); 𝜃t+1,i = 𝜃t,i + 𝛥𝜃t. Then: • SGD: 𝛥𝜃t = - 𝜂 ∙ gt • Adagrad: 𝛥𝜃t = - 𝜂 / √(Gt+ϵ) ⊙ gt
 
 with Gt ∈ℝd⨉d a diagonal matrix where each diagonal element i,i the sum of square of gradients w.r.t. 𝜃i up to time step t, ⊙ element-wise matrix- vector multiplication.
  • 11. Adadelta • Idea: Instead of accumulating all past squared gradients, restrict the window of accumulated past gradients to some fixed size w. • The sum of gradients is recursively defined as a decaying average of all past squared gradients:
 E[𝛥𝜃2]t = 𝛾 E[𝛥𝜃2]t-1 + (1-𝛾) 𝛥𝜃t2 • Update: we replace the diagonal matrix Gt with the decaying average over past squared gradients E[g2]t 
 
 𝛥𝜃t = - RMS[𝛥𝜃]t-1/RMS[g]t ⊙ gt
  • 12. RMSprop • Idea: use the first update vector of Adadelta • Update: • E[g2]t = 0.9 E[g2]t-1 + 0.1 gt2 • 𝛥𝜃t = - 𝜂 / √(E[g2]t + ϵ) ⊙ gt
  • 13. Visualization and comparison Adagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum.
  • 14. Conclusions • Big Data ML requires (scalable, distributed) algorithms to process training points in small batches, performing effective incremental updates to the model • Final objective: a closed loop that trains models, compares them recursively • Key challenge: evaluation metrics in the face of available resources 
 (including data)