SlideShare a Scribd company logo
Parallel Optimization in Machine
Learning
Fabian Pedregosa
December 19, 2017 Huawei Paris Research Center
About me
• Engineer (2010-2012), Inria Saclay
(scikit-learn kickstart).
• PhD (2012-2015), Inria Saclay.
• Postdoc (2015-2016),
Dauphine–ENS–Inria Paris.
• Postdoc (2017-present), UC Berkeley
- ETH Zurich (Marie-Curie fellowship,
European Commission)
Hacker at heart ... trapped in a
researcher’s body.
1/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
2/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
Primary feature: number of cores.
2/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
Parallel algorithms needed to take advantage of modern CPUs.
3/32
Parallel optimization
Parallel algorithms can be divided into two large categories:
synchronous and asynchronous. Image credits: (Peng et al. 2016)
Synchronous methods
 Easy to implement (i.e.,
developed software packages).
 Well understood.
 Limited speedup due to
synchronization costs.
Asynchronous methods
 Faster, typically larger
speedups.
 Not well understood, large gap
between theory and practice.
 No mature software solutions.
4/32
Outline
Synchronous methods
• Synchronous (stochastic) gradient descent.
Asynchronous methods
• Asynchronous stochastic gradient descent (Hogwild) (Niu et al.
2011)
• Asynchronous variance-reduced stochastic methods (Leblond, P.,
and Lacoste-Julien 2017), (Pedregosa, Leblond, and
Lacoste-Julien 2017).
• Analysis of asynchronous methods.
• Codes and implementation aspects.
Leaving out many parallel synchronous methods: ADMM (Glowinski
and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro,
and Zhang 2014), to name a few.
5/32
Outline
Most of the following is joint work with Rémi Leblond and Simon
Lacoste-Julien
Rémi Leblond Simon Lacoste–Julien
6/32
Synchronous algorithms
Optimization for machine learning
Large part of problems in machine learning can be framed as
optimization problems of the form
minimize
x
f(x)
def
=
1
n
n∑
i=1
fi(x)
Gradient descent (Cauchy 1847). Descend
along steepest direction (−∇f(x))
x+
= x − γ∇f(x)
Stochastic gradient descent (SGD)
(Robbins and Monro 1951). Select a
random index i and descent along
− ∇fi(x):
x+
= x − γ∇fi(x) images source: Francis Bach
7/32
Parallel synchronous gradient descent
Computation of gradient is distributed among k workers
• Workers can be: different computers, CPUs
or GPUs
• Popular frameworks: Spark, Tensorflow,
PyTorch, neHadoop.
8/32
Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
9/32
Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
 Trivial parallelization, same analysis as gradient descent.
 Synchronization step every iteration (3.).
9/32
Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
10/32
Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
 Trivial parallelization, same analysis as (mini-batch) stochastic
gradient descent.
 The kind of parallelization that is implemented in deep learning
libraries (tensorflow, PyTorch, Thano, etc.).
 Synchronization step every iteration (3.).
10/32
Asynchronous algorithms
Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
11/32
Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
Hogwild (Niu et al. 2011): each core runs SGD in parallel, without
synchronization, and updates the same vector of coefficients.
In theory: convergence under very strong assumptions.
In practice: just works.
11/32
Hogwild in more detail
Each core follows the same procedure
1. Read the information from shared memory ˆx.
2. Sample i ∈ {1, . . . , n} uniformly at random.
3. Compute partial gradient ∇fi(ˆx).
4. Write the SGD update to shared memory x = x − γ∇fi(ˆx).
12/32
Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives (Emilie already mentioned some)
13/32
Looking for excitement? ...
analyze asynchronous methods!
Analysis of asynchronous methods
Simple things become counter-intuitive, e.g, how to name the
iterates?
 Iterates will change depending on the speed of processors
14/32
Naming scheme in Hogwild
Simple, intuitive and wrong
Each time a core has finished writing to shared memory, increment
iteration counter.
⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory.
Value of ˆxt and it are not determined until the iteration has finished.
=⇒ ˆxt and it are not necessarily independent.
15/32
Unbiased gradient estimate
SGD-like algorithms crucially rely on the unbiased property
Ei[∇fi(x)] = ∇f(x).
For synchronous algorithms, follows from the uniform sampling of i
Ei[∇fi(x)] =
n∑
i=1
Proba(selecting i)∇fi(x)
uniform sampling
=
n∑
i=1
1
n
∇fi(x) = ∇f(x)
16/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
17/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
17/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
Start at x0. Because of the random sampling there are 4 possible
scenarios:
1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x)
2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x)
4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
So we have
Ei [∇fi] =
1
4
f1 +
3
4
f2
̸=
1
2
f1 +
1
2
f2 !!
17/32
The Art of Naming Things
A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
18/32
A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
 No dependency between it and the cost of computing ∇fit
.
 Full analysis of Hogwild and other asynchronous methods in
“Improved parallel stochastic optimization analysis for incremental
methods”, Leblond, P., and Lacoste-Julien (submitted).
18/32
Asynchronous SAGA
The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
19/32
The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
Super easy to use in scikit-learn
19/32
Sparse SAGA
Need for a sparse variant of SAGA
• A large part of large scale datasets are sparse.
• For sparse datasets and generalized linear models (e.g., least
squares, logistic regression, etc.), partial gradients ∇fi are sparse
too.
• Asynchronous algorithms work best when updates are sparse.
SAGA update is inefficient for sparse data
x+
= x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
) ; α+
i = ∇fi(x)
[scikit-learn uses many tricks to make it efficient that we cannot use
in asynchronous version]
20/32
Sparse SAGA
Sparse variant of SAGA. Relies on
• Diagonal matrix Pi = projection onto the support of ∇fi
• Diagonal matrix D defined as
Dj,j = n/number of times ∇jfi is nonzero.
Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017)
x+
= x − γ(∇fi(x) − αi + PiDα) ; α+
i = ∇fi(x)
• All operations are sparse, cost per iteration is
O(nonzeros in ∇fi).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
• Crucial property: Ei[PiD] = I.
21/32
Asynchronous SAGA (ASAGA)
• Each core runs an instance of Sparse SAGA.
• Updates the same vector of coefficients α, α.
Theory: Under standard assumptions (bounded dalays), same
convergence rate than sequential version.
=⇒ theoretical linear speedup with respect to number of cores.
22/32
Experiments
• Improved convergence of variance-reduced methods wrt SGD.
• Significant improvement between 1 and 10 cores.
• Speedup is significant, but far from ideal.
23/32
Non-smooth problems
Composite objective
Previous methods assume objective function is smooth.
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
1
n
n∑
i=1
fi(x) + ∥x∥1
where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the
nonsmooth term to be ℓ1 norm, but this is general to any convex
function for which we have access to its proximal operator.
24/32
(Prox)SAGA
The ProxSAGA update is inefficient
x+
= proxγh
dense!
(x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
)) ; α+
i = ∇fi(x)
=⇒ a sparse variant is needed as a prerequisite for a practical
parallel method.
25/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + DPiα ;
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
26/32
Proximal Asynchronous SAGA (ProxASAGA)
Each core runs Sparse Proximal SAGA asynchronously without locks
and updates x, α and α in shared memory.
 All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
27/32
Empirical results
ProxASAGA vs competing methods on 3 large-scale datasets,
ℓ1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
28/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
29/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
29/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
29/32
Perspectives
• Scale above 20 cores.
• Asynchronous optimization on the GPU.
• Acceleration.
• Software development.
30/32
Codes
 Code is in github: https://github.com/fabianp/ProxASAGA.
Computational code is C++ (use of atomic type) but wrapped in
Python.
A very efficient implementation of SAGA can be found in the
scikit-learn and lightning
(https://github.com/scikit-learn-contrib/lightning) libraries.
31/32
References
Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations
simultanées”. In: Comp. Rend. Sci. Paris.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In:
Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique.
Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In:
Advances in Neural Information Processing Systems 27.
Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In:
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
(AISTATS 2017).
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
31/32
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate
updates”. In: SIAM Journal on Scientific Computing.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math.
Statist.
Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed
optimization using an approximate newton-type method”. In: International conference on
machine learning.
32/32
Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Input
layer
Hidden
layer
Output
layer
a1
a2
a3
a4
a5
Ouput
Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Minimize some distance (e.g., quadratic) between the prediction
minimize
x
1
n
n∑
i=1
ℓ(bi, h(ai, x))
notation
=
1
n
n∑
i=1
fi(x)
where popular examples of ℓ are
• Squared loss, ℓ(bi, h(ai, x))
def
= (bi − h(ai, x))2
• Logistic (softmax), ℓ(bi, h(ai, x))
def
= log(1 + exp(−bih(ai, x)))
Sparse Proximal SAGA
For step size γ = 1
5L and f µ-strongly convex (µ > 0), Sparse Proximal
SAGA converges geometrically in expectation. At iteration t we have
E∥xt − x∗
∥2
≤ (1 − 1
5 min{ 1
n , 1
κ })t
C0 ,
with C0 = ∥x0 − x∗
∥2
+ 1
5L2
∑n
i=1 ∥α0
i − ∇fi(x∗
)∥2
and κ = L
µ (condition
number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
Convergence ProxASAGA
Suppose τ ≤ 1
10
√
∆
. Then:
• If κ ≥ n, then with step size γ = 1
36L , ProxASAGA converges
geometrically with rate factor Ω( 1
κ ).
• If κ < n, then using the step size γ = 1
36nµ , ProxASAGA converges
geometrically with rate factor Ω( 1
n ).
In both cases, the convergence rate is the same as Sparse Proximal
SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both
cases the step size does not depend on τ.
If τ ≤ 6κ, a universal step size of Θ(1
L ) achieves a similar rate than
Sparse Proximal SAGA, making it adaptive to local strong convexity
(knowledge of κ not required).
ASAGA algorithm
ProxASAGA algorithm
Atomic vs non-atomic

More Related Content

What's hot

Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
Katsuki Ohto
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
Holistic Benchmarking of Big Linked Data
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
The Statistical and Applied Mathematical Sciences Institute
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
Yoonho Lee
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
Victor Pillac
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
GIUSEPPE DI BERNARDO
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
Ha Phuong
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Marjan Sterjev
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
Nesreen K. Ahmed
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying Framework
Nesreen K. Ahmed
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
The Statistical and Applied Mathematical Sciences Institute
 

What's hot (20)

Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying Framework
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 

Similar to Parallel Optimization in Machine Learning

Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Fabian Pedregosa
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
Charles Deledalle
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
James Wong
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Harry Potter
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Luis Goldster
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
Fraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Young Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
Tony Nguyen
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical Analysis
Xin-She Yang
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms
Xin-She Yang
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Association for Computational Linguistics
 
50120140503004
5012014050300450120140503004
50120140503004
IAEME Publication
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
Erlangen Artificial Intelligence & Machine Learning Meetup
 
How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...
Viach Kakovskyi
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Differential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingDifferential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow Modelling
Serge Hoogendoorn
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
ChenYiHuang5
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
FreefireGarena30
 

Similar to Parallel Optimization in Machine Learning (20)

Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical Analysis
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
50120140503004
5012014050300450120140503004
50120140503004
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
 
How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Differential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingDifferential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow Modelling
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 

More from Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
Fabian Pedregosa
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
Fabian Pedregosa
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
Fabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
Fabian Pedregosa
 

More from Fabian Pedregosa (8)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 

Recently uploaded

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 

Recently uploaded (20)

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 

Parallel Optimization in Machine Learning

  • 1. Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center
  • 2. About me • Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). • PhD (2012-2015), Inria Saclay. • Postdoc (2015-2016), Dauphine–ENS–Inria Paris. • Postdoc (2017-present), UC Berkeley - ETH Zurich (Marie-Curie fellowship, European Commission) Hacker at heart ... trapped in a researcher’s body. 1/32
  • 3. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2/32
  • 4. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. 2/32
  • 5. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. Primary feature: number of cores. 2/32
  • 6. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. 3/32
  • 7. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32
  • 8. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32
  • 9. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 3/32
  • 10. Parallel optimization Parallel algorithms can be divided into two large categories: synchronous and asynchronous. Image credits: (Peng et al. 2016) Synchronous methods  Easy to implement (i.e., developed software packages).  Well understood.  Limited speedup due to synchronization costs. Asynchronous methods  Faster, typically larger speedups.  Not well understood, large gap between theory and practice.  No mature software solutions. 4/32
  • 11. Outline Synchronous methods • Synchronous (stochastic) gradient descent. Asynchronous methods • Asynchronous stochastic gradient descent (Hogwild) (Niu et al. 2011) • Asynchronous variance-reduced stochastic methods (Leblond, P., and Lacoste-Julien 2017), (Pedregosa, Leblond, and Lacoste-Julien 2017). • Analysis of asynchronous methods. • Codes and implementation aspects. Leaving out many parallel synchronous methods: ADMM (Glowinski and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro, and Zhang 2014), to name a few. 5/32
  • 12. Outline Most of the following is joint work with Rémi Leblond and Simon Lacoste-Julien Rémi Leblond Simon Lacoste–Julien 6/32
  • 14. Optimization for machine learning Large part of problems in machine learning can be framed as optimization problems of the form minimize x f(x) def = 1 n n∑ i=1 fi(x) Gradient descent (Cauchy 1847). Descend along steepest direction (−∇f(x)) x+ = x − γ∇f(x) Stochastic gradient descent (SGD) (Robbins and Monro 1951). Select a random index i and descent along − ∇fi(x): x+ = x − γ∇fi(x) images source: Francis Bach 7/32
  • 15. Parallel synchronous gradient descent Computation of gradient is distributed among k workers • Workers can be: different computers, CPUs or GPUs • Popular frameworks: Spark, Tensorflow, PyTorch, neHadoop. 8/32
  • 16. Parallel synchronous gradient descent 1. Choose n1, . . . nk that sum to n. 2. Distribute computation of ∇f(x) among k nodes ∇f(x) = 1 n ∑ i=1 ∇fi(x) = 1 k ( 1 n1 n1∑ i=1 ∇fi(x) done by worker 1 + . . . + 1 n1 nk∑ i=nk−1 ∇fi(x) done by worker k ) 3. Perform the gradient descent update by a master node x+ = x − γ∇f(x) 9/32
  • 17. Parallel synchronous gradient descent 1. Choose n1, . . . nk that sum to n. 2. Distribute computation of ∇f(x) among k nodes ∇f(x) = 1 n ∑ i=1 ∇fi(x) = 1 k ( 1 n1 n1∑ i=1 ∇fi(x) done by worker 1 + . . . + 1 n1 nk∑ i=nk−1 ∇fi(x) done by worker k ) 3. Perform the gradient descent update by a master node x+ = x − γ∇f(x)  Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). 9/32
  • 18. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i0, . . . , ik uniformly at random. 2. Compute in parallel ∇fit on worker t 3. Perform the (mini-batch) stochastic gradient descent update x+ = x − γ 1 k k∑ t=1 ∇fit (x) 10/32
  • 19. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i0, . . . , ik uniformly at random. 2. Compute in parallel ∇fit on worker t 3. Perform the (mini-batch) stochastic gradient descent update x+ = x − γ 1 k k∑ t=1 ∇fit (x)  Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). 10/32
  • 21. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? 11/32
  • 22. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory: convergence under very strong assumptions. In practice: just works. 11/32
  • 23. Hogwild in more detail Each core follows the same procedure 1. Read the information from shared memory ˆx. 2. Sample i ∈ {1, . . . , n} uniformly at random. 3. Compute partial gradient ∇fi(ˆx). 4. Write the SGD update to shared memory x = x − γ∇fi(ˆx). 12/32
  • 24. Hogwild is fast Hogwild can be very fast. But its still SGD... • With constant step size, bounces around the optimum. • With decreasing step size, slow convergence. • There are better alternatives (Emilie already mentioned some) 13/32
  • 25. Looking for excitement? ... analyze asynchronous methods!
  • 26. Analysis of asynchronous methods Simple things become counter-intuitive, e.g, how to name the iterates?  Iterates will change depending on the speed of processors 14/32
  • 27. Naming scheme in Hogwild Simple, intuitive and wrong Each time a core has finished writing to shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory. Value of ˆxt and it are not determined until the iteration has finished. =⇒ ˆxt and it are not necessarily independent. 15/32
  • 28. Unbiased gradient estimate SGD-like algorithms crucially rely on the unbiased property Ei[∇fi(x)] = ∇f(x). For synchronous algorithms, follows from the uniform sampling of i Ei[∇fi(x)] = n∑ i=1 Proba(selecting i)∇fi(x) uniform sampling = n∑ i=1 1 n ∇fi(x) = ∇f(x) 16/32
  • 29. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. 17/32
  • 30. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration: problem with two samples and two cores f = 1 2 (f1 + f2). Computing ∇f1 is much expensive than ∇f2. 17/32
  • 31. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration: problem with two samples and two cores f = 1 2 (f1 + f2). Computing ∇f1 is much expensive than ∇f2. Start at x0. Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x) 2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x) 3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x) 4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x) So we have Ei [∇fi] = 1 4 f1 + 3 4 f2 ̸= 1 2 f1 + 1 2 f2 !! 17/32
  • 32. The Art of Naming Things
  • 33. A new labeling scheme  New way to name iterates. “After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment counter each time we read the vector of coefficients from shared memory. 18/32
  • 34. A new labeling scheme  New way to name iterates. “After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment counter each time we read the vector of coefficients from shared memory.  No dependency between it and the cost of computing ∇fit .  Full analysis of Hogwild and other asynchronous methods in “Improved parallel stochastic optimization analysis for incremental methods”, Leblond, P., and Lacoste-Julien (submitted). 18/32
  • 36. The SAGA algorithm Setting: minimize x 1 n n∑ i=1 fi(x) The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014). Select i ∈ {1, . . . , n} and compute (x+ , α+ ) as x+ = x − γ(∇fi(x) − αi + α) ; α+ i = ∇fi(x) • Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x). • Unlike SGD, because of memory terms α, variance → 0. • Unlike SGD, converges with fixed step size (1/3L) 19/32
  • 37. The SAGA algorithm Setting: minimize x 1 n n∑ i=1 fi(x) The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014). Select i ∈ {1, . . . , n} and compute (x+ , α+ ) as x+ = x − γ(∇fi(x) − αi + α) ; α+ i = ∇fi(x) • Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x). • Unlike SGD, because of memory terms α, variance → 0. • Unlike SGD, converges with fixed step size (1/3L) Super easy to use in scikit-learn 19/32
  • 38. Sparse SAGA Need for a sparse variant of SAGA • A large part of large scale datasets are sparse. • For sparse datasets and generalized linear models (e.g., least squares, logistic regression, etc.), partial gradients ∇fi are sparse too. • Asynchronous algorithms work best when updates are sparse. SAGA update is inefficient for sparse data x+ = x − γ(∇fi(x) sparse − αi sparse + α dense! ) ; α+ i = ∇fi(x) [scikit-learn uses many tricks to make it efficient that we cannot use in asynchronous version] 20/32
  • 39. Sparse SAGA Sparse variant of SAGA. Relies on • Diagonal matrix Pi = projection onto the support of ∇fi • Diagonal matrix D defined as Dj,j = n/number of times ∇jfi is nonzero. Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017) x+ = x − γ(∇fi(x) − αi + PiDα) ; α+ i = ∇fi(x) • All operations are sparse, cost per iteration is O(nonzeros in ∇fi). • Same convergence properties than SAGA, but with cheaper iterations in presence of sparsity. • Crucial property: Ei[PiD] = I. 21/32
  • 40. Asynchronous SAGA (ASAGA) • Each core runs an instance of Sparse SAGA. • Updates the same vector of coefficients α, α. Theory: Under standard assumptions (bounded dalays), same convergence rate than sequential version. =⇒ theoretical linear speedup with respect to number of cores. 22/32
  • 41. Experiments • Improved convergence of variance-reduced methods wrt SGD. • Significant improvement between 1 and 10 cores. • Speedup is significant, but far from ideal. 23/32
  • 43. Composite objective Previous methods assume objective function is smooth. Cannot be applied to Lasso, Group Lasso, box constraints, etc. Objective: minimize composite objective function: minimize x 1 n n∑ i=1 fi(x) + ∥x∥1 where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the nonsmooth term to be ℓ1 norm, but this is general to any convex function for which we have access to its proximal operator. 24/32
  • 44. (Prox)SAGA The ProxSAGA update is inefficient x+ = proxγh dense! (x − γ(∇fi(x) sparse − αi sparse + α dense! )) ; α+ i = ∇fi(x) =⇒ a sparse variant is needed as a prerequisite for a practical parallel method. 25/32
  • 45. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems 26/32
  • 46. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate vi=∇fi(x) − αi + DPiα ; 26/32
  • 47. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) 26/32
  • 48. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Where Pi, D are as in Sparse SAGA and φi def = ∑d j (PiD)i,i|xj|. φi has two key properties: i) support of φi = support of ∇fi (sparse updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness) 26/32
  • 49. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Where Pi, D are as in Sparse SAGA and φi def = ∑d j (PiD)i,i|xj|. φi has two key properties: i) support of φi = support of ∇fi (sparse updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness) Convergence: same linear convergence rate as SAGA, with cheaper updates in presence of sparsity. 26/32
  • 50. Proximal Asynchronous SAGA (ProxASAGA) Each core runs Sparse Proximal SAGA asynchronously without locks and updates x, α and α in shared memory.  All read/write operations to shared memory are inconsistent, i.e., no performance destroying vector-level locks while reading/writing. Convergence: under sparsity assumptions, ProxASAGA converges with the same rate as the sequential algorithm =⇒ theoretical linear speedup with respect to the number of cores. 27/32
  • 51. Empirical results ProxASAGA vs competing methods on 3 large-scale datasets, ℓ1-regularized logistic regression Dataset n p density L ∆ KDD 2010 19,264,097 1,163,024 10−6 28.12 0.15 KDD 2012 149,639,105 54,686,452 2 × 10−7 1.25 0.85 Criteo 45,840,617 1,000,000 4 × 10−5 1.25 0.89 0 20 40 60 80 100 Time (in minutes) 10 12 10 9 10 6 10 3 100 Objectiveminusoptimum KDD10 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 KDD12 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 100 Criteo dataset ProxASAGA (1 core) ProxASAGA (10 cores) AsySPCD (1 core) AsySPCD (10 cores) FISTA (1 core) FISTA (10 cores) 28/32
  • 52. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA 29/32
  • 53. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. 29/32
  • 54. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. • As predicted by theory, there is a high correlation between degree of sparsity and speedup. 29/32
  • 55. Perspectives • Scale above 20 cores. • Asynchronous optimization on the GPU. • Acceleration. • Software development. 30/32
  • 56. Codes  Code is in github: https://github.com/fabianp/ProxASAGA. Computational code is C++ (use of atomic type) but wrapped in Python. A very efficient implementation of SAGA can be found in the scikit-learn and lightning (https://github.com/scikit-learn-contrib/lightning) libraries. 31/32
  • 57. References Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations simultanées”. In: Comp. Rend. Sci. Paris. Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”. In: Advances in Neural Information Processing Systems. Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In: Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique. Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In: Advances in Neural Information Processing Systems 27. Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In: Advances in Neural Information Processing Systems. 31/32
  • 58. Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30. Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate updates”. In: SIAM Journal on Scientific Computing. Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math. Statist. Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed optimization using an approximate newton-type method”. In: International conference on machine learning. 32/32
  • 59. Supervised Machine Learning Data: n observations (ai, bi) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2σ(xT 1 a)) Input layer Hidden layer Output layer a1 a2 a3 a4 a5 Ouput
  • 60. Supervised Machine Learning Data: n observations (ai, bi) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2σ(xT 1 a)) Minimize some distance (e.g., quadratic) between the prediction minimize x 1 n n∑ i=1 ℓ(bi, h(ai, x)) notation = 1 n n∑ i=1 fi(x) where popular examples of ℓ are • Squared loss, ℓ(bi, h(ai, x)) def = (bi − h(ai, x))2 • Logistic (softmax), ℓ(bi, h(ai, x)) def = log(1 + exp(−bih(ai, x)))
  • 61. Sparse Proximal SAGA For step size γ = 1 5L and f µ-strongly convex (µ > 0), Sparse Proximal SAGA converges geometrically in expectation. At iteration t we have E∥xt − x∗ ∥2 ≤ (1 − 1 5 min{ 1 n , 1 κ })t C0 , with C0 = ∥x0 − x∗ ∥2 + 1 5L2 ∑n i=1 ∥α0 i − ∇fi(x∗ )∥2 and κ = L µ (condition number). Implications • Same convergence rate than SAGA with cheaper updates. • In the “big data regime” (n ≥ κ): rate in O(1/n). • In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ). • Adaptivity to strong convexity, i.e., no need to know strong convexity parameter to obtain linear convergence.
  • 62. Convergence ProxASAGA Suppose τ ≤ 1 10 √ ∆ . Then: • If κ ≥ n, then with step size γ = 1 36L , ProxASAGA converges geometrically with rate factor Ω( 1 κ ). • If κ < n, then using the step size γ = 1 36nµ , ProxASAGA converges geometrically with rate factor Ω( 1 n ). In both cases, the convergence rate is the same as Sparse Proximal SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both cases the step size does not depend on τ. If τ ≤ 6κ, a universal step size of Θ(1 L ) achieves a similar rate than Sparse Proximal SAGA, making it adaptive to local strong convexity (knowledge of κ not required).