Asynchronous Stochastic Optimization, New Analysis and Algorithms

Asynchronous Stochastic
Optimization
New Analysis and Algorithms
Fabian Pedregosa
May 25, 2018. University of Washington

Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic / Parallel / Constrained /
Hyperparameter optimization.
1/33

Outline
Goal: Review recent work in asynchronous parallel optimization for
machine learning1,2.
1. Asynchronous parallel optimization, Asynchronous SGD.
2. Asynchronous variance-reduced optimization.
3. Analysis of asynchronous methods: What we can prove.
1
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
2
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
2/33

40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/33

• At the same time, the number of cores increases exponentially.
3/33

• At the same time, the number of cores increases exponentially.
Parallel algorithms needed to take advantage of modern
CPUs. 3/33

Parallel Optimization: Not a new topic
• Most of the principles and
methods already in
(Bertsekas and Tsitsiklis,
1989).
• For linear systems it can be
traced even earlier (Arrow
and Hurwicz, 1958).
4/33

Asynchronous vs Synchronous methods
Synchronous methods
• Wait for slowest worker.
• Limited speedup due to
synchronization cost.
Asynchronous methods
• Workers receive work as
needed.
• Minimize idle time.
• Challenging analysis.
t0 t1 t2
Worker 4
Worker 3
Worker 2
Worker 1
idle
idle
idle
idle
idle
idle
t0 t1t2t3 t4 t5t6 t7 t8
Worker 4
Worker 3
Worker 2
Worker 1
Time
5/33

Optimization for machine learning
Many problems in machine learning can be framed as
minimize
x∈Rp
f (x)
def
=
1
n
n
i=1
fi (x)
Gradient descent (Cauchy, 1847).
Descend along steepest direction
x+
= x − γ f (x)
Stochastic gradient descent (SGD)
(Robbins and Monro, 1951). Select
random i, descent along − fi (x):
x+
= x − γ fi (x) Figure source: Francis Bach
6/33

Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
7/33

Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
x and ˆx might be diﬀerent. 7/33

Asynchronous SGD
• Write is performed with old version of coeﬃcients.
• Update requires a lock on the vector of coeﬃcients.
8/33

Hogwild! (Niu et al., 2011): Lock-free Async. SGD
Algorithm 1 Hogwild
1: loop
2: ˆx = inconsistent read of x
3: Sample i uniformly in {1, ..., n}
4: Let Si be fi ’s support
5: [δx]Si := −γ fi (ˆx)
6: for v in Si do
7: [x]v ← [x]v + [δx]v // atomic
8: end for
9: end loop
• All read/write operations to shared memory are
inconsistent, i.e., no vector-level locks while updating shared
memory.
• Key assumption. Sparse gradients (|Si | dimension).
9/33

Hogwild: when does it converge?
Sparse fi . Is this a reasonable assumption?
• If fi (x) = ϕ(aT
i x) then fi (x) = ai ϕ (aT
i x).
• Gradients are sparse whenever data ai is sparse.
• This is the case for generalized linear models (least squares,
logistic regression, linear SVMs, etc.).
In this class of models, Hogwild enjoys almost linear speedups.
Figure 1: Speedup of Hogwild. Image source: (Niu et al., 2011)
10/33

Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives 11/33

2. Asynchronous (Proximal) SAGA

Variance-reduced Stochastic Optimization
Problem: Finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) , where n < ∞
12/33

Variance-reduced Stochastic Optimization
Problem: Finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) , where n < ∞
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien,
2014)
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ ( fi (x) − αi + α)
gradient estimate
; α+
i = fi (x)
Variance-reduction technique known under diﬀerent names, e.g.,
control variates in Monte Carlo methods. 12/33

The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a ﬁxed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
13/33

The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a ﬁxed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
Already used in scikit-learn
13/33

Asynchronous SAGA
Motivation: Can we design asynchronous version of SAGA?
14/33

Asynchronous SAGA
Motivation: Can we design asynchronous version of SAGA?
SAGA update is ineﬃcient (without tricks) for sparse gradients.
x+
= x − γ( fi (x)
sparse
− αi
sparse
+ α
dense!
) ;
Need for a sparse variant of SAGA
• Many large scale datasets are sparse.
• Asynchronous algorithms work best when updates are sparse.
14/33

Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
3
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artiﬁcial Intelligence and Statistics (AISTATS
2017).
15/33

Sparse SAGA
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
3
2017).
15/33

Sparse SAGA
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
Sparse SAGA algorithm3
x+
= x − γ( fi (x) − αi + Di α) ; α+
i = fi (x)
3
2017).
15/33

Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33

Proximal Sparse SAGA
Problem: Composite ﬁnite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) + g(x) , where
• g is potentially nonsmooth (think λ · 1 or indicator) but we
have access to proxγg (x) = arg minz g(z) + 1
2 x − z 2.
• For some g, its proximal operator is available in closed form.
Examples: 1 norm (soft thresholding), indicator function
(projection).
17/33

Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
4
(NIPS).
18/33

We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
Sparse SAGA algorithm4
x+
= proxγϕi
(x − γ( fi (x) − αi + Di α)) ; α+
i = fi (x)
4
(NIPS).
18/33

As SAGA, linear convergence under strong convexity.
Theorem
For step size γ = 1
5L and f L-smooth and µ-strongly convex
(µ > 0), at iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , µ
L })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2.
Implications
• Same convergence than SAGA with cheaper updates in
presence of sparsity.
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
19/33

Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
20/33

Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
• As Hogwild!, inconsistent read and writes.
• Same convergence rate than sequential version under sparsity
of the gradients (delays ≤ 1
10
√
sparsity
.)
20/33

Empirical Results
ProxASAGA vs competing methods on 3 large-scale datasets,
1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
21/33

Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
22/33

Speedup =
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
22/33

Speedup =
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup. 22/33

3. Analysis or The Art of Naming

Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
23/33

Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
Claim #1
There are fundamental ﬂaws in these analysis.
23/33

Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to deﬁne an iterate?
24/33

Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to deﬁne an iterate?
Asynchronous SGD
24/33

Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has ﬁnished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
25/33

Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
25/33

Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
Issue
The naming scheme and unbiased assumption are incompatible.
25/33

A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
26/33

Problem: minimizex
1
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33

Problem: minimizex
1
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33

Problem: minimizex
1
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33

Problem: minimizex
1
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
26/33

Problem: minimizex
1
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33

Problem: minimizex
1
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
• This scheme does not satisfy the
crucial unbiasedness condition.
• Can we ﬁx it?

A New Labeling Scheme
After read labeling scheme
Each time a worker has ﬁnished reading from shared memory,
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
5
27/33

A New Labeling Scheme
After read labeling scheme
Each time a worker has ﬁnished reading from shared memory,
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
No dependency between it and the cost of computing fit .
Full analysis of Hogwild, Asynchronous SVRG and
Asynchronous SAGA in5.
5
27/33

Convergence results – preliminaries
Some notation.
• ∆ = maxj∈1,...,d |{j : j ∈ supp( fi )}|/n. We always have
1/n ≤ ∆ ≤ 1.
• τ = Number of updates between the time that the vector of
coeﬃcients is read to memory and the time the update is
ﬁnished.
28/33

A rigorous analysis of Hogwild (Niu et al., 2011)
• Inconsistent reads.
• Unlike (Niu et al., 2011), allow for inconsistent writes.
• Unlike (Niu et al., 2011; Mania et al., 2017), no global bound
on gradient.
Main result for Hogwild (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then Hogwild converges with the same rate
as SGD with step size γ = a
L with
a ≤ min
1
5(1 + 2τ
√
∆)
,
L
µ∆
.
=⇒ theoretical linear speedup.
29/33

Main result for ASAGA
Main result for ASAGA (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then ASAGA converges with the same rate
as SAGA with step size γ = a
L with
a ≤
1
32(1 + τ
√
∆)
.
=⇒ theoretical linear speedup, step size independent of µ.
30/33

Perspectives
• Better scalability ⇐⇒ communication eﬃciency.
• Tighter analysis with better constants / step-size independent
of ∆.
• Large gap between theory and practice.
• Interplay with generalization and momentum
Thanks for your attention!
31/33

References
Arrow, Kenneth Joseph and Leonid Hurwicz (1958). Decentralization and computation
in resource allocation. Stanford University, Department of Economics.
Bertsekas, Dimitri P. and John N. Tsitsiklis (1989). Parallel and Distributed
Computation: Numerical Methods. Athena Scientific.
Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes
déquations simultanées”. In: Comp. Rend. Sci. Paris.
De Sa, Christopher M et al. (2015). “Taming the wild: A unified analysis of
hogwild-style algorithms”. In: Advances in neural information processing systems.
Dean, Jeffrey et al. (2012). “Large scale distributed deep networks”. In: Advances in
neural information processing systems.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast
incremental gradient method with support for non-strongly convex composite
objectives”. In: Advances in Neural Information Processing Systems.
Duchi, John C, Sorathan Chaturapruek, and Christopher Ré (2015). “Asynchronous
stochastic convex optimization”. In: arXiv preprint arXiv:1508.00882.
31/33

Leblond, Rémi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA:
synchronous parallel SAGA”. In: Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics (AISTATS 2017).
— (2018). “Improved asynchronous parallel optimization analysis for stochastic
Lian, Xiangru et al. (2015). “Asynchronous parallel stochastic gradient for nonconvex
optimization”. In: Advances in Neural Information Processing Systems.
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic
optimization”. In: SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic
gradient descent”. In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the
Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In:
Advances in Neural Information Processing Systems 30 (NIPS).
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent
and its asynchronous variants”. In: Advances in Neural Information Processing
Systems.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”.
In: Ann. Math. Statist.
32/33

Tsitsiklis, John, Dimitri Bertsekas, and Michael Athans (1986). “Distributed
asynchronous deterministic and stochastic gradient optimization algorithms”. In:
IEEE transactions on automatic control.
33/33

Supervised Machine Learning
Data: n observations (ai , bi ) ∈ Rp × R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2 σ(xT
1 a))

For step size γ = 1
5L and f be gradient L-Lipschitz and µ-strongly
convex (µ > 0), Sparse Proximal SAGA converges geometrically in
expectation. At iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , 1
κ })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2 and κ = L
µ
(condition number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).

Asynchronous Stochastic Optimization, New Analysis and Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Asynchronous Stochastic Optimization, New Analysis and Algorithms

Similar to Asynchronous Stochastic Optimization, New Analysis and Algorithms (20)

More from Fabian Pedregosa

More from Fabian Pedregosa (10)

Recently uploaded

Recently uploaded (20)

Asynchronous Stochastic Optimization, New Analysis and Algorithms