Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Breaking the Nonsmooth Barrier: A Scalable
Parallel Method for Composite Optimization
Fabian Pedregosa Rémi Leblond Simon Lacoste–Julien

Motivation
 Since 2005, the speed of
processors has stagnated.
 The number of cores has
increased.
 Development of parallel
asynchronous variants of
stochastic gradient algorithms
1/6

Motivation
 Since 2005, the speed of
processors has stagnated.
 The number of cores has
increased.
 Development of parallel
asynchronous variants of
stochastic gradient algorithms
SGD → Hogwild (Niu et al. 2011).
SVRG → Kromagnon (Reddi et al. 2015; Mania et al. 2017).
SAGA → ASAGA (Leblond, Pedregosa, and Lacoste-Julien 2017).
1/6

Composite objective
 These methods assume objective function is smooth
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
2/6

Composite objective
 These methods assume objective function is smooth
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
f(x) + h(x) , with f(x) = 1
n
∑n
i=1 fi(x)
where fi is smooth and h is a block-separable (i.e., h(x) =
∑
B h([x]B))
convex function for which we have access to its proximal operator.
2/6

Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efﬁcient when ∇fi are sparse.
3/6

Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + Diα ;
3/6

Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
3/6

= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Unlike SAGA, Di and φi are designed to give sparse updates while
verifying unbiasedness conditions.
3/6

= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Unlike SAGA, Di and φi are designed to give sparse updates while
verifying unbiasedness conditions.
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
3/6

Proximal Asynchronous SAGA (ProxASAGA)
Contribution 2: Proximal Asynchronous SAGA (ProxASAGA). Each core
runs Sparse Proximal SAGA asynchronously without locks and
updates x, α and α in shared memory.
 All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
4/6

Empirical results
ProxASAGA vs competing methods on 3 large-scale datasets,
ℓ1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
5/6

Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
6/6

Speedup =
Time to 10−10
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
6/6

Speedup =
Time to 10−10
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
6/6

Speedup =
Time to 10−10
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
Thanks for your attention, see you at poster #159. 6/6

References
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Leblond, Rémi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel
SAGA”. In: Proceedings of the 20th International Conference on Artiﬁcial Intelligence and
Statistics (AISTATS 2017).
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic optimization”. In:
SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent and its
asynchronous variants”. In: Advances in Neural Information Processing Systems.
6/6

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Similar to Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization (20)

More from Fabian Pedregosa

More from Fabian Pedregosa (9)

Recently uploaded

Recently uploaded (20)

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization