CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Systems: An Overview and Some Theoretical Results - Zhengyuan Zhu, Feb 13, 2018

Distributed Optimization: An Overview and Some
Theoretical Results
Zhengyuan Zhu
Joint work with Xin Zhang and Jia Kevin Liu
Department of Statistics and
Center for Survey Statistics Methodology
Iowa State University
2/13/2018

Introduction
1 Introduction
2 Asynchronous Stochastic Gradient Descent
3 Convergence Analysis
4 Numerical Study
5 Conclusion
6 Preliminary Work
Zhang & Liu & Zhu Asyn-SGD 2 / 35

Introduction
Introduction
Connection to remote senesing WK:
My research interests: spatial statistics, spatial sampling design,
survey statistics.
National Resouce Inventory survey, remote sensing data help improve
survey estimates of agricultural statistics and natural resources.
Massive spatial-temporal imputation (gap-ﬁlling), computationally
eﬃeicent functional approach, application to Landsat and MODIS
data.
Massive Imputation for hyper-spectral satellite data (OCO2), data
sparse in space-time
Unmixing problem for the SMOS and OCO2 data.
Original title: Asynchronous Stochastic Gradient Descent with
unbounded delay on nonconvex problem
Actual title: Distributed Optimization: An Overview and Some
Theoretical Results

Introduction
Distributed Computation
Problem: Datasets are becoming extremely large (features and sample
size), and they may be collected and/or stored in a distributed system.
With Moore’s law coming to an end in a few years (2025?), we can
no longer rely on hardware improvements anymore.
Distributed computation study how to divide a ’big’ problem into
several small parts and allocate these parts to many computers, then
combine ’local’ results to obtain the ﬁnal result;
Bertsekas and Tsitsiklis (1999) provided a general frame work for
parallele and distributed computation.
Micro chip level: multi-core CPU/GPU
Macro data center level: networked cloud computing

Introduction
Workﬂow of Distributed Computing
Figure: Distributed Computing Workﬂow

Introduction
Some issues relevant to theory of data system
Centralized vs local computation: local computation reduces data
transfer costs, and have less issue with data privacy and
confidentianity.
Synchronous vs asynchronous methods: synchronization could involve
significant communication overhead; server variability may lead to
inefficiency; asyn methods may have convergence issue depending on
the delay function and the algorithm.
Data homogeneity vs heteroscedasticity
Homogeneous: Databases Ξ1, ..., Ξk are shared, i.i.d or stationary. The
objective function computed at each local machine is unbiased;
Heterogeneous: Databases Ξ1, ..., Ξk are not i.i.d or stationary, i.e.,
they could be from different sources or collection with differen
methods. The objective function in each local machine may be biased;
Trade-off in computation, communication, and inference precision.

Introduction
Distributed Optimization Algorithms
Some of the well-studied algorithms for distributed optimization:
Stochastic Gradient Descent (SGD) Bottou (1998, 2011) theory and
application to large scale machine learning; Recht et. al. (2011)
Asynchoronous SGD algorithm HOGWILD!; Lian et. al. (2015),
convergence rate for non-convex problem with bounded delay;
Alternating Direction Method of Multiplier (ADMM) Gabay and B.
Mercier (1976), Boyd et.al. (2010), Chang et. al. (2015), Hong
(2017)
Distributed quasi-newton methods for faster rates of convergence
Eisen et. al. (2017) uses gradient to estimate the curvature,
Mansoori (2017) used a matrix splitting technique to compute
Newton direction in a distributed way.

Introduction
Applications in ML
Distributed optimization, and in particular distributed SGD, has become a
very popular way to speedup machine learning algorithms. Some successful
examples:
In ?, parallel system is used to train SVM, which could save
computational time and avoid out of memory;
? designed an parallelizable method, called CCD++, for matrix
factorization in large-scale recommender systems.;
Distributed deep learning: ? proposed two distributed algorithms,
Downpour SGD and Sandblaster, to train DNN; Abadi et. al. (2016)
introduced TensorFlow for large scale machine learning

Asynchronous Stochastic Gradient Descent
1 Introduction
4 Numerical Study
5 Conclusion
6 Preliminary Work

Overview for Stochastic Gradient Descent (SGD)
Our work focuses on the feasiblity of a distributed asynchronous
optimization algorithm, Asynchronous Stochastic Gradient Descent, under
unbounded delays.
Also referred to as stochastic approximation in the literature;
First introduced in ? and ?;
The idea: simply use a noisy unbiased gradient to replace the
unknown true gradient in the gradient descent algorithm;
The stochastic gradient descent works as follows. To solve the
following optimization:
min
x∈Rd
f (x) = E[F(x; ξ)], (1)
Let xk+1 = xk − γkG(xk), where xk presents the parameter in k-th
iteration, G(xk) is a noisy unbiased gradient based on xk;

Asynchronous Stochastic Gradient Descent (Asyn-SGD)
Asyn-SGD is a extension framework based on SGD. It could be
implemented as following:
For workers,
compute gradients G with current parameter x and random sample ξ;
report gradients to server;
For server,
collect the certain amount (M) of gradients from workers;
update current parameter with these gradients;

Asynchronous Stochastic Gradient Descent (Asyn-SGD)
Algorithm 1 Asynchronous Stochastic Gradient Descent (Asyn-SGD)
Require: Database Ξ, step size {γk}, initial point x0, batch size M;
Ensure: xk;
At parameter server:
1: for i= 1, 2 ,..., k do
2: Collecting M gradients G(xi−τi,m
; ξi,m) from workers;
3: Updating xi+1 = xi − γi
M
m=1 G(xi−τi,m
; ξi,m);
4: end for
At workers:
5: Receive current parameter x∗ from parameter server;
6: Randomly select sample ξ from database;
7: Compute stochastic gradient G(x∗; ξ) and report it to server;
Here τi,m is the delay in i-th iteration and m-th batch.

Asynchronous Stochastic Gradient Descent with
Incremental batch size (Asyn-SGDI)
A modiﬁed version of Asyn-SGD is to increase the batch size when
determining the undate direction. With large batch size, the variance of
the gradient noise would decrease, which might lead to a faster result.
Algorithm 2 Asyn SGD with increment batch size (Asyn-SGDI)
Require: Database Ξ, step size {γk}, initial point x0, increasing batch
size{Mi = ni M};
Ensure: xk;
1: for i= 1, 2 ,..., k do
2: Collecting M gradients G(xi−τi,m
; ξi,m) from workers;
3: Updating xi+1 = xi − γi
ni
Mi
m=1 G(xi−τi,m
; ξi,m);
4: end for

Convergence Analysis
1 Introduction
4 Numerical Study
5 Conclusion
6 Preliminary Work

General Assumption
Assumption
(Lower bounded objective function) For objective function f , there exists a
optimal point x∗, s.t. ∀x, f (x) ≥ f (x∗).
Assumption
(Lipschitz Continuous Gradient) The objective function f satisﬁes
f (x) − f (y) ≤ L x − y , ∀x, y.
Assumption
(Unbiased graidents with bounded variance) The stochastic gradient
G(x; ξ) is unbiased with bounded variance, that is to say:
1 E(G(x; ξ)) = f (x), ∀x, ξ;
2 E( (G(x; ξ)) − f (x) 2) ≤ σ2, ∀x;

Restriction for probabilities of delay variables
Assumption
There exists a sequence {ci }, such that
cj+1 +
γkML2
2
k
i=j
iP(τk = i) ≤ cj , ∀ j, k, (2)
where τk presents the maximun delay in k-th iteration: τk = maxm τk,m
and γk is the step size.
Here, {ci } is the weight in the asynchronicity error.

Convergence Analysis for Asyn-SGD
Now we can give the convergence result for Asyn-SGD:
Theorem
Assume above assumptions hold and the stepsize {γk} satisﬁes
1 γk ≤ 1
2Mc1+ML, ∀k;
2 γk is unsummable but γ2
k is summable;
where M is ﬁxed batch size, L is Lipschitz constant in Assumption 2 and
c1 is from the sequence in Assumption 4, then we have
E[ ∞
k=1 γk f (xk) 2] < ∞, and E[ f (xk) 2] → 0.
Corollary
If the step size γk = O(1/(K1/2log(K)), ∀ > 0, then the asymptotic
convergence rate for Asyn-SGD is
E( f (xk) 2
) = o(1/
√
K).

Convergence Analysis for Asyn-SGD with incremental
batch size
Similarly, we can get the convergence result for Asyn-SGD with
incremental batch size:
Theorem
Assume the above assumptions hold and the size of database is inﬁnite,
set batch size {Mk := nkM} satisfying that ∞
k=1
1
nk
< ∞ and step size
{γk} satisfying that γk ≤ 1
2M1c1+M1L, ∀k, then we have
E[ ∞
k=1 γk f (xk) 2] < ∞, and E[ f (xk) 2] → 0.
Corollary
For > 0, and nk = o( 1
k1+ ), with ﬁxed stepsize satisfying the requirement
in Theorem 3.2, we have E( f (xk) 2) = o(1/K).

Bounded Delay Variable
First we consider a simple case, in which the delay variables are bounded.
Corollary
(Bounded Delay Variable) If the delay variables {τk} are bounded, then
{ci } exists.
This is a very common case, as discussed in ?, ? etc.. This scenario is
reasonable as long as all the worker runs with evenly speed.

I.I.D. Delay Variable
Second case is to assume that for the sequence of delays {τk} is I.I.D. and
the commmon distribution has ﬁnite second moment. This scenoria is
rational when the iteration number is very large and the system has
reached the stationarity.
Corollary
(I.I.D. Delay Variale) If the probability series {τk} is I.I.D as τ and τ has
ﬁnite second moment, then {ci } exists.

Uniform Upper Bound
Third case: the delay variables can have diﬀerent distributions as long as
uniformly they could be bounded by a second moment ﬁnite sequence.
This is a more general case.
Corollary
(Uniformly Upper Bounded Probability Series) Consider the probability
series of delay variables {τk}∞
k=1, if there exists a series {ai }∞
i=1 s.t.
1
∞
i=1 i2ai < ∞;
2 P(τk = i) ≤ ai , ∀k;
then {ci } exists.

Numerical Study
1 Introduction
4 Numerical Study
5 Conclusion
6 Preliminary Work

Numerical Study
Example 1: MLE for MVN Covariance Matrix
First, we consider maximum likelihood estimation for the covariance matrix
in multivariate normal distribution. This problem can be formulated as:
min
Σ∈Rd×d
ln |Σ| +
1
n
n
i=1
(xi − µ)T
Σ−1
(xi − µ) (3)
subjet to Σ 0
where Σ is the covariance matrix, µ is the mean vector and xi are samples.
The gradient for this problem has been derived in ?.

Numerical Study
We randomly generate data from multivariate normal distribuion with
mean as (0, 0) and covariance matrix as (10, 3; 3, 5).
(a) is with bounded delay variable and the upper bound is 50; (b)
uses poisson delay with parameter 30; in (c), we simulates a virtual
system and the working time t for each worker follows the same
model, t ∼ Exp(λ) and λ ∼ Gamma(2, 1).
The green solid line is the convergence result for Asyn-SGD with
O(1/k) step size, the orange dotted line is the convergence result for
Asyn-SGD with O(1/(K1/2log(K))) step size and the purple dashed
line is the convergence result for Asyn-SGDI.

Numerical Study
(a) bounded by 50 (b) Poi(50) (c) System delay
Figure: Convergence for Asyn-SGD and Asyn-SGDI
In the three cases, the l2 norm of gradient will go to zero while Asyn-SGDI
is fastest and Asyn-SGD with stepsize O(1/k) is slowest.

Numerical Study
We consider an extreme case where the delay variable is from discrete
uniform distribution (evenly probability).
Figure: A counter example when Asyn-SGD fails

Numerical Study
We also compare the computation time of Syn-SGD, Asyn-SGD and
Asyn-SGDI on this problem. The step size for Syn-SGD and Asyn-SGD is
O(1/k) and the step size for Asyn-SGD is constant.
Figure: Computation time for three algorithms: the red line is for Syn-SGD; blue dotted line is
for Asyn-SGD; black dotdash line is for Asyn-SGDI

Numerical Study
Example 2: Low Rank Matrix Completion
This problem is to ﬁnd the lowest rank matrix X which matches the
expectation of observed symmetric matrices, E[A]. It could be
mathematically formulated as following:
min E[ A − YY T 2
F ] (4)
subjet to Y ∈ Rn×p
where X = YY T . Using SGD to solve this problem has been discussed in
many works, including ? and ? etc.

Numerical Study
Example 2: Low Rank Matrix Completion
(a) bounded by 50 (b) Poi(30) (c) System delay
Figure: Convergence for Asyn-SGD and Asyn-SGDI

Conclusion
1 Introduction
4 Numerical Study
5 Conclusion
6 Preliminary Work

Conclusion
Conclusion
In our work, we analyze the convergence of Asyn-SGD on nonconvex
optimization problem with unbounded delay;
We propose a new Lyapurov function, which consists of classical error
and asynchronicity error;
A suﬃcient condition for delay variable is given to guarantee the
convergence of Asyn-SGD;
With proper stepsize, the asymptotic convergence rate for Asyn-SGD
is o(1/
√
k) and that for Asyn-SGDI is o(1/k).
This algorithm requires local gradient to be unbiased. For
heterogeneous case, we are working on an ADMM based asynchonous
solution.

Preliminary Work
1 Introduction
4 Numerical Study
5 Conclusion
6 Preliminary Work

Preliminary Work
Distributed Computing and ADMM
Consider following problem:
Data are distributed in several machine, let’s say Ξ1, Ξ2, ..., Ξk;
The objective function is
min
x
L(x; Ξ1, Ξ2, ..., Ξk) =
k
i=1
Li (x; Ξi ); (5)
Communication cost is too expensive so each machine could only
”see” local objection function Li (x; Ξi );
Data is biased, which means xi = arg min Li (x; Ξi ) is not consistent.

Preliminary Work
Problem Formulation
Reformuating the problem:
min
x
L(x; Ξ1, Ξ2, ..., Ξk) =
k
i=1
Li (x; Ξi ) (6)
⇒ min
x
k
i=1
Li (xi ; Ξi ), s.t.xi = x, ∀ i (7)
The corresponding augmented Lagrangian function:
L({xi }, x; y) =
k
i=1
Li (xi ; Ξi ) +
k
i=1
yk, xk − x +
k
i=1
ρi
2
xi − x 2
; (8)
Thus, x and yk could be updated in central server; xk could be
updated in loacl machine. Only x, xk and yk are transported between
the central server and local machines.

Preliminary Work
ADMM based parallel computing framework
Algorithm 3 ADMM based parallel computing framework
Require: Database {Ξi }, {ρi }, initial point;
Ensure: xT ;
1: for t= 1, 2 ,..., T do
2: Collect xk from local machines;
3: Update xt+1: xt+1 = arg minx
K
i=1 yt
i , xt
i − x + K
i=1
ρi
2 xi − x 2;
4: Update yt+1
k = yt
k + ρk(xt+1
k − xt+1);
5: end for At local machine i:
6: Receive current yt+1
i and xt+1 from parameter server;
7: Update
xt+1
i = arg min
xi
Li (x; Ξi ) +
K
i=1
yt+1
i , xi − xt+1
+
ρi
2
xi − xt+1 2
;

CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Systems: An Overview and Some Theoretical Results - Zhengyuan Zhu, Feb 13, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Systems: An Overview and Some Theoretical Results - Zhengyuan Zhu, Feb 13, 2018

Similar to CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Systems: An Overview and Some Theoretical Results - Zhengyuan Zhu, Feb 13, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Systems: An Overview and Some Theoretical Results - Zhengyuan Zhu, Feb 13, 2018