main

Distributed algorithms for convex optimization
David Mateos-Núñez Jorge Cortés
University of California, San Diego
{dmateosn,cortes}@ucsd.edu
Thesis Defense
Committee:
Tara Javidi, Philip E. Gill, Maur´ıcio de Oliveira, and Miroslav Krstić
UC San Diego, August 20, 2015
1 / 68

Context: decentralized, peer-to-peer systems
Sensor networks Medical diagnosis
Recommender systems Formation control
2 / 68

Distributed paradigm
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W =
|
w1
|
|
w2
|
|
w3
|
|
w4
|
min
wi ∈Rd
4
i=1
f i
(wi ) + γ [w1 · · · w4] ∗
Ratings {wi } are
private
3 / 68

Distributed algorithms for four classes of problems
Noisy channels
min
x∈Rd
N
i=1
f i
(x)
Online optimization
Rj
T :=
T
t=1
N
i=1
f i
t (xj
t ) − min
x∈Rd
T
t=1
N
i=1
f i
t (x)
Multi-task learning
min
wi ∈Rd
N
i=1
f i
(wi
) + γ W ∗
Constraints
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D) s.t.
N
i=1
gi
(wi
, D) ≤ 0
4 / 68

Part I: Noise-to-state stable coordination
Distributed optimization Noisy communication
5 / 68

Review of
(consensus based) distributed convex optimization
Local information
Laplacian matrix
Disagreement feedback
Pioneering works
Lagrangian approach
6 / 68

Using local information
x∗
∈ arg min
x∈Rd
N
i=1
f i
(x)
Agent i has access to f i
Agent i can share its estimate xi
t with “neighboring” agents
f 1
f 2
f 3
f 4
f 5
A =






a13 a14
a21 a25
a31 a32
a41
a52






Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05
7 / 68

The Laplacian matrix
L = diag(A1) − A =






a13 + a14 −a13 −a14
−a21 a21 + a25 −a25
−a31 −a32 a31 + a32
−a41 a41
−a52 a52






Nullspace is agreement ⇔ spanning tree
The mirror graph has Laplacian L
Consensus via feedback on disagreement
proportional
proportional-integral feedback
8 / 68

Disagreement feedback
Honeybee navigation en route to the goal: visual ﬂight control and odometry
1996 M. Srinivasan, S. Zhang, M. Lehrer, and T. Collett, J Exp. Biol
Relative positions and angles
g1(w1) + · · · + gN(wN) = 0
Agreement
on the multiplier
9 / 68

Pioneering works
local minimization versus local consensus
A. Nedi´c and A. Ozdaglar, TAC, 09
xi
k+1 = − ηk
decreasing
gi
k +
N
j=1
aij xj
k,
where gi
k ∈ ∂f i (xi
k) and A = (aij ) is doubly stochastic
J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12
10 / 68

The Lagrangian approach
min
x1,...,xN ∈Rd
x1=...=xN
N
i=1
f i
(xi
) = min
x∈(Rd )N
Lx=0
˜f (x)
Convex-concave augmented Lagrangian
F(x, z) := ˜f (x)
Network objective
+ z (L x)
Multiplier × constraint
+
˜γ
2
x L x
Augmented Lagrangian
Saddle-point dynamics for symmetric Laplacian L
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ Lx − Lz
˙z =
∂F(x, z)
∂z
= Lx
J. Wang and N. Elia, Allerton, 10 And when L = L ?
11 / 68

Lagrangian approach for directed graphs
Lagrangian F(x, z) := ˜f (x) + z L x +
˜γ
2
x (L + L ) x
still convex-concave because
Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ 1
2 L + L x − L z (Non distributed!)
˙z =
∂F(x, z)
∂z
= Lx
B. Gharesifard and J. Cort´es, CDC, 12
12 / 68

Lagrangian approach for directed graphs
Lagrangian F(x, z) := ˜f (x) + z L x +
˜γ
2
x (L + L ) x
still convex-concave because
Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ 1
2 L + L x − L z (Non distributed!)
changed to − ˜f (x) − ˜γ Lx − Lz
˙z =
∂F(x, z)
∂z
= Lx
B. Gharesifard and J. Cort´es, CDC, 12
12 / 68

Now... what happens when there is noise?
Previous work and limitations
our coordination algorithm under noise
theorem of exponential noise-to-state stability
simulations
outline of the proof
13 / 68

Previous work with noisy communication channels
S. S. Ram and A. Nedić and V. V. Veeravalli, 10
K. Srivastava and A. Nedić, 11
xi
k+1 = xi
k − ηi
k gi
k + noise
+ σi
k
N
j=1
aij,k xj
k + noise − xi
k
where gi
k ∈ ∂f i (xi
k) and Ak = (aij,k) is an adjacency matrix
Limitations:
vanishing {ηi
k}k≥1 and {σi
k}k≥1
Sublinear convergence even in strongly convex case
noise of fixed variance
require projection step or bounded subgradients
14 / 68

Our algorithm: proprortional-integral feedback + noise
˙xi
(t) = − fi (xi
(t)) + noise
+ ˜γ
N
j=1,j=i
aij xj
(t) + noise − xi
(t) +
N
j=1,j=i
aij zj
(t) + noise − zi
(t)
˙zi
(t) = −
N
j=1,j=i
aij xj
(t) + noise − xi
(t)
Stochastic diﬀerential equation in compact form
dx = − ˜f (x) dt − ˜γ Lx dt −Lz dt
structured integral feedback
+ G1
(x, z, t)Σ1
(t)dB(t)
Noise
dz = Lx dt + G2
(x, z, t)Σ2
(t)dB(t)
Noise
15 / 68

Theorem (DMN-JC 2nd moment noise-to-state exponential stability)
If
the digraph is strongly connected and weight-balanced,
each fi is convex, twice-diﬀerentiable with bounded Hessian, and
at least one fi0 is strongly convex,
then
E x(t) − x∗
1⊗xmin
2
2
Distance to optimizer
+ (z(t) − z∗
) 2
LK
Distance modulus agreement
≤ Cµ x0 − x∗ 2
2 + z0 − z∗ 2
LK
2
Initial conditions
e−Dµ (t−t0)
+ Cθ sup
t0≤τ≤t
Σ1
(τ)
Σ2
(τ) F
2
Ultimate bound
Σ(t) F := trace (Σ(t)Σ(t) )
gives the size of the noise
16 / 68

Simulations
0 2 4 6 8 10 12 14 16 18
−2.74
0
1.1
Evolution of the agents’s estimates
time, t
{[xi
1(t), xi
2(t)]}
0 5 10 15 20
0
2
4
6
8
10
12
14
time, t
E x(t) − 1 ⊗ xmin
2
2
Evolution of the second moment
˜γ = 2
˜γ = 4
˜γ = 3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
Empirical noise gain
size of the noise, s
Ultimate bound for
E x(t) − 1 ⊗ xmin
2
2
(t = 17)
˜γ = 2
˜γ = 3
˜γ = 4
N = 4 agents communicating over a directed ring with local objectives
f1(x1, x2) = 1
2((x1 − 4)2
+ (x2 − 3)2
)
f2(x1, x2) = x1 + 3x2 − 2
f3(x1, x2) = log(ex1+3
+ ex2+1
)
f4(x1, x2) = (x1 + 2x2 + 5)2
+ (x1 − x2 − 4)2
17 / 68

Outline of the proof
KKT conditions VERSUS Equilibria of underlying modified
saddle-point ODE
Existence & uniqueness of solutions of the SDEs
Lyapunov analysis of SDEs
Pth moment NSS Lyapunov functions
“pth moment noise-to-state stability of stochastic differential
equations with persistent noise”, DMN-JC
Strategy for directed graphs
“Distributed continuous-time convex optimization on
weight-balanced digraphs”, B. Gharesifard and J. Cortés
δ-Cocoercivity of F := (INd + P)( ˜f(x) + Lx)
(x − x ) (F(x) − F(x )) ≥ δ F(x) − F(x ) 2
2
18 / 68

End of Part I
Contributions
2nd moment Noise-to-state exponentially stable
First proof of exponential convergence
only one objective strongly convex
Novel general Lyapunov stability analysis for SDEs
Open problems
Coordination algorithms
with
communication delays
19 / 68

Part II: Distributed online optimization
Distributed optimization Online optimization
Case study: medical diagnosis
20 / 68

Review of
online convex optimization
Motivation: medical diagnosis
Sequential decision making
Regret
Classical results
21 / 68

Medical diagnosis
Medical ﬁndings, symptoms:
age factor
amnesia before impact
deterioration in GCS score
open skull fracture
loss of consciousness
vomiting
Does the patient need a computerized Tomography?
“The Canadian CT Head Rule for patients with minor head injury”
22 / 68

Different kind of optimization: sequential decision making
In the diagnosis example:
Each round t ∈ {1, . . . , T}
question (features, medical findings): wt
decision (about using CT): h(xt, wt)
outcome (by CT findings/follow up of patient): yt
loss: l(yt h(xt, wt))
Choose xt & Incur loss ft(xt) := l(yt h(xt, wt))
Goal: sublinear regret
R(u, T) :=
T
t=1
ft(xt) −
T
t=1
ft(u) ∈ o(T)
using “historical observations”
23 / 68

Signiﬁcance of sublinear regret
If the regret is sublinear,
T
t=1
ft(xt) ≤
T
t=1
ft(u) + o(T),
then
lim
T→∞
1
T
T
t=1
ft(xt)
Loss of decisions “on the ﬂy”
≤ lim
T→∞
1
T
T
t=1
ft(u)
Loss of any decision in hindsight
In temporal average, online decisions {xt}T
t=1 perform as well as best
decision in hindsight
“No regrets, my friend”
24 / 68

Some classical results
Projected gradient descent:
xt+1 = ΠS(xt − ηt ft(xt)), (1)
where ΠS is a projection onto a compact set S ⊆ Rd , & f 2 ≤ H
Follow-the-Regularized-Leader:
xt+1 = arg min
y∈S
t
s=1
fs(y) + ψ(y)
Martin Zinkevich, 03
Alg. (1) achieves O(
√
T) regret under convexity with ηt = 1√
t
Elad Hazan, Amit Agarwal, and Satyen Kale, 07
(1) achieves O(log T) regret under p-strong convexity with ηt = 1
pt
Others: Online Newton Step, Follow the Regularized Leader, etc.
25 / 68

Combination of the distributed and online frameworks
Motivation
our coordination algorithm in the online case
theorems of O(
√
T) & O(log T) agent regrets
simulations
26 / 68

Resuming the diagnosis example:
Health center i ∈ {1, . . . , N} takes care of a set of patients at time t
Goal: sublinear agent regret
Rj
(u, T) :=
T
t=1
N
i=1
f i
t (xj
t ) −
T
t=1
N
i=1
f i
t (u) ≤ o(T)
using “local information” & “historical observations”
27 / 68

Challenge: Coordinate hospitals
f 1
t f 2
t
f 3
t f 4
t
f 5
t
t ∈ {1, . . . , T}
Need to design distributed online algorithms
Private patient data; share only prediction models
28 / 68

Previous work on consensus-based online algorithms
F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TAC
Projected Subgradient Descent
log(T) regret (local strong convexity & bounded subgradients)√
T regret (convexity & bounded subgradients)
Both analysis require a projection onto a compact set
S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13
Dual Averaging
√
T regret (convexity & bounded subgradients)
General regularized projection onto a convex closed set.
K. I. Tsianos and M. G. Rabbat, arXiv, 12
Projected Subgradient Descent
Empirical risk as opposed to regret analysis
Limitations: Communication digraph in all cases is ﬁxed and strongly
connected
29 / 68

Our contributions (informally)
time-varying communication digraphs under B-joint connectivity &
weight-balanced
unconstrained optimization (no projection step onto a bounded set)
log T regret (local strong convexity & bounded subgradients)
√
T regret (convexity & β-centrality with β ∈ (0, 1] & bounded
subgradients)
x∗
x
−ξx
f i
t is β-central in Z ⊆ Rd X∗,
where X∗ = x∗ : 0 ∈ ∂f i
t (x∗) ,
if ∀x ∈ Z, ∃x∗ ∈ X∗ such that
−ξx (x∗ − x)
ξx 2 x∗ − x 2
≥ β,
for each ξx ∈ ∂f i
t (x).
30 / 68

Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
Subgradient descent on previous local objectives, gxi
t
∈ ∂f i
t
31 / 68

xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t
Proportional feedback on disagreement
31 / 68

xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Integral feedback on disagreement
31 / 68

xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Union of graphs over intervals of length B is strongly connected.
f 1
t f 2
t
f 3
t f 4
t
f 5
t
time t time t + 1
a14,t+1a41,t+1
time t + 2
31 / 68

xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Compact representation
xt+1
zt+1
=
xt
zt
− σ
˜γLt Lt
−Lt 0
xt
zt
− ηt
˜gxt
0
31 / 68

xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Compact representation & generalization
xt+1
zt+1
=
xt
zt
− σ
˜γLt Lt
−Lt 0
xt
zt
− ηt
˜gxt
0
vt+1 = (I − σE ⊗ Lt)vt − ηtgt,
31 / 68

Theorem (DMN-JC Sublinear agent regret)
Assume that
{f 1
t , ..., f N
t }T
t=1 are convex functions in Rd
with H-bounded subgradient sets,
nonempty and uniformly bounded sets of minimizers, and
p-strongly convex in a suﬀ. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs is
δ-nondegenerate, and
B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) λmin(E) δ
λmax(E) dmax
,
choosing σ ∈
˜δ
λmin(E)δ , 1−˜δ
λmax(E)dmax
and ηt = 1
p t ,
Rj
(u, T) ≤ C( u 2
2 + 1 + log T),
for any j ∈ {1, . . . , N} and u ∈ Rd
32 / 68

Theorem (DMN-JC Sublinear agent regret)
Assume that
{f 1
t , ..., f N
t }T
t=1 are convex functions in Rd
with H-bounded subgradient sets,
nonempty and uniformly bounded sets of minimizers, and
p-strongly convex in a suﬀ. large neighborhood of their minimizers
δ-nondegenerate, and
B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
Substituting strong convexity by β-centrality, for β ∈ (0, 1],
Rj
(u, T) ≤ C u 2
2
√
T,
for any j ∈ {1, . . . , N} and u ∈ Rd
32 / 68

Simulations:
acute brain ﬁnding revealed on Computerized Tomography
33 / 68

Simulations
Figure: Agents’ estimates
0 50 100 150 200 250 300
−0.5
0
0.5
1
{ xi
t,7}N
i=1
Centralized
0 50 100 150 200 250 300
−0.5
0
0.5
1
time, t
{ xi
t,8}N
i=1
Figure: Average regret
0 50 100 150 200 250 300
10
0
time horizon, T
maxj 1/T Rj
x∗
T ,
N
i f i
t
T
t=1
Proportional-Integral
disagreement feedback
Proportional
disagreement feedback
Centralized
f i
t (x) =
s∈Pi
t
l(ysh(x, ws)) + 1
10 x 2
2
where
l(m) = log 1 + e−2m
34 / 68

Analysis of network regret
RN (u, T) :=
T
t=1
N
i=1
f i
t (xi
t ) −
T
t=1
N
i=1
f i
t (u)
Input-to-state stability with respect to agreement
ˆLKvt 2 ≤ CI v1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
Agent regret bound in terms of {ηt}t≥1 and initial conditions
Boundedness of online estimates under β-centrality (β ∈ (0, 1] )
Doubling trick to obtain O(
√
T) agent regret
Local strong convexity ⇒ β-centrality
ηt = 1
p t to obtain O(log T) agent regret
35 / 68

End of Part II
Contributions
Extension of classical
√
T and log(T)-regret bounds to multi-agent
scenario
B-joint connectivity of communication topologies
Unnecessary projections onto compact sets
Proportional-integral feedback on disagreement
Relevant for regression & classiﬁcation
Future work
Reﬁne guarantees under
model for evolution
of objective functions
36 / 68

Part III: Distributed optimization with nuclear norm
Distributed optimization Nuclear norm regularization
37 / 68

Review of
nuclear norm regularization
Deﬁnition
Application to matrix completion
38 / 68

Coupling through nuclear norm
Given a matrix W =


| |
w1 · · · wN
| |

 ∈ Rd×N
W 2
∗ = sum of singular values of W
= trace
√
WW = trace
N
i=1
wi wi
The nuclear norm regularization
min
wi ∈Wi
N
i=1
f i
(wi ) + γ W 2
∗
favors vectors {wi }N
i=1 belonging to a low-dimensional subspace
39 / 68

Application of nuclear norm
Matrix completion
for recommender systems
40 / 68

Low-rank matrix completion
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W = [ W:,1 W:,2 W:,3 W:,4 ]
Incomplete
Rating matrix Z
assumed low-rank
To estimate the missing entries
min
W ∈Rd×N
(i,j)∈Γ
Wij − Zij
2
2 + γ W ∗
where Γ indexes the revealed entries of the rating matrix Z
γ depends on application, dimensions... Regularization, not penalty
Netﬂix: users ∼ 107 movies ∼ 105
Why making it distributed? Because users may not want to share
their ratings 41 / 68

Distributed formulation of nuclear norm
Useful characterizations and challenges
Our dynamics for distributed optimization with nuclear norm
theorem of convergence
simulations
42 / 68

1st characterization of the nuclear norm
For any W ∈ Rd×N
2 trace(
√
WW )
W ∗
= min
D∈Sd
0
C(D)⊇C(W )
Column spaces of D and W
trace D†
Pseudoinverse
WW + trace(D)
= min
D∈Sd
0
wi ∈C(D)
N
i=1
wi D†
wi + trace(D)
Used in A.Argyriou, T. Evgeniou, and M. Pontil, NIPS, 06
Minimizer D∗ =
√
WW
For W 2
∗ use hard constraint trace(D) ≤ 1. Then D∗ =
√
WW
trace
√
WW
Challenges related to the pseudoinverse:
Optimizer D∗ =
√
WW might be rank deﬁcient
Pseudoinverse D† might be discontinuous when rank of D changes
43 / 68

Convexity property of w D†
w
Point-wise supremum of a linear function in w and D
sup
x∈Rd
−(x Dx + 2w x) =
w D†w if D ∈ Sd
0 , w ∈ C(D),
∞ otherwise
Hence w D†w is jointly convex in {w ∈ Rd , D ∈ Sd
0 : w ∈ C(D)}
Here the pseudoinverse is “hidden” in the optimizer
Good for iterative algorithms
Curiosity: explicit Fenchel conjugate of q(x) := x Dx, for a ﬁxed D
(Up to a factor, q (−2w))
Proof in appendix of “Convex Optimization”, by Boyd
44 / 68

2nd characterization of the nuclear norm
The latter expression allows to write
2 W ∗ = min
D∈Sd
0
wi ∈C(D)
sup
xi ∈Rd
N
i=1
−(xi Dxi + 2wi xi )
N
i=1 wi D†wi
+ trace(D)
Sum of convex functions in the vectors {wi }N
i=1
(as well as in the auxiliary vectors {xi }N
i=1)
Common auxiliary matrix D ∈ Rd×d , ideally d N
Can be “distributed” as a Min-max problem with local copies of D
subject to agreement
45 / 68

Approximate regularization to guarantee D∗
cI
Challenges related to the constraint w ∈ C(D):
Set {w ∈ Rd , D ∈ Sd
0 : w ∈ C(D)} is convex but not closed
also, Sd
0 is open
DD†w projects wi onto C(D) but is state dependent
[W |
√
Id ] ∗ = trace( WW + Id )
= min
D∈Sd
0
C(W )⊆C(D)
trace D†
(WW + Id ) + trace(D)
Minimizer D∗ =
√
WW + Id
√
Id
Replace the constraint w ∈ C(D) by the dummy constraint D
√
Id
46 / 68

Min-max problem with explicit agreement
min
wi ∈Wi
N
i=1
f i
(wi ) + [ W |
√
Id ] ∗
= min
wi ∈Wi , Di ∈Rd×d
Di =Dj ∀i,j
sup
xi ∈Rd
Yi ∈Rd×d
N
i=1
Fi (wi , Di
convex
, xi , Yi
concave
)
with
Fi (w, D, x, Y ) := fi (w)+ γ trace D(−xx − N YY
approx. reg.
)
− 2γw x − 2γ
N
trace(Y )
approx. reg.
+
1
N
trace(D)
47 / 68

Distributed saddle-point dynamics for nuclear optimization
wi (k + 1) = PW wi (k) − ηk gi (k) − 2γxi (k)
Di (k + 1) = PD Di (k) − ηkγ − xi xi − N Yi Yi + 1
N Id
+ σ
N
j=1
aij,t(Dj (k) − Di (k))
xi (k + 1) = xi (k) + ηkγ − 2Di (k)xi (k) − 2wi (k)
Yi (k + 1) = Yi (k) + ηkγ −
2
N
Di (k)Yi (k) −
2
N
Id
User i does not need to share wi with its neighbors!!
and Di → N
i=1 wi wi + Id
Complexity: 1/
√
k saddle-point error × projection onto {D cId }
PD(D) := (D − cId )2 + cId (e.g., using Newton)
48 / 68

Theorem (DMN-JC Distributed saddle-point nuclear optimization)
Assume that fi is convex and the set Wi is convex and compact.
Also, let the sequence of weight-balanced communication digraphs be
nondegenerate, and
B-jointly-connected
Then,
for a suitable choice of the consensus stepsize σ
and the learning rates {ηk}
the saddle-point evaluation error of the running-time averages decreases
as 1/
√
k
Proof follows from developments in the next section.
49 / 68

Privacy preserving matrix completion
10
0
10
1
10
2
10
3
10
4
0
0.5
1
1.5
2
Distributed saddle-point dynamics
Centralized subgradient descent
10
0
10
1
10
2
10
3
10
4
10
3
10
4
0 5000 10000 15000
0
1
2
3
4
5
iterations, k
W (k)−Z F
Z F
N
i=1 j∈Υi
(Wij (k) − Zij )2 + γ W (k)|
√
Id ∗
( N
i=1 Di (k) − 1
N
N
i=1 Di (k) 2
F )1/2
Matrix
ﬁtting error
Nuclear
regularization
Disagreement
local matrices
50 / 68

End of Part III
Contributions
First multi-agent treatment of nuclear norm regularization
Privacy preserving matrix completion
Future work
Inexact projections,
Compressed matrix
communication
51 / 68

Part IV (and last): Saddle-point problems
Distributed optimization Constraints
52 / 68

Organization
Distributed constrained optimization
The magic of agreement on the multiplier
Saddle-point problems with explicit agreement
Nuclear norm min-max characterization
Lagrangians
Our distributed saddle-point dynamics with Laplacian averaging
theorem for our dynamics: 1√
t
saddle-point evaluation error
53 / 68

Distributed constrained optimization
min
wi ∈Wi , ∀i
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
{wi }N
i=1 are local decision vectors, D is a global decision vector
Applications:
Traﬃc and routing
Resource allocation
Optimal control
Network formation
54 / 68

Previous work by type of constraint
N
i=1
f i
(x)
s.t. g(x) ≤ 0
All agents know the constraint
2011 D. Yuan, S. Xu, and H.
Zhao
2012 M. Zhu and S. Mart´ınez
D. Yuan, D. W. C. Ho, and S. Xu
(to appear)
min
wi ∈Wi
N
i=1
f i
(wi
)
s.t.
N
i=1
gi
(wi
) ≤ 0
Agent i knows only gi
2010 D. Mosk-Aoyama,
T. Roughgarden and D. Shah
(only linear constraints)
2014 T.-H. Chang, A. Nedi´c
and A. Scaglione
Primal-dual perturbation
methods
55 / 68

Distributing the constraint via agreement on multipliers
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
Dual decomposition, if strong duality holds
min
wi ∈Wi
D∈D
max
z∈Rm
≥0
N
i=1
f i
(wi
, D) + z
N
i=1
gi
(wi
, D)
= min
wi ∈Wi
D∈D
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D) + zi
gi
(wi
, D)
= min
wi ∈Wi
Di ∈D
Di =Dj ∀i,j
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D
i
) + zi
gi
(wi
, D
i
)
56 / 68

Saddle-point problems with explicit agreement
A more general framework
min
w∈W
(D1,...,DN )∈DN
Di =Dj , ∀i,j
max
µ∈M
(z1,...,zN )∈ZN
zi =zj , ∀i,j
φ w, (D
1
, . . . , D
N
)
convex
, µ, (z1
, . . . , zN
)
concave
Problem unstudied in the literature
Fits min-max formulation of nuclear norm regularization
The concave part is quadratic
Fits convex-concave functions arising from Lagrangians
The concave part is linear
57 / 68

Our dynamics
Subgradient saddle-point dynamics with Laplacian averaging
(provided a saddle-point exists under agreement)
ˆwt+1 = wt − ηtgwt
ˆDt+1 = Dt −σLtDt
Lap. aver.
−ηtgDt
ˆµt+1 = µt + ηtgµt
ˆzt+1 = zt −σLtzt
Lap. aver.
+ηt ˆgzt
(wt+1, Dt+1, µt+1, zt+1) = PS ˆwt+1, ˆDt+1, ˆµt+1, ˆzt+1
Orthogonal projection
gwt , gDt , gµt , ˆgzt are subgradients and supergradients of φ(w, D, µ, z)
and S = W × DN × M × ZN
58 / 68

Theorem (DMN-JC Distributed saddle-point approximation)
Assume that
φ(w, D, µ, z) is convex in (w, D) ∈ W × DN and concave in
(µ, z) ∈ M × ZN
The dynamics is bounded
nondegenerate, and
B-jointly-connected
For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) δ
dmax
,
choosing σ ∈
˜δ
δ , 1−˜δ
dmax
and {ηt} based on the Doubling Trick scheme,
then for any saddle point (w∗, D∗, µ∗, z∗) of φ within agreement,
−
α
√
t − 1
≤ φ(wav
t , D
av
t , zav
t , µav
t ) − φ(w∗
, D
∗
, z∗
, µ∗
) ≤
α
√
t − 1
wav
t :=
t
s=1
ws, etc
59 / 68

Inequality techniques from A. Nedi and A. Ozdaglar, 2009
Saddle-point evaluation error
tφ(wav
t+1, D
av
t+1, µav
t+1, zav
t+1) − tφ(w∗
, D
∗
, µ∗
, z∗
) (2)
at running-time averages, wav
t+1 := 1
t
t
s=1 ws, etc.
Bound for (2) in terms of
initial conditions
bound on subgradients and states of the dynamics
disagreement
sum of learning rates
Input-to-state stability with respect to agreement
LKDt 2 ≤ CI D1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
Doubling Trick scheme: for m = 0, 1, 2, . . . , log2 t , take
ηs = 1√
2m
in each period of 2m rounds s = 2m, . . . , 2m+1 − 1
60 / 68

End of Part IV
Contributions
Distributed subgradient methods for saddle-point problems
Particularize to
Saddle-points of Lagrangians coming from separable constraints
Min-max characterization of nuclear norm
Future work
Bounds on Lagrange vectors and matrices in a distributed way
Boundedness properties of the dynamics without truncated
projections
Semideﬁnite constraints with chordal sparsity, where agents update
the entries corresponding to maximal cliques subject to agreement
on the intersections
61 / 68

Summary
Global
decision
vectors
min
x∈Rd
N
i=1
f i
(x)
Rj
T :=
T
t=1
N
i=1
f i
t (xj
t ) − min
x∈Rd
T
t=1
N
i=1
f i
t (x)
Local
decision
vectors
min
wi ∈Rd
N
i=1
f i
(wi
) + γ [w1
. . . wN
] ∗
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D) s.t.
N
i=1
gi
(wi
, D) ≤ 0
63 / 68

Noise-to-state exponentially stable distributed convex
optimization on weight-balanced digraphs,
SIAM Journal on Control and Optimization, under revision
pth moment noise-to-state stability of stochastic diﬀerential
equations with persistent noise,
SIAM Journal on Control and Optimization 52 (4) (2014), 2399-2421
Distributed online convex optimization over jointly connected
digraphs,
IEEE Transactions on Network Science and Engineering 1 (1) (2014),
23-37
And conference versions presented in ACC 13, CDC 13, MTNS 14
64 / 68

Distributed optimization for multi-task learning via nuclear-norm
approximation,
IFAC Workshop on Distributed Estimation and Control in Networked
Systems
Distributed subgradient methods for saddle-point problems,
CDC 15
(In preparation for Transactions on Automatic Control)
65 / 68

Next steps
Algorithmic principles in biology
homeostasis in neocortex
noisy computational units
distribution of ﬁring rates
recovery after disruptions
diﬀerentiation in populations
of cells
distribution of phenotypes
recovery after injury or
disease
cancer, stem cells
66 / 68

Beyond biology
Control and security of cyber-physical systems:
cooperative optimization +
distributed trust
heterogeneous
agents’ dynamics & incentives
malicious minority
67 / 68

Thank you for listening!
68 / 68

main

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to main

Similar to main (20)

main