SlideShare a Scribd company logo
Distributed algorithms for convex optimization
David Mateos-N´u˜nez Jorge Cort´es
University of California, San Diego
{dmateosn,cortes}@ucsd.edu
Thesis Defense
Committee:
Tara Javidi, Philip E. Gill, Maur´ıcio de Oliveira, and Miroslav Krsti´c
UC San Diego, August 20, 2015
1 / 68
Context: decentralized, peer-to-peer systems
Sensor networks Medical diagnosis
Recommender systems Formation control
2 / 68
Distributed paradigm
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W =
|
w1
|
|
w2
|
|
w3
|
|
w4
|
min
wi ∈Rd
4
i=1
f i
(wi ) + γ [w1 · · · w4] ∗
Ratings {wi } are
private
3 / 68
Distributed algorithms for four classes of problems
Noisy channels
min
x∈Rd
N
i=1
f i
(x)
Online optimization
Rj
T :=
T
t=1
N
i=1
f i
t (xj
t ) − min
x∈Rd
T
t=1
N
i=1
f i
t (x)
Multi-task learning
min
wi ∈Rd
N
i=1
f i
(wi
) + γ W ∗
Constraints
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D) s.t.
N
i=1
gi
(wi
, D) ≤ 0
4 / 68
Part I: Noise-to-state stable coordination
Distributed optimization Noisy communication
5 / 68
Review of
(consensus based) distributed convex optimization
Local information
Laplacian matrix
Disagreement feedback
Pioneering works
Lagrangian approach
6 / 68
Using local information
x∗
∈ arg min
x∈Rd
N
i=1
f i
(x)
Agent i has access to f i
Agent i can share its estimate xi
t with “neighboring” agents
f 1
f 2
f 3
f 4
f 5
A =






a13 a14
a21 a25
a31 a32
a41
a52






Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95
Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05
7 / 68
The Laplacian matrix
L = diag(A1) − A =






a13 + a14 −a13 −a14
−a21 a21 + a25 −a25
−a31 −a32 a31 + a32
−a41 a41
−a52 a52






Nullspace is agreement ⇔ spanning tree
The mirror graph has Laplacian L
Consensus via feedback on disagreement
proportional
proportional-integral feedback
8 / 68
Disagreement feedback
Honeybee navigation en route to the goal: visual flight control and odometry
1996 M. Srinivasan, S. Zhang, M. Lehrer, and T. Collett, J Exp. Biol
Relative positions and angles
g1(w1) + · · · + gN(wN) = 0
Agreement
on the multiplier
9 / 68
Pioneering works
local minimization versus local consensus
A. Nedi´c and A. Ozdaglar, TAC, 09
xi
k+1 = − ηk
decreasing
gi
k +
N
j=1
aij xj
k,
where gi
k ∈ ∂f i (xi
k) and A = (aij ) is doubly stochastic
J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12
10 / 68
The Lagrangian approach
min
x1,...,xN ∈Rd
x1=...=xN
N
i=1
f i
(xi
) = min
x∈(Rd )N
Lx=0
˜f (x)
Convex-concave augmented Lagrangian
F(x, z) := ˜f (x)
Network objective
+ z (L x)
Multiplier × constraint
+
˜γ
2
x L x
Augmented Lagrangian
Saddle-point dynamics for symmetric Laplacian L
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ Lx − Lz
˙z =
∂F(x, z)
∂z
= Lx
J. Wang and N. Elia, Allerton, 10 And when L = L ?
11 / 68
Lagrangian approach for directed graphs
Lagrangian F(x, z) := ˜f (x) + z L x +
˜γ
2
x (L + L ) x
still convex-concave because
Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ 1
2 L + L x − L z (Non distributed!)
˙z =
∂F(x, z)
∂z
= Lx
B. Gharesifard and J. Cort´es, CDC, 12
12 / 68
Lagrangian approach for directed graphs
Lagrangian F(x, z) := ˜f (x) + z L x +
˜γ
2
x (L + L ) x
still convex-concave because
Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0
˙x = −
∂F(x, z)
∂x
= − ˜f (x) − ˜γ 1
2 L + L x − L z (Non distributed!)
changed to − ˜f (x) − ˜γ Lx − Lz
˙z =
∂F(x, z)
∂z
= Lx
B. Gharesifard and J. Cort´es, CDC, 12
12 / 68
Now... what happens when there is noise?
Previous work and limitations
our coordination algorithm under noise
theorem of exponential noise-to-state stability
simulations
outline of the proof
13 / 68
Previous work with noisy communication channels
S. S. Ram and A. Nedi´c and V. V. Veeravalli, 10
K. Srivastava and A. Nedi´c, 11
xi
k+1 = xi
k − ηi
k gi
k + noise
+ σi
k
N
j=1
aij,k xj
k + noise − xi
k
where gi
k ∈ ∂f i (xi
k) and Ak = (aij,k) is an adjacency matrix
Limitations:
vanishing {ηi
k}k≥1 and {σi
k}k≥1
Sublinear convergence even in strongly convex case
noise of fixed variance
require projection step or bounded subgradients
14 / 68
Our algorithm: proprortional-integral feedback + noise
˙xi
(t) = − fi (xi
(t)) + noise
+ ˜γ
N
j=1,j=i
aij xj
(t) + noise − xi
(t) +
N
j=1,j=i
aij zj
(t) + noise − zi
(t)
˙zi
(t) = −
N
j=1,j=i
aij xj
(t) + noise − xi
(t)
Stochastic differential equation in compact form
dx = − ˜f (x) dt − ˜γ Lx dt −Lz dt
structured integral feedback
+ G1
(x, z, t)Σ1
(t)dB(t)
Noise
dz = Lx dt + G2
(x, z, t)Σ2
(t)dB(t)
Noise
15 / 68
Theorem (DMN-JC 2nd moment noise-to-state exponential stability)
If
the digraph is strongly connected and weight-balanced,
each fi is convex, twice-differentiable with bounded Hessian, and
at least one fi0 is strongly convex,
then
E x(t) − x∗
1⊗xmin
2
2
Distance to optimizer
+ (z(t) − z∗
) 2
LK
Distance modulus agreement
≤ Cµ x0 − x∗ 2
2 + z0 − z∗ 2
LK
2
Initial conditions
e−Dµ (t−t0)
+ Cθ sup
t0≤τ≤t
Σ1
(τ)
Σ2
(τ) F
2
Ultimate bound
Σ(t) F := trace (Σ(t)Σ(t) )
gives the size of the noise
16 / 68
Simulations
0 2 4 6 8 10 12 14 16 18
−2.74
0
1.1
Evolution of the agents’s estimates
time, t
{[xi
1(t), xi
2(t)]}
0 5 10 15 20
0
2
4
6
8
10
12
14
time, t
E x(t) − 1 ⊗ xmin
2
2
Evolution of the second moment
˜γ = 2
˜γ = 4
˜γ = 3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
Empirical noise gain
size of the noise, s
Ultimate bound for
E x(t) − 1 ⊗ xmin
2
2
(t = 17)
˜γ = 2
˜γ = 3
˜γ = 4
N = 4 agents communicating over a directed ring with local objectives
f1(x1, x2) = 1
2((x1 − 4)2
+ (x2 − 3)2
)
f2(x1, x2) = x1 + 3x2 − 2
f3(x1, x2) = log(ex1+3
+ ex2+1
)
f4(x1, x2) = (x1 + 2x2 + 5)2
+ (x1 − x2 − 4)2
17 / 68
Outline of the proof
KKT conditions VERSUS Equilibria of underlying modified
saddle-point ODE
Existence & uniqueness of solutions of the SDEs
Lyapunov analysis of SDEs
Pth moment NSS Lyapunov functions
“pth moment noise-to-state stability of stochastic differential
equations with persistent noise”, DMN-JC
Strategy for directed graphs
“Distributed continuous-time convex optimization on
weight-balanced digraphs”, B. Gharesifard and J. Cort´es
δ-Cocoercivity of F := (INd + P)( ˜f(x) + Lx)
(x − x ) (F(x) − F(x )) ≥ δ F(x) − F(x ) 2
2
18 / 68
End of Part I
Contributions
2nd moment Noise-to-state exponentially stable
First proof of exponential convergence
only one objective strongly convex
Novel general Lyapunov stability analysis for SDEs
Open problems
Coordination algorithms
with
communication delays
19 / 68
Part II: Distributed online optimization
Distributed optimization Online optimization
Case study: medical diagnosis
20 / 68
Review of
online convex optimization
Motivation: medical diagnosis
Sequential decision making
Regret
Classical results
21 / 68
Medical diagnosis
Medical findings, symptoms:
age factor
amnesia before impact
deterioration in GCS score
open skull fracture
loss of consciousness
vomiting
Does the patient need a computerized Tomography?
“The Canadian CT Head Rule for patients with minor head injury”
22 / 68
Different kind of optimization: sequential decision making
In the diagnosis example:
Each round t ∈ {1, . . . , T}
question (features, medical findings): wt
decision (about using CT): h(xt, wt)
outcome (by CT findings/follow up of patient): yt
loss: l(yt h(xt, wt))
Choose xt & Incur loss ft(xt) := l(yt h(xt, wt))
Goal: sublinear regret
R(u, T) :=
T
t=1
ft(xt) −
T
t=1
ft(u) ∈ o(T)
using “historical observations”
23 / 68
Significance of sublinear regret
If the regret is sublinear,
T
t=1
ft(xt) ≤
T
t=1
ft(u) + o(T),
then
lim
T→∞
1
T
T
t=1
ft(xt)
Loss of decisions “on the fly”
≤ lim
T→∞
1
T
T
t=1
ft(u)
Loss of any decision in hindsight
In temporal average, online decisions {xt}T
t=1 perform as well as best
decision in hindsight
“No regrets, my friend”
24 / 68
Some classical results
Projected gradient descent:
xt+1 = ΠS(xt − ηt ft(xt)), (1)
where ΠS is a projection onto a compact set S ⊆ Rd , & f 2 ≤ H
Follow-the-Regularized-Leader:
xt+1 = arg min
y∈S
t
s=1
fs(y) + ψ(y)
Martin Zinkevich, 03
Alg. (1) achieves O(
√
T) regret under convexity with ηt = 1√
t
Elad Hazan, Amit Agarwal, and Satyen Kale, 07
(1) achieves O(log T) regret under p-strong convexity with ηt = 1
pt
Others: Online Newton Step, Follow the Regularized Leader, etc.
25 / 68
Combination of the distributed and online frameworks
Motivation
Previous work and limitations
our coordination algorithm in the online case
theorems of O(
√
T) & O(log T) agent regrets
simulations
outline of the proof
26 / 68
Resuming the diagnosis example:
Health center i ∈ {1, . . . , N} takes care of a set of patients at time t
Goal: sublinear agent regret
Rj
(u, T) :=
T
t=1
N
i=1
f i
t (xj
t ) −
T
t=1
N
i=1
f i
t (u) ≤ o(T)
using “local information” & “historical observations”
27 / 68
Challenge: Coordinate hospitals
f 1
t f 2
t
f 3
t f 4
t
f 5
t
t ∈ {1, . . . , T}
Need to design distributed online algorithms
Private patient data; share only prediction models
28 / 68
Previous work on consensus-based online algorithms
F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TAC
Projected Subgradient Descent
log(T) regret (local strong convexity & bounded subgradients)√
T regret (convexity & bounded subgradients)
Both analysis require a projection onto a compact set
S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13
Dual Averaging
√
T regret (convexity & bounded subgradients)
General regularized projection onto a convex closed set.
K. I. Tsianos and M. G. Rabbat, arXiv, 12
Projected Subgradient Descent
Empirical risk as opposed to regret analysis
Limitations: Communication digraph in all cases is fixed and strongly
connected
29 / 68
Our contributions (informally)
time-varying communication digraphs under B-joint connectivity &
weight-balanced
unconstrained optimization (no projection step onto a bounded set)
log T regret (local strong convexity & bounded subgradients)
√
T regret (convexity & β-centrality with β ∈ (0, 1] & bounded
subgradients)
x∗
x
−ξx
f i
t is β-central in Z ⊆ Rd  X∗,
where X∗ = x∗ : 0 ∈ ∂f i
t (x∗) ,
if ∀x ∈ Z, ∃x∗ ∈ X∗ such that
−ξx (x∗ − x)
ξx 2 x∗ − x 2
≥ β,
for each ξx ∈ ∂f i
t (x).
30 / 68
Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
Subgradient descent on previous local objectives, gxi
t
∈ ∂f i
t
31 / 68
Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t
Proportional feedback on disagreement
31 / 68
Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Integral feedback on disagreement
31 / 68
Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Union of graphs over intervals of length B is strongly connected.
f 1
t f 2
t
f 3
t f 4
t
f 5
t
time t time t + 1
a14,t+1a41,t+1
time t + 2
31 / 68
Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Compact representation
xt+1
zt+1
=
xt
zt
− σ
˜γLt Lt
−Lt 0
xt
zt
− ηt
˜gxt
0
31 / 68
Our algorithm under jointly connected topologies
xi
t+1 = xi
t − ηt gxi
t
+ σ ˜γ
N
j=1,j=i
aij,t xj
t − xi
t +
N
j=1,j=i
aij,t zj
t − zi
t
zi
t+1 = zi
t − σ
N
j=1,j=i
aij,t xj
t − xi
t
Compact representation & generalization
xt+1
zt+1
=
xt
zt
− σ
˜γLt Lt
−Lt 0
xt
zt
− ηt
˜gxt
0
vt+1 = (I − σE ⊗ Lt)vt − ηtgt,
31 / 68
Theorem (DMN-JC Sublinear agent regret)
Assume that
{f 1
t , ..., f N
t }T
t=1 are convex functions in Rd
with H-bounded subgradient sets,
nonempty and uniformly bounded sets of minimizers, and
p-strongly convex in a suff. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs is
δ-nondegenerate, and
B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) λmin(E) δ
λmax(E) dmax
,
choosing σ ∈
˜δ
λmin(E)δ , 1−˜δ
λmax(E)dmax
and ηt = 1
p t ,
Rj
(u, T) ≤ C( u 2
2 + 1 + log T),
for any j ∈ {1, . . . , N} and u ∈ Rd
32 / 68
Theorem (DMN-JC Sublinear agent regret)
Assume that
{f 1
t , ..., f N
t }T
t=1 are convex functions in Rd
with H-bounded subgradient sets,
nonempty and uniformly bounded sets of minimizers, and
p-strongly convex in a suff. large neighborhood of their minimizers
The sequence of weight-balanced communication digraphs is
δ-nondegenerate, and
B-jointly-connected
E ∈ RK×K is diagonalizable with positive real eigenvalues
Substituting strong convexity by β-centrality, for β ∈ (0, 1],
Rj
(u, T) ≤ C u 2
2
√
T,
for any j ∈ {1, . . . , N} and u ∈ Rd
32 / 68
Simulations:
acute brain finding revealed on Computerized Tomography
33 / 68
Simulations
Figure: Agents’ estimates
0 50 100 150 200 250 300
−0.5
0
0.5
1
{ xi
t,7}N
i=1
Centralized
0 50 100 150 200 250 300
−0.5
0
0.5
1
time, t
{ xi
t,8}N
i=1
Figure: Average regret
0 50 100 150 200 250 300
10
0
time horizon, T
maxj 1/T Rj
x∗
T ,
N
i f i
t
T
t=1
Proportional-Integral
disagreement feedback
Proportional
disagreement feedback
Centralized
f i
t (x) =
s∈Pi
t
l(ysh(x, ws)) + 1
10 x 2
2
where
l(m) = log 1 + e−2m
34 / 68
Outline of the proof
Analysis of network regret
RN (u, T) :=
T
t=1
N
i=1
f i
t (xi
t ) −
T
t=1
N
i=1
f i
t (u)
Input-to-state stability with respect to agreement
ˆLKvt 2 ≤ CI v1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
Agent regret bound in terms of {ηt}t≥1 and initial conditions
Boundedness of online estimates under β-centrality (β ∈ (0, 1] )
Doubling trick to obtain O(
√
T) agent regret
Local strong convexity ⇒ β-centrality
ηt = 1
p t to obtain O(log T) agent regret
35 / 68
End of Part II
Contributions
Extension of classical
√
T and log(T)-regret bounds to multi-agent
scenario
B-joint connectivity of communication topologies
Unnecessary projections onto compact sets
Proportional-integral feedback on disagreement
Relevant for regression & classification
Future work
Refine guarantees under
model for evolution
of objective functions
36 / 68
Part III: Distributed optimization with nuclear norm
Distributed optimization Nuclear norm regularization
37 / 68
Review of
nuclear norm regularization
Definition
Application to matrix completion
38 / 68
Coupling through nuclear norm
Given a matrix W =


| |
w1 · · · wN
| |

 ∈ Rd×N
W 2
∗ = sum of singular values of W
= trace
√
WW = trace
N
i=1
wi wi
The nuclear norm regularization
min
wi ∈Wi
N
i=1
f i
(wi ) + γ W 2
∗
favors vectors {wi }N
i=1 belonging to a low-dimensional subspace
39 / 68
Application of nuclear norm
Matrix completion
for recommender systems
40 / 68
Low-rank matrix completion
Tara Philip Mauricio Miroslav
Toy Story
Jurasic Park
· · · · · · · · · · · · · · ·
W = [ W:,1 W:,2 W:,3 W:,4 ]
Incomplete
Rating matrix Z
assumed low-rank
To estimate the missing entries
min
W ∈Rd×N
(i,j)∈Γ
Wij − Zij
2
2 + γ W ∗
where Γ indexes the revealed entries of the rating matrix Z
γ depends on application, dimensions... Regularization, not penalty
Netflix: users ∼ 107 movies ∼ 105
Why making it distributed? Because users may not want to share
their ratings 41 / 68
Distributed formulation of nuclear norm
Useful characterizations and challenges
Our dynamics for distributed optimization with nuclear norm
theorem of convergence
simulations
42 / 68
1st characterization of the nuclear norm
For any W ∈ Rd×N
2 trace(
√
WW )
W ∗
= min
D∈Sd
0
C(D)⊇C(W )
Column spaces of D and W
trace D†
Pseudoinverse
WW + trace(D)
= min
D∈Sd
0
wi ∈C(D)
N
i=1
wi D†
wi + trace(D)
Used in A.Argyriou, T. Evgeniou, and M. Pontil, NIPS, 06
Minimizer D∗ =
√
WW
For W 2
∗ use hard constraint trace(D) ≤ 1. Then D∗ =
√
WW
trace
√
WW
Challenges related to the pseudoinverse:
Optimizer D∗ =
√
WW might be rank deficient
Pseudoinverse D† might be discontinuous when rank of D changes
43 / 68
Convexity property of w D†
w
Point-wise supremum of a linear function in w and D
sup
x∈Rd
−(x Dx + 2w x) =
w D†w if D ∈ Sd
0 , w ∈ C(D),
∞ otherwise
Hence w D†w is jointly convex in {w ∈ Rd , D ∈ Sd
0 : w ∈ C(D)}
Here the pseudoinverse is “hidden” in the optimizer
Good for iterative algorithms
Curiosity: explicit Fenchel conjugate of q(x) := x Dx, for a fixed D
(Up to a factor, q (−2w))
Proof in appendix of “Convex Optimization”, by Boyd
44 / 68
2nd characterization of the nuclear norm
The latter expression allows to write
2 W ∗ = min
D∈Sd
0
wi ∈C(D)
sup
xi ∈Rd
N
i=1
−(xi Dxi + 2wi xi )
N
i=1 wi D†wi
+ trace(D)
Sum of convex functions in the vectors {wi }N
i=1
(as well as in the auxiliary vectors {xi }N
i=1)
Common auxiliary matrix D ∈ Rd×d , ideally d N
Can be “distributed” as a Min-max problem with local copies of D
subject to agreement
45 / 68
Approximate regularization to guarantee D∗
cI
Challenges related to the constraint w ∈ C(D):
Set {w ∈ Rd , D ∈ Sd
0 : w ∈ C(D)} is convex but not closed
also, Sd
0 is open
DD†w projects wi onto C(D) but is state dependent
[W |
√
Id ] ∗ = trace( WW + Id )
= min
D∈Sd
0
C(W )⊆C(D)
trace D†
(WW + Id ) + trace(D)
Minimizer D∗ =
√
WW + Id
√
Id
Replace the constraint w ∈ C(D) by the dummy constraint D
√
Id
46 / 68
Min-max problem with explicit agreement
min
wi ∈Wi
N
i=1
f i
(wi ) + [ W |
√
Id ] ∗
= min
wi ∈Wi , Di ∈Rd×d
Di =Dj ∀i,j
sup
xi ∈Rd
Yi ∈Rd×d
N
i=1
Fi (wi , Di
convex
, xi , Yi
concave
)
with
Fi (w, D, x, Y ) := fi (w)+ γ trace D(−xx − N YY
approx. reg.
)
− 2γw x − 2γ
N
trace(Y )
approx. reg.
+
1
N
trace(D)
47 / 68
Distributed saddle-point dynamics for nuclear optimization
wi (k + 1) = PW wi (k) − ηk gi (k) − 2γxi (k)
Di (k + 1) = PD Di (k) − ηkγ − xi xi − N Yi Yi + 1
N Id
+ σ
N
j=1
aij,t(Dj (k) − Di (k))
xi (k + 1) = xi (k) + ηkγ − 2Di (k)xi (k) − 2wi (k)
Yi (k + 1) = Yi (k) + ηkγ −
2
N
Di (k)Yi (k) −
2
N
Id
User i does not need to share wi with its neighbors!!
and Di → N
i=1 wi wi + Id
Complexity: 1/
√
k saddle-point error × projection onto {D cId }
PD(D) := (D − cId )2 + cId (e.g., using Newton)
48 / 68
Theorem (DMN-JC Distributed saddle-point nuclear optimization)
Assume that fi is convex and the set Wi is convex and compact.
Also, let the sequence of weight-balanced communication digraphs be
nondegenerate, and
B-jointly-connected
Then,
for a suitable choice of the consensus stepsize σ
and the learning rates {ηk}
the saddle-point evaluation error of the running-time averages decreases
as 1/
√
k
Proof follows from developments in the next section.
49 / 68
Privacy preserving matrix completion
10
0
10
1
10
2
10
3
10
4
0
0.5
1
1.5
2
Distributed saddle-point dynamics
Centralized subgradient descent
10
0
10
1
10
2
10
3
10
4
10
3
10
4
0 5000 10000 15000
0
1
2
3
4
5
iterations, k
W (k)−Z F
Z F
N
i=1 j∈Υi
(Wij (k) − Zij )2 + γ W (k)|
√
Id ∗
( N
i=1 Di (k) − 1
N
N
i=1 Di (k) 2
F )1/2
Matrix
fitting error
Nuclear
regularization
Disagreement
local matrices
50 / 68
End of Part III
Contributions
First multi-agent treatment of nuclear norm regularization
Privacy preserving matrix completion
Future work
Inexact projections,
Compressed matrix
communication
51 / 68
Part IV (and last): Saddle-point problems
Distributed optimization Constraints
52 / 68
Organization
Distributed constrained optimization
Previous work and limitations
The magic of agreement on the multiplier
Saddle-point problems with explicit agreement
Nuclear norm min-max characterization
Lagrangians
Our distributed saddle-point dynamics with Laplacian averaging
theorem for our dynamics: 1√
t
saddle-point evaluation error
outline of the proof
53 / 68
Distributed constrained optimization
min
wi ∈Wi , ∀i
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
{wi }N
i=1 are local decision vectors, D is a global decision vector
Applications:
Traffic and routing
Resource allocation
Optimal control
Network formation
54 / 68
Previous work by type of constraint
N
i=1
f i
(x)
s.t. g(x) ≤ 0
All agents know the constraint
2011 D. Yuan, S. Xu, and H.
Zhao
2012 M. Zhu and S. Mart´ınez
D. Yuan, D. W. C. Ho, and S. Xu
(to appear)
min
wi ∈Wi
N
i=1
f i
(wi
)
s.t.
N
i=1
gi
(wi
) ≤ 0
Agent i knows only gi
2010 D. Mosk-Aoyama,
T. Roughgarden and D. Shah
(only linear constraints)
2014 T.-H. Chang, A. Nedi´c
and A. Scaglione
Primal-dual perturbation
methods
55 / 68
Distributing the constraint via agreement on multipliers
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D)
s.t. g1
(w1
, D) + · · · + gN
(wN
, D) ≤ 0
Dual decomposition, if strong duality holds
min
wi ∈Wi
D∈D
max
z∈Rm
≥0
N
i=1
f i
(wi
, D) + z
N
i=1
gi
(wi
, D)
= min
wi ∈Wi
D∈D
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D) + zi
gi
(wi
, D)
= min
wi ∈Wi
Di ∈D
Di =Dj ∀i,j
max
zi ∈Rm
≥0
zi =zj ∀i,j
N
i=1
f i
(wi
, D
i
) + zi
gi
(wi
, D
i
)
56 / 68
Saddle-point problems with explicit agreement
A more general framework
min
w∈W
(D1,...,DN )∈DN
Di =Dj , ∀i,j
max
µ∈M
(z1,...,zN )∈ZN
zi =zj , ∀i,j
φ w, (D
1
, . . . , D
N
)
convex
, µ, (z1
, . . . , zN
)
concave
Problem unstudied in the literature
Fits min-max formulation of nuclear norm regularization
The concave part is quadratic
Fits convex-concave functions arising from Lagrangians
The concave part is linear
57 / 68
Our dynamics
Subgradient saddle-point dynamics with Laplacian averaging
(provided a saddle-point exists under agreement)
ˆwt+1 = wt − ηtgwt
ˆDt+1 = Dt −σLtDt
Lap. aver.
−ηtgDt
ˆµt+1 = µt + ηtgµt
ˆzt+1 = zt −σLtzt
Lap. aver.
+ηt ˆgzt
(wt+1, Dt+1, µt+1, zt+1) = PS ˆwt+1, ˆDt+1, ˆµt+1, ˆzt+1
Orthogonal projection
gwt , gDt , gµt , ˆgzt are subgradients and supergradients of φ(w, D, µ, z)
and S = W × DN × M × ZN
58 / 68
Theorem (DMN-JC Distributed saddle-point approximation)
Assume that
φ(w, D, µ, z) is convex in (w, D) ∈ W × DN and concave in
(µ, z) ∈ M × ZN
The dynamics is bounded
The sequence of weight-balanced communication digraphs is
nondegenerate, and
B-jointly-connected
For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) δ
dmax
,
choosing σ ∈
˜δ
δ , 1−˜δ
dmax
and {ηt} based on the Doubling Trick scheme,
then for any saddle point (w∗, D∗, µ∗, z∗) of φ within agreement,
−
α
√
t − 1
≤ φ(wav
t , D
av
t , zav
t , µav
t ) − φ(w∗
, D
∗
, z∗
, µ∗
) ≤
α
√
t − 1
wav
t :=
t
s=1
ws, etc
59 / 68
Outline of the proof
Inequality techniques from A. Nedi and A. Ozdaglar, 2009
Saddle-point evaluation error
tφ(wav
t+1, D
av
t+1, µav
t+1, zav
t+1) − tφ(w∗
, D
∗
, µ∗
, z∗
) (2)
at running-time averages, wav
t+1 := 1
t
t
s=1 ws, etc.
Bound for (2) in terms of
initial conditions
bound on subgradients and states of the dynamics
disagreement
sum of learning rates
Input-to-state stability with respect to agreement
LKDt 2 ≤ CI D1 2 1 −
˜δ
4N2
t−1
B
+ CU max
1≤s≤t−1
ds 2
Doubling Trick scheme: for m = 0, 1, 2, . . . , log2 t , take
ηs = 1√
2m
in each period of 2m rounds s = 2m, . . . , 2m+1 − 1
60 / 68
End of Part IV
Contributions
Distributed subgradient methods for saddle-point problems
Particularize to
Saddle-points of Lagrangians coming from separable constraints
Min-max characterization of nuclear norm
Future work
Bounds on Lagrange vectors and matrices in a distributed way
Boundedness properties of the dynamics without truncated
projections
Semidefinite constraints with chordal sparsity, where agents update
the entries corresponding to maximal cliques subject to agreement
on the intersections
61 / 68
Summary
&
next steps
62 / 68
Summary
Global
decision
vectors
min
x∈Rd
N
i=1
f i
(x)
Rj
T :=
T
t=1
N
i=1
f i
t (xj
t ) − min
x∈Rd
T
t=1
N
i=1
f i
t (x)
Local
decision
vectors
min
wi ∈Rd
N
i=1
f i
(wi
) + γ [w1
. . . wN
] ∗
min
wi ∈Wi
D∈D
N
i=1
f i
(wi
, D) s.t.
N
i=1
gi
(wi
, D) ≤ 0
63 / 68
Noise-to-state exponentially stable distributed convex
optimization on weight-balanced digraphs,
SIAM Journal on Control and Optimization, under revision
pth moment noise-to-state stability of stochastic differential
equations with persistent noise,
SIAM Journal on Control and Optimization 52 (4) (2014), 2399-2421
Distributed online convex optimization over jointly connected
digraphs,
IEEE Transactions on Network Science and Engineering 1 (1) (2014),
23-37
And conference versions presented in ACC 13, CDC 13, MTNS 14
64 / 68
Distributed optimization for multi-task learning via nuclear-norm
approximation,
IFAC Workshop on Distributed Estimation and Control in Networked
Systems
Distributed subgradient methods for saddle-point problems,
CDC 15
(In preparation for Transactions on Automatic Control)
65 / 68
Next steps
Algorithmic principles in biology
homeostasis in neocortex
noisy computational units
distribution of firing rates
recovery after disruptions
differentiation in populations
of cells
distribution of phenotypes
recovery after injury or
disease
cancer, stem cells
66 / 68
Beyond biology
Control and security of cyber-physical systems:
cooperative optimization +
distributed trust
heterogeneous
agents’ dynamics & incentives
malicious minority
67 / 68
Thank you for listening!
68 / 68

More Related Content

What's hot

Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
Fabian Pedregosa
 
SIAM CSE 2017 talk
SIAM CSE 2017 talkSIAM CSE 2017 talk
SIAM CSE 2017 talk
Fred J. Hickernell
 
Predictive mean-matching2
Predictive mean-matching2Predictive mean-matching2
Predictive mean-matching2
Jae-kwang Kim
 
MNAR
MNARMNAR
Propensity albert
Propensity albertPropensity albert
Propensity albert
Jae-kwang Kim
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
Chiheb Ben Hammouda
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
Arthur Charpentier
 
Tutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksTutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian Networks
Anmol Dwivedi
 
Tulane March 2017 Talk
Tulane March 2017 TalkTulane March 2017 Talk
Tulane March 2017 Talk
Fred J. Hickernell
 
Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach
Jae-kwang Kim
 
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
The Statistical and Applied Mathematical Sciences Institute
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
Arthur Charpentier
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Anmol Dwivedi
 
Reject Inference in Credit Scoring
Reject Inference in Credit ScoringReject Inference in Credit Scoring
Reject Inference in Credit Scoring
Adrien Ehrhardt
 
Numerical smoothing and hierarchical approximations for efficient option pric...
Numerical smoothing and hierarchical approximations for efficient option pric...Numerical smoothing and hierarchical approximations for efficient option pric...
Numerical smoothing and hierarchical approximations for efficient option pric...
Chiheb Ben Hammouda
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
Frank Nielsen
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
Asma Ben Slimene
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Paris Women in Machine Learning and Data Science
 

What's hot (19)

Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
SIAM CSE 2017 talk
SIAM CSE 2017 talkSIAM CSE 2017 talk
SIAM CSE 2017 talk
 
Predictive mean-matching2
Predictive mean-matching2Predictive mean-matching2
Predictive mean-matching2
 
MNAR
MNARMNAR
MNAR
 
Propensity albert
Propensity albertPropensity albert
Propensity albert
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Tutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksTutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian Networks
 
Tulane March 2017 Talk
Tulane March 2017 TalkTulane March 2017 Talk
Tulane March 2017 Talk
 
Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach
 
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
PMED Opening Workshop - Inference on Individualized Treatment Rules from Obse...
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
 
Reject Inference in Credit Scoring
Reject Inference in Credit ScoringReject Inference in Credit Scoring
Reject Inference in Credit Scoring
 
Numerical smoothing and hierarchical approximations for efficient option pric...
Numerical smoothing and hierarchical approximations for efficient option pric...Numerical smoothing and hierarchical approximations for efficient option pric...
Numerical smoothing and hierarchical approximations for efficient option pric...
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 

Viewers also liked

Bachelor_European_Transcript
Bachelor_European_TranscriptBachelor_European_Transcript
Bachelor_European_TranscriptDavid Mateos
 
1228 fianl project
1228 fianl project1228 fianl project
1228 fianl project
庭安 林
 
vimal pathak - visual cv
vimal pathak - visual cvvimal pathak - visual cv
vimal pathak - visual cv
vimalpathak11111
 
Caras del mundo
Caras del mundoCaras del mundo
Caras del mundo
shadowpac66
 
Sahara
SaharaSahara
Sahara
shadowpac66
 
Fotos, almuerzo en el rascacielos
Fotos, almuerzo en el rascacielosFotos, almuerzo en el rascacielos
Fotos, almuerzo en el rascacielos
shadowpac66
 
Internet expandeix societat
Internet expandeix societatInternet expandeix societat
Internet expandeix societat
Raquel Casanovas de la Cruz
 
2015 2016. 3ºeso mat. temas 7 y9
2015 2016. 3ºeso mat. temas 7 y92015 2016. 3ºeso mat. temas 7 y9
2015 2016. 3ºeso mat. temas 7 y9
José María Gutiérrez Maté
 
The Hideaways Club Deb12
The Hideaways Club Deb12The Hideaways Club Deb12
The Hideaways Club Deb12Stuart Langrish
 
Caras del mundo
Caras del mundoCaras del mundo
Caras del mundo
shadowpac66
 
Dka+hhs
Dka+hhsDka+hhs
Dka+hhs
Murad Aamar
 
Gerencia
GerenciaGerencia
Gerencia
edwaard7
 

Viewers also liked (20)

Bachelor_European_Transcript
Bachelor_European_TranscriptBachelor_European_Transcript
Bachelor_European_Transcript
 
AUCEI 2013 DaithiOMurchu
AUCEI 2013 DaithiOMurchuAUCEI 2013 DaithiOMurchu
AUCEI 2013 DaithiOMurchu
 
1228 fianl project
1228 fianl project1228 fianl project
1228 fianl project
 
vimal pathak - visual cv
vimal pathak - visual cvvimal pathak - visual cv
vimal pathak - visual cv
 
Caras del mundo
Caras del mundoCaras del mundo
Caras del mundo
 
2015.2016. mat. 3ºeso. 3ºtri. 2ºexa
2015.2016. mat. 3ºeso. 3ºtri. 2ºexa2015.2016. mat. 3ºeso. 3ºtri. 2ºexa
2015.2016. mat. 3ºeso. 3ºtri. 2ºexa
 
Golf - no cover
Golf - no coverGolf - no cover
Golf - no cover
 
Sahara
SaharaSahara
Sahara
 
Fotos, almuerzo en el rascacielos
Fotos, almuerzo en el rascacielosFotos, almuerzo en el rascacielos
Fotos, almuerzo en el rascacielos
 
Internet expandeix societat
Internet expandeix societatInternet expandeix societat
Internet expandeix societat
 
2015 2016. 3ºeso mat. temas 7 y9
2015 2016. 3ºeso mat. temas 7 y92015 2016. 3ºeso mat. temas 7 y9
2015 2016. 3ºeso mat. temas 7 y9
 
The Hideaways Club Deb12
The Hideaways Club Deb12The Hideaways Club Deb12
The Hideaways Club Deb12
 
Documents Controller
Documents ControllerDocuments Controller
Documents Controller
 
Caras del mundo
Caras del mundoCaras del mundo
Caras del mundo
 
S.T. Dupont Deb12
S.T. Dupont Deb12S.T. Dupont Deb12
S.T. Dupont Deb12
 
Small Business Radio Network 2014
Small Business Radio Network 2014Small Business Radio Network 2014
Small Business Radio Network 2014
 
AUCEi Final Final 2015
AUCEi Final Final 2015AUCEi Final Final 2015
AUCEi Final Final 2015
 
Thiramas
ThiramasThiramas
Thiramas
 
Dka+hhs
Dka+hhsDka+hhs
Dka+hhs
 
Gerencia
GerenciaGerencia
Gerencia
 

Similar to main

Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Frank Nielsen
 
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
The Statistical and Applied Mathematical Sciences Institute
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Umberto Picchini
 
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Alexander Litvinenko
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
Frank Nielsen
 
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Gota Morota
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Frank Nielsen
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
Christian Robert
 
Identifiability in Dynamic Casual Networks
Identifiability in Dynamic Casual NetworksIdentifiability in Dynamic Casual Networks
Identifiability in Dynamic Casual Networks
Graph-TA
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
Christian Robert
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methods
Mercier Jean-Marc
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian Nonparametrics
Julyan Arbel
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
The Statistical and Applied Mathematical Sciences Institute
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
Frank Nielsen
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Christian Robert
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
Charles Deledalle
 
Unbiased Markov chain Monte Carlo
Unbiased Markov chain Monte CarloUnbiased Markov chain Monte Carlo
Unbiased Markov chain Monte Carlo
JeremyHeng10
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histograms
Frank Nielsen
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
Frank Nielsen
 

Similar to main (20)

Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
Identifiability in Dynamic Casual Networks
Identifiability in Dynamic Casual NetworksIdentifiability in Dynamic Casual Networks
Identifiability in Dynamic Casual Networks
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methods
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian Nonparametrics
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
 
Unbiased Markov chain Monte Carlo
Unbiased Markov chain Monte CarloUnbiased Markov chain Monte Carlo
Unbiased Markov chain Monte Carlo
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histograms
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 

main

  • 1. Distributed algorithms for convex optimization David Mateos-N´u˜nez Jorge Cort´es University of California, San Diego {dmateosn,cortes}@ucsd.edu Thesis Defense Committee: Tara Javidi, Philip E. Gill, Maur´ıcio de Oliveira, and Miroslav Krsti´c UC San Diego, August 20, 2015 1 / 68
  • 2. Context: decentralized, peer-to-peer systems Sensor networks Medical diagnosis Recommender systems Formation control 2 / 68
  • 3. Distributed paradigm Tara Philip Mauricio Miroslav Toy Story Jurasic Park · · · · · · · · · · · · · · · W = | w1 | | w2 | | w3 | | w4 | min wi ∈Rd 4 i=1 f i (wi ) + γ [w1 · · · w4] ∗ Ratings {wi } are private 3 / 68
  • 4. Distributed algorithms for four classes of problems Noisy channels min x∈Rd N i=1 f i (x) Online optimization Rj T := T t=1 N i=1 f i t (xj t ) − min x∈Rd T t=1 N i=1 f i t (x) Multi-task learning min wi ∈Rd N i=1 f i (wi ) + γ W ∗ Constraints min wi ∈Wi D∈D N i=1 f i (wi , D) s.t. N i=1 gi (wi , D) ≤ 0 4 / 68
  • 5. Part I: Noise-to-state stable coordination Distributed optimization Noisy communication 5 / 68
  • 6. Review of (consensus based) distributed convex optimization Local information Laplacian matrix Disagreement feedback Pioneering works Lagrangian approach 6 / 68
  • 7. Using local information x∗ ∈ arg min x∈Rd N i=1 f i (x) Agent i has access to f i Agent i can share its estimate xi t with “neighboring” agents f 1 f 2 f 3 f 4 f 5 A =       a13 a14 a21 a25 a31 a32 a41 a52       Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95 Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05 7 / 68
  • 8. The Laplacian matrix L = diag(A1) − A =       a13 + a14 −a13 −a14 −a21 a21 + a25 −a25 −a31 −a32 a31 + a32 −a41 a41 −a52 a52       Nullspace is agreement ⇔ spanning tree The mirror graph has Laplacian L Consensus via feedback on disagreement proportional proportional-integral feedback 8 / 68
  • 9. Disagreement feedback Honeybee navigation en route to the goal: visual flight control and odometry 1996 M. Srinivasan, S. Zhang, M. Lehrer, and T. Collett, J Exp. Biol Relative positions and angles g1(w1) + · · · + gN(wN) = 0 Agreement on the multiplier 9 / 68
  • 10. Pioneering works local minimization versus local consensus A. Nedi´c and A. Ozdaglar, TAC, 09 xi k+1 = − ηk decreasing gi k + N j=1 aij xj k, where gi k ∈ ∂f i (xi k) and A = (aij ) is doubly stochastic J. C. Duchi, A. Agarwal, and M. J. Wainwright, TAC, 12 10 / 68
  • 11. The Lagrangian approach min x1,...,xN ∈Rd x1=...=xN N i=1 f i (xi ) = min x∈(Rd )N Lx=0 ˜f (x) Convex-concave augmented Lagrangian F(x, z) := ˜f (x) Network objective + z (L x) Multiplier × constraint + ˜γ 2 x L x Augmented Lagrangian Saddle-point dynamics for symmetric Laplacian L ˙x = − ∂F(x, z) ∂x = − ˜f (x) − ˜γ Lx − Lz ˙z = ∂F(x, z) ∂z = Lx J. Wang and N. Elia, Allerton, 10 And when L = L ? 11 / 68
  • 12. Lagrangian approach for directed graphs Lagrangian F(x, z) := ˜f (x) + z L x + ˜γ 2 x (L + L ) x still convex-concave because Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0 ˙x = − ∂F(x, z) ∂x = − ˜f (x) − ˜γ 1 2 L + L x − L z (Non distributed!) ˙z = ∂F(x, z) ∂z = Lx B. Gharesifard and J. Cort´es, CDC, 12 12 / 68
  • 13. Lagrangian approach for directed graphs Lagrangian F(x, z) := ˜f (x) + z L x + ˜γ 2 x (L + L ) x still convex-concave because Weight-balanced ⇔ 1 L = 0 ⇔ L + L 0 ˙x = − ∂F(x, z) ∂x = − ˜f (x) − ˜γ 1 2 L + L x − L z (Non distributed!) changed to − ˜f (x) − ˜γ Lx − Lz ˙z = ∂F(x, z) ∂z = Lx B. Gharesifard and J. Cort´es, CDC, 12 12 / 68
  • 14. Now... what happens when there is noise? Previous work and limitations our coordination algorithm under noise theorem of exponential noise-to-state stability simulations outline of the proof 13 / 68
  • 15. Previous work with noisy communication channels S. S. Ram and A. Nedi´c and V. V. Veeravalli, 10 K. Srivastava and A. Nedi´c, 11 xi k+1 = xi k − ηi k gi k + noise + σi k N j=1 aij,k xj k + noise − xi k where gi k ∈ ∂f i (xi k) and Ak = (aij,k) is an adjacency matrix Limitations: vanishing {ηi k}k≥1 and {σi k}k≥1 Sublinear convergence even in strongly convex case noise of fixed variance require projection step or bounded subgradients 14 / 68
  • 16. Our algorithm: proprortional-integral feedback + noise ˙xi (t) = − fi (xi (t)) + noise + ˜γ N j=1,j=i aij xj (t) + noise − xi (t) + N j=1,j=i aij zj (t) + noise − zi (t) ˙zi (t) = − N j=1,j=i aij xj (t) + noise − xi (t) Stochastic differential equation in compact form dx = − ˜f (x) dt − ˜γ Lx dt −Lz dt structured integral feedback + G1 (x, z, t)Σ1 (t)dB(t) Noise dz = Lx dt + G2 (x, z, t)Σ2 (t)dB(t) Noise 15 / 68
  • 17. Theorem (DMN-JC 2nd moment noise-to-state exponential stability) If the digraph is strongly connected and weight-balanced, each fi is convex, twice-differentiable with bounded Hessian, and at least one fi0 is strongly convex, then E x(t) − x∗ 1⊗xmin 2 2 Distance to optimizer + (z(t) − z∗ ) 2 LK Distance modulus agreement ≤ Cµ x0 − x∗ 2 2 + z0 − z∗ 2 LK 2 Initial conditions e−Dµ (t−t0) + Cθ sup t0≤τ≤t Σ1 (τ) Σ2 (τ) F 2 Ultimate bound Σ(t) F := trace (Σ(t)Σ(t) ) gives the size of the noise 16 / 68
  • 18. Simulations 0 2 4 6 8 10 12 14 16 18 −2.74 0 1.1 Evolution of the agents’s estimates time, t {[xi 1(t), xi 2(t)]} 0 5 10 15 20 0 2 4 6 8 10 12 14 time, t E x(t) − 1 ⊗ xmin 2 2 Evolution of the second moment ˜γ = 2 ˜γ = 4 ˜γ = 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.5 1 1.5 2 2.5 3 3.5 4 Empirical noise gain size of the noise, s Ultimate bound for E x(t) − 1 ⊗ xmin 2 2 (t = 17) ˜γ = 2 ˜γ = 3 ˜γ = 4 N = 4 agents communicating over a directed ring with local objectives f1(x1, x2) = 1 2((x1 − 4)2 + (x2 − 3)2 ) f2(x1, x2) = x1 + 3x2 − 2 f3(x1, x2) = log(ex1+3 + ex2+1 ) f4(x1, x2) = (x1 + 2x2 + 5)2 + (x1 − x2 − 4)2 17 / 68
  • 19. Outline of the proof KKT conditions VERSUS Equilibria of underlying modified saddle-point ODE Existence & uniqueness of solutions of the SDEs Lyapunov analysis of SDEs Pth moment NSS Lyapunov functions “pth moment noise-to-state stability of stochastic differential equations with persistent noise”, DMN-JC Strategy for directed graphs “Distributed continuous-time convex optimization on weight-balanced digraphs”, B. Gharesifard and J. Cort´es δ-Cocoercivity of F := (INd + P)( ˜f(x) + Lx) (x − x ) (F(x) − F(x )) ≥ δ F(x) − F(x ) 2 2 18 / 68
  • 20. End of Part I Contributions 2nd moment Noise-to-state exponentially stable First proof of exponential convergence only one objective strongly convex Novel general Lyapunov stability analysis for SDEs Open problems Coordination algorithms with communication delays 19 / 68
  • 21. Part II: Distributed online optimization Distributed optimization Online optimization Case study: medical diagnosis 20 / 68
  • 22. Review of online convex optimization Motivation: medical diagnosis Sequential decision making Regret Classical results 21 / 68
  • 23. Medical diagnosis Medical findings, symptoms: age factor amnesia before impact deterioration in GCS score open skull fracture loss of consciousness vomiting Does the patient need a computerized Tomography? “The Canadian CT Head Rule for patients with minor head injury” 22 / 68
  • 24. Different kind of optimization: sequential decision making In the diagnosis example: Each round t ∈ {1, . . . , T} question (features, medical findings): wt decision (about using CT): h(xt, wt) outcome (by CT findings/follow up of patient): yt loss: l(yt h(xt, wt)) Choose xt & Incur loss ft(xt) := l(yt h(xt, wt)) Goal: sublinear regret R(u, T) := T t=1 ft(xt) − T t=1 ft(u) ∈ o(T) using “historical observations” 23 / 68
  • 25. Significance of sublinear regret If the regret is sublinear, T t=1 ft(xt) ≤ T t=1 ft(u) + o(T), then lim T→∞ 1 T T t=1 ft(xt) Loss of decisions “on the fly” ≤ lim T→∞ 1 T T t=1 ft(u) Loss of any decision in hindsight In temporal average, online decisions {xt}T t=1 perform as well as best decision in hindsight “No regrets, my friend” 24 / 68
  • 26. Some classical results Projected gradient descent: xt+1 = ΠS(xt − ηt ft(xt)), (1) where ΠS is a projection onto a compact set S ⊆ Rd , & f 2 ≤ H Follow-the-Regularized-Leader: xt+1 = arg min y∈S t s=1 fs(y) + ψ(y) Martin Zinkevich, 03 Alg. (1) achieves O( √ T) regret under convexity with ηt = 1√ t Elad Hazan, Amit Agarwal, and Satyen Kale, 07 (1) achieves O(log T) regret under p-strong convexity with ηt = 1 pt Others: Online Newton Step, Follow the Regularized Leader, etc. 25 / 68
  • 27. Combination of the distributed and online frameworks Motivation Previous work and limitations our coordination algorithm in the online case theorems of O( √ T) & O(log T) agent regrets simulations outline of the proof 26 / 68
  • 28. Resuming the diagnosis example: Health center i ∈ {1, . . . , N} takes care of a set of patients at time t Goal: sublinear agent regret Rj (u, T) := T t=1 N i=1 f i t (xj t ) − T t=1 N i=1 f i t (u) ≤ o(T) using “local information” & “historical observations” 27 / 68
  • 29. Challenge: Coordinate hospitals f 1 t f 2 t f 3 t f 4 t f 5 t t ∈ {1, . . . , T} Need to design distributed online algorithms Private patient data; share only prediction models 28 / 68
  • 30. Previous work on consensus-based online algorithms F. Yan, S. Sundaram, S. V. N. Vishwanathan and Y. Qi, TAC Projected Subgradient Descent log(T) regret (local strong convexity & bounded subgradients)√ T regret (convexity & bounded subgradients) Both analysis require a projection onto a compact set S. Hosseini, A. Chapman and M. Mesbahi, CDC, 13 Dual Averaging √ T regret (convexity & bounded subgradients) General regularized projection onto a convex closed set. K. I. Tsianos and M. G. Rabbat, arXiv, 12 Projected Subgradient Descent Empirical risk as opposed to regret analysis Limitations: Communication digraph in all cases is fixed and strongly connected 29 / 68
  • 31. Our contributions (informally) time-varying communication digraphs under B-joint connectivity & weight-balanced unconstrained optimization (no projection step onto a bounded set) log T regret (local strong convexity & bounded subgradients) √ T regret (convexity & β-centrality with β ∈ (0, 1] & bounded subgradients) x∗ x −ξx f i t is β-central in Z ⊆ Rd X∗, where X∗ = x∗ : 0 ∈ ∂f i t (x∗) , if ∀x ∈ Z, ∃x∗ ∈ X∗ such that −ξx (x∗ − x) ξx 2 x∗ − x 2 ≥ β, for each ξx ∈ ∂f i t (x). 30 / 68
  • 32. Our algorithm under jointly connected topologies xi t+1 = xi t − ηt gxi t Subgradient descent on previous local objectives, gxi t ∈ ∂f i t 31 / 68
  • 33. Our algorithm under jointly connected topologies xi t+1 = xi t − ηt gxi t + σ ˜γ N j=1,j=i aij,t xj t − xi t Proportional feedback on disagreement 31 / 68
  • 34. Our algorithm under jointly connected topologies xi t+1 = xi t − ηt gxi t + σ ˜γ N j=1,j=i aij,t xj t − xi t + N j=1,j=i aij,t zj t − zi t zi t+1 = zi t − σ N j=1,j=i aij,t xj t − xi t Integral feedback on disagreement 31 / 68
  • 35. Our algorithm under jointly connected topologies xi t+1 = xi t − ηt gxi t + σ ˜γ N j=1,j=i aij,t xj t − xi t + N j=1,j=i aij,t zj t − zi t zi t+1 = zi t − σ N j=1,j=i aij,t xj t − xi t Union of graphs over intervals of length B is strongly connected. f 1 t f 2 t f 3 t f 4 t f 5 t time t time t + 1 a14,t+1a41,t+1 time t + 2 31 / 68
  • 36. Our algorithm under jointly connected topologies xi t+1 = xi t − ηt gxi t + σ ˜γ N j=1,j=i aij,t xj t − xi t + N j=1,j=i aij,t zj t − zi t zi t+1 = zi t − σ N j=1,j=i aij,t xj t − xi t Compact representation xt+1 zt+1 = xt zt − σ ˜γLt Lt −Lt 0 xt zt − ηt ˜gxt 0 31 / 68
  • 37. Our algorithm under jointly connected topologies xi t+1 = xi t − ηt gxi t + σ ˜γ N j=1,j=i aij,t xj t − xi t + N j=1,j=i aij,t zj t − zi t zi t+1 = zi t − σ N j=1,j=i aij,t xj t − xi t Compact representation & generalization xt+1 zt+1 = xt zt − σ ˜γLt Lt −Lt 0 xt zt − ηt ˜gxt 0 vt+1 = (I − σE ⊗ Lt)vt − ηtgt, 31 / 68
  • 38. Theorem (DMN-JC Sublinear agent regret) Assume that {f 1 t , ..., f N t }T t=1 are convex functions in Rd with H-bounded subgradient sets, nonempty and uniformly bounded sets of minimizers, and p-strongly convex in a suff. large neighborhood of their minimizers The sequence of weight-balanced communication digraphs is δ-nondegenerate, and B-jointly-connected E ∈ RK×K is diagonalizable with positive real eigenvalues For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) λmin(E) δ λmax(E) dmax , choosing σ ∈ ˜δ λmin(E)δ , 1−˜δ λmax(E)dmax and ηt = 1 p t , Rj (u, T) ≤ C( u 2 2 + 1 + log T), for any j ∈ {1, . . . , N} and u ∈ Rd 32 / 68
  • 39. Theorem (DMN-JC Sublinear agent regret) Assume that {f 1 t , ..., f N t }T t=1 are convex functions in Rd with H-bounded subgradient sets, nonempty and uniformly bounded sets of minimizers, and p-strongly convex in a suff. large neighborhood of their minimizers The sequence of weight-balanced communication digraphs is δ-nondegenerate, and B-jointly-connected E ∈ RK×K is diagonalizable with positive real eigenvalues Substituting strong convexity by β-centrality, for β ∈ (0, 1], Rj (u, T) ≤ C u 2 2 √ T, for any j ∈ {1, . . . , N} and u ∈ Rd 32 / 68
  • 40. Simulations: acute brain finding revealed on Computerized Tomography 33 / 68
  • 41. Simulations Figure: Agents’ estimates 0 50 100 150 200 250 300 −0.5 0 0.5 1 { xi t,7}N i=1 Centralized 0 50 100 150 200 250 300 −0.5 0 0.5 1 time, t { xi t,8}N i=1 Figure: Average regret 0 50 100 150 200 250 300 10 0 time horizon, T maxj 1/T Rj x∗ T , N i f i t T t=1 Proportional-Integral disagreement feedback Proportional disagreement feedback Centralized f i t (x) = s∈Pi t l(ysh(x, ws)) + 1 10 x 2 2 where l(m) = log 1 + e−2m 34 / 68
  • 42. Outline of the proof Analysis of network regret RN (u, T) := T t=1 N i=1 f i t (xi t ) − T t=1 N i=1 f i t (u) Input-to-state stability with respect to agreement ˆLKvt 2 ≤ CI v1 2 1 − ˜δ 4N2 t−1 B + CU max 1≤s≤t−1 ds 2 Agent regret bound in terms of {ηt}t≥1 and initial conditions Boundedness of online estimates under β-centrality (β ∈ (0, 1] ) Doubling trick to obtain O( √ T) agent regret Local strong convexity ⇒ β-centrality ηt = 1 p t to obtain O(log T) agent regret 35 / 68
  • 43. End of Part II Contributions Extension of classical √ T and log(T)-regret bounds to multi-agent scenario B-joint connectivity of communication topologies Unnecessary projections onto compact sets Proportional-integral feedback on disagreement Relevant for regression & classification Future work Refine guarantees under model for evolution of objective functions 36 / 68
  • 44. Part III: Distributed optimization with nuclear norm Distributed optimization Nuclear norm regularization 37 / 68
  • 45. Review of nuclear norm regularization Definition Application to matrix completion 38 / 68
  • 46. Coupling through nuclear norm Given a matrix W =   | | w1 · · · wN | |   ∈ Rd×N W 2 ∗ = sum of singular values of W = trace √ WW = trace N i=1 wi wi The nuclear norm regularization min wi ∈Wi N i=1 f i (wi ) + γ W 2 ∗ favors vectors {wi }N i=1 belonging to a low-dimensional subspace 39 / 68
  • 47. Application of nuclear norm Matrix completion for recommender systems 40 / 68
  • 48. Low-rank matrix completion Tara Philip Mauricio Miroslav Toy Story Jurasic Park · · · · · · · · · · · · · · · W = [ W:,1 W:,2 W:,3 W:,4 ] Incomplete Rating matrix Z assumed low-rank To estimate the missing entries min W ∈Rd×N (i,j)∈Γ Wij − Zij 2 2 + γ W ∗ where Γ indexes the revealed entries of the rating matrix Z γ depends on application, dimensions... Regularization, not penalty Netflix: users ∼ 107 movies ∼ 105 Why making it distributed? Because users may not want to share their ratings 41 / 68
  • 49. Distributed formulation of nuclear norm Useful characterizations and challenges Our dynamics for distributed optimization with nuclear norm theorem of convergence simulations 42 / 68
  • 50. 1st characterization of the nuclear norm For any W ∈ Rd×N 2 trace( √ WW ) W ∗ = min D∈Sd 0 C(D)⊇C(W ) Column spaces of D and W trace D† Pseudoinverse WW + trace(D) = min D∈Sd 0 wi ∈C(D) N i=1 wi D† wi + trace(D) Used in A.Argyriou, T. Evgeniou, and M. Pontil, NIPS, 06 Minimizer D∗ = √ WW For W 2 ∗ use hard constraint trace(D) ≤ 1. Then D∗ = √ WW trace √ WW Challenges related to the pseudoinverse: Optimizer D∗ = √ WW might be rank deficient Pseudoinverse D† might be discontinuous when rank of D changes 43 / 68
  • 51. Convexity property of w D† w Point-wise supremum of a linear function in w and D sup x∈Rd −(x Dx + 2w x) = w D†w if D ∈ Sd 0 , w ∈ C(D), ∞ otherwise Hence w D†w is jointly convex in {w ∈ Rd , D ∈ Sd 0 : w ∈ C(D)} Here the pseudoinverse is “hidden” in the optimizer Good for iterative algorithms Curiosity: explicit Fenchel conjugate of q(x) := x Dx, for a fixed D (Up to a factor, q (−2w)) Proof in appendix of “Convex Optimization”, by Boyd 44 / 68
  • 52. 2nd characterization of the nuclear norm The latter expression allows to write 2 W ∗ = min D∈Sd 0 wi ∈C(D) sup xi ∈Rd N i=1 −(xi Dxi + 2wi xi ) N i=1 wi D†wi + trace(D) Sum of convex functions in the vectors {wi }N i=1 (as well as in the auxiliary vectors {xi }N i=1) Common auxiliary matrix D ∈ Rd×d , ideally d N Can be “distributed” as a Min-max problem with local copies of D subject to agreement 45 / 68
  • 53. Approximate regularization to guarantee D∗ cI Challenges related to the constraint w ∈ C(D): Set {w ∈ Rd , D ∈ Sd 0 : w ∈ C(D)} is convex but not closed also, Sd 0 is open DD†w projects wi onto C(D) but is state dependent [W | √ Id ] ∗ = trace( WW + Id ) = min D∈Sd 0 C(W )⊆C(D) trace D† (WW + Id ) + trace(D) Minimizer D∗ = √ WW + Id √ Id Replace the constraint w ∈ C(D) by the dummy constraint D √ Id 46 / 68
  • 54. Min-max problem with explicit agreement min wi ∈Wi N i=1 f i (wi ) + [ W | √ Id ] ∗ = min wi ∈Wi , Di ∈Rd×d Di =Dj ∀i,j sup xi ∈Rd Yi ∈Rd×d N i=1 Fi (wi , Di convex , xi , Yi concave ) with Fi (w, D, x, Y ) := fi (w)+ γ trace D(−xx − N YY approx. reg. ) − 2γw x − 2γ N trace(Y ) approx. reg. + 1 N trace(D) 47 / 68
  • 55. Distributed saddle-point dynamics for nuclear optimization wi (k + 1) = PW wi (k) − ηk gi (k) − 2γxi (k) Di (k + 1) = PD Di (k) − ηkγ − xi xi − N Yi Yi + 1 N Id + σ N j=1 aij,t(Dj (k) − Di (k)) xi (k + 1) = xi (k) + ηkγ − 2Di (k)xi (k) − 2wi (k) Yi (k + 1) = Yi (k) + ηkγ − 2 N Di (k)Yi (k) − 2 N Id User i does not need to share wi with its neighbors!! and Di → N i=1 wi wi + Id Complexity: 1/ √ k saddle-point error × projection onto {D cId } PD(D) := (D − cId )2 + cId (e.g., using Newton) 48 / 68
  • 56. Theorem (DMN-JC Distributed saddle-point nuclear optimization) Assume that fi is convex and the set Wi is convex and compact. Also, let the sequence of weight-balanced communication digraphs be nondegenerate, and B-jointly-connected Then, for a suitable choice of the consensus stepsize σ and the learning rates {ηk} the saddle-point evaluation error of the running-time averages decreases as 1/ √ k Proof follows from developments in the next section. 49 / 68
  • 57. Privacy preserving matrix completion 10 0 10 1 10 2 10 3 10 4 0 0.5 1 1.5 2 Distributed saddle-point dynamics Centralized subgradient descent 10 0 10 1 10 2 10 3 10 4 10 3 10 4 0 5000 10000 15000 0 1 2 3 4 5 iterations, k W (k)−Z F Z F N i=1 j∈Υi (Wij (k) − Zij )2 + γ W (k)| √ Id ∗ ( N i=1 Di (k) − 1 N N i=1 Di (k) 2 F )1/2 Matrix fitting error Nuclear regularization Disagreement local matrices 50 / 68
  • 58. End of Part III Contributions First multi-agent treatment of nuclear norm regularization Privacy preserving matrix completion Future work Inexact projections, Compressed matrix communication 51 / 68
  • 59. Part IV (and last): Saddle-point problems Distributed optimization Constraints 52 / 68
  • 60. Organization Distributed constrained optimization Previous work and limitations The magic of agreement on the multiplier Saddle-point problems with explicit agreement Nuclear norm min-max characterization Lagrangians Our distributed saddle-point dynamics with Laplacian averaging theorem for our dynamics: 1√ t saddle-point evaluation error outline of the proof 53 / 68
  • 61. Distributed constrained optimization min wi ∈Wi , ∀i D∈D N i=1 f i (wi , D) s.t. g1 (w1 , D) + · · · + gN (wN , D) ≤ 0 {wi }N i=1 are local decision vectors, D is a global decision vector Applications: Traffic and routing Resource allocation Optimal control Network formation 54 / 68
  • 62. Previous work by type of constraint N i=1 f i (x) s.t. g(x) ≤ 0 All agents know the constraint 2011 D. Yuan, S. Xu, and H. Zhao 2012 M. Zhu and S. Mart´ınez D. Yuan, D. W. C. Ho, and S. Xu (to appear) min wi ∈Wi N i=1 f i (wi ) s.t. N i=1 gi (wi ) ≤ 0 Agent i knows only gi 2010 D. Mosk-Aoyama, T. Roughgarden and D. Shah (only linear constraints) 2014 T.-H. Chang, A. Nedi´c and A. Scaglione Primal-dual perturbation methods 55 / 68
  • 63. Distributing the constraint via agreement on multipliers min wi ∈Wi D∈D N i=1 f i (wi , D) s.t. g1 (w1 , D) + · · · + gN (wN , D) ≤ 0 Dual decomposition, if strong duality holds min wi ∈Wi D∈D max z∈Rm ≥0 N i=1 f i (wi , D) + z N i=1 gi (wi , D) = min wi ∈Wi D∈D max zi ∈Rm ≥0 zi =zj ∀i,j N i=1 f i (wi , D) + zi gi (wi , D) = min wi ∈Wi Di ∈D Di =Dj ∀i,j max zi ∈Rm ≥0 zi =zj ∀i,j N i=1 f i (wi , D i ) + zi gi (wi , D i ) 56 / 68
  • 64. Saddle-point problems with explicit agreement A more general framework min w∈W (D1,...,DN )∈DN Di =Dj , ∀i,j max µ∈M (z1,...,zN )∈ZN zi =zj , ∀i,j φ w, (D 1 , . . . , D N ) convex , µ, (z1 , . . . , zN ) concave Problem unstudied in the literature Fits min-max formulation of nuclear norm regularization The concave part is quadratic Fits convex-concave functions arising from Lagrangians The concave part is linear 57 / 68
  • 65. Our dynamics Subgradient saddle-point dynamics with Laplacian averaging (provided a saddle-point exists under agreement) ˆwt+1 = wt − ηtgwt ˆDt+1 = Dt −σLtDt Lap. aver. −ηtgDt ˆµt+1 = µt + ηtgµt ˆzt+1 = zt −σLtzt Lap. aver. +ηt ˆgzt (wt+1, Dt+1, µt+1, zt+1) = PS ˆwt+1, ˆDt+1, ˆµt+1, ˆzt+1 Orthogonal projection gwt , gDt , gµt , ˆgzt are subgradients and supergradients of φ(w, D, µ, z) and S = W × DN × M × ZN 58 / 68
  • 66. Theorem (DMN-JC Distributed saddle-point approximation) Assume that φ(w, D, µ, z) is convex in (w, D) ∈ W × DN and concave in (µ, z) ∈ M × ZN The dynamics is bounded The sequence of weight-balanced communication digraphs is nondegenerate, and B-jointly-connected For ˜δ ∈ (0, 1) and ˜δ := min ˜δ , (1 − ˜δ ) δ dmax , choosing σ ∈ ˜δ δ , 1−˜δ dmax and {ηt} based on the Doubling Trick scheme, then for any saddle point (w∗, D∗, µ∗, z∗) of φ within agreement, − α √ t − 1 ≤ φ(wav t , D av t , zav t , µav t ) − φ(w∗ , D ∗ , z∗ , µ∗ ) ≤ α √ t − 1 wav t := t s=1 ws, etc 59 / 68
  • 67. Outline of the proof Inequality techniques from A. Nedi and A. Ozdaglar, 2009 Saddle-point evaluation error tφ(wav t+1, D av t+1, µav t+1, zav t+1) − tφ(w∗ , D ∗ , µ∗ , z∗ ) (2) at running-time averages, wav t+1 := 1 t t s=1 ws, etc. Bound for (2) in terms of initial conditions bound on subgradients and states of the dynamics disagreement sum of learning rates Input-to-state stability with respect to agreement LKDt 2 ≤ CI D1 2 1 − ˜δ 4N2 t−1 B + CU max 1≤s≤t−1 ds 2 Doubling Trick scheme: for m = 0, 1, 2, . . . , log2 t , take ηs = 1√ 2m in each period of 2m rounds s = 2m, . . . , 2m+1 − 1 60 / 68
  • 68. End of Part IV Contributions Distributed subgradient methods for saddle-point problems Particularize to Saddle-points of Lagrangians coming from separable constraints Min-max characterization of nuclear norm Future work Bounds on Lagrange vectors and matrices in a distributed way Boundedness properties of the dynamics without truncated projections Semidefinite constraints with chordal sparsity, where agents update the entries corresponding to maximal cliques subject to agreement on the intersections 61 / 68
  • 70. Summary Global decision vectors min x∈Rd N i=1 f i (x) Rj T := T t=1 N i=1 f i t (xj t ) − min x∈Rd T t=1 N i=1 f i t (x) Local decision vectors min wi ∈Rd N i=1 f i (wi ) + γ [w1 . . . wN ] ∗ min wi ∈Wi D∈D N i=1 f i (wi , D) s.t. N i=1 gi (wi , D) ≤ 0 63 / 68
  • 71. Noise-to-state exponentially stable distributed convex optimization on weight-balanced digraphs, SIAM Journal on Control and Optimization, under revision pth moment noise-to-state stability of stochastic differential equations with persistent noise, SIAM Journal on Control and Optimization 52 (4) (2014), 2399-2421 Distributed online convex optimization over jointly connected digraphs, IEEE Transactions on Network Science and Engineering 1 (1) (2014), 23-37 And conference versions presented in ACC 13, CDC 13, MTNS 14 64 / 68
  • 72. Distributed optimization for multi-task learning via nuclear-norm approximation, IFAC Workshop on Distributed Estimation and Control in Networked Systems Distributed subgradient methods for saddle-point problems, CDC 15 (In preparation for Transactions on Automatic Control) 65 / 68
  • 73. Next steps Algorithmic principles in biology homeostasis in neocortex noisy computational units distribution of firing rates recovery after disruptions differentiation in populations of cells distribution of phenotypes recovery after injury or disease cancer, stem cells 66 / 68
  • 74. Beyond biology Control and security of cyber-physical systems: cooperative optimization + distributed trust heterogeneous agents’ dynamics & incentives malicious minority 67 / 68
  • 75. Thank you for listening! 68 / 68