COCOA: Communication-Efficient Coordinate Ascent

COCOA
Communication-Efficient
Coordinate Ascent
Virginia Smith
!
Martin Jaggi, Martin Takáč, Jonathan Terhorst,
Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

LARGE-SCALE OPTIMIZATION
COCOA

COCOA
RESULTS

LARGE-SCALE OPTIMIZATION!
COCOA
RESULTS

Machine Learning with
Large Datasets

Machine Learning with
Large Datasets
image/music/video tagging
document categorization
item recommendation
click-through rate prediction
sequence tagging
protein structure prediction
sensor data prediction
spam classification
fraud detection

Machine Learning Workflow
DATA & PROBLEM
classification, regression,
collaborative filtering, …

DATA & PROBLEM
MACHINE LEARNING MODEL
logistic regression, lasso,
support vector machines, …

DATA & PROBLEM
OPTIMIZATION ALGORITHM
gradient descent, coordinate
descent, Newton’s method, …

Example: SVM Classification
wx-b=1
wx-b=0 w
2 / ||w||
wx-b=-1

wx-b=1
wx-b=0 w
2 / ||w||
wx-b=-1
min
w2Rd
!
2 ||w||2 +
1
n
Xn
i=1
`hinge(yiwT xi)

wx-b=1
wx-b=0 w
2 / ||w||
wx-b=-1
min
w2Rd
!
2 ||w||2 +
1
n
Xn
i=1
`hinge(yiwT xi)
Descent algorithms and line search methods
Acceleration, momentum, and conjugate gradients
Newton and Quasi-Newton methods
Coordinate descent
Stochastic and incremental gradient methods
SMO
SVMlight
LIBLINEAR

DATA & PROBLEM
SYSTEMS SETTING
multi-core, cluster, cloud,
supercomputer, …

DATA & PROBLEM
Open Problem:
efficiently solving objective
when data is distributed
SYSTEMS SETTING
multi-core, cluster, cloud,
supercomputer, …

Distributed Optimization
reduce: w = w ↵
P
k w

“always communicate”
reduce: w = w ↵
P
k w

✔ convergence
guarantees
reduce: w = w ↵
P
k w

✔ convergence
guarantees
✗ high communication
reduce: w = w ↵
P
k w

✔ convergence
guarantees
average: w := 1K
P
k wk
reduce: w = w ↵
P
k w

✔ convergence
guarantees
average: w := 1K
P
k wk
“never communicate”
reduce: w = w ↵
P
k w

✔ convergence
guarantees
✔ low communication
average: w := 1K
P
k wk
reduce: w = w ↵
P
k w

✔ convergence
guarantees
average: w := 1K
P
k wk
ZDWJ, 2012
reduce: w = w ↵
P
k w

✔ convergence
guarantees
✗ convergence
not guaranteed
average: w := 1K
P
k wk
ZDWJ, 2012
reduce: w = w ↵
P
k w

Mini-batch
reduce: w = w ↵
|b|
P
i2b w

Mini-batch
✔ convergence
guarantees
reduce: w = w ↵
|b|
P
i2b w

Mini-batch
✔ convergence
guarantees
✔ tunable
communication
reduce: w = w ↵
|b|
P
i2b w

Mini-batch
✔ convergence
guarantees
✔ tunable
communication
a natural middle-ground
reduce: w = w ↵
|b|
P
i2b w

Mini-batch Limitations
1. METHODS BEYOND SGD

2. STALE UPDATES

2. STALE UPDATES
3. AVERAGE OVER BATCH SIZE

COCOA!
RESULTS

Use Primal-Dual Framework
2. STALE UPDATES
Immediately apply local updates
Average over K batch size

Communication-Efficient
Distributed Dual
Coordinate Ascent (CoCoA)
Use Primal-Dual Framework
2. STALE UPDATES
Immediately apply local updates
Average over K batch size

1. Primal-Dual Framework
PRIMAL ≥ DUAL

PRIMAL ≥ DUAL
min
w2Rd

P(w) :=
!
2 ||w||2 +
1
n
Xn
i=1
`i(wT xi)
#

PRIMAL ≥ DUAL
min
w2Rd

P(w) :=
!
2 ||w||2 +
1
n
Xn
i=1
`i(wT xi)
#
max
↵2Rn

D(↵) := ||A↵||2
1
n
Xn
i=1
`⇤i
(↵i)
#
Ai =
1
n
xi

≥
PRIMAL DUAL
Stopping criteria given by duality gap
min
w2Rd

P(w) :=
!
2 ||w||2 +
1
n
Xn
i=1
`i(wT xi)
#
max
↵2Rn

D(↵) := ||A↵||2
1
n
Xn
i=1
`⇤i
(↵i)
#
Ai =
1
n
xi

≥
PRIMAL DUAL
Good performance in practice
min
w2Rd

P(w) :=
!
2 ||w||2 +
1
n
Xn
i=1
`i(wT xi)
#
max
↵2Rn

D(↵) := ||A↵||2
1
n
Xn
i=1
`⇤i
(↵i)
#
Ai =
1
n
xi

≥
PRIMAL DUAL
Good performance in practice
Default in software packages e.g. liblinear
min
w2Rd

P(w) :=
!
2 ||w||2 +
1
n
Xn
i=1
`i(wT xi)
#
max
↵2Rn

D(↵) := ||A↵||2
1
n
Xn
i=1
`⇤i
(↵i)
#
Ai =
1
n
xi

2. Immediately Apply Updates
for i 2 b
w w − ↵riP(w)
end
w w + w
STALE

2. Immediately Apply Updates
for i 2 b
w w − ↵riP(w)
end
w w + w
STALE
for i 2 b
w w − ↵riP(w)
w w + w
end
Output: ↵[k] and w := A[k]↵[k]
FRESH

3. Average over K
reduce: w = w + 1K
P
k wk

CoCoA
ni
Algorithm 1: CoCoA
Input: T ! 1, scaling parameter 1  K  K (default: K := 1).
Data: {(xi, yi)}=1 distributed over K machines
Initialize: ↵(0)
[k] 0 for all machines k, and w(0) 0
for t = 1, 2, . . . ,T
for all machines k = 1, 2, . . . ,K in parallel
(↵[k],wk) LocalDualMethod(↵(t1)
[k] ,w(t1))
[k] ↵(t1)
[k] + K
↵(t)
K ↵[k]
end
reduce w(t) w(t1) + K
K
PK
k=1 wk
end
Procedure A: LocalDualMethod: Dual algorithm on machine k
Input: Local ↵[k] 2 Rnk , and w 2 Rd consistent with other coordinate
blocks of ↵ s.t. w = A↵
Data: Local {(xi, yi)}nk
i=1

CoCoA
Algorithm 1: CoCoA
Data: {(xi, yi)}ni
✔ 10 lines of code in Spark
=1 distributed over K machines
Initialize: ↵(0)
for t = 1, 2, . . . ,T
[k] ,w(t1))
[k] ↵(t1)
[k] + K
↵(t)
K ↵[k]
end
K
PK
k=1 wk
end
i=1

CoCoA
Algorithm 1: CoCoA
Data: {(xi, yi)}ni
Initialize: ↵(0)
✔ primal-dual framework allows for
any internal optimization method
for t = 1, 2, . . . ,T
[k] ,w(t1))
[k] ↵(t1)
[k] + K
↵(t)
K ↵[k]
end
K
PK
k=1 wk
end
i=1

CoCoA
Algorithm 1: CoCoA
Data: {(xi, yi)}ni
Initialize: ↵(0)
for t = 1, 2, . . . ,T
[k] ,w(t1))
[k] ↵(t1)
[k] + K
↵(t)
K ↵[k]
end
✔ local updates K
applied immediately
PK
k=1 wk
end
i=1

CoCoA
Algorithm 1: CoCoA
Data: {(xi, yi)}ni
Initialize: ↵(0)
for t = 1, 2, . . . ,T
[k] ,w(t1))
[k] ↵(t1)
[k] + K
↵(t)
K ↵[k]
end
✔ local updates K
applied immediately
PK
k=1 wk
end
Input: Local ↵[✔ k] 2 Rnk average , and w 2 Rd consistent over with K
other coordinate
i=1

Convergence
E[D(↵⇤) − D(↵(T))] 
✓
1 − (1 − ⇥)
1
K
!n
# + !n
◆T ⇣
D(↵⇤) − D(↵(0))
⌘
Assumptions:
are -smooth
LocalDualMethod makes improvement ⇥
per step
e.g. for SDCA
⇥ =
✓
1
!n
1 + !n
1
˜n
◆H
`i 1/
applies also to duality gap
0   n/!
K
measure of difficulty of data partition

Convergence
✔ it converges!
E[D(↵⇤) − D(↵(T))] 
✓
1 − (1 − ⇥)
1
K
!n
# + !n
◆T ⇣
D(↵⇤) − D(↵(0))
⌘
Assumptions:
are -smooth
per step
e.g. for SDCA
⇥ =
✓
1
!n
1 + !n
1
˜n
◆H
`i 1/
0   n/!
K

Convergence
✔ it converges!
✔ convergence rate matches
current state-of-the-art mini-batch
E[D(↵⇤) − D(↵(T))] 
✓
1 − (1 − ⇥)
1
K
!n
# + !n
◆T ⇣
D(↵⇤) − D(↵(0))
⌘
Assumptions:
are -smooth
per step
e.g. for SDCA
⇥ =
✓
1
!n
1 + !n
1
˜n
◆H
`i 1/
methods
0   n/!
K

COCOA
RESULTS!

Empirical Results in
Dataset Training (n) Features (d) Sparsity Workers (K)
Cov 522,911 54 22.22% 4
Rcv1 677,399 47,236 0.16% 8
Imagenet 32,751 160,000 100% 32

2
10
0
10
−2
10
−4
10
−6
0 200 400 600 800
10
Imagenet
Time (s)
Log Primal Suboptimality
COCOA (H=1e3)
mini−batch−CD (H=1)
local−SGD (H=1e3)
mini−batch−SGD (H=10)

2
10
0
10
−2
10
−4
10
−6
0 200 400 600 800
10
Imagenet
Time (s)
COCOA (H=1e3)
mini−batch−CD (H=1)
local−SGD (H=1e3)
mini−batch−SGD (H=10)
2
10
0
10
−2
10
−4
10
−6
2
10
0
10
−2
10
−4
10
−6
0 20 40 60 80 100
10
Cov
Time (s)
COCOA (H=1e5)
minibatch−CD (H=100)
local−SGD (H=1e5)
batch−SGD (H=1)
0 100 200 300 400
10
RCV1
Time (s)
COCOA (H=1e5)
minibatch−CD (H=100)
local−SGD (H=1e4)
batch−SGD (H=100)

COCOA Take-Aways
A framework for distributed optimization

COCOA Take-Aways
Uses state-of-the-art primal-dual methods

COCOA Take-Aways
Reduces communication:
- applies updates immediately
- averages over # of machines

COCOA Take-Aways
Strong empirical results convergence
guarantees

COCOA Take-Aways
guarantees
To appear in
NIPS ‘14

COCOA Take-Aways
guarantees
To appear in
www.berkeley.edu/~vsmith NIPS ‘14

COCOA: Communication-Efficient Coordinate Ascent

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to COCOA: Communication-Efficient Coordinate Ascent

Similar to COCOA: Communication-Efficient Coordinate Ascent (20)

More from jeykottalam

More from jeykottalam (6)

Recently uploaded

Recently uploaded (20)

COCOA: Communication-Efficient Coordinate Ascent