Speeding Up Distributed Machine Learning Using Codes

Kangwook Lee, KAIST
Joint work w/
Max Lam (Stanford), Ramtin Pedarsani (UCSB), Dimitris Papailiopoulos (UW Madison),
Kannan Ramchandran (UC Berkeley), Jichan Chung (KAIST), Geewon Suh (KAIST),
Changho Suh (KAIST)
05/22/2017

Microsoft’s data center in Dublin, Ireland
# of servers > 1,000,000
• 300,000 for Xbox
• 700,000 for ?
Estimated cost > $2.5B
Size ~= Large Football Stadiums

“The scale and complexity of modern
Web services make it infeasible to
eliminate all latency variability.”
-Jeff Dean, Google

Network latency
Shared resources

System Noise = Latency Variability

A A
Loading file A…
Completed in 1s.

A A
Loading file A…
Completed in 1s.
A A
Loading file A…
Completed in 3s.
Still loading…
Still…

f(A)
Computing f(A)…
Completed in 1s.
A

f(A)
Computing f(A)…
Completed in 3s.
Still computing…
Still…
A
f(A)
Computing f(A)…
Completed in 1s.
A

Codes
Code is a system of rules to convert information
into another form of representation

Codes
A
B
A+B
Noise

Codes
A
B
A+B
A
?
A+B
Noise

Codes
A
B
A+B
A
B
A+B
Noise

Speeding Up Distributed Computing Systems Using Codes
A
B
A+B
A
?
A+B
Noise
System Noise

Codes for
ComputingStorage
[LSHR, IEEE T-IT ’17]
[LPR, IEEE/ACM ToN ’16]
[SLR, IEEE ToC ’16]
[LYPR, Allerton ’13]
Algorithms
[LPR, in preparation]
[PYLR, IEEE T-IT ’17]
[CLKPR, IEEE ICC ’17]
[LPR, IEEE ISIT ’16]
[PLR, IEEE Allerton ’15]
[LRS, IEEE ISIT ‘17]
[LPPR, IEEE ISIT ‘17]
[SLS, in preparation]
[LLPPR, IEEE ISIT ‘16,
NIPS ’15 workshop]

Codes for
Storage
[LSHR, IEEE T-IT ‘17
[LPR, IEEE/ACM ToN ’16]
[SLR, IEEE ToC ’16]
[LYPR, Allerton ’13]
Algorithms
[LPR, in preparation]
[PYLR, IEEE T-IT ’17]
[CLKPR, IEEE ICC ’17]
[LPR, IEEE ISIT ’16]
[PLR, IEEE Allerton ’15]
Computing
[LRS, IEEE ISIT ‘17]
[LPPR, IEEE ISIT ‘17]
[SLS, in preparation]
[LLPPR, IEEE ISIT ‘16,
NIPS ’15 workshop]

Large-scale Distributed Machine Learning Systems
1
2
1
2
n n
Coded ComputationCoded Shuffling

Agenda
• Basic Coded Computation
• MDS-coded Mat-Vec Multiplication
• Extensions
• Mat-Mat Multiplication
• Nonlinear functions
• Gradient Coding
• Basic Coded Shuffling

Distributed Matrix-Vector Multiplication
A ⇥ b
=
0
@
A1
A2
A3
1
A ⇥ b
=
0
@
A1 ⇥ b
A2 ⇥ b
A3 ⇥ b
1
A
Master
Worker 1 Worker 2 Worker 3

A ⇥ b
=
0
@
A1
A2
A3
1
A ⇥ b
=
0
@
A1 ⇥ b
A2 ⇥ b
A3 ⇥ b
1
A
:=
0
@
y1
y2
y3
1
A
Master

Master
A1 b b bA2 A3
A ⇥ b
=
0
@
A1
A2
A3
1
A ⇥ b
=
0
@
A1 ⇥ b
A2 ⇥ b
A3 ⇥ b
1
A
:=
0
@
y1
y2
y3
1
A
y1 y2 y3

Straggler Problem
0 0.1 0.2 0.3 0.4
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Latency (sec)
Complementary CDF of latency
5~10% stragglers
Incurs significant delay
(Measured on Amazon AWS)
Q. Can codes provide the distributed algorithms
with robustness against stragglers?

Uncoded Algorithm
W1 W2
M
W3
!"#$%&'& = max !-, !/, !0, !1
A1
A2
A
A3
A4
W4
A1 b A2 b A3 b A4 b
A1b
A2b
A3b Ab
A4b

Replication-based Algorithm
W1 W2
M
W3
A
A’1
A’2
A’1b
A’2b
Ab
!/23'456$786%# = max[min !′-, !′/ , min (!′0, !′1))]
W4
A’1 b A’1 b A’2 b A’2 b
[LPR, IEEE/ACM ToN’16]
[SLR, IEEE ToC’16]
[D. Wang, G. Joshi, G. Wornell, ACM Sigmetrics’15]
[K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyytia, A. Scheller-Wolf, ACM Sigmetrics’15]
The most popular choice in practice & well-studied in theory
Design param.

Coded Algorithm
W1 W2
M
W3 W4
A’’1
A
A’’2
A’’3
A’’1 b A’’2 b A’’3 b b∑
+
! 1,0 2ABC2$%&'& = 3rd min(!′′-, !′′/, !′′0, !′′1)
A’’1b
AbA’’2b
A’’3b
Design param.
A00
1 b A00
3 b (A00
1 + A00
2 + A00
3 )b
# of subblocks
coding gain with more parities
workload heavier

Coded Algorithm
A’’1
A
W1
A’’1
W2
M
b
W3A’’2
A’’1b
Ab
W4
A’’2 b A’’3 b b
A’’3
∑
+
! 1,0 2ABC2$%&'& = 3rd min(!′′-, !′′/, !′′0, !′′1)
A’’2b
A’’3b
Design param.
Q1. Given a latency distribution, what are the
optimal parameters for these algorithms?
Q2. Can ‘coded algorithm’ achieve
the optimal latency scaling?

Coded Computation for Linear Operations
Theorem: E[Tuncoded] = ⇥
✓
log n
n
◆
Assumptions:
§ n workers
§ k-way parallelizable: G = H G-, G/, … GJ
§ Computing time of ∑ KLML
N
LOP = constant + exponential RV (i.i.d.)
§ Average computing time is proportional to
-
J
Q
E[T]
R
coded
replication
1 Q∗
uncoded
⇥
✓
log n
n
◆
⇥
✓
1
n
◆
E[T?
MDS-coded] = ⇥
✓
1
n
◆
E[T?
replication] = ⇥
✓
log n
n
◆
[LLPPR, NIPS workshop‘15]
[LLPPR, IEEE ISIT’16]

k
E[T]
(ms)
500
400
300
200
6 8 10 12 14 16 18 20 22 n=24
30-40% speed up
uncoded
(1-rep)
2-rep
3-rep
4-rep
(24,23)
(24,22)
(24,20)
Experimental Results w/ 24 EC2 machines

Distributed Linear Regression
Gradient descent for linear regression = Iterative matrix multiplication
Coded Gradient Descent
= Gradient Descent + Coded Matrix Multiplication
35% reduction
0
2
4
6
8
10
Square Fat Tall
Averageruntime(s)
Uncoded
MDS-coded
✓(t+1)
= ✓(t)
↵AT
(A✓(t)
y)

Challenges
• Matrix-Matrix multiplication
• Gradient Coding

Distributed Matrix-Matrix Multiplication
A ⇥ B
=
0
B
B
@
A1
A2
· · ·
An
1
C
C
A ⇥ B
=
0
B
B
@
A1 ⇥ B
A2 ⇥ B
· · ·
An ⇥ B
1
C
C
A
When both A and B scale,
this task does not fit in a single worker

Encode “A” and Multiply with B
No coding across groups of workers!
AB1 AB2
0
B
B
@
A1
A2
A1 + A2
A1 + 2A2
1
C
C
A ⇥ B1 B2
=
0
B
B
@
A1B1 A1B2
A2B1 A2B2
(A1 + A2)B1 (A1 + A2)B2
(A1 + 2A2)B1 (A1 + 2A2)B2
1
C
C
A

Encode over {AiBi}
4 x Average Latency = Average Latency
A1B1 A1B2 A2B1 A2B2 A1B1
+2A1B2
+3A2B1
+4A2B2
4 x Computation = Computation
A1B1
+2A1B2
+5A2B1
+3A2B2
3A1B1
+A1B2
+2A2B1
+4A2B2
A1B1
+5A1B2
+4A2B1
+2A2B2

0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”

0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Column
Decoding

0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Row
Decoding

0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Product-Coded Computation
Product Codes = Coding across all workers!

Simulation Results
Number of workers, N
400 800 1200 1600 2000 2400
E[T]
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
MDS-coded
Product-coded
Replication
Lower bound

Theorem: With workers,
Lower bound:
MDS-coded:
Product-coded:
Runtime Analysis
E[T]
1
µ
log
✓
k + t
t
◆
+ o(1)
E[TMDS-coded] ⇡
1
µ
log
✓
k + t
t
◆
+
1
µt
p
2(t + 1) log k
E[Tproduct-coded] ⇡
1
µ
log
✓
k + t/2
ct/2+1
◆
k2
+ tk

Pf: Product-Coded Computation
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3

0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3

0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
2-core exists!
Lemma: An erasure pattern is decodable iff the
corresponding bipartite graph does not have a core.
Theorem: Emergence of a core has a sharp threshold.

Challenges
• Matrix-Matrix multiplication
• Random Sparse Linear Code
• Gradient Coding
[Lee, Pedarsani, Papailiopoulos, Ramchandran, IEEE ISIT’17]
...
f1
f2
...
fk

Gradient Coding
Master
f(x1) + f(x2) + f(x3)Goal:
[R. Tandon, Q. Lei, A. Dimakis, N. Karampatziakis, NIPS’16 workshop]
x1, x2 x1, x3 x2, x3

Gradient Coding
Master
f(x1) + f(x2) + f(x3)Goal:
f(x1), f(x2) f(x2), f(x3)
x1, x2 x1, x3 x2, x3
Q. Can we do better?
Or can we reduce the comm. overheads?
Computation Alignment!

Gradient Coding
Master
f(x1) + f(x2) + f(x3)Goal:
x1, x2 x1, x3 x2, x3
f(x1) + 2f(x2) f(x2) f(x3)f(x1) + 2f(x3)

Gradient Coding
Master
f(x1) + f(x2) + f(x3)Goal:
x1, x2 x1, x3 x2, x3
f(x1) + 2f(x2) f(x2) f(x3)
A. This is the ‘unique’ solution
that achieves the minimum comm. overheads.
f(x1) + 2f(x3)

Recent Works on Coded Computation
• [S. Li, MA. Maddah-Ali, S. Avestimehr, Allerton’16]
• Coded Matrix Multiplication in MapReduce setup
• [Y. Yang, P. Grover, S. Kar, Allerton’16]
• Coded Computation for Logistic Regression
• [N. Ferdinand, S. Draper, Allerton’16]
• SVD + Coded Matrix Multiplication
• [S. Dutta, V. Cadambe, P. Grover, NIPS’16]
• Sparsification of A + Coded Matrix Multiplication
• [R. Tandon, Q. Lei, A. Dimakis, N. Karampatziakis, NIPS’16 workshop]
• Coded Computation + Distributed Gradient Computing
• + 8 works in ISIT’17…

x(t+1)
= x(t) (t)
rf(x(t)
)GD:
Uniformly
chosen at random
x(t+1)
= x(t) (t)
rfr(t) (x(t)
)SGD:
f(x) =
qX
i=1
fi(x)
Number of
data points
Coded Shuffling for PSGD

SGD SGD SGD
ModelData
Random
shuffling
Data
shuffling
x(0)

SGD SGD SGD
ModelData
Random
shuffling
Data
shuffling
x(0)
x
(1)
1 x
(1)
2 x
(1)
3

SGD SGD SGD
ModelData
Random
shuffling
Data
shuffling
x(0)
x
(1)
1 x
(1)
2 x
(1)
3
x(1)
= (x
(1)
1 , x
(1)
2 , x
(1)
3 )

SGD SGD SGD
ModelData
Random
shuffling
Data
shuffling
x(1)

SGD SGD SGD
Model
x(1)
Data
Random
shuffling
Data
shuffling

SGD SGD SGD
Model
x(1)
Data
Random
shuffling
Data
shuffling
Merge modelsPSGD with shuffling converges faster but it involves
communication cost* [Recht and Re, 2013], [Bottou, 2012], [Zhang and Re, 2014], [Gurbuzbalaban et al., 2015], [Ioffe and Szegedy, 2015],
[Zhang et al. 2015]

Coded Shuffling Algorithm
2 3A
W1
1
M
2 3 4
W2
(T-, T/) (T0, T1)
(T-, T0) (T/, T1)
Epoch 1
Epoch 2
1 2 3 4

2 3A
W1
1
M
2 3 4
1 TX
W2
(T-, T/) (T0, T1)
(T-, T0) (T/, T1)
Epoch 1
Epoch 2
1 2 3 4
2+32+3

2 3A
W1
1
M
2 3 4
1 TX
W2
(T-, T/) (T0, T1)
(T-, T0) (T/, T1)
Epoch 1
Epoch 2
1 2 3 4
2+3 2+3
3 2

2 3A
W1
1
M
2 3 4
1
TX
W2
(T-, T/) (T0, T1)
(T-, T0) (T/, T1)
Epoch 1
Epoch 2
1 2 3 4
2+32+3
3 2
Coding opportunity increases
as each worker can store more data points!

Thm: Let
q = # of data points,
= memory overhead
( ),
and n= # of workers. Then,
=
Number of data points stored by each worker
q/n
(1   n)
Tuncoded = q
✓
1
n
◆
Tcoded =
Tuncoded
+ 1

2 4 6 8 10 12 14 16 18 20
Memory
overhead
q = 10^7
n = 20
#oftransmissions/
10^6
1
10
2
3
4
5
6
7
8
9 Uncoded
Theory
Simulation
Simulation Results

35%
Tested on 25 EC2 instances
Experiments: Low-rank Matrix Completion (10M x 10M)

Conclusion
• Coding theory for distributed computing
• Stragglers slow down distributed computing
• => Coded Computation
• Data needs to be shuffled between distributed nodes
• => Coded Shuffling
ShufflingComputation

Speeding Up Distributed Machine Learning Using Codes

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Speeding Up Distributed Machine Learning Using Codes

Similar to Speeding Up Distributed Machine Learning Using Codes (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Speeding Up Distributed Machine Learning Using Codes