발표자: 이강욱 (KAIST 박사 후 연구원)
발표일: 2017.5.
Kangwook Lee is a postdoctoral scholar in the School of EE at KAIST, working with Prof. Changho Suh. He received his PhD degree in 2016 from the EECS department at UC Berkeley under the supervision of Prof. Kannan Ramchandran. He also obtained his MS degree in EECS from UC Berkeley in 2012, and BS degree in EE from KAIST in 2010.
목차:
1. Coded Computation
2. Coded Shuffling
2. Microsoft’s data center in Dublin, Ireland
# of servers > 1,000,000
• 300,000 for Xbox
• 700,000 for ?
Estimated cost > $2.5B
Size ~= Large Football Stadiums
3.
4. “The scale and complexity of modern
Web services make it infeasible to
eliminate all latency variability.”
-Jeff Dean, Google
29. Replication-based Algorithm
W1 W2
M
W3
A
A’1
A’2
A’1b
A’2b
Ab
!/23'456$786%# = max[min !′-, !′/ , min (!′0, !′1))]
W4
A’1 b A’1 b A’2 b A’2 b
[LPR, IEEE/ACM ToN’16]
[SLR, IEEE ToC’16]
[D. Wang, G. Joshi, G. Wornell, ACM Sigmetrics’15]
[K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyytia, A. Scheller-Wolf, ACM Sigmetrics’15]
The most popular choice in practice & well-studied in theory
Design param.
30. Coded Algorithm
W1 W2
M
W3 W4
A’’1
A
A’’2
A’’3
A’’1 b A’’2 b A’’3 b b∑
+
! 1,0 2ABC2$%&'& = 3rd min(!′′-, !′′/, !′′0, !′′1)
A’’1b
AbA’’2b
A’’3b
Design param.
A00
1 b A00
3 b (A00
1 + A00
2 + A00
3 )b
# of subblocks
coding gain with more parities
workload heavier
31. Coded Algorithm
A’’1
A
W1
A’’1
W2
M
b
W3A’’2
A’’1b
Ab
W4
A’’2 b A’’3 b b
A’’3
∑
+
! 1,0 2ABC2$%&'& = 3rd min(!′′-, !′′/, !′′0, !′′1)
A’’2b
A’’3b
Design param.
Q1. Given a latency distribution, what are the
optimal parameters for these algorithms?
Q2. Can ‘coded algorithm’ achieve
the optimal latency scaling?
32. Coded Computation for Linear Operations
Theorem: E[Tuncoded] = ⇥
✓
log n
n
◆
Assumptions:
§ n workers
§ k-way parallelizable: G = H G-, G/, … GJ
§ Computing time of ∑ KLML
N
LOP = constant + exponential RV (i.i.d.)
§ Average computing time is proportional to
-
J
Q
E[T]
R
coded
replication
1 Q∗
uncoded
⇥
✓
log n
n
◆
⇥
✓
1
n
◆
E[T?
MDS-coded] = ⇥
✓
1
n
◆
E[T?
replication] = ⇥
✓
log n
n
◆
[LLPPR, NIPS workshop‘15]
[LLPPR, IEEE ISIT’16]
38. Distributed Matrix-Matrix Multiplication
A ⇥ B
=
0
B
B
@
A1
A2
· · ·
An
1
C
C
A ⇥ B
=
0
B
B
@
A1 ⇥ B
A2 ⇥ B
· · ·
An ⇥ B
1
C
C
A
When both A and B scale,
this task does not fit in a single worker
39. Encode “A” and Multiply with B
No coding across groups of workers!
AB1 AB2
0
B
B
@
A1
A2
A1 + A2
A1 + 2A2
1
C
C
A ⇥ B1 B2
=
0
B
B
@
A1B1 A1B2
A2B1 A2B2
(A1 + A2)B1 (A1 + A2)B2
(A1 + 2A2)B1 (A1 + 2A2)B2
1
C
C
A
40. Encode over {AiBi}
4 x Average Latency = Average Latency
A1B1 A1B2 A2B1 A2B2 A1B1
+2A1B2
+3A2B1
+4A2B2
4 x Computation = Computation
A1B1
+2A1B2
+5A2B1
+3A2B2
3A1B1
+A1B2
+2A2B1
+4A2B2
A1B1
+5A1B2
+4A2B1
+2A2B2
41. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”
42. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”
43. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”
44. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”
Column
Decoding
45. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”
Row
Decoding
46. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Encode both “A” and “B”
Column
Decoding
47. 0
@
A1
A2
A1 + A2
1
A ⇥ B1 B2 B1 + B2
=
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
Product-Coded Computation
Product Codes = Coding across all workers!
48. Simulation Results
Number of workers, N
400 800 1200 1600 2000 2400
E[T]
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
MDS-coded
Product-coded
Replication
Lower bound
49. Theorem: With workers,
Lower bound:
MDS-coded:
Product-coded:
Runtime Analysis
E[T]
1
µ
log
✓
k + t
t
◆
+ o(1)
E[TMDS-coded] ⇡
1
µ
log
✓
k + t
t
◆
+
1
µt
p
2(t + 1) log k
E[Tproduct-coded] ⇡
1
µ
log
✓
k + t/2
ct/2+1
◆
k2
+ tk
55. Pf: Product-Coded Computation
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
56. Pf: Product-Coded Computation
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
57. Pf: Product-Coded Computation
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
58. Pf: Product-Coded Computation
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
0
B
B
B
B
@
A1B1 A1B2 A1(B1 + B2)
A2B1 A2B2 A2(B1 + B2)
(A1 + A2)B1 (A1 + A2)B2
1
C
C
C
C
A
R1
R2
C1
C2
R3 C3
2-core exists!
Lemma: An erasure pattern is decodable iff the
corresponding bipartite graph does not have a core.
Theorem: Emergence of a core has a sharp threshold.
59. Challenges
• Matrix-Matrix multiplication
• Nonlinear functions
• Random Sparse Linear Code
• Gradient Coding
[Lee, Pedarsani, Papailiopoulos, Ramchandran, IEEE ISIT’17]
...
f1
f2
...
fk
75. SGD SGD SGD
Model
Coded Shuffling for PSGD
x(1)
Data
Random
shuffling
Data
shuffling
Merge modelsPSGD with shuffling converges faster but it involves
communication cost* [Recht and Re, 2013], [Bottou, 2012], [Zhang and Re, 2014], [Gurbuzbalaban et al., 2015], [Ioffe and Szegedy, 2015],
[Zhang et al. 2015]
79. Coded Shuffling Algorithm
2 3A
W1
1
M
2 3 4
1
TX
W2
(T-, T/) (T0, T1)
(T-, T0) (T/, T1)
Epoch 1
Epoch 2
1 2 3 4
2+32+3
3 2
Coding opportunity increases
as each worker can store more data points!
80. Thm: Let
q = # of data points,
= memory overhead
( ),
and n= # of workers. Then,
=
Number of data points stored by each worker
q/n
(1 n)
Coded Shuffling Algorithm
Tuncoded = q
✓
1
n
◆
Tcoded =
Tuncoded
+ 1
82. 35%
Tested on 25 EC2 instances
Experiments: Low-rank Matrix Completion (10M x 10M)
83. Conclusion
• Coding theory for distributed computing
• Stragglers slow down distributed computing
• => Coded Computation
• Data needs to be shuffled between distributed nodes
• => Coded Shuffling
ShufflingComputation