Stale Synchronous Parallel
Iterations on Flink
TRAN Nam-Luc / Engineer @EURA NOVA Research & Development
FLINK FORWARD 2015
BERLIN, GERMANY
OCTOBER 2015
Our people:
40 employees from business
engineers to data scientists
7 freelances
3 founding partners
EURA NOVA?
OUR INNOVATION-DRIVEN MODEL & DISRUPTIVE CULTURE
KEY FIGURES
“EURA NOVA is a team of passionated IT
experts devoted to providing knowledge &
skills to people with great ideas”
Data Science, Distributed
computing, Software
engineering, Big Data.
Our researches
Since 2009
2 Phd thesis & 18 master
thesis
with 4 renowned Universities
20 publications
in conferences as lecturer
4 large R&D projects
3 open-source products
Worker 1
Worker 2
Worker 3
Worker 4
Worker 6
Worker 5
Worker 7
Worker 8
Worker
9
Worker
10
STRAGGLER
Bulk Synchronous Parallelism synchronizes threads after each
iteration.
THE BIG PICTURE
4
There are always stragglers in a cluster.
In large clusters, that causes a lot of workers
waiting !
Worker 1
Worker 2
Worker 3
CONTRIBUTION
6
1. STALE SYNCHRONOUS PARALLEL ITERATIONS
Tackling the straggler problem within Flink
2. DISTRIBUTED FRANK-WOLFE ALGORITHM
Applied on LASSO regression, as use case
PART 1: STALE SYNCHRONOUS PARALLEL ITERATIONS
ON FLINK
There are stragglers in distributed processing frameworks …
→ Hardware heterogeneity
→ Skewed data distribution
→ Garbage collection
8
THE STRAGGLER PROBLEM
Not predictable
Costly to reschedule !
… especially in the context of data center operating systems:
Distribution of iterative-convergent algorithms:
9
BULK VS STALE SYNCHRONOUS
STALE STALE
Explicit
synchronization
barrier
10
PARAMETER SERVER
STALE STALE
Explicit
synchronization
barrier
How to keep workers up-to-date?
x
x
x
Parameter
server
1. SSP iteration control model
2. Parameter server
11
INTEGRATION WITH FLINK
What does Flink need to enable SSP?
if clocki
<= cluster-wide clock + staleness
do iteration
++clocki
, then send to clocki
synchronization sink
else wait until clocki
<= cluster-wide clock + staleness 12
ITERATION CONTROL MODEL IN FLINK
Worker pi
Clock Synchronization Sink
clocki
cluster-wide
clock
store clocki
in C
cluster-wide clock = min(C)
broadcast cluster-wide clock if changed
ITERATION CONTROL MODEL IN FLINK
13
IterationHead
worker
done
worker
done
worker
done
Iteration
Intermediate
IterationTail
backchannel
IterationHead
Iteration
Intermediate
IterationTail
backchannel
IterationHead
Iteration
Intermediate
IterationTail
backchannel
all workers
done
all workers
done
all workers
done
IterationSynchronizationTask
ITERATION CONTROL MODEL IN FLINK
14
IterationHead
Clock pi
Iteration
Intermediate
IterationTail
backchannel
IterationHead
Iteration
Intermediate
IterationTail
backchannel
IterationHead
Iteration
Intermediate
IterationTail
backchannel
ClockSynchronizationTask
cluster-wide
clock
Clock pi
Clock pi
cluster-wide
clock
cluster-wide
clock
15
ITERATION CONTROL MODEL IN FLINK
SuperstepBarrier
IterationHeadPACTTask
SyncEventHandler
IterationSynchronizationTask SSPIterationHeadPACTTask
ClockHolder
ClockSyncEventHandler
ClockSynchronizationTask
BULK SYNCHRONOUS
PARALLEL
Convergence determined
at synchronization barrier
16
CONVERGENCE CHECK
STALE SYNCHRONOUS
PARALLEL
Convergence reached when
no more worker can
improve the solution
dataSet.Iterate(nIterations)
17
STALE SYNCHRONOUS API
dataSet.IterateWithSSP(nIterations, staleness)
Simple API
RichMapFunctionWithParameterServer extends RichMapFunction {
update(id, clock, parameter)
get(id)
}
18
PARAMETER SERVER
DATA GRID
SHARED MODEL
Worker Worker Worker Worker
Architecture
PART 2: DISTRIBUTED FRANK-WOLFE ALGORITHM
Solving the current optimization problem:
Distributed version (Bellet et al. 2015):
20
DISTRIBUTED FRANK-WOLFE ALGORITHM
Linear combination of atoms
sparse coefficients
Distributed version (Bellet et al. 2015):
21
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
Linear combination of atoms
sparse coefficients
Distributed version (Bellet et al. 2015):
22
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
Distributed version (Bellet et al. 2015):
23
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
1. Local selection of
atoms
Distributed version (Bellet et al. 2015):
24
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
2. Global consensus
Distributed version (Bellet et al. 2015):
25
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
3. α Coefficients update
Stale synchronous version:
26
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
1. Get α coefficients
from parameter
server
Parameter
Server
Stale synchronous version:
27
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
2. Local selection of
atoms
Parameter
Server
Stale synchronous version:
28
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
3. Compute α
coefficients from
locally selected atoms
Parameter
Server
Stale synchronous version:
29
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
4. Update α coefficients to
parameter server
Parameter
Server
Stale synchronous version:
30
DISTRIBUTED FRANK-WOLFE ALGORITHM
Atom1
Atom2
Atom3
Atom4
...
Atomn
W1 W2 W3
Repeat while within
staleness bounds
Parameter
Server
See our full paper for
full implementation details
properties
application to LASSO REGRESSION
convergence proof
N-L Tran, T Peel, S Skhiri, Distributed Frank-Wolfe under Pipelined Stale
Synchronous Parallelism, proceedings of IEEE BigData 2015, Santa Clara,
November 2015
DISTRIBUTED FRANK-WOLFE ALGORITHM
31
Application on LASSO regression
Random sparse 1.000 x 10.000 matrices
Sparsity ratio = 0,001
Generated load: at any time, 1 random node under 100% load
during 12 seconds
32
EXPERIMENTS
5 nodes, 2 Ghz, 3Gb RAM
33
RESULTS
Convergence of the objective function
Stragglers in a cluster are an issue.
Mitigate them with Stale Synchronous Parallel Iterations.
34
RECAP
Pull request #967
35
WANNA TRY IT OUT?
Stale Synchronous Parallel iterations + API
Pull request #1101
Frank-Wolfe algorithm + LASSO regression
THANK YOU!
Do you have any questions?
namluc.tran@euranova.eu
AGENDA
37
1. STALE SYNCHRONOUS PARALLEL ITERATIONS
∙ The straggler problem
∙ BSP vs SSP
∙ Integration with Flink
∙ Iteration control model
∙ API
2. DISTRIBUTED FRANK-WOLFE ALGORITHM
∙ Problem statement
∙ Application: LASSO regression
∙ Experiments
38
RESULTS
Sparsity of the coefficients
The parameter server keeps track of the intermediate results
→ Key-object store
→ Distributed, with local caching
39
PARAMETER SERVER

Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink

  • 1.
    Stale Synchronous Parallel Iterationson Flink TRAN Nam-Luc / Engineer @EURA NOVA Research & Development FLINK FORWARD 2015 BERLIN, GERMANY OCTOBER 2015
  • 2.
    Our people: 40 employeesfrom business engineers to data scientists 7 freelances 3 founding partners EURA NOVA? OUR INNOVATION-DRIVEN MODEL & DISRUPTIVE CULTURE KEY FIGURES “EURA NOVA is a team of passionated IT experts devoted to providing knowledge & skills to people with great ideas” Data Science, Distributed computing, Software engineering, Big Data. Our researches Since 2009 2 Phd thesis & 18 master thesis with 4 renowned Universities 20 publications in conferences as lecturer 4 large R&D projects 3 open-source products
  • 3.
    Worker 1 Worker 2 Worker3 Worker 4 Worker 6 Worker 5 Worker 7 Worker 8 Worker 9 Worker 10 STRAGGLER
  • 4.
    Bulk Synchronous Parallelismsynchronizes threads after each iteration. THE BIG PICTURE 4 There are always stragglers in a cluster. In large clusters, that causes a lot of workers waiting !
  • 5.
  • 6.
    CONTRIBUTION 6 1. STALE SYNCHRONOUSPARALLEL ITERATIONS Tackling the straggler problem within Flink 2. DISTRIBUTED FRANK-WOLFE ALGORITHM Applied on LASSO regression, as use case
  • 7.
    PART 1: STALESYNCHRONOUS PARALLEL ITERATIONS ON FLINK
  • 8.
    There are stragglersin distributed processing frameworks … → Hardware heterogeneity → Skewed data distribution → Garbage collection 8 THE STRAGGLER PROBLEM Not predictable Costly to reschedule ! … especially in the context of data center operating systems:
  • 9.
    Distribution of iterative-convergentalgorithms: 9 BULK VS STALE SYNCHRONOUS STALE STALE Explicit synchronization barrier
  • 10.
    10 PARAMETER SERVER STALE STALE Explicit synchronization barrier Howto keep workers up-to-date? x x x Parameter server
  • 11.
    1. SSP iterationcontrol model 2. Parameter server 11 INTEGRATION WITH FLINK What does Flink need to enable SSP?
  • 12.
    if clocki <= cluster-wideclock + staleness do iteration ++clocki , then send to clocki synchronization sink else wait until clocki <= cluster-wide clock + staleness 12 ITERATION CONTROL MODEL IN FLINK Worker pi Clock Synchronization Sink clocki cluster-wide clock store clocki in C cluster-wide clock = min(C) broadcast cluster-wide clock if changed
  • 13.
    ITERATION CONTROL MODELIN FLINK 13 IterationHead worker done worker done worker done Iteration Intermediate IterationTail backchannel IterationHead Iteration Intermediate IterationTail backchannel IterationHead Iteration Intermediate IterationTail backchannel all workers done all workers done all workers done IterationSynchronizationTask
  • 14.
    ITERATION CONTROL MODELIN FLINK 14 IterationHead Clock pi Iteration Intermediate IterationTail backchannel IterationHead Iteration Intermediate IterationTail backchannel IterationHead Iteration Intermediate IterationTail backchannel ClockSynchronizationTask cluster-wide clock Clock pi Clock pi cluster-wide clock cluster-wide clock
  • 15.
    15 ITERATION CONTROL MODELIN FLINK SuperstepBarrier IterationHeadPACTTask SyncEventHandler IterationSynchronizationTask SSPIterationHeadPACTTask ClockHolder ClockSyncEventHandler ClockSynchronizationTask
  • 16.
    BULK SYNCHRONOUS PARALLEL Convergence determined atsynchronization barrier 16 CONVERGENCE CHECK STALE SYNCHRONOUS PARALLEL Convergence reached when no more worker can improve the solution
  • 17.
  • 18.
    Simple API RichMapFunctionWithParameterServer extendsRichMapFunction { update(id, clock, parameter) get(id) } 18 PARAMETER SERVER DATA GRID SHARED MODEL Worker Worker Worker Worker Architecture
  • 19.
    PART 2: DISTRIBUTEDFRANK-WOLFE ALGORITHM
  • 20.
    Solving the currentoptimization problem: Distributed version (Bellet et al. 2015): 20 DISTRIBUTED FRANK-WOLFE ALGORITHM Linear combination of atoms sparse coefficients
  • 21.
    Distributed version (Belletet al. 2015): 21 DISTRIBUTED FRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 Linear combination of atoms sparse coefficients
  • 22.
    Distributed version (Belletet al. 2015): 22 DISTRIBUTED FRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3
  • 23.
    Distributed version (Belletet al. 2015): 23 DISTRIBUTED FRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 1. Local selection of atoms
  • 24.
    Distributed version (Belletet al. 2015): 24 DISTRIBUTED FRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 2. Global consensus
  • 25.
    Distributed version (Belletet al. 2015): 25 DISTRIBUTED FRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 3. α Coefficients update
  • 26.
    Stale synchronous version: 26 DISTRIBUTEDFRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 1. Get α coefficients from parameter server Parameter Server
  • 27.
    Stale synchronous version: 27 DISTRIBUTEDFRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 2. Local selection of atoms Parameter Server
  • 28.
    Stale synchronous version: 28 DISTRIBUTEDFRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 3. Compute α coefficients from locally selected atoms Parameter Server
  • 29.
    Stale synchronous version: 29 DISTRIBUTEDFRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 4. Update α coefficients to parameter server Parameter Server
  • 30.
    Stale synchronous version: 30 DISTRIBUTEDFRANK-WOLFE ALGORITHM Atom1 Atom2 Atom3 Atom4 ... Atomn W1 W2 W3 Repeat while within staleness bounds Parameter Server
  • 31.
    See our fullpaper for full implementation details properties application to LASSO REGRESSION convergence proof N-L Tran, T Peel, S Skhiri, Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism, proceedings of IEEE BigData 2015, Santa Clara, November 2015 DISTRIBUTED FRANK-WOLFE ALGORITHM 31
  • 32.
    Application on LASSOregression Random sparse 1.000 x 10.000 matrices Sparsity ratio = 0,001 Generated load: at any time, 1 random node under 100% load during 12 seconds 32 EXPERIMENTS 5 nodes, 2 Ghz, 3Gb RAM
  • 33.
  • 34.
    Stragglers in acluster are an issue. Mitigate them with Stale Synchronous Parallel Iterations. 34 RECAP
  • 35.
    Pull request #967 35 WANNATRY IT OUT? Stale Synchronous Parallel iterations + API Pull request #1101 Frank-Wolfe algorithm + LASSO regression
  • 36.
    THANK YOU! Do youhave any questions? namluc.tran@euranova.eu
  • 37.
    AGENDA 37 1. STALE SYNCHRONOUSPARALLEL ITERATIONS ∙ The straggler problem ∙ BSP vs SSP ∙ Integration with Flink ∙ Iteration control model ∙ API 2. DISTRIBUTED FRANK-WOLFE ALGORITHM ∙ Problem statement ∙ Application: LASSO regression ∙ Experiments
  • 38.
  • 39.
    The parameter serverkeeps track of the intermediate results → Key-object store → Distributed, with local caching 39 PARAMETER SERVER