Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017)

ONLINE
PERFORMANCE
ANALYSIS
OF DISTRIBUTED DATAFLOW SYSTEMS
Vasia Kalavri
kalavriv@inf.ethz.ch
Support:
O’Reilly Velocity London
20 October 2017

ABOUT ME
▸ Postdoc at ETH Zürich
▸ Systems Group: https://www.systems.ethz.ch/
▸ PMC member of Apache Flink
▸ Research interests
▸ Large-scale graph processing
▸ Streaming dataﬂow engines
▸ Current project
▸ Predictive datacenter analytics and
management
2
@vkalavri

33
traces, conﬁguration,
topology updates, …
Datacenter
Strymon
queries, complex analytics,
simulations, …
policy enforcement,
what-if scenarios, …
STRYMON: ONLINE DATACENTER ANALYTICS AND MANAGEMENT
Datacenter
model
event streams
strymon.systems.ethz.ch

STRYMON IS BUILT ON TIMELY
4
▸ A steaming framework for data-parallel computations
▸ Arbitrary cyclic dataflows
▸ Logical timestamps (epochs)
▸ Asynchronous execution
▸ Low latency
D. Murray, F. McSherry, M. Isard, R. Isaacs, P. Barham, M. Abadi.
Naiad: A Timely Dataflow System. In SOSP, 2013.
4
https://github.com/frankmcsherry/timely-dataflow

What-if scenarios
Real-time
datacenter analytics
Incremental
network routing
Online critical
path analysis
Explaining
simulations
Strymon
5

Apache Flink
SnailTrail: Strymon’s
component for online
performance analysis of
distributed dataﬂows
6
What-if scenarios
Real-time
datacenter analytics
Incremental
network routing
Online critical
path analysis
Explaining
simulations
Strymon

DATAFLOW PROGRAMMING
7
sources
sinks
operators
data exchange

UNDERSTANDING THE PERFORMANCE OF DISTRIBUTED DATAFLOWS
Hard to troubleshoot
▸ long-running, dynamic
workloads
▸ many tasks, activities,
operators, dependencies
▸ bottleneck causes are
usually not isolated but
span multiple processes
8
client
W1
W1
scheduler

PERFORMANCE METRICS
9
Dataﬂow graph

PERFORMANCE METRICS
9
Duration
Dataﬂow graph

PERFORMANCE METRICS
9
Duration
Aggregate data exchange
Dataﬂow graph

PERFORMANCE METRICS
9
Duration
Aggregate data exchange
Custom
aggregate metrics
Dataﬂow graph

W1
W2
W3
serialization processing
waiting
deserialization
10
processing
message
A PARALLEL EXECUTION

W1
W2
W3
waiting
deserialization
10
processing
messageProcessing is the most time-consuming activity

W1
W2
W3
waiting
deserialization
11
processing
What if we optimize it?

W1
W2
W3
serialization
waiting
deserialization
12

13
W1
W2
W3
serialization
waiting
deserialization

13
W1
W2
W3
serialization
waiting
deserialization
No performance beneﬁt
for the parallel execution!

CONVENTIONAL
PROFILING CAN
BE MISLEADING

THE PROGRAM ACTIVITY GRAPH (PAG)
16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z

16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Nodes are timestamped events:
start or end of a worker activity

16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Nodes are timestamped events:
start or end of a worker activity
u = { 
timestamp: k+1, 
worker: 2
}

17
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Edges represent activities
annotated with a type and duration

17
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Edges represent activities
annotated with a type and duration
(u, v) = { 
type: serialization 
duration: 1
}

W1
W2
W3
The Program Activity
Graph captures
computational
dependencies among
parallel workers
▸ Which activities delay the overall execution?
▸ i.e. which activities lie on the critical path of execution?
18

CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)

20
W1
W2
W3
a b
c d
CRITICAL PATH

CRITICAL PATH
21
W1
W2
W3
a b
c d

CRITICAL PATH
21
W1
W2
W3
a b
c d
Reduced execution time

POST-MORTEM CRITICAL PATH ANALYSIS IS EASY
1. Collect traces during execution
job start job end
proﬁler
22

POST-MORTEM CRITICAL PATH ANALYSIS IS EASY
1. Collect traces during execution
job start job end
proﬁler
22
2. Analyze traces ofﬂine
analyzer
performance
summaries

How to compute the critical path for
continuously running,
distributed streaming applications,
with potentially unbounded input?
23

How to compute the critical path for
continuously running,
distributed streaming applications,
with potentially unbounded input?
▸ There might be no “job end”
▸ The PAG and critical path are continuously evolving
▸ Stale proﬁling information is not useful
23

ONLINE ANALYSIS OF TRACE SNAPSHOTS
25
input stream output stream

25
periodic
snapshot

25
periodic
snapshot
trace snapshot
stream
analyzer
performance
summaries
stream

PROGRAM ACTIVITY GRAPH SNAPSHOT
26
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
ts te

W1
W2
W3
a b
c d
x u v z
ts te
27

▸ All paths have the same length: te - ts
W1
W2
W3
a b
c d
x u v z
ts te
27

W1
W2
W3
a b
c d
x u v z
ts te
28

W1
W2
W3
a b
c d
x u v z
ts te
29

W1
W2
W3
a b
c d
x u v z
ts te
▸ Choosing a random path might miss critical activities
30

W1
W2
W3
a b
c d
x u v z
ts te
31

▸ Enumerating all paths is impractical
W1
W2
W3
a b
c d
x u v z
ts te
32

W1
W2
W3
a b
c d
x u v z
ts te
How to rank activities with regard to criticality?
All paths are
potentially part
of the evolving
critical path
33

W1
W2
W3
a b
c d
x u v z
ts te
How to rank activities with regard to criticality?
All paths are
potentially part
of the evolving
critical path
Intuition: the more paths an activity appears on
the more probable it is that this activity is critical
33

1
2
3
4
5
6
7
8
9
W1
W2
W3
a b
c d
x u v z
ts te
34

1
2
3
4
5
6
7
8
9
W1
W2
W3
a b
c d
x u v z
ts te
9
0
0
6 6
35

36
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
tion 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
set of N transient paths of snapshot G[ts,te]. The tran-
path centrality of an edge e 2 G[ts,te] is deﬁned as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
e following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Sp
di↵er
ysis:
graph
whos
ers (t
graph
all wo
tions
1 We p
4

36
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Can be computed
without path
enumeration!
tion 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
set of N transient paths of snapshot G[ts,te]. The tran-
path centrality of an edge e 2 G[ts,te] is deﬁned as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
e following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Sp
di↵er
ysis:
graph
whos
ers (t
graph
all wo
tions
1 We p
4

ONLINE PERFORMANCE ANALYSIS WITH
SNAILTRAIL

38
reference application SnailTrail
Timely
Trace ingestion
CP-based
performance
summaries
PAG construction
CP computation and
activity ranking
trace streams
Proﬁling
Trace generation
Apache Flink,
Apache Spark,
TensorFlow,
Heron,
Timely Dataﬂow, ...
SNAILTRAIL IN ACTION

EXAMPLE: TASK SCHEDULING IN APACHE SPARK
39
DRIVER
W1
W2
W3
Venkataraman, Shivaram, et al. "Drizzle: Fast and adaptable stream processing at scale." Spark Summit (2016).

SCHEDULING BOTTLENECK IN APACHE SPARK
40
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
Conventional ProﬁlingSnailTrail Proﬁling
Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots

SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
41

0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
1.0
CP
DataMessage
Unknown
Buffer
Deserialization
Serialization
Processing
Activity Summary
Apache Flink: Dhalion WordCount Benchmark, 4 workers, 1s snapshots
42

0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
1.0
CP
DataMessage
Unknown
Buffer
Deserialization
Serialization
Processing
Activity Summary
Optimize
serialization!
42

▸ Straggler Summary
▸ which worker is a bottleneck?
43

0 5 10 15
Snapshot
0.00
0.05
0.10
0.15
CP
W1
W2
W3
W4
Straggler Summary
44

0 5 10 15
Snapshot
0.00
0.05
0.10
0.15
CP
W1
W2
W3
W4
Straggler Summary
Steal work from W1
44

▸ Operator Summary
▸ which operator is a bottleneck?
45

Operator Summary
46

Operator Summary
Increase
flatmap’s parallelism!
46

▸ Operator Summary
▸ which operator is a bottleneck?
▸ Communication Summary
▸ which communication channels are bottlenecks?
47

Communication Criticality
Communication Summary
48
0 1 2 3 4 5 6 7 8 9 10 11 12
Worker
0
1
2
3
4
5
6
7
8
9
10
11
12
Worker
1 32 54 86 7 109 1211
Worker
Worker
1
2
3
4
5
6
7
8
9
10
11
12

Communication Criticality
Communication Summary
48
0 1 2 3 4 5 6 7 8 9 10 11 12
Worker
0
1
2
3
4
5
6
7
8
9
10
11
12
Worker
1 32 54 86 7 109 1211
Worker
Worker
1
2
3
4
5
6
7
8
9
10
11
12
Collocate worker 5
with 2, 9, 10, 11

SNAILTRAIL PERFORMANCE
▸ Low instrumentation overhead
▸ < 10% for all reference systems
▸ High throughput
▸ 1.2 million events per second
▸ Always online
▸ 1s of traces in 6ms, 256s of traces in < 25s
49
Traces from Apache Flink Sessionization, 48 workers, 1s-256s snapshots
SnailTrail on Intel Xeon E5-4640, 2.40GHz, 32 cores, 512GB RAM

RECAP
50
Strymon: online datacenter analytics
and management

RECAP
50
and management
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Conventional proﬁling is misleading

RECAP
50
and management
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
CP-metric: online critical path analysis

RECAP
50
and management
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
CP-metric: online critical path analysis SnailTrail: online CP-based summaries

THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
Frank McSherry

51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
? ?
Frank McSherry

51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
? ?
Frank McSherry
IT COULD BE YOU!
strymon.systems.ethz.ch
www.systems.ethz.ch/positions

Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017)

Recommended

Recommended

More Related Content

Similar to Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017)

Similar to Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017) (20)

More from Vasia Kalavri

More from Vasia Kalavri (17)

Recently uploaded

Recently uploaded (20)

Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017)