Understanding the performance of distributed dataflow systems like Apache Spark, Apache Flink, and Tensorflow is hard. Parallel computation is interleaved with data and control communication, and execution dependencies typically span multiple system components. In such environments, bottleneck detection is cumbersome and currently relies heavily on humans. After decades of systems research, state-of-the-art performance analysis techniques are commonly based on offline trace processing and thus are only suitable for batch computations and postmortem reports.
Vasia Kalavri offers an overview of Strymon, a system for predictive data center analytics, and its online critical path analysis module. Strymon analyzes live traces from distributed dataflow systems to predict bottlenecks and provide insights on streaming application performance—leveraging logging and monitoring pipelines of modern production data centers to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if sc
2. ABOUT ME
▸ Postdoc at ETH Zürich
▸ Systems Group: https://www.systems.ethz.ch/
▸ PMC member of Apache Flink
▸ Research interests
▸ Large-scale graph processing
▸ Streaming dataflow engines
▸ Current project
▸ Predictive datacenter analytics and
management
2
@vkalavri
4. STRYMON IS BUILT ON TIMELY
4
▸ A steaming framework for data-parallel computations
▸ Arbitrary cyclic dataflows
▸ Logical timestamps (epochs)
▸ Asynchronous execution
▸ Low latency
D. Murray, F. McSherry, M. Isard, R. Isaacs, P. Barham, M. Abadi.
Naiad: A Timely Dataflow System. In SOSP, 2013.
4
https://github.com/frankmcsherry/timely-dataflow
8. UNDERSTANDING THE PERFORMANCE OF DISTRIBUTED DATAFLOWS
Hard to troubleshoot
▸ long-running, dynamic
workloads
▸ many tasks, activities,
operators, dependencies
▸ bottleneck causes are
usually not isolated but
span multiple processes
8
client
W1
W1
scheduler
25. THE PROGRAM ACTIVITY GRAPH (PAG)
16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Nodes are timestamped events:
start or end of a worker activity
26. THE PROGRAM ACTIVITY GRAPH (PAG)
16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Nodes are timestamped events:
start or end of a worker activity
u = {
timestamp: k+1,
worker: 2
}
27. THE PROGRAM ACTIVITY GRAPH (PAG)
17
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Edges represent activities
annotated with a type and duration
28. THE PROGRAM ACTIVITY GRAPH (PAG)
17
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Edges represent activities
annotated with a type and duration
(u, v) = {
type: serialization
duration: 1
}
29. W1
W2
W3
The Program Activity
Graph captures
computational
dependencies among
parallel workers
▸ Which activities delay the overall execution?
▸ i.e. which activities lie on the critical path of execution?
18
40. POST-MORTEM CRITICAL PATH ANALYSIS IS EASY
1. Collect traces during execution
job start job end
profiler
22
41. POST-MORTEM CRITICAL PATH ANALYSIS IS EASY
1. Collect traces during execution
job start job end
profiler
22
2. Analyze traces offline
analyzer
performance
summaries
42. How to compute the critical path for
continuously running,
distributed streaming applications,
with potentially unbounded input?
23
43. How to compute the critical path for
continuously running,
distributed streaming applications,
with potentially unbounded input?
▸ There might be no “job end”
▸ The PAG and critical path are continuously evolving
▸ Stale profiling information is not useful
23
50. ▸ All paths have the same length: te - ts
W1
W2
W3
a b
c d
x u v z
ts te
27
51. W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
28
52. W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
29
53. W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
30
54. W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
31
55. W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
31
56. ▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
▸ Enumerating all paths is impractical
W1
W2
W3
a b
c d
x u v z
ts te
32
57. W1
W2
W3
a b
c d
x u v z
ts te
How to rank activities with regard to criticality?
All paths are
potentially part
of the evolving
critical path
33
58. W1
W2
W3
a b
c d
x u v z
ts te
How to rank activities with regard to criticality?
All paths are
potentially part
of the evolving
critical path
Intuition: the more paths an activity appears on
the more probable it is that this activity is critical
33
61. 36
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
tion 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
set of N transient paths of snapshot G[ts,te]. The tran-
path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
e following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Sp
di↵er
ysis:
graph
whos
ers (t
graph
all wo
tions
1 We p
4
62. 36
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Can be computed
without path
enumeration!
tion 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
set of N transient paths of snapshot G[ts,te]. The tran-
path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
e following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Sp
di↵er
ysis:
graph
whos
ers (t
graph
all wo
tions
1 We p
4
73. SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
▸ Straggler Summary
▸ which worker is a bottleneck?
▸ Operator Summary
▸ which operator is a bottleneck?
45
76. SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
▸ Straggler Summary
▸ which worker is a bottleneck?
▸ Operator Summary
▸ which operator is a bottleneck?
▸ Communication Summary
▸ which communication channels are bottlenecks?
47
79. SNAILTRAIL PERFORMANCE
▸ Low instrumentation overhead
▸ < 10% for all reference systems
▸ High throughput
▸ 1.2 million events per second
▸ Always online
▸ 1s of traces in 6ms, 256s of traces in < 25s
49
Traces from Apache Flink Sessionization, 48 workers, 1s-256s snapshots
SnailTrail on Intel Xeon E5-4640, 2.40GHz, 32 cores, 512GB RAM
85. THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
Frank McSherry
86. THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
? ?
Frank McSherry
87. THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
? ?
Frank McSherry
IT COULD BE YOU!
strymon.systems.ethz.ch
www.systems.ethz.ch/positions