A Benchmark Suite for
Distributed Stream Processing
Systems
Maycon Viana Bordin
Claudio Geyer
Advisor
April, 2017
1
2
HUGE amounts of data
are being generated in
real-time
3
4
500M tweets
are sent per day
5
6
4.75B shares
4.5B likes
420M status updates
300M photos
EVERY DAY.
7
8
They need to process…
9
They need to process…
large volumes of data
10
They need to process…
large volumes of data
in real-time
11
They need to process…
large volumes of data
in real-time
continuously
12
They need to process…
large volumes of data
in real-time
continuously
producing actionable information
13
14
Stream Processing
15
16
17
B
18
B
19
20
Data Stream
21
B
22
B
23
B 1234567
24
Data from the stream source may or
may not be structured
25
The amount of data is usually
unbounded in size
26
The input rate is variable and
typically unpredictable
27
There are many platforms
on the market
28
Problem:
How to know which platform
is better for an specific type
of application?
29
Problem:
Current stream processing
benchmarks are composed
mostly of synthetic
applications.
30
Problem:
Benchmarks for other Big
Data platforms use more real
world applications, e.g.
BigDataBench and HiBench.
31
Goals:
32
Specific Goals:
•
•
33
34
Benchmarks for
Stream Processing
35
Linear Road Benchmark [Ara04]
•
•
•
36
StreamBench [Lu14]
•
•
•
•
•
37
Yahoo Streaming Benchmark
•
•
•
38
BigDataBench[Wan14]
•
•
•
39
StreamBench[Wan16]
•
•
•
•
•
40
RIoTBench[Wan17]
•
•
•
•
41
HiBench[Hua10]
•
•
•
42
Comparison
43
44
Benchmark
Architecture
45
46
47
48
49
API
•
•
•
•
50
Metrics
•
•
•
51
Scripts for automation…
•
•
•
•
•
•
•
52
Benchmark
Applications
53
54
55
56
•
•
•
•
57
58
59
60
61
62
63
64
65
66
Benchmark
Metrics
67
68
69
𝐿𝑎𝑡𝑒𝑛𝑐𝑦 = 𝑇𝑒𝑛𝑑 − 𝑇𝑒𝑛𝑑
70
71
72
73
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =
𝑁𝑢𝑚. 𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑇𝑢𝑝𝑙𝑒𝑠
𝑅𝑢𝑛𝑡𝑖𝑚𝑒
74
75
76
77
Comparison with the other
Benchmarks
78
79
80
Results
Set-up
81
82
•
•
•
•
83
•
•
•
84
85
86
87
n1_x1_x5_x6_x3 n1_x1_x2_x1_x4_x2 n1_x4_x2_x2
n1_x2_x5_x6_x3 n1_x2_x2_x1_x4_x2 n4_x2_x2_x2
n1_x3_x5_x6_x3 n1_x4_x2_x1_x4_x2 n4_x8_x2_x2
n2_x1_x5_x6_x3 n1_x8_x2_x1_x4_x2
n2_x2_x5_x6_x3 n2_x1_x2_x1_x4_x2
n2_x3_x5_x6_x3 n2_x2_x2_x1_x4_x2
n4_x1_x5_x6_x3 n2_x4_x2_x1_x4_x2
n4_x2_x5_x6_x3 n2_x8_x2_x1_x4_x2
n4_x3_x5_x6_x3 n4_x1_x2_x1_x4_x2
n8_x1_x5_x6_x3 n4_x2_x2_x1_x4_x2
n8_x2_x5_x6_x3 n4_x4_x2_x1_x4_x2
n8_x3_x5_x6_x3 n4_x8_x2_x1_x4_x2
n8_x1_x2_x1_x4_x2
n8_x2_x2_x1_x4_x2
n8_x4_x2_x1_x4_x2
n8_x8_x2_x1_x4_x2 88
Results
Word Count: Storm
89
90
91
92
93
n8_x4
n4_x2
n4_x2_x10_x12_x6
n2_x1_x5_x6_x3
n1_x2_x5_x6_x3
94
Results
Word Count: Spark
95
96
97
98
99
Results
Log Processing: Storm
100
101
102
103
n8_x3
n4_x1_x2_x1_x4_x2
n2_x1_x2_x1_x4_x2
n1_x4_x2_x1_x4_x2
104
Results
Log Processing: Spark
105
106
107
108
109
Results
Traffic Monitoring: Storm
110
111
112
113
Results
Traffic Monitoring: Spark
114
115
116
117
118
119
120
121
Conclusion
122
•
•
•
123
•
•
124
Future Work
125
•
•
•
•
126
•
•
Publications
127
A Benchmark Suite for
Distributed Stream Processing
Systems
Maycon Viana Bordin
Claudio Geyer
Advisor
April, 2017
128
129
130
131
132
133
134

A Benchmark Suite for Distributed Stream Processing Systems