Stream Processing

Stream Processing
Steffen Zeuch

2
Agenda
1. Stream Processing Overview
2. Stream Processing Optimizations
3. Stream Processing Research

3
Big Fast Data
✤ Data is growing and can be evaluated
✤ Tweets, social networks (statuses, check-ins,
shared content), blogs, click streams, various
logs, …
✤ Facebook: > 845M active users, > 8B messages/
day
✤ Twitter: > 140M active users, > 340M
tweets/day
✤ Everyone is interested!

4
More Examples
✤ Autonomous Driving:
✤ Requires rich navigation info and data sensor readings
✤ 1GB data per minute per car (all sensors)
✤ Traffic Monitoring:
✤ High event rates: millions events / sec
✤ High query rates: thousands queries / sec
✤ Queries: filtering, notifications, analytical
✤ Pre-processing of sensor data:
✤ CERN experiments generate ~1PB of measurements per second
✤ Unfeasible to store or process directly, fast preprocessing is a must

5
Stream Processing
Stream Processor
Data Stream Result Stream
Queries
Queries
Queries
Queries

6
Why is it hard?
Tension between performance and algorithmic expressiveness

7
Stream Processing
Optimizations

8
Streaming Definitions
✤ A stream is a conceptually infinite, ever growing
set of data items / events
✤ An operator is a continuous stream transformer:
each operator transforms its input streams to its
output streams
✤ Operators may or may not have state, which is
data that the operator remembers between firings
✤ The selectivity of an operator is its data rate
measured in output data items per input data
item.

9
Parallelism
a) Pipeline parallelism: concurrent execution of a
producer A with a consumer B.
b) Task parallelism: concurrent execution of
different operators D and E that do not constitute
a pipeline.
c) Data parallelism: concurrent execution of
multiple replicas of the same operator G on
different portions of the same data (SPMD single
program, multiple data).

10
The Catalog of Optimizations

11
The optimizations are
connected

13
Operator Fusion
Tradeoff: communication cost vs. pipeline
parallelism

14
Operator Fission
Worthwhile if: replicated op costly enough to be the
bottleneck and if benefit of parallelization outweigh the

15
Operator Placement
Tradeoff: commucation costs vs. resource
utilization

16
Load Balancing
Worthwhile if data is skewed

17
Batching
Tradeoff: throughput vs. latency

18
3. Stream Processing Research
Analyzing Efficient Stream
Processing on Modern
Hardware
Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz,
Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, Volker Markl

19
Analysis Outline
✤ Analyze state-of-the-art streaming systems and
identify sources of inefficiency
✤ Investigate data-related design space:
✤ Data ingestion and passing
✤ Investigate processing-related design space:
✤ Processing model, parallelization strategies,
windowing mechanism
✤ Derive design changes for streaming systems to
exploit modern hardware more efficiently

20
Types of Streaming Systems
✤ Scale-out optimized SPEs
✤ Goal: Massively parallelize the workload among
many small to medium sized nodes
✤ Systems: Flink, Spark, Storm
✤ Scale-up optimized SPEs
✤ Goal: Exploit the capabilities of one high-end
machine efficiently
✤ Systems: Saber, Streambox, Trill

22
Potential for Modern Hardware

23
Scale-Out on 10 Nodes
Potential: decrease the resource
consumption
or enable more complex analysis

25
Data Ingestion
Common Wisdom: Local first does
not hold anymore
Source: Following Binning et al. The End of Slow Networks:
It’s Time for a Redesign. VLDB 2016.

26
Infiniband Future
Source: http://www.infinibandta.org/content/pages.php?
pg=technology_overview

29
Processing-related Design
Space

34
Modern Hardware Utilization

35
Summary
✤ Analyze state-of-the-art streaming systems and
identify sources of inefficiency
✤ Investigate data-related design space:
✤ Data ingestion and passing
✤ Investigate processing-related design space:
✤ Processing model, parallelization strategies,
windowing mechanism
✤ Derive design changes for streaming systems to
exploit modern hardware more efficiently

36
Interesting Topics for ESRs
✤ Long running queries => lots of optimization
potential and time to „try-out“ things
✤ Modern hardware and compilation based
approaches are very worthwhile for performance
✤ Fog Computing has different ingestion rates
(including InfiniBand) how to exploit them?

37
Thanks
Contact: steffen.zeuch@dfki.de

This training material is part of the FogGuru project that has received funding from the European Union’s
Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement
No 765452. The information and views set out in this material are those of the author(s) and do not
necessarily reflect the official opinion of the European Union. Neither the European Union institutions and
bodies nor any person acting on their behalf may be held responsible for the use which may be made of
the information contained therein.

Stream Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stream Processing

Similar to Stream Processing (20)

More from FogGuru MSCA Project

More from FogGuru MSCA Project (20)

Recently uploaded

Recently uploaded (20)

Stream Processing