Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
3. 3
Big Fast Data
✤ Data is growing and can be evaluated
✤ Tweets, social networks (statuses, check-ins,
shared content), blogs, click streams, various
logs, …
✤ Facebook: > 845M active users, > 8B messages/
day
✤ Twitter: > 140M active users, > 340M
tweets/day
✤ Everyone is interested!
4. 4
More Examples
✤ Autonomous Driving:
✤ Requires rich navigation info and data sensor readings
✤ 1GB data per minute per car (all sensors)
✤ Traffic Monitoring:
✤ High event rates: millions events / sec
✤ High query rates: thousands queries / sec
✤ Queries: filtering, notifications, analytical
✤ Pre-processing of sensor data:
✤ CERN experiments generate ~1PB of measurements per second
✤ Unfeasible to store or process directly, fast preprocessing is a must
8. 8
Streaming Definitions
✤ A stream is a conceptually infinite, ever growing
set of data items / events
✤ An operator is a continuous stream transformer:
each operator transforms its input streams to its
output streams
✤ Operators may or may not have state, which is
data that the operator remembers between firings
✤ The selectivity of an operator is its data rate
measured in output data items per input data
item.
9. 9
Parallelism
a) Pipeline parallelism: concurrent execution of a
producer A with a consumer B.
b) Task parallelism: concurrent execution of
different operators D and E that do not constitute
a pipeline.
c) Data parallelism: concurrent execution of
multiple replicas of the same operator G on
different portions of the same data (SPMD single
program, multiple data).
18. 18
3. Stream Processing Research
Analyzing Efficient Stream
Processing on Modern
Hardware
Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz,
Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, Volker Markl
19. 19
Analysis Outline
✤ Analyze state-of-the-art streaming systems and
identify sources of inefficiency
✤ Investigate data-related design space:
✤ Data ingestion and passing
✤ Investigate processing-related design space:
✤ Processing model, parallelization strategies,
windowing mechanism
✤ Derive design changes for streaming systems to
exploit modern hardware more efficiently
20. 20
Types of Streaming Systems
✤ Scale-out optimized SPEs
✤ Goal: Massively parallelize the workload among
many small to medium sized nodes
✤ Systems: Flink, Spark, Storm
✤ Scale-up optimized SPEs
✤ Goal: Exploit the capabilities of one high-end
machine efficiently
✤ Systems: Saber, Streambox, Trill
25. 25
Data Ingestion
Common Wisdom: Local first does
not hold anymore
Source: Following Binning et al. The End of Slow Networks:
It’s Time for a Redesign. VLDB 2016.
35. 35
Summary
✤ Analyze state-of-the-art streaming systems and
identify sources of inefficiency
✤ Investigate data-related design space:
✤ Data ingestion and passing
✤ Investigate processing-related design space:
✤ Processing model, parallelization strategies,
windowing mechanism
✤ Derive design changes for streaming systems to
exploit modern hardware more efficiently
36. 36
Interesting Topics for ESRs
✤ Long running queries => lots of optimization
potential and time to „try-out“ things
✤ Modern hardware and compilation based
approaches are very worthwhile for performance
✤ Fog Computing has different ingestion rates
(including InfiniBand) how to exploit them?
38. This training material is part of the FogGuru project that has received funding from the European Union’s
Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement
No 765452. The information and views set out in this material are those of the author(s) and do not
necessarily reflect the official opinion of the European Union. Neither the European Union institutions and
bodies nor any person acting on their behalf may be held responsible for the use which may be made of
the information contained therein.