What's New in Teams Calling, Meetings and Devices March 2024
Comparative Evaluation of Spark and Flink Stream Processing
1. Comparative Evaluation of
Spark and Flink Stream
Processing
Ehab Qadah
Supervisor: PD Dr. Michael Mock.
Lab: MA-INF 4306 - Data Science and Big Data
University of Bonn
2. Ehab Qadah 2
MotivationMotivation
Which platform is superior to the other?
● To answer that:
➢ We provide a performance comparison (latency and throughput) of
stream processing in Apache Spark and Apache Flink.
➢ We cover some key aspects of real-time stream applications and
how they are handled in the two frameworks.
➢ We developed two evaluation stream processing workloads over
datasets of aircraft trajectories provided by the DatAcron
project[1].
3. Ehab Qadah 3
OutlineOutline
● Introduction
● What is a data stream
● Apache Spark
● Apache Flink
● Apache Kafka
● Data Stream Setup
● General Comparison Aspects
● Statistics Computation Workload
● Implementation in Spark and Flink
● Discussion and Performance Results
● Sector Change Detection Workload
● Implementation in Spark and Flink
● Discussion and Performance Results
● Conclusion
4. Ehab Qadah 4
IntroductionIntroduction
● What is a data Stream:
➢ “A data stream is a real-time, continuous, ordered (implicitly by arrival
time or explicitly by timestamp) sequence of items.” [2]
➢ Massive volumes of data, items arrive at a high rate.
● Applications of stream processing:
➢ Alerting on stream data from the Internet of Things (IoT) devices
➢ Log analysis and statistics on web traffic
➢ Network monitoring
➢ Financial analysis (e.g., stock prices trends)
5. Ehab Qadah 5
IntroductionIntroduction
● Apache Spark:
➢ is an open source project that provides a general framework for large-
scale data processing [3].
➢ offers programming APIs in Java, Scala, Python and R.
➢ Resilient Distributed Datasets (RDDs) & Discretized Stream (DStream)
are the main data abstractions.
Software stack of Apache Spark [3].
6. Ehab Qadah 6
IntroductionIntroduction
● Stream processing model of Spark:
➢ Spark Streaming processes the continuous stream of data by dividing it into
micro-batches that are processed by the Spark engine.
➢ The updateStateByKey operation is used the manage the state between the
micro batches.
Process flow of Spark Streaming [3].
7. Ehab Qadah 7
IntroductionIntroduction
Software stack of Apache Flink [4].
● Apache Flink:
➢ is an open source project that provides a large-scale, distributed stream
processing platform [4].
➢ offers programming APIs in Java and Scala.
➢ Flink treats the batch processing as a special case of streaming
applications (i.e., finite stream).
➢ The DataStream and DataSet are
the main data abstractions.
8. Ehab Qadah 8
IntroductionIntroduction
● Stream processing model of Flink:
➢ The Flink's core is a distributed streaming dataflow engine, with each
Flink program is represented by a dataflow graph.
An example of data flow graph in Flink [4].
9. Ehab Qadah 9
IntroductionIntroduction
Distribution of a stream partitions for
consumer groups [5].
● Apache Kafka:
➢ is a scalable, fault-tolerant and distributed streaming framework [5].
➢ allows to publish and subscribe to data streams.
➢ manages the stream records in different categories (i.e., topics) that are
partitioned and distributed over the servers of the Kafka cluster.
➢ balances the stream partitions
among the members of a certain
group.
10. Ehab Qadah 10
Data Stream SetupData Stream Setup
● We use datasets of Automatic Dependent Surveillance - Broadcast (ADS-B)
messages that represent the position of aircrafts over time.
● Each message comprises 22 fields of data such as aircraft ID, date message
generated, longitude, latitude and altitude.
● Datasets (2.4 GB) contain around 26 million messages.
The setup of the Data Stream Producer and Kafka Cluster.
11. Ehab Qadah 11
General Comparison AspectsGeneral Comparison Aspects
● Handling parallel input streams (e.g., Kafka Stream).
● How to aggregate the state of an input stream.
● Manage the order of stream records.
● How to provide and update global data model in a stream processing task.
● Evaluate the performance by measuring the latency and throughput.
12. Ehab Qadah 12
Statistics Computation per Trajectory
Workload
Statistics Computation per Trajectory
Workload
● Compute and aggregate statistics for each new position in a trajectory.
● Statistics quantities like speed mean, mean of location coordinates, min and max
altitude, etc.
● This workload covers:
➢ Parallel receiving of an input data stream
➢ Stateful aggregation over a data stream
➢ Preserving the correct order of the records of a stream
13. Ehab Qadah 13
Statistics Computation in Spark StreamingStatistics Computation in Spark Streaming
Create multiple Kafka streams
(DStream) and Union them.
Filter irrelevant by applying
a filter transformation.
Construct a stream of trajectories
(tuples of ID and list of positions) by
using the groupByKey transformation.
For each micro batch sort the new list
of positions, calculate and aggregate
the statistics within the
updateStateByKey function.
14. Ehab Qadah 14
Statistics Computation in FlinkStatistics Computation in Flink
Parse the Kafka Stream records to build
tuples of (ID, position) using a map
transformation.
Construct a KeyedStream of trajectories
by using the KeyBy operation
(ID of the tuple as the key).
A reduce transformation is used
to calculate the statistics for each
new arriving trajectory's position using
the aggregated statistics of old position .
15. Ehab Qadah 15
Differences between the two solutions:Differences between the two solutions:
Flink:
● Handles the parallel consumers of
Kafka stream implicitly.
● Operations over the KeyedStream are
stateful.
● Sort is not required by using a
reduce transformation that processes
the stream records item by item.
Spark:
● Multiple DStream must be created and
union them to have parallel receivers.
● The UpdateStateByKey is must be
used to manage the state between the
micro batches.
● A sort action is required to preserve
the correct order of the position
messages inside the state update
function.
16. Ehab Qadah 16
Performance ResultsPerformance Results
Latency: (end of processing time – streaming time)
18. Ehab Qadah 18
Air Sector Change Detection WorkloadAir Sector Change Detection Workload
● Detect the entering or leaving of an aircraft from one air sector to another one.
● Using a dataset of 20,000 sectors.
● This workload covers:
➢ Parallel receiving of an input data stream
➢ Stateful aggregation over a data stream
➢ Preserving the correct order of the records
of a stream
➢ How to provide and update a global data model
(sectors dataset) in stream processing task
19. Ehab Qadah 19
Air Sector Change Detection in Spark
Streaming
Air Sector Change Detection in Spark
Streaming
Create multiple Kafka streams
(DStream) and Union them.
Filter messages of type 3 & 2 by applying
a filter transformation.
Construct a stream of trajectories
(tuples of ID and list of positions) by
using the groupByKey transformation.
For each micro batch sort the new list
of positions and assign the corresponding
sector using the Broadcast feature
within the updateStateByKey function.
Detect the change of
sectors between two
consecutive positions
by applying a filter
transformation.
20. Ehab Qadah 20
Air Sector Change Detection in FlinkAir Sector Change Detection in Flink
Parse the Kafka Stream records to build
tuples of (ID, position) using a map
transformation.
Construct a KeyedStream of trajectories
by using the KeyBy operation
(ID of the tuple as the key).
A reduce transformation is used to
assign the sector for each new arriving
trajectory's position of a tuple and the
previous sector of the old tuple with
providing the sectors manually .
A filter transformation
is used to detect the
tuples with difference
between the current
and previous sectors.
21. Ehab Qadah 21
Differences between the two solutions:Differences between the two solutions:
Flink:
● The global data model is manually
provided to the reduce
transformation.
● The program must be reloaded to
update the sectors data.
Spark:
● offers the Broadcast feature to provide
the global data model (sectors).
● The sectors can be updated in the driver
program by using the unpersist
function and then update it.
22. Ehab Qadah 22
Performance ResultsPerformance Results
Latency: (end of processing time – streaming time)
24. Ehab Qadah 24
ConclusionConclusion
● Results show that Flink outperforms Spark Streaming in
term of processing latency.
● Spark Streaming provides higher throughput rates
than Flink by increasing the batch duration.
● Flink gives a similar throughput to Spark Streaming
with small batch durations.
● Flink's processing model is well-suited to the stream
processing tasks (stateful, low latency, item by item, no
batch interval).
25. Ehab Qadah 25
ReferencesReferences
[1] DatAcron project. Available:http://www.datacron-project.eu/
[2] Golab, Lukasz, and M. Tamer zsu. ”Issues in data stream manage-
ment.” ACM Sigmod Record 32.2 (2003): 5-14.
[3] Apache Spark. Available: https://spark.apache.org/.
[4] Apache Flink. Available: https://flink.apache.org/.
[5] Apache Kafka. Available: https://kafka.apache.org/intro.html.
[6] Source code of the workloads . Available: https://github
.com/ehabqadah/Spark_vs_Flink/.