SlideShare a Scribd company logo
1 of 26
Download to read offline
Comparative Evaluation of
Spark and Flink Stream
Processing
Ehab Qadah
Supervisor: PD Dr. Michael Mock.
Lab: MA-INF 4306 - Data Science and Big Data
University of Bonn
Ehab Qadah 2
MotivationMotivation
Which platform is superior to the other?
● To answer that:
➢ We provide a performance comparison (latency and throughput) of
stream processing in Apache Spark and Apache Flink.
➢ We cover some key aspects of real-time stream applications and
how they are handled in the two frameworks.
➢ We developed two evaluation stream processing workloads over
datasets of aircraft trajectories provided by the DatAcron
project[1].
Ehab Qadah 3
OutlineOutline
● Introduction
● What is a data stream
● Apache Spark
● Apache Flink
● Apache Kafka
● Data Stream Setup
● General Comparison Aspects
● Statistics Computation Workload
● Implementation in Spark and Flink
● Discussion and Performance Results
● Sector Change Detection Workload
● Implementation in Spark and Flink
● Discussion and Performance Results
● Conclusion
Ehab Qadah 4
IntroductionIntroduction
● What is a data Stream:
➢ “A data stream is a real-time, continuous, ordered (implicitly by arrival
time or explicitly by timestamp) sequence of items.” [2]
➢ Massive volumes of data, items arrive at a high rate.
● Applications of stream processing:
➢ Alerting on stream data from the Internet of Things (IoT) devices
➢ Log analysis and statistics on web traffic
➢ Network monitoring
➢ Financial analysis (e.g., stock prices trends)
Ehab Qadah 5
IntroductionIntroduction
● Apache Spark:
➢ is an open source project that provides a general framework for large-
scale data processing [3].
➢ offers programming APIs in Java, Scala, Python and R.
➢ Resilient Distributed Datasets (RDDs) & Discretized Stream (DStream)
are the main data abstractions.
Software stack of Apache Spark [3].
Ehab Qadah 6
IntroductionIntroduction
● Stream processing model of Spark:
➢ Spark Streaming processes the continuous stream of data by dividing it into
micro-batches that are processed by the Spark engine.
➢ The updateStateByKey operation is used the manage the state between the
micro batches.
Process flow of Spark Streaming [3].
Ehab Qadah 7
IntroductionIntroduction
Software stack of Apache Flink [4].
● Apache Flink:
➢ is an open source project that provides a large-scale, distributed stream
processing platform [4].
➢ offers programming APIs in Java and Scala.
➢ Flink treats the batch processing as a special case of streaming
applications (i.e., finite stream).
➢ The DataStream and DataSet are
the main data abstractions.
Ehab Qadah 8
IntroductionIntroduction
● Stream processing model of Flink:
➢ The Flink's core is a distributed streaming dataflow engine, with each
Flink program is represented by a dataflow graph.
An example of data flow graph in Flink [4].
Ehab Qadah 9
IntroductionIntroduction
Distribution of a stream partitions for
consumer groups [5].
● Apache Kafka:
➢ is a scalable, fault-tolerant and distributed streaming framework [5].
➢ allows to publish and subscribe to data streams.
➢ manages the stream records in different categories (i.e., topics) that are
partitioned and distributed over the servers of the Kafka cluster.
➢ balances the stream partitions
among the members of a certain
group.
Ehab Qadah 10
Data Stream SetupData Stream Setup
● We use datasets of Automatic Dependent Surveillance - Broadcast (ADS-B)
messages that represent the position of aircrafts over time.
● Each message comprises 22 fields of data such as aircraft ID, date message
generated, longitude, latitude and altitude.
● Datasets (2.4 GB) contain around 26 million messages.
The setup of the Data Stream Producer and Kafka Cluster.
Ehab Qadah 11
General Comparison AspectsGeneral Comparison Aspects
● Handling parallel input streams (e.g., Kafka Stream).
● How to aggregate the state of an input stream.
● Manage the order of stream records.
● How to provide and update global data model in a stream processing task.
● Evaluate the performance by measuring the latency and throughput.
Ehab Qadah 12
Statistics Computation per Trajectory
Workload
Statistics Computation per Trajectory
Workload
● Compute and aggregate statistics for each new position in a trajectory.
● Statistics quantities like speed mean, mean of location coordinates, min and max
altitude, etc.
● This workload covers:
➢ Parallel receiving of an input data stream
➢ Stateful aggregation over a data stream
➢ Preserving the correct order of the records of a stream
Ehab Qadah 13
Statistics Computation in Spark StreamingStatistics Computation in Spark Streaming
Create multiple Kafka streams
(DStream) and Union them.
Filter irrelevant by applying
a filter transformation.
Construct a stream of trajectories
(tuples of ID and list of positions) by
using the groupByKey transformation.
For each micro batch sort the new list
of positions, calculate and aggregate
the statistics within the
updateStateByKey function.
Ehab Qadah 14
Statistics Computation in FlinkStatistics Computation in Flink
Parse the Kafka Stream records to build
tuples of (ID, position) using a map
transformation.
Construct a KeyedStream of trajectories
by using the KeyBy operation
(ID of the tuple as the key).
A reduce transformation is used
to calculate the statistics for each
new arriving trajectory's position using
the aggregated statistics of old position .
Ehab Qadah 15
Differences between the two solutions:Differences between the two solutions:
Flink:
● Handles the parallel consumers of
Kafka stream implicitly.
● Operations over the KeyedStream are
stateful.
● Sort is not required by using a
reduce transformation that processes
the stream records item by item.
Spark:
● Multiple DStream must be created and
union them to have parallel receivers.
● The UpdateStateByKey is must be
used to manage the state between the
micro batches.
● A sort action is required to preserve
the correct order of the position
messages inside the state update
function.
Ehab Qadah 16
Performance ResultsPerformance Results
Latency: (end of processing time – streaming time)
Ehab Qadah 17
Performance ResultsPerformance Results
Throughput: (# processed messages / minute)
Ehab Qadah 18
Air Sector Change Detection WorkloadAir Sector Change Detection Workload
● Detect the entering or leaving of an aircraft from one air sector to another one.
● Using a dataset of 20,000 sectors.
● This workload covers:
➢ Parallel receiving of an input data stream
➢ Stateful aggregation over a data stream
➢ Preserving the correct order of the records
of a stream
➢ How to provide and update a global data model
(sectors dataset) in stream processing task
Ehab Qadah 19
Air Sector Change Detection in Spark
Streaming
Air Sector Change Detection in Spark
Streaming
Create multiple Kafka streams
(DStream) and Union them.
Filter messages of type 3 & 2 by applying
a filter transformation.
Construct a stream of trajectories
(tuples of ID and list of positions) by
using the groupByKey transformation.
For each micro batch sort the new list
of positions and assign the corresponding
sector using the Broadcast feature
within the updateStateByKey function.
Detect the change of
sectors between two
consecutive positions
by applying a filter
transformation.
Ehab Qadah 20
Air Sector Change Detection in FlinkAir Sector Change Detection in Flink
Parse the Kafka Stream records to build
tuples of (ID, position) using a map
transformation.
Construct a KeyedStream of trajectories
by using the KeyBy operation
(ID of the tuple as the key).
A reduce transformation is used to
assign the sector for each new arriving
trajectory's position of a tuple and the
previous sector of the old tuple with
providing the sectors manually .
A filter transformation
is used to detect the
tuples with difference
between the current
and previous sectors.
Ehab Qadah 21
Differences between the two solutions:Differences between the two solutions:
Flink:
● The global data model is manually
provided to the reduce
transformation.
● The program must be reloaded to
update the sectors data.
Spark:
● offers the Broadcast feature to provide
the global data model (sectors).
● The sectors can be updated in the driver
program by using the unpersist
function and then update it.
Ehab Qadah 22
Performance ResultsPerformance Results
Latency: (end of processing time – streaming time)
Ehab Qadah 23
Performance ResultsPerformance Results
Throughput: (# processed messages / minute)
Ehab Qadah 24
ConclusionConclusion
● Results show that Flink outperforms Spark Streaming in
term of processing latency.
● Spark Streaming provides higher throughput rates
than Flink by increasing the batch duration.
● Flink gives a similar throughput to Spark Streaming
with small batch durations.
● Flink's processing model is well-suited to the stream
processing tasks (stateful, low latency, item by item, no
batch interval).
Ehab Qadah 25
ReferencesReferences
[1] DatAcron project. Available:http://www.datacron-project.eu/
[2] Golab, Lukasz, and M. Tamer zsu. ”Issues in data stream manage-
ment.” ACM Sigmod Record 32.2 (2003): 5-14.
[3] Apache Spark. Available: https://spark.apache.org/.
[4] Apache Flink. Available: https://flink.apache.org/.
[5] Apache Kafka. Available: https://kafka.apache.org/intro.html.
[6] Source code of the workloads . Available: https://github
.com/ehabqadah/Spark_vs_Flink/.
Ehab Qadah 26
THANK YOU FOR YOUR
ATTENTION

More Related Content

What's hot

Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache ApexYogi Devendra Vyavahare
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Apache Apex
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingParis Carbone
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingFlink Forward
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...InfluxData
 

What's hot (20)

Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Apache flink
Apache flinkApache flink
Apache flink
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 

Viewers also liked

What Makes Great Infographics
What Makes Great InfographicsWhat Makes Great Infographics
What Makes Great InfographicsSlideShare
 
Masters of SlideShare
Masters of SlideShareMasters of SlideShare
Masters of SlideShareKapost
 
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareSTOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareEmpowered Presentations
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation OptimizationOneupweb
 
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingHow To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingContent Marketing Institute
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShareSlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

Viewers also liked (10)

What Makes Great Infographics
What Makes Great InfographicsWhat Makes Great Infographics
What Makes Great Infographics
 
Masters of SlideShare
Masters of SlideShareMasters of SlideShare
Masters of SlideShare
 
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareSTOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
 
You Suck At PowerPoint!
You Suck At PowerPoint!You Suck At PowerPoint!
You Suck At PowerPoint!
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
 
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingHow To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to Comparative Evaluation of Spark and Flink Stream Processing

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataApache Apex
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flinkRenato Guimaraes
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven MicroservicesFabrizio Fortino
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 

Similar to Comparative Evaluation of Spark and Flink Stream Processing (20)

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Streams on wires
Streams on wiresStreams on wires
Streams on wires
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Comparative Evaluation of Spark and Flink Stream Processing

  • 1. Comparative Evaluation of Spark and Flink Stream Processing Ehab Qadah Supervisor: PD Dr. Michael Mock. Lab: MA-INF 4306 - Data Science and Big Data University of Bonn
  • 2. Ehab Qadah 2 MotivationMotivation Which platform is superior to the other? ● To answer that: ➢ We provide a performance comparison (latency and throughput) of stream processing in Apache Spark and Apache Flink. ➢ We cover some key aspects of real-time stream applications and how they are handled in the two frameworks. ➢ We developed two evaluation stream processing workloads over datasets of aircraft trajectories provided by the DatAcron project[1].
  • 3. Ehab Qadah 3 OutlineOutline ● Introduction ● What is a data stream ● Apache Spark ● Apache Flink ● Apache Kafka ● Data Stream Setup ● General Comparison Aspects ● Statistics Computation Workload ● Implementation in Spark and Flink ● Discussion and Performance Results ● Sector Change Detection Workload ● Implementation in Spark and Flink ● Discussion and Performance Results ● Conclusion
  • 4. Ehab Qadah 4 IntroductionIntroduction ● What is a data Stream: ➢ “A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items.” [2] ➢ Massive volumes of data, items arrive at a high rate. ● Applications of stream processing: ➢ Alerting on stream data from the Internet of Things (IoT) devices ➢ Log analysis and statistics on web traffic ➢ Network monitoring ➢ Financial analysis (e.g., stock prices trends)
  • 5. Ehab Qadah 5 IntroductionIntroduction ● Apache Spark: ➢ is an open source project that provides a general framework for large- scale data processing [3]. ➢ offers programming APIs in Java, Scala, Python and R. ➢ Resilient Distributed Datasets (RDDs) & Discretized Stream (DStream) are the main data abstractions. Software stack of Apache Spark [3].
  • 6. Ehab Qadah 6 IntroductionIntroduction ● Stream processing model of Spark: ➢ Spark Streaming processes the continuous stream of data by dividing it into micro-batches that are processed by the Spark engine. ➢ The updateStateByKey operation is used the manage the state between the micro batches. Process flow of Spark Streaming [3].
  • 7. Ehab Qadah 7 IntroductionIntroduction Software stack of Apache Flink [4]. ● Apache Flink: ➢ is an open source project that provides a large-scale, distributed stream processing platform [4]. ➢ offers programming APIs in Java and Scala. ➢ Flink treats the batch processing as a special case of streaming applications (i.e., finite stream). ➢ The DataStream and DataSet are the main data abstractions.
  • 8. Ehab Qadah 8 IntroductionIntroduction ● Stream processing model of Flink: ➢ The Flink's core is a distributed streaming dataflow engine, with each Flink program is represented by a dataflow graph. An example of data flow graph in Flink [4].
  • 9. Ehab Qadah 9 IntroductionIntroduction Distribution of a stream partitions for consumer groups [5]. ● Apache Kafka: ➢ is a scalable, fault-tolerant and distributed streaming framework [5]. ➢ allows to publish and subscribe to data streams. ➢ manages the stream records in different categories (i.e., topics) that are partitioned and distributed over the servers of the Kafka cluster. ➢ balances the stream partitions among the members of a certain group.
  • 10. Ehab Qadah 10 Data Stream SetupData Stream Setup ● We use datasets of Automatic Dependent Surveillance - Broadcast (ADS-B) messages that represent the position of aircrafts over time. ● Each message comprises 22 fields of data such as aircraft ID, date message generated, longitude, latitude and altitude. ● Datasets (2.4 GB) contain around 26 million messages. The setup of the Data Stream Producer and Kafka Cluster.
  • 11. Ehab Qadah 11 General Comparison AspectsGeneral Comparison Aspects ● Handling parallel input streams (e.g., Kafka Stream). ● How to aggregate the state of an input stream. ● Manage the order of stream records. ● How to provide and update global data model in a stream processing task. ● Evaluate the performance by measuring the latency and throughput.
  • 12. Ehab Qadah 12 Statistics Computation per Trajectory Workload Statistics Computation per Trajectory Workload ● Compute and aggregate statistics for each new position in a trajectory. ● Statistics quantities like speed mean, mean of location coordinates, min and max altitude, etc. ● This workload covers: ➢ Parallel receiving of an input data stream ➢ Stateful aggregation over a data stream ➢ Preserving the correct order of the records of a stream
  • 13. Ehab Qadah 13 Statistics Computation in Spark StreamingStatistics Computation in Spark Streaming Create multiple Kafka streams (DStream) and Union them. Filter irrelevant by applying a filter transformation. Construct a stream of trajectories (tuples of ID and list of positions) by using the groupByKey transformation. For each micro batch sort the new list of positions, calculate and aggregate the statistics within the updateStateByKey function.
  • 14. Ehab Qadah 14 Statistics Computation in FlinkStatistics Computation in Flink Parse the Kafka Stream records to build tuples of (ID, position) using a map transformation. Construct a KeyedStream of trajectories by using the KeyBy operation (ID of the tuple as the key). A reduce transformation is used to calculate the statistics for each new arriving trajectory's position using the aggregated statistics of old position .
  • 15. Ehab Qadah 15 Differences between the two solutions:Differences between the two solutions: Flink: ● Handles the parallel consumers of Kafka stream implicitly. ● Operations over the KeyedStream are stateful. ● Sort is not required by using a reduce transformation that processes the stream records item by item. Spark: ● Multiple DStream must be created and union them to have parallel receivers. ● The UpdateStateByKey is must be used to manage the state between the micro batches. ● A sort action is required to preserve the correct order of the position messages inside the state update function.
  • 16. Ehab Qadah 16 Performance ResultsPerformance Results Latency: (end of processing time – streaming time)
  • 17. Ehab Qadah 17 Performance ResultsPerformance Results Throughput: (# processed messages / minute)
  • 18. Ehab Qadah 18 Air Sector Change Detection WorkloadAir Sector Change Detection Workload ● Detect the entering or leaving of an aircraft from one air sector to another one. ● Using a dataset of 20,000 sectors. ● This workload covers: ➢ Parallel receiving of an input data stream ➢ Stateful aggregation over a data stream ➢ Preserving the correct order of the records of a stream ➢ How to provide and update a global data model (sectors dataset) in stream processing task
  • 19. Ehab Qadah 19 Air Sector Change Detection in Spark Streaming Air Sector Change Detection in Spark Streaming Create multiple Kafka streams (DStream) and Union them. Filter messages of type 3 & 2 by applying a filter transformation. Construct a stream of trajectories (tuples of ID and list of positions) by using the groupByKey transformation. For each micro batch sort the new list of positions and assign the corresponding sector using the Broadcast feature within the updateStateByKey function. Detect the change of sectors between two consecutive positions by applying a filter transformation.
  • 20. Ehab Qadah 20 Air Sector Change Detection in FlinkAir Sector Change Detection in Flink Parse the Kafka Stream records to build tuples of (ID, position) using a map transformation. Construct a KeyedStream of trajectories by using the KeyBy operation (ID of the tuple as the key). A reduce transformation is used to assign the sector for each new arriving trajectory's position of a tuple and the previous sector of the old tuple with providing the sectors manually . A filter transformation is used to detect the tuples with difference between the current and previous sectors.
  • 21. Ehab Qadah 21 Differences between the two solutions:Differences between the two solutions: Flink: ● The global data model is manually provided to the reduce transformation. ● The program must be reloaded to update the sectors data. Spark: ● offers the Broadcast feature to provide the global data model (sectors). ● The sectors can be updated in the driver program by using the unpersist function and then update it.
  • 22. Ehab Qadah 22 Performance ResultsPerformance Results Latency: (end of processing time – streaming time)
  • 23. Ehab Qadah 23 Performance ResultsPerformance Results Throughput: (# processed messages / minute)
  • 24. Ehab Qadah 24 ConclusionConclusion ● Results show that Flink outperforms Spark Streaming in term of processing latency. ● Spark Streaming provides higher throughput rates than Flink by increasing the batch duration. ● Flink gives a similar throughput to Spark Streaming with small batch durations. ● Flink's processing model is well-suited to the stream processing tasks (stateful, low latency, item by item, no batch interval).
  • 25. Ehab Qadah 25 ReferencesReferences [1] DatAcron project. Available:http://www.datacron-project.eu/ [2] Golab, Lukasz, and M. Tamer zsu. ”Issues in data stream manage- ment.” ACM Sigmod Record 32.2 (2003): 5-14. [3] Apache Spark. Available: https://spark.apache.org/. [4] Apache Flink. Available: https://flink.apache.org/. [5] Apache Kafka. Available: https://kafka.apache.org/intro.html. [6] Source code of the workloads . Available: https://github .com/ehabqadah/Spark_vs_Flink/.
  • 26. Ehab Qadah 26 THANK YOU FOR YOUR ATTENTION