SlideShare a Scribd company logo
Comparative Evaluation of
Spark and Flink Stream
Processing
Ehab Qadah
Supervisor: PD Dr. Michael Mock.
Lab: MA-INF 4306 - Data Science and Big Data
University of Bonn
Ehab Qadah 2
MotivationMotivation
Which platform is superior to the other?
● To answer that:
➢ We provide a performance comparison (latency and throughput) of
stream processing in Apache Spark and Apache Flink.
➢ We cover some key aspects of real-time stream applications and
how they are handled in the two frameworks.
➢ We developed two evaluation stream processing workloads over
datasets of aircraft trajectories provided by the DatAcron
project[1].
Ehab Qadah 3
OutlineOutline
● Introduction
● What is a data stream
● Apache Spark
● Apache Flink
● Apache Kafka
● Data Stream Setup
● General Comparison Aspects
● Statistics Computation Workload
● Implementation in Spark and Flink
● Discussion and Performance Results
● Sector Change Detection Workload
● Implementation in Spark and Flink
● Discussion and Performance Results
● Conclusion
Ehab Qadah 4
IntroductionIntroduction
● What is a data Stream:
➢ “A data stream is a real-time, continuous, ordered (implicitly by arrival
time or explicitly by timestamp) sequence of items.” [2]
➢ Massive volumes of data, items arrive at a high rate.
● Applications of stream processing:
➢ Alerting on stream data from the Internet of Things (IoT) devices
➢ Log analysis and statistics on web traffic
➢ Network monitoring
➢ Financial analysis (e.g., stock prices trends)
Ehab Qadah 5
IntroductionIntroduction
● Apache Spark:
➢ is an open source project that provides a general framework for large-
scale data processing [3].
➢ offers programming APIs in Java, Scala, Python and R.
➢ Resilient Distributed Datasets (RDDs) & Discretized Stream (DStream)
are the main data abstractions.
Software stack of Apache Spark [3].
Ehab Qadah 6
IntroductionIntroduction
● Stream processing model of Spark:
➢ Spark Streaming processes the continuous stream of data by dividing it into
micro-batches that are processed by the Spark engine.
➢ The updateStateByKey operation is used the manage the state between the
micro batches.
Process flow of Spark Streaming [3].
Ehab Qadah 7
IntroductionIntroduction
Software stack of Apache Flink [4].
● Apache Flink:
➢ is an open source project that provides a large-scale, distributed stream
processing platform [4].
➢ offers programming APIs in Java and Scala.
➢ Flink treats the batch processing as a special case of streaming
applications (i.e., finite stream).
➢ The DataStream and DataSet are
the main data abstractions.
Ehab Qadah 8
IntroductionIntroduction
● Stream processing model of Flink:
➢ The Flink's core is a distributed streaming dataflow engine, with each
Flink program is represented by a dataflow graph.
An example of data flow graph in Flink [4].
Ehab Qadah 9
IntroductionIntroduction
Distribution of a stream partitions for
consumer groups [5].
● Apache Kafka:
➢ is a scalable, fault-tolerant and distributed streaming framework [5].
➢ allows to publish and subscribe to data streams.
➢ manages the stream records in different categories (i.e., topics) that are
partitioned and distributed over the servers of the Kafka cluster.
➢ balances the stream partitions
among the members of a certain
group.
Ehab Qadah 10
Data Stream SetupData Stream Setup
● We use datasets of Automatic Dependent Surveillance - Broadcast (ADS-B)
messages that represent the position of aircrafts over time.
● Each message comprises 22 fields of data such as aircraft ID, date message
generated, longitude, latitude and altitude.
● Datasets (2.4 GB) contain around 26 million messages.
The setup of the Data Stream Producer and Kafka Cluster.
Ehab Qadah 11
General Comparison AspectsGeneral Comparison Aspects
● Handling parallel input streams (e.g., Kafka Stream).
● How to aggregate the state of an input stream.
● Manage the order of stream records.
● How to provide and update global data model in a stream processing task.
● Evaluate the performance by measuring the latency and throughput.
Ehab Qadah 12
Statistics Computation per Trajectory
Workload
Statistics Computation per Trajectory
Workload
● Compute and aggregate statistics for each new position in a trajectory.
● Statistics quantities like speed mean, mean of location coordinates, min and max
altitude, etc.
● This workload covers:
➢ Parallel receiving of an input data stream
➢ Stateful aggregation over a data stream
➢ Preserving the correct order of the records of a stream
Ehab Qadah 13
Statistics Computation in Spark StreamingStatistics Computation in Spark Streaming
Create multiple Kafka streams
(DStream) and Union them.
Filter irrelevant by applying
a filter transformation.
Construct a stream of trajectories
(tuples of ID and list of positions) by
using the groupByKey transformation.
For each micro batch sort the new list
of positions, calculate and aggregate
the statistics within the
updateStateByKey function.
Ehab Qadah 14
Statistics Computation in FlinkStatistics Computation in Flink
Parse the Kafka Stream records to build
tuples of (ID, position) using a map
transformation.
Construct a KeyedStream of trajectories
by using the KeyBy operation
(ID of the tuple as the key).
A reduce transformation is used
to calculate the statistics for each
new arriving trajectory's position using
the aggregated statistics of old position .
Ehab Qadah 15
Differences between the two solutions:Differences between the two solutions:
Flink:
● Handles the parallel consumers of
Kafka stream implicitly.
● Operations over the KeyedStream are
stateful.
● Sort is not required by using a
reduce transformation that processes
the stream records item by item.
Spark:
● Multiple DStream must be created and
union them to have parallel receivers.
● The UpdateStateByKey is must be
used to manage the state between the
micro batches.
● A sort action is required to preserve
the correct order of the position
messages inside the state update
function.
Ehab Qadah 16
Performance ResultsPerformance Results
Latency: (end of processing time – streaming time)
Ehab Qadah 17
Performance ResultsPerformance Results
Throughput: (# processed messages / minute)
Ehab Qadah 18
Air Sector Change Detection WorkloadAir Sector Change Detection Workload
● Detect the entering or leaving of an aircraft from one air sector to another one.
● Using a dataset of 20,000 sectors.
● This workload covers:
➢ Parallel receiving of an input data stream
➢ Stateful aggregation over a data stream
➢ Preserving the correct order of the records
of a stream
➢ How to provide and update a global data model
(sectors dataset) in stream processing task
Ehab Qadah 19
Air Sector Change Detection in Spark
Streaming
Air Sector Change Detection in Spark
Streaming
Create multiple Kafka streams
(DStream) and Union them.
Filter messages of type 3 & 2 by applying
a filter transformation.
Construct a stream of trajectories
(tuples of ID and list of positions) by
using the groupByKey transformation.
For each micro batch sort the new list
of positions and assign the corresponding
sector using the Broadcast feature
within the updateStateByKey function.
Detect the change of
sectors between two
consecutive positions
by applying a filter
transformation.
Ehab Qadah 20
Air Sector Change Detection in FlinkAir Sector Change Detection in Flink
Parse the Kafka Stream records to build
tuples of (ID, position) using a map
transformation.
Construct a KeyedStream of trajectories
by using the KeyBy operation
(ID of the tuple as the key).
A reduce transformation is used to
assign the sector for each new arriving
trajectory's position of a tuple and the
previous sector of the old tuple with
providing the sectors manually .
A filter transformation
is used to detect the
tuples with difference
between the current
and previous sectors.
Ehab Qadah 21
Differences between the two solutions:Differences between the two solutions:
Flink:
● The global data model is manually
provided to the reduce
transformation.
● The program must be reloaded to
update the sectors data.
Spark:
● offers the Broadcast feature to provide
the global data model (sectors).
● The sectors can be updated in the driver
program by using the unpersist
function and then update it.
Ehab Qadah 22
Performance ResultsPerformance Results
Latency: (end of processing time – streaming time)
Ehab Qadah 23
Performance ResultsPerformance Results
Throughput: (# processed messages / minute)
Ehab Qadah 24
ConclusionConclusion
● Results show that Flink outperforms Spark Streaming in
term of processing latency.
● Spark Streaming provides higher throughput rates
than Flink by increasing the batch duration.
● Flink gives a similar throughput to Spark Streaming
with small batch durations.
● Flink's processing model is well-suited to the stream
processing tasks (stateful, low latency, item by item, no
batch interval).
Ehab Qadah 25
ReferencesReferences
[1] DatAcron project. Available:http://www.datacron-project.eu/
[2] Golab, Lukasz, and M. Tamer zsu. ”Issues in data stream manage-
ment.” ACM Sigmod Record 32.2 (2003): 5-14.
[3] Apache Spark. Available: https://spark.apache.org/.
[4] Apache Flink. Available: https://flink.apache.org/.
[5] Apache Kafka. Available: https://kafka.apache.org/intro.html.
[6] Source code of the workloads . Available: https://github
.com/ehabqadah/Spark_vs_Flink/.
Ehab Qadah 26
THANK YOU FOR YOUR
ATTENTION

More Related Content

What's hot

Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
Chinmay Kolhatkar
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
Gyula Fóra
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
Paris Carbone
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
DataWorks Summit/Hadoop Summit
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
Gyula Fóra
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
Flink Forward
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
InfluxData
 

What's hot (20)

Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Apache flink
Apache flinkApache flink
Apache flink
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 

Viewers also liked

What Makes Great Infographics
What Makes Great InfographicsWhat Makes Great Infographics
What Makes Great Infographics
SlideShare
 
Masters of SlideShare
Masters of SlideShareMasters of SlideShare
Masters of SlideShare
Kapost
 
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareSTOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
Empowered Presentations
 
You Suck At PowerPoint!
You Suck At PowerPoint!You Suck At PowerPoint!
You Suck At PowerPoint!
Jesse Desjardins - @jessedee
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
Oneupweb
 
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingHow To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
Content Marketing Institute
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
SlideShare
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
SlideShare
 

Viewers also liked (10)

What Makes Great Infographics
What Makes Great InfographicsWhat Makes Great Infographics
What Makes Great Infographics
 
Masters of SlideShare
Masters of SlideShareMasters of SlideShare
Masters of SlideShare
 
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareSTOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
 
You Suck At PowerPoint!
You Suck At PowerPoint!You Suck At PowerPoint!
You Suck At PowerPoint!
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
 
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingHow To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to Comparative Evaluation of Spark and Flink Stream Processing

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Streams on wires
Streams on wiresStreams on wires
Streams on wires
Takefumi MIYOSHI
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
Renato Guimaraes
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
Sneh Pahilwani
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
Stephan Ewen
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
Yi Pan
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
Fabrizio Fortino
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 

Similar to Comparative Evaluation of Spark and Flink Stream Processing (20)

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Streams on wires
Streams on wiresStreams on wires
Streams on wires
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 

Recently uploaded

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 

Recently uploaded (20)

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 

Comparative Evaluation of Spark and Flink Stream Processing

  • 1. Comparative Evaluation of Spark and Flink Stream Processing Ehab Qadah Supervisor: PD Dr. Michael Mock. Lab: MA-INF 4306 - Data Science and Big Data University of Bonn
  • 2. Ehab Qadah 2 MotivationMotivation Which platform is superior to the other? ● To answer that: ➢ We provide a performance comparison (latency and throughput) of stream processing in Apache Spark and Apache Flink. ➢ We cover some key aspects of real-time stream applications and how they are handled in the two frameworks. ➢ We developed two evaluation stream processing workloads over datasets of aircraft trajectories provided by the DatAcron project[1].
  • 3. Ehab Qadah 3 OutlineOutline ● Introduction ● What is a data stream ● Apache Spark ● Apache Flink ● Apache Kafka ● Data Stream Setup ● General Comparison Aspects ● Statistics Computation Workload ● Implementation in Spark and Flink ● Discussion and Performance Results ● Sector Change Detection Workload ● Implementation in Spark and Flink ● Discussion and Performance Results ● Conclusion
  • 4. Ehab Qadah 4 IntroductionIntroduction ● What is a data Stream: ➢ “A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items.” [2] ➢ Massive volumes of data, items arrive at a high rate. ● Applications of stream processing: ➢ Alerting on stream data from the Internet of Things (IoT) devices ➢ Log analysis and statistics on web traffic ➢ Network monitoring ➢ Financial analysis (e.g., stock prices trends)
  • 5. Ehab Qadah 5 IntroductionIntroduction ● Apache Spark: ➢ is an open source project that provides a general framework for large- scale data processing [3]. ➢ offers programming APIs in Java, Scala, Python and R. ➢ Resilient Distributed Datasets (RDDs) & Discretized Stream (DStream) are the main data abstractions. Software stack of Apache Spark [3].
  • 6. Ehab Qadah 6 IntroductionIntroduction ● Stream processing model of Spark: ➢ Spark Streaming processes the continuous stream of data by dividing it into micro-batches that are processed by the Spark engine. ➢ The updateStateByKey operation is used the manage the state between the micro batches. Process flow of Spark Streaming [3].
  • 7. Ehab Qadah 7 IntroductionIntroduction Software stack of Apache Flink [4]. ● Apache Flink: ➢ is an open source project that provides a large-scale, distributed stream processing platform [4]. ➢ offers programming APIs in Java and Scala. ➢ Flink treats the batch processing as a special case of streaming applications (i.e., finite stream). ➢ The DataStream and DataSet are the main data abstractions.
  • 8. Ehab Qadah 8 IntroductionIntroduction ● Stream processing model of Flink: ➢ The Flink's core is a distributed streaming dataflow engine, with each Flink program is represented by a dataflow graph. An example of data flow graph in Flink [4].
  • 9. Ehab Qadah 9 IntroductionIntroduction Distribution of a stream partitions for consumer groups [5]. ● Apache Kafka: ➢ is a scalable, fault-tolerant and distributed streaming framework [5]. ➢ allows to publish and subscribe to data streams. ➢ manages the stream records in different categories (i.e., topics) that are partitioned and distributed over the servers of the Kafka cluster. ➢ balances the stream partitions among the members of a certain group.
  • 10. Ehab Qadah 10 Data Stream SetupData Stream Setup ● We use datasets of Automatic Dependent Surveillance - Broadcast (ADS-B) messages that represent the position of aircrafts over time. ● Each message comprises 22 fields of data such as aircraft ID, date message generated, longitude, latitude and altitude. ● Datasets (2.4 GB) contain around 26 million messages. The setup of the Data Stream Producer and Kafka Cluster.
  • 11. Ehab Qadah 11 General Comparison AspectsGeneral Comparison Aspects ● Handling parallel input streams (e.g., Kafka Stream). ● How to aggregate the state of an input stream. ● Manage the order of stream records. ● How to provide and update global data model in a stream processing task. ● Evaluate the performance by measuring the latency and throughput.
  • 12. Ehab Qadah 12 Statistics Computation per Trajectory Workload Statistics Computation per Trajectory Workload ● Compute and aggregate statistics for each new position in a trajectory. ● Statistics quantities like speed mean, mean of location coordinates, min and max altitude, etc. ● This workload covers: ➢ Parallel receiving of an input data stream ➢ Stateful aggregation over a data stream ➢ Preserving the correct order of the records of a stream
  • 13. Ehab Qadah 13 Statistics Computation in Spark StreamingStatistics Computation in Spark Streaming Create multiple Kafka streams (DStream) and Union them. Filter irrelevant by applying a filter transformation. Construct a stream of trajectories (tuples of ID and list of positions) by using the groupByKey transformation. For each micro batch sort the new list of positions, calculate and aggregate the statistics within the updateStateByKey function.
  • 14. Ehab Qadah 14 Statistics Computation in FlinkStatistics Computation in Flink Parse the Kafka Stream records to build tuples of (ID, position) using a map transformation. Construct a KeyedStream of trajectories by using the KeyBy operation (ID of the tuple as the key). A reduce transformation is used to calculate the statistics for each new arriving trajectory's position using the aggregated statistics of old position .
  • 15. Ehab Qadah 15 Differences between the two solutions:Differences between the two solutions: Flink: ● Handles the parallel consumers of Kafka stream implicitly. ● Operations over the KeyedStream are stateful. ● Sort is not required by using a reduce transformation that processes the stream records item by item. Spark: ● Multiple DStream must be created and union them to have parallel receivers. ● The UpdateStateByKey is must be used to manage the state between the micro batches. ● A sort action is required to preserve the correct order of the position messages inside the state update function.
  • 16. Ehab Qadah 16 Performance ResultsPerformance Results Latency: (end of processing time – streaming time)
  • 17. Ehab Qadah 17 Performance ResultsPerformance Results Throughput: (# processed messages / minute)
  • 18. Ehab Qadah 18 Air Sector Change Detection WorkloadAir Sector Change Detection Workload ● Detect the entering or leaving of an aircraft from one air sector to another one. ● Using a dataset of 20,000 sectors. ● This workload covers: ➢ Parallel receiving of an input data stream ➢ Stateful aggregation over a data stream ➢ Preserving the correct order of the records of a stream ➢ How to provide and update a global data model (sectors dataset) in stream processing task
  • 19. Ehab Qadah 19 Air Sector Change Detection in Spark Streaming Air Sector Change Detection in Spark Streaming Create multiple Kafka streams (DStream) and Union them. Filter messages of type 3 & 2 by applying a filter transformation. Construct a stream of trajectories (tuples of ID and list of positions) by using the groupByKey transformation. For each micro batch sort the new list of positions and assign the corresponding sector using the Broadcast feature within the updateStateByKey function. Detect the change of sectors between two consecutive positions by applying a filter transformation.
  • 20. Ehab Qadah 20 Air Sector Change Detection in FlinkAir Sector Change Detection in Flink Parse the Kafka Stream records to build tuples of (ID, position) using a map transformation. Construct a KeyedStream of trajectories by using the KeyBy operation (ID of the tuple as the key). A reduce transformation is used to assign the sector for each new arriving trajectory's position of a tuple and the previous sector of the old tuple with providing the sectors manually . A filter transformation is used to detect the tuples with difference between the current and previous sectors.
  • 21. Ehab Qadah 21 Differences between the two solutions:Differences between the two solutions: Flink: ● The global data model is manually provided to the reduce transformation. ● The program must be reloaded to update the sectors data. Spark: ● offers the Broadcast feature to provide the global data model (sectors). ● The sectors can be updated in the driver program by using the unpersist function and then update it.
  • 22. Ehab Qadah 22 Performance ResultsPerformance Results Latency: (end of processing time – streaming time)
  • 23. Ehab Qadah 23 Performance ResultsPerformance Results Throughput: (# processed messages / minute)
  • 24. Ehab Qadah 24 ConclusionConclusion ● Results show that Flink outperforms Spark Streaming in term of processing latency. ● Spark Streaming provides higher throughput rates than Flink by increasing the batch duration. ● Flink gives a similar throughput to Spark Streaming with small batch durations. ● Flink's processing model is well-suited to the stream processing tasks (stateful, low latency, item by item, no batch interval).
  • 25. Ehab Qadah 25 ReferencesReferences [1] DatAcron project. Available:http://www.datacron-project.eu/ [2] Golab, Lukasz, and M. Tamer zsu. ”Issues in data stream manage- ment.” ACM Sigmod Record 32.2 (2003): 5-14. [3] Apache Spark. Available: https://spark.apache.org/. [4] Apache Flink. Available: https://flink.apache.org/. [5] Apache Kafka. Available: https://kafka.apache.org/intro.html. [6] Source code of the workloads . Available: https://github .com/ehabqadah/Spark_vs_Flink/.
  • 26. Ehab Qadah 26 THANK YOU FOR YOUR ATTENTION