SlideShare a Scribd company logo
1 of 3
Download to read offline
Comparison of Open-Source Data Stream
Processing Engines: Spark Streaming, Flink and
Storm
Darshankumar Vinubhai Gorasiya (x18134751)
School of Computing (Programming for Data Analytics)
National College of Ireland (NCI)
Dublin, Ireland
x18134751@student.ncirl.ie
Abstract—The constant rise in businesses reliance on digital
technologies has brought exponential growth in sources produc-
ing continuous streams of data. Hadoop stack with MapReduce
framework together was able to address many challenges come
with storing and processing the huge volume of data, famously
termed as Bigdata problem. However, with the surge in IoT
devices to mission-critical monitoring applications generating
unbounded continues stream of data requiring real-time or near
real-time processing as it is being produced led to further indus-
trial and academic studies in the field. As a result, Today on top
of existing distributed parallel computing framework, numerous
streaming data processing platforms like Apache Storm, Flink,
Spark Stream, Kafka Stream, and Samja are built to satisfy the
needs of streaming applications where maintaining Law latency,
being tolerant to failures and high throughput is highly desired.
Complexity in architecture and implementation challenges of
these engines in real-world scenarios caused confusion across
the business community and made previous benchmarking out-
comes inconsistent as a minor change in low-level environmental
properties leads to entirely different results. There are not many
independent benchmarking studies available which differentiate
not just performance measures but as well combinedly present the
conceptual and architectural distinctions. This review paper aims
to do so between major 3 streaming engines Apache Storm, Spark
Streaming and Flink while critically evaluating performance
comparison of previous benchmarking studies to help businesses
make an informed decision on adoption of these platforms.
Index Terms—Data Streaming, Apache Spark, Storm, Real-
Time Big-Data Processing, Apache Flink
I. INTRODUCTION
The global digital transformation with automation and broad
applications of artificial intelligence across the domains have
seen rapid growth in the past decade. By the end of 2020,
it is expected that world of digital devices will generate
44 ZB of data. [1] Managing this information produced on
an unprecedented scale and from varieties of sources was a
challenge for the industries, but Hadoop and MapReduce’s dis-
tributed architecture was able to tackle the significant amount
of difficulties in managing and processing data-at-rest by using
the power of parallel computing on the commodity hardware.
[2] However that system was not been able to efficiently cope
with increasing real-time big-data applications demand, IoT
devices, online gaming, automotive industry, sensor recording,
smart cities, real-time threat, and financial fraud detection
are just a few to name, requiring a large continuous stream
of information to be processed while it is in motion. it
required new architectural approach as it is time-sensitive and
data is to be processed as it is produced while preserving
state, high fault tolerance and service certainty as opposed to
batching architecture where information is stored first and later
processed in large batches periodically for further knowledge
extraction. [3]
The open-source community and industry-driven research
support brought to life countless stream processing engines
such as Apache Storm, Flink, Kafka Streams, Samza, and
Spark Structured Streaming. Differences in latency, throughput
and in-memory processing architecture in each streaming en-
gine have resulted in confusion among industry users on which
might be the best suited for the individual implementation due
to unavailability of cross-industry benchmarking methodology
and studies. [4] Majority of previous studies are use-case
specific and do not simulate real-world application properties
resulting in inaccurate assessments. [5]
Apache Flink, Spark, and Storm are the current most popu-
lar streaming platform amongst others, due to its fault-tolerant
architecture and support for scalability in stream processing.
[6] Apache Storm, Flink, and Spark are based on different
processing architectures where spark streaming engine is based
on the concept of micro-batching while Flink and Storm are
a native streaming engine. The objective of this study is to
show conceptual differences between open-source platforms,
Apache Flink, Storm, and Spark Streaming to further compare
the present benchmarking studies and assess them critically.
The remainder of this paper is organized as follow. Section
II defines models for data stream processing. Section III
illustrates the details of the characteristics and architectural
distinctions between 3 major platforms. Previous benchmark-
ing studies are presented in Section IV. Finally, Section V
concludes the review paper.
II. STREAM COMPUTATION MODELS
The unbounded data stream handling services are largely
categorized into two frameworks,
A. Native Streaming:
The native streaming models are designed to take into ac-
count the need of real-time applications, Fig. (1) demonstrates
the processing of data stream obtained from producer sources
over time, which is processed individually on an ongoing
basis. This architecture helps to decrease latency owing to
decreased waiting time before it gets into the system. Apache
Flink and Storm with directed graph data flow adheres to
this architecture resulting in reduced latency relative to micro-
batching model oriented Apache Spark. [6] [7]
Fig. 1. Data Stream Processing Flow: Native Streaming
As the stream data is handled separately and not in batch,
this results in lower throughput compared to the micro-
batching. However, different studies showing configurable
back-end implementation approaches to better handle trade-
offs to satisfy streaming application requirements. [3] [1]
B. Micro-Batching:
Fig. 2. Data Stream Processing Flow: Micro-batching
Micro-batching based architecture takes continues input data
stream from multiple sources as shown in Fig. (2) and splits
stream into small batches or groups. Set of those batches are
then parallelly processed at tiny time intervals by processing
engines. Apache Spark Streaming follows such architecture for
managing streams in small batches. All the sources and stream
processing nodes to gather creates a Directed Acyclic Graph
(DAG). [6] At the core of Apache Spark, batches are processed
following this model as Discrete Streams (D-Stream) made of
Resilient Distributed Datasets (RDDs). [8] [9]
III. PLATFORM CHARACTERISTICS
Fault-Tolerance: As the system is vulnerable to failures ow-
ing to network or software errors, it is of primary importance
in streaming applications. Spark streaming uses fault tolerance
mechanisms for individual batches while it is expensive for
a native streaming system such as Storm and Flink as it is
enforced at each record level. Spark provides assurance of
’Exactly-Once’ processing of records in case of failures by
continuously replicates state to the other worker nodes so
that in the failure state can be extracted from other node
and processing can be restarted. [10] [6] Similarly, Flink
also provides ’Exactly-Once’ processing assurance by keeping
track of distributed snapshots and checkpoints to provide
failure recovery. [9] [6] [2] however, Storm does not provide
state management and in case of application failure, it restarts
the entire process again on different node giving ’At-least-
Once’ assurance. [3] [6]
State-Management: To manage the state a separate thread is
required to continuously update and preserve the existing state
of records. It is not natively available in storm however it can
be implemented with help of Zookeeper Marcu2016Hanif2019
State-management in Spark streaming is associated with RDDs
and involves updating each batch despite no change in the
state, which makes it extremely inefficient compared to Flink.
[10] Flink provides efficient support for integrating state
management with the help of a distributed file system to keep
track of state with snapshots. [1]
Performance (Latency Vs Throughput): Latency is a time,
records in the stream have to wait after it is produced and
throughput is the number of records being processed by system
at a given unit of time. Studies show Spark streaming micro-
batching model leads to higher latency and high throughput
whereas Storm and Flink like native streaming platforms
continuously process those records giving low latency. [9]
[7] Certain novel studies also focusing on network latency
due to an increase in cloud-based infrastructures proposing
the utilization of Edge and Fog computing to reduce latency.
[11] Further performance benchmarking studies discussed in
Section IV.
IV. TOOLS BENCHMARKING
The independent benchmarking of these services is crucial
for business as it helps the decision maker to decide based
on the statistical proof. Yahoo! has been largely contributing
in providing benchmarking tools like YCSB, using YCSB!
[12] researcher exploited breaking points of the Flink, Spark,
and Storm with varieties of node size, Redis as backend,
Kafka as messaging system and zookeeper to provide delivery
assurance of records. Same way independent study [6] also
experiments on one master and 7 worker node architecture by
measuring these services in case of node failures and both the
study concludes that Spark is robust to node failures however
lags in latency as compared to storm and Flink. though both
studies are limited in terms of complex event arrival and do
not produce the same amount of workload as in a production
environment.
To better simulate the real-world environment, studies [7]
[5] presents benchmarking results for threat detection and
advertisements industry using 20 and 30 nodes. where one
study [7] used Kafka as massaging service where other [5]
avoided using any messaging service to eliminate network
delay with a large volume of data. both were indicative
towards similar results where spark performs faster in terms of
processing this event even when data is skewed however in the
application where data is fluctuating Flink finds its way ahead
of spark and storm. Researchers are exploring newer ways
to perform benchmarking, [4] presented unique way to do so
using one-way highway approach for Flink where highway like
environment is simulated for streaming event processing in
real-time using windowing design to better manage traffic and
disorderly arrival of records. Though these studies are guiding
specific uses-cases there is still a need for better cross-industry
benchmarking for Streaming services. Below Table highlights
key differences amongst these services.
TABLE I
OVERALL COMPARISON OF SPARK, FLINK & STORM
Tools Streaming Services
Characteristic Storm Flink Spark
Assurance At-Least-Once Exactly-Once Exactly-Once
State-fullness No Yes Yes
Flow of Data DAG CDG DAG
Community Selective Growing Wide
Streaming Type Native-Streaming Native-Streaming Micro-Batches
API Compositional Compositional Declarative
Scaling Manual Manual Auto
Language Java, Clojure Java,Scala,Py Scala,Java,Py
API Compositional Compositional Declarative
Data Carrier Tuple DataStream DStream
V. CONCLUSION
The streaming services are growing its application base
in various industries. This paper presents conceptual and
architectural differences between Flink, Spark Streaming and
Storm. Many studies have been proposed to help differentiate
in terms of performance of each however there is no clear
winner. Studies are use case specific and on default parameters
where majority studies conclude that Spark works best with
high throughput when the incoming volume is huge and
latency is not of priority however with small volume Storm
also performs better in terms of latency similar to Flink but
Flink does better in fault-tolerance in comparison to Storm.
these studies are based on simulation and do not replicate
real-world like environment hence there is further need for a
better benchmarking approach that helps differentiate across
use cases.
REFERENCES
[1] P. Carbone, S. Ewen, G. F´ora, S. Haridi, S. Richter, and K. Tzoumas,
“State management in Apache Flink®,” Proceedings of the VLDB
Endowment, vol. 10, no. 12, pp. 1718–1729, 2017. [Online]. Available:
http://dl.acm.org/citation.cfm?doid=3137765.3137777
[2] O. C. Marcu, A. Costan, G. Antoniu, and M. S. P´erez-Hern´andez,
“Spark versus flink: Understanding performance in big data analytics
frameworks,” Proceedings - IEEE International Conference on Cluster
Computing, ICCC, pp. 433–442, 2016.
[3] M. Hussain Iqbal and T. Rahim Soomro, “Big Data Analysis: Apache
Storm Perspective,” International Journal of Computer Trends and
Technology, vol. 19, no. 1, pp. 9–14, 2015.
[4] M. Hanif, H. Yoon, and C. Lee, “Benchmarking Tool for Modern
Distributed Stream Processing Engines,” 2019 International Conference
on Information Networking (ICOIN), no. 2017, pp. 393–395, 2019.
[5] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and
V. Markl, “Benchmarking distributed stream data processing systems,”
Proceedings - IEEE 34th International Conference on Data Engineering,
ICDE 2018, pp. 1519–1530, 2018.
[6] M. A. Lopez, A. G. P. Lobato, and O. C. M. Duarte, “A performance
comparison of open-source stream processing platforms,” in 2016 IEEE
Global Communications Conference, GLOBECOM 2016 - Proceedings,
2016.
[7] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder-
baugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky,
“Benchmarking streaming computation engines: Storm, flink and spark
streaming,” Proceedings - 2016 IEEE 30th International Parallel and
Distributed Processing Symposium, IPDPS 2016, pp. 1789–1792, 2016.
[8] F. Gurcan and M. Berigel, “Real-Time Processing of Big Data Streams:
Lifecycle, Tools, Tasks, and Challenges,” ISMSIT 2018 - 2nd Interna-
tional Symposium on Multidisciplinary Studies and Innovative Technolo-
gies, Proceedings, 2018.
[9] D. Garc´ıa-Gil, S. Ram´ırez-Gallego, S. Garc´ıa, and F. Herrera, “A
comparison on scalability for batch big data processing on Apache
Spark and Apache Flink,” Big Data Analytics, vol. 2, no. 1, pp. 1–11,
2017. [Online]. Available: http://dx.doi.org/10.1186/s41044-016-0020-2
[10] B. Y. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,
X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. L. I. Ghodsi,
and J. Gonzalez, “Apache Spark : A Unified Engine for Big Data
Processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65,
2016.
[11] V. Cardellini, G. Mencagli, D. Talia, and M. Torquati, “New Landscapes
of the Data Stream Processing in the era of Fog Computing,” Future
Generation Computer Systems, no. xxxx, 2019. [Online]. Available:
https://doi.org/10.1016/j.future.2019.03.027
[12] Z. Karakaya, A. Yazici, and M. Alayyoub, “A Comparison of Stream
Processing Frameworks,” 2017 International Conference on Computer
and Applications, ICCA 2017, pp. 1–12, 2017.

More Related Content

What's hot

IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud StorageIRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud StorageIRJET Journal
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataredpel dot com
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query EngineMeasuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engineparekhnikunj
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...IOSR Journals
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
ABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTING
ABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTINGABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTING
ABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTINGcsandit
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache FlinkDataWorks Summit
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearchFoo Sounds
 

What's hot (18)

IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud StorageIRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query EngineMeasuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
Measuring Resources & Workload Skew In Micro-Service MPP Analytic Query Engine
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
Cs6703 grid and cloud computing unit 1
Cs6703 grid and cloud computing unit 1Cs6703 grid and cloud computing unit 1
Cs6703 grid and cloud computing unit 1
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
ABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTING
ABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTINGABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTING
ABOUT THE SUITABILITY OF CLOUDS IN HIGH-PERFORMANCE COMPUTING
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 

Similar to Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm

Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...
Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...
Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...IOSR Journals
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Societyconfluent
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics
 
A real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingA real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...
Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...
Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...HarshitParkar6677
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
A Comparative Review on Fault Tolerance methods and models in Cloud Computing
A Comparative Review on Fault Tolerance methods and models in Cloud ComputingA Comparative Review on Fault Tolerance methods and models in Cloud Computing
A Comparative Review on Fault Tolerance methods and models in Cloud ComputingIRJET Journal
 
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAXFast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAXIJERA Editor
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analyticsconfluent
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
L7-L7 Services in a Cloud Datacenter
L7-L7 Services in a Cloud Datacenter L7-L7 Services in a Cloud Datacenter
L7-L7 Services in a Cloud Datacenter Vikas Deolaliker
 
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...IJCSIS Research Publications
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 

Similar to Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm (20)

A 01
A 01A 01
A 01
 
Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...
Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...
Cataloging Of Sessions in Genuine Traffic by Packet Size Distribution and Ses...
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
 
50120140507002
5012014050700250120140507002
50120140507002
 
50120140507002
5012014050700250120140507002
50120140507002
 
50120140507002
5012014050700250120140507002
50120140507002
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
D017212027
D017212027D017212027
D017212027
 
A real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingA real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streaming
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...
Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...
Mobile Fog: A Programming Model for Large–Scale Applications on the Internet ...
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
A Comparative Review on Fault Tolerance methods and models in Cloud Computing
A Comparative Review on Fault Tolerance methods and models in Cloud ComputingA Comparative Review on Fault Tolerance methods and models in Cloud Computing
A Comparative Review on Fault Tolerance methods and models in Cloud Computing
 
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAXFast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analytics
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
L7-L7 Services in a Cloud Datacenter
L7-L7 Services in a Cloud Datacenter L7-L7 Services in a Cloud Datacenter
L7-L7 Services in a Cloud Datacenter
 
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 

More from Darshan Gorasiya

Analysis of the EU Citizen's Response on the Challenges and Aspirations of EU
Analysis of the EU Citizen's Response on the Challenges and Aspirations of EUAnalysis of the EU Citizen's Response on the Challenges and Aspirations of EU
Analysis of the EU Citizen's Response on the Challenges and Aspirations of EUDarshan Gorasiya
 
FAKE NEWS - Sources, Concerns & Impacts.
FAKE NEWS - Sources, Concerns & Impacts.FAKE NEWS - Sources, Concerns & Impacts.
FAKE NEWS - Sources, Concerns & Impacts.Darshan Gorasiya
 
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Darshan Gorasiya
 
'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization
'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization
'Drowning Earth' - Magazine-style report on Climate Change. - Data VisualizationDarshan Gorasiya
 
Consumer-To-Consumer Food Delivery System on Salesforce.
Consumer-To-Consumer Food Delivery System on Salesforce.Consumer-To-Consumer Food Delivery System on Salesforce.
Consumer-To-Consumer Food Delivery System on Salesforce.Darshan Gorasiya
 
Diabetic Retinopathy Detection using CNN image classification algorithm
Diabetic Retinopathy Detection using CNN image classification algorithmDiabetic Retinopathy Detection using CNN image classification algorithm
Diabetic Retinopathy Detection using CNN image classification algorithmDarshan Gorasiya
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Darshan Gorasiya
 

More from Darshan Gorasiya (7)

Analysis of the EU Citizen's Response on the Challenges and Aspirations of EU
Analysis of the EU Citizen's Response on the Challenges and Aspirations of EUAnalysis of the EU Citizen's Response on the Challenges and Aspirations of EU
Analysis of the EU Citizen's Response on the Challenges and Aspirations of EU
 
FAKE NEWS - Sources, Concerns & Impacts.
FAKE NEWS - Sources, Concerns & Impacts.FAKE NEWS - Sources, Concerns & Impacts.
FAKE NEWS - Sources, Concerns & Impacts.
 
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
 
'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization
'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization
'Drowning Earth' - Magazine-style report on Climate Change. - Data Visualization
 
Consumer-To-Consumer Food Delivery System on Salesforce.
Consumer-To-Consumer Food Delivery System on Salesforce.Consumer-To-Consumer Food Delivery System on Salesforce.
Consumer-To-Consumer Food Delivery System on Salesforce.
 
Diabetic Retinopathy Detection using CNN image classification algorithm
Diabetic Retinopathy Detection using CNN image classification algorithmDiabetic Retinopathy Detection using CNN image classification algorithm
Diabetic Retinopathy Detection using CNN image classification algorithm
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm

  • 1. Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm Darshankumar Vinubhai Gorasiya (x18134751) School of Computing (Programming for Data Analytics) National College of Ireland (NCI) Dublin, Ireland x18134751@student.ncirl.ie Abstract—The constant rise in businesses reliance on digital technologies has brought exponential growth in sources produc- ing continuous streams of data. Hadoop stack with MapReduce framework together was able to address many challenges come with storing and processing the huge volume of data, famously termed as Bigdata problem. However, with the surge in IoT devices to mission-critical monitoring applications generating unbounded continues stream of data requiring real-time or near real-time processing as it is being produced led to further indus- trial and academic studies in the field. As a result, Today on top of existing distributed parallel computing framework, numerous streaming data processing platforms like Apache Storm, Flink, Spark Stream, Kafka Stream, and Samja are built to satisfy the needs of streaming applications where maintaining Law latency, being tolerant to failures and high throughput is highly desired. Complexity in architecture and implementation challenges of these engines in real-world scenarios caused confusion across the business community and made previous benchmarking out- comes inconsistent as a minor change in low-level environmental properties leads to entirely different results. There are not many independent benchmarking studies available which differentiate not just performance measures but as well combinedly present the conceptual and architectural distinctions. This review paper aims to do so between major 3 streaming engines Apache Storm, Spark Streaming and Flink while critically evaluating performance comparison of previous benchmarking studies to help businesses make an informed decision on adoption of these platforms. Index Terms—Data Streaming, Apache Spark, Storm, Real- Time Big-Data Processing, Apache Flink I. INTRODUCTION The global digital transformation with automation and broad applications of artificial intelligence across the domains have seen rapid growth in the past decade. By the end of 2020, it is expected that world of digital devices will generate 44 ZB of data. [1] Managing this information produced on an unprecedented scale and from varieties of sources was a challenge for the industries, but Hadoop and MapReduce’s dis- tributed architecture was able to tackle the significant amount of difficulties in managing and processing data-at-rest by using the power of parallel computing on the commodity hardware. [2] However that system was not been able to efficiently cope with increasing real-time big-data applications demand, IoT devices, online gaming, automotive industry, sensor recording, smart cities, real-time threat, and financial fraud detection are just a few to name, requiring a large continuous stream of information to be processed while it is in motion. it required new architectural approach as it is time-sensitive and data is to be processed as it is produced while preserving state, high fault tolerance and service certainty as opposed to batching architecture where information is stored first and later processed in large batches periodically for further knowledge extraction. [3] The open-source community and industry-driven research support brought to life countless stream processing engines such as Apache Storm, Flink, Kafka Streams, Samza, and Spark Structured Streaming. Differences in latency, throughput and in-memory processing architecture in each streaming en- gine have resulted in confusion among industry users on which might be the best suited for the individual implementation due to unavailability of cross-industry benchmarking methodology and studies. [4] Majority of previous studies are use-case specific and do not simulate real-world application properties resulting in inaccurate assessments. [5] Apache Flink, Spark, and Storm are the current most popu- lar streaming platform amongst others, due to its fault-tolerant architecture and support for scalability in stream processing. [6] Apache Storm, Flink, and Spark are based on different processing architectures where spark streaming engine is based on the concept of micro-batching while Flink and Storm are a native streaming engine. The objective of this study is to show conceptual differences between open-source platforms, Apache Flink, Storm, and Spark Streaming to further compare the present benchmarking studies and assess them critically. The remainder of this paper is organized as follow. Section II defines models for data stream processing. Section III illustrates the details of the characteristics and architectural distinctions between 3 major platforms. Previous benchmark- ing studies are presented in Section IV. Finally, Section V concludes the review paper. II. STREAM COMPUTATION MODELS The unbounded data stream handling services are largely categorized into two frameworks,
  • 2. A. Native Streaming: The native streaming models are designed to take into ac- count the need of real-time applications, Fig. (1) demonstrates the processing of data stream obtained from producer sources over time, which is processed individually on an ongoing basis. This architecture helps to decrease latency owing to decreased waiting time before it gets into the system. Apache Flink and Storm with directed graph data flow adheres to this architecture resulting in reduced latency relative to micro- batching model oriented Apache Spark. [6] [7] Fig. 1. Data Stream Processing Flow: Native Streaming As the stream data is handled separately and not in batch, this results in lower throughput compared to the micro- batching. However, different studies showing configurable back-end implementation approaches to better handle trade- offs to satisfy streaming application requirements. [3] [1] B. Micro-Batching: Fig. 2. Data Stream Processing Flow: Micro-batching Micro-batching based architecture takes continues input data stream from multiple sources as shown in Fig. (2) and splits stream into small batches or groups. Set of those batches are then parallelly processed at tiny time intervals by processing engines. Apache Spark Streaming follows such architecture for managing streams in small batches. All the sources and stream processing nodes to gather creates a Directed Acyclic Graph (DAG). [6] At the core of Apache Spark, batches are processed following this model as Discrete Streams (D-Stream) made of Resilient Distributed Datasets (RDDs). [8] [9] III. PLATFORM CHARACTERISTICS Fault-Tolerance: As the system is vulnerable to failures ow- ing to network or software errors, it is of primary importance in streaming applications. Spark streaming uses fault tolerance mechanisms for individual batches while it is expensive for a native streaming system such as Storm and Flink as it is enforced at each record level. Spark provides assurance of ’Exactly-Once’ processing of records in case of failures by continuously replicates state to the other worker nodes so that in the failure state can be extracted from other node and processing can be restarted. [10] [6] Similarly, Flink also provides ’Exactly-Once’ processing assurance by keeping track of distributed snapshots and checkpoints to provide failure recovery. [9] [6] [2] however, Storm does not provide state management and in case of application failure, it restarts the entire process again on different node giving ’At-least- Once’ assurance. [3] [6] State-Management: To manage the state a separate thread is required to continuously update and preserve the existing state of records. It is not natively available in storm however it can be implemented with help of Zookeeper Marcu2016Hanif2019 State-management in Spark streaming is associated with RDDs and involves updating each batch despite no change in the state, which makes it extremely inefficient compared to Flink. [10] Flink provides efficient support for integrating state management with the help of a distributed file system to keep track of state with snapshots. [1] Performance (Latency Vs Throughput): Latency is a time, records in the stream have to wait after it is produced and throughput is the number of records being processed by system at a given unit of time. Studies show Spark streaming micro- batching model leads to higher latency and high throughput whereas Storm and Flink like native streaming platforms continuously process those records giving low latency. [9] [7] Certain novel studies also focusing on network latency due to an increase in cloud-based infrastructures proposing the utilization of Edge and Fog computing to reduce latency. [11] Further performance benchmarking studies discussed in Section IV. IV. TOOLS BENCHMARKING The independent benchmarking of these services is crucial for business as it helps the decision maker to decide based on the statistical proof. Yahoo! has been largely contributing in providing benchmarking tools like YCSB, using YCSB! [12] researcher exploited breaking points of the Flink, Spark, and Storm with varieties of node size, Redis as backend, Kafka as messaging system and zookeeper to provide delivery assurance of records. Same way independent study [6] also experiments on one master and 7 worker node architecture by measuring these services in case of node failures and both the study concludes that Spark is robust to node failures however lags in latency as compared to storm and Flink. though both studies are limited in terms of complex event arrival and do not produce the same amount of workload as in a production environment. To better simulate the real-world environment, studies [7] [5] presents benchmarking results for threat detection and advertisements industry using 20 and 30 nodes. where one study [7] used Kafka as massaging service where other [5] avoided using any messaging service to eliminate network delay with a large volume of data. both were indicative towards similar results where spark performs faster in terms of
  • 3. processing this event even when data is skewed however in the application where data is fluctuating Flink finds its way ahead of spark and storm. Researchers are exploring newer ways to perform benchmarking, [4] presented unique way to do so using one-way highway approach for Flink where highway like environment is simulated for streaming event processing in real-time using windowing design to better manage traffic and disorderly arrival of records. Though these studies are guiding specific uses-cases there is still a need for better cross-industry benchmarking for Streaming services. Below Table highlights key differences amongst these services. TABLE I OVERALL COMPARISON OF SPARK, FLINK & STORM Tools Streaming Services Characteristic Storm Flink Spark Assurance At-Least-Once Exactly-Once Exactly-Once State-fullness No Yes Yes Flow of Data DAG CDG DAG Community Selective Growing Wide Streaming Type Native-Streaming Native-Streaming Micro-Batches API Compositional Compositional Declarative Scaling Manual Manual Auto Language Java, Clojure Java,Scala,Py Scala,Java,Py API Compositional Compositional Declarative Data Carrier Tuple DataStream DStream V. CONCLUSION The streaming services are growing its application base in various industries. This paper presents conceptual and architectural differences between Flink, Spark Streaming and Storm. Many studies have been proposed to help differentiate in terms of performance of each however there is no clear winner. Studies are use case specific and on default parameters where majority studies conclude that Spark works best with high throughput when the incoming volume is huge and latency is not of priority however with small volume Storm also performs better in terms of latency similar to Flink but Flink does better in fault-tolerance in comparison to Storm. these studies are based on simulation and do not replicate real-world like environment hence there is further need for a better benchmarking approach that helps differentiate across use cases. REFERENCES [1] P. Carbone, S. Ewen, G. F´ora, S. Haridi, S. Richter, and K. Tzoumas, “State management in Apache Flink®,” Proceedings of the VLDB Endowment, vol. 10, no. 12, pp. 1718–1729, 2017. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3137765.3137777 [2] O. C. Marcu, A. Costan, G. Antoniu, and M. S. P´erez-Hern´andez, “Spark versus flink: Understanding performance in big data analytics frameworks,” Proceedings - IEEE International Conference on Cluster Computing, ICCC, pp. 433–442, 2016. [3] M. Hussain Iqbal and T. Rahim Soomro, “Big Data Analysis: Apache Storm Perspective,” International Journal of Computer Trends and Technology, vol. 19, no. 1, pp. 9–14, 2015. [4] M. Hanif, H. Yoon, and C. Lee, “Benchmarking Tool for Modern Distributed Stream Processing Engines,” 2019 International Conference on Information Networking (ICOIN), no. 2017, pp. 393–395, 2019. [5] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl, “Benchmarking distributed stream data processing systems,” Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018, pp. 1519–1530, 2018. [6] M. A. Lopez, A. G. P. Lobato, and O. C. M. Duarte, “A performance comparison of open-source stream processing platforms,” in 2016 IEEE Global Communications Conference, GLOBECOM 2016 - Proceedings, 2016. [7] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder- baugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky, “Benchmarking streaming computation engines: Storm, flink and spark streaming,” Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, pp. 1789–1792, 2016. [8] F. Gurcan and M. Berigel, “Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and Challenges,” ISMSIT 2018 - 2nd Interna- tional Symposium on Multidisciplinary Studies and Innovative Technolo- gies, Proceedings, 2018. [9] D. Garc´ıa-Gil, S. Ram´ırez-Gallego, S. Garc´ıa, and F. Herrera, “A comparison on scalability for batch big data processing on Apache Spark and Apache Flink,” Big Data Analytics, vol. 2, no. 1, pp. 1–11, 2017. [Online]. Available: http://dx.doi.org/10.1186/s41044-016-0020-2 [10] B. Y. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. L. I. Ghodsi, and J. Gonzalez, “Apache Spark : A Unified Engine for Big Data Processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016. [11] V. Cardellini, G. Mencagli, D. Talia, and M. Torquati, “New Landscapes of the Data Stream Processing in the era of Fog Computing,” Future Generation Computer Systems, no. xxxx, 2019. [Online]. Available: https://doi.org/10.1016/j.future.2019.03.027 [12] Z. Karakaya, A. Yazici, and M. Alayyoub, “A Comparison of Stream Processing Frameworks,” 2017 International Conference on Computer and Applications, ICCA 2017, pp. 1–12, 2017.