Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm

Comparison of Open-Source Data Stream
Processing Engines: Spark Streaming, Flink and
Storm
Darshankumar Vinubhai Gorasiya (x18134751)
School of Computing (Programming for Data Analytics)
National College of Ireland (NCI)
Dublin, Ireland
x18134751@student.ncirl.ie
Abstract—The constant rise in businesses reliance on digital
technologies has brought exponential growth in sources produc-
ing continuous streams of data. Hadoop stack with MapReduce
framework together was able to address many challenges come
with storing and processing the huge volume of data, famously
termed as Bigdata problem. However, with the surge in IoT
devices to mission-critical monitoring applications generating
unbounded continues stream of data requiring real-time or near
real-time processing as it is being produced led to further indus-
trial and academic studies in the field. As a result, Today on top
of existing distributed parallel computing framework, numerous
streaming data processing platforms like Apache Storm, Flink,
Spark Stream, Kafka Stream, and Samja are built to satisfy the
needs of streaming applications where maintaining Law latency,
being tolerant to failures and high throughput is highly desired.
Complexity in architecture and implementation challenges of
these engines in real-world scenarios caused confusion across
the business community and made previous benchmarking out-
comes inconsistent as a minor change in low-level environmental
properties leads to entirely different results. There are not many
independent benchmarking studies available which differentiate
not just performance measures but as well combinedly present the
conceptual and architectural distinctions. This review paper aims
to do so between major 3 streaming engines Apache Storm, Spark
Streaming and Flink while critically evaluating performance
comparison of previous benchmarking studies to help businesses
make an informed decision on adoption of these platforms.
Index Terms—Data Streaming, Apache Spark, Storm, Real-
Time Big-Data Processing, Apache Flink
I. INTRODUCTION
The global digital transformation with automation and broad
applications of artificial intelligence across the domains have
seen rapid growth in the past decade. By the end of 2020,
it is expected that world of digital devices will generate
44 ZB of data. [1] Managing this information produced on
an unprecedented scale and from varieties of sources was a
challenge for the industries, but Hadoop and MapReduce’s dis-
tributed architecture was able to tackle the significant amount
of difficulties in managing and processing data-at-rest by using
the power of parallel computing on the commodity hardware.
[2] However that system was not been able to efficiently cope
with increasing real-time big-data applications demand, IoT
devices, online gaming, automotive industry, sensor recording,
smart cities, real-time threat, and financial fraud detection
are just a few to name, requiring a large continuous stream
of information to be processed while it is in motion. it
required new architectural approach as it is time-sensitive and
data is to be processed as it is produced while preserving
state, high fault tolerance and service certainty as opposed to
batching architecture where information is stored first and later
processed in large batches periodically for further knowledge
extraction. [3]
The open-source community and industry-driven research
support brought to life countless stream processing engines
such as Apache Storm, Flink, Kafka Streams, Samza, and
Spark Structured Streaming. Differences in latency, throughput
and in-memory processing architecture in each streaming en-
gine have resulted in confusion among industry users on which
might be the best suited for the individual implementation due
to unavailability of cross-industry benchmarking methodology
and studies. [4] Majority of previous studies are use-case
specific and do not simulate real-world application properties
resulting in inaccurate assessments. [5]
Apache Flink, Spark, and Storm are the current most popu-
lar streaming platform amongst others, due to its fault-tolerant
architecture and support for scalability in stream processing.
[6] Apache Storm, Flink, and Spark are based on different
processing architectures where spark streaming engine is based
on the concept of micro-batching while Flink and Storm are
a native streaming engine. The objective of this study is to
show conceptual differences between open-source platforms,
Apache Flink, Storm, and Spark Streaming to further compare
the present benchmarking studies and assess them critically.
The remainder of this paper is organized as follow. Section
II defines models for data stream processing. Section III
illustrates the details of the characteristics and architectural
distinctions between 3 major platforms. Previous benchmark-
ing studies are presented in Section IV. Finally, Section V
concludes the review paper.
II. STREAM COMPUTATION MODELS
The unbounded data stream handling services are largely
categorized into two frameworks,

A. Native Streaming:
The native streaming models are designed to take into ac-
count the need of real-time applications, Fig. (1) demonstrates
the processing of data stream obtained from producer sources
over time, which is processed individually on an ongoing
basis. This architecture helps to decrease latency owing to
decreased waiting time before it gets into the system. Apache
Flink and Storm with directed graph data flow adheres to
this architecture resulting in reduced latency relative to micro-
batching model oriented Apache Spark. [6] [7]
Fig. 1. Data Stream Processing Flow: Native Streaming
As the stream data is handled separately and not in batch,
this results in lower throughput compared to the micro-
batching. However, different studies showing configurable
back-end implementation approaches to better handle trade-
offs to satisfy streaming application requirements. [3] [1]
B. Micro-Batching:
Fig. 2. Data Stream Processing Flow: Micro-batching
Micro-batching based architecture takes continues input data
stream from multiple sources as shown in Fig. (2) and splits
stream into small batches or groups. Set of those batches are
then parallelly processed at tiny time intervals by processing
engines. Apache Spark Streaming follows such architecture for
managing streams in small batches. All the sources and stream
processing nodes to gather creates a Directed Acyclic Graph
(DAG). [6] At the core of Apache Spark, batches are processed
following this model as Discrete Streams (D-Stream) made of
Resilient Distributed Datasets (RDDs). [8] [9]
III. PLATFORM CHARACTERISTICS
Fault-Tolerance: As the system is vulnerable to failures ow-
ing to network or software errors, it is of primary importance
in streaming applications. Spark streaming uses fault tolerance
mechanisms for individual batches while it is expensive for
a native streaming system such as Storm and Flink as it is
enforced at each record level. Spark provides assurance of
’Exactly-Once’ processing of records in case of failures by
continuously replicates state to the other worker nodes so
that in the failure state can be extracted from other node
and processing can be restarted. [10] [6] Similarly, Flink
also provides ’Exactly-Once’ processing assurance by keeping
track of distributed snapshots and checkpoints to provide
failure recovery. [9] [6] [2] however, Storm does not provide
state management and in case of application failure, it restarts
the entire process again on different node giving ’At-least-
Once’ assurance. [3] [6]
State-Management: To manage the state a separate thread is
required to continuously update and preserve the existing state
of records. It is not natively available in storm however it can
be implemented with help of Zookeeper Marcu2016Hanif2019
State-management in Spark streaming is associated with RDDs
and involves updating each batch despite no change in the
state, which makes it extremely inefficient compared to Flink.
[10] Flink provides efficient support for integrating state
management with the help of a distributed file system to keep
track of state with snapshots. [1]
Performance (Latency Vs Throughput): Latency is a time,
records in the stream have to wait after it is produced and
throughput is the number of records being processed by system
at a given unit of time. Studies show Spark streaming micro-
batching model leads to higher latency and high throughput
whereas Storm and Flink like native streaming platforms
continuously process those records giving low latency. [9]
[7] Certain novel studies also focusing on network latency
due to an increase in cloud-based infrastructures proposing
the utilization of Edge and Fog computing to reduce latency.
[11] Further performance benchmarking studies discussed in
Section IV.
IV. TOOLS BENCHMARKING
The independent benchmarking of these services is crucial
for business as it helps the decision maker to decide based
on the statistical proof. Yahoo! has been largely contributing
in providing benchmarking tools like YCSB, using YCSB!
[12] researcher exploited breaking points of the Flink, Spark,
and Storm with varieties of node size, Redis as backend,
Kafka as messaging system and zookeeper to provide delivery
assurance of records. Same way independent study [6] also
experiments on one master and 7 worker node architecture by
measuring these services in case of node failures and both the
study concludes that Spark is robust to node failures however
lags in latency as compared to storm and Flink. though both
studies are limited in terms of complex event arrival and do
not produce the same amount of workload as in a production
environment.
To better simulate the real-world environment, studies [7]
[5] presents benchmarking results for threat detection and
advertisements industry using 20 and 30 nodes. where one
study [7] used Kafka as massaging service where other [5]
avoided using any messaging service to eliminate network
delay with a large volume of data. both were indicative
towards similar results where spark performs faster in terms of

processing this event even when data is skewed however in the
application where data is fluctuating Flink finds its way ahead
of spark and storm. Researchers are exploring newer ways
to perform benchmarking, [4] presented unique way to do so
using one-way highway approach for Flink where highway like
environment is simulated for streaming event processing in
real-time using windowing design to better manage traffic and
disorderly arrival of records. Though these studies are guiding
specific uses-cases there is still a need for better cross-industry
benchmarking for Streaming services. Below Table highlights
key differences amongst these services.
TABLE I
OVERALL COMPARISON OF SPARK, FLINK & STORM
Tools Streaming Services
Characteristic Storm Flink Spark
Assurance At-Least-Once Exactly-Once Exactly-Once
State-fullness No Yes Yes
Flow of Data DAG CDG DAG
Community Selective Growing Wide
Streaming Type Native-Streaming Native-Streaming Micro-Batches
API Compositional Compositional Declarative
Scaling Manual Manual Auto
Language Java, Clojure Java,Scala,Py Scala,Java,Py
API Compositional Compositional Declarative
Data Carrier Tuple DataStream DStream
V. CONCLUSION
The streaming services are growing its application base
in various industries. This paper presents conceptual and
architectural differences between Flink, Spark Streaming and
Storm. Many studies have been proposed to help differentiate
in terms of performance of each however there is no clear
winner. Studies are use case specific and on default parameters
where majority studies conclude that Spark works best with
high throughput when the incoming volume is huge and
latency is not of priority however with small volume Storm
also performs better in terms of latency similar to Flink but
Flink does better in fault-tolerance in comparison to Storm.
these studies are based on simulation and do not replicate
real-world like environment hence there is further need for a
better benchmarking approach that helps differentiate across
use cases.
REFERENCES
[1] P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas,
“State management in Apache Flink®,” Proceedings of the VLDB
Endowment, vol. 10, no. 12, pp. 1718–1729, 2017. [Online]. Available:
http://dl.acm.org/citation.cfm?doid=3137765.3137777
[2] O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernández,
“Spark versus flink: Understanding performance in big data analytics
frameworks,” Proceedings - IEEE International Conference on Cluster
Computing, ICCC, pp. 433–442, 2016.
[3] M. Hussain Iqbal and T. Rahim Soomro, “Big Data Analysis: Apache
Storm Perspective,” International Journal of Computer Trends and
Technology, vol. 19, no. 1, pp. 9–14, 2015.
[4] M. Hanif, H. Yoon, and C. Lee, “Benchmarking Tool for Modern
Distributed Stream Processing Engines,” 2019 International Conference
on Information Networking (ICOIN), no. 2017, pp. 393–395, 2019.
[5] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and
V. Markl, “Benchmarking distributed stream data processing systems,”
Proceedings - IEEE 34th International Conference on Data Engineering,
ICDE 2018, pp. 1519–1530, 2018.
[6] M. A. Lopez, A. G. P. Lobato, and O. C. M. Duarte, “A performance
comparison of open-source stream processing platforms,” in 2016 IEEE
Global Communications Conference, GLOBECOM 2016 - Proceedings,
2016.
[7] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder-
baugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky,
“Benchmarking streaming computation engines: Storm, flink and spark
streaming,” Proceedings - 2016 IEEE 30th International Parallel and
Distributed Processing Symposium, IPDPS 2016, pp. 1789–1792, 2016.
[8] F. Gurcan and M. Berigel, “Real-Time Processing of Big Data Streams:
Lifecycle, Tools, Tasks, and Challenges,” ISMSIT 2018 - 2nd Interna-
tional Symposium on Multidisciplinary Studies and Innovative Technolo-
gies, Proceedings, 2018.
[9] D. Garc´ıa-Gil, S. Ram´ırez-Gallego, S. Garc´ıa, and F. Herrera, “A
comparison on scalability for batch big data processing on Apache
Spark and Apache Flink,” Big Data Analytics, vol. 2, no. 1, pp. 1–11,
2017. [Online]. Available: http://dx.doi.org/10.1186/s41044-016-0020-2
[10] B. Y. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,
X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. L. I. Ghodsi,
and J. Gonzalez, “Apache Spark : A Unified Engine for Big Data
Processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65,
2016.
[11] V. Cardellini, G. Mencagli, D. Talia, and M. Torquati, “New Landscapes
of the Data Stream Processing in the era of Fog Computing,” Future
Generation Computer Systems, no. xxxx, 2019. [Online]. Available:
https://doi.org/10.1016/j.future.2019.03.027
[12] Z. Karakaya, A. Yazici, and M. Alayyoub, “A Comparison of Stream
Processing Frameworks,” 2017 International Conference on Computer
and Applications, ICCA 2017, pp. 1–12, 2017.

Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm

Similar to Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm (20)

More from Darshan Gorasiya

More from Darshan Gorasiya (7)

Recently uploaded

Recently uploaded (20)

Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Flink and Storm