SlideShare a Scribd company logo
Real-time Text Stream Processing: A Dynamic and Distributed
NLP Pipeline
Mohammad Arshi Saloot
MIMOS Berhad
Kuala Lumpur, Malaysia
+60187007981
arshi.saloot@yahoo.com
Duc Nghia Pham
MIMOS Berhad
Kuala Lumpur, Malaysia
address
+60389955000
nghia.pham@mimos.my
ABSTRACT
In recent years, the need for flexible and instant Natural Language
Processing (NLP) pipelines becomes more crucial. The existence
of real-time data sources, such as Twitter, necessitates using real-
time text analysis platforms. In addition, due to the existence of a
wide range of NLP toolkits and libraries in a variety of
programming languages, a streaming platform is required to
combine and integrate different modules of various NLP toolkits.
This study proposes a real-time architecture that uses Apache
Storm and Apache Kafka to apply different NLP tasks on streams
of textual data. The architecture allows developers to inject NLP
modules to it via different programming languages. To compare
the performance of the architecture, a series of experiments are
conducted to handle OpenNLP, Fasttext, and SpaCy modules for
Bahasa Malaysia and English languages. The result shows that
Apache Storm achieved the lowest latency, compared with
Trident and baseline experiments.
CCS Concepts
Computer systems organization ~ Real-time systems ~ Real-
time system architecture
Keywords
Real-time, Natural Language Processing, Streaming, Pipeline,
Kafka, Storm
1. INTRODUCTION
NLP toolkits often offer the following NLP components:
tokenization, part-of-speech (PoS) tagging, chunking, named
entity recognition (NER), and sentiment analysis. Currently, there
is a wide range of NLP tools and libraries in different
programming programs, and there is an ongoing competition
between them in terms of accuracy and performance. For example,
in 2017, an experiment is performed to compare in which we
applied four state-of-the-art NLP libraries to publicly available
software artifacts, namely Google’s SyntaxNet, Stanford
CoreNLP suite, NLTK Python library, and spaCy [1]. They
proved that NLTK achieved the highest accuracy for tokenization
among the other toolkits while its accuracy of PoS tagger is the
lowest results [1]. Therefore, because of this diversity of software
artifacts in NLP field, linking and merging NLP modules with
different techniques into an NLP pipeline is an important
phenomenon for NLP engineers and researchers [2].
A sheer volume of textual data is generated daily in many
domains, such as medicine, sports, legal, education, etc. For
example, a law institute generates with a large amount of research
notes, legal transaction documents, emails, reference books, etc.
Thus, NLP becomes an essential factor to get the best results out
of the descriptive or predictive analysis. As a result, AI and NLP
are a vital tool for legal practice to contribute to the growth of
technologies that assist lawyers or “think like a lawyer”.
Therefore, traditional data processing techniques are substituted
with the big data analytics approaches to solve real-life problems
[3].
Big-data (i.e. batch processing), focuses on batch, a
posteriori processing. Recently, the explosion of sensors and
applications needing immediate actions, interest are shifted
towards Fast-data (i.e. stream processing), focusing on real-time
processing. Batch big-data processing techniques encounters
many challenges when it comes to analyzing real-time stream of
data. Data streaming is useful for the types of data sources that
send data in small sizes (often in kilobytes) in a continuous flow
as the data is generated. This may include a wide variety of data
sources such as telemetry, log files, e-commerce transactions,
social network data, or geospatial services. Thus, many real-time
generated data sources, such as Tweets, need real-time data
analysis pipeline. The output of real-time pipelines should be
generated with low-latency and any incoming data must be
processed within seconds or milliseconds [4]. Therefore, research
efforts should be addressed towards developing scalable
frameworks and algorithms that will accommodate data stream
computing mode, effective resource allocation strategy and
parallelization issues to cope with the ever-growing size and
complexity of data [4]. The objectives of this work are examining
different frameworks in order to propose a platform to achieve
following aims:
• To encapsulate NLP modules: adding and removing
multilingual NLP modules to the pipeline without disturbing the
architecture of the system.
• To compute distributedly: scalable by adding or
removing parallel processes as well as worker nodes.
• To process real-time streams: real-time processing and
analysis of incoming streams of textual data with different lengths
and frequencies.
• To have a configurable topology: manipulate the
processing topology (i.e. workflow or network of implemented
NLP modules) at runtime without interrupting the running
services.
• To allow user interaction with the system: Provide
RESTful APIs to end users.
Section 2 reflects on the state-of-the-art NLP systems and libraries
as well as distributed stream processing platforms. Section 3
describes the experiment of this work. Finally, Section 4
summarizes the paper and suggests future research directions.
2. LITERATURE REVIEW
2.1 NLP Toolkits
In 2014, a short survey of NLP toolkits [5] recognized NLTK [6]
as the most well-known and comprehensive NLP toolkit. NLTK is
written in Python and provides essential NLP modules, which
includes tokenization, splitting, statistical analysis of corpora,
classification and clustering. Although NLTK does not provide
any neural network tool, it can be fused with Gensim [7] to
provide word embedding.
Apache OpenNLP [8] offers pre-trained models for the most
common NLP modules, such as tokenization, sentence
segmentation, PoS tagging, NER, chunking, parsing, and co-
reference resolution, for a variety of languages. OpenNLP is an
Apache licensed cross-platform Java library, which uses machine
learning methods.
Tokenization, sentence splitting, PoS tagging, NER, parsing,
sentiment analysis and temporal expression tagging as well as
word embedding are arranged in Stanford NLP Toolkit [9], which
is written in the Java programming language. OpenNLP and
Stanford NLP use Maximum Entropy models for their PoS tagger.
Stanford NLP uses different approaches for each task. For
instance, Conditional Random Fields (CRF) are used for NER
library.
Fasttext [10], [11] and spaCy [12] are new generations of
NLP libraries that emphasis on neural networks. SpaCy is a one of
the advanced multi-language NLP library, which is written in
Python and Cython [12]. SpaCy developers put efforts on the
speed of their library to provide a suitable NLP solution for
industry and commercial applications. There are some comparison
studies to compare different NLP libraries using different domains
and datasets [13]. For example, a study in 2017 [1], found that
spaCy achieved the most promising combined accuracy of PoS
tagging and tokenizer, compared to Google’s SyntaxNet, Stanford
NLP, NLTK Python libraries, while testing with Stack Overflow
data. In addition, spaCy supports neural network modeling while
OpenNLP lacks the advance of deep learning.
In 2016, Facebook Research offered Fasttext as an open-
source NLP library. Similar to spaCy, efficiency is vital in
Fasttext. Although Fasttext is written in C++, there are available
wrappers for it in other languages such as Python and Java.
Fasttext provides pre-trained word embedding models for 294
languages. The Fasttext in considered as a generic tool in the NLP
field because it does not provide any specific NLP module, such
as such as NER and sentiment analysis. Instead, it provides a text
classification library to be used as the engine for many NLP tasks.
Finally, UIMA is the most reliable and well-known
framework to combine different tasks from different library into a
single text annotation pipeline [14]. Although UIMA itself does
not provide any NLP module, it offers flexible pipelines that can
be configured by writing an XML description or using a GUI tool.
Instead of using a specific annotation format, the annotation
formats in UIMA are interoperable using XML Metadata
Interchange (XMI), which is an interchange standard. To pass
right types of input to the next component, UIMA validates the
output formats of components based on predefined Types.
Therefore, there are many frameworks is developed based on
UIMA such as Text Imager [15].
2.2 Real-time Stream Processing Frameworks
This study compares the main features of six most popular
stream processing frameworks, namely Spark Streams [16], Flink
[17], Akka Streams [18], Kafka Streams [19], Samza [20], and
Apache Storm [21]. Spark Streams is a library in Spark
framework to process continuously flowing streams data, which is
powered by Spark RDD. Flink provides stream processing for
large-volume data, and it also lets you handle batch analytics, with
one technology. Akka Streams are an implementation of the
Reactive Streams specification, build on top of Akka Actors to do
asynchronous, non-blocking stream processing. Apache Samza is
another open-source near-realtime, asynchronous computational
framework for stream processing developed in Scala and Java.
Apache Storm accepts tons of data coming in extremely fast,
possibly from various sources, analyze it, and publish real-time
updates to some other places, without storing any actual data.
Apache offers two different streaming frameworks with similar
names: 1) Kafka Streams is a library to write complex logic for
your stream processing jobs, 2) Apache Kafka (known as Kafka
Topics) is a distributed streaming platform [22]. Kafka Streams
API is used to develop stream applications, which may consume
from Kafka Topics and produces back into Kafka Topics. Results
from any of these tools are usually written back to new Kafka
topics for downstream consumption, as shown in Figure 1.
Figure 1. Kafka Topics
2.2.1 Kafka Topics
Apache Kafka is a distributed logging, which stores
messages sequentially. In Kafka terminology, consumers
consume/read data from some topics, and producers produce/write
data into topics. As shown in Figure 2, a Kafka cluster typically
consists of multiple brokers to maintain the load balance. The
Kafka broker election can be done by ZooKeeper [23]. Kafka
Kafka
Topics
Flink
Spark
Streams
Akka Streams
Samza
Appache
Storm
Kafka
Streams
provides high scalability and resiliency, so it is an excellent
integration tool between data producers and consumers. As
depicted in Figure 3, peer-to-peer spaghetti integration quickly
becomes unmanageable as the number of services grows.
Therefore, Kafka Topics provide a single backbone which is used
by all services. Although Kafka is not basically a queue, it can be
utilized as FIFO queue. Producers always write to the end of the
log, consumers can read in the log offset that they want to read
from the beginning or ending of the queue [22].
Figure 2. Kafka Architecture
Figure 3. Peer-to-peer Architecture
2.2.2 Distributed Processing Comparison
There are three categories of the reliability of message delivery:
• At-most-once delivery: for each input message, that
message is delivered zero or one time; in other words, a message
may be lost.
• At-least-once delivery: for each input message,
potentially multiple attempts are made at delivering it; in other
words, a message may be duplicated but not lost.
• Exactly-once delivery: for each input message, one
delivery is made to the recipient; in other words, the message can
neither be lost nor duplicated.
Kafka Topics, Kafka Streams, Spark Streams, Apache Storm, and
Flink support exactly-once and at-least-once delivery semantics.
However, Akka Streams and Samza are unable to guarantee
exactly-once delivery, as shown in Table 1.
Spark Streams and Flink are similar and often compared with
each other because they run as distributed services to run any
submitted jobs. They provide similar, very rich analytics, based on
Apache Beam. Apache Beam [24] is an advanced unified
programming model that implements batch and streaming data
processing jobs that run on any beam runners.
Spark Streams and Flink can manage all the issues of
scheduling processes, etc. After submitting jobs to run, they
handle scalability, failover, load balancing, etc. Another
advantage of Spark Streams and Flink is that they possess a big
community and ongoing updates and improvements because they
widely accepted by big companies at scale. A drawbacks of Spark
Streams and Flink is their restricted programming model. Jobs
should be written using their APIs that conform to their
programming model. Furthermore, integration with other services
usually requires that you run the engines separately from the
microservices and exchange data through Kafka topics or other
means. This adds some latency and more running applications at
the system level. In addition, the overhead of these systems makes
them less ideal for smaller data streams. In Spark Streams, data is
captured in fixed time intervals, then processed as a “mini batch.”
The drawback is longer latencies are required (100 milliseconds
or longer for the intervals).
Akka Streams, Kafka Streams, Samza, and Storm are similar
because they run as libraries that can be embedded in
microservices, providing greater flexibility in how to integrate
analytics with other processes. Akka Streams are very flexible in
terms of deployment and configuration options, compared to
Spark and Flink. In Akk Streams, there are many flexibility and
interoperation capabilities. When Akka Streams use Kafka Topics
to exchange data, consumer lag should be watched carefully (i.e.,
queue depth), which is a source of latency. Figure 4 shows the
spectrum of microservices. Microservices are not always record
oriented. It is a spectrum because we might take some events and
also route them through a data pipeline. Compared to Kafka
Streams, Akka Streams are more generic microservice oriented
and less data-analytics oriented. Although both Akka and Kafka
Streams can cover most of the spectrum, Akka emerged in the
world of building Reactive microservices, and Kafka Streams are
effectively a dataflow API.
Broker 1
Producers Consumers
Producers
Producers
Consumers
Consumers
Broker 2
Broker 3
Multiple Kafka Broker
ZooKeeper
Services Service 1
Services
Services
Service 2
Service 3
Producers Consumers
Figure 4. Microservices spectrum
In Kafka Streams, there should be always a persistent buffer
between Stream applications as shown in Figure 5. Another
disadvantage of Kafka Streams is that all the nodes (processors) in
one topology should be written in one programming language.
Figure 5. Kafka Streams
As displayed in Table 1, Samza currently only provides at-least-
once delivery guaranty. Exactly-once is going to be added to its
future releases. On top of that, Samza Currently only supports
JAVA and Scala for its high and low level APIs. All in all,
Apache Storm is one of the best distributed stream processing
because: 1) it is not restricted to any specific type of programming
model or data structure, 2) it supports at-least-once and exactly-
once message delivery semantics, 3) it provides real-time message
processing with a low latency, 4) it can easily be integrated with
other external platforms such as Apache Kafka, 5) it supports
many programming languages including Java, Scala, Ruby,
Python, JavaScript and Perl.
Table 1. Platform Comparison
Spark Streams Flink Akka Streams Kafka
Streams
Samza Apache Storm
Programming
model
Apache Beam Apache Beam High & Low
Level APIs
High & Low
Level APIs
High &
Low Level
APIs +
Apache
Beam
Model Free
Semantical
Guarantees
-at-least-once
-exactly-once
-at-least-once
-exactly-once
-at-most-once
-at-least-once
-at-least-once
-at-most-
once
-exactly-once
-at-least-
once
-at-least -once
-exactly-once
Latency High Medium Low Low Low Low
Real-time
processing
Mini batches Real-time Real-time Real-time Real-time Real-time
Integration
with other
services
Integration
require extra
engines
Integration
require extra
engines
Integratable Best
Integration
platform
Integratable Integratable
Supported
Programming
Language
-JVM
-Python
-R
-SparkSQL
-JVM -JVM -JVM
-Python
-KSQL
-JVM
-SamzaSQL
-JVM
-Ruby
-Python
-JavaScript
-Perl
2.2.3 Apache Storm
An arrangements of Spouts and Bolts are called topology. A
Spout is a source of data in a topology to fetch data from an
external source and emit them into the Bolts. A Bolt performs the
actual data processing (Veen et al., 2015). At the core of Apache
Storm is a Thrift definition (Agarwal et al., 2007) for defining and
submitting topologies As shown in Figure 4, since Thrift can be
utilized in any language, topologies can be defined and submitted
from any language. Apache storm is designed to be usable with
any programming language. Spouts and Bolts can be defined in
any language. Non-JVM Spouts and Bolts communicate to
Apache Storm over stdin/stdout.
Records
Events
REST
DATA
API Gateway
ORDERS ACCOUNTS
INVENTORY
Model Training
Storage Model Serving
Other Logics
Text
Processi
ng0
Text
Proc
Kafka
Topic
1
Kafka
Topic
3
Text
Proc
Text
Processi
ng0
Languag
e
Detector
Text
Proc
Kafka
Streams
Application
1
Text
Proc
Kafka
Streams
Application
n
Data flow
Kafka
Streams
Application
→
Kafka
Topic
0
Figure 6. Apache Storm
To parallelize the processing, all spout or bolt will be executed in
many Tasks across a Storm cluster. Executors are the processing
threads in a Storm worker node to run one or more Tasks of the
same Spout/Bolt. Figure 6 displays a topology into a topology
with two working machines. It contains four threads (Executors),
where each thread consists of two Tasks.
The number of Executors is always less than or equal to the
number of Tasks. The number of executors can be changed
without downtime while the number of Tasks is fixed. When the
number of Tasks is more than the number of Executors, the Tasks
inside an Executer are serial. For example, only four Tasks can be
active concurrently in Figure 7.
Stream grouping decides how a stream should be divided
among the bolt's tasks. Apache Storm supports eight types of
stream grouping that four of them are more important and
practical. Shuffle grouping is the most popular grouping. To
distribute data in a uniform and arbitrary way across the bolts,
shuffle grouping should be used. Field grouping controls how
each message is sent to bolts based on the content of each. All
grouping is a special grouping that send a message to all the bolts
that often is used to send signals to bolts. Global grouping
approach is used only to combine results from previous bolts in
the topology in a single bolt. The global grouping sends all the
messages to a single Task with lowest ID.
Divide-and-conquer is one of the most important phenomenon in
big-data. All batched big-data processing platforms, such as
Hadoop and Spark, use map-reduce technique which is based on
divide-and-conquer logic [26]. Trident is an extension to Apache
Storm, which provides divide-and-conquer logic for the real-time
stream processing applications [27]. Using Trident, a message can
be divided into many pieces and, and distributed between many
Storm Tasks, and merged back into one message. Therefore,
Trident offers joins, aggregations, and grouping Bolts.
Figure 7. Storm Topology
3. PROPOSED ARCHITECTURE
The importance of Kafka Topics in real-time platforms is
explained in the Section 2. As Kafka Topics provide persistent
data storage with the lowest latency, it is an essential part of our
architecture. The default behavior of Kafka consumer is to send
an acknowledgement message to the Kafka Brokers after
receiving successfully a message. However, sending an
acknowledgement can be delayed until any specific time. Figure 8
displays three different high level designs for a real-time platform.
Figure 8-a refers to the most common way of using Kafka Topics
while dealing with stream processing that is used in this work as a
baseline experiment to be compared with the proposed
architecture. Figure 8-a shows that the output of each process is
stored in Kafka Topics. It guarantees that the output of each
processor will not be lost in case of processor failure. Figure 8-b
shows that only the output of last processor will be stored in a
Topic. As the first processor sends the acknowledgement to the
input Topic, if other processors are down the message will be lost.
Figure 8-c is the best option for our platform because in case of a
failure in processors, no message will be lost. Although only the
output of last processor will be stored in a Topic, the
acknowledgement will be sent by the last processor instead of the
first one.
Thrift file Thrift Compiler
JAVA Nimbus
Python Nimbus
Ruby Nimbus
…
Supervisor (worker)
Executor Task
Task
Ta
Task
Supervisor (worker)
Executor
Executor
Executor Task Task
Task
Task
Task
Task
Figure 8. High Level Design
One of the main challenges of NLP tasks is to handle different
languages. In the proposed architecture, a separate stream of data
will be created of each language. Thus, the dataflow can be leaded
to language-specific bolts. Figure 9 that a how a language
identifier Bolt can divide a stream of Tweets into two different
data streams. Most NLP pipelines are required to handle few
languages. For example, to analyze Tweets from Malaysia,
English, Bahasa Malaysia, Chinese, and Tamil languages need to
be supported in NLP pipeline. Since Apache Storm allows to have
different data streams inside a topology, each language is
considered as one stream. Figure 10 displays the proposed
architecture. Input data can come from any sources, including
real-time streams and RDBMS. Kafka Producer connect to the
source of input via Kafka connector. Kafka Producer fetches data
from the source and push it a Kafka Topic. Then a Spout reads
data from the Kafka Topic and sends to the first Bolt (i.e. set of
Kafka Tasks). Each Bolt sends data to the next Bolt based on the
assigned streams. The last Bolt writes data to a Kafka Topic.
Finally, a Kafka Consumer reads the results from the Kafka Topic
and sends it to real-time application or write the result to a more
persistence database.
Figure 11 displays an embodiment of the proposed
architecture to process English and Bahasa Malaysia texts.
OpenNLP is used in language and sentence detection Bolts. Since
Bahasa Malaysia uses the English writing system, it is assumed
that OpenNLP sentence detector and Fasttext tokenizer can handle
both languages. There are two different Bolts for PoS taggers:
English and Bahasa Malaysia PoS taggers. Language detector
Bolt acts as a stream splitter to create different data streams based
on the detected languages. Finally, Kafka writer Bolt converts the
formats of the results into key and value pairs and writes them
into a Kafka topic. In Figure 11, the Bolts are implemented using
different programming languages. SpaCy is written in Python
language and other Bolts are in the Java language.
Figure 9. Stream Branching
Text Processorx Text Processory Text Processorz
Kafka Topic
(input1′)
Kafka Topic
(input1′′) Kafka Topic
(output1)
a) (baseline experiment)
b)
c) (proposed experiment)
Kafka Topic
(input1
)
Text Processorx
Text Processory Text Processorz
Kafka Topic
(input1
)
Text Processorx
Text Processory
Text Processorz
ack
Kafka Topic
(output1
)
Kafka Topic
(output1
)
Kafka Topic
(input1
)
Language
Identifier
Tweet ID,
English
Twitter
English
Sentence
Detector
Foreign
Sentence
Detector
Language
Identifier
Tweet ID,
Foreign
Twitter
English
Sentence
Detector
Foreign
Sentence
Detector
English POS
Tagger
Foreign POS
Tagger
Foreign POS
Tagger
English POS
Tagger
Output
Generator
Output
Generator
Sending to
Kafka
Sending to
Kafka
Figure 10. Proposed Architecture
Figure 11. Sample Implementation
4. EXPERIMENT RESULTS
100,000 messages are pushed to Apache Kafka to be processed by
the proposed architecture as well as baseline experiment. All
messages have at least 450 characters length with minimum two
sentences. To compare the performance of Storm with other
platforms, three different experiments are conducted:
 Baseline (Kafka): the baseline refers to running NLP
tasks as separate Java applications. As shown in figure
8-a, each Text Processor (i.e. Java application) is
responsible for reading from a Kafka Topic, performing
an NLP task, and write to a particular Kafka Topic.
 Apache Storm: A series of Bolts are dispositioned inside
a topology to perform different NLP tasks. There is a
Spout that reads from Kafka Topic, and send to the
sentence detection Bolt as shown in Figure 10 and
Figure 11. As shown in Figure 8-c, an
acknowledgement message will be sent to Apache
Kafka after a message processed by all the Bolts.
 Storm Trident: Messages will be divided into multiple
messages based on the detected sentences inside a
message. After processing sentences by all NLP Bolts,
sentences will be merged back together based on the
message ID,
Regarding the hardware resources, two sets of experiments are
conducted:
• Standalone: the experiments are conducted using a
single Virtual Machine (VM). The VM encompasses 8GB RAM
and 4 Intel Core 2299 MHz.
• Cluster: the experiments are conducted using a cluster
of three VMs with the same specification and configuration. Each
VM encompasses 8GB RAM and 4 Intel Core 2299 MHz.
Figure 12 displays the final results, where Apache Storm achieved
the best result: processing 100,000 messages in 8 minutes in
Cluster mode. Although Trident outperforms the baseline (i.e.
Kafka) experiment, it cannot reach the performance of Apache
Kafka
Topic
(start)
Kafka
Topic
Kafka
Producer
Storm
Spout
Input
Output
Kafka
Connect
Kafka
Connect
Storm
Bolt
Kafka
Consumer
Storm
Bolt
Storm
Bolt
Storm
Bolt
Storm
Bolt
Storm
Bolt
Data flow
Kafka Topic
(start)
Kafka Topic
(end)
Kafka
Producer
Data flow
Storm
Spout
Storm
Spout
Tokenizer
Kafka
Writer
Sentence
Detector
Sentence
Detector
Tokenizer
Tokenizer
Language
Detector
Language
Detector
PoS (English)
OpenNLP FastText
OpenNLP
PoS (English)
PoS (English)
PoS (Malay)
PoS (Malay)
PoS (Malay)
PoS (Malay)
PoS (English)
Kafka
Writer
SpaCy
Storm because there is some overhead to break a message into
separate messages and merge them back together. Figure 13
shows the trend of processing for the Standalone experiment and
Figure 14 displays the trends in the Cluster mode. The fluctuation
of lines in Figure 13 proves that Trident and Storm cannot reach
their maximum efficiency because of the lack of resources, where
each hike will be followed by a dramatic drop. This problem is
resolved in the Cluster mode, where there was an optimal point
that Storm was able to process about 22,000 messages in a
minute.
Figure 12. Evaluation Result
Figure 13. Apache Kafka vs Apache Storm vs Trident (Standalone)
Figure 14. Apache Kafka vs Apache Storm vs Trident (Cluster)
5. CONCLUSION
The existence of many NLP modules from different sources, as
well as the increasing trend of real-time data necessitate NLP
pipelines that are flexible and rapid. There are several real-time
generated data sources that require data analysis pipeline. Another
challenge of NLP tasks is to handle multiple languages in real-
time processing. Therefore, in recent years, there was a shift of
focus from batch data processing towards stream processing.
Data streaming is one of the useful methods in sending small
size of data in a continuous flow. Although Apache Kafka is one
of the important platforms in real-time processing, it does not
provide distributed computation similar to other Stream
processing platforms such as Akka, Flink, and Apache Storm.
Apache Kafka is a vital part of stream processing because it
provides a rapid approach to store, read and write data from
persistence data sources (i.e. Kafka Topics). Between Spark
Streams, Flink, Akka Streams, Kafka Streams, Samza, and
Apache Storm, Apache Storm is selected for this study because it
is not restrained by any programming model or data structure or
programming language. Moreover, Apache Storm supports at-
least-once and exactly-once message delivery semantics and also
it is easily integrable with other data sources especially Apache
Kafka. This study examines the latency of Apache Storm while
handling NLP tasks.
A distributed architecture is proposed to handle OpenNLP,
Fasttext, and SpaCy modules for Bahasa Malaysia and English
languages. The architecture is implemented using a mixture of
Java and Python programming languages. The input and output of
the architecture are connected to Kafka Topics. A total of 100,000
messages which had at least 450 characters length with minimum
two sentences used to test our proposed architecture. The result
shows that, Apache Storm outperforms Trident and the baseline
experiment by processing 100,000 messages in 8 minutes in the
cluster mode.
ACKNOWLEDGMENTS
This research was done under Artificial Intelligence Lab, MIMOS
BERHAD.
REFERENCES
[1] F. N. A. Al Omran and C. Treude, “Choosing an NLP
Library for Analyzing Software Documentation: A
Systematic Literature Review and a Series of Experiments,”
in 2017 IEEE/ACM 14th International Conference on Mining
Software Repositories (MSR), 2017, pp. 187–197.
[2] R. de Castilho and I. Gurevych, “A broad-coverage
collection of portable NLP components for building
shareable analysis pipelines,” in Proceedings of the
Workshop on Open Infrastructures and Analysis Frameworks
for {HLT}, 2014, pp. 1–11, doi: 10.3115/v1/W14-5201.
[3] Z. Xiang, Z. Schwartz, J. H. Gerdes, and M. Uysal, “What
can big data and text analytics tell us about hotel guest
experience and satisfaction?,” Int. J. Hosp. Manag., vol. 44,
pp. 120–130, 2015, doi:
https://doi.org/10.1016/j.ijhm.2014.10.013.
[4] T. Kolajo, O. Daramola, and A. Adebiyi, “Big data stream
analysis: a systematic literature review,” J. Big Data, vol. 6,
no. 1, p. 47, 2019, doi: 10.1186/s40537-019-0210-7.
[5] L. B. Krithika and K. V. Akondi, “Survey on Various
Natural Language Processing Toolkits,” 2014.
[6] E. Loper and S. Bird, “NLTK: The Natural Language
Toolkit,” in Proceedings of the ACL-02 Workshop on
Effective Tools and Methodologies for Teaching Natural
Language Processing and Computational Linguistics -
Volume 1, 2002, pp. 63–70, doi: 10.3115/1118108.1118117.
[7] R. Rehurek and P. Sojka, “Software Framework for Topic
Modelling with Large Corpora,” in Proceedings of the LREC
2010 Workshop on New Challenges for NLP Frameworks,
2010, pp. 45–50.
[8] Apache Software Foundation, “openNLP Natural Language
Processing Library.” 2014.
[9] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
Bethard, and D. McClosky, “The Stanford CoreNLP Natural
Language Processing Toolkit,” in Association for
Computational Linguistics (ACL) System Demonstrations,
2014, pp. 55–60.
[10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov,
“Enriching Word Vectors with Subword Information,” arXiv
Prepr. arXiv1607.04606, 2016.
[11] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of
Tricks for Efficient Text Classification,” arXiv Prepr.
arXiv1607.01759, 2016.
[12] M. Honnibal and I. Montani, “spaCy 2: Natural language
understanding with Bloom embeddings, convolutional neural
networks and incremental parsing,” 2017.
[13] J. D. Choi, J. Tetreault, and A. Stent, “It Depends:
Dependency Parser Comparison Using A Web-based
Evaluation Tool,” in Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), 2015, pp.
387–396, doi: 10.3115/v1/P15-1038.
[14] G. Wilcock, “Text Annotation with OpenNLP and UIMA,”
in Proceedings of 17th Nordic Conference on Computational
Linguistics, NODALIDA, 2009, pp. 7–8.
[15] W. Hemati, T. Uslu, and A. Mehler, “Text Imager: a
Distributed UIMA-based System for NLP,” in Proceedings
of {COLING} 2016, the 26th International Conference on
Computational Linguistics: System Demonstrations, 2016,
pp. 59–63.
[16] M. Zaharia et al., “Apache Spark: A Unified Engine for Big
Data Processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65,
Oct. 2016, doi: 10.1145/2934664.
[17] E. Friedman and K. Tzoumas, Introduction to Apache Flink:
Stream Processing for Real Time and Beyond, 1st ed.
O’Reilly Media, Inc., 2016.
[18] A. L. Davis, Reactive Streams in Java: Concurrency with
RxJava, Reactor, and Akka Streams, 1st ed. USA: Apress,
2018.
[19] S. Ehrenstein, “Scalability Benchmarking of Kafka Streams
Applications,” Institut für Informatik, 2020.
[20] S. A. Noghabi et al., “Samza: Stateful Scalable Stream
Processing at LinkedIn,” Proc. VLDB Endow., vol. 10, no.
12, pp. 1634–1645, Aug. 2017, doi:
10.14778/3137765.3137770.
[21] J. S. van der Veen, B. van der Waaij, E. Lazovik, W.
Wijbrandi, and R. J. Meijer, “Dynamically Scaling Apache
Storm for the Analysis of Streaming Data,” in Proceedings of
the 2015 IEEE First International Conference on Big Data
Computing Service and Applications, 2015, pp. 154–161,
doi: 10.1109/BigDataService.2015.56.
[22] N. Garg, Apache Kafka. Packt Publishing, 2013.
[23] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed,
“ZooKeeper: Wait-Free Coordination for Internet-Scale
Systems,” in Proceedings of the 2010 USENIX Conference
on USENIX Annual Technical Conference, 2010, p. 11.
[24] H. Karau, “Unifying the open big data world: The
possibilities∗ of apache BEAM,” in 2017 IEEE International
Conference on Big Data (Big Data), 2017, p. 3981, doi:
10.1109/BigData.2017.8258410.
[25] A. Agarwal, M. Slee, and M. Kwiatkowski, “Thrift: Scalable
Cross-Language Services Implementation,” 2007.
[26] A. B. Patel, M. Birla, and U. Nair, “Addressing big data
problem using Hadoop and Map Reduce,” in 2012 Nirma
University International Conference on Engineering
(NUiCONE), 2012, pp. 1–5, doi:
10.1109/NUICONE.2012.6493198.
[27] A. Jain, Mastering Apache Storm: Real-Time Big Data
Streaming Using Kafka, Hbase and Redis. Packt Publishing,
2017.

More Related Content

What's hot

IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...
IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...
IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...
ijp2p
 
Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loader
eSAT Journals
 
Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loader
eSAT Publishing House
 
A unified dashboard for collaborative robot management system
A unified dashboard for collaborative robot management systemA unified dashboard for collaborative robot management system
A unified dashboard for collaborative robot management system
Conference Papers
 
H0444146
H0444146H0444146
H0444146
IJERA Editor
 
Hashtag Recommendation System in a P2P Social Networking Application
Hashtag Recommendation System in a P2P Social Networking ApplicationHashtag Recommendation System in a P2P Social Networking Application
Hashtag Recommendation System in a P2P Social Networking Application
csandit
 
IRJET- Plug-In based System for Data Visualization
IRJET- Plug-In based System for Data VisualizationIRJET- Plug-In based System for Data Visualization
IRJET- Plug-In based System for Data Visualization
IRJET Journal
 
Finite State Machine Based Evaluation Model For Web Service Reliability Analysis
Finite State Machine Based Evaluation Model For Web Service Reliability AnalysisFinite State Machine Based Evaluation Model For Web Service Reliability Analysis
Finite State Machine Based Evaluation Model For Web Service Reliability Analysis
dannyijwest
 
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineMulti Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
IDES Editor
 
A real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingA real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streaming
journalBEEI
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET Journal
 
Mca titles
Mca titlesMca titles
Mca titles
tema_solution
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structure
iosrjce
 
Fakebuster fake news detection system using logistic regression technique i...
Fakebuster   fake news detection system using logistic regression technique i...Fakebuster   fake news detection system using logistic regression technique i...
Fakebuster fake news detection system using logistic regression technique i...
Conference Papers
 
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET Journal
 
A prototype framework_for_high_performance_push_no
A prototype framework_for_high_performance_push_noA prototype framework_for_high_performance_push_no
A prototype framework_for_high_performance_push_no
DavidNereekshan
 
Latent semantic analysis and cosine similarity for hadith search engine
Latent semantic analysis and cosine similarity for hadith search engineLatent semantic analysis and cosine similarity for hadith search engine
Latent semantic analysis and cosine similarity for hadith search engine
TELKOMNIKA JOURNAL
 
Development of Effective Audit Service to Maintain Integrity of Migrated Data...
Development of Effective Audit Service to Maintain Integrity of Migrated Data...Development of Effective Audit Service to Maintain Integrity of Migrated Data...
Development of Effective Audit Service to Maintain Integrity of Migrated Data...
IRJET Journal
 

What's hot (18)

IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...
IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...
IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...
 
Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loader
 
Heterogeneous data transfer and loader
Heterogeneous data transfer and loaderHeterogeneous data transfer and loader
Heterogeneous data transfer and loader
 
A unified dashboard for collaborative robot management system
A unified dashboard for collaborative robot management systemA unified dashboard for collaborative robot management system
A unified dashboard for collaborative robot management system
 
H0444146
H0444146H0444146
H0444146
 
Hashtag Recommendation System in a P2P Social Networking Application
Hashtag Recommendation System in a P2P Social Networking ApplicationHashtag Recommendation System in a P2P Social Networking Application
Hashtag Recommendation System in a P2P Social Networking Application
 
IRJET- Plug-In based System for Data Visualization
IRJET- Plug-In based System for Data VisualizationIRJET- Plug-In based System for Data Visualization
IRJET- Plug-In based System for Data Visualization
 
Finite State Machine Based Evaluation Model For Web Service Reliability Analysis
Finite State Machine Based Evaluation Model For Web Service Reliability AnalysisFinite State Machine Based Evaluation Model For Web Service Reliability Analysis
Finite State Machine Based Evaluation Model For Web Service Reliability Analysis
 
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineMulti Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
 
A real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingA real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streaming
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
Mca titles
Mca titlesMca titles
Mca titles
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structure
 
Fakebuster fake news detection system using logistic regression technique i...
Fakebuster   fake news detection system using logistic regression technique i...Fakebuster   fake news detection system using logistic regression technique i...
Fakebuster fake news detection system using logistic regression technique i...
 
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
 
A prototype framework_for_high_performance_push_no
A prototype framework_for_high_performance_push_noA prototype framework_for_high_performance_push_no
A prototype framework_for_high_performance_push_no
 
Latent semantic analysis and cosine similarity for hadith search engine
Latent semantic analysis and cosine similarity for hadith search engineLatent semantic analysis and cosine similarity for hadith search engine
Latent semantic analysis and cosine similarity for hadith search engine
 
Development of Effective Audit Service to Maintain Integrity of Migrated Data...
Development of Effective Audit Service to Maintain Integrity of Migrated Data...Development of Effective Audit Service to Maintain Integrity of Migrated Data...
Development of Effective Audit Service to Maintain Integrity of Migrated Data...
 

Similar to Real time text stream processing - a dynamic and distributed nlp pipeline

ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
Prajakta Yerpude
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
kevig
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ijnlc
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
Ankush Mehta
 
A powerful comparison of deep learning frameworks for Arabic sentiment analysis
A powerful comparison of deep learning frameworks for Arabic sentiment analysis A powerful comparison of deep learning frameworks for Arabic sentiment analysis
A powerful comparison of deep learning frameworks for Arabic sentiment analysis
IJECEIAES
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
ThennarasuSakkan
 
Synopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptxSynopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptx
HarpreetSinghBagga2
 
Python Programming
Python ProgrammingPython Programming
Python Programming
SheikAllavudeenN
 
PB.docx
PB.docxPB.docx
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET Journal
 
Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
CSP as a Domain-Specific Language Embedded in Python and Jython
CSP as a Domain-Specific Language Embedded in Python and JythonCSP as a Domain-Specific Language Embedded in Python and Jython
CSP as a Domain-Specific Language Embedded in Python and Jython
M H
 
Top Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdfTop Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdf
Appdeveloper10
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Session 2.1 ontological representation of the telecom domain for advanced a...
Session 2.1   ontological representation of the telecom domain for advanced a...Session 2.1   ontological representation of the telecom domain for advanced a...
Session 2.1 ontological representation of the telecom domain for advanced a...
semanticsconference
 
Requirment
RequirmentRequirment
Requirment
stat
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
ilovepdf_merged pdfggjhfgyutertyuiuytrsj
ilovepdf_merged pdfggjhfgyutertyuiuytrsjilovepdf_merged pdfggjhfgyutertyuiuytrsj
ilovepdf_merged pdfggjhfgyutertyuiuytrsj
gautamkumar88905
 

Similar to Real time text stream processing - a dynamic and distributed nlp pipeline (20)

ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION ALGORITHM FOR TEXT TO GRAPH CONVERSION
ALGORITHM FOR TEXT TO GRAPH CONVERSION
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 
A powerful comparison of deep learning frameworks for Arabic sentiment analysis
A powerful comparison of deep learning frameworks for Arabic sentiment analysis A powerful comparison of deep learning frameworks for Arabic sentiment analysis
A powerful comparison of deep learning frameworks for Arabic sentiment analysis
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
 
Synopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptxSynopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptx
 
Python Programming
Python ProgrammingPython Programming
Python Programming
 
PB.docx
PB.docxPB.docx
PB.docx
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
 
Data streaming
Data streamingData streaming
Data streaming
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
CSP as a Domain-Specific Language Embedded in Python and Jython
CSP as a Domain-Specific Language Embedded in Python and JythonCSP as a Domain-Specific Language Embedded in Python and Jython
CSP as a Domain-Specific Language Embedded in Python and Jython
 
Top Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdfTop Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdf
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Session 2.1 ontological representation of the telecom domain for advanced a...
Session 2.1   ontological representation of the telecom domain for advanced a...Session 2.1   ontological representation of the telecom domain for advanced a...
Session 2.1 ontological representation of the telecom domain for advanced a...
 
Requirment
RequirmentRequirment
Requirment
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
ilovepdf_merged pdfggjhfgyutertyuiuytrsj
ilovepdf_merged pdfggjhfgyutertyuiuytrsjilovepdf_merged pdfggjhfgyutertyuiuytrsj
ilovepdf_merged pdfggjhfgyutertyuiuytrsj
 

More from Conference Papers

Ai driven occupational skills generator
Ai driven occupational skills generatorAi driven occupational skills generator
Ai driven occupational skills generator
Conference Papers
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
Conference Papers
 
Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...
Conference Papers
 
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Conference Papers
 
A deployment scenario a taxonomy mapping and keyword searching for the appl...
A deployment scenario   a taxonomy mapping and keyword searching for the appl...A deployment scenario   a taxonomy mapping and keyword searching for the appl...
A deployment scenario a taxonomy mapping and keyword searching for the appl...
Conference Papers
 
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Conference Papers
 
Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...
Conference Papers
 
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Conference Papers
 
An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution
Conference Papers
 
An analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentAn analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deployment
Conference Papers
 
Validation of early testing method for e government projects by requirement ...
Validation of early testing method for e  government projects by requirement ...Validation of early testing method for e  government projects by requirement ...
Validation of early testing method for e government projects by requirement ...
Conference Papers
 
The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...
Conference Papers
 
Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...
Conference Papers
 
Towards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaTowards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysia
Conference Papers
 
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
Conference Papers
 
Searchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsSearchable symmetric encryption security definitions
Searchable symmetric encryption security definitions
Conference Papers
 
Study on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorStudy on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistor
Conference Papers
 
Stil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designStil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal design
Conference Papers
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
Conference Papers
 
Review of big data analytics (bda) architecture trends and analysis
Review of big data analytics (bda) architecture   trends and analysis Review of big data analytics (bda) architecture   trends and analysis
Review of big data analytics (bda) architecture trends and analysis
Conference Papers
 

More from Conference Papers (20)

Ai driven occupational skills generator
Ai driven occupational skills generatorAi driven occupational skills generator
Ai driven occupational skills generator
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...
 
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
 
A deployment scenario a taxonomy mapping and keyword searching for the appl...
A deployment scenario   a taxonomy mapping and keyword searching for the appl...A deployment scenario   a taxonomy mapping and keyword searching for the appl...
A deployment scenario a taxonomy mapping and keyword searching for the appl...
 
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...
 
Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...
 
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...
 
An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution
 
An analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentAn analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deployment
 
Validation of early testing method for e government projects by requirement ...
Validation of early testing method for e  government projects by requirement ...Validation of early testing method for e  government projects by requirement ...
Validation of early testing method for e government projects by requirement ...
 
The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...
 
Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...Unified theory of acceptance and use of technology of e government services i...
Unified theory of acceptance and use of technology of e government services i...
 
Towards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaTowards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysia
 
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
 
Searchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsSearchable symmetric encryption security definitions
Searchable symmetric encryption security definitions
 
Study on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorStudy on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistor
 
Stil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designStil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal design
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 
Review of big data analytics (bda) architecture trends and analysis
Review of big data analytics (bda) architecture   trends and analysis Review of big data analytics (bda) architecture   trends and analysis
Review of big data analytics (bda) architecture trends and analysis
 

Recently uploaded

Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 

Recently uploaded (20)

Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 

Real time text stream processing - a dynamic and distributed nlp pipeline

  • 1. Real-time Text Stream Processing: A Dynamic and Distributed NLP Pipeline Mohammad Arshi Saloot MIMOS Berhad Kuala Lumpur, Malaysia +60187007981 arshi.saloot@yahoo.com Duc Nghia Pham MIMOS Berhad Kuala Lumpur, Malaysia address +60389955000 nghia.pham@mimos.my ABSTRACT In recent years, the need for flexible and instant Natural Language Processing (NLP) pipelines becomes more crucial. The existence of real-time data sources, such as Twitter, necessitates using real- time text analysis platforms. In addition, due to the existence of a wide range of NLP toolkits and libraries in a variety of programming languages, a streaming platform is required to combine and integrate different modules of various NLP toolkits. This study proposes a real-time architecture that uses Apache Storm and Apache Kafka to apply different NLP tasks on streams of textual data. The architecture allows developers to inject NLP modules to it via different programming languages. To compare the performance of the architecture, a series of experiments are conducted to handle OpenNLP, Fasttext, and SpaCy modules for Bahasa Malaysia and English languages. The result shows that Apache Storm achieved the lowest latency, compared with Trident and baseline experiments. CCS Concepts Computer systems organization ~ Real-time systems ~ Real- time system architecture Keywords Real-time, Natural Language Processing, Streaming, Pipeline, Kafka, Storm 1. INTRODUCTION NLP toolkits often offer the following NLP components: tokenization, part-of-speech (PoS) tagging, chunking, named entity recognition (NER), and sentiment analysis. Currently, there is a wide range of NLP tools and libraries in different programming programs, and there is an ongoing competition between them in terms of accuracy and performance. For example, in 2017, an experiment is performed to compare in which we applied four state-of-the-art NLP libraries to publicly available software artifacts, namely Google’s SyntaxNet, Stanford CoreNLP suite, NLTK Python library, and spaCy [1]. They proved that NLTK achieved the highest accuracy for tokenization among the other toolkits while its accuracy of PoS tagger is the lowest results [1]. Therefore, because of this diversity of software artifacts in NLP field, linking and merging NLP modules with different techniques into an NLP pipeline is an important phenomenon for NLP engineers and researchers [2]. A sheer volume of textual data is generated daily in many domains, such as medicine, sports, legal, education, etc. For example, a law institute generates with a large amount of research notes, legal transaction documents, emails, reference books, etc. Thus, NLP becomes an essential factor to get the best results out of the descriptive or predictive analysis. As a result, AI and NLP are a vital tool for legal practice to contribute to the growth of technologies that assist lawyers or “think like a lawyer”. Therefore, traditional data processing techniques are substituted with the big data analytics approaches to solve real-life problems [3]. Big-data (i.e. batch processing), focuses on batch, a posteriori processing. Recently, the explosion of sensors and applications needing immediate actions, interest are shifted towards Fast-data (i.e. stream processing), focusing on real-time processing. Batch big-data processing techniques encounters many challenges when it comes to analyzing real-time stream of data. Data streaming is useful for the types of data sources that send data in small sizes (often in kilobytes) in a continuous flow as the data is generated. This may include a wide variety of data sources such as telemetry, log files, e-commerce transactions, social network data, or geospatial services. Thus, many real-time generated data sources, such as Tweets, need real-time data analysis pipeline. The output of real-time pipelines should be generated with low-latency and any incoming data must be processed within seconds or milliseconds [4]. Therefore, research efforts should be addressed towards developing scalable frameworks and algorithms that will accommodate data stream computing mode, effective resource allocation strategy and parallelization issues to cope with the ever-growing size and complexity of data [4]. The objectives of this work are examining different frameworks in order to propose a platform to achieve following aims: • To encapsulate NLP modules: adding and removing multilingual NLP modules to the pipeline without disturbing the architecture of the system. • To compute distributedly: scalable by adding or removing parallel processes as well as worker nodes. • To process real-time streams: real-time processing and analysis of incoming streams of textual data with different lengths and frequencies. • To have a configurable topology: manipulate the processing topology (i.e. workflow or network of implemented NLP modules) at runtime without interrupting the running services.
  • 2. • To allow user interaction with the system: Provide RESTful APIs to end users. Section 2 reflects on the state-of-the-art NLP systems and libraries as well as distributed stream processing platforms. Section 3 describes the experiment of this work. Finally, Section 4 summarizes the paper and suggests future research directions. 2. LITERATURE REVIEW 2.1 NLP Toolkits In 2014, a short survey of NLP toolkits [5] recognized NLTK [6] as the most well-known and comprehensive NLP toolkit. NLTK is written in Python and provides essential NLP modules, which includes tokenization, splitting, statistical analysis of corpora, classification and clustering. Although NLTK does not provide any neural network tool, it can be fused with Gensim [7] to provide word embedding. Apache OpenNLP [8] offers pre-trained models for the most common NLP modules, such as tokenization, sentence segmentation, PoS tagging, NER, chunking, parsing, and co- reference resolution, for a variety of languages. OpenNLP is an Apache licensed cross-platform Java library, which uses machine learning methods. Tokenization, sentence splitting, PoS tagging, NER, parsing, sentiment analysis and temporal expression tagging as well as word embedding are arranged in Stanford NLP Toolkit [9], which is written in the Java programming language. OpenNLP and Stanford NLP use Maximum Entropy models for their PoS tagger. Stanford NLP uses different approaches for each task. For instance, Conditional Random Fields (CRF) are used for NER library. Fasttext [10], [11] and spaCy [12] are new generations of NLP libraries that emphasis on neural networks. SpaCy is a one of the advanced multi-language NLP library, which is written in Python and Cython [12]. SpaCy developers put efforts on the speed of their library to provide a suitable NLP solution for industry and commercial applications. There are some comparison studies to compare different NLP libraries using different domains and datasets [13]. For example, a study in 2017 [1], found that spaCy achieved the most promising combined accuracy of PoS tagging and tokenizer, compared to Google’s SyntaxNet, Stanford NLP, NLTK Python libraries, while testing with Stack Overflow data. In addition, spaCy supports neural network modeling while OpenNLP lacks the advance of deep learning. In 2016, Facebook Research offered Fasttext as an open- source NLP library. Similar to spaCy, efficiency is vital in Fasttext. Although Fasttext is written in C++, there are available wrappers for it in other languages such as Python and Java. Fasttext provides pre-trained word embedding models for 294 languages. The Fasttext in considered as a generic tool in the NLP field because it does not provide any specific NLP module, such as such as NER and sentiment analysis. Instead, it provides a text classification library to be used as the engine for many NLP tasks. Finally, UIMA is the most reliable and well-known framework to combine different tasks from different library into a single text annotation pipeline [14]. Although UIMA itself does not provide any NLP module, it offers flexible pipelines that can be configured by writing an XML description or using a GUI tool. Instead of using a specific annotation format, the annotation formats in UIMA are interoperable using XML Metadata Interchange (XMI), which is an interchange standard. To pass right types of input to the next component, UIMA validates the output formats of components based on predefined Types. Therefore, there are many frameworks is developed based on UIMA such as Text Imager [15]. 2.2 Real-time Stream Processing Frameworks This study compares the main features of six most popular stream processing frameworks, namely Spark Streams [16], Flink [17], Akka Streams [18], Kafka Streams [19], Samza [20], and Apache Storm [21]. Spark Streams is a library in Spark framework to process continuously flowing streams data, which is powered by Spark RDD. Flink provides stream processing for large-volume data, and it also lets you handle batch analytics, with one technology. Akka Streams are an implementation of the Reactive Streams specification, build on top of Akka Actors to do asynchronous, non-blocking stream processing. Apache Samza is another open-source near-realtime, asynchronous computational framework for stream processing developed in Scala and Java. Apache Storm accepts tons of data coming in extremely fast, possibly from various sources, analyze it, and publish real-time updates to some other places, without storing any actual data. Apache offers two different streaming frameworks with similar names: 1) Kafka Streams is a library to write complex logic for your stream processing jobs, 2) Apache Kafka (known as Kafka Topics) is a distributed streaming platform [22]. Kafka Streams API is used to develop stream applications, which may consume from Kafka Topics and produces back into Kafka Topics. Results from any of these tools are usually written back to new Kafka topics for downstream consumption, as shown in Figure 1. Figure 1. Kafka Topics 2.2.1 Kafka Topics Apache Kafka is a distributed logging, which stores messages sequentially. In Kafka terminology, consumers consume/read data from some topics, and producers produce/write data into topics. As shown in Figure 2, a Kafka cluster typically consists of multiple brokers to maintain the load balance. The Kafka broker election can be done by ZooKeeper [23]. Kafka Kafka Topics Flink Spark Streams Akka Streams Samza Appache Storm Kafka Streams
  • 3. provides high scalability and resiliency, so it is an excellent integration tool between data producers and consumers. As depicted in Figure 3, peer-to-peer spaghetti integration quickly becomes unmanageable as the number of services grows. Therefore, Kafka Topics provide a single backbone which is used by all services. Although Kafka is not basically a queue, it can be utilized as FIFO queue. Producers always write to the end of the log, consumers can read in the log offset that they want to read from the beginning or ending of the queue [22]. Figure 2. Kafka Architecture Figure 3. Peer-to-peer Architecture 2.2.2 Distributed Processing Comparison There are three categories of the reliability of message delivery: • At-most-once delivery: for each input message, that message is delivered zero or one time; in other words, a message may be lost. • At-least-once delivery: for each input message, potentially multiple attempts are made at delivering it; in other words, a message may be duplicated but not lost. • Exactly-once delivery: for each input message, one delivery is made to the recipient; in other words, the message can neither be lost nor duplicated. Kafka Topics, Kafka Streams, Spark Streams, Apache Storm, and Flink support exactly-once and at-least-once delivery semantics. However, Akka Streams and Samza are unable to guarantee exactly-once delivery, as shown in Table 1. Spark Streams and Flink are similar and often compared with each other because they run as distributed services to run any submitted jobs. They provide similar, very rich analytics, based on Apache Beam. Apache Beam [24] is an advanced unified programming model that implements batch and streaming data processing jobs that run on any beam runners. Spark Streams and Flink can manage all the issues of scheduling processes, etc. After submitting jobs to run, they handle scalability, failover, load balancing, etc. Another advantage of Spark Streams and Flink is that they possess a big community and ongoing updates and improvements because they widely accepted by big companies at scale. A drawbacks of Spark Streams and Flink is their restricted programming model. Jobs should be written using their APIs that conform to their programming model. Furthermore, integration with other services usually requires that you run the engines separately from the microservices and exchange data through Kafka topics or other means. This adds some latency and more running applications at the system level. In addition, the overhead of these systems makes them less ideal for smaller data streams. In Spark Streams, data is captured in fixed time intervals, then processed as a “mini batch.” The drawback is longer latencies are required (100 milliseconds or longer for the intervals). Akka Streams, Kafka Streams, Samza, and Storm are similar because they run as libraries that can be embedded in microservices, providing greater flexibility in how to integrate analytics with other processes. Akka Streams are very flexible in terms of deployment and configuration options, compared to Spark and Flink. In Akk Streams, there are many flexibility and interoperation capabilities. When Akka Streams use Kafka Topics to exchange data, consumer lag should be watched carefully (i.e., queue depth), which is a source of latency. Figure 4 shows the spectrum of microservices. Microservices are not always record oriented. It is a spectrum because we might take some events and also route them through a data pipeline. Compared to Kafka Streams, Akka Streams are more generic microservice oriented and less data-analytics oriented. Although both Akka and Kafka Streams can cover most of the spectrum, Akka emerged in the world of building Reactive microservices, and Kafka Streams are effectively a dataflow API. Broker 1 Producers Consumers Producers Producers Consumers Consumers Broker 2 Broker 3 Multiple Kafka Broker ZooKeeper Services Service 1 Services Services Service 2 Service 3 Producers Consumers
  • 4. Figure 4. Microservices spectrum In Kafka Streams, there should be always a persistent buffer between Stream applications as shown in Figure 5. Another disadvantage of Kafka Streams is that all the nodes (processors) in one topology should be written in one programming language. Figure 5. Kafka Streams As displayed in Table 1, Samza currently only provides at-least- once delivery guaranty. Exactly-once is going to be added to its future releases. On top of that, Samza Currently only supports JAVA and Scala for its high and low level APIs. All in all, Apache Storm is one of the best distributed stream processing because: 1) it is not restricted to any specific type of programming model or data structure, 2) it supports at-least-once and exactly- once message delivery semantics, 3) it provides real-time message processing with a low latency, 4) it can easily be integrated with other external platforms such as Apache Kafka, 5) it supports many programming languages including Java, Scala, Ruby, Python, JavaScript and Perl. Table 1. Platform Comparison Spark Streams Flink Akka Streams Kafka Streams Samza Apache Storm Programming model Apache Beam Apache Beam High & Low Level APIs High & Low Level APIs High & Low Level APIs + Apache Beam Model Free Semantical Guarantees -at-least-once -exactly-once -at-least-once -exactly-once -at-most-once -at-least-once -at-least-once -at-most- once -exactly-once -at-least- once -at-least -once -exactly-once Latency High Medium Low Low Low Low Real-time processing Mini batches Real-time Real-time Real-time Real-time Real-time Integration with other services Integration require extra engines Integration require extra engines Integratable Best Integration platform Integratable Integratable Supported Programming Language -JVM -Python -R -SparkSQL -JVM -JVM -JVM -Python -KSQL -JVM -SamzaSQL -JVM -Ruby -Python -JavaScript -Perl 2.2.3 Apache Storm An arrangements of Spouts and Bolts are called topology. A Spout is a source of data in a topology to fetch data from an external source and emit them into the Bolts. A Bolt performs the actual data processing (Veen et al., 2015). At the core of Apache Storm is a Thrift definition (Agarwal et al., 2007) for defining and submitting topologies As shown in Figure 4, since Thrift can be utilized in any language, topologies can be defined and submitted from any language. Apache storm is designed to be usable with any programming language. Spouts and Bolts can be defined in any language. Non-JVM Spouts and Bolts communicate to Apache Storm over stdin/stdout. Records Events REST DATA API Gateway ORDERS ACCOUNTS INVENTORY Model Training Storage Model Serving Other Logics Text Processi ng0 Text Proc Kafka Topic 1 Kafka Topic 3 Text Proc Text Processi ng0 Languag e Detector Text Proc Kafka Streams Application 1 Text Proc Kafka Streams Application n Data flow Kafka Streams Application → Kafka Topic 0
  • 5. Figure 6. Apache Storm To parallelize the processing, all spout or bolt will be executed in many Tasks across a Storm cluster. Executors are the processing threads in a Storm worker node to run one or more Tasks of the same Spout/Bolt. Figure 6 displays a topology into a topology with two working machines. It contains four threads (Executors), where each thread consists of two Tasks. The number of Executors is always less than or equal to the number of Tasks. The number of executors can be changed without downtime while the number of Tasks is fixed. When the number of Tasks is more than the number of Executors, the Tasks inside an Executer are serial. For example, only four Tasks can be active concurrently in Figure 7. Stream grouping decides how a stream should be divided among the bolt's tasks. Apache Storm supports eight types of stream grouping that four of them are more important and practical. Shuffle grouping is the most popular grouping. To distribute data in a uniform and arbitrary way across the bolts, shuffle grouping should be used. Field grouping controls how each message is sent to bolts based on the content of each. All grouping is a special grouping that send a message to all the bolts that often is used to send signals to bolts. Global grouping approach is used only to combine results from previous bolts in the topology in a single bolt. The global grouping sends all the messages to a single Task with lowest ID. Divide-and-conquer is one of the most important phenomenon in big-data. All batched big-data processing platforms, such as Hadoop and Spark, use map-reduce technique which is based on divide-and-conquer logic [26]. Trident is an extension to Apache Storm, which provides divide-and-conquer logic for the real-time stream processing applications [27]. Using Trident, a message can be divided into many pieces and, and distributed between many Storm Tasks, and merged back into one message. Therefore, Trident offers joins, aggregations, and grouping Bolts. Figure 7. Storm Topology 3. PROPOSED ARCHITECTURE The importance of Kafka Topics in real-time platforms is explained in the Section 2. As Kafka Topics provide persistent data storage with the lowest latency, it is an essential part of our architecture. The default behavior of Kafka consumer is to send an acknowledgement message to the Kafka Brokers after receiving successfully a message. However, sending an acknowledgement can be delayed until any specific time. Figure 8 displays three different high level designs for a real-time platform. Figure 8-a refers to the most common way of using Kafka Topics while dealing with stream processing that is used in this work as a baseline experiment to be compared with the proposed architecture. Figure 8-a shows that the output of each process is stored in Kafka Topics. It guarantees that the output of each processor will not be lost in case of processor failure. Figure 8-b shows that only the output of last processor will be stored in a Topic. As the first processor sends the acknowledgement to the input Topic, if other processors are down the message will be lost. Figure 8-c is the best option for our platform because in case of a failure in processors, no message will be lost. Although only the output of last processor will be stored in a Topic, the acknowledgement will be sent by the last processor instead of the first one. Thrift file Thrift Compiler JAVA Nimbus Python Nimbus Ruby Nimbus … Supervisor (worker) Executor Task Task Ta Task Supervisor (worker) Executor Executor Executor Task Task Task Task Task Task
  • 6. Figure 8. High Level Design One of the main challenges of NLP tasks is to handle different languages. In the proposed architecture, a separate stream of data will be created of each language. Thus, the dataflow can be leaded to language-specific bolts. Figure 9 that a how a language identifier Bolt can divide a stream of Tweets into two different data streams. Most NLP pipelines are required to handle few languages. For example, to analyze Tweets from Malaysia, English, Bahasa Malaysia, Chinese, and Tamil languages need to be supported in NLP pipeline. Since Apache Storm allows to have different data streams inside a topology, each language is considered as one stream. Figure 10 displays the proposed architecture. Input data can come from any sources, including real-time streams and RDBMS. Kafka Producer connect to the source of input via Kafka connector. Kafka Producer fetches data from the source and push it a Kafka Topic. Then a Spout reads data from the Kafka Topic and sends to the first Bolt (i.e. set of Kafka Tasks). Each Bolt sends data to the next Bolt based on the assigned streams. The last Bolt writes data to a Kafka Topic. Finally, a Kafka Consumer reads the results from the Kafka Topic and sends it to real-time application or write the result to a more persistence database. Figure 11 displays an embodiment of the proposed architecture to process English and Bahasa Malaysia texts. OpenNLP is used in language and sentence detection Bolts. Since Bahasa Malaysia uses the English writing system, it is assumed that OpenNLP sentence detector and Fasttext tokenizer can handle both languages. There are two different Bolts for PoS taggers: English and Bahasa Malaysia PoS taggers. Language detector Bolt acts as a stream splitter to create different data streams based on the detected languages. Finally, Kafka writer Bolt converts the formats of the results into key and value pairs and writes them into a Kafka topic. In Figure 11, the Bolts are implemented using different programming languages. SpaCy is written in Python language and other Bolts are in the Java language. Figure 9. Stream Branching Text Processorx Text Processory Text Processorz Kafka Topic (input1′) Kafka Topic (input1′′) Kafka Topic (output1) a) (baseline experiment) b) c) (proposed experiment) Kafka Topic (input1 ) Text Processorx Text Processory Text Processorz Kafka Topic (input1 ) Text Processorx Text Processory Text Processorz ack Kafka Topic (output1 ) Kafka Topic (output1 ) Kafka Topic (input1 ) Language Identifier Tweet ID, English Twitter English Sentence Detector Foreign Sentence Detector Language Identifier Tweet ID, Foreign Twitter English Sentence Detector Foreign Sentence Detector English POS Tagger Foreign POS Tagger Foreign POS Tagger English POS Tagger Output Generator Output Generator Sending to Kafka Sending to Kafka
  • 7. Figure 10. Proposed Architecture Figure 11. Sample Implementation 4. EXPERIMENT RESULTS 100,000 messages are pushed to Apache Kafka to be processed by the proposed architecture as well as baseline experiment. All messages have at least 450 characters length with minimum two sentences. To compare the performance of Storm with other platforms, three different experiments are conducted:  Baseline (Kafka): the baseline refers to running NLP tasks as separate Java applications. As shown in figure 8-a, each Text Processor (i.e. Java application) is responsible for reading from a Kafka Topic, performing an NLP task, and write to a particular Kafka Topic.  Apache Storm: A series of Bolts are dispositioned inside a topology to perform different NLP tasks. There is a Spout that reads from Kafka Topic, and send to the sentence detection Bolt as shown in Figure 10 and Figure 11. As shown in Figure 8-c, an acknowledgement message will be sent to Apache Kafka after a message processed by all the Bolts.  Storm Trident: Messages will be divided into multiple messages based on the detected sentences inside a message. After processing sentences by all NLP Bolts, sentences will be merged back together based on the message ID, Regarding the hardware resources, two sets of experiments are conducted: • Standalone: the experiments are conducted using a single Virtual Machine (VM). The VM encompasses 8GB RAM and 4 Intel Core 2299 MHz. • Cluster: the experiments are conducted using a cluster of three VMs with the same specification and configuration. Each VM encompasses 8GB RAM and 4 Intel Core 2299 MHz. Figure 12 displays the final results, where Apache Storm achieved the best result: processing 100,000 messages in 8 minutes in Cluster mode. Although Trident outperforms the baseline (i.e. Kafka) experiment, it cannot reach the performance of Apache Kafka Topic (start) Kafka Topic Kafka Producer Storm Spout Input Output Kafka Connect Kafka Connect Storm Bolt Kafka Consumer Storm Bolt Storm Bolt Storm Bolt Storm Bolt Storm Bolt Data flow Kafka Topic (start) Kafka Topic (end) Kafka Producer Data flow Storm Spout Storm Spout Tokenizer Kafka Writer Sentence Detector Sentence Detector Tokenizer Tokenizer Language Detector Language Detector PoS (English) OpenNLP FastText OpenNLP PoS (English) PoS (English) PoS (Malay) PoS (Malay) PoS (Malay) PoS (Malay) PoS (English) Kafka Writer SpaCy
  • 8. Storm because there is some overhead to break a message into separate messages and merge them back together. Figure 13 shows the trend of processing for the Standalone experiment and Figure 14 displays the trends in the Cluster mode. The fluctuation of lines in Figure 13 proves that Trident and Storm cannot reach their maximum efficiency because of the lack of resources, where each hike will be followed by a dramatic drop. This problem is resolved in the Cluster mode, where there was an optimal point that Storm was able to process about 22,000 messages in a minute. Figure 12. Evaluation Result Figure 13. Apache Kafka vs Apache Storm vs Trident (Standalone)
  • 9. Figure 14. Apache Kafka vs Apache Storm vs Trident (Cluster) 5. CONCLUSION The existence of many NLP modules from different sources, as well as the increasing trend of real-time data necessitate NLP pipelines that are flexible and rapid. There are several real-time generated data sources that require data analysis pipeline. Another challenge of NLP tasks is to handle multiple languages in real- time processing. Therefore, in recent years, there was a shift of focus from batch data processing towards stream processing. Data streaming is one of the useful methods in sending small size of data in a continuous flow. Although Apache Kafka is one of the important platforms in real-time processing, it does not provide distributed computation similar to other Stream processing platforms such as Akka, Flink, and Apache Storm. Apache Kafka is a vital part of stream processing because it provides a rapid approach to store, read and write data from persistence data sources (i.e. Kafka Topics). Between Spark Streams, Flink, Akka Streams, Kafka Streams, Samza, and Apache Storm, Apache Storm is selected for this study because it is not restrained by any programming model or data structure or programming language. Moreover, Apache Storm supports at- least-once and exactly-once message delivery semantics and also it is easily integrable with other data sources especially Apache Kafka. This study examines the latency of Apache Storm while handling NLP tasks. A distributed architecture is proposed to handle OpenNLP, Fasttext, and SpaCy modules for Bahasa Malaysia and English languages. The architecture is implemented using a mixture of Java and Python programming languages. The input and output of the architecture are connected to Kafka Topics. A total of 100,000 messages which had at least 450 characters length with minimum two sentences used to test our proposed architecture. The result shows that, Apache Storm outperforms Trident and the baseline experiment by processing 100,000 messages in 8 minutes in the cluster mode. ACKNOWLEDGMENTS This research was done under Artificial Intelligence Lab, MIMOS BERHAD. REFERENCES [1] F. N. A. Al Omran and C. Treude, “Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 187–197. [2] R. de Castilho and I. Gurevych, “A broad-coverage collection of portable NLP components for building shareable analysis pipelines,” in Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for {HLT}, 2014, pp. 1–11, doi: 10.3115/v1/W14-5201. [3] Z. Xiang, Z. Schwartz, J. H. Gerdes, and M. Uysal, “What can big data and text analytics tell us about hotel guest experience and satisfaction?,” Int. J. Hosp. Manag., vol. 44, pp. 120–130, 2015, doi: https://doi.org/10.1016/j.ijhm.2014.10.013. [4] T. Kolajo, O. Daramola, and A. Adebiyi, “Big data stream analysis: a systematic literature review,” J. Big Data, vol. 6, no. 1, p. 47, 2019, doi: 10.1186/s40537-019-0210-7. [5] L. B. Krithika and K. V. Akondi, “Survey on Various Natural Language Processing Toolkits,” 2014. [6] E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, 2002, pp. 63–70, doi: 10.3115/1118108.1118117. [7] R. Rehurek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC
  • 10. 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. [8] Apache Software Foundation, “openNLP Natural Language Processing Library.” 2014. [9] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit,” in Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. [10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” arXiv Prepr. arXiv1607.04606, 2016. [11] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” arXiv Prepr. arXiv1607.01759, 2016. [12] M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing,” 2017. [13] J. D. Choi, J. Tetreault, and A. Stent, “It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 387–396, doi: 10.3115/v1/P15-1038. [14] G. Wilcock, “Text Annotation with OpenNLP and UIMA,” in Proceedings of 17th Nordic Conference on Computational Linguistics, NODALIDA, 2009, pp. 7–8. [15] W. Hemati, T. Uslu, and A. Mehler, “Text Imager: a Distributed UIMA-based System for NLP,” in Proceedings of {COLING} 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, 2016, pp. 59–63. [16] M. Zaharia et al., “Apache Spark: A Unified Engine for Big Data Processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Oct. 2016, doi: 10.1145/2934664. [17] E. Friedman and K. Tzoumas, Introduction to Apache Flink: Stream Processing for Real Time and Beyond, 1st ed. O’Reilly Media, Inc., 2016. [18] A. L. Davis, Reactive Streams in Java: Concurrency with RxJava, Reactor, and Akka Streams, 1st ed. USA: Apress, 2018. [19] S. Ehrenstein, “Scalability Benchmarking of Kafka Streams Applications,” Institut für Informatik, 2020. [20] S. A. Noghabi et al., “Samza: Stateful Scalable Stream Processing at LinkedIn,” Proc. VLDB Endow., vol. 10, no. 12, pp. 1634–1645, Aug. 2017, doi: 10.14778/3137765.3137770. [21] J. S. van der Veen, B. van der Waaij, E. Lazovik, W. Wijbrandi, and R. J. Meijer, “Dynamically Scaling Apache Storm for the Analysis of Streaming Data,” in Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications, 2015, pp. 154–161, doi: 10.1109/BigDataService.2015.56. [22] N. Garg, Apache Kafka. Packt Publishing, 2013. [23] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-Free Coordination for Internet-Scale Systems,” in Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, 2010, p. 11. [24] H. Karau, “Unifying the open big data world: The possibilities∗ of apache BEAM,” in 2017 IEEE International Conference on Big Data (Big Data), 2017, p. 3981, doi: 10.1109/BigData.2017.8258410. [25] A. Agarwal, M. Slee, and M. Kwiatkowski, “Thrift: Scalable Cross-Language Services Implementation,” 2007. [26] A. B. Patel, M. Birla, and U. Nair, “Addressing big data problem using Hadoop and Map Reduce,” in 2012 Nirma University International Conference on Engineering (NUiCONE), 2012, pp. 1–5, doi: 10.1109/NUICONE.2012.6493198. [27] A. Jain, Mastering Apache Storm: Real-Time Big Data Streaming Using Kafka, Hbase and Redis. Packt Publishing, 2017.