Real time big data stream processing

Real-time stream processing
for Big Data
Presented by Luay AL-Assadi

INTRODUCTION
Rise of the web 2.0 and the Internet of things.
 Huge amounts of data. (ex sensors, social media, online marketing).
 Track all kinds of information that are only valuable for a short time and therefore have to be
processed immediately.
 Monitoring user activity to optimize product or video recommendations for the current user
context.
Traditional batch-oriented approaches.
 Complex Event Processing (CEP) engines and DBMSs.
Distributed data processing.
 MapReduce.

Real-time analytics: Big Data in motion
 Real time Data infrastructure:
 Built from distributed components.
 Communicate via asynchronous network.
 Engineered on top of the JVM(Java Virtual Machine).
 Real time Big Data Basic Architecture Model:
 Collecting data from various places.
 Moving data to streaming layer.
 Analyze data in stream processor.
 Forwarding outputs to serving layer.

 Big Data Architecture Model:
Collecting Data
Streaming Data
Batch processing
Store Data
Stream processing
Serving Layer
Lambda Architecture

 Big Data Architecture Models:
Collecting Data
Streaming Data
Stream processing
Serving Layer
Kappa Architecture
Store, retain Data

Real-time streamers
 RabbitMQ.
 Broker centric, message Acknowledgement.
 focused around delivery guarantees between producers and consumers.
 fall over if your consumers were too slow.
Producer ConsumerBROKER
Message
Ack

Real-time streamers
 Kafka.
Producer centric.
Online / Offline consumers.
Use Zookeeper to reliably maintain their state across a cluster.

Real-time processors:
Latency Throughput & Efficiency
Handling data items
immediately as they arrive.
buffering and processing them in
batches increased efficiency.
Low Latency High Throughput
SAMZA
STORM
SPARK
SPARK Streaming
Trident
Stream BatchMicro - Batch
groups tuples into batches
Restrict batch size

Real-time processors
 STORM
Storm was developed by
Nathan Marz as a BackType
project which was later
acquired by Twitter in the
year 2011.
initially promoted as the
“Hadoop of real-time”.
 The vital parts of a Storm
deployment are a ZooKeeper
cluster for reliable coordination.

 STORM
Topology:
network made of spout and bolts
Similar to hadoop Map reduce.
Stream:
an unbounded pipeline of tuples
Spout & bolts:
receiving data continuously,
transforming those data into
actual stream of tuples and
finally sending them to the
bolts to be processed.

 STORM
Nodes
Master Node:
runs a daemon called ‘Nimbus’,
which is similar to the ‘Job
Tracker’ of Hadoop cluster.
Assign Jobs.
Monitor performance.

 STORM
Nodes
Worker Node:
runs a daemon called
‘Supervisor’.
run one or more worker
processes on its node.
Apache Zookeeper facilitates communication between
Nimbus and Supervisors with the help of message
acknowledgements and processing status.

 SAMZA
It was initially created at LinkedIn, submitted to the Apache
Incubator in July 2013.
Samza was co-developed with the queueing system Kafka.
Samza requires a little more work than storm to deploy as it does
not only depend on a ZooKeeper cluster, but also runs on top of
Hadoop YARN.

 SAMZA - YARN
cluster scheduler. It allows you to allocate a number
of containers (processes) in a cluster of machines, and execute
arbitrary commands on them, The Samza client uses YARN to run a
Samza job.
NodeManager: is responsible for launching processes on the
machine.
ResourceManager: Talks to all of the NodeManagers to tell
them what to run.
ApplicationMaster: is responsible for managing the
application’s workload, asking for containers, and handling
notifications when one of its containers fails.

 SAMZA
decouples individual processing
steps.
buffering data between
processing steps makes
(intermediate) results available
to unrelated parties.
 Prevent data loss by periodically
checkpointing current progress
and reprocessing all data from
failure point.

 SPARK
Is a batch-processing framework that is often mentioned as the in
official successor of Hadoop as it offers several benefits in
comparison.
significant performance improvements through in-memory
caching.
 Spark provides a variety of machine learning algorithms out-of-the-box
through the MLlib library.

 SPARK – Architecture

Discussion
SPARKSAMZASTORM
Achievable latency
processing model
ordering guarantees
<< 100 ms < 100 ms < 1 s
one-at-a-time one-at-a-time Micro-batch
between batcheswithin stream partitionsNo
elasticity Yes YesNo
All these different systems show that low latency is involved in a
number of trade-offs with other desirable properties such as
throughput, fault-tolerance, reliability (processing guarantees) and
ease of development.

References
• https://www.quora.com/What-are-the-differences-between-
Apache-Spark-Storm-Samza-Flink-Beam-Apex
• https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems
• https://samza.apache.org/learn/documentation/0.10/introduction/
architecture.html
• https://dzone.com/articles/streaming-big-data-storm-spark
• Paper : Real-time stream processing for Big Data
Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg
Angemeldet
Heruntergeladen am | 13.10.16 19:14

Real time big data stream processing

More Related Content

What's hot

Viewers also liked

Similar to Real time big data stream processing

Recently uploaded

Real time big data stream processing