Real-time stream processing
for Big Data
Presented by Luay AL-Assadi
INTRODUCTION
Rise of the web 2.0 and the Internet of things.
 Huge amounts of data. (ex sensors, social media, online marketing).
 Track all kinds of information that are only valuable for a short time and therefore have to be
processed immediately.
 Monitoring user activity to optimize product or video recommendations for the current user
context.
Traditional batch-oriented approaches.
 Complex Event Processing (CEP) engines and DBMSs.
Distributed data processing.
 MapReduce.
Real-time analytics: Big Data in motion
 Real time Data infrastructure:
 Built from distributed components.
 Communicate via asynchronous network.
 Engineered on top of the JVM(Java Virtual Machine).
 Real time Big Data Basic Architecture Model:
 Collecting data from various places.
 Moving data to streaming layer.
 Analyze data in stream processor.
 Forwarding outputs to serving layer.
Real-time analytics: Big Data in motion
 Big Data Architecture Model:
Collecting Data
Streaming Data
Batch processing
Store Data
Stream processing
Serving Layer
Lambda Architecture
Real-time analytics: Big Data in motion
 Big Data Architecture Models:
Collecting Data
Streaming Data
Stream processing
Serving Layer
Kappa Architecture
Store, retain Data
Real-time streamers
 RabbitMQ.
 Broker centric, message Acknowledgement.
 focused around delivery guarantees between producers and consumers.
 fall over if your consumers were too slow.
Producer ConsumerBROKER
Message
Ack
Real-time streamers
 Kafka.
Producer centric.
Online / Offline consumers.
Use Zookeeper to reliably maintain their state across a cluster.
Real-time processors:
Latency Throughput & Efficiency
Handling data items
immediately as they arrive.
buffering and processing them in
batches increased efficiency.
Low Latency High Throughput
SAMZA
STORM
SPARK
SPARK Streaming
Trident
Stream BatchMicro - Batch
groups tuples into batches
Restrict batch size
Real-time processors
 STORM
Storm was developed by
Nathan Marz as a BackType
project which was later
acquired by Twitter in the
year 2011.
initially promoted as the
“Hadoop of real-time”.
 The vital parts of a Storm
deployment are a ZooKeeper
cluster for reliable coordination.
Real-time processors
 STORM
Topology:
network made of spout and bolts
Similar to hadoop Map reduce.
Stream:
an unbounded pipeline of tuples
Spout & bolts:
receiving data continuously,
transforming those data into
actual stream of tuples and
finally sending them to the
bolts to be processed.
Real-time processors
 STORM
Nodes
Master Node:
runs a daemon called ‘Nimbus’,
which is similar to the ‘Job
Tracker’ of Hadoop cluster.
Assign Jobs.
Monitor performance.
Real-time processors
 STORM
Nodes
Worker Node:
runs a daemon called
‘Supervisor’.
run one or more worker
processes on its node.
Apache Zookeeper facilitates communication between
Nimbus and Supervisors with the help of message
acknowledgements and processing status.
Real-time processors
 SAMZA
It was initially created at LinkedIn, submitted to the Apache
Incubator in July 2013.
Samza was co-developed with the queueing system Kafka.
Samza requires a little more work than storm to deploy as it does
not only depend on a ZooKeeper cluster, but also runs on top of
Hadoop YARN.
Real-time processors
 SAMZA - YARN
cluster scheduler. It allows you to allocate a number
of containers (processes) in a cluster of machines, and execute
arbitrary commands on them, The Samza client uses YARN to run a
Samza job.
NodeManager: is responsible for launching processes on the
machine.
ResourceManager: Talks to all of the NodeManagers to tell
them what to run.
ApplicationMaster: is responsible for managing the
application’s workload, asking for containers, and handling
notifications when one of its containers fails.
Real-time processors
 SAMZA
decouples individual processing
steps.
buffering data between
processing steps makes
(intermediate) results available
to unrelated parties.
 Prevent data loss by periodically
checkpointing current progress
and reprocessing all data from
failure point.
Real-time processors
 SPARK
Is a batch-processing framework that is often mentioned as the in
official successor of Hadoop as it offers several benefits in
comparison.
significant performance improvements through in-memory
caching.
 Spark provides a variety of machine learning algorithms out-of-the-box
through the MLlib library.
Real-time processors
 SPARK – Architecture
Discussion
SPARKSAMZASTORM
Achievable latency
processing model
ordering guarantees
<< 100 ms < 100 ms < 1 s
one-at-a-time one-at-a-time Micro-batch
between batcheswithin stream partitionsNo
elasticity Yes YesNo
All these different systems show that low latency is involved in a
number of trade-offs with other desirable properties such as
throughput, fault-tolerance, reliability (processing guarantees) and
ease of development.
References
• https://www.quora.com/What-are-the-differences-between-
Apache-Spark-Storm-Samza-Flink-Beam-Apex
• https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems
• https://samza.apache.org/learn/documentation/0.10/introduction/
architecture.html
• https://dzone.com/articles/streaming-big-data-storm-spark
• Paper : Real-time stream processing for Big Data
Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg
Angemeldet
Heruntergeladen am | 13.10.16 19:14
THANKS

Real time big data stream processing

  • 1.
    Real-time stream processing forBig Data Presented by Luay AL-Assadi
  • 2.
    INTRODUCTION Rise of theweb 2.0 and the Internet of things.  Huge amounts of data. (ex sensors, social media, online marketing).  Track all kinds of information that are only valuable for a short time and therefore have to be processed immediately.  Monitoring user activity to optimize product or video recommendations for the current user context. Traditional batch-oriented approaches.  Complex Event Processing (CEP) engines and DBMSs. Distributed data processing.  MapReduce.
  • 3.
    Real-time analytics: BigData in motion  Real time Data infrastructure:  Built from distributed components.  Communicate via asynchronous network.  Engineered on top of the JVM(Java Virtual Machine).  Real time Big Data Basic Architecture Model:  Collecting data from various places.  Moving data to streaming layer.  Analyze data in stream processor.  Forwarding outputs to serving layer.
  • 4.
    Real-time analytics: BigData in motion  Big Data Architecture Model: Collecting Data Streaming Data Batch processing Store Data Stream processing Serving Layer Lambda Architecture
  • 5.
    Real-time analytics: BigData in motion  Big Data Architecture Models: Collecting Data Streaming Data Stream processing Serving Layer Kappa Architecture Store, retain Data
  • 6.
    Real-time streamers  RabbitMQ. Broker centric, message Acknowledgement.  focused around delivery guarantees between producers and consumers.  fall over if your consumers were too slow. Producer ConsumerBROKER Message Ack
  • 7.
    Real-time streamers  Kafka. Producercentric. Online / Offline consumers. Use Zookeeper to reliably maintain their state across a cluster.
  • 8.
    Real-time processors: Latency Throughput& Efficiency Handling data items immediately as they arrive. buffering and processing them in batches increased efficiency. Low Latency High Throughput SAMZA STORM SPARK SPARK Streaming Trident Stream BatchMicro - Batch groups tuples into batches Restrict batch size
  • 9.
    Real-time processors  STORM Stormwas developed by Nathan Marz as a BackType project which was later acquired by Twitter in the year 2011. initially promoted as the “Hadoop of real-time”.  The vital parts of a Storm deployment are a ZooKeeper cluster for reliable coordination.
  • 10.
    Real-time processors  STORM Topology: networkmade of spout and bolts Similar to hadoop Map reduce. Stream: an unbounded pipeline of tuples Spout & bolts: receiving data continuously, transforming those data into actual stream of tuples and finally sending them to the bolts to be processed.
  • 11.
    Real-time processors  STORM Nodes MasterNode: runs a daemon called ‘Nimbus’, which is similar to the ‘Job Tracker’ of Hadoop cluster. Assign Jobs. Monitor performance.
  • 12.
    Real-time processors  STORM Nodes WorkerNode: runs a daemon called ‘Supervisor’. run one or more worker processes on its node. Apache Zookeeper facilitates communication between Nimbus and Supervisors with the help of message acknowledgements and processing status.
  • 13.
    Real-time processors  SAMZA Itwas initially created at LinkedIn, submitted to the Apache Incubator in July 2013. Samza was co-developed with the queueing system Kafka. Samza requires a little more work than storm to deploy as it does not only depend on a ZooKeeper cluster, but also runs on top of Hadoop YARN.
  • 14.
    Real-time processors  SAMZA- YARN cluster scheduler. It allows you to allocate a number of containers (processes) in a cluster of machines, and execute arbitrary commands on them, The Samza client uses YARN to run a Samza job. NodeManager: is responsible for launching processes on the machine. ResourceManager: Talks to all of the NodeManagers to tell them what to run. ApplicationMaster: is responsible for managing the application’s workload, asking for containers, and handling notifications when one of its containers fails.
  • 15.
    Real-time processors  SAMZA decouplesindividual processing steps. buffering data between processing steps makes (intermediate) results available to unrelated parties.  Prevent data loss by periodically checkpointing current progress and reprocessing all data from failure point.
  • 16.
    Real-time processors  SPARK Isa batch-processing framework that is often mentioned as the in official successor of Hadoop as it offers several benefits in comparison. significant performance improvements through in-memory caching.  Spark provides a variety of machine learning algorithms out-of-the-box through the MLlib library.
  • 17.
  • 18.
    Discussion SPARKSAMZASTORM Achievable latency processing model orderingguarantees << 100 ms < 100 ms < 1 s one-at-a-time one-at-a-time Micro-batch between batcheswithin stream partitionsNo elasticity Yes YesNo All these different systems show that low latency is involved in a number of trade-offs with other desirable properties such as throughput, fault-tolerance, reliability (processing guarantees) and ease of development.
  • 19.
    References • https://www.quora.com/What-are-the-differences-between- Apache-Spark-Storm-Samza-Flink-Beam-Apex • https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems •https://samza.apache.org/learn/documentation/0.10/introduction/ architecture.html • https://dzone.com/articles/streaming-big-data-storm-spark • Paper : Real-time stream processing for Big Data Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14
  • 20.