SlideShare a Scribd company logo
1 of 21
STORM
    COMPARISON – INTRODUCTION - CONCEPTS




PRESENTATION BY KASPER MADSEN
MARCH - 2012
HADOOP                              VS             STORM
     Batch processing                            Real-time processing
     Jobs runs to completion                   Topologies run forever
     JobTracker is SPOF*                      No single point of failure
     Stateful nodes                                    Stateless nodes


     Scalable                                                 Scalable
     Guarantees no data loss                  Guarantees no data loss
     Open source                                          Open source




* Hadoop 0.21 added some checkpointing
 SPOF: Single Point Of Failure
COMPONENTS
     Nimbus daemon is comparable to Hadoop JobTracker. It is the master
     Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker
     Worker is spawned by supervisor, one per port defined in storm.yaml configuration
     Task is run as a thread in workers
     Zookeeper* is a distributed system, used to store metadata. Nimbus and
     Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper.




         Notice all communication between Nimbus and
           Supervisors are done through Zookeeper

      On a cluster with 2k+1 zookeeper nodes, the system
          can recover when maximally k nodes fails.




* Zookeeper is an Apache top-level project
STREAMS
Stream is an unbounded sequence of tuples.
Topology is a graph where each node is a spout or bolt, and the edges indicate
which bolts are subscribing to which streams.
•   A spout is a source of a stream
•   A bolt is consuming a stream (possibly emits a new one)
                                                              Subscribes: A
•   An edge represents a grouping                             Emits: C


                                                                                 Subscribes: C & D

                                                              Subscribes: A
                                 Source of stream A           Emits: D




                                 Source of stream B
                                                              Subscribes:A & B
GROUPINGS
Each spout or bolt are running X instances in parallel (called tasks).
Groupings are used to decide which task in the subscribing bolt, the tuple is sent to
Shuffle grouping     is a random grouping
Fields grouping      is grouped by value, such that equal value results in equal task
All grouping         replicates to all tasks
Global grouping      makes all tuples go to one task
None grouping        makes bolt run in same thread as bolt/spout it subscribes to
Direct grouping      producer (task that emits) controls which consumer will receive
                                          4 tasks   3 tasks


                                2 tasks


                                          2 tasks
TestWordSpout          ExclamationBolt     ExclamationBolt

    EXAMPLE
     TopologyBuilder builder = new TopologyBuilder();                   Create stream called ”words”

                                                                        Run 10 tasks
     builder.setSpout("words", new TestWordSpout(), 10);
                                                                        Create stream called ”exclaim1”
     builder.setBolt("exclaim1", new ExclamationBolt(), 3)              Run 3 tasks

                                                                        Subscribe to stream ”words”,
                 .shuffleGrouping("words");                             using shufflegrouping
                                                                        Create stream called ”exclaim2”
     builder.setBolt("exclaim2", new ExclamationBolt(), 2)
                                                                        Run 2 tasks
                 .shuffleGrouping("exclaim1");                          Subscribe to stream ”exclaim1”,
                                                                        using shufflegrouping



        A bolt can subscribe to an unlimited number of
                streams, by chaining groupings.



The sourcecode for this example is part of the storm-starter project on github
TestWordSpout        ExclamationBolt     ExclamationBolt

EXAMPLE – 1
TestWordSpout
public void nextTuple() {
     Utils.sleep(100);
     final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
     final Random rand = new Random();
     final String word = words[rand.nextInt(words.length)];
     _collector.emit(new Values(word));
}



The TestWordSpout emits a random string from the
       array words, each 100 milliseconds
TestWordSpout          ExclamationBolt        ExclamationBolt

EXAMPLE – 2
ExclamationBolt                                    Prepare is called when bolt is created

OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
      _collector = collector;
}                                             Execute is called for each tuple
public void execute(Tuple tuple) {
     _collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
     _collector.ack(tuple);
 }                                            declareOutputFields is called when bolt is created
public void declareOutputFields(OutputFieldsDeclarer declarer) {
     declarer.declare(new Fields("word"));
}


declareOutputFields is used to declare streams and their schemas. It
 is possible to declare several streams and specify the stream to use
           when outputting tuples in the emit function call.
FAULT TOLERANCE
Zookeeper stores metadata in a very robust way
Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart
When a node dies
   • The tasks will time out and be reassigned to other workers by Nimbus.
When a worker dies
     •The supervisor will restart the worker.
     •Nimbus will reassign worker to another supervisor, if no heartbeats are sent.
     •If not possible (no free ports), then tasks will be run on other workers in
      topology. If more capacity is added to the cluster later, STORM will
      automatically initialize a new worker and spread out the tasks.
When nimbus or supervisor dies
     •   Workers will continue to run
     •   Workers cannot be reassigned without Nimbus
     •   Nimbus and Supervisor should be run using a process monitoring tool, to
         restarts them automatically if they fail.
AT-LEAST-ONCE PROCESSING
STORM guarantees at-least-once processing of tuples.
Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long
Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple.
Ack is called on spout, when tree of tuples for spout tuple is fully processed.
Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of
tuples is not fully processed within a specified timeout (default is 30 seconds).
It is possible to specify the message id, when emitting a tuple. This might be useful for
replaying tuples from a queue.




                   Ack/fail method called when tree of
                  tuples have been fully processed or
                             failed / timed-out
AT-LEAST-ONCE PROCESSING – 2
Anchoring is used to copy the spout tuple message id(s) to the new tuples
generated. In this way, every tuple knows the message id(s) of all spout tuples.
Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then
multiple spout tuples will be replayed. Useful for doing streaming joins and more.
Ack called from a bolt, indicates the tuple has been processed as intented
Fail called from a bolt, replays the spout tuple(s)
Every tuple must be acked/failed or the task will run out of memory at some point.




_collector.emit(tuple, new Values(word));    Uses anchoring

_collector.emit(new Values(word));           Does NOT use anchoring
AT-LEAST-ONCE PROCESSING – 3
Acker tasks tracks the tree of tuples for every spout tuple
     •   The acker task responsible for a given spout tuple is determined by modulo
         on message id. Since all tuples have all spout tuple message ids, it is easy
         to call the correct acker tasks.
     •   Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}}
     •   ”ack val” is the representation of state of entire tree of tuples. It is the xor of
         all tuple message ids created and acked in the tree of tuples.
     •   When ”ack val” is 0, then tuple tree is fully processed.
     •   Since message ids are random 64 bits numbers, chances of ”ack val”
         becoming 0 by accident is extremely small.




               Important to set number of acker tasks in topology when
                  processing large amounts of tuples (defaults to 1)
AT-LEAST-ONCE PROCESSING – 4
    Example                                                             Bolt
                                                         Emit ”h”        Task: 3
                                                         spoutIds: 10
                                                         msgId: 2
                Spout          Emit ”hey”   Bolt
                     Task: 1   msgId:10      Task: 2
                                                        Emit ”ey”
                                                        spoutIds: 10
                                                        msgId: 3        Bolt
                                                                         Task: 4



Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}}
1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010}
2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000}
3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011}                      USES 64 BIT IDS
4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001}                         IN REALITY
5. After ack ”h”:    {10, {1, 0001 XOR 0010 = 0011}
6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000}
7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
AT-LEAST-ONCE PROCESSING – 5
A tuple isn't acked because the task died:
The spout tuple(s) at the root of the tree of tuples will time out and be replayed.
Acker task dies:
All the spout tuples the acker was tracking will time out and be replayed.
Spout task dies:
In this case the source that the spout talks to is responsible for replaying the
messages. For example, queues like Kestrel and RabbitMQ will place all pending
messages back on the queue when a client disconnects.
AT-LEAST-ONCE PROCESSING – 6
At-least-once processing might process a tuple more than once.
Example

     All grouping       Bolt       1. A spout tuple is emitted to task 2 and 3
                         Task: 2   2. Worker responsible for task 3 fails
                                   3. Supervisor restarts worker
  Spout
      Task: 1
                                   4. Spout tuple is replayed and emitted to task 2 and 3
                                   5. Task 2 will now have executed the same bolt twice
                       Bolt
                         Task: 3




Consider why the all grouping is not important in this example
EXACTLY-ONCE-PROCESSING
Transactional topologies (TT) is an abstraction built on STORM primitives.
TT guarantees exactly-once-processing of tuples.
Acking is optimized in TT, no need to do anchoring or acking manually.
Bolts execute as new instances per attempt of processing a batch


Example

     All grouping       Bolt        1. A spout tuple is emitted to task 2 and 3
                         Task: 2    2. Worker responsible for task 3 fails
                                    3. Supervisor restarts worker
  Spout
      Task: 1
                                    4. Spout tuple is replayed and emitted to task 2 and 3
                                    5. Task 2 and 3 initiate new bolts because of new attempt
                        Bolt        5. Now there is no problem
                         Task: 3
EXACTLY-ONCE-PROCESSING – 2
For efficiency batch processing of tuples is introduced in TT
Batch has two states: processing or committing
Many batches can be in the processing state concurrently
Only one batch can be in the committing state, and a strong ordering is imposed. That
means batch 1 will always be committed before batch 2 and so on.
Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer
BasicBolt is processing one tuple at a time.
BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed
BatchBolt marked as committer is calling finishBatch only when batch is in
committing state.
EXACTLY-ONCE-PROCESSING – 3
      Transactional spout has capability               Committer               Committer
      to replay exact batches of tuples    batchbolt   batchbolt   batchbolt
                                                                               batchbolt




BATCH IS IN PROCESSING STATE
Bolt A:   execute method is called for all tuples received from spout
          finishBatch is called when first batch is received
Bolt B:   execute method is called for all tuples received from bolt A
          finishBatch is NOT called because batch is in processing state
Bolt C:   execute method is called for all tuples received from bolt A (and B)
          finishBatch is NOT called, because bolt B has not called finishBatch
Bolt D:   execute method is called for all tuples received from bolt C
          finishBatch is NOT called because batch is in processing state
BATCH CHANGES TO COMMITTING STATE
Bolt B:   finishBatch is called
Bolt C:   finishBatch is called, because we know we got all tuples from Bolt B now
Bolt D:   finishBatch is called, because we know we got all tuples from Bolt C now
EXACTLY-ONCE-PROCESSING – 4
    Transactional spout
  All groupings on                   When batch should enter processing state:
  batch stream                       •  Coordinator emits a tuple with TransactionAttempt and the metadata for that
                                        transaction to the "batch" stream.
                                     •  All emitter tasks receives the tuple and begins to emit their portion of tuples for
                                        the given batch.


                                     When processing phase of batch is done (determined by acker task):
                                     •  Ack gets called on coordinator


                                     When ack gets called on coordinator and all prior transactions have committed:
                  Regular bolt,      •  Coordinator emits a tuple with TransactionAttempt to the commit stream.
                  Parallelism of P   •  All Bolts which are marked as committers subscribe to the commit stream of the
                                        coordinator using an all grouping.
                                     •  Bolts marked as committers now know the batch is in the committing phase
Regular spout, parallelism of 1
Defined streams: batch & commit
                                     When batch is fully processed again (determined by acker task):
                                     •  Ack gets called on coordinator
                                     •  Coordinator knows batch is now committed
STORM LIBRARIES
STORM uses a lot of libraries. The most prominent are
Clojure    a new lisp programming language. Crash-course follows
Jetty      an embedded webserver. Used to host the UI of Nimbus.
Kryo       a fast serializer, used when sending tuples
Thrift     a framework to build services. Nimbus is a thrift daemon
ZeroMQ     a very fast transportation layer
Zookeeper a distributed system for storing metadata
LEARN MORE
Wiki (https://github.com/nathanmarz/storm/wiki)
Storm-starter (https://github.com/nathanmarz/storm-starter)
Mailing list (http://groups.google.com/group/storm-user)
#storm-user room on freenode




                                            from: http://www.cupofjoe.tv/2010/11/learn-lesson.html

More Related Content

What's hot

Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemAndrii Gakhov
 

What's hot (20)

Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Storm
StormStorm
Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Storm
StormStorm
Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 

Viewers also liked

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Sensor(zigbee)
Sensor(zigbee)Sensor(zigbee)
Sensor(zigbee)rajrayala
 
Advance ethernet
Advance ethernetAdvance ethernet
Advance ethernetOnline
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
Ethernet - Networking presentation
Ethernet - Networking presentationEthernet - Networking presentation
Ethernet - Networking presentationViet Nguyen
 
Zigbee technology ppt edited
Zigbee technology ppt editedZigbee technology ppt edited
Zigbee technology ppt editedrakeshkumarchary
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Internet of Things and its applications
Internet of Things and its applicationsInternet of Things and its applications
Internet of Things and its applicationsPasquale Puzio
 

Viewers also liked (15)

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Sensor(zigbee)
Sensor(zigbee)Sensor(zigbee)
Sensor(zigbee)
 
Advance ethernet
Advance ethernetAdvance ethernet
Advance ethernet
 
Ethernet
EthernetEthernet
Ethernet
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Ethernet technology
Ethernet technologyEthernet technology
Ethernet technology
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Ethernet - Networking presentation
Ethernet - Networking presentationEthernet - Networking presentation
Ethernet - Networking presentation
 
Zigbee technology ppt edited
Zigbee technology ppt editedZigbee technology ppt edited
Zigbee technology ppt edited
 
Zigbee Presentation
Zigbee PresentationZigbee Presentation
Zigbee Presentation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Internet of Things and its applications
Internet of Things and its applicationsInternet of Things and its applications
Internet of Things and its applications
 

Similar to Storm vs Hadoop comparison and introduction to concepts

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxIbrahimBenhadhria
 
understand Storm in pictures
understand Storm in picturesunderstand Storm in pictures
understand Storm in pictureszqhxuyuan
 
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopUnraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopPiotr Turek
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Evel xf
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eveguest91855c
 
Java 7 at SoftShake 2011
Java 7 at SoftShake 2011Java 7 at SoftShake 2011
Java 7 at SoftShake 2011julien.ponge
 

Similar to Storm vs Hadoop comparison and introduction to concepts (20)

Storm 0.8.2
Storm 0.8.2Storm 0.8.2
Storm 0.8.2
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Storm
StormStorm
Storm
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptx
 
Storm
StormStorm
Storm
 
Storm
StormStorm
Storm
 
Iteration
IterationIteration
Iteration
 
understand Storm in pictures
understand Storm in picturesunderstand Storm in pictures
understand Storm in pictures
 
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopUnraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
Storm introduction
Storm introductionStorm introduction
Storm introduction
 
Storm begins
Storm beginsStorm begins
Storm begins
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
 
02 - Basics of Qt
02 - Basics of Qt02 - Basics of Qt
02 - Basics of Qt
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Java 7 LavaJUG
Java 7 LavaJUGJava 7 LavaJUG
Java 7 LavaJUG
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
 
Java 7 at SoftShake 2011
Java 7 at SoftShake 2011Java 7 at SoftShake 2011
Java 7 at SoftShake 2011
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Storm vs Hadoop comparison and introduction to concepts

  • 1. STORM COMPARISON – INTRODUCTION - CONCEPTS PRESENTATION BY KASPER MADSEN MARCH - 2012
  • 2. HADOOP VS STORM Batch processing Real-time processing Jobs runs to completion Topologies run forever JobTracker is SPOF* No single point of failure Stateful nodes Stateless nodes Scalable Scalable Guarantees no data loss Guarantees no data loss Open source Open source * Hadoop 0.21 added some checkpointing SPOF: Single Point Of Failure
  • 3. COMPONENTS Nimbus daemon is comparable to Hadoop JobTracker. It is the master Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker Worker is spawned by supervisor, one per port defined in storm.yaml configuration Task is run as a thread in workers Zookeeper* is a distributed system, used to store metadata. Nimbus and Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper. Notice all communication between Nimbus and Supervisors are done through Zookeeper On a cluster with 2k+1 zookeeper nodes, the system can recover when maximally k nodes fails. * Zookeeper is an Apache top-level project
  • 4. STREAMS Stream is an unbounded sequence of tuples. Topology is a graph where each node is a spout or bolt, and the edges indicate which bolts are subscribing to which streams. • A spout is a source of a stream • A bolt is consuming a stream (possibly emits a new one) Subscribes: A • An edge represents a grouping Emits: C Subscribes: C & D Subscribes: A Source of stream A Emits: D Source of stream B Subscribes:A & B
  • 5. GROUPINGS Each spout or bolt are running X instances in parallel (called tasks). Groupings are used to decide which task in the subscribing bolt, the tuple is sent to Shuffle grouping is a random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive 4 tasks 3 tasks 2 tasks 2 tasks
  • 6. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE TopologyBuilder builder = new TopologyBuilder(); Create stream called ”words” Run 10 tasks builder.setSpout("words", new TestWordSpout(), 10); Create stream called ”exclaim1” builder.setBolt("exclaim1", new ExclamationBolt(), 3) Run 3 tasks Subscribe to stream ”words”, .shuffleGrouping("words"); using shufflegrouping Create stream called ”exclaim2” builder.setBolt("exclaim2", new ExclamationBolt(), 2) Run 2 tasks .shuffleGrouping("exclaim1"); Subscribe to stream ”exclaim1”, using shufflegrouping A bolt can subscribe to an unlimited number of streams, by chaining groupings. The sourcecode for this example is part of the storm-starter project on github
  • 7. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 1 TestWordSpout public void nextTuple() { Utils.sleep(100); final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"}; final Random rand = new Random(); final String word = words[rand.nextInt(words.length)]; _collector.emit(new Values(word)); } The TestWordSpout emits a random string from the array words, each 100 milliseconds
  • 8. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 2 ExclamationBolt Prepare is called when bolt is created OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } Execute is called for each tuple public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } declareOutputFields is called when bolt is created public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
  • 9. FAULT TOLERANCE Zookeeper stores metadata in a very robust way Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart When a node dies • The tasks will time out and be reassigned to other workers by Nimbus. When a worker dies •The supervisor will restart the worker. •Nimbus will reassign worker to another supervisor, if no heartbeats are sent. •If not possible (no free ports), then tasks will be run on other workers in topology. If more capacity is added to the cluster later, STORM will automatically initialize a new worker and spread out the tasks. When nimbus or supervisor dies • Workers will continue to run • Workers cannot be reassigned without Nimbus • Nimbus and Supervisor should be run using a process monitoring tool, to restarts them automatically if they fail.
  • 10. AT-LEAST-ONCE PROCESSING STORM guarantees at-least-once processing of tuples. Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple. Ack is called on spout, when tree of tuples for spout tuple is fully processed. Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of tuples is not fully processed within a specified timeout (default is 30 seconds). It is possible to specify the message id, when emitting a tuple. This might be useful for replaying tuples from a queue. Ack/fail method called when tree of tuples have been fully processed or failed / timed-out
  • 11. AT-LEAST-ONCE PROCESSING – 2 Anchoring is used to copy the spout tuple message id(s) to the new tuples generated. In this way, every tuple knows the message id(s) of all spout tuples. Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then multiple spout tuples will be replayed. Useful for doing streaming joins and more. Ack called from a bolt, indicates the tuple has been processed as intented Fail called from a bolt, replays the spout tuple(s) Every tuple must be acked/failed or the task will run out of memory at some point. _collector.emit(tuple, new Values(word));  Uses anchoring _collector.emit(new Values(word));  Does NOT use anchoring
  • 12. AT-LEAST-ONCE PROCESSING – 3 Acker tasks tracks the tree of tuples for every spout tuple • The acker task responsible for a given spout tuple is determined by modulo on message id. Since all tuples have all spout tuple message ids, it is easy to call the correct acker tasks. • Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}} • ”ack val” is the representation of state of entire tree of tuples. It is the xor of all tuple message ids created and acked in the tree of tuples. • When ”ack val” is 0, then tuple tree is fully processed. • Since message ids are random 64 bits numbers, chances of ”ack val” becoming 0 by accident is extremely small. Important to set number of acker tasks in topology when processing large amounts of tuples (defaults to 1)
  • 13. AT-LEAST-ONCE PROCESSING – 4 Example Bolt Emit ”h” Task: 3 spoutIds: 10 msgId: 2 Spout Emit ”hey” Bolt Task: 1 msgId:10 Task: 2 Emit ”ey” spoutIds: 10 msgId: 3 Bolt Task: 4 Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}} 1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010} 2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000} 3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011} USES 64 BIT IDS 4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001} IN REALITY 5. After ack ”h”: {10, {1, 0001 XOR 0010 = 0011} 6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000} 7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
  • 14. AT-LEAST-ONCE PROCESSING – 5 A tuple isn't acked because the task died: The spout tuple(s) at the root of the tree of tuples will time out and be replayed. Acker task dies: All the spout tuples the acker was tracking will time out and be replayed. Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
  • 15. AT-LEAST-ONCE PROCESSING – 6 At-least-once processing might process a tuple more than once. Example All grouping Bolt 1. A spout tuple is emitted to task 2 and 3 Task: 2 2. Worker responsible for task 3 fails 3. Supervisor restarts worker Spout Task: 1 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 will now have executed the same bolt twice Bolt Task: 3 Consider why the all grouping is not important in this example
  • 16. EXACTLY-ONCE-PROCESSING Transactional topologies (TT) is an abstraction built on STORM primitives. TT guarantees exactly-once-processing of tuples. Acking is optimized in TT, no need to do anchoring or acking manually. Bolts execute as new instances per attempt of processing a batch Example All grouping Bolt 1. A spout tuple is emitted to task 2 and 3 Task: 2 2. Worker responsible for task 3 fails 3. Supervisor restarts worker Spout Task: 1 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 and 3 initiate new bolts because of new attempt Bolt 5. Now there is no problem Task: 3
  • 17. EXACTLY-ONCE-PROCESSING – 2 For efficiency batch processing of tuples is introduced in TT Batch has two states: processing or committing Many batches can be in the processing state concurrently Only one batch can be in the committing state, and a strong ordering is imposed. That means batch 1 will always be committed before batch 2 and so on. Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer BasicBolt is processing one tuple at a time. BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed BatchBolt marked as committer is calling finishBatch only when batch is in committing state.
  • 18. EXACTLY-ONCE-PROCESSING – 3 Transactional spout has capability Committer Committer to replay exact batches of tuples batchbolt batchbolt batchbolt batchbolt BATCH IS IN PROCESSING STATE Bolt A: execute method is called for all tuples received from spout finishBatch is called when first batch is received Bolt B: execute method is called for all tuples received from bolt A finishBatch is NOT called because batch is in processing state Bolt C: execute method is called for all tuples received from bolt A (and B) finishBatch is NOT called, because bolt B has not called finishBatch Bolt D: execute method is called for all tuples received from bolt C finishBatch is NOT called because batch is in processing state BATCH CHANGES TO COMMITTING STATE Bolt B: finishBatch is called Bolt C: finishBatch is called, because we know we got all tuples from Bolt B now Bolt D: finishBatch is called, because we know we got all tuples from Bolt C now
  • 19. EXACTLY-ONCE-PROCESSING – 4 Transactional spout All groupings on When batch should enter processing state: batch stream • Coordinator emits a tuple with TransactionAttempt and the metadata for that transaction to the "batch" stream. • All emitter tasks receives the tuple and begins to emit their portion of tuples for the given batch. When processing phase of batch is done (determined by acker task): • Ack gets called on coordinator When ack gets called on coordinator and all prior transactions have committed: Regular bolt, • Coordinator emits a tuple with TransactionAttempt to the commit stream. Parallelism of P • All Bolts which are marked as committers subscribe to the commit stream of the coordinator using an all grouping. • Bolts marked as committers now know the batch is in the committing phase Regular spout, parallelism of 1 Defined streams: batch & commit When batch is fully processed again (determined by acker task): • Ack gets called on coordinator • Coordinator knows batch is now committed
  • 20. STORM LIBRARIES STORM uses a lot of libraries. The most prominent are Clojure a new lisp programming language. Crash-course follows Jetty an embedded webserver. Used to host the UI of Nimbus. Kryo a fast serializer, used when sending tuples Thrift a framework to build services. Nimbus is a thrift daemon ZeroMQ a very fast transportation layer Zookeeper a distributed system for storing metadata
  • 21. LEARN MORE Wiki (https://github.com/nathanmarz/storm/wiki) Storm-starter (https://github.com/nathanmarz/storm-starter) Mailing list (http://groups.google.com/group/storm-user) #storm-user room on freenode from: http://www.cupofjoe.tv/2010/11/learn-lesson.html