Storm
Anatomy
Eiichiro Uchiumi
http://www.eiichiro.org/
About Me
Eiichiro Uchiumi
• A solutions architect at
working in emerging enterprise
technologies
- Cloud transformation
- ...
What is Stream Processing?
Stream processing is a technical paradigm to process
big volume unbound sequence of tuples in r...
What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & opera...
Conceptual View
Bolt
Bolt
Bolt
Bolt
BoltSpout
Spout
Bolt:
Consumer of streams does some processing
and possibly emits new ...
Physical View
SupervisorNimbus
Worker
* N
Worker
Executor
* N
Task
* N
Supervisor
Supervisor
ZooKeeper
Supervisor
Supervis...
Spout
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm....
Spout
! @Override
! public void open(Map conf, TopologyContext context,
! ! ! SpoutOutputCollector collector) {
! ! this.c...
Bolt
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topolog...
Topology
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.StormSubmitter;
import ba...
Starting Topology
Nimbus
Thrift server
ZooKeeperStormSubmitter
> bin/storm jar
Uploads topology JAR to
Nimbus’ inbox with
...
Starting Topology
ZooKeeper Executor
Task
Worker
Supervisor
Nimbus
Thrift server
Downloads topology
JAR, configuration and...
What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & opera...
Extremely Significant Performance
Parallelism
RandomSentence
Spout
SplitSentence
Bolt
WordCount
Bolt
Parallelism
hint = 2
Parallelism
hint = 4
Parallelism
h...
Message Passing
Worker process
Executor
Executor Transfer
thread
Executor
Receive
thread
From other
workers
To other
worke...
LMAX Disruptor
• Consumer can easily
keep up with
producer by batching
• CPU cache friendly
- The ring is implemented as
a...
What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & opera...
Fault-tolerance
Cluster works normally
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
...
Fault-tolerance
Nimbus goes down
ZooKeeper WorkerSupervisorNimbus
Synchronizing
assignment
Sending heartbeat
Reading worke...
Fault-tolerance
Worker node goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
S...
Fault-tolerance
Supervisor goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Se...
Fault-tolerance
Worker process goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignmen...
What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & opera...
Reliability API
public class RandomSentenceSpout extends BaseRichSpout {
! public void nextTuple() {
! ! ...;
! ! UUID msg...
Acking Framework
SplitSentence
Bolt
RandomSentence
Spout
WordCount
Bolt
Acker
implicit bolt
Acker ack
Acker fail
Acker ini...
What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & opera...
Cluster Setup
• Setup ZooKeeper cluster
• Install dependencies on Nimbus and worker
machines
- ZeroMQ 2.1.7 and JZMQ
- Jav...
Cluster Summary
Topology Summary
Component Summary
What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & opera...
Basic Resources
• Storm is available at
- http://storm-project.net/
- https://github.com/nathanmarz/storm
under Eclipse Pu...
Many Contributions
• Community repository for modules to use Storm at
- https://github.com/nathanmarz/storm-contrib
includ...
Features on Deck
• Current release: 0.8.2 as of 6/28/2013
• Work in progress (older): 0.8.3-wip3
- Some bug fixes
• Work i...
Advanced Topics
• Distributed RPC
• Transactional topologies
• Trident
• Using non-JVM languages with Storm
• Unit testing...
Thank You
Upcoming SlideShare
Loading in...5
×

Storm Anatomy

10,035

Published on

Introducing Storm's concept, programming model and internal architecture

Published in: Technology
2 Comments
65 Likes
Statistics
Notes
No Downloads
Views
Total Views
10,035
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
448
Comments
2
Likes
65
Embeds 0
No embeds

No notes for slide

Storm Anatomy

  1. 1. Storm Anatomy Eiichiro Uchiumi http://www.eiichiro.org/
  2. 2. About Me Eiichiro Uchiumi • A solutions architect at working in emerging enterprise technologies - Cloud transformation - Enterprise mobility - Information optimization (big data) https://github.com/eiichiro @eiichirouchiumi http://www.facebook.com/ eiichiro.uchiumi
  3. 3. What is Stream Processing? Stream processing is a technical paradigm to process big volume unbound sequence of tuples in realtime • Algorithmic trading • Sensor data monitoring • Continuous analytics = Stream Source Stream Processor
  4. 4. What is Storm? Storm is • Fast & scalable • Fault-tolerant • Guarantees messages will be processed • Easy to setup & operate • Free & open source distributed realtime computation system - Originally developed by Nathan Marz at BackType (acquired by Twitter) - Written in Java and Clojure
  5. 5. Conceptual View Bolt Bolt Bolt Bolt BoltSpout Spout Bolt: Consumer of streams does some processing and possibly emits new tuples Spout: Source of streams Stream: Unbound sequence of tuples Tuple Tuple: List of name-value pair Topology: Graph of computation composed of spout/bolt as the node and stream as the edge Tuple Tuple
  6. 6. Physical View SupervisorNimbus Worker * N Worker Executor * N Task * N Supervisor Supervisor ZooKeeper Supervisor Supervisor ZooKeeper ZooKeeper Worker Nimbus: Master daemon process responsible for • distributing code • assigning tasks • monitoring failures ZooKeeper: Storing cluster operational state Supervisor: Worker daemon process listening for work assigned its node Worker: Java process executes a subset of topology Worker node Worker process Executor: Java thread spawned by worker runs on one or more tasks of the same component Task: Component (spout/ bolt) instance performs the actual data processing
  7. 7. Spout import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseRichSpout; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import backtype.storm.utils.Utils; public class RandomSentenceSpout extends BaseRichSpout { ! SpoutOutputCollector collector; ! Random random; ! ! @Override ! public void open(Map conf, TopologyContext context, ! ! ! SpoutOutputCollector collector) { ! ! this.collector = collector; ! ! random = new Random(); ! } ! @Override ! public void nextTuple() { ! ! String[] sentences = new String[] { ! ! ! ! "the cow jumped over the moon", ! ! ! ! "an apple a day keeps the doctor away", ! ! ! ! "four score and seven years ago", ! ! ! ! "snow white and the seven dwarfs", ! ! ! ! "i am at two with nature" ! ! }; ! ! String sentence = sentences[random.nextInt(sentences.length)]; ! ! collector.emit(new Values(sentence)); ! }
  8. 8. Spout ! @Override ! public void open(Map conf, TopologyContext context, ! ! ! SpoutOutputCollector collector) { ! ! this.collector = collector; ! ! random = new Random(); ! } ! @Override ! public void nextTuple() { ! ! String[] sentences = new String[] { ! ! ! ! "the cow jumped over the moon", ! ! ! ! "an apple a day keeps the doctor away", ! ! ! ! "four score and seven years ago", ! ! ! ! "snow white and the seven dwarfs", ! ! ! ! "i am at two with nature" ! ! }; ! ! String sentence = sentences[random.nextInt(sentences.length)]; ! ! collector.emit(new Values(sentence)); ! } ! @Override ! public void declareOutputFields(OutputFieldsDeclarer declarer) { ! ! declarer.declare(new Fields("sentence")); ! } @Override public void ack(Object msgId) {} @Override public void fail(Object msgId) {} }
  9. 9. Bolt import backtype.storm.task.OutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseRichBolt; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Tuple; import backtype.storm.tuple.Values; public class SplitSentenceBolt extends BaseRichBolt { ! OutputCollector collector; ! ! @Override ! public void prepare(Map stormConf, TopologyContext context, ! ! ! OutputCollector collector) { ! ! this.collector = collector; ! } ! @Override ! public void execute(Tuple input) { ! ! for (String s : input.getString(0).split("s")) { ! ! ! collector.emit(new Values(s)); ! ! } ! } ! @Override ! public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); ! } }
  10. 10. Topology import backtype.storm.Config; import backtype.storm.LocalCluster; import backtype.storm.StormSubmitter; import backtype.storm.topology.TopologyBuilder; import backtype.storm.tuple.Fields; public class WordCountTopology { ! public static void main(String[] args) throws Exception { ! ! TopologyBuilder builder = new TopologyBuilder(); ! ! builder.setSpout("sentence", new RandomSentenceSpout(), 2); ! ! builder.setBolt("split", new SplitSentenceBolt(), 4) ! ! ! ! .shuffleGrouping("sentence") ! ! ! ! .setNumTasks(8); ! ! builder.setBolt("count", new WordCountBolt(), 6) ! ! ! ! .fieldsGrouping("split", new Fields("word")); ! ! ! ! Config config = new Config(); ! ! config.setNumWorkers(4); ! ! ! ! StormSubmitter.submitTopology("wordcount", config, builder.createTopology()); ! ! ! ! // Local testing //! ! LocalCluster cluster = new LocalCluster(); //! ! cluster.submitTopology("wordcount", config, builder.createTopology()); //! ! Thread.sleep(10000); //! ! cluster.shutdown(); ! } ! }
  11. 11. Starting Topology Nimbus Thrift server ZooKeeperStormSubmitter > bin/storm jar Uploads topology JAR to Nimbus’ inbox with dependencies Submits topology configuration as JSON and structure as Thrift Copies topology JAR, configuration and structure into local file system Sets up static information for topology Makes assignment Starts topology
  12. 12. Starting Topology ZooKeeper Executor Task Worker Supervisor Nimbus Thrift server Downloads topology JAR, configuration and structure Writes assignment on its node into local file system Starts worker based on the assignment Refreshes connections Makes executors Makes tasks Starts processing
  13. 13. What is Storm? Storm is • Fast & scalable • Fault-tolerant • Guarantees messages will be processed • Easy to setup & operate • Free & open source distributed realtime computation system - Originally developed by Nathan Marz at BackType (acquired by Twitter) - Written in Java and Clojure
  14. 14. Extremely Significant Performance
  15. 15. Parallelism RandomSentence Spout SplitSentence Bolt WordCount Bolt Parallelism hint = 2 Parallelism hint = 4 Parallelism hint = 6 Number of tasks = Not specified = Same as parallelism hint = 2 Number of tasks = 8 Number of tasks = Not specified = 6 Number of topology worker = 4 Number of worker slots / node = 4 Number of worker nodes = 2 Number of executor threads = 2 + 4 + 6 = 12 Number of component instances = 2 + 8 + 6 = 16 Worker node Worker node Worker process Worker process SS Bolt WC Bolt RS Spout SS Bolt SS Bolt WC Bolt RS Spout SS Bolt SS Bolt WC Bolt SS Bolt WC Bolt SS Bolt WC Bolt SS Bolt WC Bolt Executor thread Topology can be spread out manually without downtime when a worker node is added
  16. 16. Message Passing Worker process Executor Executor Transfer thread Executor Receive thread From other workers To other workers Receiver queue Transfer queue Internal transfer queue Interprocess communication is mediated by ZeroMQ Outside transfer is done with Kryo serialization Local communication is mediated by LMAX Disruptor Inside transfer is done with no serialization
  17. 17. LMAX Disruptor • Consumer can easily keep up with producer by batching • CPU cache friendly - The ring is implemented as an array, so the entries can be preloaded • GC safe - The entries are preallocated up front and live forever Large concurrent magic ring buffer can be used like blocking queue Producer Consumer 6 million orders per second can be processed on a single thread at LMAX
  18. 18. What is Storm? Storm is • Fast & scalable • Fault-tolerant • Guarantees messages will be processed • Easy to setup & operate • Free & open source distributed realtime computation system - Originally developed by Nathan Marz at BackType (acquired by Twitter) - Written in Java and Clojure
  19. 19. Fault-tolerance Cluster works normally ZooKeeper WorkerSupervisorNimbus Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heartbeat from local file system Sending executor heartbeat
  20. 20. Fault-tolerance Nimbus goes down ZooKeeper WorkerSupervisorNimbus Synchronizing assignment Sending heartbeat Reading worker heartbeat from local file system Sending executor heartbeat Monitoring cluster state Processing will still continue. But topology lifecycle operations and reassignment facility are lost
  21. 21. Fault-tolerance Worker node goes down ZooKeeper WorkerSupervisorNimbus Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heartbeat from local file system Sending executor heartbeat WorkerSupervisor Nimbus will reassign the tasks to other machines and the processing will continue
  22. 22. Fault-tolerance Supervisor goes down ZooKeeper WorkerSupervisorNimbus Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heartbeat from local file system Sending executor heartbeat Processing will still continue. But assignment is never synchronized
  23. 23. Fault-tolerance Worker process goes down ZooKeeper WorkerSupervisorNimbus Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heartbeat from local file system Sending executor heartbeat Supervisor will restart the worker process and the processing will continue
  24. 24. What is Storm? Storm is • Fast & scalable • Fault-tolerant • Guarantees messages will be processed • Easy to setup & operate • Free & open source distributed realtime computation system - Originally developed by Nathan Marz at BackType (acquired by Twitter) - Written in Java and Clojure
  25. 25. Reliability API public class RandomSentenceSpout extends BaseRichSpout { ! public void nextTuple() { ! ! ...; ! ! UUID msgId = getMsgId(); ! ! collector.emit(new Values(sentence), msgId); ! } public void ack(Object msgId) { ! // Do something with acked message id. } public void fail(Object msgId) { ! // Do something with failed message id. } } public class SplitSentenceBolt extends BaseRichBolt { ! public void execute(Tuple input) { ! ! for (String s : input.getString(0).split("s")) { ! ! ! collector.emit(input, new Values(s)); ! ! } ! ! ! ! collector.ack(input); ! } } "the" "the cow jumped over the moon" "cow" "jumped" "over" "the" "moon" Emitting tuple with message id Anchoring incoming tuple to outgoing tuples Sending ack Tuple tree
  26. 26. Acking Framework SplitSentence Bolt RandomSentence Spout WordCount Bolt Acker implicit bolt Acker ack Acker fail Acker init Acker implicit bolt Tuple A Tuple C Tuple B 64 bit number called “Ack val”Spout tuple id Spout task id Ack val has become 0, Acker implicit bolt knows the tuple tree has been completed Acker ack Acker fail • Emitted tuple A, XOR tuple A id with ack val • Emitted tuple B, XOR tuple B id with ack val • Emitted tuple C, XOR tuple C id with ack val • Acked tuple A, XOR tuple A id with ack val • Acked tuple B, XOR tuple B id with ack val • Acked tuple C, XOR tuple C id with ack val
  27. 27. What is Storm? Storm is • Fast & scalable • Fault-tolerant • Guarantees messages will be processed • Easy to setup & operate • Free & open source distributed realtime computation system - Originally developed by Nathan Marz at BackType (acquired by Twitter) - Written in Java and Clojure
  28. 28. Cluster Setup • Setup ZooKeeper cluster • Install dependencies on Nimbus and worker machines - ZeroMQ 2.1.7 and JZMQ - Java 6 and Python 2.6.6 - unzip • Download and extract a Storm release to Nimbus and worker machines • Fill in mandatory configuration into storm.yaml • Launch daemons under supervision using “storm” script
  29. 29. Cluster Summary
  30. 30. Topology Summary
  31. 31. Component Summary
  32. 32. What is Storm? Storm is • Fast & scalable • Fault-tolerant • Guarantees messages will be processed • Easy to setup & operate • Free & open source distributed realtime computation system - Originally developed by Nathan Marz at BackType (acquired by Twitter) - Written in Java and Clojure
  33. 33. Basic Resources • Storm is available at - http://storm-project.net/ - https://github.com/nathanmarz/storm under Eclipse Public License 1.0 • Get help on - http://groups.google.com/group/storm-user - #storm-user freenode room • Follow - @stormprocessor and @nathanmarz for updates on the project
  34. 34. Many Contributions • Community repository for modules to use Storm at - https://github.com/nathanmarz/storm-contrib including integration with Redis, Kafka, MongoDB, HBase, JMS, Amazon SQS and so on • Good articles for understanding Storm internals - http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm- topology/ - http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message- buffers/ • Good slides for understanding real-life examples - http://www.slideshare.net/DanLynn1/storm-as-deep-into-realtime-data-processing-as-you- can-get-in-30-minutes - http://www.slideshare.net/KrishnaGade2/storm-at-twitter
  35. 35. Features on Deck • Current release: 0.8.2 as of 6/28/2013 • Work in progress (older): 0.8.3-wip3 - Some bug fixes • Work in progress (newest): 0.9.0-wip19 - SLF4J and Logback - Pluggable tuple serialization and blowfish encryption - Pluggable interprocess messaging and Netty implementation - Some bug fixes - And more
  36. 36. Advanced Topics • Distributed RPC • Transactional topologies • Trident • Using non-JVM languages with Storm • Unit testing • Patterns ...Not described in this presentation. So check these out by yourself, or my upcoming session if a chance is given :)
  37. 37. Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×