0
StormDistributed and fault-tolerant realtime computation system                                               Chandler@PyH...
Outline•   Background•   Why Strom•   Component•   Topology•   Storm & DRPC•   Multilang Protocol•   Experience
Background
Background• Creates by Nathan Marz @ BackType/Twitter  – Analyze twits, links, users on Twitter• Opensourced at Sep 2011  ...
Background• Active user group  – https://groups.google.com/group/storm-user  – https://github.com/nathanmarz/storm  – Most...
Why Storm ?
Before Storm
Problems• Scale is painful• Poor fault-tolerance  – Hadoop is stateful• Coding is tedious• Batch processing  – Long latenc...
Storm• Scalable and robust    – No persistent layer•   Guarantees no data loss•   Fault-tolerant•   Programming language a...
Components
Base on• Apache Zookeeper  – Distributed system, used to store metadata• ØMQ  – Asynchronous message transport layer• Apac...
System architecture
System architecture• Nimbus  – Like JobtTacker in hadoop• Supervisor  – Manage workers• Zookeeper  – Store meta data• UI  ...
Topology
Topology• Tuples  – ordered list of elements  – (“user”, “link”, “event”, “10/3/12 17:50“)• Streams   – unbounded sequence...
Spouts• Source of streams• Example     • Read from logs, API calls, event data, queues, …
Spout• Interface ISpout  –   BaseRichSpout, ClojureSpout, DRPCSpout,      FeederSpout, FixedTupleSpout, MasterBatchCoordin...
Topology• Bolts  – Processes input streams and produces new    streams  – Example     • Stream Joins, DBs, APIs, Filters, ...
Bolts• Interface Ibolt  – BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt, Coordinate    dBol...
Topology• Topology  – A directed graph of Spouts and Bolts
Tasks• Instances of Spouts and Blots• Managed by Supervisor  –   http://www.michael-noll.com/blog/2012/10/16/understanding...
Stream grouping• All grouping  – Send to all tasks• Global grouping  – Pick task with lowest id• Shuffle grouping  – Pick ...
Storm fault-tolerance• Reliability API   – Spout tuple creation        • colloctor.emit(values, msgID);   – Child tuple cr...
Storm fault-tolerance• Disable reliability API  – Globally     • Config.TOPOLOGY_ACKER_EXECUTORS = 0  – On topology level ...
Storm & DRPC
Distributed RPC
Multilang Protocol
Multilang protocol• Using ShellSpout/ShellBolt• Process using stand in/out to communicate• Massage are encoded as JSON/ li...
Three steps• Initiate a handshake  – Keep track with process id  – Send a json object to standard input while start  – Con...
Three steps• Start looping   – storm_sync would     expect torm_ack• Read or write tuples   – Follow defined structure   –...
Experience
Experience• Not hard to setup, but  – Beware of certain version of Zookeeper  – Wait a while after topology deployed• Fast...
Reference
Reference• “Getting started with Storm”, O’REILLY• Twitter Storm   – Sergey Lukjanov@slideshare   – http://www.slideshare....
Q/A
Thanks
Upcoming SlideShare
Loading in...5
×

Introduction to Storm

6,380

Published on

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,380
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
396
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Storm "

  1. 1. StormDistributed and fault-tolerant realtime computation system Chandler@PyHug previa [at] gmail.com
  2. 2. Outline• Background• Why Strom• Component• Topology• Storm & DRPC• Multilang Protocol• Experience
  3. 3. Background
  4. 4. Background• Creates by Nathan Marz @ BackType/Twitter – Analyze twits, links, users on Twitter• Opensourced at Sep 2011 – Eclipse Public License 1.0 – Storm 0.5.2 – 16k java and 7k Clojure Loc – Current stable release 0.8.2 • 0.9.0 major core improvement
  5. 5. Background• Active user group – https://groups.google.com/group/storm-user – https://github.com/nathanmarz/storm – Most watched java repo at GitHub (>4k watcher) – Used by over 30 companies • Twitter, Groupon, Alibaba, GumGum, ..
  6. 6. Why Storm ?
  7. 7. Before Storm
  8. 8. Problems• Scale is painful• Poor fault-tolerance – Hadoop is stateful• Coding is tedious• Batch processing – Long latency – no realtime
  9. 9. Storm• Scalable and robust – No persistent layer• Guarantees no data loss• Fault-tolerant• Programming language agnostic• Use case – Stream processing – Distributed RPC – Continues computation
  10. 10. Components
  11. 11. Base on• Apache Zookeeper – Distributed system, used to store metadata• ØMQ – Asynchronous message transport layer• Apache Thrift – Cross-language bridge, RPC• LMAX Disruptor – High performance queue shared by threads• Kryo – Serialization framework
  12. 12. System architecture
  13. 13. System architecture• Nimbus – Like JobtTacker in hadoop• Supervisor – Manage workers• Zookeeper – Store meta data• UI – Web-UI
  14. 14. Topology
  15. 15. Topology• Tuples – ordered list of elements – (“user”, “link”, “event”, “10/3/12 17:50“)• Streams – unbounded sequence of tuples
  16. 16. Spouts• Source of streams• Example • Read from logs, API calls, event data, queues, …
  17. 17. Spout• Interface ISpout – BaseRichSpout, ClojureSpout, DRPCSpout, FeederSpout, FixedTupleSpout, MasterBatchCoordinator, NoOpSpout, RichShellSpout, RichSpoutBatchTriggerer, ShellS pout, SpoutTracker, TestPlannerSpout, TestWordSpout, TransactionalSpoutCoordinator
  18. 18. Topology• Bolts – Processes input streams and produces new streams – Example • Stream Joins, DBs, APIs, Filters, Aggregation, …
  19. 19. Bolts• Interface Ibolt – BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt, Coordinate dBolt, JoinResult, KeyedFairBolt, NonRichBoltTracker, ReturnResults, BaseShellBolt, ShellBolt, TestAggregatesCounter, TestGlobalCount, TestPlannerBolt, TransactionalSpout BatchExecutor,TridentBoltExecutor, TupleCaptureBolt
  20. 20. Topology• Topology – A directed graph of Spouts and Bolts
  21. 21. Tasks• Instances of Spouts and Blots• Managed by Supervisor – http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
  22. 22. Stream grouping• All grouping – Send to all tasks• Global grouping – Pick task with lowest id• Shuffle grouping – Pick a random task• Fields grouping – Consistent hashing on a subset of tuple fields
  23. 23. Storm fault-tolerance• Reliability API – Spout tuple creation • colloctor.emit(values, msgID); – Child tuple creation (Bolts) • colloctor.emit(parentTuples, values); – Tuple end of processing • collector.ack(tuple); – Tuple failed to process • collector.fail(tuple);
  24. 24. Storm fault-tolerance• Disable reliability API – Globally • Config.TOPOLOGY_ACKER_EXECUTORS = 0 – On topology level • Collector.emit(values, msgID); – For a single tuple • Collector.emit(paranetTuples, values);
  25. 25. Storm & DRPC
  26. 26. Distributed RPC
  27. 27. Multilang Protocol
  28. 28. Multilang protocol• Using ShellSpout/ShellBolt• Process using stand in/out to communicate• Massage are encoded as JSON/ lines of plain text
  29. 29. Three steps• Initiate a handshake – Keep track with process id – Send a json object to standard input while start – Contains • Storm configuration, topology, context, PID directory
  30. 30. Three steps• Start looping – storm_sync would expect torm_ack• Read or write tuples – Follow defined structure – Implement read_msg(), storm_emit() ,…
  31. 31. Experience
  32. 32. Experience• Not hard to setup, but – Beware of certain version of Zookeeper – Wait a while after topology deployed• Fast, – Better use fabric• Stable – But beware of memory leak
  33. 33. Reference
  34. 34. Reference• “Getting started with Storm”, O’REILLY• Twitter Storm – Sergey Lukjanov@slideshare – http://www.slideshare.net/lukjanovsv/twitter-storm• Storm – nathanmarz@slideshare – http://www.slideshare.net/nathanmarz/storm-11164672• Realtime Analytics with Storm and Hadoop – Hadoop_Summit@slideshare – http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with- storm
  35. 35. Q/A
  36. 36. Thanks
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×