Storm
Distributed and fault-tolerant realtime computation system




                                               Chandler@PyHug
                                            previa [at] gmail.com
Outline
•   Background
•   Why Strom
•   Component
•   Topology
•   Storm & DRPC
•   Multilang Protocol
•   Experience
Background
Background
• Creates by Nathan Marz @ BackType/Twitter
  – Analyze twits, links, users on Twitter

• Opensourced at Sep 2011
  – Eclipse Public License 1.0
  – Storm 0.5.2
  – 16k java and 7k Clojure Loc
  – Current stable release 0.8.2
     • 0.9.0 major core improvement
Background
• Active user group
  – https://groups.google.com/group/storm-user
  – https://github.com/nathanmarz/storm

  – Most watched java repo at GitHub (>4k watcher)
  – Used by over 30 companies
     • Twitter, Groupon, Alibaba, GumGum, ..
Why Storm ?
Before Storm
Problems
• Scale is painful
• Poor fault-tolerance
  – Hadoop is stateful
• Coding is tedious
• Batch processing
  – Long latency
  – no realtime
Storm
• Scalable and robust
    – No persistent layer
•   Guarantees no data loss
•   Fault-tolerant
•   Programming language agnostic
•   Use case
    – Stream processing
    – Distributed RPC
    – Continues computation
Components
Base on
• Apache Zookeeper
  – Distributed system, used to store metadata
• ØMQ
  – Asynchronous message transport layer
• Apache Thrift
  – Cross-language bridge, RPC
• LMAX Disruptor
  – High performance queue shared by threads
• Kryo
  – Serialization framework
System architecture
System architecture
• Nimbus
  – Like JobtTacker in hadoop
• Supervisor
  – Manage workers
• Zookeeper
  – Store meta data
• UI
  – Web-UI
Topology
Topology
• Tuples
  – ordered list of elements
  – (“user”, “link”, “event”, “10/3/12 17:50“)



• Streams
   – unbounded sequence of tuples
Spouts
• Source of streams
• Example
     • Read from logs, API calls, event data, queues, …
Spout
• Interface ISpout
  –   BaseRichSpout, ClojureSpout, DRPCSpout,
      FeederSpout, FixedTupleSpout, MasterBatchCoordinator, NoOpSpout, RichShellSpout, RichSpoutBatchTriggerer, ShellS
      pout, SpoutTracker, TestPlannerSpout, TestWordSpout, TransactionalSpoutCoordinator
Topology
• Bolts
  – Processes input streams and produces new
    streams
  – Example
     • Stream Joins, DBs, APIs, Filters, Aggregation, …
Bolts
• Interface Ibolt
  – BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt, Coordinate
    dBolt, JoinResult, KeyedFairBolt, NonRichBoltTracker, ReturnResults, BaseShellBolt,
    ShellBolt, TestAggregatesCounter, TestGlobalCount, TestPlannerBolt, TransactionalSpout
    BatchExecutor,TridentBoltExecutor, TupleCaptureBolt
Topology
• Topology
  – A directed graph of Spouts and Bolts
Tasks
• Instances of Spouts and Blots
• Managed by Supervisor
  –   http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Stream grouping
• All grouping
  – Send to all tasks
• Global grouping
  – Pick task with lowest id
• Shuffle grouping
  – Pick a random task
• Fields grouping
  – Consistent hashing on a subset of tuple fields
Storm fault-tolerance
• Reliability API
   – Spout tuple creation
        • colloctor.emit(values, msgID);
   – Child tuple creation (Bolts)
        • colloctor.emit(parentTuples,
            values);
   – Tuple end of processing
        • collector.ack(tuple);
   – Tuple failed to process
        • collector.fail(tuple);
Storm fault-tolerance
• Disable reliability API
  – Globally
     • Config.TOPOLOGY_ACKER_EXECUTORS = 0
  – On topology level
     • Collector.emit(values, msgID);
  – For a single tuple
     • Collector.emit(paranetTuples, values);
Storm & DRPC
Distributed RPC
Multilang Protocol
Multilang protocol
• Using ShellSpout/ShellBolt
• Process using stand in/out to communicate
• Massage are encoded as JSON/ lines of plain text
Three steps
• Initiate a handshake
  – Keep track with process id
  – Send a json object to standard input while start
  – Contains
     • Storm configuration, topology, context, PID directory
Three steps
• Start looping
   – storm_sync would
     expect torm_ack

• Read or write tuples
   – Follow defined structure
   – Implement read_msg(),
     storm_emit() ,…
Experience
Experience
• Not hard to setup, but
  – Beware of certain version of Zookeeper
  – Wait a while after topology deployed

• Fast,
  – Better use fabric

• Stable
  – But beware of memory leak
Reference
Reference
• “Getting started with Storm”, O’REILLY

• Twitter Storm
   – Sergey Lukjanov@slideshare
   – http://www.slideshare.net/lukjanovsv/twitter-storm

• Storm
   – nathanmarz@slideshare
   – http://www.slideshare.net/nathanmarz/storm-11164672

• Realtime Analytics with Storm and Hadoop
   – Hadoop_Summit@slideshare
   – http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with-
     storm
Q/A
Thanks

Introduction to Storm

  • 1.
    Storm Distributed and fault-tolerantrealtime computation system Chandler@PyHug previa [at] gmail.com
  • 2.
    Outline • Background • Why Strom • Component • Topology • Storm & DRPC • Multilang Protocol • Experience
  • 3.
  • 4.
    Background • Creates byNathan Marz @ BackType/Twitter – Analyze twits, links, users on Twitter • Opensourced at Sep 2011 – Eclipse Public License 1.0 – Storm 0.5.2 – 16k java and 7k Clojure Loc – Current stable release 0.8.2 • 0.9.0 major core improvement
  • 5.
    Background • Active usergroup – https://groups.google.com/group/storm-user – https://github.com/nathanmarz/storm – Most watched java repo at GitHub (>4k watcher) – Used by over 30 companies • Twitter, Groupon, Alibaba, GumGum, ..
  • 6.
  • 7.
  • 8.
    Problems • Scale ispainful • Poor fault-tolerance – Hadoop is stateful • Coding is tedious • Batch processing – Long latency – no realtime
  • 9.
    Storm • Scalable androbust – No persistent layer • Guarantees no data loss • Fault-tolerant • Programming language agnostic • Use case – Stream processing – Distributed RPC – Continues computation
  • 10.
  • 11.
    Base on • ApacheZookeeper – Distributed system, used to store metadata • ØMQ – Asynchronous message transport layer • Apache Thrift – Cross-language bridge, RPC • LMAX Disruptor – High performance queue shared by threads • Kryo – Serialization framework
  • 12.
  • 13.
    System architecture • Nimbus – Like JobtTacker in hadoop • Supervisor – Manage workers • Zookeeper – Store meta data • UI – Web-UI
  • 14.
  • 15.
    Topology • Tuples – ordered list of elements – (“user”, “link”, “event”, “10/3/12 17:50“) • Streams – unbounded sequence of tuples
  • 16.
    Spouts • Source ofstreams • Example • Read from logs, API calls, event data, queues, …
  • 17.
    Spout • Interface ISpout – BaseRichSpout, ClojureSpout, DRPCSpout, FeederSpout, FixedTupleSpout, MasterBatchCoordinator, NoOpSpout, RichShellSpout, RichSpoutBatchTriggerer, ShellS pout, SpoutTracker, TestPlannerSpout, TestWordSpout, TransactionalSpoutCoordinator
  • 18.
    Topology • Bolts – Processes input streams and produces new streams – Example • Stream Joins, DBs, APIs, Filters, Aggregation, …
  • 19.
    Bolts • Interface Ibolt – BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt, Coordinate dBolt, JoinResult, KeyedFairBolt, NonRichBoltTracker, ReturnResults, BaseShellBolt, ShellBolt, TestAggregatesCounter, TestGlobalCount, TestPlannerBolt, TransactionalSpout BatchExecutor,TridentBoltExecutor, TupleCaptureBolt
  • 20.
    Topology • Topology – A directed graph of Spouts and Bolts
  • 21.
    Tasks • Instances ofSpouts and Blots • Managed by Supervisor – http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
  • 22.
    Stream grouping • Allgrouping – Send to all tasks • Global grouping – Pick task with lowest id • Shuffle grouping – Pick a random task • Fields grouping – Consistent hashing on a subset of tuple fields
  • 23.
    Storm fault-tolerance • ReliabilityAPI – Spout tuple creation • colloctor.emit(values, msgID); – Child tuple creation (Bolts) • colloctor.emit(parentTuples, values); – Tuple end of processing • collector.ack(tuple); – Tuple failed to process • collector.fail(tuple);
  • 24.
    Storm fault-tolerance • Disablereliability API – Globally • Config.TOPOLOGY_ACKER_EXECUTORS = 0 – On topology level • Collector.emit(values, msgID); – For a single tuple • Collector.emit(paranetTuples, values);
  • 25.
  • 26.
  • 27.
  • 28.
    Multilang protocol • UsingShellSpout/ShellBolt • Process using stand in/out to communicate • Massage are encoded as JSON/ lines of plain text
  • 29.
    Three steps • Initiatea handshake – Keep track with process id – Send a json object to standard input while start – Contains • Storm configuration, topology, context, PID directory
  • 30.
    Three steps • Startlooping – storm_sync would expect torm_ack • Read or write tuples – Follow defined structure – Implement read_msg(), storm_emit() ,…
  • 31.
  • 32.
    Experience • Not hardto setup, but – Beware of certain version of Zookeeper – Wait a while after topology deployed • Fast, – Better use fabric • Stable – But beware of memory leak
  • 33.
  • 34.
    Reference • “Getting startedwith Storm”, O’REILLY • Twitter Storm – Sergey Lukjanov@slideshare – http://www.slideshare.net/lukjanovsv/twitter-storm • Storm – nathanmarz@slideshare – http://www.slideshare.net/nathanmarz/storm-11164672 • Realtime Analytics with Storm and Hadoop – Hadoop_Summit@slideshare – http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with- storm
  • 35.
  • 36.