Realtime Computation
      with Storm
                    Brad Anderson
          banderson@maprtech.com
                         @boorad
Definition & Overview
   Interoperability
     Use Cases
Stream Processing
       CEP
 Distributed RPC
Before Storm



Queues        Workers
Example




 (simplified)
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message passing
Concepts
streams

Tuple   Tuple      Tuple    Tuple    Tuple     Tuple   Tuple




                Unbounded sequence of tuples
spouts



Source of streams
spouts
public interface ISpout extends Serializable {
  void open(Map conf,
         TopologyContext context,
         SpoutOutputCollector collector);
  void close();
  void nextTuple();
  void ack(Object msgId);
  void fail(Object msgId);
}
bolts



Processes input streams and produces new streams
bolts
public class DoubleAndTripleBolt extends BaseRichBolt {
  private OutputCollectorBase _collector;

    public void prepare(Map conf,
                 TopologyContext context,
                 OutputCollectorBase collector) {
      _collector = collector;
    }

    public void execute(Tuple input) {
      int val = input.getInteger(0);
      _collector.emit(input, new Values(val*2, val*3));
      _collector.ack(input);
    }

    public void declareOutputFields(OutputFieldsDeclarer declarer) {
      declarer.declare(new Fields("double", "triple"));
    }
}
topologies



Network of spouts and bolts
Trident
Cascading for Storm
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
   topology.newStream("spout1", spout)
    .each(new Fields("sentence"), new Split(), new Fields("word"))
    .groupBy(new Fields("word"))
    .persistentAggregate(new MemoryMapState.Factory(),
                  new Count(),
                   new Fields("count"))
    .parallelismHint(6);
Interoperability
spouts
•Kafka (with transactions)
• Kestrel
• JMS
• AMQP
• Beanstalkd
bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases, Hadoop write-behind
Storm

                realtime
               processes

       Queue                               Apps
Raw
Data                                      Business
                                           Value
                               Hadoop




                                batch
                              processes
Storm

                       realtime
                      processes

              Queue                               Apps
Raw
Data                                             Business
                                                  Value
                                      Hadoop
       Parallel Cluster Ingest


                                       batch
                                     processes
Storm

                realtime
               processes

       Queue                Apps
Raw
Data                       Business
                            Value
               Hadoop




                 batch
               processes
Storm

        realtime
       processes
                    Apps
Raw
Data               Business
                    Value
       Hadoop




         batch
       processes
Use Cases
Twitter
                  Follower

                             Distinct
        Tweeter   Follower   follower



                  Follower
                             Distinct
  URL   Tweeter              follower   Reach
                  Follower


                  Follower
                             Distinct
        Tweeter              follower

                  Follower
Heartbyte
Fleet Logistics
http://github.com/{tdunning | boorad}/mapr-spout


                                    Brad Anderson
                          banderson@maprtech.com
                                         @boorad
Thank you.
http://github.com/{tdunning | boorad}/mapr-spout


                                    Brad Anderson
                          banderson@maprtech.com
                                         @boorad

Realtime Computation with Storm

Editor's Notes

  • #2 \n
  • #3 C - Best accessible distributed realtime computation system going\nA - Learn about and start using Storm\nB - You will get a great new tool in your technology stack - interesting uses\n
  • #4 CEP - continuous\n\nNot HFT-grade\n\n
  • #5 \n
  • #6 scaling is painful\npoor fault tolerance\ncoding is hard\n
  • #7 \n
  • #8 \n
  • #9 tweets stock ticks manufacturing machine data sensor messages\n
  • #10 \n
  • #11 \n
  • #12 \n
  • #13 \n
  • #14 DAG\n\nruns continuously\n
  • #15 abstractions like Cascading, Hive, Pig make MR approachable\n\ncode size reduction\n
  • #16 \n
  • #17 \n
  • #18 kestrel - via thrift\nkafka - transactional topologies, idempotentcy, process only once\nactivemq\n
  • #19 \n
  • #20 current architecture\n\ndata ingest tool for hadoop (avoid Flume madness)\n
  • #21 new architecture\n
  • #22 \n
  • #23 Trending Topics (stream processing of the firehose)\ncomputing the ‘reach’ of a URL (Dist RPC)\n
  • #24 \n
  • #25 Android devices, sampling geo every 5 seconds\nroute optimization\nroad tax reduction\nidle alerts\n
  • #26 C - Exciting times, much like Hadoop/NoSQL beginning\nA - Start tinkering with Storm, integrate into your workflows\nB - be more responsive in turning data into information\n