Streams processing with Storm

6,780
-1

Published on

Published in: Technology

Streams processing with Storm

  1. 1. Mariusz Gil Data streams processing with STORM
  2. 2. data expire fast. very fast
  3. 3. realtime processing?
  4. 4. Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  5. 5. Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
  6. 6. concept architecture
  7. 7. tuple tuple tuple tuple tuple tuple tuple (val1, val2) (val3, val4) (val5, val6) Stream unbounded sequence of tuples
  8. 8. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams
  9. 9. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Reliable and unreliable Spouts replay or forget about touple
  10. 10. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Kafka
  11. 11. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Kestrel
  12. 12. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-AMQP-Spout
  13. 13. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-JMS
  14. 14. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-PubSub*
  15. 15. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Beanstalkd-Spout
  16. 16. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Bolts process input streams and produce new streams
  17. 17. tuple tuple tuple tuple tuple tuple tuple tuple tuple le tuple tup le le tup tup le le tup tuple process input streams and produce new streams tuple tuple tuple tuple Bolts tuple tuple tuple tup tup le le tup tuple tuple tuple
  18. 18. TextSpout [word, count] [word] [sentence] SplitSentenceBolt Topologies WordCountBolt network of spouts and bolts
  19. 19. [sentence] [word, count] TextSpout SplitSentenceBolt [word] [sentence] WordCountBolt TextSpout SplitSentenceBolt Topologies network of spouts and bolts xyzBolt
  20. 20. servers architecture
  21. 21. Nimbus process responsible for distributing processing across the cluster
  22. 22. Supervisors worker process responsible for executing subset of topology
  23. 23. zookeepers coordination layer between Nimbus and Supervisors
  24. 24. fast il a f CLUSTER STATE IS STORED LOCALLY OR IN ZOOKEEPERS
  25. 25. sample code
  26. 26. public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } Spouts }
  27. 27. public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } } Bolts
  28. 28. public static class ExclamationBolt implements IRichBolt { OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } } Bolts
  29. 29. public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Topology
  30. 30. public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } import storm class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word]) SplitSentenceBolt().run() Bolts
  31. 31. github.com/nathanmarz/storm-starter
  32. 32. streams groupping
  33. 33. public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Topology
  34. 34. Groupping shuffle, fields, all, global, none, direct, local or shuffle
  35. 35. distributed rpc
  36. 36. [request-id, results] results arguments RPC distributed [request-id, arguments]
  37. 37. public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { [request-id, results] String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); } RPC public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); } results } public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3); arguments LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); distributed System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); cluster.shutdown(); drpc.shutdown(); } [request-id, arguments]
  38. 38. realtime analytics personalization search revenue optimization monitoring
  39. 39. content search realtime analytics generating feeds integrated with elastic search, Hbase,hadoop and hdfs
  40. 40. realtime scoring moments generation integrated with kafka queues and hdfs storage
  41. 41. Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such As HBase and HDFS
  42. 42. thanks! mail: mariusz@mariuszgil.pl twitter: @mariuszgil
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×