Streams processing with Storm

  • 4,527 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,527
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
363
Comments
0
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Mariusz Gil Data streams processing with STORM
  • 2. data expire fast. very fast
  • 3. realtime processing?
  • 4. Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  • 5. Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
  • 6. concept architecture
  • 7. tuple tuple tuple tuple tuple tuple tuple (val1, val2) (val3, val4) (val5, val6) Stream unbounded sequence of tuples
  • 8. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams
  • 9. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Reliable and unreliable Spouts replay or forget about touple
  • 10. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Kafka
  • 11. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Kestrel
  • 12. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-AMQP-Spout
  • 13. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-JMS
  • 14. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-PubSub*
  • 15. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Beanstalkd-Spout
  • 16. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Bolts process input streams and produce new streams
  • 17. tuple tuple tuple tuple tuple tuple tuple tuple tuple le tuple tup le le tup tup le le tup tuple process input streams and produce new streams tuple tuple tuple tuple Bolts tuple tuple tuple tup tup le le tup tuple tuple tuple
  • 18. TextSpout [word, count] [word] [sentence] SplitSentenceBolt Topologies WordCountBolt network of spouts and bolts
  • 19. [sentence] [word, count] TextSpout SplitSentenceBolt [word] [sentence] WordCountBolt TextSpout SplitSentenceBolt Topologies network of spouts and bolts xyzBolt
  • 20. servers architecture
  • 21. Nimbus process responsible for distributing processing across the cluster
  • 22. Supervisors worker process responsible for executing subset of topology
  • 23. zookeepers coordination layer between Nimbus and Supervisors
  • 24. fast il a f CLUSTER STATE IS STORED LOCALLY OR IN ZOOKEEPERS
  • 25. sample code
  • 26. public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } Spouts }
  • 27. public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } } Bolts
  • 28. public static class ExclamationBolt implements IRichBolt { OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } } Bolts
  • 29. public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Topology
  • 30. public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } import storm class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word]) SplitSentenceBolt().run() Bolts
  • 31. github.com/nathanmarz/storm-starter
  • 32. streams groupping
  • 33. public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Topology
  • 34. Groupping shuffle, fields, all, global, none, direct, local or shuffle
  • 35. distributed rpc
  • 36. [request-id, results] results arguments RPC distributed [request-id, arguments]
  • 37. public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { [request-id, results] String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); } RPC public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); } results } public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3); arguments LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); distributed System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); cluster.shutdown(); drpc.shutdown(); } [request-id, arguments]
  • 38. realtime analytics personalization search revenue optimization monitoring
  • 39. content search realtime analytics generating feeds integrated with elastic search, Hbase,hadoop and hdfs
  • 40. realtime scoring moments generation integrated with kafka queues and hdfs storage
  • 41. Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such As HBase and HDFS
  • 42. thanks! mail: mariusz@mariuszgil.pl twitter: @mariuszgil