Realtime processing with storm presentation

12,799 views

Published on

Storm is a distributed, reliable, fault-tolerant system for processing streams of data.

In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.

We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.

Published in: Technology
0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,799
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
395
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide

Realtime processing with storm presentation

  1. 1. Realtime Processing with Storm
  2. 2. Who are we? Dario Simonassi @ldsimonassi Gabriel Eisbruch @geisbruch Jonathan Leibiusky @xetorthio
  3. 3. Agenda ● The problem ● Storm - Basic concepts ● Trident ● Deploying ● Some other features ● Contributions
  4. 4. Tons of information arriving every secondhttp://necromentia.files.wordpress.com/2011/01/20070103213517_iguazu.jpg
  5. 5. Heavy workloadhttp://www.computerkochen.de/fun/bilder/esel.jpg
  6. 6. Do Re it alti me !http://www.asimon.co.il/data/images/a10689.jpg
  7. 7. Worker Complex Queue Worker Queue Queue Worker WorkerInformation Worker Source Queue DB Queue Worker Worker Worker Queue Queue Worker
  8. 8. What is Storm?● Is a highly distributed realtime computation system.● Provides general primitives to do realtime computation.● Can be used with any programming language.● Its scalable and fault-tolerant.
  9. 9. Example - Main Conceptshttps://github.com/geisbruch/strata2012
  10. 10. Main Concepts - Example "Count tweets related to strataconf and show hashtag totals."
  11. 11. Main Concepts - Spout● First thing we need is to read all tweets related to strataconf from twitter● A Spout is a special component that is a source of messages to be processed. Tweets Tuples Spout
  12. 12. Main Concepts - Spout public class ApiStreamingSpout extends BaseRichSpout { SpoutOutputCollector collector; TwitterReader reader; public void nextTuple() { Tweet tweet = reader.getNextTweet(); if(tweet != null) collector.emit(new Values(tweet)); } public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { reader = new TwitterReader(conf.get("track"), conf.get("user"), conf.get("pass")); this.collector = collector; } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("tweet")); } } Tweets Tuples Spout
  13. 13. Main Concepts - Bolt● For every tweet, we find hashtags● A Bolt is a component that does the processing. tweet hashtags hashtags stream tweet hashtags Spout HashtagsSplitter
  14. 14. Main Concepts - Bolt public class HashtagsSplitter extends BaseBasicBolt { public void execute(Tuple input, BasicOutputCollector collector) { Tweet tweet = (Tweet)input.getValueByField("tweet"); for(String hashtag : tweet.getHashTags()){ collector.emit(new Values(hashtag)); } } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("hashtag")); } } tweet hashtags tweet hashtags hashtags stream Spout HashtagsSplitter
  15. 15. Main Concepts - Bolt public class HashtagsCounterBolt extends BaseBasicBolt { private Jedis jedis; public void execute(Tuple input, BasicOutputCollector collector) { String hashtag = input.getStringByField("hashtag"); jedis.hincrBy("hashs", key, val); } public void prepare(Map conf, TopologyContext context) { jedis = new Jedis("localhost"); } } tweet stream tweet hashtag hashtag hashtag Spout HashtagsSplitter HashtagsCounterBolt
  16. 16. Main Concepts - Topology public static Topology createTopology() { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("tweets-collector", new ApiStreamingSpout(), 1); builder.setBolt("hashtags-splitter", new HashTagsSplitter(), 2) .shuffleGrouping("tweets-collector"); builder.setBolt("hashtags-counter", new HashtagsCounterBolt(), 2) .fieldsGrouping("hashtags-splitter", new Fields("hashtags")); return builder.createTopology(); } fields shuffle HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
  17. 17. Example Overview - Read Tweet Spout Splitter Counter Redis This tweet has #hello #world hashtags!!!
  18. 18. Example Overview - Split Tweet Spout Splitter Counter Redis This tweet has #hello #world hashtags!!! #hello #world
  19. 19. Example Overview - Split Tweet Spout Splitter Counter Redis This tweet has #hello #world hashtags!! #hello incr(#hello) #hello = 1 #world incr(#world) #world = 1
  20. 20. Example Overview - Split Tweet Spout Splitter Counter Redis #hello = 1 This tweet has #bye #world hashtags #world = 1
  21. 21. Example Overview - Split Tweet Spout Splitter Counter Redis #hello = 1 #world #bye This tweet has #bye #world hashtags #world = 1
  22. 22. Example Overview - Split Tweet Spout Splitter Counter Redis #hello = 1 #world #bye This tweet has #bye incr(#bye) #world hashtags #world = 2 incr(#world) #bye =1
  23. 23. Main Concepts - In summary ○ Topology ○ Spout ○ Stream ○ Bolt
  24. 24. Guaranteeing Message Processing http://www.steinborn.com/home_warranty.php
  25. 25. Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
  26. 26. Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt _collector.ack(tuple);
  27. 27. Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
  28. 28. Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
  29. 29. Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt void fail(Object msgId);
  30. 30. Trident http://www.raytekmarine.com/images/trident.jpg
  31. 31. Trident - What is Trident? ● High-level abstraction on top of Storm. ● Will make it easier to build topologies. ● Similar to Hadoop, Pig
  32. 32. Trident - Topology TridentTopology topology = new TridentTopology(); hashtags hashtags tweet hashtags stream tweet Spout HashtagsSplitter Memory
  33. 33. Trident - Stream Stream tweetsStream = topology .newStream("tweets", new ApiStreamingSpout()); tweet stream Spout
  34. 34. Trident Stream hashtagsStream = tweetsStrem.each(new Fields("tweet"),new BaseFunction() { public void execute(TridentTuple tuple, TridentCollector collector) { Tweet tweet = (Tweet)tuple.getValueByField("tweet"); for(String hashtag : tweet.getHashtags()) collector.emit(new Values(hashtag)); } }, new Fields("hashtag") ); tweet hashtags hashtags stream tweet hashtags Spout HashtagsSplitter
  35. 35. Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
  36. 36. Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
  37. 37. Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
  38. 38. Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
  39. 39. Trident - Complete example TridentState hashtags = topology .newStream("tweets", new ApiStreamingSpout()) .each(new Fields("tweet"),new HashTagSplitter(), new Fields("hashtag")) .groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
  40. 40. Transactions
  41. 41. Transactions ● Exactly once message processing semantic
  42. 42. Transactions ● The intermediate processing may be executed in a parallel way
  43. 43. Transactions ● But persistence should be strictly sequential
  44. 44. Transactions ● Spouts and persistence should be designed to be transactional
  45. 45. Trident - Queries We want to know on-line the count of tweets with hashtag hadoop and the amount of tweets with strataconf hashtag
  46. 46. Trident - DRPC
  47. 47. Trident - Query DRPCClient drpc = new DRPCClient("MyDRPCServer", "MyDRPCPort"); drpc.execute("counts", "strataconf hadoop");
  48. 48. Trident - Query Stream drpcStream = topology.newDRPCStream("counts",drpc)
  49. 49. Trident - Query Stream singleHashtag = drpcStream.each(new Fields("args"), new Split(), new Fields("hashtag")) DRPC Hashtag DRPC stream Splitter Server
  50. 50. Trident - Query Stream hashtagsCountQuery = singleHashtag.stateQuery(hashtagCountMemoryState, new Fields("hashtag"), new MapGet(), new Fields("count")); DRPC Hashtag Hashtag Count stream Splitter Query DRPC Server
  51. 51. Trident - Query Stream drpcCountAggregate = hashtagsCountQuery.groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum")); DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
  52. 52. Trident - Query - Complete Example topology.newDRPCStream("counts",drpc) .each(new Fields("args"), new Split(), new Fields("hashtag")) .stateQuery(hashtags, new Fields("hashtag"), new MapGet(), new Fields("count")) .each(new Fields("count"), new FilterNull()) .groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
  53. 53. Trident - Query - Complete Example DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
  54. 54. Trident - Query - Complete Example Client Call DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
  55. 55. Trident - Query - Complete Example Hold the Execute Request DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
  56. 56. Trident - Query - Complete Example Request continue holded DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server On finish return to DRPC server
  57. 57. Trident - Query - Complete Example DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server Return response to the client
  58. 58. Running Topologies
  59. 59. Running Topologies - Local Cluster Config conf = new Config(); conf.put("some_property","some_value") LocalCluster cluster = new LocalCluster(); cluster.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology()); ● The whole topology will run in a single machine. ● Similar to run in a production cluster ● Very Useful for test and development ● Configurable parallelism ● Similar methods than StormSubmitter
  60. 60. Running Topologies - Production package twitter.streaming; public class Topology { public static void main(String[] args){ StormSubmitter.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology()); } } mvn package storm jar Strata2012-0.0.1.jar twitter.streaming.Topology "$REDIS_HOST" $REDIS_PORT "strata,strataconf" $TWITTER_USER $TWITTER_PASS
  61. 61. Running Topologies - Cluster Layout● Single Master process Storm Cluster Supervisor● No SPOF (where the processing is done)● Automatic Task distribution Nimbus Supervisor (The master (where the processing is done) Process)● Amount of task configurable by process Supervisor (where the processing is done)● Automatic tasks re-distribution on supervisors fails Zookeeper Cluster (Only for coordination and cluster state)● Work continue on nimbus fails
  62. 62. Running Topologies - Cluster Architecture - No SPOF Storm Cluster Supervisor (where the processing is done) Nimbus Supervisor (The master Process) (where the processing is done) Supervisor (where the processing is done)
  63. 63. Running Topologies - Cluster Architecture - No SPOF Storm Cluster Supervisor (where the processing is done) Nimbus Supervisor (The master Process) (where the processing is done) Supervisor (where the processing is done)
  64. 64. Running Topologies - Cluster Architecture - No SPOF Storm Cluster Supervisor (where the processing is done) Nimbus Supervisor (The master Process) (where the processing is done) Supervisor (where the processing is done)
  65. 65. Other features
  66. 66. Other features - Non JVM languages { ● Use Non-JVM Languages "conf": { "topology.message.timeout.secs": 3, // etc ● Simple protocol }, "context": { ● Use stdin & stdout for communication "task->component": { "1": "example-spout", "2": "__acker", ● Many implementations and abstraction "3": "example-bolt" layers done by the community }, "taskid": 3 }, "pidDir": "..." }
  67. 67. Contribution● A collection of community developed spouts, bolts, serializers, and other goodies.● You can find it in: https://github.com/nathanmarz/storm-contrib● Some of those goodies are: ○ cassandra: Integrates Storm and Cassandra by writing tuples to a Cassandra Column Family. ○ mongodb: Provides a spout to read documents from mongo, and a bolt to write documents to mongo. ○ SQS: Provides a spout to consume messages from Amazon SQS. ○ hbase: Provides a spout and a bolt to read and write from HBase. ○ kafka: Provides a spout to read messages from kafka. ○ rdbms: Provides a bolt to write tuples to a table. ○ redis-pubsub: Provides a spout to read messages from redis pubsub. ○ And more...
  68. 68. Questions?

×