0
stormstream processing @twitterKrishna GadeTwitter@krishnagadeSunday, June 16, 13
what is storm?storm is a platform for doing analysis on streams ofdata as they come in, so you can react to data as ithapp...
storm v hadoopstorm & hadoop are complementary!hadoop => big batch processingstorm => fast, reactive, real time processing...
origins• originated at backtype, acquired by twitterin 2011.• to vastly simplify dealing with queues &workers.Sunday, June...
queue-worker modelqueues workersa a a a aSunday, June 16, 13
typical workflowqueues queuesworkers workersdatastoreSunday, June 16, 13
problems• scaling is painful - queue partitioning &worker deploy.• operational overhead - workerfailures & queue backups.•...
stormSunday, June 16, 13
what does stormprovide?• at least once message processing.• horizontal scalability.• no intermediate queues.• less operati...
storm primitives• streams• spouts• bolts• topologiesSunday, June 16, 13
streamsunbounded sequence of tuplesT T T T T T T T T T T T T T TSunday, June 16, 13
spoutssource of streamsA A A A A A A A A A A AB B B B B B B B B B B BSunday, June 16, 13
typical spouts• read from a kestrel/kafka queue. {tuples = events}• read from a http server log. {tuples = http requests}•...
boltsprocess input stream - Aproduce output stream - BA A A A A A A A B B B B B B B BSunday, June 16, 13
bolts• filtering tuples in a stream.• aggregation of tuples.• joining multiple streams.• arbitrary functions on streams.• c...
topologydirected-acyclic-graph of spouts and bolts.s1s2b1b2b3b4b5Sunday, June 16, 13
storm clusternimbussupervisorw1 w2 w3 w4supervisorw1 w2 w3 w4ZKtopology mapsync codetopology submissionmaster nodeslave no...
nimbus• master node.• manages the topologies.• job tracker in hadoop.$ storm jar myapp.jar com.twitter.MyTopology demoSund...
supervisor• runs on slave nodes.• co-ordinates with zookeeper.• manages workers.Sunday, June 16, 13
workerjvm processexecutortask tasktasktaskexecutor executorSunday, June 16, 13
recap• worker - process that executes asubset of a topology.• executor - a thread spawned by aworker.• task - performs the...
stream grouping• shuffle grouping - random distributionof tuples.• field grouping - groups tuples by a field.• all grouping -...
streaming word-countTopologyBuilder builder = new TopologyBuilder();builder.setSpout("tweet_spout", new RandomTweetSpout()...
tweet spoutclass RandomTweetSpout extends BaseRichSpout {SpoutOutputCollector collector;Random rand;String[] tweets = new ...
parse boltclass ParseTweetBolt extends BaseBasicBolt {@Overridepublic void execute(Tuple tuple, BasicOutputCollector colle...
word count boltclass WordCountBolt extends BaseBasicBolt {Map<String, Integer> counts = new HashMap<String, Integer>();@Ov...
word-count topologyRandomTweetSpout ParseTweetBolt WordCountBoltshuffle grouping fields groupingSunday, June 16, 13
how do we run storm@twitter ?Sunday, June 16, 13
storm on mesosnode node node nodemesoswe run multiple instances of storm onthe same cluster via mesos.storm(production)sto...
topology isolationisolation scheduler solves the problem ofmulti-tenancy – avoiding resource contentionbetween topologies,...
topology isolation• shared pool - multiple topologies canrun on the same host.• isolated pool - dedicated set of hosts tor...
topology isolationshared poolstormclusterSunday, June 16, 13
topology isolationshared poolstormclusterjoe’s topologyisolated poolsSunday, June 16, 13
topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologySunday, June 16, 13
topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologySunday, June 16, 13
topology isolationXshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyhost failureSunday, Ju...
topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyrepair hostadd hostSund...
topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyadd to sharedpoolSunday...
numbers• benchmarked at a million tuplesprocessed per second per node.• running 30 topologies in a 200 nodecluster..• proc...
storm use-cases@twitterSunday, June 16, 13
stream processingapplicationstweetsfavorites, retweetsimpressionstwitter stormstreamsspoutboltbolt$$$$realtimedashboardsne...
current use-cases• discovery of emerging topics/stories.• online learning of tweet features for searchresult ranking.• rea...
tweet scoring pipelinetweetsdata streamsimpressionsinteractionsstormtopologygraphstoremetadatastorejoin: tweets, impressio...
road ahead• auto scaling.• persistent bolts.• better grouping schemes.• replicated computation.• higher-level abstractions...
companies using stormSunday, June 16, 13
questions?krishna@twitter.comproject: https://storm-project.netmailing-list: http://groups.google.com/group/storm-userSund...
Upcoming SlideShare
Loading in...5
×

storm at twitter

12,871

Published on

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

Published in: Technology

Transcript of "storm at twitter"

  1. 1. stormstream processing @twitterKrishna GadeTwitter@krishnagadeSunday, June 16, 13
  2. 2. what is storm?storm is a platform for doing analysis on streams ofdata as they come in, so you can react to data as ithappens.Sunday, June 16, 13
  3. 3. storm v hadoopstorm & hadoop are complementary!hadoop => big batch processingstorm => fast, reactive, real time processingSunday, June 16, 13
  4. 4. origins• originated at backtype, acquired by twitterin 2011.• to vastly simplify dealing with queues &workers.Sunday, June 16, 13
  5. 5. queue-worker modelqueues workersa a a a aSunday, June 16, 13
  6. 6. typical workflowqueues queuesworkers workersdatastoreSunday, June 16, 13
  7. 7. problems• scaling is painful - queue partitioning &worker deploy.• operational overhead - workerfailures & queue backups.• no guarantees on data processing.Sunday, June 16, 13
  8. 8. stormSunday, June 16, 13
  9. 9. what does stormprovide?• at least once message processing.• horizontal scalability.• no intermediate queues.• less operational overhead.• “just works”.Sunday, June 16, 13
  10. 10. storm primitives• streams• spouts• bolts• topologiesSunday, June 16, 13
  11. 11. streamsunbounded sequence of tuplesT T T T T T T T T T T T T T TSunday, June 16, 13
  12. 12. spoutssource of streamsA A A A A A A A A A A AB B B B B B B B B B B BSunday, June 16, 13
  13. 13. typical spouts• read from a kestrel/kafka queue. {tuples = events}• read from a http server log. {tuples = http requests}• read from twitter streaming api. {tuples = tweets}Sunday, June 16, 13
  14. 14. boltsprocess input stream - Aproduce output stream - BA A A A A A A A B B B B B B B BSunday, June 16, 13
  15. 15. bolts• filtering tuples in a stream.• aggregation of tuples.• joining multiple streams.• arbitrary functions on streams.• communication with external caches/dbs.Sunday, June 16, 13
  16. 16. topologydirected-acyclic-graph of spouts and bolts.s1s2b1b2b3b4b5Sunday, June 16, 13
  17. 17. storm clusternimbussupervisorw1 w2 w3 w4supervisorw1 w2 w3 w4ZKtopology mapsync codetopology submissionmaster nodeslave nodesSunday, June 16, 13
  18. 18. nimbus• master node.• manages the topologies.• job tracker in hadoop.$ storm jar myapp.jar com.twitter.MyTopology demoSunday, June 16, 13
  19. 19. supervisor• runs on slave nodes.• co-ordinates with zookeeper.• manages workers.Sunday, June 16, 13
  20. 20. workerjvm processexecutortask tasktasktaskexecutor executorSunday, June 16, 13
  21. 21. recap• worker - process that executes asubset of a topology.• executor - a thread spawned by aworker.• task - performs the actual dataprocessing.Sunday, June 16, 13
  22. 22. stream grouping• shuffle grouping - random distributionof tuples.• field grouping - groups tuples by a field.• all grouping - replicates to all tasks.• global grouping - sends the entirestream to one task.Sunday, June 16, 13
  23. 23. streaming word-countTopologyBuilder builder = new TopologyBuilder();builder.setSpout("tweet_spout", new RandomTweetSpout(), 5);builder.setBolt("parse_bolt", new ParseTweetBolt(), 8).shuffleGrouping("tweet_spout").setNumTasks(2);builder.setBolt("count_bolt", new WordCountBolt(), 12).fieldsGrouping("parse_bolt", new Fields("word"));Config config = new Config();config.setNumWorkers(3);StormSubmitter.submitTopology(“demo”, config, builder.createTopology());Sunday, June 16, 13
  24. 24. tweet spoutclass RandomTweetSpout extends BaseRichSpout {SpoutOutputCollector collector;Random rand;String[] tweets = new String[] {"@jkrums:There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy","@barackobama: Four more years. pic.twitter.com/bAJE6Vom",...};....@Overridepublic void nextTuple() {Utils.sleep(100);String tweet = tweets[rand.nextInt(tweets.length)];collector.emit(new Values(tweet));}}Sunday, June 16, 13
  25. 25. parse boltclass ParseTweetBolt extends BaseBasicBolt {@Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String tweet = tuple.getString(0);for (String word : tweet.split(" ")) {collector.emit(new Values(word));}}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}}Sunday, June 16, 13
  26. 26. word count boltclass WordCountBolt extends BaseBasicBolt {Map<String, Integer> counts = new HashMap<String, Integer>();@Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String word = tuple.getString(0);Integer count = counts.get(word);count = (count == null) ? 1 : count + 1;counts.put(word, count);collector.emit(new Values(word, count));}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word", "count"));}}Sunday, June 16, 13
  27. 27. word-count topologyRandomTweetSpout ParseTweetBolt WordCountBoltshuffle grouping fields groupingSunday, June 16, 13
  28. 28. how do we run storm@twitter ?Sunday, June 16, 13
  29. 29. storm on mesosnode node node nodemesoswe run multiple instances of storm onthe same cluster via mesos.storm(production)storm(dev) provides efficientresource isolation andsharing across distributedframeworks such asstorm.Sunday, June 16, 13
  30. 30. topology isolationisolation scheduler solves the problem ofmulti-tenancy – avoiding resource contentionbetween topologies, by providing full isolationbetween topologies.Sunday, June 16, 13
  31. 31. topology isolation• shared pool - multiple topologies canrun on the same host.• isolated pool - dedicated set of hosts torun a single topology.Sunday, June 16, 13
  32. 32. topology isolationshared poolstormclusterSunday, June 16, 13
  33. 33. topology isolationshared poolstormclusterjoe’s topologyisolated poolsSunday, June 16, 13
  34. 34. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologySunday, June 16, 13
  35. 35. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologySunday, June 16, 13
  36. 36. topology isolationXshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyhost failureSunday, June 16, 13
  37. 37. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyrepair hostadd hostSunday, June 16, 13
  38. 38. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyadd to sharedpoolSunday, June 16, 13
  39. 39. numbers• benchmarked at a million tuplesprocessed per second per node.• running 30 topologies in a 200 nodecluster..• processing 50 billion messages a daywith an average complete latency under 50ms.Sunday, June 16, 13
  40. 40. storm use-cases@twitterSunday, June 16, 13
  41. 41. stream processingapplicationstweetsfavorites, retweetsimpressionstwitter stormstreamsspoutboltbolt$$$$realtimedashboardsnewfeaturesSunday, June 16, 13
  42. 42. current use-cases• discovery of emerging topics/stories.• online learning of tweet features for searchresult ranking.• realtime analytics for ads.• internal log processing.Sunday, June 16, 13
  43. 43. tweet scoring pipelinetweetsdata streamsimpressionsinteractionsstormtopologygraphstoremetadatastorejoin: tweets, impressionsjoin: tweets, interactionslast 7 days of:tweet ->feature_val,feature_type,timestamppersistentstore:tweet ->feature_val,feature_type,timestampthriftservicecassandratwemcacheinput: tweet idoutput: scorewrite tweetfeaturesSunday, June 16, 13
  44. 44. road ahead• auto scaling.• persistent bolts.• better grouping schemes.• replicated computation.• higher-level abstractions.Sunday, June 16, 13
  45. 45. companies using stormSunday, June 16, 13
  46. 46. questions?krishna@twitter.comproject: https://storm-project.netmailing-list: http://groups.google.com/group/storm-userSunday, June 16, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×