storm at twitter

  • 11,303 views
Uploaded on

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
11,303
On Slideshare
0
From Embeds
0
Number of Embeds
23

Actions

Shares
Downloads
222
Comments
0
Likes
29

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. stormstream processing @twitterKrishna GadeTwitter@krishnagadeSunday, June 16, 13
  • 2. what is storm?storm is a platform for doing analysis on streams ofdata as they come in, so you can react to data as ithappens.Sunday, June 16, 13
  • 3. storm v hadoopstorm & hadoop are complementary!hadoop => big batch processingstorm => fast, reactive, real time processingSunday, June 16, 13
  • 4. origins• originated at backtype, acquired by twitterin 2011.• to vastly simplify dealing with queues &workers.Sunday, June 16, 13
  • 5. queue-worker modelqueues workersa a a a aSunday, June 16, 13
  • 6. typical workflowqueues queuesworkers workersdatastoreSunday, June 16, 13
  • 7. problems• scaling is painful - queue partitioning &worker deploy.• operational overhead - workerfailures & queue backups.• no guarantees on data processing.Sunday, June 16, 13
  • 8. stormSunday, June 16, 13
  • 9. what does stormprovide?• at least once message processing.• horizontal scalability.• no intermediate queues.• less operational overhead.• “just works”.Sunday, June 16, 13
  • 10. storm primitives• streams• spouts• bolts• topologiesSunday, June 16, 13
  • 11. streamsunbounded sequence of tuplesT T T T T T T T T T T T T T TSunday, June 16, 13
  • 12. spoutssource of streamsA A A A A A A A A A A AB B B B B B B B B B B BSunday, June 16, 13
  • 13. typical spouts• read from a kestrel/kafka queue. {tuples = events}• read from a http server log. {tuples = http requests}• read from twitter streaming api. {tuples = tweets}Sunday, June 16, 13
  • 14. boltsprocess input stream - Aproduce output stream - BA A A A A A A A B B B B B B B BSunday, June 16, 13
  • 15. bolts• filtering tuples in a stream.• aggregation of tuples.• joining multiple streams.• arbitrary functions on streams.• communication with external caches/dbs.Sunday, June 16, 13
  • 16. topologydirected-acyclic-graph of spouts and bolts.s1s2b1b2b3b4b5Sunday, June 16, 13
  • 17. storm clusternimbussupervisorw1 w2 w3 w4supervisorw1 w2 w3 w4ZKtopology mapsync codetopology submissionmaster nodeslave nodesSunday, June 16, 13
  • 18. nimbus• master node.• manages the topologies.• job tracker in hadoop.$ storm jar myapp.jar com.twitter.MyTopology demoSunday, June 16, 13
  • 19. supervisor• runs on slave nodes.• co-ordinates with zookeeper.• manages workers.Sunday, June 16, 13
  • 20. workerjvm processexecutortask tasktasktaskexecutor executorSunday, June 16, 13
  • 21. recap• worker - process that executes asubset of a topology.• executor - a thread spawned by aworker.• task - performs the actual dataprocessing.Sunday, June 16, 13
  • 22. stream grouping• shuffle grouping - random distributionof tuples.• field grouping - groups tuples by a field.• all grouping - replicates to all tasks.• global grouping - sends the entirestream to one task.Sunday, June 16, 13
  • 23. streaming word-countTopologyBuilder builder = new TopologyBuilder();builder.setSpout("tweet_spout", new RandomTweetSpout(), 5);builder.setBolt("parse_bolt", new ParseTweetBolt(), 8).shuffleGrouping("tweet_spout").setNumTasks(2);builder.setBolt("count_bolt", new WordCountBolt(), 12).fieldsGrouping("parse_bolt", new Fields("word"));Config config = new Config();config.setNumWorkers(3);StormSubmitter.submitTopology(“demo”, config, builder.createTopology());Sunday, June 16, 13
  • 24. tweet spoutclass RandomTweetSpout extends BaseRichSpout {SpoutOutputCollector collector;Random rand;String[] tweets = new String[] {"@jkrums:There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy","@barackobama: Four more years. pic.twitter.com/bAJE6Vom",...};....@Overridepublic void nextTuple() {Utils.sleep(100);String tweet = tweets[rand.nextInt(tweets.length)];collector.emit(new Values(tweet));}}Sunday, June 16, 13
  • 25. parse boltclass ParseTweetBolt extends BaseBasicBolt {@Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String tweet = tuple.getString(0);for (String word : tweet.split(" ")) {collector.emit(new Values(word));}}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}}Sunday, June 16, 13
  • 26. word count boltclass WordCountBolt extends BaseBasicBolt {Map<String, Integer> counts = new HashMap<String, Integer>();@Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String word = tuple.getString(0);Integer count = counts.get(word);count = (count == null) ? 1 : count + 1;counts.put(word, count);collector.emit(new Values(word, count));}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word", "count"));}}Sunday, June 16, 13
  • 27. word-count topologyRandomTweetSpout ParseTweetBolt WordCountBoltshuffle grouping fields groupingSunday, June 16, 13
  • 28. how do we run storm@twitter ?Sunday, June 16, 13
  • 29. storm on mesosnode node node nodemesoswe run multiple instances of storm onthe same cluster via mesos.storm(production)storm(dev) provides efficientresource isolation andsharing across distributedframeworks such asstorm.Sunday, June 16, 13
  • 30. topology isolationisolation scheduler solves the problem ofmulti-tenancy – avoiding resource contentionbetween topologies, by providing full isolationbetween topologies.Sunday, June 16, 13
  • 31. topology isolation• shared pool - multiple topologies canrun on the same host.• isolated pool - dedicated set of hosts torun a single topology.Sunday, June 16, 13
  • 32. topology isolationshared poolstormclusterSunday, June 16, 13
  • 33. topology isolationshared poolstormclusterjoe’s topologyisolated poolsSunday, June 16, 13
  • 34. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologySunday, June 16, 13
  • 35. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologySunday, June 16, 13
  • 36. topology isolationXshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyhost failureSunday, June 16, 13
  • 37. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyrepair hostadd hostSunday, June 16, 13
  • 38. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyadd to sharedpoolSunday, June 16, 13
  • 39. numbers• benchmarked at a million tuplesprocessed per second per node.• running 30 topologies in a 200 nodecluster..• processing 50 billion messages a daywith an average complete latency under 50ms.Sunday, June 16, 13
  • 40. storm use-cases@twitterSunday, June 16, 13
  • 41. stream processingapplicationstweetsfavorites, retweetsimpressionstwitter stormstreamsspoutboltbolt$$$$realtimedashboardsnewfeaturesSunday, June 16, 13
  • 42. current use-cases• discovery of emerging topics/stories.• online learning of tweet features for searchresult ranking.• realtime analytics for ads.• internal log processing.Sunday, June 16, 13
  • 43. tweet scoring pipelinetweetsdata streamsimpressionsinteractionsstormtopologygraphstoremetadatastorejoin: tweets, impressionsjoin: tweets, interactionslast 7 days of:tweet ->feature_val,feature_type,timestamppersistentstore:tweet ->feature_val,feature_type,timestampthriftservicecassandratwemcacheinput: tweet idoutput: scorewrite tweetfeaturesSunday, June 16, 13
  • 44. road ahead• auto scaling.• persistent bolts.• better grouping schemes.• replicated computation.• higher-level abstractions.Sunday, June 16, 13
  • 45. companies using stormSunday, June 16, 13
  • 46. questions?krishna@twitter.comproject: https://storm-project.netmailing-list: http://groups.google.com/group/storm-userSunday, June 16, 13