Your SlideShare is downloading. ×
0
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
storm at twitter
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

storm at twitter

12,728

Published on

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

Published in: Technology
0 Comments
42 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,728
On Slideshare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
259
Comments
0
Likes
42
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. stormstream processing @twitterKrishna GadeTwitter@krishnagadeSunday, June 16, 13
  • 2. what is storm?storm is a platform for doing analysis on streams ofdata as they come in, so you can react to data as ithappens.Sunday, June 16, 13
  • 3. storm v hadoopstorm & hadoop are complementary!hadoop => big batch processingstorm => fast, reactive, real time processingSunday, June 16, 13
  • 4. origins• originated at backtype, acquired by twitterin 2011.• to vastly simplify dealing with queues &workers.Sunday, June 16, 13
  • 5. queue-worker modelqueues workersa a a a aSunday, June 16, 13
  • 6. typical workflowqueues queuesworkers workersdatastoreSunday, June 16, 13
  • 7. problems• scaling is painful - queue partitioning &worker deploy.• operational overhead - workerfailures & queue backups.• no guarantees on data processing.Sunday, June 16, 13
  • 8. stormSunday, June 16, 13
  • 9. what does stormprovide?• at least once message processing.• horizontal scalability.• no intermediate queues.• less operational overhead.• “just works”.Sunday, June 16, 13
  • 10. storm primitives• streams• spouts• bolts• topologiesSunday, June 16, 13
  • 11. streamsunbounded sequence of tuplesT T T T T T T T T T T T T T TSunday, June 16, 13
  • 12. spoutssource of streamsA A A A A A A A A A A AB B B B B B B B B B B BSunday, June 16, 13
  • 13. typical spouts• read from a kestrel/kafka queue. {tuples = events}• read from a http server log. {tuples = http requests}• read from twitter streaming api. {tuples = tweets}Sunday, June 16, 13
  • 14. boltsprocess input stream - Aproduce output stream - BA A A A A A A A B B B B B B B BSunday, June 16, 13
  • 15. bolts• filtering tuples in a stream.• aggregation of tuples.• joining multiple streams.• arbitrary functions on streams.• communication with external caches/dbs.Sunday, June 16, 13
  • 16. topologydirected-acyclic-graph of spouts and bolts.s1s2b1b2b3b4b5Sunday, June 16, 13
  • 17. storm clusternimbussupervisorw1 w2 w3 w4supervisorw1 w2 w3 w4ZKtopology mapsync codetopology submissionmaster nodeslave nodesSunday, June 16, 13
  • 18. nimbus• master node.• manages the topologies.• job tracker in hadoop.$ storm jar myapp.jar com.twitter.MyTopology demoSunday, June 16, 13
  • 19. supervisor• runs on slave nodes.• co-ordinates with zookeeper.• manages workers.Sunday, June 16, 13
  • 20. workerjvm processexecutortask tasktasktaskexecutor executorSunday, June 16, 13
  • 21. recap• worker - process that executes asubset of a topology.• executor - a thread spawned by aworker.• task - performs the actual dataprocessing.Sunday, June 16, 13
  • 22. stream grouping• shuffle grouping - random distributionof tuples.• field grouping - groups tuples by a field.• all grouping - replicates to all tasks.• global grouping - sends the entirestream to one task.Sunday, June 16, 13
  • 23. streaming word-countTopologyBuilder builder = new TopologyBuilder();builder.setSpout("tweet_spout", new RandomTweetSpout(), 5);builder.setBolt("parse_bolt", new ParseTweetBolt(), 8).shuffleGrouping("tweet_spout").setNumTasks(2);builder.setBolt("count_bolt", new WordCountBolt(), 12).fieldsGrouping("parse_bolt", new Fields("word"));Config config = new Config();config.setNumWorkers(3);StormSubmitter.submitTopology(“demo”, config, builder.createTopology());Sunday, June 16, 13
  • 24. tweet spoutclass RandomTweetSpout extends BaseRichSpout {SpoutOutputCollector collector;Random rand;String[] tweets = new String[] {"@jkrums:There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy","@barackobama: Four more years. pic.twitter.com/bAJE6Vom",...};....@Overridepublic void nextTuple() {Utils.sleep(100);String tweet = tweets[rand.nextInt(tweets.length)];collector.emit(new Values(tweet));}}Sunday, June 16, 13
  • 25. parse boltclass ParseTweetBolt extends BaseBasicBolt {@Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String tweet = tuple.getString(0);for (String word : tweet.split(" ")) {collector.emit(new Values(word));}}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word"));}}Sunday, June 16, 13
  • 26. word count boltclass WordCountBolt extends BaseBasicBolt {Map<String, Integer> counts = new HashMap<String, Integer>();@Overridepublic void execute(Tuple tuple, BasicOutputCollector collector) {String word = tuple.getString(0);Integer count = counts.get(word);count = (count == null) ? 1 : count + 1;counts.put(word, count);collector.emit(new Values(word, count));}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("word", "count"));}}Sunday, June 16, 13
  • 27. word-count topologyRandomTweetSpout ParseTweetBolt WordCountBoltshuffle grouping fields groupingSunday, June 16, 13
  • 28. how do we run storm@twitter ?Sunday, June 16, 13
  • 29. storm on mesosnode node node nodemesoswe run multiple instances of storm onthe same cluster via mesos.storm(production)storm(dev) provides efficientresource isolation andsharing across distributedframeworks such asstorm.Sunday, June 16, 13
  • 30. topology isolationisolation scheduler solves the problem ofmulti-tenancy – avoiding resource contentionbetween topologies, by providing full isolationbetween topologies.Sunday, June 16, 13
  • 31. topology isolation• shared pool - multiple topologies canrun on the same host.• isolated pool - dedicated set of hosts torun a single topology.Sunday, June 16, 13
  • 32. topology isolationshared poolstormclusterSunday, June 16, 13
  • 33. topology isolationshared poolstormclusterjoe’s topologyisolated poolsSunday, June 16, 13
  • 34. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologySunday, June 16, 13
  • 35. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologySunday, June 16, 13
  • 36. topology isolationXshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyhost failureSunday, June 16, 13
  • 37. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyrepair hostadd hostSunday, June 16, 13
  • 38. topology isolationshared poolstormclusterjoe’s topologyisolated poolsjane’s topologydave’s topologyadd to sharedpoolSunday, June 16, 13
  • 39. numbers• benchmarked at a million tuplesprocessed per second per node.• running 30 topologies in a 200 nodecluster..• processing 50 billion messages a daywith an average complete latency under 50ms.Sunday, June 16, 13
  • 40. storm use-cases@twitterSunday, June 16, 13
  • 41. stream processingapplicationstweetsfavorites, retweetsimpressionstwitter stormstreamsspoutboltbolt$$$$realtimedashboardsnewfeaturesSunday, June 16, 13
  • 42. current use-cases• discovery of emerging topics/stories.• online learning of tweet features for searchresult ranking.• realtime analytics for ads.• internal log processing.Sunday, June 16, 13
  • 43. tweet scoring pipelinetweetsdata streamsimpressionsinteractionsstormtopologygraphstoremetadatastorejoin: tweets, impressionsjoin: tweets, interactionslast 7 days of:tweet ->feature_val,feature_type,timestamppersistentstore:tweet ->feature_val,feature_type,timestampthriftservicecassandratwemcacheinput: tweet idoutput: scorewrite tweetfeaturesSunday, June 16, 13
  • 44. road ahead• auto scaling.• persistent bolts.• better grouping schemes.• replicated computation.• higher-level abstractions.Sunday, June 16, 13
  • 45. companies using stormSunday, June 16, 13
  • 46. questions?krishna@twitter.comproject: https://storm-project.netmailing-list: http://groups.google.com/group/storm-userSunday, June 16, 13

×