Storm - As deep into real-time data processing as you can get in 30 minutes.

21,513 views

Published on

My slides from GlueCon 2013

Published in: Technology, Business
1 Comment
58 Likes
Statistics
Notes
  • Dan Lynn, nice tutorial. I have some questions, how can you do Log analysis in the clustered way of storm?? I need to get the Logs from each bolt in order to anomaly detection of the stream data, in addition, how can I know the condition/status(may be fail to do the assigned task, delay on processing, ..) of each workers in the topology, knowing such information is good to do analysis on the data .. any suggestions or Help ..
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
21,513
On SlideShare
0
From Embeds
0
Number of Embeds
173
Actions
Shares
0
Downloads
705
Comments
1
Likes
58
Embeds 0
No embeds

No notes for slide

Storm - As deep into real-time data processing as you can get in 30 minutes.

  1. 1. StormDan Lynndan@fullcontact.com@danklynnAs deep into real-time data processing as you can get**in 30 minutes.
  2. 2. Keeps Contact Information Current and CompleteBased in Denver, ColoradoCTO & Co-Founderdan@fullcontact.com@danklynn
  3. 3. Turn Partial ContactsInto Full Contacts
  4. 4. Storm
  5. 5. StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on
  6. 6. StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on
  7. 7. StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on
  8. 8. StormDistributed  and  fault-­‐tolerant  real-­‐3me  computa3on
  9. 9. THE HARD WAYQueuesWorkers
  10. 10. THE HARD WAY
  11. 11. Key Concepts
  12. 12. TuplesOrdered list of elements
  13. 13. TuplesOrdered list of elements("search-01384", "e:dan@fullcontact.com")
  14. 14. StreamsUnbounded sequence of tuples
  15. 15. StreamsUnbounded sequence of tuplesTuple Tuple Tuple Tuple Tuple Tuple
  16. 16. SpoutsSource of streams
  17. 17. SpoutsSource of streams
  18. 18. SpoutsSource of streamsTuple Tuple Tuple Tuple Tuple Tuple
  19. 19. Spouts can talk withsome  images  from  h,p://commons.wikimedia.org•Queues•Web  logs•API  calls•Event  data
  20. 20. BoltsProcess tuples and create new streams
  21. 21. Boltssome  images  from  h,p://commons.wikimedia.org•Apply  funcBons  /  transforms•Filter•AggregaBon•Streaming  joins•Access  DBs,  APIs,  etc...
  22. 22. BoltsTuple Tuple Tuple Tuple Tuple Tuplesome  images  from  h,p://commons.wikimedia.orgTupleTupleTupleTupleTupleTupleTupleTupleTupleTupleTupleTuple
  23. 23. TopologiesA directed graph of Spouts and Bolts
  24. 24. This is a Topologysome  images  from  h,p://commons.wikimedia.org
  25. 25. This is also a topologysome  images  from  h,p://commons.wikimedia.org
  26. 26. TasksExecute Streams or Bolts
  27. 27. Running a Topology$ storm jar my-code.jar com.example.MyTopology arg1 arg2
  28. 28. Storm ClusterNathan  Marz
  29. 29. Storm ClusterNathan  MarzIf this wereHadoop...
  30. 30. Storm ClusterNathan  MarzJob TrackerIf this wereHadoop...
  31. 31. Storm ClusterNathan  MarzTask TrackersIf this wereHadoop...
  32. 32. Storm ClusterNathan  MarzCoordinates everythingBut it’s not Hadoop
  33. 33. Example:Streaming Word Count
  34. 34. Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
  35. 35. Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
  36. 36. Streaming Word Countpublic static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}SplitSentence.java
  37. 37. Streaming Word Countpublic static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}SplitSentence.javasplitsentence.py
  38. 38. Streaming Word Countpublic static class SplitSentence extends ShellBolt implements IRichBolt {            public SplitSentence() {        super("python", "splitsentence.py");    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word"));    }    @Override    public Map<String, Object> getComponentConfiguration() {        return null;    }}SplitSentence.java
  39. 39. Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));java
  40. 40. Streaming Word Countpublic static class WordCount extends BaseBasicBolt {    Map<String, Integer> counts = new HashMap<String, Integer>();    @Override    public void execute(Tuple tuple, BasicOutputCollector collector) {        String word = tuple.getString(0);        Integer count = counts.get(word);        if(count==null) count = 0;        count++;        counts.put(word, count);        collector.emit(new Values(word, count));    }    @Override    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word", "count"));    }}WordCount.java
  41. 41. Streaming Word CountTopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));javaGroupings control how tuples are routed
  42. 42. Shuffle groupingTuples are randomly distributed across all of thetasks running the bolt
  43. 43. Fields groupingGroups tuples by specific named fields and routesthem to the same task
  44. 44. Fields groupingGroups tuples by specific named fields and routesthem to the same taskAnalogous to Hadoop’spartitioning behavior
  45. 45. Trending Topics
  46. 46. Twitter Trending TopicsTwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)(word)RollingCountsBoltparallelism = n(word, count)IntermediateRankingsBoltparallelism = n(rankings)(tweets)(JSON rankings)RankingsReportBoltparallelism = 1TotalRankingsBoltparallelism = 1(rankings)
  47. 47. Live Coding!
  48. 48. Twitter Trending TopicsTwitterStreamingTopicSpoutparallelism = 1 (unless you use GNip)(word)RollingCountsBoltparallelism = n(word, count)IntermediateRankingsBoltparallelism = n(rankings)(tweets)(JSON rankings)RankingsReportBoltparallelism = 1TotalRankingsBoltparallelism = 1(rankings)
  49. 49. Tips
  50. 50. loggly.comGraylog2logstashUse a log aggregator
  51. 51. "$topologyName-$buildNumber"Rolling Deploys
  52. 52. 1.  Launch  new  topology2.  Wait  for  it  to  be  healthy3.  Kill  the  old  oneRolling Deploys
  53. 53. These are under activedevelopmentRolling Deploys
  54. 54. TopologyBuilder builder = new TopologyBuilder();builder.setSpout("sentences", new RandomSentenceSpout(), 5);builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("sentences");builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));javasee:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topologyTune your parallelism
  55. 55. Tune your parallelismSupervisorWorker  Process  (JVM)Executor  (thread)TaskTaskExecutor  (thread)TaskTaskWorker  Process  (JVM)Executor  (thread)TaskTaskExecutor  (thread)TaskTaskParallelism hints controlthe number of Executors
  56. 56. collector.emit(new Values(word, count));see:https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topologyAnchor your tuples (or not)collector.emit(tuple, new Values(word, count));
  57. 57. But Dan, youleft outTrident!
  58. 58. if (storm == hadoop) {trident = pig / cascading}
  59. 59. A little taste of TridentTridentState  urlToTweeters  =              topology.newStaticState(getUrlToTweetersState());TridentState  tweetersToFollowers  =              topology.newStaticState(getTweeterToFollowersState());topology.newDRPCStream("reach")    .stateQuery(urlToTweeters,  new  Fields("args"),  new  MapGet(),                      new  Fields("tweeters"))    .each(new  Fields("tweeters"),  new  ExpandList(),  new  Fields("tweeter"))    .shuffle()    .stateQuery(tweetersToFollowers,  new  Fields("tweeter"),  new  MapGet(),                        new  Fields("followers"))      .parallelismHint(200)    .each(new  Fields("followers"),  new  ExpandList(),  new  Fields("follower"))    .groupBy(new  Fields("follower"))    .aggregate(new  One(),  new  Fields("one"))    .parallelismHint(20)    .aggregate(new  Count(),  new  Fields("reach"));h,ps://github.com/nathanmarz/storm/wiki/Trident-­‐tutorial
  60. 60. Thanks:@stormprocessorhttp://github.com/nathanmarz/stormNathan Marz - @nathanmarzhttp://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/Michael Knoll - @migunoMichael Rose - @xorlevhttp://github.com/xorlev
  61. 61. Questions?dan@fullcontact.comhttps://github.com/danklynn/storm-starter/tree/gluecon2013

×