Introduction to Twitter Storm

3,839 views
3,667 views

Published on

Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany

Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?

Published in: Technology, Education

Introduction to Twitter Storm

  1. 1. Sankt Augustin 24-25.08.2013 Introduction to Twitter Storm uweseiler
  2. 2. Sankt Augustin 24-25.08.2013 About me Big Data Nerd TravelpiratePhotography Enthusiast Hadoop Trainer MongoDB Author
  3. 3. Sankt Augustin 24-25.08.2013 About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  4. 4. Sankt Augustin 24-25.08.2013 Agenda • Why Twitter Storm? • What is Twitter Storm? • What to do with Twitter Storm?
  5. 5. Sankt Augustin 24-25.08.2013 The 3 V’s of Big Data VarietyVolume Velocity
  6. 6. Sankt Augustin 24-25.08.2013 Velocity
  7. 7. Sankt Augustin 24-25.08.2013 Why Twitter Storm?
  8. 8. Sankt Augustin 24-25.08.2013 Batch vs. Real-Time processing • Batch processing – Gathering of data and processing as a group at one time. • Real-time processing – Processing of data that takes place as the information is being entered.
  9. 9. Sankt Augustin 24-25.08.2013 Lambda architecture
  10. 10. Sankt Augustin 24-25.08.2013 Bridging the gap… • A batch workflow is too slow • Views are out of date Absorbed into batch views Time Not Absorbed Now Just a few hours of data
  11. 11. Sankt Augustin 24-25.08.2013 Storm vs. Hadoop • Real-time processing • Topologies run forever • No SPOF • Stateless nodes • Batch processing • Jobs run to completion • NameNode is SPOF • Stateful nodes • Scalable • Gurantees no dataloss • Open Source
  12. 12. Sankt Augustin 24-25.08.2013 Stream Processing Stream processing is a technical paradigm to process big volumes of unbound sequence of tuples in real-time Source Stream Processing • Algorithmic trading • Sensor data monitoring • Continuous analytics
  13. 13. Sankt Augustin 24-25.08.2013 Example: Stream of tweets https://github.com/colinsurprenant/tweitgeist
  14. 14. Sankt Augustin 24-25.08.2013 Agenda • Why Twitter Storm? • What is Twitter Storm? • What to do with Twitter Storm?
  15. 15. Sankt Augustin 24-25.08.2013 Welcome, Twitter Storm! • Created by Nathan Marz @ BackType – Analyze tweets, links, users on Twitter • Open sourced on 19th September, 2011 – Eclipse Public License 1.0 – Storm v0.5.2 • Latest Updates – Current stable release v0.8.2 released on 11th January, 2013 – Major core improvements planned for v0.9.0 – Storm will be an Apache Project [soon..]
  16. 16. Sankt Augustin 24-25.08.2013 Storm under the hood • Java & Clojure • Apache Thrift – Cross language bridge, RPC, Framework to build services • ZeroMQ – Asynchronous message transport layer • Kryo – Serialization framework • Jetty – Embedded web server
  17. 17. Sankt Augustin 24-25.08.2013 Conceptual view Spout Spout Spout: Source of streams Bolt Bolt Bolt Bolt Bolt Bolt: Consumer of streams, Processing of tuples, Possibly emits new tuples Tuple Tuple Tuple Tuple: List of name-value pairs Stream: Unbound sequence of tuples Topology: Network of Spouts & Bolts as the nodes and stream as the edge
  18. 18. Sankt Augustin 24-25.08.2013 Physical view Java thread spawned by worker, runs one or more tasks of the same component Nimbus ZooKeeper WorkerSupervisor Executor Task ZooKeeper ZooKeeper Supervisor Supervisor Supervisor Supervisor Worker Worker Worker Node Worker Process Java process executing a subset of topology Component (Spout/ Bolt) instance, performs the actual data processing Master daemon process Responsible for • distributing code • assigning tasks • monitoring failures Storing operational cluster state Worker daemon process listening for work assigned to its node
  19. 19. Sankt Augustin 24-25.08.2013 A simple example: WordCount FileReader Spout WordSplit Bolt WordCount Bolt line shakespeare.txt word of: 18126 to: 18763 i: 19540 and: 26099 the: 27730 Sorted list
  20. 20. Sankt Augustin 24-25.08.2013 FileReaderSpout I package de.codecentric.storm.wordcount.spouts; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.util.Map; import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseRichSpout; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; public class FileReaderSpout extends BaseRichSpout { private SpoutOutputCollector collector; private FileReader fileReader; private boolean completed = false; public void ack(Object msgId) { System.out.println("OK:" + msgId); } public void fail(Object msgId) { System.out.println("FAIL:" + msgId); }
  21. 21. Sankt Augustin 24-25.08.2013 FileReaderSpout II /** * Declare the output field "line" */ public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("line")); } /** * We will read the file and get the collector object */ public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { try { this.fileReader = new FileReader(conf.get("wordsFile").toString()); } catch (FileNotFoundException e) { throw new RuntimeException("Error reading file [" + conf.get("wordFile") + "]"); } this.collector = collector; } public void close() { }
  22. 22. Sankt Augustin 24-25.08.2013 FileReaderSpout III /** * The only thing that the methods will do is emit each file line */ public void nextTuple() { /** * The nextuple it is called forever, so if we have read the file we * will wait and then return */ String str; // Open the reader BufferedReader reader = new BufferedReader(fileReader); try { // Read all lines while ((str = reader.readLine()) != null) { /** * Emit each line as a value */ this.collector.emit(new Values(str), str); } } catch (Exception e) { throw new RuntimeException("Error reading tuple", e); } finally { completed = true; } } }
  23. 23. Sankt Augustin 24-25.08.2013 WordSplitBolt I package de.codecentric.storm.wordcount.bolts; import backtype.storm.topology.BasicOutputCollector; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseBasicBolt; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Tuple; import backtype.storm.tuple.Values; public class WordSplitBolt extends BaseBasicBolt { public void cleanup() {} /** * The bolt will only emit the field "word" */ public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }
  24. 24. Sankt Augustin 24-25.08.2013 WordSplitBolt II /** * The bolt will receive the line from the * words file and process it to split it into words */ public void execute(Tuple input, BasicOutputCollector collector) { String sentence = input.getString(0); String[] words = sentence.split(" "); for(String word : words){ word = word.trim(); if(!word.isEmpty()){ word = word.toLowerCase(); collector.emit(new Values(word)); } } }
  25. 25. Sankt Augustin 24-25.08.2013 WordCountBolt I package de.codecentric.storm.wordcount.bolts; import java.util.Comparator; import java.util.HashMap; import java.util.Map; import java.util.SortedSet; import java.util.TreeSet; import backtype.storm.task.TopologyContext; import backtype.storm.topology.BasicOutputCollector; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseBasicBolt; import backtype.storm.tuple.Tuple; public class WordCountBolt extends BaseBasicBolt { /** * */ private static final long serialVersionUID = 1L; Integer id; String name; Map<String, Integer> counters;
  26. 26. Sankt Augustin 24-25.08.2013 WordCountBolt II /** * On create */ @Override public void prepare(Map stormConf, TopologyContext context) { this.counters = new HashMap<String, Integer>(); this.name = context.getThisComponentId(); this.id = context.getThisTaskId(); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { } @Override public void execute(Tuple input, BasicOutputCollector collector) { String str = input.getString(0); /** * If the word doesn't exist in the map we will create this, if not we will add 1 */ if (!counters.containsKey(str)) { counters.put(str, 1); } else { Integer c = counters.get(str) + 1; counters.put(str, c); } }
  27. 27. Sankt Augustin 24-25.08.2013 WordCountBolt III /** * At the end of the spout (when the cluster is shutdown we will show the * word counters */ @Override public void cleanup() { // Sort map SortedSet<Map.Entry<String, Integer>> sortedCounts = entriesSortedByValues(counters); System.out.println("-- Word Counter [" + name + "-" + id + "] --"); for (Map.Entry<String, Integer> entry : sortedCounts) { System.out.println(entry.getKey() + ": " + entry.getValue()); } } … }
  28. 28. Sankt Augustin 24-25.08.2013 WordCountTopology public class WordCountTopology { public static void main(String[] args) throws InterruptedException { // Topology definition TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("word-reader",new FileReaderSpout()); builder.setBolt("word-normalizer", new WordSplitBolt()) .shuffleGrouping("word-reader"); builder.setBolt("word-counter", new WordCountBolt(),1) .fieldsGrouping("word-normalizer", new Fields("word")); // Configuration Config conf = new Config(); conf.put("wordsFile", args[0]); conf.setDebug(false); // Run Topology conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count-topology", conf, builder.createTopology()); // You don‘t do this on a regular topology Utils.sleep(10000); cluster.killTopology("word-count-topology"); cluster.shutdown(); } }
  29. 29. Sankt Augustin 24-25.08.2013 Stream Grouping • Each Spout or Bolt might be running n instances in parallel • Groupings are used to decide to which task in the subscribing bolt (group) a tuple is sent to. • Possible Groupings: Grouping Feature Shuffle Random grouping Fields Grouped by value such that equal value results in same task All Replicates to all tasks Global Makes all tuples go to one task None Makes Bolt run in the same thread as the Bolt / Spout it subscribes to Direct Producer (task that emits) controls which Consumer will receive Local If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks
  30. 30. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  31. 31. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  32. 32. Sankt Augustin 24-25.08.2013 Extremely performant
  33. 33. Sankt Augustin 24-25.08.2013 Parallelism Number of worker nodes = 2 Number of worker slots per node = 4 Number of topology worker = 4 FileReaderSpout WordSplitBolt WordCountBolt Number of tasks = Not specified = Same as parallism hint Parellism_hint = 2 Number of tasks = 8 Parellism_hint = 4 Number of tasks = Not specified = 6 Parellism_hint = 6 Number of component instances = 2 + 8 + 6 = 16 Number of executor threads = 2 + 4 + 6 = 12
  34. 34. Sankt Augustin 24-25.08.2013 Message passing Receive Thread Executor Transfer Thread Executor Executor Receiver queue To other workers From other workers Internal transfer queue Transfer queue Interprocess communication is mediated by ZeroMQ Outside transfer is done with Kryo serialization Local communication is mediated by LMAX Disruptor Inside transfer is done with no serialization
  35. 35. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  36. 36. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Cluster works normally Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat
  37. 37. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Nimbus goes down Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat Processing will still continue. But topology lifecycle operations and reassignment facility are lost
  38. 38. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Worker node goes down Monitoring cluster state Sending executor heartbeat Nimbus will reassign the tasks to other machines and the processing will continue Supervisor Worker Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system
  39. 39. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Supervisor goes down Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat Processing will still continue. But assignment is never synchronized
  40. 40. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Worker process goes down Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat Supervisor will restart the worker process and the processing will continue
  41. 41. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  42. 42. Sankt Augustin 24-25.08.2013 Reliability API public class FileReaderSpout extends BaseRichSpout { public void nextTuple() { …; UUID messageID = getMsgID(); collector.emit(newValues(line), msgId) } public void ack(Object msgId) { // Do something with acked message id } public void fail(Object msgId) { // Do something with failes message id } } public class WordSplitBolt extends BaseBasicBolt { public void execute(Tuple input, BasicOutputCollector collector) { for (String s : input.getString(0).split("s")) { collector.emit(input, newValues(s)); } collector.ack(input); } } Tupel tree Anchoring incoming tuple to outgoing tuples Sending ack This “This is a line” This This This Emiting tuple with Message ID
  43. 43. Sankt Augustin 24-25.08.2013 ACKing Framework ACKer init FileReaderSpout WordSplitBolt WordCountBolt ACKer implicit boltACKer ack ACKer fail ACKer ack ACKer fail Tuple A Tuple B Tuple C • Emitted tuple A, XOR tuple A id with ack val • Emitted tuple B, XOR tuple B id with ack val • Emitted tuple C, XOR tuple C id with ack val • Acked tuple A, XOR tuple A id with ack val • Acked tuple B, XOR tuple B id with ack val • Acked tuple C, XOR tuple C id with ack val Spout Tuple ID Spout Task ID ACK val (64 Bit) ACKer implizit bolt ACK val has become 0, ACKer implicit bolt knows the tuple tree has been completed
  44. 44. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  45. 45. Sankt Augustin 24-25.08.2013 Cluster Setup • Setup ZooKeeper cluster • Install dependencies on Nimbus and worker machines – ZeroMQ 2.1.7 and JZMQ – Java 6 and Python 2.6.6 – unzip • Download and extract a Storm release to Nimbus and worker machines • Fill in mandatory configuration into storm.yaml • Launch daemons under supervision using storm scripts • Start a topology: – storm jar <path_topology_jar> <main_class> <arg1>…<argN>
  46. 46. Sankt Augustin 24-25.08.2013 Cluster Summary
  47. 47. Sankt Augustin 24-25.08.2013 Topology Summary
  48. 48. Sankt Augustin 24-25.08.2013 Component Summary
  49. 49. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  50. 50. Sankt Augustin 24-25.08.2013 Basic resources • Storm is available at – http://storm-project.net/ – https://github.com/nathanmarz/storm under Eclipse Public License 1.0 • Get help on – http://groups.google.com/group/storm-user – #storm-user freenode room • Follow @stormprocessor and @nathanmarz
  51. 51. Sankt Augustin 24-25.08.2013 Many contributions • Community repository for modules to use Storm at – https://github.com/nathanmarz/storm-contrib – including integration with Redis, Kafka, MongoDB, HBase, JMS, Amazon SQS, … • Good articles for understanding Storm internals – http://www.michael-noll.com/blog/2012/10/16/understanding-the- parallelism-of-a-stormtopology/ – http://www.michael-noll.com/blog/2013/06/21/understanding-storm- internal-messagebuffers/ • Good slides for understanding real-life examples – http://www.slideshare.net/DanLynn1/storm-as-deep-into- realtime-data-processing-as-youcan-get-in-30-minutes – http://www.slideshare.net/KrishnaGade2/storm-at-twitter
  52. 52. Sankt Augustin 24-25.08.2013 Coming next… • Current release: 0.8.2 • Work in progress (newest): 0.9.0-wip21 – SLF4J and Logback – Pluggable tuple serialization and blowfish encryption – Pluggable interprocess messaging and Netty implementation – Some bug fixes – And more • Storm on YARN
  53. 53. Sankt Augustin 24-25.08.2013 Agenda • Why Twitter Storm? • What is Twitter Storm? • What to do with Twitter Storm?
  54. 54. Sankt Augustin 24-25.08.2013 One example: Webshop • Webtracking component • No defined page impression • Identifying page impressions using Varnish logs of the click stream data • Page consists of different fragments – Body – Article description – Recommendation box, … • Session data also of interest
  55. 55. Sankt Augustin 24-25.08.2013 One example: Webshop • Custom solution using J2EE and MongoDB • Export into Comscore DAx and Enterprise DWH • Solution is currently working but not scalable • What about performance?
  56. 56. Sankt Augustin 24-25.08.2013 Topology Architecture

×