Storm real-time processing


Published on

Published in: Technology

Storm real-time processing

  1. 1. StormReal-time computation made easyMichael Vogiatzis
  2. 2. What’s Storm?• Distributed real-time computation system• Fault tolerant• Fast• Scalable• Guaranteed message processing• Open source• Multilang capabilities
  3. 3. PurposeOk but why?
  4. 4. Motivation• Queues – Workers paradigm• Scaling is hard• System is not robust• Coding is not fun!– No abstraction– Low level message passing– Intermediate message brokers
  5. 5. Use cases• Stream processing– Consume stream, update db, etc• Distributed RPC– Intense function on top of storm• Ongoing computation– Computing music trends on Twitter
  6. 6. Architecture
  7. 7. Elements• Streams– Set of tuples– Unbounded sequence of data• Spout– Source of streams• Bolts– Application logic– Functions– Streaming aggregations, joins, DB ops
  8. 8. Topology
  9. 9. Storm UI
  10. 10. DemoUnshorten URLs
  11. 11. Evil Shorteners
  12. 12. Demo
  13. 13. Trident● Higher level of abstraction on top of Storm● Batch processing● Keeps state using your persistence store e.g.DBs, Memcached, etc.● Exactly – once semantics● Tuples can be replayed!● Similar API to Pig / Cascading
  14. 14. Trident operationsOperationInput fields Function fields
  15. 15. Trident operations● Joins● Aggregations● Grouping● Functions● Filtering● Sorting
  16. 16. Trident State● Solid API for reading / writing to statefulsources● State updates are idempotent● Different kind of fault-tolerance dependingon the different Spout implementations
  17. 17. Learn by exampleCompute Male – Female count on aparticular topic on Twitter over time
  18. 18. Trident Gender1. Stream of incoming tweets2. Filter out the non-relevant to topic3. Check gender by checking first name4. Update either male or female counter
  19. 19. Input (Spout impl.)● Receives public stream (~1% of tweets) and emits theminto the systemList<Object> tweets;public void emitBatch(long batchId, TridentCollector collector) {for (Object o : tweets)collector.emit(new Values(o));}
  20. 20. FilterImplement a Filter class called FilterWords.each(new Fields("status"), new FilterWords(interestingWords))String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};public boolean isKeep(TridentTuple tuple) {Tweet t = (Tweet) tuple.getValue(0);//is tweet an interesting one?for (String word : words)        if (s.getText().toLowerCase().contains(word))           return true;               return false;    }}
  21. 21. FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}]
  22. 22. FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”]
  23. 23. State QueryImplement a QueryFunction to query the persistence storage..stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender"))public List<String> batchRetrieve(GenderDB state, List<TridentTuple> tuples) {List<String> batchToQuery = new ArrayList<String>();for (TridentTuple t : tuples){    String name = t.getStringByField("name");    batchToQuery.add(name);  }return state.getGenders(batchToQuery);}
  24. 24. State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great.   This week is London.“}, “Iris”]
  25. 25. State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”,“Female”]
  26. 26. Grouping● .groupBy(new Fields("gender"))● Groups the tuples containing the samegender value together● Re-partitions the stream● Tuples are sent over the network
  27. 27. Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]
  28. 28. Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]Group By Gender● Tuple after: new 1st Partition: [{TweetJson1}, “Iris”, “Female”]new 1st Partition: [{TweetJson3}, “Lena”, “Female”]new 2nd Partition: [{TweetJson2}, “Michael”, “Male”]
  29. 29. Aggregators (general case)● Run the init() function before processing the batch● Aggregate through a number of tuples (usually “grouped-by” before) and emit oneor more results based on the aggregate method.public interface Aggregator<T> extends Operation {    T init(Object batchId, TridentCollector collector);    void aggregate(T state, TridentTuple tuple, TridentCollector collector);    void complete(T state, TridentCollector collector);}
  30. 30. Combiner Aggregator● Run init(TridentTuple t) on every tuple● Run combine method to tuple values until no tuples are left, then return single value.public class Count implements CombinerAggregator<Long> {    public Long init(TridentTuple tuple) {        return 1L;    }    public Long combine(Long val1, Long val2) {        return val1 + val2;    }    public Long zero() {        return 0L;    }}
  31. 31. Reducer Aggregator● Run init() to get an initial value● Iterate over the value to emit a single resultpublic interface ReducerAggregator<T> extends Serializable {    T init();    T reduce(T curr, TridentTuple tuple);}
  32. 32. Back to the example● For each gender batch run Count()aggregator● Not only aggregate, but also store thevalue to memory● Why?● “Over time count”
  33. 33. Back to the example● For each gender batch run Count() aggregator● Not only aggregate, but also store the value to memory● Why?● “Over time count”persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
  34. 34. Putting it all togetherTridentState genderDB = topology.newStaticState(new GenderDBFactory());Stream gender = topology.newStream("spout", spout).each(new Fields("status"), new Filter(topicWords)).each(new Fields("status"), new ExpandName(), new Fields("name"))   .parallelismHint(4).stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender")).parallelismHint(10).groupBy(new Fields("gender")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).newValuesStream();
  35. 35. DemoGender count
  36. 36. Some minus• Hard debugging➢pseudo-distributed mode but still..• Object serialization➢When using 3rdparty libraries➢Register your own serializers for betterperformance e.g. Kryo
  37. 37. I didn’t tackle• Reliability–Guaranteed message processing• Distributed RPC example• Storm-deploy companion–One-click storm cluster automateddeploy i.e. EC2
  38. 38. Contributions
  39. 39. Overall• Express your realtime needs naturally• Growing community• System rapidly improving• Not a Hadoop/MR competitor• Fun to use
  40. 40. Resources● Storm Unshortening example­unshortening● Understanding the Storm Parallelism● http://storm­●
  41. 41. The EndMichael VogiatzisFollow me @mvogiatzis
  42. 42. Q & A