Your SlideShare is downloading. ×
  • Like
Storm real-time processing
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Storm real-time processing



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. StormReal-time computation made easyMichael Vogiatzis
  • 2. What’s Storm?• Distributed real-time computation system• Fault tolerant• Fast• Scalable• Guaranteed message processing• Open source• Multilang capabilities
  • 3. PurposeOk but why?
  • 4. Motivation• Queues – Workers paradigm• Scaling is hard• System is not robust• Coding is not fun!– No abstraction– Low level message passing– Intermediate message brokers
  • 5. Use cases• Stream processing– Consume stream, update db, etc• Distributed RPC– Intense function on top of storm• Ongoing computation– Computing music trends on Twitter
  • 6. Architecture
  • 7. Elements• Streams– Set of tuples– Unbounded sequence of data• Spout– Source of streams• Bolts– Application logic– Functions– Streaming aggregations, joins, DB ops
  • 8. Topology
  • 9. Storm UI
  • 10. DemoUnshorten URLs
  • 11. Evil Shorteners
  • 12. Demo
  • 13. Trident● Higher level of abstraction on top of Storm● Batch processing● Keeps state using your persistence store e.g.DBs, Memcached, etc.● Exactly – once semantics● Tuples can be replayed!● Similar API to Pig / Cascading
  • 14. Trident operationsOperationInput fields Function fields
  • 15. Trident operations● Joins● Aggregations● Grouping● Functions● Filtering● Sorting
  • 16. Trident State● Solid API for reading / writing to statefulsources● State updates are idempotent● Different kind of fault-tolerance dependingon the different Spout implementations
  • 17. Learn by exampleCompute Male – Female count on aparticular topic on Twitter over time
  • 18. Trident Gender1. Stream of incoming tweets2. Filter out the non-relevant to topic3. Check gender by checking first name4. Update either male or female counter
  • 19. Input (Spout impl.)● Receives public stream (~1% of tweets) and emits theminto the systemList<Object> tweets;public void emitBatch(long batchId, TridentCollector collector) {for (Object o : tweets)collector.emit(new Values(o));}
  • 20. FilterImplement a Filter class called FilterWords.each(new Fields("status"), new FilterWords(interestingWords))String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};public boolean isKeep(TridentTuple tuple) {Tweet t = (Tweet) tuple.getValue(0);//is tweet an interesting one?for (String word : words)        if (s.getText().toLowerCase().contains(word))           return true;               return false;    }}
  • 21. FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}]
  • 22. FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”]
  • 23. State QueryImplement a QueryFunction to query the persistence storage..stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender"))public List<String> batchRetrieve(GenderDB state, List<TridentTuple> tuples) {List<String> batchToQuery = new ArrayList<String>();for (TridentTuple t : tuples){    String name = t.getStringByField("name");    batchToQuery.add(name);  }return state.getGenders(batchToQuery);}
  • 24. State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great.   This week is London.“}, “Iris”]
  • 25. State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”,“Female”]
  • 26. Grouping● .groupBy(new Fields("gender"))● Groups the tuples containing the samegender value together● Re-partitions the stream● Tuples are sent over the network
  • 27. Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]
  • 28. Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]Group By Gender● Tuple after: new 1st Partition: [{TweetJson1}, “Iris”, “Female”]new 1st Partition: [{TweetJson3}, “Lena”, “Female”]new 2nd Partition: [{TweetJson2}, “Michael”, “Male”]
  • 29. Aggregators (general case)● Run the init() function before processing the batch● Aggregate through a number of tuples (usually “grouped-by” before) and emit oneor more results based on the aggregate method.public interface Aggregator<T> extends Operation {    T init(Object batchId, TridentCollector collector);    void aggregate(T state, TridentTuple tuple, TridentCollector collector);    void complete(T state, TridentCollector collector);}
  • 30. Combiner Aggregator● Run init(TridentTuple t) on every tuple● Run combine method to tuple values until no tuples are left, then return single value.public class Count implements CombinerAggregator<Long> {    public Long init(TridentTuple tuple) {        return 1L;    }    public Long combine(Long val1, Long val2) {        return val1 + val2;    }    public Long zero() {        return 0L;    }}
  • 31. Reducer Aggregator● Run init() to get an initial value● Iterate over the value to emit a single resultpublic interface ReducerAggregator<T> extends Serializable {    T init();    T reduce(T curr, TridentTuple tuple);}
  • 32. Back to the example● For each gender batch run Count()aggregator● Not only aggregate, but also store thevalue to memory● Why?● “Over time count”
  • 33. Back to the example● For each gender batch run Count() aggregator● Not only aggregate, but also store the value to memory● Why?● “Over time count”persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
  • 34. Putting it all togetherTridentState genderDB = topology.newStaticState(new GenderDBFactory());Stream gender = topology.newStream("spout", spout).each(new Fields("status"), new Filter(topicWords)).each(new Fields("status"), new ExpandName(), new Fields("name"))   .parallelismHint(4).stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender")).parallelismHint(10).groupBy(new Fields("gender")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).newValuesStream();
  • 35. DemoGender count
  • 36. Some minus• Hard debugging➢pseudo-distributed mode but still..• Object serialization➢When using 3rdparty libraries➢Register your own serializers for betterperformance e.g. Kryo
  • 37. I didn’t tackle• Reliability–Guaranteed message processing• Distributed RPC example• Storm-deploy companion–One-click storm cluster automateddeploy i.e. EC2
  • 38. Contributions
  • 39. Overall• Express your realtime needs naturally• Growing community• System rapidly improving• Not a Hadoop/MR competitor• Fun to use
  • 40. Resources● Storm Unshortening example­unshortening● Understanding the Storm Parallelism● http://storm­●
  • 41. The EndMichael VogiatzisFollow me @mvogiatzis
  • 42. Q & A