Your SlideShare is downloading. ×
Storm real-time processing
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Storm real-time processing


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. StormReal-time computation made easyMichael Vogiatzis
  • 2. What’s Storm?• Distributed real-time computation system• Fault tolerant• Fast• Scalable• Guaranteed message processing• Open source• Multilang capabilities
  • 3. PurposeOk but why?
  • 4. Motivation• Queues – Workers paradigm• Scaling is hard• System is not robust• Coding is not fun!– No abstraction– Low level message passing– Intermediate message brokers
  • 5. Use cases• Stream processing– Consume stream, update db, etc• Distributed RPC– Intense function on top of storm• Ongoing computation– Computing music trends on Twitter
  • 6. Architecture
  • 7. Elements• Streams– Set of tuples– Unbounded sequence of data• Spout– Source of streams• Bolts– Application logic– Functions– Streaming aggregations, joins, DB ops
  • 8. Topology
  • 9. Storm UI
  • 10. DemoUnshorten URLs
  • 11. Evil Shorteners
  • 12. Demo
  • 13. Trident● Higher level of abstraction on top of Storm● Batch processing● Keeps state using your persistence store e.g.DBs, Memcached, etc.● Exactly – once semantics● Tuples can be replayed!● Similar API to Pig / Cascading
  • 14. Trident operationsOperationInput fields Function fields
  • 15. Trident operations● Joins● Aggregations● Grouping● Functions● Filtering● Sorting
  • 16. Trident State● Solid API for reading / writing to statefulsources● State updates are idempotent● Different kind of fault-tolerance dependingon the different Spout implementations
  • 17. Learn by exampleCompute Male – Female count on aparticular topic on Twitter over time
  • 18. Trident Gender1. Stream of incoming tweets2. Filter out the non-relevant to topic3. Check gender by checking first name4. Update either male or female counter
  • 19. Input (Spout impl.)● Receives public stream (~1% of tweets) and emits theminto the systemList<Object> tweets;public void emitBatch(long batchId, TridentCollector collector) {for (Object o : tweets)collector.emit(new Values(o));}
  • 20. FilterImplement a Filter class called FilterWords.each(new Fields("status"), new FilterWords(interestingWords))String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};public boolean isKeep(TridentTuple tuple) {Tweet t = (Tweet) tuple.getValue(0);//is tweet an interesting one?for (String word : words)        if (s.getText().toLowerCase().contains(word))           return true;               return false;    }}
  • 21. FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}]
  • 22. FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”]
  • 23. State QueryImplement a QueryFunction to query the persistence storage..stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender"))public List<String> batchRetrieve(GenderDB state, List<TridentTuple> tuples) {List<String> batchToQuery = new ArrayList<String>();for (TridentTuple t : tuples){    String name = t.getStringByField("name");    batchToQuery.add(name);  }return state.getGenders(batchToQuery);}
  • 24. State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great.   This week is London.“}, “Iris”]
  • 25. State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London.“}, “Iris”,“Female”]
  • 26. Grouping● .groupBy(new Fields("gender"))● Groups the tuples containing the samegender value together● Re-partitions the stream● Tuples are sent over the network
  • 27. Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]
  • 28. Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]Group By Gender● Tuple after: new 1st Partition: [{TweetJson1}, “Iris”, “Female”]new 1st Partition: [{TweetJson3}, “Lena”, “Female”]new 2nd Partition: [{TweetJson2}, “Michael”, “Male”]
  • 29. Aggregators (general case)● Run the init() function before processing the batch● Aggregate through a number of tuples (usually “grouped-by” before) and emit oneor more results based on the aggregate method.public interface Aggregator<T> extends Operation {    T init(Object batchId, TridentCollector collector);    void aggregate(T state, TridentTuple tuple, TridentCollector collector);    void complete(T state, TridentCollector collector);}
  • 30. Combiner Aggregator● Run init(TridentTuple t) on every tuple● Run combine method to tuple values until no tuples are left, then return single value.public class Count implements CombinerAggregator<Long> {    public Long init(TridentTuple tuple) {        return 1L;    }    public Long combine(Long val1, Long val2) {        return val1 + val2;    }    public Long zero() {        return 0L;    }}
  • 31. Reducer Aggregator● Run init() to get an initial value● Iterate over the value to emit a single resultpublic interface ReducerAggregator<T> extends Serializable {    T init();    T reduce(T curr, TridentTuple tuple);}
  • 32. Back to the example● For each gender batch run Count()aggregator● Not only aggregate, but also store thevalue to memory● Why?● “Over time count”
  • 33. Back to the example● For each gender batch run Count() aggregator● Not only aggregate, but also store the value to memory● Why?● “Over time count”persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
  • 34. Putting it all togetherTridentState genderDB = topology.newStaticState(new GenderDBFactory());Stream gender = topology.newStream("spout", spout).each(new Fields("status"), new Filter(topicWords)).each(new Fields("status"), new ExpandName(), new Fields("name"))   .parallelismHint(4).stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender")).parallelismHint(10).groupBy(new Fields("gender")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).newValuesStream();
  • 35. DemoGender count
  • 36. Some minus• Hard debugging➢pseudo-distributed mode but still..• Object serialization➢When using 3rdparty libraries➢Register your own serializers for betterperformance e.g. Kryo
  • 37. I didn’t tackle• Reliability–Guaranteed message processing• Distributed RPC example• Storm-deploy companion–One-click storm cluster automateddeploy i.e. EC2
  • 38. Contributions
  • 39. Overall• Express your realtime needs naturally• Growing community• System rapidly improving• Not a Hadoop/MR competitor• Fun to use
  • 40. Resources● Storm Unshortening example­unshortening● Understanding the Storm Parallelism● http://storm­●
  • 41. The EndMichael VogiatzisFollow me @mvogiatzis
  • 42. Q & A