• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Storm real-time processing

Storm real-time processing






Total Views
Views on SlideShare
Embed Views



2 Embeds 5

https://twitter.com 4
http://tweetedtimes.com 1



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Storm real-time processing Storm real-time processing Presentation Transcript

    • StormReal-time computation made easyMichael Vogiatzis
    • What’s Storm?• Distributed real-time computation system• Fault tolerant• Fast• Scalable• Guaranteed message processing• Open source• Multilang capabilities
    • PurposeOk but why?
    • Motivation• Queues – Workers paradigm• Scaling is hard• System is not robust• Coding is not fun!– No abstraction– Low level message passing– Intermediate message brokers
    • Use cases• Stream processing– Consume stream, update db, etc• Distributed RPC– Intense function on top of storm• Ongoing computation– Computing music trends on Twitter
    • Architecture
    • Elements• Streams– Set of tuples– Unbounded sequence of data• Spout– Source of streams• Bolts– Application logic– Functions– Streaming aggregations, joins, DB ops
    • Topology
    • Storm UI
    • DemoUnshorten URLs
    • Evil Shorteners
    • Demo
    • Trident● Higher level of abstraction on top of Storm● Batch processing● Keeps state using your persistence store e.g.DBs, Memcached, etc.● Exactly – once semantics● Tuples can be replayed!● Similar API to Pig / Cascading
    • Trident operationsOperationInput fields Function fields
    • Trident operations● Joins● Aggregations● Grouping● Functions● Filtering● Sorting
    • Trident State● Solid API for reading / writing to statefulsources● State updates are idempotent● Different kind of fault-tolerance dependingon the different Spout implementations
    • Learn by exampleCompute Male – Female count on aparticular topic on Twitter over time
    • Trident Gender1. Stream of incoming tweets2. Filter out the non-relevant to topic3. Check gender by checking first name4. Update either male or female counter
    • Input (Spout impl.)● Receives public stream (~1% of tweets) and emits theminto the systemList<Object> tweets;public void emitBatch(long batchId, TridentCollector collector) {for (Object o : tweets)collector.emit(new Values(o));}
    • FilterImplement a Filter class called FilterWords.each(new Fields("status"), new FilterWords(interestingWords))String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};public boolean isKeep(TridentTuple tuple) {Tweet t = (Tweet) tuple.getValue(0);//is tweet an interesting one?for (String word : words)        if (s.getText().toLowerCase().contains(word))           return true;               return false;    }}
    • FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6“}]
    • FunctionImplement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))Tuple before:[{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6“}]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6“}, “Iris”]
    • State QueryImplement a QueryFunction to query the persistence storage..stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender"))public List<String> batchRetrieve(GenderDB state, List<TridentTuple> tuples) {List<String> batchToQuery = new ArrayList<String>();for (TridentTuple t : tuples){    String name = t.getStringByField("name");    batchToQuery.add(name);  }return state.getGenders(batchToQuery);}
    • State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great.   This week is London. pic.twitter.com/BHZq86o6“}, “Iris”]
    • State QueryTuple before: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6“}, “Iris”]Tuple after: [{”fullname”: “Iris HappyWorker”, “text”:”Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6“}, “Iris”,“Female”]
    • Grouping● .groupBy(new Fields("gender"))● Groups the tuples containing the samegender value together● Re-partitions the stream● Tuples are sent over the network
    • Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]
    • Grouping● Tuples before: 1st Partition: [{TweetJson1}, “Iris”, “Female”]1st Partition: [{TweetJson2}, “Michael”, “Male”]2nd Partition: [{TweetJson3}, “Lena”, “Female”]Group By Gender● Tuple after: new 1st Partition: [{TweetJson1}, “Iris”, “Female”]new 1st Partition: [{TweetJson3}, “Lena”, “Female”]new 2nd Partition: [{TweetJson2}, “Michael”, “Male”]
    • Aggregators (general case)● Run the init() function before processing the batch● Aggregate through a number of tuples (usually “grouped-by” before) and emit oneor more results based on the aggregate method.public interface Aggregator<T> extends Operation {    T init(Object batchId, TridentCollector collector);    void aggregate(T state, TridentTuple tuple, TridentCollector collector);    void complete(T state, TridentCollector collector);}
    • Combiner Aggregator● Run init(TridentTuple t) on every tuple● Run combine method to tuple values until no tuples are left, then return single value.public class Count implements CombinerAggregator<Long> {    public Long init(TridentTuple tuple) {        return 1L;    }    public Long combine(Long val1, Long val2) {        return val1 + val2;    }    public Long zero() {        return 0L;    }}
    • Reducer Aggregator● Run init() to get an initial value● Iterate over the value to emit a single resultpublic interface ReducerAggregator<T> extends Serializable {    T init();    T reduce(T curr, TridentTuple tuple);}
    • Back to the example● For each gender batch run Count()aggregator● Not only aggregate, but also store thevalue to memory● Why?● “Over time count”
    • Back to the example● For each gender batch run Count() aggregator● Not only aggregate, but also store the value to memory● Why?● “Over time count”persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
    • Putting it all togetherTridentState genderDB = topology.newStaticState(new GenderDBFactory());Stream gender = topology.newStream("spout", spout).each(new Fields("status"), new Filter(topicWords)).each(new Fields("status"), new ExpandName(), new Fields("name"))   .parallelismHint(4).stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender")).parallelismHint(10).groupBy(new Fields("gender")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).newValuesStream();
    • DemoGender count
    • Some minus• Hard debugging➢pseudo-distributed mode but still..• Object serialization➢When using 3rdparty libraries➢Register your own serializers for betterperformance e.g. Kryo
    • I didn’t tackle• Reliability–Guaranteed message processing• Distributed RPC example• Storm-deploy companion–One-click storm cluster automateddeploy i.e. EC2
    • Contributions
    • Overall• Express your realtime needs naturally• Growing community• System rapidly improving• Not a Hadoop/MR competitor• Fun to use
    • Resources● Storm Unshortening examplehttps://github.com/mvogiatzis/storm­unshortening● Understanding the Storm Parallelismhttp://bit.ly/RCx4Ln● http://storm­project.net/● https://github.com/nathanmarz/storm
    • The EndMichael VogiatzisFollow me @mvogiatzis
    • Q & A