• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Realtime processing with storm presentation
 

Realtime processing with storm presentation

on

  • 8,838 views

Storm is a distributed, reliable, fault-tolerant system for processing streams of data. ...

Storm is a distributed, reliable, fault-tolerant system for processing streams of data.

In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.

We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.

Statistics

Views

Total Views
8,838
Views on SlideShare
8,833
Embed Views
5

Actions

Likes
18
Downloads
308
Comments
0

3 Embeds 5

http://bummerware.tumblr.com 2
https://twitter.com 2
http://snipick.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Realtime processing with storm presentation Realtime processing with storm presentation Presentation Transcript

    • Realtime Processing with Storm
    • Who are we? Dario Simonassi @ldsimonassi Gabriel Eisbruch @geisbruch Jonathan Leibiusky @xetorthio
    • Agenda ● The problem ● Storm - Basic concepts ● Trident ● Deploying ● Some other features ● Contributions
    • Tons of information arriving every secondhttp://necromentia.files.wordpress.com/2011/01/20070103213517_iguazu.jpg
    • Heavy workloadhttp://www.computerkochen.de/fun/bilder/esel.jpg
    • Do Re it alti me !http://www.asimon.co.il/data/images/a10689.jpg
    • Worker Complex Queue Worker Queue Queue Worker WorkerInformation Worker Source Queue DB Queue Worker Worker Worker Queue Queue Worker
    • What is Storm?● Is a highly distributed realtime computation system.● Provides general primitives to do realtime computation.● Can be used with any programming language.● Its scalable and fault-tolerant.
    • Example - Main Conceptshttps://github.com/geisbruch/strata2012
    • Main Concepts - Example "Count tweets related to strataconf and show hashtag totals."
    • Main Concepts - Spout● First thing we need is to read all tweets related to strataconf from twitter● A Spout is a special component that is a source of messages to be processed. Tweets Tuples Spout
    • Main Concepts - Spout public class ApiStreamingSpout extends BaseRichSpout { SpoutOutputCollector collector; TwitterReader reader; public void nextTuple() { Tweet tweet = reader.getNextTweet(); if(tweet != null) collector.emit(new Values(tweet)); } public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { reader = new TwitterReader(conf.get("track"), conf.get("user"), conf.get("pass")); this.collector = collector; } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("tweet")); } } Tweets Tuples Spout
    • Main Concepts - Bolt● For every tweet, we find hashtags● A Bolt is a component that does the processing. tweet hashtags hashtags stream tweet hashtags Spout HashtagsSplitter
    • Main Concepts - Bolt public class HashtagsSplitter extends BaseBasicBolt { public void execute(Tuple input, BasicOutputCollector collector) { Tweet tweet = (Tweet)input.getValueByField("tweet"); for(String hashtag : tweet.getHashTags()){ collector.emit(new Values(hashtag)); } } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("hashtag")); } } tweet hashtags tweet hashtags hashtags stream Spout HashtagsSplitter
    • Main Concepts - Bolt public class HashtagsCounterBolt extends BaseBasicBolt { private Jedis jedis; public void execute(Tuple input, BasicOutputCollector collector) { String hashtag = input.getStringByField("hashtag"); jedis.hincrBy("hashs", key, val); } public void prepare(Map conf, TopologyContext context) { jedis = new Jedis("localhost"); } } tweet stream tweet hashtag hashtag hashtag Spout HashtagsSplitter HashtagsCounterBolt
    • Main Concepts - Topology public static Topology createTopology() { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("tweets-collector", new ApiStreamingSpout(), 1); builder.setBolt("hashtags-splitter", new HashTagsSplitter(), 2) .shuffleGrouping("tweets-collector"); builder.setBolt("hashtags-counter", new HashtagsCounterBolt(), 2) .fieldsGrouping("hashtags-splitter", new Fields("hashtags")); return builder.createTopology(); } fields shuffle HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
    • Example Overview - Read Tweet Spout Splitter Counter Redis This tweet has #hello #world hashtags!!!
    • Example Overview - Split Tweet Spout Splitter Counter Redis This tweet has #hello #world hashtags!!! #hello #world
    • Example Overview - Split Tweet Spout Splitter Counter Redis This tweet has #hello #world hashtags!! #hello incr(#hello) #hello = 1 #world incr(#world) #world = 1
    • Example Overview - Split Tweet Spout Splitter Counter Redis #hello = 1 This tweet has #bye #world hashtags #world = 1
    • Example Overview - Split Tweet Spout Splitter Counter Redis #hello = 1 #world #bye This tweet has #bye #world hashtags #world = 1
    • Example Overview - Split Tweet Spout Splitter Counter Redis #hello = 1 #world #bye This tweet has #bye incr(#bye) #world hashtags #world = 2 incr(#world) #bye =1
    • Main Concepts - In summary ○ Topology ○ Spout ○ Stream ○ Bolt
    • Guaranteeing Message Processing http://www.steinborn.com/home_warranty.php
    • Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
    • Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt _collector.ack(tuple);
    • Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
    • Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt
    • Guaranteeing Message Processing HashtagsSplitter HashtagsCounterBolt Spout HashtagsSplitter HashtagsCounterBolt void fail(Object msgId);
    • Trident http://www.raytekmarine.com/images/trident.jpg
    • Trident - What is Trident? ● High-level abstraction on top of Storm. ● Will make it easier to build topologies. ● Similar to Hadoop, Pig
    • Trident - Topology TridentTopology topology = new TridentTopology(); hashtags hashtags tweet hashtags stream tweet Spout HashtagsSplitter Memory
    • Trident - Stream Stream tweetsStream = topology .newStream("tweets", new ApiStreamingSpout()); tweet stream Spout
    • Trident Stream hashtagsStream = tweetsStrem.each(new Fields("tweet"),new BaseFunction() { public void execute(TridentTuple tuple, TridentCollector collector) { Tweet tweet = (Tweet)tuple.getValueByField("tweet"); for(String hashtag : tweet.getHashtags()) collector.emit(new Values(hashtag)); } }, new Fields("hashtag") ); tweet hashtags hashtags stream tweet hashtags Spout HashtagsSplitter
    • Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
    • Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
    • Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
    • Trident - States TridentState hashtagCounts = hashtagsStream.groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
    • Trident - Complete example TridentState hashtags = topology .newStream("tweets", new ApiStreamingSpout()) .each(new Fields("tweet"),new HashTagSplitter(), new Fields("hashtag")) .groupBy(new Fields("hashtag")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) hashtags hashtags tweet hashtags stream tweet Memory Spout HashtagsSplitter (State)
    • Transactions
    • Transactions ● Exactly once message processing semantic
    • Transactions ● The intermediate processing may be executed in a parallel way
    • Transactions ● But persistence should be strictly sequential
    • Transactions ● Spouts and persistence should be designed to be transactional
    • Trident - Queries We want to know on-line the count of tweets with hashtag hadoop and the amount of tweets with strataconf hashtag
    • Trident - DRPC
    • Trident - Query DRPCClient drpc = new DRPCClient("MyDRPCServer", "MyDRPCPort"); drpc.execute("counts", "strataconf hadoop");
    • Trident - Query Stream drpcStream = topology.newDRPCStream("counts",drpc)
    • Trident - Query Stream singleHashtag = drpcStream.each(new Fields("args"), new Split(), new Fields("hashtag")) DRPC Hashtag DRPC stream Splitter Server
    • Trident - Query Stream hashtagsCountQuery = singleHashtag.stateQuery(hashtagCountMemoryState, new Fields("hashtag"), new MapGet(), new Fields("count")); DRPC Hashtag Hashtag Count stream Splitter Query DRPC Server
    • Trident - Query Stream drpcCountAggregate = hashtagsCountQuery.groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum")); DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
    • Trident - Query - Complete Example topology.newDRPCStream("counts",drpc) .each(new Fields("args"), new Split(), new Fields("hashtag")) .stateQuery(hashtags, new Fields("hashtag"), new MapGet(), new Fields("count")) .each(new Fields("count"), new FilterNull()) .groupBy(new Fields("hashtag")) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
    • Trident - Query - Complete Example DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
    • Trident - Query - Complete Example Client Call DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
    • Trident - Query - Complete Example Hold the Execute Request DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server
    • Trident - Query - Complete Example Request continue holded DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server On finish return to DRPC server
    • Trident - Query - Complete Example DRPC Hashtag Hashtag Count Aggregate DRPC stream Splitter Query Count Server Return response to the client
    • Running Topologies
    • Running Topologies - Local Cluster Config conf = new Config(); conf.put("some_property","some_value") LocalCluster cluster = new LocalCluster(); cluster.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology()); ● The whole topology will run in a single machine. ● Similar to run in a production cluster ● Very Useful for test and development ● Configurable parallelism ● Similar methods than StormSubmitter
    • Running Topologies - Production package twitter.streaming; public class Topology { public static void main(String[] args){ StormSubmitter.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology()); } } mvn package storm jar Strata2012-0.0.1.jar twitter.streaming.Topology "$REDIS_HOST" $REDIS_PORT "strata,strataconf" $TWITTER_USER $TWITTER_PASS
    • Running Topologies - Cluster Layout● Single Master process Storm Cluster Supervisor● No SPOF (where the processing is done)● Automatic Task distribution Nimbus Supervisor (The master (where the processing is done) Process)● Amount of task configurable by process Supervisor (where the processing is done)● Automatic tasks re-distribution on supervisors fails Zookeeper Cluster (Only for coordination and cluster state)● Work continue on nimbus fails
    • Running Topologies - Cluster Architecture - No SPOF Storm Cluster Supervisor (where the processing is done) Nimbus Supervisor (The master Process) (where the processing is done) Supervisor (where the processing is done)
    • Running Topologies - Cluster Architecture - No SPOF Storm Cluster Supervisor (where the processing is done) Nimbus Supervisor (The master Process) (where the processing is done) Supervisor (where the processing is done)
    • Running Topologies - Cluster Architecture - No SPOF Storm Cluster Supervisor (where the processing is done) Nimbus Supervisor (The master Process) (where the processing is done) Supervisor (where the processing is done)
    • Other features
    • Other features - Non JVM languages { ● Use Non-JVM Languages "conf": { "topology.message.timeout.secs": 3, // etc ● Simple protocol }, "context": { ● Use stdin & stdout for communication "task->component": { "1": "example-spout", "2": "__acker", ● Many implementations and abstraction "3": "example-bolt" layers done by the community }, "taskid": 3 }, "pidDir": "..." }
    • Contribution● A collection of community developed spouts, bolts, serializers, and other goodies.● You can find it in: https://github.com/nathanmarz/storm-contrib● Some of those goodies are: ○ cassandra: Integrates Storm and Cassandra by writing tuples to a Cassandra Column Family. ○ mongodb: Provides a spout to read documents from mongo, and a bolt to write documents to mongo. ○ SQS: Provides a spout to consume messages from Amazon SQS. ○ hbase: Provides a spout and a bolt to read and write from HBase. ○ kafka: Provides a spout to read messages from kafka. ○ rdbms: Provides a bolt to write tuples to a table. ○ redis-pubsub: Provides a spout to read messages from redis pubsub. ○ And more...
    • Questions?