Your SlideShare is downloading. ×
0
Mariusz Gil

Data streams

processing with

STORM
data expire fast. very fast
realtime processing?
Storm is a free and open source distributed realtime
computation system. Storm makes it easy to reliably
process unbounded...
Storm is fast, a benchmark clocked it at over a million
tuples processed per second per node. It is scalable,
fault-tolera...
concept architecture
tuple
tuple
tuple
tuple
tuple
tuple
tuple

(val1, val2)
(val3, val4)
(val5, val6)

Stream

unbounded sequence of tuples
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams
tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple
tuple
tuple

tuple
tuple
tuple
tuple

Reliable and unreliable Spouts
...
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams

...
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams

...
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams

...
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams

...
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams

...
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple
tuple

Spouts

tuple
tuple

source of streams

...
tuple
tuple

tuple
tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple
tuple

tuple
tuple

Bolts

process input streams...
tuple
tuple

tuple
tuple

tuple

tuple

tuple

tuple

tuple

le

tuple

tup

le

le
tup

tup

le

le
tup

tuple

process i...
TextSpout

[word, count]

[word]

[sentence]

SplitSentenceBolt

Topologies

WordCountBolt

network of spouts and bolts
[sentence]

[word, count]

TextSpout

SplitSentenceBolt

[word]

[sentence]

WordCountBolt

TextSpout

SplitSentenceBolt

...
servers architecture
Nimbus

process responsible for distributing processing across the cluster
Supervisors

worker process responsible for executing subset of topology
zookeepers

coordination layer between Nimbus and Supervisors
fast
il
a
f
CLUSTER STATE IS STORED
LOCALLY OR IN ZOOKEEPERS
sample code
public class RandomSentenceSpout extends BaseRichSpout {
SpoutOutputCollector _collector;
Random _rand;
@Override
public v...
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Overr...
public static class ExclamationBolt implements IRichBolt {
OutputCollector _collector;
public void prepare(Map conf, Topol...
public class WordCountTopology {
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new T...
public static class SplitSentence extends ShellBolt implements IRichBolt {
public SplitSentence() {
super("python", "split...
github.com/nathanmarz/storm-starter
streams groupping
public class WordCountTopology {
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new T...
Groupping

shuffle, fields, all, global, none, direct, local or shuffle
distributed rpc
[request-id, results]

results

arguments

RPC
distributed

[request-id, arguments]
public static class ExclaimBolt extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {...
realtime analytics
personalization
search
revenue
optimization
monitoring
content search
realtime analytics
generating feeds
integrated with
elastic search,
Hbase,hadoop
and hdfs
realtime scoring
moments generation
integrated with
kafka queues and
hdfs storage
Storm-YARN enables
Storm applications to
utilize the
computational
resources in a Hadoop
cluster along with
accessing Hado...
thanks!
mail: mariusz@mariuszgil.pl
twitter: @mariuszgil
Streams processing with Storm
Upcoming SlideShare
Loading in...5
×

Streams processing with Storm

6,252

Published on

Published in: Technology

Transcript of "Streams processing with Storm"

  1. 1. Mariusz Gil Data streams processing with STORM
  2. 2. data expire fast. very fast
  3. 3. realtime processing?
  4. 4. Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  5. 5. Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
  6. 6. concept architecture
  7. 7. tuple tuple tuple tuple tuple tuple tuple (val1, val2) (val3, val4) (val5, val6) Stream unbounded sequence of tuples
  8. 8. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams
  9. 9. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Reliable and unreliable Spouts replay or forget about touple
  10. 10. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Kafka
  11. 11. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Kestrel
  12. 12. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-AMQP-Spout
  13. 13. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-JMS
  14. 14. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-PubSub*
  15. 15. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Spouts tuple tuple source of streams Storm-Beanstalkd-Spout
  16. 16. tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Bolts process input streams and produce new streams
  17. 17. tuple tuple tuple tuple tuple tuple tuple tuple tuple le tuple tup le le tup tup le le tup tuple process input streams and produce new streams tuple tuple tuple tuple Bolts tuple tuple tuple tup tup le le tup tuple tuple tuple
  18. 18. TextSpout [word, count] [word] [sentence] SplitSentenceBolt Topologies WordCountBolt network of spouts and bolts
  19. 19. [sentence] [word, count] TextSpout SplitSentenceBolt [word] [sentence] WordCountBolt TextSpout SplitSentenceBolt Topologies network of spouts and bolts xyzBolt
  20. 20. servers architecture
  21. 21. Nimbus process responsible for distributing processing across the cluster
  22. 22. Supervisors worker process responsible for executing subset of topology
  23. 23. zookeepers coordination layer between Nimbus and Supervisors
  24. 24. fast il a f CLUSTER STATE IS STORED LOCALLY OR IN ZOOKEEPERS
  25. 25. sample code
  26. 26. public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } Spouts }
  27. 27. public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } } Bolts
  28. 28. public static class ExclamationBolt implements IRichBolt { OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } } Bolts
  29. 29. public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Topology
  30. 30. public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } import storm class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word]) SplitSentenceBolt().run() Bolts
  31. 31. github.com/nathanmarz/storm-starter
  32. 32. streams groupping
  33. 33. public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Topology
  34. 34. Groupping shuffle, fields, all, global, none, direct, local or shuffle
  35. 35. distributed rpc
  36. 36. [request-id, results] results arguments RPC distributed [request-id, arguments]
  37. 37. public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { [request-id, results] String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); } RPC public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); } results } public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3); arguments LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); distributed System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); cluster.shutdown(); drpc.shutdown(); } [request-id, arguments]
  38. 38. realtime analytics personalization search revenue optimization monitoring
  39. 39. content search realtime analytics generating feeds integrated with elastic search, Hbase,hadoop and hdfs
  40. 40. realtime scoring moments generation integrated with kafka queues and hdfs storage
  41. 41. Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such As HBase and HDFS
  42. 42. thanks! mail: mariusz@mariuszgil.pl twitter: @mariuszgil
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×