BWB Meetup: Storm - distributed realtime computation system
Upcoming SlideShare
Loading in...5
×
 

BWB Meetup: Storm - distributed realtime computation system

on

  • 859 views

torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch ...

torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Statistics

Views

Total Views
859
Views on SlideShare
858
Embed Views
1

Actions

Likes
1
Downloads
20
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In Slide Show mode, click the arrow to enter the PowerPoint Getting Started Center.
  • In Slide Show mode, click the arrow to enter the PowerPoint Getting Started Center.

BWB Meetup: Storm - distributed realtime computation system BWB Meetup: Storm - distributed realtime computation system Presentation Transcript

  • Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin
  • Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup
  • Use cases distributed RPC continuous computationsstream processing
  • Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages
  • Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.
  • Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.
  • Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.
  • Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.
  • How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus
  • How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
  • How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.
  • How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1
  • How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }
  • How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
  • How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
  • Storm UI
  • Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/
  • Storm in our company ferret-go.com
  • Ferret go GmbH Trend & Media Analytics ferret-go.com
  • Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing
  • Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook
  • Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors
  • Thank you! 30.10.2013 September BWB Meetup Andrii Gakhov
  • Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin
  • Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup
  • Use cases distributed RPC continuous computationsstream processing
  • Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages
  • Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.
  • Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.
  • Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.
  • Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.
  • How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus
  • How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
  • How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.
  • How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1
  • How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }
  • How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
  • How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
  • Storm UI
  • Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/
  • Storm in our company ferret-go.com
  • Ferret go GmbH Trend & Media Analytics ferret-go.com
  • Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing
  • Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook
  • Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors
  • Thank you! 30.09.2013 September BWB Meetup Andrii Gakhov
  • Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin
  • Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup
  • Use cases distributed RPC continuous computationsstream processing
  • Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages
  • Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.
  • Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.
  • Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.
  • Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.
  • How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus
  • How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
  • How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.
  • How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1
  • How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }
  • How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
  • How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
  • Storm UI
  • Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/
  • Storm in our company ferret-go.com
  • Ferret go GmbH Trend & Media Analytics ferret-go.com
  • Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing
  • Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook
  • Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors
  • Thank you! 30.09.2013 September BWB Meetup Andrii Gakhov