0
Storm: overview
distributed and fault-tolerant realtime
computation.
Backend Web Berlin
Storm
www.storm-project.net
Storm is a free and open source distributed
realtime computation system.
September BWB Meetup
Use cases
distributed RPC continuous computationsstream processing
Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-toler...
Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology ca...
Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on an...
Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the abi...
Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spout...
How Storm works? Storm cluster
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus
How Storm works? Basic concepts
Topology
Topology is a graph of computation. A topology runs forever, or until you kill it...
How Storm works? Basic concepts
Worker process
A worker process executes a subset of a topology. A worker process belongs ...
How Storm works? Basic concepts
Spout
Task1
Task2
BoltA
Task1
Task2
Task3
BoltB
Task1
Task2
BoltC
Task1
Task2
Task3
Task4
...
How Storm works? Topology Example
class DemoTopology {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“...
How Storm works? Spout Example
public class DemoSpout extends BaseRichSpout {
….
@Override
public void open(Map conf, Topo...
How Storm works? Bolt Example
public class BoltA extends BaseRichBolt {
private OutputCollector _collector;
@Override
publ...
Storm UI
Read More about Storm
• Storm
http://storm-project.net/
• Example Storm Topologies
https://github.com/nathanmarz/storm-sta...
Storm in our company
ferret-go.com
Ferret go GmbH
Trend & Media Analytics
ferret-go.com
Our data flow (simplified)
Twitter
Facebook
Google+
Blogs
Comments
Online media
Offline media
Reviews
ElasticSearch
Elasti...
Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end...
Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm ...
Thank you!
30.10.2013
September BWB Meetup
Andrii Gakhov
Storm: overview
distributed and fault-tolerant realtime
computation.
Backend Web Berlin
Storm
www.storm-project.net
Storm is a free and open source distributed
realtime computation system.
September BWB Meetup
Use cases
distributed RPC continuous computationsstream processing
Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-toler...
Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology ca...
Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on an...
Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the abi...
Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spout...
How Storm works? Storm cluster
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus
How Storm works? Basic concepts
Topology
Topology is a graph of computation. A topology runs forever, or until you kill it...
How Storm works? Basic concepts
Worker process
A worker process executes a subset of a topology. A worker process belongs ...
How Storm works? Basic concepts
Spout
Task1
Task2
BoltA
Task1
Task2
Task3
BoltB
Task1
Task2
BoltC
Task1
Task2
Task3
Task4
...
How Storm works? Topology Example
class DemoTopology {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“...
How Storm works? Spout Example
public class DemoSpout extends BaseRichSpout {
….
@Override
public void open(Map conf, Topo...
How Storm works? Bolt Example
public class BoltA extends BaseRichBolt {
private OutputCollector _collector;
@Override
publ...
Storm UI
Read More about Storm
• Storm
http://storm-project.net/
• Example Storm Topologies
https://github.com/nathanmarz/storm-sta...
Storm in our company
ferret-go.com
Ferret go GmbH
Trend & Media Analytics
ferret-go.com
Our data flow (simplified)
Twitter
Facebook
Google+
Blogs
Comments
Online media
Offline media
Reviews
ElasticSearch
Elasti...
Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end...
Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm ...
Thank you!
30.09.2013
September BWB Meetup
Andrii Gakhov
Storm: overview
distributed and fault-tolerant realtime
computation.
Backend Web Berlin
Storm
www.storm-project.net
Storm is a free and open source distributed
realtime computation system.
September BWB Meetup
Use cases
distributed RPC continuous computationsstream processing
Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-toler...
Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology ca...
Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on an...
Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the abi...
Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spout...
How Storm works? Storm cluster
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus
How Storm works? Basic concepts
Topology
Topology is a graph of computation. A topology runs forever, or until you kill it...
How Storm works? Basic concepts
Worker process
A worker process executes a subset of a topology. A worker process belongs ...
How Storm works? Basic concepts
Spout
Task1
Task2
BoltA
Task1
Task2
Task3
BoltB
Task1
Task2
BoltC
Task1
Task2
Task3
Task4
...
How Storm works? Topology Example
class DemoTopology {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“...
How Storm works? Spout Example
public class DemoSpout extends BaseRichSpout {
….
@Override
public void open(Map conf, Topo...
How Storm works? Bolt Example
public class BoltA extends BaseRichBolt {
private OutputCollector _collector;
@Override
publ...
Storm UI
Read More about Storm
• Storm
http://storm-project.net/
• Example Storm Topologies
https://github.com/nathanmarz/storm-sta...
Storm in our company
ferret-go.com
Ferret go GmbH
Trend & Media Analytics
ferret-go.com
Our data flow (simplified)
Twitter
Facebook
Google+
Blogs
Comments
Online media
Offline media
Reviews
ElasticSearch
Elasti...
Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end...
Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm ...
Thank you!
30.09.2013
September BWB Meetup
Andrii Gakhov
Upcoming SlideShare
Loading in...5
×

BWB Meetup: Storm - distributed realtime computation system

827

Published on

torm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
827
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • In Slide Show mode, click the arrow to enter the PowerPoint Getting Started Center.
  • In Slide Show mode, click the arrow to enter the PowerPoint Getting Started Center.
  • Transcript of "BWB Meetup: Storm - distributed realtime computation system"

    1. 1. Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin
    2. 2. Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup
    3. 3. Use cases distributed RPC continuous computationsstream processing
    4. 4. Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages
    5. 5. Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.
    6. 6. Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.
    7. 7. Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.
    8. 8. Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.
    9. 9. How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus
    10. 10. How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
    11. 11. How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.
    12. 12. How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1
    13. 13. How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }
    14. 14. How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
    15. 15. How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
    16. 16. Storm UI
    17. 17. Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/
    18. 18. Storm in our company ferret-go.com
    19. 19. Ferret go GmbH Trend & Media Analytics ferret-go.com
    20. 20. Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing
    21. 21. Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook
    22. 22. Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors
    23. 23. Thank you! 30.10.2013 September BWB Meetup Andrii Gakhov
    24. 24. Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin
    25. 25. Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup
    26. 26. Use cases distributed RPC continuous computationsstream processing
    27. 27. Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages
    28. 28. Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.
    29. 29. Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.
    30. 30. Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.
    31. 31. Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.
    32. 32. How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus
    33. 33. How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
    34. 34. How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.
    35. 35. How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1
    36. 36. How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }
    37. 37. How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
    38. 38. How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
    39. 39. Storm UI
    40. 40. Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/
    41. 41. Storm in our company ferret-go.com
    42. 42. Ferret go GmbH Trend & Media Analytics ferret-go.com
    43. 43. Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing
    44. 44. Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook
    45. 45. Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors
    46. 46. Thank you! 30.09.2013 September BWB Meetup Andrii Gakhov
    47. 47. Storm: overview distributed and fault-tolerant realtime computation. Backend Web Berlin
    48. 48. Storm www.storm-project.net Storm is a free and open source distributed realtime computation system. September BWB Meetup
    49. 49. Use cases distributed RPC continuous computationsstream processing
    50. 50. Overview • free and open source • integrates with any queuing and database system • distributed and scalable • fault-tolerant • supports multiple languages
    51. 51. Scalable Storm topologies are inherently parallel and run across a cluster of machines. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly.
    52. 52. Fault tolerant When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast.
    53. 53. Guarantees data processing Storm guarantees every tuple will be fully processed. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way. Messages are only replayed when there are failures. Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system.
    54. 54. Use with many languages Storm was designed from the ground up to be usable with any programming language. Similarly, spouts and bolts can be defined in any language. Non-JVM spouts and bolts communicate to Storm over a JSON-based protocol over stdin/stdout. Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl, and PHP.
    55. 55. How Storm works? Storm cluster Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Nimbus
    56. 56. How Storm works? Basic concepts Topology Topology is a graph of computation. A topology runs forever, or until you kill it. Stream Stream is an unbounded sequence of tuples. Spout Spout is a source of streams. Bolt Bolt is the place where calculations are done. Bolts can do anything from run functions, filter tuples, do streaming aggregations, joins, talk to databases etc.
    57. 57. How Storm works? Basic concepts Worker process A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. Executor (thread) Executor is a thread that is spawned by a worker process. It may run 1+ tasks for the same component. It always has 1 thread that it uses for all of its tasks. Task Task performs the actual data processing – each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology.
    58. 58. How Storm works? Basic concepts Spout Task1 Task2 BoltA Task1 Task2 Task3 BoltB Task1 Task2 BoltC Task1 Task2 Task3 Task4 Task5 Task6 BoltD Task1 Task2 Task3 BoltE Task1 Task2 BoltF Task1
    59. 59. How Storm works? Topology Example class DemoTopology { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2) .declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item"); builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”); builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout") .declareDefaultStream("uid", “fromB"); builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA") .declareDefaultStream("uid", “fromC"); builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC") .fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")) .declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne"); builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”); builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”); StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology()); }
    60. 60. How Storm works? Spout Example public class DemoSpout extends BaseRichSpout { …. @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _queue = new MyFavoritQueue<string>(); } @Override public void nextTuple() { String nextItem = queue.poll(); _collector.emit(new Values(nextItem)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
    61. 61. How Storm works? Bolt Example public class BoltA extends BaseRichBolt { private OutputCollector _collector; @Override public void execute(Tuple tuple) { Object obj = tuple.getValue(0); String capitalizedItem = capitalize((String)obj); _collector.emit(tuple, new Value(capitalizedItem)); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“item")); } }
    62. 62. Storm UI
    63. 63. Read More about Storm • Storm http://storm-project.net/ • Example Storm Topologies https://github.com/nathanmarz/storm-starter • Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending- topics-in-storm/ • Understanding the Internal Message Buffers of Storm http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal- message-buffers/ • Understanding the Parallelism of a Storm Topology http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of- a-storm-topology/
    64. 64. Storm in our company ferret-go.com
    65. 65. Ferret go GmbH Trend & Media Analytics ferret-go.com
    66. 66. Our data flow (simplified) Twitter Facebook Google+ Blogs Comments Online media Offline media Reviews ElasticSearch ElasticSearch ElasticSearch processing classification analyzing
    67. 67. Problem overview • we have a number of streams that spout items • for every item we do different calculations • at the end of calculations we save item into storage(s) – ElasticSearch, PostgreSQL etc. • if processing fails because of some environment issues, we want to re-queue item easily • some of our calculations can be done in parallel Google+ Twitter Facebook
    68. 68. Solution • Redis-based queues for spouting • 1-2 spouts per topology • 1 bulk bolt for storage writing per worker • Storm cluster with 2 nodes: 32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04 • ~ 20 items per sec (could be increased) • 3 slots per worker, 198 tasks, 68 executors
    69. 69. Thank you! 30.09.2013 September BWB Meetup Andrii Gakhov
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×