Nathan Marz
Twitter
Distributed and fault-tolerant realtime computation
Storm
Basic info
• Open sourced September 19th
• Implementation is 12,000 lines of code
• Used by over 25 companies
• >2280 watc...
Before Storm
Queues Workers
Example
(simplified)
Example
Workers schemify tweets
and append to Hadoop
Example
Workers update statistics on URLs by
incrementing counters in Cassandra
Problems
• Scaling is painful
• Poor fault-tolerance
• Coding is tedious
What we want
• Guaranteed data processing
• Horizontal scalability
• Fault-tolerance
• No intermediate message brokers!
• ...
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abst...
Stream
processing
Continuous
computation
Distributed
RPC
Use cases
Storm Cluster
Storm Cluster
Master node (similar to Hadoop JobTracker)
Storm Cluster
Used for cluster coordination
Storm Cluster
Run worker processes
Starting a topology
Killing a topology
Concepts
• Streams
• Spouts
• Bolts
• Topologies
Streams
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Spouts
Source of streams
Spout examples
• Read from Kestrel queue
• Read from Twitter streaming API
Bolts
Processes input streams and produces new streams
Bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases
Topology
Network of spouts and bolts
Tasks
Spouts and bolts execute as
many tasks across the cluster
Stream grouping
When a tuple is emitted, which task does it go to?
Stream grouping
• Shuffle grouping: pick a random task
• Fields grouping: mod hashing on a
subset of tuple fields
• All grou...
Topology
shuffle
[“url”]
shuffle
shuffle
[“id1”,“id2”]
all
Streaming word count
TopologyBuilder is used to construct topologies in Java
Streaming word count
Define a spout in the topology with parallelism of 5 tasks
Streaming word count
Split sentences into words with parallelism of 8 tasks
Consumer decides what data it receives and how it gets grouped
Streaming word count
Split sentences into words with parall...
Streaming word count
Create a word count stream
Streaming word count
splitsentence.py
Streaming word count
Streaming word count
Submitting topology to a cluster
Streaming word count
Running topology in local mode
Demo
Distributed RPC
Data flow for Distributed RPC
DRPC Example
Computing “reach” of a URL on the fly
Reach
Reach is the number of unique people
exposed to a URL on Twitter
Computing reach
URL
Tweeter
Tweeter
Tweeter
Follower
Follower
Follower
Follower
Follower
Follower
Distinct
follower
Distin...
Reach topology
Reach topology
Reach topology
Reach topology
Keep set of followers for
each request id in memory
Reach topology
Update followers set when
receive a new follower
Reach topology
Emit partial count after
receiving all followers for a
request id
Demo
Guaranteeing message
processing
“Tuple tree”
Guaranteeing message
processing
• A spout tuple is not fully processed until all
tuples in the tree have been completed
Guaranteeing message
processing
• If the tuple tree is not completed within a
specified timeout, the spout tuple is replayed
Guaranteeing message
processing
Reliability API
Guaranteeing message
processing
“Anchoring” creates a new edge in the tuple tree
Guaranteeing message
processing
Marks a single node in the tree as complete
Guaranteeing message
processing
• Storm tracks tuple trees for you in an
extremely efficient way
Transactional topologies
How do you do idempotent counting with an
at least once delivery guarantee?
Won’t you overcount?
Transactional topologies
Transactional topologies solve this problem
Transactional topologies
Built completely on top of Storm’s primitives
of streams, spouts, and bolts
Transactional topologies
Batch 1 Batch 2 Batch 3
Transactional topologies
Process small batches of tuples
Batch 1 Batch 2 Batch 3
Transactional topologies
If a batch fails, replay the whole batch
Batch 1 Batch 2 Batch 3
Transactional topologies
Once a batch is completed, commit the batch
Batch 1 Batch 2 Batch 3
Transactional topologies
Bolts can optionally implement “commit” method
Commit 1
Transactional topologies
Commits are ordered. If there’s a failure during
commit, the whole batch + commit is ret...
Example
Example
New instance of this object
for every transaction attempt
Example
Aggregate the count for
this batch
Example
Only update database if
transaction ids differ
Example
This enables idempotency since
commits are ordered
Example
(Credit goes to Kafka guys
for figuring out this trick)
Transactional topologies
Multiple batches can be processed in parallel,
but commits are guaranteed to be ordered
• Will be available in next version of Storm
(0.7.0)
• Requires a source queue that can replay
identical batches of messag...
Storm UI
Storm on EC2
https://github.com/nathanmarz/storm-deploy
One-click deploy tool
Starter code
https://github.com/nathanmarz/storm-starter
Example topologies
Documentation
Ecosystem
• Scala, JRuby, and Clojure DSL’s
• Kestrel,AMQP, JMS, and other spout adapters
• Serializers
• Multilang adapte...
Questions?
http://github.com/nathanmarz/storm
Future work
• State spout
• Storm on Mesos
• “Swapping”
• Auto-scaling
• Higher level abstractions
Upcoming SlideShare
Loading in...5
×

Jan 2012 HUG: Storm

2,665

Published on

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Storm was open-sourced by Twitter in September of 2011 and has since been adopted by many companies around the world.

Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I'll introduce Storm and show how easy it is to use for realtime computation.

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,665
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Jan 2012 HUG: Storm

  1. 1. Nathan Marz Twitter Distributed and fault-tolerant realtime computation Storm
  2. 2. Basic info • Open sourced September 19th • Implementation is 12,000 lines of code • Used by over 25 companies • >2280 watchers on Github (most watched JVM project) • Very active mailing list • >1700 messages • >520 members
  3. 3. Before Storm Queues Workers
  4. 4. Example (simplified)
  5. 5. Example Workers schemify tweets and append to Hadoop
  6. 6. Example Workers update statistics on URLs by incrementing counters in Cassandra
  7. 7. Problems • Scaling is painful • Poor fault-tolerance • Coding is tedious
  8. 8. What we want • Guaranteed data processing • Horizontal scalability • Fault-tolerance • No intermediate message brokers! • Higher level abstraction than message passing • “Just works”
  9. 9. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”
  10. 10. Stream processing Continuous computation Distributed RPC Use cases
  11. 11. Storm Cluster
  12. 12. Storm Cluster Master node (similar to Hadoop JobTracker)
  13. 13. Storm Cluster Used for cluster coordination
  14. 14. Storm Cluster Run worker processes
  15. 15. Starting a topology
  16. 16. Killing a topology
  17. 17. Concepts • Streams • Spouts • Bolts • Topologies
  18. 18. Streams Unbounded sequence of tuples Tuple Tuple Tuple Tuple Tuple Tuple Tuple
  19. 19. Spouts Source of streams
  20. 20. Spout examples • Read from Kestrel queue • Read from Twitter streaming API
  21. 21. Bolts Processes input streams and produces new streams
  22. 22. Bolts • Functions • Filters • Aggregation • Joins • Talk to databases
  23. 23. Topology Network of spouts and bolts
  24. 24. Tasks Spouts and bolts execute as many tasks across the cluster
  25. 25. Stream grouping When a tuple is emitted, which task does it go to?
  26. 26. Stream grouping • Shuffle grouping: pick a random task • Fields grouping: mod hashing on a subset of tuple fields • All grouping: send to all tasks • Global grouping: pick task with lowest id
  27. 27. Topology shuffle [“url”] shuffle shuffle [“id1”,“id2”] all
  28. 28. Streaming word count TopologyBuilder is used to construct topologies in Java
  29. 29. Streaming word count Define a spout in the topology with parallelism of 5 tasks
  30. 30. Streaming word count Split sentences into words with parallelism of 8 tasks
  31. 31. Consumer decides what data it receives and how it gets grouped Streaming word count Split sentences into words with parallelism of 8 tasks
  32. 32. Streaming word count Create a word count stream
  33. 33. Streaming word count splitsentence.py
  34. 34. Streaming word count
  35. 35. Streaming word count Submitting topology to a cluster
  36. 36. Streaming word count Running topology in local mode
  37. 37. Demo
  38. 38. Distributed RPC Data flow for Distributed RPC
  39. 39. DRPC Example Computing “reach” of a URL on the fly
  40. 40. Reach Reach is the number of unique people exposed to a URL on Twitter
  41. 41. Computing reach URL Tweeter Tweeter Tweeter Follower Follower Follower Follower Follower Follower Distinct follower Distinct follower Distinct follower Count Reach
  42. 42. Reach topology
  43. 43. Reach topology
  44. 44. Reach topology
  45. 45. Reach topology Keep set of followers for each request id in memory
  46. 46. Reach topology Update followers set when receive a new follower
  47. 47. Reach topology Emit partial count after receiving all followers for a request id
  48. 48. Demo
  49. 49. Guaranteeing message processing “Tuple tree”
  50. 50. Guaranteeing message processing • A spout tuple is not fully processed until all tuples in the tree have been completed
  51. 51. Guaranteeing message processing • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
  52. 52. Guaranteeing message processing Reliability API
  53. 53. Guaranteeing message processing “Anchoring” creates a new edge in the tuple tree
  54. 54. Guaranteeing message processing Marks a single node in the tree as complete
  55. 55. Guaranteeing message processing • Storm tracks tuple trees for you in an extremely efficient way
  56. 56. Transactional topologies How do you do idempotent counting with an at least once delivery guarantee?
  57. 57. Won’t you overcount? Transactional topologies
  58. 58. Transactional topologies solve this problem Transactional topologies
  59. 59. Built completely on top of Storm’s primitives of streams, spouts, and bolts Transactional topologies
  60. 60. Batch 1 Batch 2 Batch 3 Transactional topologies Process small batches of tuples
  61. 61. Batch 1 Batch 2 Batch 3 Transactional topologies If a batch fails, replay the whole batch
  62. 62. Batch 1 Batch 2 Batch 3 Transactional topologies Once a batch is completed, commit the batch
  63. 63. Batch 1 Batch 2 Batch 3 Transactional topologies Bolts can optionally implement “commit” method
  64. 64. Commit 1 Transactional topologies Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried Commit 1 Commit 2 Commit 3 Commit 4 Commit 4
  65. 65. Example
  66. 66. Example New instance of this object for every transaction attempt
  67. 67. Example Aggregate the count for this batch
  68. 68. Example Only update database if transaction ids differ
  69. 69. Example This enables idempotency since commits are ordered
  70. 70. Example (Credit goes to Kafka guys for figuring out this trick)
  71. 71. Transactional topologies Multiple batches can be processed in parallel, but commits are guaranteed to be ordered
  72. 72. • Will be available in next version of Storm (0.7.0) • Requires a source queue that can replay identical batches of messages • Aiming for first TransactionalSpout implementation to use Kafka Transactional topologies
  73. 73. Storm UI
  74. 74. Storm on EC2 https://github.com/nathanmarz/storm-deploy One-click deploy tool
  75. 75. Starter code https://github.com/nathanmarz/storm-starter Example topologies
  76. 76. Documentation
  77. 77. Ecosystem • Scala, JRuby, and Clojure DSL’s • Kestrel,AMQP, JMS, and other spout adapters • Serializers • Multilang adapters • Cassandra, MongoDB integration
  78. 78. Questions? http://github.com/nathanmarz/storm
  79. 79. Future work • State spout • Storm on Mesos • “Swapping” • Auto-scaling • Higher level abstractions

×