Your SlideShare is downloading. ×
Jan 2012 HUG: Storm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Jan 2012 HUG: Storm

2,630
views

Published on

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message …

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Storm was open-sourced by Twitter in September of 2011 and has since been adopted by many companies around the world.

Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I'll introduce Storm and show how easy it is to use for realtime computation.

Published in: Technology

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,630
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Nathan Marz Twitter Distributed and fault-tolerant realtime computation Storm
  • 2. Basic info • Open sourced September 19th • Implementation is 12,000 lines of code • Used by over 25 companies • >2280 watchers on Github (most watched JVM project) • Very active mailing list • >1700 messages • >520 members
  • 3. Before Storm Queues Workers
  • 4. Example (simplified)
  • 5. Example Workers schemify tweets and append to Hadoop
  • 6. Example Workers update statistics on URLs by incrementing counters in Cassandra
  • 7. Problems • Scaling is painful • Poor fault-tolerance • Coding is tedious
  • 8. What we want • Guaranteed data processing • Horizontal scalability • Fault-tolerance • No intermediate message brokers! • Higher level abstraction than message passing • “Just works”
  • 9. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”
  • 10. Stream processing Continuous computation Distributed RPC Use cases
  • 11. Storm Cluster
  • 12. Storm Cluster Master node (similar to Hadoop JobTracker)
  • 13. Storm Cluster Used for cluster coordination
  • 14. Storm Cluster Run worker processes
  • 15. Starting a topology
  • 16. Killing a topology
  • 17. Concepts • Streams • Spouts • Bolts • Topologies
  • 18. Streams Unbounded sequence of tuples Tuple Tuple Tuple Tuple Tuple Tuple Tuple
  • 19. Spouts Source of streams
  • 20. Spout examples • Read from Kestrel queue • Read from Twitter streaming API
  • 21. Bolts Processes input streams and produces new streams
  • 22. Bolts • Functions • Filters • Aggregation • Joins • Talk to databases
  • 23. Topology Network of spouts and bolts
  • 24. Tasks Spouts and bolts execute as many tasks across the cluster
  • 25. Stream grouping When a tuple is emitted, which task does it go to?
  • 26. Stream grouping • Shuffle grouping: pick a random task • Fields grouping: mod hashing on a subset of tuple fields • All grouping: send to all tasks • Global grouping: pick task with lowest id
  • 27. Topology shuffle [“url”] shuffle shuffle [“id1”,“id2”] all
  • 28. Streaming word count TopologyBuilder is used to construct topologies in Java
  • 29. Streaming word count Define a spout in the topology with parallelism of 5 tasks
  • 30. Streaming word count Split sentences into words with parallelism of 8 tasks
  • 31. Consumer decides what data it receives and how it gets grouped Streaming word count Split sentences into words with parallelism of 8 tasks
  • 32. Streaming word count Create a word count stream
  • 33. Streaming word count splitsentence.py
  • 34. Streaming word count
  • 35. Streaming word count Submitting topology to a cluster
  • 36. Streaming word count Running topology in local mode
  • 37. Demo
  • 38. Distributed RPC Data flow for Distributed RPC
  • 39. DRPC Example Computing “reach” of a URL on the fly
  • 40. Reach Reach is the number of unique people exposed to a URL on Twitter
  • 41. Computing reach URL Tweeter Tweeter Tweeter Follower Follower Follower Follower Follower Follower Distinct follower Distinct follower Distinct follower Count Reach
  • 42. Reach topology
  • 43. Reach topology
  • 44. Reach topology
  • 45. Reach topology Keep set of followers for each request id in memory
  • 46. Reach topology Update followers set when receive a new follower
  • 47. Reach topology Emit partial count after receiving all followers for a request id
  • 48. Demo
  • 49. Guaranteeing message processing “Tuple tree”
  • 50. Guaranteeing message processing • A spout tuple is not fully processed until all tuples in the tree have been completed
  • 51. Guaranteeing message processing • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
  • 52. Guaranteeing message processing Reliability API
  • 53. Guaranteeing message processing “Anchoring” creates a new edge in the tuple tree
  • 54. Guaranteeing message processing Marks a single node in the tree as complete
  • 55. Guaranteeing message processing • Storm tracks tuple trees for you in an extremely efficient way
  • 56. Transactional topologies How do you do idempotent counting with an at least once delivery guarantee?
  • 57. Won’t you overcount? Transactional topologies
  • 58. Transactional topologies solve this problem Transactional topologies
  • 59. Built completely on top of Storm’s primitives of streams, spouts, and bolts Transactional topologies
  • 60. Batch 1 Batch 2 Batch 3 Transactional topologies Process small batches of tuples
  • 61. Batch 1 Batch 2 Batch 3 Transactional topologies If a batch fails, replay the whole batch
  • 62. Batch 1 Batch 2 Batch 3 Transactional topologies Once a batch is completed, commit the batch
  • 63. Batch 1 Batch 2 Batch 3 Transactional topologies Bolts can optionally implement “commit” method
  • 64. Commit 1 Transactional topologies Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried Commit 1 Commit 2 Commit 3 Commit 4 Commit 4
  • 65. Example
  • 66. Example New instance of this object for every transaction attempt
  • 67. Example Aggregate the count for this batch
  • 68. Example Only update database if transaction ids differ
  • 69. Example This enables idempotency since commits are ordered
  • 70. Example (Credit goes to Kafka guys for figuring out this trick)
  • 71. Transactional topologies Multiple batches can be processed in parallel, but commits are guaranteed to be ordered
  • 72. • Will be available in next version of Storm (0.7.0) • Requires a source queue that can replay identical batches of messages • Aiming for first TransactionalSpout implementation to use Kafka Transactional topologies
  • 73. Storm UI
  • 74. Storm on EC2 https://github.com/nathanmarz/storm-deploy One-click deploy tool
  • 75. Starter code https://github.com/nathanmarz/storm-starter Example topologies
  • 76. Documentation
  • 77. Ecosystem • Scala, JRuby, and Clojure DSL’s • Kestrel,AMQP, JMS, and other spout adapters • Serializers • Multilang adapters • Cassandra, MongoDB integration
  • 78. Questions? http://github.com/nathanmarz/storm
  • 79. Future work • State spout • Storm on Mesos • “Swapping” • Auto-scaling • Higher level abstractions

×