Storm: distributed and fault-tolerant realtime computation

161,719 views
178,235 views

Published on

Published in: Technology
5 Comments
217 Likes
Statistics
Notes
No Downloads
Views
Total views
161,719
On SlideShare
0
From Embeds
0
Number of Embeds
47,534
Actions
Shares
0
Downloads
2,998
Comments
5
Likes
217
Embeds 0
No embeds

No notes for slide

Storm: distributed and fault-tolerant realtime computation

  1. StormDistributed and fault-tolerant realtime computation Nathan Marz Twitter
  2. Storm at Twitter Twitter Web Analytics
  3. Before StormQueues Workers
  4. Example (simplified)
  5. ExampleWorkers schemify tweets and append to Hadoop
  6. ExampleWorkers update statistics on URLs byincrementing counters in Cassandra
  7. ExampleDistribute tweets randomly on multiple queues
  8. ExampleWorkers share the load of schemifying tweets
  9. ExampleDesire all updates for same URL go to same worker
  10. Message locality• Because: • No transactions in Cassandra (and no atomic increments at the time) • More effective batching of updates
  11. Implementing message locality• Have a queue for each consuming worker• Choose queue for a URL using consistent hashing
  12. ExampleWorkers choose queue to enqueue to using hash/mod of URL
  13. Example All updates for same URLguaranteed to go to same worker
  14. Adding a worker
  15. Adding a worker DeployReconfigure/redeploy
  16. Problems• Scaling is painful• Poor fault-tolerance• Coding is tedious
  17. What we want• Guaranteed data processing• Horizontal scalability• Fault-tolerance• No intermediate message brokers!• Higher level abstraction than message passing• “Just works”
  18. StormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstraction than message passing“Just works”
  19. Use cases Stream Distributed Continuousprocessing RPC computation
  20. Storm Cluster
  21. Storm ClusterMaster node (similar to Hadoop JobTracker)
  22. Storm ClusterUsed for cluster coordination
  23. Storm Cluster Run worker processes
  24. Starting a topology
  25. Killing a topology
  26. Concepts• Streams• Spouts• Bolts• Topologies
  27. StreamsTuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  28. SpoutsSource of streams
  29. Spout examples• Read from Kestrel queue• Read from Twitter streaming API
  30. BoltsProcesses input streams and produces new streams
  31. Bolts• Functions• Filters• Aggregation• Joins• Talk to databases
  32. TopologyNetwork of spouts and bolts
  33. TasksSpouts and bolts execute asmany tasks across the cluster
  34. Stream groupingWhen a tuple is emitted, which task does it go to?
  35. Stream grouping• Shuffle grouping: pick a random task• Fields grouping: consistent hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id
  36. Topologyshuffle [“id1”, “id2”] shuffle[“url”] shuffle all
  37. Streaming word countTopologyBuilder is used to construct topologies in Java
  38. Streaming word countDefine a spout in the topology with parallelism of 5 tasks
  39. Streaming word countSplit sentences into words with parallelism of 8 tasks
  40. Streaming word countConsumer decides what data it receives and how it gets groupedSplit sentences into words with parallelism of 8 tasks
  41. Streaming word count Create a word count stream
  42. Streaming word count splitsentence.py
  43. Streaming word count
  44. Streaming word count Submitting topology to a cluster
  45. Streaming word count Running topology in local mode
  46. Demo
  47. Traditional data processing
  48. Traditional data processing Intense processing (Hadoop, databases, etc.)
  49. Traditional data processingLight processing on a single machine to resolve queries
  50. Distributed RPCDistributed RPC lets you do intense processing at query-time
  51. Game changer
  52. Distributed RPCData flow for Distributed RPC
  53. DRPC ExampleComputing “reach” of a URL on the fly
  54. ReachReach is the number of unique people exposed to a URL on Twitter
  55. Computing reach Follower Distinct Tweeter Follower follower Follower DistinctURL Tweeter follower Count Reach Follower Follower Distinct Tweeter follower Follower
  56. Reach topology
  57. Guaranteeing message processing “Tuple tree”
  58. Guaranteeing message processing• A spout tuple is not fully processed until all tuples in the tree have been completed
  59. Guaranteeing message processing• If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
  60. Guaranteeing message processing Reliability API
  61. Guaranteeing message processing“Anchoring” creates a new edge in the tuple tree
  62. Guaranteeing message processing Marks a single node in the tree as complete
  63. Guaranteeing message processing• Storm tracks tuple trees for you in an extremely efficient way
  64. Storm UI
  65. Storm UI
  66. Storm UI
  67. Storm on EC2https://github.com/nathanmarz/storm-deploy One-click deploy tool
  68. Documentation
  69. State spout (almost done) Synchronize a large amount of frequently changing state into a topology
  70. State spout (almost done)Optimizing reach topology by eliminating the database calls
  71. State spout (almost done) Each GetFollowers task keeps a synchronous cache of a subset of the social graph
  72. State spout (almost done)This works because GetFollowers repartitions the social graph the same way it partitions GetTweeter’s stream
  73. Future work• Storm on Mesos• “Swapping”• Auto-scaling• Higher level abstractions
  74. Questions?http://github.com/nathanmarz/storm
  75. What Storm does• Distributes code and configurations• Robust process management• Monitors topologies and reassigns failed tasks• Provides reliability by tracking tuple trees• Routing and partitioning of streams• Serialization• Fine-grained performance stats of topologies

×