Lessons Learned Building Storm

1,012
-1

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/162zy81.

Nathan Marz shares lessons learned building Storm, an open-source, distributed, real-time computation system. Filmed at qconnewyork.com.

Nathan Marz is currently working on a new startup. He was the lead engineer at BackType before being acquired by Twitter in 2011. At Twitter, he started the streaming compute team which provides and develops shared infrastructure to support many critical real-time applications throughout the company. Nathan is the creator of many open source projects, including projects such as Cascalog and Storm.

Published in: Technology, Education
0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,012
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide

Lessons Learned Building Storm

  1. 1. Lessons learned building Storm Nathan Marz @nathanmarz !1
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /storm-lessons
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Software development in theory
  5. 5. Software development in practice
  6. 6. Storm Widely used stream processing system
  7. 7. Storm Fully distributed and scalable Strong processing guarantees High performance Multi-tenant
  8. 8. Storm architecture
  9. 9. Storm architecture Master node (similar to Hadoop JobTracker)
  10. 10. Storm architecture Used for cluster coordination
  11. 11. Storm architecture Run worker processes
  12. 12. Lesson #1: No such thing as a long-lived process
  13. 13. If JobTracker dies, all jobs die
  14. 14. =
  15. 15. Your processes will crash
  16. 16. Your code is not correct
  17. 17. Your code is not correct
  18. 18. Solution: Design system to be fault-tolerant to process restart
  19. 19. Implications Program can be kill -9’d at any point All state must be external to process State modification might be aborted at any point
  20. 20. Other benefits for Storm Easy to reconfigure Can upgrade Nimbus without touching apps (e.g. with bug fixes)
  21. 21. Lesson #2: Use state machines to express intricate behavior
  22. 22. Nimbus has intricate behavior
  23. 23. Killing a topology 1. Stop emitting new data into topology 2. Wait to let topology finish processing in-transit messages 3. Shutdown workers 4. Cleanup state
  24. 24. Killing a topology Asynchronous Must be process fault-tolerant Don’t allow activate/deactivate/rebalance to a killed topology Should be able to kill a killed topology with a smaller wait time
  25. 25. Originally lots of (buggy) conditional logic
  26. 26. Rewrote Nimbus to be an asynchronous, process fault-tolerant state machine
  27. 27. Example of general solution being easier to understand than the specific solution
  28. 28. Lesson #3: Every feature will be abused
  29. 29. Example: Logging
  30. 30. One rogue topology uses up all disk space on the cluster
  31. 31. Solution: Switch from log4j to logback so that size of logs can be limited
  32. 32. Example: Storm’s “reportError” method
  33. 33. Used to show errors in the Storm UI
  34. 34. Error info is stored in Zookeeper
  35. 35. What happens when a user deploys code like this?
  36. 36. Denial-of-service on Zookeeper and cluster goes down
  37. 37. Solution: Rate-limit how many errors/sec can be written to Zookeeper
  38. 38. This is a general principle
  39. 39. Lesson #4: Isolate to avoid cascading failure
  40. 40. Originally one giant shared Zookeeper cluster for all services within Twitter
  41. 41. If one service abused ZK, that could lead to cascading failure
  42. 42. Zookeeper is not a multi-tenant system
  43. 43. Solution: Storm got a dedicated Zookeeper cluster
  44. 44. Lesson #5: Minimize dependencies
  45. 45. Your code is not correct
  46. 46. Other people’s code is not correct
  47. 47. Fewer dependencies Less possibility for failure=
  48. 48. Example: Storm’s usage of Zookeeper
  49. 49. Worker locations stored in Zookeeper
  50. 50. All workers must know locations of other workers to send messages
  51. 51. Two ways to get location updates
  52. 52. 1. Poll Zookeeper
  53. 53. 2. Use Zookeeper “watch” feature to get push notifications
  54. 54. Method 2 is faster but relies on another feature
  55. 55. Storm uses both methods
  56. 56. If watch feature fails, locations still propagate via polling
  57. 57. Eliminating dependence justified by small amount of code required
  58. 58. Lesson #6: Monitor everything
  59. 59. Monitoring is a prerequisite to operating robust software
  60. 60. Storm makes monitoring topologies very easy
  61. 61. Lots of stats monitored automatically
  62. 62. Simple API to monitor custom stats
  63. 63. Automatically integrates with visualization stack
  64. 64. Lesson #7: How we solved the resource management problem
  65. 65. Resource management Multi-tenancy: topologies don’t affect each other Capacity management: converting $$$ into topologies
  66. 66. Initially treated these as separate problems
  67. 67. Multi-tenancy attempt #1 Resource isolation using Mesos
  68. 68. Machine Machine Machine Machine Mesos Storm
  69. 69. Each machine runs workers from many topologies
  70. 70. Resource isolation is an extraordinary claim
  71. 71. Extraordinary claims require extraordinary evidence!
  72. 72. Ran into massive variance problems
  73. 73. At-least resource model of Mesos made capacity measurement impossible
  74. 74. Almost went down route of resource isolation with resource capping
  75. 75. But what about hardware threading and caching?
  76. 76. Conclusion: Sharing a single machine for independent applications is fundamentally complex
  77. 77. Capacity management attempt #1 1. Provide shared Storm cluster 2. Measure capacity usage in aggregate 3. Always have some % of cluster free 4. Grow cluster as needed according to usage
  78. 78. People would deploy topologies with more workers than slots on cluster
  79. 79. People only care about getting their application working and will twist any knob possible
  80. 80. Problems Production topologies starved Bloated resource usage because no incentive to optimize No process for making $$$ decisions
  81. 81. Requirements 1. Production topologies get priority to resources 2. One topology cannot affect the performance of another topology 3. Incentives for people to optimize resource usage 4. Process for making $$$ decisions on machines 5. Ability to measure how much capacity a topology needs for 3 and 4
  82. 82. The more complex the problem, the simpler the solution must be
  83. 83. Solution Isolation scheduler
  84. 84. Isolation scheduler
  85. 85. Isolation scheduler Nimbus configuration
  86. 86. Isolation scheduler Configurable only by cluster administrator
  87. 87. Isolation scheduler Map from topology name to # of machines
  88. 88. Isolation scheduler Topologies listed are production topologies
  89. 89. Isolation scheduler These topologies guaranteed dedicated access to that # of machines
  90. 90. Isolation scheduler Remaining machines used for failover and for running development topologies
  91. 91. Benefits Resource contention issue of Mesos completely avoided Takes advantage of process fault-tolerance of Nimbus Simple to use and understand Easy to do capacity measurements Distinguishes production from in-development
  92. 92. Topology productionization process 1. Test topology on cluster as a development topology 2. When ready, work with admins to do capacity measurement 3. Submit capacity proposal for approval by VP 4. Allocate machines immediately from failover machines 5. Backfill capacity when machines arrive 4-6 weeks later
  93. 93. Benefits Incentives to optimize resource usage Backfill allows immediate productionization Human process integrated with technical solution
  94. 94. Questions?
  95. 95. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/storm- lessons

×