Your SlideShare is downloading. ×
Lessons Learned Building Storm
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Lessons Learned Building Storm


Published on

Video and slides synchronized, mp3 and slide download available at URL …

Video and slides synchronized, mp3 and slide download available at URL

Nathan Marz shares lessons learned building Storm, an open-source, distributed, real-time computation system. Filmed at

Nathan Marz is currently working on a new startup. He was the lead engineer at BackType before being acquired by Twitter in 2011. At Twitter, he started the streaming compute team which provides and develops shared infrastructure to support many critical real-time applications throughout the company. Nathan is the creator of many open source projects, including projects such as Cascalog and Storm.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Lessons learned building Storm Nathan Marz @nathanmarz !1
  • 2. News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on! /storm-lessons
  • 3. Presented at QCon New York Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Software development in theory
  • 5. Software development in practice
  • 6. Storm Widely used stream processing system
  • 7. Storm Fully distributed and scalable Strong processing guarantees High performance Multi-tenant
  • 8. Storm architecture
  • 9. Storm architecture Master node (similar to Hadoop JobTracker)
  • 10. Storm architecture Used for cluster coordination
  • 11. Storm architecture Run worker processes
  • 12. Lesson #1: No such thing as a long-lived process
  • 13. If JobTracker dies, all jobs die
  • 14. =
  • 15. Your processes will crash
  • 16. Your code is not correct
  • 17. Your code is not correct
  • 18. Solution: Design system to be fault-tolerant to process restart
  • 19. Implications Program can be kill -9’d at any point All state must be external to process State modification might be aborted at any point
  • 20. Other benefits for Storm Easy to reconfigure Can upgrade Nimbus without touching apps (e.g. with bug fixes)
  • 21. Lesson #2: Use state machines to express intricate behavior
  • 22. Nimbus has intricate behavior
  • 23. Killing a topology 1. Stop emitting new data into topology 2. Wait to let topology finish processing in-transit messages 3. Shutdown workers 4. Cleanup state
  • 24. Killing a topology Asynchronous Must be process fault-tolerant Don’t allow activate/deactivate/rebalance to a killed topology Should be able to kill a killed topology with a smaller wait time
  • 25. Originally lots of (buggy) conditional logic
  • 26. Rewrote Nimbus to be an asynchronous, process fault-tolerant state machine
  • 27. Example of general solution being easier to understand than the specific solution
  • 28. Lesson #3: Every feature will be abused
  • 29. Example: Logging
  • 30. One rogue topology uses up all disk space on the cluster
  • 31. Solution: Switch from log4j to logback so that size of logs can be limited
  • 32. Example: Storm’s “reportError” method
  • 33. Used to show errors in the Storm UI
  • 34. Error info is stored in Zookeeper
  • 35. What happens when a user deploys code like this?
  • 36. Denial-of-service on Zookeeper and cluster goes down
  • 37. Solution: Rate-limit how many errors/sec can be written to Zookeeper
  • 38. This is a general principle
  • 39. Lesson #4: Isolate to avoid cascading failure
  • 40. Originally one giant shared Zookeeper cluster for all services within Twitter
  • 41. If one service abused ZK, that could lead to cascading failure
  • 42. Zookeeper is not a multi-tenant system
  • 43. Solution: Storm got a dedicated Zookeeper cluster
  • 44. Lesson #5: Minimize dependencies
  • 45. Your code is not correct
  • 46. Other people’s code is not correct
  • 47. Fewer dependencies Less possibility for failure=
  • 48. Example: Storm’s usage of Zookeeper
  • 49. Worker locations stored in Zookeeper
  • 50. All workers must know locations of other workers to send messages
  • 51. Two ways to get location updates
  • 52. 1. Poll Zookeeper
  • 53. 2. Use Zookeeper “watch” feature to get push notifications
  • 54. Method 2 is faster but relies on another feature
  • 55. Storm uses both methods
  • 56. If watch feature fails, locations still propagate via polling
  • 57. Eliminating dependence justified by small amount of code required
  • 58. Lesson #6: Monitor everything
  • 59. Monitoring is a prerequisite to operating robust software
  • 60. Storm makes monitoring topologies very easy
  • 61. Lots of stats monitored automatically
  • 62. Simple API to monitor custom stats
  • 63. Automatically integrates with visualization stack
  • 64. Lesson #7: How we solved the resource management problem
  • 65. Resource management Multi-tenancy: topologies don’t affect each other Capacity management: converting $$$ into topologies
  • 66. Initially treated these as separate problems
  • 67. Multi-tenancy attempt #1 Resource isolation using Mesos
  • 68. Machine Machine Machine Machine Mesos Storm
  • 69. Each machine runs workers from many topologies
  • 70. Resource isolation is an extraordinary claim
  • 71. Extraordinary claims require extraordinary evidence!
  • 72. Ran into massive variance problems
  • 73. At-least resource model of Mesos made capacity measurement impossible
  • 74. Almost went down route of resource isolation with resource capping
  • 75. But what about hardware threading and caching?
  • 76. Conclusion: Sharing a single machine for independent applications is fundamentally complex
  • 77. Capacity management attempt #1 1. Provide shared Storm cluster 2. Measure capacity usage in aggregate 3. Always have some % of cluster free 4. Grow cluster as needed according to usage
  • 78. People would deploy topologies with more workers than slots on cluster
  • 79. People only care about getting their application working and will twist any knob possible
  • 80. Problems Production topologies starved Bloated resource usage because no incentive to optimize No process for making $$$ decisions
  • 81. Requirements 1. Production topologies get priority to resources 2. One topology cannot affect the performance of another topology 3. Incentives for people to optimize resource usage 4. Process for making $$$ decisions on machines 5. Ability to measure how much capacity a topology needs for 3 and 4
  • 82. The more complex the problem, the simpler the solution must be
  • 83. Solution Isolation scheduler
  • 84. Isolation scheduler
  • 85. Isolation scheduler Nimbus configuration
  • 86. Isolation scheduler Configurable only by cluster administrator
  • 87. Isolation scheduler Map from topology name to # of machines
  • 88. Isolation scheduler Topologies listed are production topologies
  • 89. Isolation scheduler These topologies guaranteed dedicated access to that # of machines
  • 90. Isolation scheduler Remaining machines used for failover and for running development topologies
  • 91. Benefits Resource contention issue of Mesos completely avoided Takes advantage of process fault-tolerance of Nimbus Simple to use and understand Easy to do capacity measurements Distinguishes production from in-development
  • 92. Topology productionization process 1. Test topology on cluster as a development topology 2. When ready, work with admins to do capacity measurement 3. Submit capacity proposal for approval by VP 4. Allocate machines immediately from failover machines 5. Backfill capacity when machines arrive 4-6 weeks later
  • 93. Benefits Incentives to optimize resource usage Backfill allows immediate productionization Human process integrated with technical solution
  • 94. Questions?
  • 95. Watch the video with slide synchronization on! lessons