Your SlideShare is downloading. ×
Lessons learned building Storm
Nathan Marz
@nathanmarz !1
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the sprea...
Software development in theory
Software development in practice
Storm
Widely used stream processing system
Storm

Fully distributed and scalable
Strong processing guarantees
High performance
Multi-tenant
Storm architecture
Storm architecture
Master node (similar to Hadoop JobTracker)
Storm architecture
Used for cluster coordination
Storm architecture
Run worker processes
Lesson #1:
No such thing as a long-lived process
If JobTracker dies, all jobs die
=
Your processes will crash
Your code is not correct
Your code is not correct
Solution:
Design system to be fault-tolerant
to process restart
Implications
Program can be kill -9’d at any point
All state must be external to process
State modification might be aborte...
Other benefits for Storm
Easy to reconfigure
Can upgrade Nimbus without touching apps (e.g. with bug fixes)
Lesson #2:
Use state machines to express
intricate behavior
Nimbus has intricate behavior
Killing a topology

1. Stop emitting new data into topology
2. Wait to let topology finish processing in-transit messages
3...
Killing a topology

Asynchronous
Must be process fault-tolerant
Don’t allow activate/deactivate/rebalance to a killed topo...
Originally lots of (buggy)
conditional logic
Rewrote Nimbus to be an asynchronous,
process fault-tolerant state machine
Example of general solution being easier
to understand than the specific solution
Lesson #3:
Every feature will be abused
Example: Logging
One rogue topology uses up all
disk space on the cluster
Solution:
Switch from log4j to logback so
that size of logs can be limited
Example:
Storm’s “reportError” method
Used to show errors in the Storm UI
Error info is stored in Zookeeper
What happens when a user deploys code like this?
Denial-of-service on Zookeeper
and cluster goes down
Solution:
Rate-limit how many errors/sec
can be written to Zookeeper
This is a general principle
Lesson #4:
Isolate to avoid cascading failure
Originally one giant shared Zookeeper
cluster for all services within Twitter
If one service abused ZK, that could
lead to cascading failure
Zookeeper is not a multi-tenant system
Solution:
Storm got a dedicated Zookeeper cluster
Lesson #5:
Minimize dependencies
Your code is not correct
Other people’s code is not correct
Fewer
dependencies
Less possibility
for failure=
Example:
Storm’s usage of Zookeeper
Worker locations stored in Zookeeper
All workers must know locations of
other workers to send messages
Two ways to get location updates
1. Poll Zookeeper
2. Use Zookeeper “watch”
feature to get push notifications
Method 2 is faster but
relies on another feature
Storm uses both methods
If watch feature fails, locations
still propagate via polling
Eliminating dependence justified
by small amount of code required
Lesson #6:
Monitor everything
Monitoring is a prerequisite to
operating robust software
Storm makes monitoring
topologies very easy
Lots of stats monitored automatically
Simple API to monitor custom stats
Automatically integrates with
visualization stack
Lesson #7:
How we solved the resource
management problem
Resource management

Multi-tenancy: topologies don’t affect each other

Capacity management: converting $$$ into topologies
Initially treated these as
separate problems
Multi-tenancy attempt #1
Resource isolation
using Mesos
Machine Machine Machine Machine
Mesos
Storm
Each machine runs workers
from many topologies
Resource isolation is an
extraordinary claim
Extraordinary claims require
extraordinary evidence!
Ran into massive variance problems
At-least resource model of Mesos made
capacity measurement impossible
Almost went down route of resource
isolation with resource capping
But what about hardware threading
and caching?
Conclusion:
Sharing a single machine for
independent applications is
fundamentally complex
Capacity management attempt #1
1. Provide shared Storm cluster
2. Measure capacity usage in aggregate
3. Always have some ...
People would deploy topologies with
more workers than slots on cluster
People only care about getting their
application working and will twist any
knob possible
Problems
Production topologies starved
Bloated resource usage because no incentive to optimize
No process for making $$$ d...
Requirements
1. Production topologies get priority to resources
2. One topology cannot affect the performance of another t...
The more complex the problem,
the simpler the solution must be
Solution
Isolation scheduler
Isolation scheduler
Isolation scheduler
Nimbus configuration
Isolation scheduler
Configurable only by cluster administrator
Isolation scheduler
Map from topology name to # of machines
Isolation scheduler
Topologies listed are production topologies
Isolation scheduler
These topologies guaranteed dedicated
access to that # of machines
Isolation scheduler
Remaining machines used for failover and
for running development topologies
Benefits
Resource contention issue of Mesos completely avoided
Takes advantage of process fault-tolerance of Nimbus
Simple ...
Topology productionization process
1. Test topology on cluster as a development topology
2. When ready, work with admins t...
Benefits
Incentives to optimize resource usage
Backfill allows immediate productionization
Human process integrated with tec...
Questions?
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/storm-
lessons
Lessons Learned Building Storm
Lessons Learned Building Storm
Lessons Learned Building Storm
Lessons Learned Building Storm
Lessons Learned Building Storm
Upcoming SlideShare
Loading in...5
×

Lessons Learned Building Storm

847

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/162zy81.

Nathan Marz shares lessons learned building Storm, an open-source, distributed, real-time computation system. Filmed at qconnewyork.com.

Nathan Marz is currently working on a new startup. He was the lead engineer at BackType before being acquired by Twitter in 2011. At Twitter, he started the streaming compute team which provides and develops shared infrastructure to support many critical real-time applications throughout the company. Nathan is the creator of many open source projects, including projects such as Cascalog and Storm.

Published in: Technology, Education
0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
847
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide

Transcript of "Lessons Learned Building Storm"

  1. 1. Lessons learned building Storm Nathan Marz @nathanmarz !1
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /storm-lessons
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Software development in theory
  5. 5. Software development in practice
  6. 6. Storm Widely used stream processing system
  7. 7. Storm Fully distributed and scalable Strong processing guarantees High performance Multi-tenant
  8. 8. Storm architecture
  9. 9. Storm architecture Master node (similar to Hadoop JobTracker)
  10. 10. Storm architecture Used for cluster coordination
  11. 11. Storm architecture Run worker processes
  12. 12. Lesson #1: No such thing as a long-lived process
  13. 13. If JobTracker dies, all jobs die
  14. 14. =
  15. 15. Your processes will crash
  16. 16. Your code is not correct
  17. 17. Your code is not correct
  18. 18. Solution: Design system to be fault-tolerant to process restart
  19. 19. Implications Program can be kill -9’d at any point All state must be external to process State modification might be aborted at any point
  20. 20. Other benefits for Storm Easy to reconfigure Can upgrade Nimbus without touching apps (e.g. with bug fixes)
  21. 21. Lesson #2: Use state machines to express intricate behavior
  22. 22. Nimbus has intricate behavior
  23. 23. Killing a topology 1. Stop emitting new data into topology 2. Wait to let topology finish processing in-transit messages 3. Shutdown workers 4. Cleanup state
  24. 24. Killing a topology Asynchronous Must be process fault-tolerant Don’t allow activate/deactivate/rebalance to a killed topology Should be able to kill a killed topology with a smaller wait time
  25. 25. Originally lots of (buggy) conditional logic
  26. 26. Rewrote Nimbus to be an asynchronous, process fault-tolerant state machine
  27. 27. Example of general solution being easier to understand than the specific solution
  28. 28. Lesson #3: Every feature will be abused
  29. 29. Example: Logging
  30. 30. One rogue topology uses up all disk space on the cluster
  31. 31. Solution: Switch from log4j to logback so that size of logs can be limited
  32. 32. Example: Storm’s “reportError” method
  33. 33. Used to show errors in the Storm UI
  34. 34. Error info is stored in Zookeeper
  35. 35. What happens when a user deploys code like this?
  36. 36. Denial-of-service on Zookeeper and cluster goes down
  37. 37. Solution: Rate-limit how many errors/sec can be written to Zookeeper
  38. 38. This is a general principle
  39. 39. Lesson #4: Isolate to avoid cascading failure
  40. 40. Originally one giant shared Zookeeper cluster for all services within Twitter
  41. 41. If one service abused ZK, that could lead to cascading failure
  42. 42. Zookeeper is not a multi-tenant system
  43. 43. Solution: Storm got a dedicated Zookeeper cluster
  44. 44. Lesson #5: Minimize dependencies
  45. 45. Your code is not correct
  46. 46. Other people’s code is not correct
  47. 47. Fewer dependencies Less possibility for failure=
  48. 48. Example: Storm’s usage of Zookeeper
  49. 49. Worker locations stored in Zookeeper
  50. 50. All workers must know locations of other workers to send messages
  51. 51. Two ways to get location updates
  52. 52. 1. Poll Zookeeper
  53. 53. 2. Use Zookeeper “watch” feature to get push notifications
  54. 54. Method 2 is faster but relies on another feature
  55. 55. Storm uses both methods
  56. 56. If watch feature fails, locations still propagate via polling
  57. 57. Eliminating dependence justified by small amount of code required
  58. 58. Lesson #6: Monitor everything
  59. 59. Monitoring is a prerequisite to operating robust software
  60. 60. Storm makes monitoring topologies very easy
  61. 61. Lots of stats monitored automatically
  62. 62. Simple API to monitor custom stats
  63. 63. Automatically integrates with visualization stack
  64. 64. Lesson #7: How we solved the resource management problem
  65. 65. Resource management Multi-tenancy: topologies don’t affect each other Capacity management: converting $$$ into topologies
  66. 66. Initially treated these as separate problems
  67. 67. Multi-tenancy attempt #1 Resource isolation using Mesos
  68. 68. Machine Machine Machine Machine Mesos Storm
  69. 69. Each machine runs workers from many topologies
  70. 70. Resource isolation is an extraordinary claim
  71. 71. Extraordinary claims require extraordinary evidence!
  72. 72. Ran into massive variance problems
  73. 73. At-least resource model of Mesos made capacity measurement impossible
  74. 74. Almost went down route of resource isolation with resource capping
  75. 75. But what about hardware threading and caching?
  76. 76. Conclusion: Sharing a single machine for independent applications is fundamentally complex
  77. 77. Capacity management attempt #1 1. Provide shared Storm cluster 2. Measure capacity usage in aggregate 3. Always have some % of cluster free 4. Grow cluster as needed according to usage
  78. 78. People would deploy topologies with more workers than slots on cluster
  79. 79. People only care about getting their application working and will twist any knob possible
  80. 80. Problems Production topologies starved Bloated resource usage because no incentive to optimize No process for making $$$ decisions
  81. 81. Requirements 1. Production topologies get priority to resources 2. One topology cannot affect the performance of another topology 3. Incentives for people to optimize resource usage 4. Process for making $$$ decisions on machines 5. Ability to measure how much capacity a topology needs for 3 and 4
  82. 82. The more complex the problem, the simpler the solution must be
  83. 83. Solution Isolation scheduler
  84. 84. Isolation scheduler
  85. 85. Isolation scheduler Nimbus configuration
  86. 86. Isolation scheduler Configurable only by cluster administrator
  87. 87. Isolation scheduler Map from topology name to # of machines
  88. 88. Isolation scheduler Topologies listed are production topologies
  89. 89. Isolation scheduler These topologies guaranteed dedicated access to that # of machines
  90. 90. Isolation scheduler Remaining machines used for failover and for running development topologies
  91. 91. Benefits Resource contention issue of Mesos completely avoided Takes advantage of process fault-tolerance of Nimbus Simple to use and understand Easy to do capacity measurements Distinguishes production from in-development
  92. 92. Topology productionization process 1. Test topology on cluster as a development topology 2. When ready, work with admins to do capacity measurement 3. Submit capacity proposal for approval by VP 4. Allocate machines immediately from failover machines 5. Backfill capacity when machines arrive 4-6 weeks later
  93. 93. Benefits Incentives to optimize resource usage Backfill allows immediate productionization Human process integrated with technical solution
  94. 94. Questions?
  95. 95. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/storm- lessons

×