Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Maintaining reliability in an unreliable world

2,121 views

Published on

Published in: Technology, Business
  • Be the first to comment

Maintaining reliability in an unreliable world

  1. 1. Jeremy EdbergTweet @jedberg with feedback!
  2. 2. We live in an unreliable worldTweet @jedberg with feedback!
  3. 3. Tweet @jedberg with feedback!
  4. 4. Tweet @jedberg with feedback!
  5. 5. Tweet @jedberg with feedback!
  6. 6. Maintaining reliability in an unreliable worldTweet @jedberg with feedback!
  7. 7. NetflixTweet @jedberg with feedback!
  8. 8. redditTweet @jedberg with feedback!
  9. 9. A quick rant • I hate the term “Cloud Computing” • I use it because everyone else does and no one would understand me otherwise • One alternative is “Virtualized Computing” • I don’t have a better alternativeTweet @jedberg with feedback!
  10. 10. Agenda • Money • Architecture • Process (or lack thereof)Tweet @jedberg with feedback!
  11. 11. Reliability and $$Tweet @jedberg with feedback!
  12. 12. Monthly Page Views and Costs for reddit1,300M $130,000.001,080M $108,000.00 860M $86,000.00 640M $64,000.00 420M $42,000.00 200M $20,000.00 Mar May Jul Sep Nov Jan MarTweet @jedberg with feedback!
  13. 13. The control and complexity spectrumLots LittleTweet @jedberg with feedback!
  14. 14. Tweet @jedberg with feedback!
  15. 15. Why did Netflix move out of the datacenter? • They didn’t know how fast the streaming service would grow and needed something that could grow with them • They wanted more redundancy than just the two datacenters they had • Autoscaling helps a lot tooTweet @jedberg with feedback!
  16. 16. Netflix autoscaling2 Text1 Traffic Peak Tweet @jedberg with feedback!
  17. 17. Agenda • Money • Architecture • Process (or lack thereof)Tweet @jedberg with feedback!
  18. 18. Making sure we are building for survivalTweet @jedberg with feedback!
  19. 19. Building for redundancyTweet @jedberg with feedback!
  20. 20. 1>2>3 Going from two to three is hardTweet @jedberg with feedback!
  21. 21. 1>2>3 Going from two to three is hardTweet @jedberg with feedback!
  22. 22. 1>2>3 Going from one to two is harderTweet @jedberg with feedback!
  23. 23. 1>2>3 Going from one to two is harderTweet @jedberg with feedback!
  24. 24. Build for Three If possible, plan for 3 or more from the beginning.Tweet @jedberg with feedback!
  25. 25. Build for Three If possible, plan for 3 or more from the beginning.Tweet @jedberg with feedback!
  26. 26. “Build for three” is the secret to success in a virtual environmentTweet @jedberg with feedback!
  27. 27. “Build for three” is the secret to success in a virtual environmentTweet @jedberg with feedback!
  28. 28. All systems choices assume some part will fail at some point.Tweet @jedberg with feedback!
  29. 29. Database Resiliancy with ShardingTweet @jedberg with feedback!
  30. 30. Sharding • reddit split writes across four master databases • Links/Accounts/Subreddits, Comments,Votes and Misc • Each has at least one slave in another zone • Avoid reading from the master if possible • Wrote their own database access layer, called the “thing” layerTweet @jedberg with feedback!
  31. 31. CassandraTweet @jedberg with feedback!
  32. 32. Cassandra • 20 node cluster • ~40GB per node • Dynamo modelTweet @jedberg with feedback!
  33. 33. CassandraTweet @jedberg with feedback!
  34. 34. How it works • Replication factor • Quorum reads / writes • Bloom Filter for fast negative lookups • Immutable files for fast writes • Seed nodes • Multi-regionTweet @jedberg with feedback!
  35. 35. Cassandra Benefits • Fast writes • Fast negative lookups • Easy incremental scalability • Distributed -- No SPoFTweet @jedberg with feedback!
  36. 36. I love memcache I make heavy use of memcachedTweet @jedberg with feedback!
  37. 37. Second class users • Logged out users always get cached content. • Akamai bears the brunt of reddit’s traffic • Logged out users are about 80% of the trafficTweet @jedberg with feedback!
  38. 38. 1 A 2 3 C BTweet @jedberg with feedback!
  39. 39. 1 D A 2 3 C BTweet @jedberg with feedback!
  40. 40. Get - Set - GetTweet @jedberg with feedback!
  41. 41. Caching is a good way to hide your failuresTweet @jedberg with feedback!
  42. 42. reddit’s CachingTweet @jedberg with feedback!
  43. 43. reddit’s Caching • Render cache (5GB) • Partially and fully rendered itemsTweet @jedberg with feedback!
  44. 44. reddit’s Caching • Render cache (5GB) • Partially and fully rendered items • Data cache (15GB) • Chunks of data from the databaseTweet @jedberg with feedback!
  45. 45. reddit’s Caching • Render cache (5GB) • Partially and fully rendered items • Data cache (15GB) • Chunks of data from the database • Permacache (10GB) • Precomputed queriesTweet @jedberg with feedback!
  46. 46. Sometimes users notice your data inconstancyTweet @jedberg with feedback!
  47. 47. Queues are your friend • Votes • Comments • Thumbnail scraper • Precomputed queries • Spam • processing • correctionsTweet @jedberg with feedback!
  48. 48. Going multi-zoneTweet @jedberg with feedback!
  49. 49. Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per zoneTweet @jedberg with feedback!
  50. 50. Going Multi-regionTweet @jedberg with feedback!
  51. 51. Leveraging Mutli-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost moneyTweet @jedberg with feedback!
  52. 52. Other options • Backup datacenter • Backup providerTweet @jedberg with feedback!
  53. 53. Example ---------Tweet @jedberg with feedback!
  54. 54. Example 2 ---------Tweet @jedberg with feedback!
  55. 55. Agenda • Money • Architecture • Process (or lack thereof )Tweet @jedberg with feedback!
  56. 56. Monitoring • reddit uses Ganglia • Backed by RRD • Netflix uses RRD too • Makes good rollup graphs • Gives a great way to visually and programmatically detect errorsTweet @jedberg with feedback!
  57. 57. Alert Systems CORE Paging Event Servicealerting Gateway CORE Amazon Agent SES api CORE api Agent Other Team’s Agent Tweet @jedberg with feedback!
  58. 58. Anatomy of an outageLife cycle of the common Streaming Production Problem (Productio Problematis genus)0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  59. 59. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Somethingbad happens 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  60. 60. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer impact Somethingbad happens 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  61. 61. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer impact Someone notices Something (probably CS,bad happens hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  62. 62. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer Prod impact alert Someone notices Something (probably CS,bad happens hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  63. 63. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer Prod impact alert Someone notices Something Determine (probably CS,bad happens impact hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  64. 64. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Customer Prod (nonroot) impact alert cause Someone notices Something Determine (probably CS,bad happens impact hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  65. 65. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Customer Prod (nonroot) impact alert cause Someone notices Something Determine Figure out (probably CS,bad happens impact fix hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  66. 66. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod (nonroot) fix impact alert cause Someone notices Something Determine Figure out (probably CS,bad happens impact fix hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  67. 67. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod (nonroot) fix impact alert cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  68. 68. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  69. 69. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  70. 70. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  71. 71. Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time outage time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  72. 72. Automate all the things!Tweet @jedberg with feedback!
  73. 73. Automate all the things • Application startup • Configuration • Code deployment • System deploymentTweet @jedberg with feedback!
  74. 74. Automation • Standard base image • Tools to manage all the systems • Automated code deploymentTweet @jedberg with feedback!
  75. 75. The best ops folks of course already know this.Tweet @jedberg with feedback!
  76. 76. Netflix has moved the granularity from the instance to the clusterTweet @jedberg with feedback!
  77. 77. The next level • Everything is “built for three” • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deploymentTweet @jedberg with feedback!
  78. 78. Tweet @jedberg with feedback!
  79. 79. The Monkey Theory • Simulate things that go wrong • Find things that are differentTweet @jedberg with feedback!
  80. 80. The simian army • Chaos -- Kills random instances • Latency -- Slows the network down • Conformity -- Looks for outliers • Doctor -- Looks for passing health checks • Janitor -- Cleans up unused resources • Howler -- Yells about bad thingsTweet @jedberg with feedback!
  81. 81. War StoriesTweet @jedberg with feedback!
  82. 82. April EBS outageTweet @jedberg with feedback!
  83. 83. August region failureTweet @jedberg with feedback!
  84. 84. Hurricane Irene The outage that never wasTweet @jedberg with feedback!
  85. 85. Questions?Tweet @jedberg with feedback!
  86. 86. Getting in touch Email: jedberg@gmail.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg reddit: www.reddit.com/user/jedbergTweet @jedberg with feedback!
  87. 87. Tweet @jedberg with feedback!
  88. 88. Tweet @jedberg with feedback!

×