Maintaining reliability in an unreliable world
Upcoming SlideShare
Loading in...5
×
 

Maintaining reliability in an unreliable world

on

  • 1,312 views

 

Statistics

Views

Total Views
1,312
Views on SlideShare
1,239
Embed Views
73

Actions

Likes
1
Downloads
15
Comments
0

2 Embeds 73

http://www.linkedin.com 60
https://www.linkedin.com 13

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • My name is jeremy\n\nThanks for coming to see me this evening.\n
  • We live in an unreliable world\nThings never go as we expect.\n\n
  • As the great murphy said, if it can go wrong, it will.\n
  • The world of cloud computing was supposed to be our shining light\nand solve all of our problems.\n
  • But it isn’t all fluffy and white.\nSometimes the clouds break too.\n
  • Which is why I’m here today to tell you about how we maintain reliability\nin this unreliable world.\n\nSo who am I?\n
  • I work for Netflix as the lead site reliability engineer\nPublic studies say that we are responsible for as much as 30% of all internet traffic\non a Saturday night.\nWe run almost the entire streaming service from the cloud,\nsave for a few legacy systems that we haven’t moved from the DC yet.\n
  • I used to work for reddit.\nreddit is a community where people come together\nshare and discuss interesting things on the internet \nsuch as links to other stuff \nor create their own content.\nIt does more than 2 Billion pageviews a month\nall from EC2.\n
  • \n
  • \n
  • uptime and money go hand in hand. \nWith infinite money, you can probably get perfect uptime.\n\nBut if you don’t have infinite money, you have to find the right uptime for your budget.\n
  • Luckily, the cloud makes it pretty easy to find the right balance, \nbecause you can leverage their economies of scale.\nand the ease with which you can start and stop instances.\nHere is reddit’s costs and pageviews from when I was there,\nwhich is the last data I have.\n\n\n\n
  • You have lots of control and lots of complexity,\nWith appengine and heroku you have little,\nand Amazon (and Rackspace, etc) is in the middle.\n
  • This was reddit’s server farm in 2008\nmostly I just like to brag about how clean it was.\n\n
  • With autoscaling, we can only pay for resources\nthat we actually need.\n
  • At Netflix we use autoscaling the help manage\nreliability and cost.\nHere is one of our clusters scaling up and down.\nWe are tuning for the holidays, so you can see parts\nwhere we are doing squeeze tests and adjusting the\nscaling speed and values.\n
  • \n
  • We need to make sure that we are doing everything we\ncan to ensure our survival.\n
  • They key to surviving any outage is redundancy in your systems,\nbe it the cloud or a datacenter.\nMy teammate points out that this is recursion, not redundancy\n
  • Going from two\n\n(keypress)\n\nto three is hard\n
  • Going from one\n\n(key press)\n\nto two, is harder.\n\nWhat do I mean by that?\n\nanywhere you will need more than one of something \n(application process, database, cache, queue, whatever)\nIt will be harder to go from one to two than two to three\nand so on.\n\nEspecially relevant in a cloud setting, since getting more resources is so easy.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • (wait for animation)\n\nIf possible, plan for three or more from the beginning.\n\nSometimes your development cycle doesn’t allow it\nbut at least keep it in mind.\n
  • \n
  • \n
  • \n
  • By building for three, you can reasonably lose one of your instances and still be stable.\n
  • And now some database scaling.\n
  • We use 4 master databases\n\nThey are split up as Links/Accounts/Subreddit on the main db\nThen separate db’s for the comments, votes, and everything else\n\nEach has at least one slave -- The comments have 4 slaves\n\nWhen they get busy, we add a slave. Thanks to EC2,\nwe don’t have to requisition servers!\n\nKeep writes fast by not reading from the master when safe.\n\nAnd lastly our data access layer,\nthe thing layer -- creative, isn’t it?\nwe wrote it because there was no good\ndatabase ORM at the time.\n
  • Dynamo is the name of a database at amazon that divides data in a way that give high fault tolerance durability.\n
  • replication factor\n\nquorum reads / writes\n
  • \n
  • \n
  • Did I mention caching is good?\n
  • Another kind of caching is CDN caching.\n\nSince a logged out user isn’t voting or commenting, \nthe pages looks the same for all of them, so we render\nand cache the full page every 30 seconds.\n\nThen akamai grabs it and caches it for 30 seconds as well.\n\nakamai also accelerates the connection for our logged in users\n
  • Use consistent key hashing\n
  • Only one chunk of data changes places.\n
  • Here’s a memcache tip (or cassandra or whatever\neventually consistent system you want) for locking.\nTo get reasonable locking, get the data,\nset the data, wait whatever the 95th\npercentile latency is, then do another get.\nIf it is still your lock, then you can reasonably say\nyou have the lock.\n
  • \n
  • (step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
  • (step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
  • (step through)\n\nrender cache (5GB) \nwe store bits of html in here,\nlike the html for a single link in a listing \nwith holes for filling in the custom info \nlike the arrow direction and points, \nas well as fully rendered pages for non-logged-in users\n\ndata cache (15GB) \nany time we need data from the database, \nwe check the data cache first, \nand if we don’t find it, we put it there. \nthat means that the data for all of the popular items \nwill always be in the cache, which is great for a site\nlike ours, where certain data is usually popular for \na day, and then fall out of favor for newer data. \nWe also put memoized results in the data cache.\n\nPermacache (10GB) \nThis is where we store the results of long database queries, \nsuch as all of the listings, profile pages and such.\n\nat netflix we have far more cache than that, and use it heavily\n
  • But usually they don’t\nwhich makes hiding the failures easier\n
  • Every vote generates a queue item for later processing. We just update the rendercache with the vote until it is processed. It also puts a job in the precomputer queue to recalculate that listing. That is why it sometimes seems like the vote totals aren’t accurate. They are consistent in the database, it is just a rendering issue.\n\nEvery time a comment is written, it generates a job to have the comments tree for that page recalculated, which is then stored in the cache\n\nEvery time a link is submitted, it generates a job to go out and get a thumbnail\nand it generates a job to recalculate the listing. It also generates a job for the spam filter.\n \nWhen a moderator bans or unbans a link, it generates a job to\ntrain the spam filter on that data.\n\nQueues are extra important in a virtualized environment, because if you loose an instance\nyou don’t have to worry about the lost work, as long as your queue is redundant.\n
  • Amazon will help you as well.\n\nOne way they do this is by providing zones. Each zone is like an island\nthat is loosely connected to the other zones, but mostly distinct. \n
  • So how do you get better than 99.95% uptime? Multiple zones!\n\nBy spreading your systems out across multiple zones, you should be able\nto withstand the failure of one zone.\n\nIn a little bit, I’ll go over how reddit and Netflix used a multizone strategy\nto survive outages.\n
  • Amazon, as well as other providers, offer multiple regions as well.\n\nRegions are essentially like separate providers with the same featureset.\n\nYour data does not get shared across regions\n
  • \n
  • \n
  • multiple zones\n\nsome db slaves are in different zones from master for redundancy\n\nmonolithic, highly cached.\n\n
  • Service oriented architecture is basically just a bunch of small reddits all talking to each other.\neasier for larger groups of devs.\nmore scalable -- just scale out the services that are overloaded\nmore focused optimizations -- just the ones that are biggest.\neasier to scale down\n
  • \n
  • \n
  • Gateway classifies and routes events based on severity\nand the systems involved.\nThe gateway currently processes around 48K events a day\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Automate as much as you can\n\n
  • The more automated things are, the easier it is to be a sysadmin.\nApplication startup\nConfiguration\nCode deployment\nFull system deployment\n\nThe more automated things are, the easier it is to scale\nespecially in a virtualized environment with auto-scaling\n\nAnd virtualized computing added the last bit,\nthe ability to automate system deployment.\n(Ok, that’s not entirely true, but watch me\nwave my hands and say it is)\n
  • In most places, you have this.\n\nStandard image with tools to manage the systems and the deployment.\n
  • It’s manifested every day by writing scripts and programs\nto do your repetitive tasks for you.\n\nPeople basically figured out how to do this with whole computers instead.\n\nIn case you’re wondering, that picture is what came up when I googled for\n“lazy sysadmin”\n
  • In most systems, you worry about the software and installing it on an OS.\nAt Netflix, the smallest thing we worry about is the instance image,\nwhich lives in a cluster.\n\nWe’ve essentially built a platform for doing automated deployment\nof Java code (and some Python too!)\n
  • My friends Joe and Carl already told you about Nac and our build system.\n\nThis allows the devs to take control of their deployment.\n\nEach team is responsible for their own deployments and uptime.\n\nWhen something breaks, we have a system that lets us page a team\nwho then gets on and fixes their stuff.\n\nEach team is responsible for their own destiny.\n\nSo how do we stay reliable when we have no control?\n\nInformation. \n
  • This is an example of the task list from NAC.\n\nIt shows me and everyone else all of the actions\n people have taken in our Amazon infrastructure.\n\nthis lets everyone know when it is safe to deploy,\nwhat is going on, etc.\n\nRight now my team is building additional tools to\nprovide information so other teams can make\ngood decisions.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • In mid August, a hurricane slammed the east coast\nAmazon warned of a possible zone outage.\n
  • I’m here for you to learn. If you have any questions, please jump in.\nThis will be boring for both of us otherwise.\nAnd I’ve got a couple of hours to fill anyway.\n
  • You can contact me in one of these ways,\n or ask your question now.\n\nthank you.\n
  • \n
  • Speaking of BttF, here is a delorian towing a delorian\n

Maintaining reliability in an unreliable world Maintaining reliability in an unreliable world Presentation Transcript

  • Jeremy EdbergTweet @jedberg with feedback!
  • We live in an unreliable worldTweet @jedberg with feedback!
  • Tweet @jedberg with feedback!
  • Tweet @jedberg with feedback!
  • Tweet @jedberg with feedback!
  • Maintaining reliability in an unreliable worldTweet @jedberg with feedback!
  • NetflixTweet @jedberg with feedback!
  • redditTweet @jedberg with feedback!
  • A quick rant • I hate the term “Cloud Computing” • I use it because everyone else does and no one would understand me otherwise • One alternative is “Virtualized Computing” • I don’t have a better alternativeTweet @jedberg with feedback!
  • Agenda • Money • Architecture • Process (or lack thereof)Tweet @jedberg with feedback!
  • Reliability and $$Tweet @jedberg with feedback!
  • Monthly Page Views and Costs for reddit1,300M $130,000.001,080M $108,000.00 860M $86,000.00 640M $64,000.00 420M $42,000.00 200M $20,000.00 Mar May Jul Sep Nov Jan MarTweet @jedberg with feedback!
  • The control and complexity spectrumLots LittleTweet @jedberg with feedback!
  • Tweet @jedberg with feedback!
  • Why did Netflix move out of the datacenter? • They didn’t know how fast the streaming service would grow and needed something that could grow with them • They wanted more redundancy than just the two datacenters they had • Autoscaling helps a lot tooTweet @jedberg with feedback!
  • Netflix autoscaling2 Text1 Traffic Peak Tweet @jedberg with feedback!
  • Agenda • Money • Architecture • Process (or lack thereof)Tweet @jedberg with feedback!
  • Making sure we are building for survivalTweet @jedberg with feedback!
  • Building for redundancyTweet @jedberg with feedback!
  • 1>2>3 Going from two to three is hardTweet @jedberg with feedback!
  • 1>2>3 Going from two to three is hardTweet @jedberg with feedback!
  • 1>2>3 Going from one to two is harderTweet @jedberg with feedback!
  • 1>2>3 Going from one to two is harderTweet @jedberg with feedback!
  • Build for Three If possible, plan for 3 or more from the beginning.Tweet @jedberg with feedback!
  • Build for Three If possible, plan for 3 or more from the beginning.Tweet @jedberg with feedback!
  • “Build for three” is the secret to success in a virtual environmentTweet @jedberg with feedback!
  • “Build for three” is the secret to success in a virtual environmentTweet @jedberg with feedback!
  • All systems choices assume some part will fail at some point.Tweet @jedberg with feedback!
  • Database Resiliancy with ShardingTweet @jedberg with feedback!
  • Sharding • reddit split writes across four master databases • Links/Accounts/Subreddits, Comments,Votes and Misc • Each has at least one slave in another zone • Avoid reading from the master if possible • Wrote their own database access layer, called the “thing” layerTweet @jedberg with feedback!
  • CassandraTweet @jedberg with feedback!
  • Cassandra • 20 node cluster • ~40GB per node • Dynamo modelTweet @jedberg with feedback!
  • CassandraTweet @jedberg with feedback!
  • How it works • Replication factor • Quorum reads / writes • Bloom Filter for fast negative lookups • Immutable files for fast writes • Seed nodes • Multi-regionTweet @jedberg with feedback!
  • Cassandra Benefits • Fast writes • Fast negative lookups • Easy incremental scalability • Distributed -- No SPoFTweet @jedberg with feedback!
  • I love memcache I make heavy use of memcachedTweet @jedberg with feedback!
  • Second class users • Logged out users always get cached content. • Akamai bears the brunt of reddit’s traffic • Logged out users are about 80% of the trafficTweet @jedberg with feedback!
  • 1 A 2 3 C BTweet @jedberg with feedback!
  • 1 D A 2 3 C BTweet @jedberg with feedback!
  • Get - Set - GetTweet @jedberg with feedback!
  • Caching is a good way to hide your failuresTweet @jedberg with feedback!
  • reddit’s CachingTweet @jedberg with feedback!
  • reddit’s Caching • Render cache (5GB) • Partially and fully rendered itemsTweet @jedberg with feedback!
  • reddit’s Caching • Render cache (5GB) • Partially and fully rendered items • Data cache (15GB) • Chunks of data from the databaseTweet @jedberg with feedback!
  • reddit’s Caching • Render cache (5GB) • Partially and fully rendered items • Data cache (15GB) • Chunks of data from the database • Permacache (10GB) • Precomputed queriesTweet @jedberg with feedback!
  • Sometimes users notice your data inconstancyTweet @jedberg with feedback!
  • Queues are your friend • Votes • Comments • Thumbnail scraper • Precomputed queries • Spam • processing • correctionsTweet @jedberg with feedback!
  • Going multi-zoneTweet @jedberg with feedback!
  • Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per zoneTweet @jedberg with feedback!
  • Going Multi-regionTweet @jedberg with feedback!
  • Leveraging Mutli-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost moneyTweet @jedberg with feedback!
  • Other options • Backup datacenter • Backup providerTweet @jedberg with feedback!
  • Example ---------Tweet @jedberg with feedback!
  • Example 2 ---------Tweet @jedberg with feedback!
  • Agenda • Money • Architecture • Process (or lack thereof )Tweet @jedberg with feedback!
  • Monitoring • reddit uses Ganglia • Backed by RRD • Netflix uses RRD too • Makes good rollup graphs • Gives a great way to visually and programmatically detect errorsTweet @jedberg with feedback!
  • Alert Systems CORE Paging Event Servicealerting Gateway CORE Amazon Agent SES api CORE api Agent Other Team’s Agent Tweet @jedberg with feedback!
  • Anatomy of an outageLife cycle of the common Streaming Production Problem (Productio Problematis genus)0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Somethingbad happens 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer impact Somethingbad happens 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer impact Someone notices Something (probably CS,bad happens hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer Prod impact alert Someone notices Something (probably CS,bad happens hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Customer Prod impact alert Someone notices Something Determine (probably CS,bad happens impact hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Customer Prod (nonroot) impact alert cause Someone notices Something Determine (probably CS,bad happens impact hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Customer Prod (nonroot) impact alert cause Someone notices Something Determine Figure out (probably CS,bad happens impact fix hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod (nonroot) fix impact alert cause Someone notices Something Determine Figure out (probably CS,bad happens impact fix hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod (nonroot) fix impact alert cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Anatomy of an outage Life cycle of the common Streaming Production Problem (Productio Problematis genus) Determine Deploy Customer Prod Go back (nonroot) fix impact alert to sleep cause Someone notices Something Determine Figure out Recover (probably CS,bad happens impact fix service hopefully our alerts) 0 TTD(etect) TTC(ausation) TTr(epair) TTR(ecover) Time outage time Tweet @jedberg with feedback! Slide courtesy of @royrapoport
  • Automate all the things!Tweet @jedberg with feedback!
  • Automate all the things • Application startup • Configuration • Code deployment • System deploymentTweet @jedberg with feedback!
  • Automation • Standard base image • Tools to manage all the systems • Automated code deploymentTweet @jedberg with feedback!
  • The best ops folks of course already know this.Tweet @jedberg with feedback!
  • Netflix has moved the granularity from the instance to the clusterTweet @jedberg with feedback!
  • The next level • Everything is “built for three” • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deploymentTweet @jedberg with feedback!
  • Tweet @jedberg with feedback!
  • The Monkey Theory • Simulate things that go wrong • Find things that are differentTweet @jedberg with feedback!
  • The simian army • Chaos -- Kills random instances • Latency -- Slows the network down • Conformity -- Looks for outliers • Doctor -- Looks for passing health checks • Janitor -- Cleans up unused resources • Howler -- Yells about bad thingsTweet @jedberg with feedback!
  • War StoriesTweet @jedberg with feedback!
  • April EBS outageTweet @jedberg with feedback!
  • August region failureTweet @jedberg with feedback!
  • Hurricane Irene The outage that never wasTweet @jedberg with feedback!
  • Questions?Tweet @jedberg with feedback!
  • Getting in touch Email: jedberg@gmail.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg reddit: www.reddit.com/user/jedbergTweet @jedberg with feedback!
  • Tweet @jedberg with feedback!
  • Tweet @jedberg with feedback!