Who here either works for or has used AWS?RightScale?Who has read and understood the full post mortem for the April outage?post slides to somewhere, make available and note in preso
‘what a country’ : my dad always says this, I like itso, one of our principle investors, BradFeld was in our offices recently, and was asking how AWS was working out for us i'd replied very much in the positive, with a few exceptions regarding their support services. that night at dinner brad was talking about how resilient our culture is for entrepreneurs; how we can fail and retry here in the united statesdoing things that folks might get strung up for, in other countries.the following night, I found myself exploring analogies between that idea and computing systems, and wound up pulling out my phone and started typing up a twitter post
It went something like this.this was going to be the brilliant culmination of my twitter career, to date. I was almost ready to hit the send button, when I started getting alerts from our systemsThe alerts were appearing literally right above what I had written : ‘system DOWN’. Oh, the irony. wish i had a screenshot from my phone
that was the evening of 4/20, morning of 4/21 - the AWS outageAs you can see it made the NYT. Lasted for a number of days; our API was intermittently affected for about 12 hours; that could have been mitigatedThat outage totally sucked for many reasons I’m hoping that by sharing some of my experience with AWS , you’ll gain some insights that may help you prepare adequatelyAlso hoping that this can turn into a conversation toward the end, so you can share your experiences as well.
So before I go on, a bit about me : my name is Jeff MalekGrew up in Colorado, graduated in 93 from CU Boulder after 6 long years and a suspensionduring which time I hitch-hiked around the country, winding up in hawaiigraduated, moved around, met some great friends, helped to start up a companywas at Zango for 10 years,responsible for engineering, QA and product development teams distributed across three countries50+ people who built and maintained the high-transaction system that resulted in $75M yearly revenue at its peakleveragedthe client side software I wrote in the C programming language which talked to backend systems built on Windows technology (.NET, IIS, MSSQL Server, etc) which was sitting on co-located , purchased hardware
BigDoor: over 2 years oldFunded by Foundry and Brad Feld in 2010. If you’re familiar with airline point systems, you’re familiar with loyalty programs.BigDoor provides a platform that powers social loyalty programs and game mechanics for digital communities.Think of it in terms of sharing your points with your friends and leveling-up in the processfreeRESTfulAPI that you can brand any way you wantI did a tech stack pivot ; built in the cloud on AWS using 99.99% free/open-source software – our backend systems are primarily Django-Python.Even after the outage, still a huge fan of AWS, generally very impressed with what they’ve built and their speed of innovationWhen was the last time you got a newsletter letting you know that a vendor’s pricing was going down
So what happened? Here’s some quick background.AWS Regions are areas geographically separated by large distances, and contain Zones. In the US there are two Regions, us-east and us-west.Zones are a euphemism for ‘data center’, each Region contains four data centers, in separate buildings.
Here’s the region with four zones again. Within a zone, you can allocate block-level replicated storage that’s optimized for consistency and low latency read/write access to/from EC2 instances – otherwise known as EBS (Elastic Block Storage). These EBS volumes are stored and replicated between nodes within a cluster, multiple times for durability and availability. If one replica becomes unavailable our out of sync, a new replica is provisioned automatically. This is called re-mirroring, and while it’s happening, access to that data is blocked for consistency. Old replicated blocks aren’t released until the new replica is confirmed. Within a cluster, nodes are connected to each other via two networks; one high-bandwidth backplane, and a lower bandwidth overflow capacity networkThe four zones, or data centers, are connected via control plane services that coordinate user requests for EBS resources
During scaling maintenance to upgrade primary network capacity, it’s standard practice to shift traffic away from the primary to another router, but someone routed traffic to the lower capacity network, essentially flooding it. Many nodes got disconnected from other nodes in the cluster, couldn’t connect to their replicas. While the network was down, EBS API requests were queuing up, exacerbated by the fact that you can set a ‘wait-timeout’ on API requests.Then the primary network was restored. Affected nodes began trying to create replicas; start of the ‘re-mirroring storm’. There was a bug that caused nodes to crash when closing large volumes of requests, resulting in more needing to re-mirror; on top of that, nothing metering back these requests as they were failing repeatedly, no exponential back-off. Exhausted the capacity of the cluster, putting about 13% of all volumes in ‘stuck’ state. When it came back up, the regional control plane services were overloaded and this is what made EBS services unavailable regionally.
So that’s what I’m here to talk about: fault tolerance in the cloud
I want to talk about all of this in the context of our startup, of courseUltimately the AWS outage didn’t result in any major changes to the way we do thingsWhile there were a few smaller things that we bumped up the priority chain, there’s a certain level of risk that a start up is willing to live with
My girlfriend Jenny and I got Ruger as a puppy, right when BigDoor startedRaised him from a puppy while building out our operational infrastructure, working out of our houseHe’s a great dog, love him to death So he’s kind of our mascot, and to help put things in context, I came up with a formula : The Ruger Fault Equivalency.
IOW given a low tolerance for risk, you can create a highly-fault tolerant system if you have a lot of time and/or money. that’s not BigDoor. Conversely, executing with a higher tolerance for risk gets you to market faster with less money, but with lower fault tolerance.For us, scalability is more important than extremely high fault tolerancestartup = time^2 is low (little time and money)So, fun and interesting, but what does it mean in the context of BigDoor system design?TODO : add another pic, movie of play dead?
I designed the BigDoor systems at a high level with this philosophy in mind. A bit more regarding our context : Django/PythonWeek long sprints that end in production code release 260G+ and growing transactional database, so still not that bigPeak so far: 18MM API requests/day, so still a ways to go Response times need to be faster than 500ms
OK, given that context – how do you achieve the right amount of fault tolerance in the cloud?Three basic tenets, and in the context of the AWS outage:the first sets a foundation for fault tolerance the second leverages the first to improve fault toleranceand the third will help keep your customers around when you are in crisis mode, ultimately also improving fault tolerance
AMIs (amazon machine image, install images; OS blueprints), these are used to start new server instancesLeverage pre-built AMIsDebian has great package managementpackages are verified, tested before making it into the main line - less to think aboutThank you Eric HammondA good best practice : use a single master AMI re-buildregularly via automation with new softwarenew package patches (apt)your application code we thentag per environment (test, staging, production) switch services (Apache, MySQL) on and off during boot via init scriptsAnother good practice :All app code and software config is checked out via SVN and baked into the AMIsvn up during boot via init scriptsenables fast initialization during auto-scaling activities
AWS has cloud formationThey came out with that a few days after I’d finished pretty much doing the sameI wrapped the AWS command line tools in shell scriptsSince we’re a Python shop, we’re likely going to be using boto (which has matured quite a bit in the last two years) and fabric
Nothing to see here, move along
Nothing to see here, move along
Who knows what this is a picture of?That’s a picture of the IBM RAMAC, built in 1956, which had 5M of storage and weighed a ton. We’ve come a long way, baby!
For anyone unfamiliar: if a system stops working when a part of it fails, that part is a single point of failure. So in every system there’s potential for many single points of failure, proportional to system complexityBecause of the Ruger Fault Equivalency, the idea is to pick the right SPsOF and eliminate (or at least mitigate) themI used the word ‘elimination’ here, hoping that it would make some folks chuckle; it’s really not possible to eliminate all SPOFs. You can mitigate them, though. So here are some examples, and I’ll drill into which ones are critical in our context.
If your cloud provider goes out of business, you’re hosed. SPOF.In AWS, a region is…etc. If a region disappears, you’re hosed. SPOF.Within regions, are zones. If an entire zone fails, you’re hosed. SPOF.Same with load balancers, application servers, databasesAnd even Fred. If Fred is the only guy who knows your operational systems, and he trips over the extension cord, knocking himself out in the process – you’re hosed. SPOF. The critical ones in our context and likely in many others : Zones and everything below.
Should you attempt to achieve high fault-tolerance through cloud-cloud failover?Ruger Fault Equivalency says : Cost prohibitive (times squared)RightScale , who provides a very cool cloud management system, apparently has some of this functionality, and will likely be the place to go for cloud-cloud fail-over in the future.
Ruger Fault Equivalency says :Ditto – cost prohibitiveIf you try to migrate an ELB-balanced tech stack from one region to the next, you’ll learn:You ELB won’t be able to route traffic between regionsEIPs can’t be pointed from an instance in one region to anotherYour custom useast (for example) AMI can’t be used in the new region Your useast Security groups can’t be used in the new regionYour snapshots can’t be used to create new volumes, in the new regionDo set up a DB replicant in another region, if possible.
Ruger says : yes, even in light of the recent outage, that affected the entire useast region. It’s not cost-prohibitive, and you get data-center fail-over.What about the recent AWS outage? A human error caused a major problem in one zone that had a ripple effect into the other zones. But ultimately, downtime suffered was in proportion to how well you were already leveraging other zones, and how dependent you were on EBS volumes. If all of your eggs were in the wrong zone, or didn’t have the right backup strategy in place – totally screwed. Otherwise – not so bad!
Our zone scenario and why were were down intermittently for 12 hours during the AWS outagebefore the outage we had auto-scaling groups in two zones within a single regionat some point I brought everything into a single zone, while debugging odd performance between the twoconscientiously de-prioritized revisiting that, in light of other priorities, figuring the single-zone group would at least scale with trafficbut I’d configured the groups with a trigger to auto-scale when CPU spikedover time our application grew more resource efficient, which meant CPU wasn’t spiking, which meant we weren’t scaling with trafficled to the learning that it’s better to scale on network IO, or now that AWS supports them, custom scaling triggerswe’re in multiple zones again now; recently saw the effects of an entire zone’s application server group go dark
Ruger says : don’t even think about not doing it.What’s generally worked for us:ELBs for same-region traffic distribution auto scaling groups to allow application server fail-over, within a zone and across themreplication to put secondary fail-over database servers in other zones within a region.
What about Fred? Cut Fred some slack for tripping over the extension cord, we all make mistakes. You need Fred. That is, assuming he communicates what happened widely. If he doesn’t, he’s going to suffer the wrath of his internal and external customers.
Customers don’t need a ton of detail; they need status updates and anything actionable. Does open communication increase fault tolerance? I’d argue yes. Your customers will be more tolerant of your faults if you’re open and clear about them
At BigDoor, if there’s a crisis, our standard operating procedure identifies a single person responsible for stopping the team on an hourly basis to get status and determine what should be communicated externally, if anything. As much as we love him, we don’t involve our lawyer in that conversation, by the way.
In summary, these are the three tenets that I’m hoping will help you achieve the right amount of fault tolerance in the cloud:Scripted repeatabilitySPOF eliminationClear-Cut CommunicationAll three of these things are mentioned by AWS in one way or another in their post-mortems as things they planned on doing to mitigate this for themselves going forward, by the way – including the better communication. Thanks again WTIA, I’ll be around if anyone wants to talk more about this stuff later. I also have some notes that describe the good and bad about AWS, available online here : TODO
AWS outage root cause analysis : http://aws.amazon.com/message/65648/Net Effects :hours of high EBS API error and latency rates : 11 days before affected data made available again in affected zone : peak ‘stuck’ volumes in other zones : .07% Ultimately .07% of volumes couldn’t be restored due to hardware failures45% of RDS single-zone instances affected at peak, .04% unrecoverable2.5% of multi-zone RDS didn’t fail over due to another bugTools : the good and bad ELBsGood : quick to configure, auto-scaling load-balancerscan be used for fail-over within a regionBad : no loggingreturn 503s on error - you won't know unless you can monitor every request end to ende.g. if there aren't instances that can service requestsname servers disregarding ttls + auto-scaling = traffic routing issuesbest practice : return custom HTTP headers in your response so that you can distinguish calls during support incidentscan't be used for failover between AWS regions; need separate DNS solution for funneling trafficAMIs (amazon machine image, install images; OS blueprints)Good : Leveraged pre-built Debian AMIDebian has great package management, which can be scripted.packages are verified, tested before making it into the main line - less to think aboutThank you Eric Hammondhttp://alestic.com/scripted repeatability : script the non-interactive install of your toolscan be used to stand-up instances within a regionbest practice : single master AMI built on top of pre-existing, re-built regularly with new software, app code and patches, via automation. Tagged. best practice : put app code, package configuration into SVN and include in your AMI, svn-up regularly or during instance start-upfaster for things like auto-scalingBad : Can't copy/port AMIs from region to region easilyNot having the entire process scripted from kernel means loss of flexibility (regional AMIs) and securitypitfall : easy to get off track. Didn't start out with a single script that installs everything or stay diligent about including everything? Have fun re-doing all that!Security toolsGreat article : http://trust.cased.de/AMIDAMID script : http://code.google.com/p/amid/downloads/detail?name=AMID.py&can=2&q=EC2 instancesGood :Leverages AMIsObviously, script-able automated instance creationEIPs allow for easy, dependable service re-routing from one instance to anotherSecurity groups are an easy way to firewall (and tag, before they came out with those)Zones allow easy fail-over within a geographic region (most of the time)Regions provide the promise of fail-over between data centers more geographically separated (virginiavscalifornia)Init scripts allow you to create/update on a per-instance basisBad:Security groups can't be added to or removed from an instance once it's runningbest practice: use a different group for each narrower categorye.g. instead of 'database group', create groups for 'primary transactional db server in production', 'replicant...' etc best practice : use a group that whitelists trusted IPs to give access to otherwise un-needed ports and servicesRegions don't allow easy failover; EIPs can't be mapped between them (at least not programmatically)Can't port AMIs from region to region easily, so setup to fail region-region is difficult.EBSGood:provides redundant storage for instances that can be snapshot-ed for easy backup and volume duplication within a regionBad:volumes from snapshots can't be done between regions data loss: it happened (not to us, fortunately) so be prepared and apply the amount of resources your risk tolerance allowspoor I/O in general, specifically writes, typically only has been an issue for us on our primary tx DB serversbest pracitice : RAID 0 array for MySQL data directory, but make sure it's replicated and backed upAuto-scalingGood:n scaling groups in 1-4 zones behind an ELB; provides same-region fail-overn# of instances in a scaling groupcloud watch monitors provide great statspreviously, limited scaling triggers were provided, latest integrate CloudWatch much better including custom metrics you defineBad:learning : we had no baselines for when to scale on anything other than CPU utilization, which at the time was easy to differentiate; we spikedapplication improvements fixed the spikes, which in return stopped auto scaling triggers need monitoring/alerting via nagios/other tool? figure out how to (de-)register new instances during scaling activitiesthis is changing - cloud watch is getting better. do you trust amazon's monitoring/alerting on amazon's monitoring/alerting?EMRGood :Great for async log analysiswhat's worked for us : centralized log hostsapache logs rotated via logrotate and rsync'd via cron, pre-processed, sync'd to S3 and drawn into EMR/Hive cluster for aggregations and reporting Hive/HQL very similar to SQLBad :asynchronous, takes a fair amount of time to munge data S3Good:Available from anywhere, any regionS3cmd is a great tool , for the most partBad:no full support for standard paths and directories…TBDCloudWatch Good :can monitor various services and trigger/alert when thresholds are crossed (e.g. ELB network in)new : auto-scaling can leverage triggers more broadly, custom metrics (new)Bad :no built-in ability to trigger/alert based on % change from previous measurementsconsole reports/graphs need decoder tool and most recently, appear buggy. but they've made big steps forward.AWS APIsGood :API wrappers provided; allow for cmd-line scriptingDRY : Can (and should) script most things that repeat, repeatableAll done via scripts :a bit about our process and how the cloud fits well1 week sprints - lockdown tuesdays, test overnight (uTEST), release wedtest first methodologysystem tests for backend, other big changes, our API changesTested a new ver of MySQL (Percona, recommended)http://screencast.com/t/yVf5RnaUN9http://screencast.com/t/WJaL2qiSRperformance, integrity, load, capacitythese require full-stack stand-up/tear-down , including a 230G+ db backendBad :Keep your eye out for library updates (why not open-source these things? Verify they’re not already…)Scripts, wrappers trail AWS innovation, which is fast. BASH isn't as well-known or readable as Python, for example - maintainabilityscripted stuff bakes you in a bit, no way around this w/out baking yourself into RightScale or some other solution anyway thoughAPI key management : not straight-forwardAPI keys aren't portable between regions; region-region fail-over not as easy as it sounds. not rocket science, either.Bake region 1’s keys into region 2’s new AMIAPI's - GeneralBuild things test first, run integrity tests before pushing out changes to your APIDon't version; make it backwards-compatibleWe try to keep away from anything that’s going to lock us in too muchWe continue to shy away from SQS (simple queuing service), RDS (relational database service), SimpleDB (non-relational datastore)SQS, SimpleDB proprietary, would prefer to avoid lock-in for these things and their need hasn't been high enough for us yetRDS : doesn't provide enough flexibility for us. would love to use it as a replicant pool for reads/reporting though. can't.multi-zone RDS suffered one of the biggest hits during recent AWS outageWhat we're looking forward to leveragingNew CW status, PUTs, scaling triggers from them
Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11
Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones<br />6/22/2011<br />1<br />
What a country : entrepreneurial resiliency<br />6/22/2011<br />2<br />
(true story)<br />“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API”<br />6/22/2011<br />3<br />
Tenet #1prepare a fault-tolerant foundation with scripted repeatability<br />aka automation<br />6/22/2011<br />17<br />
Tenet #1 : scripted repeatabilityfrom the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/<br />6/22/2011<br />18<br />
Tenet #1 : scripted repeatability which will allow you toscript the setup/tear-down of your stack<br />6/22/2011<br />19<br />
Tenet #1 : scripted repeatability which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests)<br />6/22/2011<br />20<br />
6/22/2011<br />21<br />Tenet #1 : scripted repeatability<br />A/B system test results : MySQL Percona Upgrade<br />
That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair<br />6/22/2011<br />22<br />try that with real hardware<br />
Tenet #2SPOF Elimination<br />We don’t need no stinkin single points of failure. <br />6/22/2011<br />23<br />
Tenet #2 : SPOF Elimination Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics<br />6/22/2011<br />28<br />
Tenet #2 : SPOF Elimination So it’s actually all about reduction of the right SPOFs for your business context<br />Just adding the ability to fail-over and have backups within a region is huge!<br />Probably enough for most.<br />What about Fred?<br />6/22/2011<br />30<br />