Many images within this deck are from Creative Commons, Noncommercial, Share Alike licenses.
This presentation is a condensed version of the Workshop arranged by Brent Stineman and Mike Wood for the 2013 CodeMashPrecompiler days. The original was 4 hours long and included some discussion time. This version is scaled down and is most presentation only.Comic: http://www.xkcdefg.com/?id=23 – XKCDEFG – Used under Creative Commons Attribution, Non Commercial license
Adam Savage from Mythbusters (http://dsc.discovery.com/tv-shows/mythbusters) has been quoted as saying, “Failure is ALWAYS an option”. While in his context he is referring to tests having non intended results, for the cloud developer we can take that to heart as well. There WILL be failure in your cloud solution, either through your own code, or in those services your solution relies on. The trick to dealing with failure is having a plan of action when they occur and learning from past issues.Image: Image used from Discovery.com, used here under Fair Use.
Ultimately though, what we’re after is protection from outages. These can be hardware failures, data corruption, network failures, or even a disaster that results in the loss of facilities. This also raises the question about making sure things are either accessible, or available. A service that is experiencing an outage that isn’t accessible is the worst possible scenario. There’s no way for it to communicate with its end users that there is an issue. But if we instead make sure its accessible, but perhaps not fully available, perhaps running in a degraded state is one that at least has options. Images: Microsoft Office Clip Art. Godzilla is Copyright of Godzilla Releasing Corp (as far as I can tell) and is used here under the Fair Use clause.
Image: Image used from FOX, used here under Fair Use.
Architecting resilient solutions. Resilient solutions are capable of adapting to an outage situation and will not only recovery when normal operating conditions return, but are capable of changing their functionality, reducing what they do to a (at worst case) bare function mode.Image: Free to use from MorgueFile - http://www.morguefile.com/license/morguefile - http://www.morguefile.com/archive/display/141413
Every choice you make in order to support the goal of resiliency will come at cost. You’ll need to weigh the cost of implementing a way to handle each failure against the likelihood of that failure occurring, your ability to handle it, and the risk it has to your business. You may find that some decisions are very easy and you will accept the risk, and at other times you’ll discover that you will pay quite a bit to ensure that you can overcome certain failures. In terms of the idea of uptime, like five Nines, etc., the more nines you add, the more zero’s it will cost you.Just keep in mind through the rest of the presentation that the patterns and solutions we talk about are options for you. You may think they are overkill in some situations, but that really comes down the risk of your site or services being down and the cost of that downtime to the business. There are companies that if their service is offline they will millions an hour, and there are companies that could disappear off the internet for days and not really affect their bottom line in any measurable way.Image: Office Clip Art
Image: Image taken and provided by NASA. Public Domain. - http://en.wikipedia.org/wiki/File:Mission_control_center.jpg
You’ve probably seen in science fiction where a scan is taken of a person and real-time representation of the injured is displayed on a screen or in a hologram giving the doctors a wealth of information about the state of their patient. Just imagine how much better healthcare would be if your body was constantly producing information about how it is doing. We’d be able to detect oncoming illness, find tumors before they become a problem and even see how changes to outside stimulus and medications affect the body as they are introduced.Quite frankly, this isn’t really science fiction for your applications. All it takes is some effort to provide functional transparency to your system. You need to properly instrument your applications and hardware or platform. Properly instrument your applicationsThese are remote machinesVisibility into failures may be reducedAllow remote interaction, tweaks to instrumentation behaviorImage: Office Clip Art
Find tools that can help you make sense of the vast amount of data you are going to be pulling in. Tools like the Azure Watch, Cerebrata Azure Management Studio (in the framed image), New Relic or others provide a way to view everything that’s going on in a solution. This level of “functional transparency” into not just what’s happening but how can be critical into helping you diagnose what’s happening in your solution before an outage occurs. Its also helpful in identifying how things behave when conditions are optimal and thus letting you tune things to further improve the system. There are tons of different options out there for this, so find one that fits your platform and requirements well, and start gathering metrics.
Capturing the data is only a small part of monitoring. You need to develop ways to analyze the data to help detect issues as they, or preferably BEFORE, they happen. Also, keep application key indicators over time to make comparisons. For example, I used to work for a company called Cumulux which produced a service for monitoring applications hosted in Windows Azure. While the product team was developing the application they were also dog-fooding it. They started to notice that the memory usage on their worker was steadily increasing when compared to past metrics. Since they kept enough data around to do analysis they were actually able to track back to when the increase started to occur and which build introduced the leak. Then it was just a matter of tracking down likely changes in the code, which lead to the correction of the memory leak.Use this data to also drive your decisions on where resiliency is needed most. If there is a feature of your site that is used more during a certain month, day or hour, provide more resources to maintain that feature during those periods.Image: Image taken and provided by NASA. Public Domain. - http://spaceflight.nasa.gov/gallery/images/station/crew-31/html/jsc2012e054351.html
Image: Microsoft Office Clip Art
Now traditionally we’ve been able to depend on the uptime of hardware. But in the increasing complex world, where things are interconnected, we’re learning that this isn’t always enough. We need to understanding the different points of failure and how to address them. Machine/App crashes – have multiple copies and redirect trafficThrottling – know the limits of services/resources you are using and how to handle errors that occurConnectivity/Network – resource connectivity is much more fluid. So you need to know how to adjust as things move aroundExternal Service Dependencies – as you build dependencies on external services, what happens when they fail? Learn to adjust and move on. Image: Microsoft Office Clip Art
Now if you’ve slung code, you can likely guess what this code snippet is doing. And I know I’ve written this kind of exception handling block hundreds if not thousands of time. But this approach addresses the symptom, not the problem. The solution doesn’t react to the exception and take any action on it aside for perhaps logging the issue. This is where resiliency comes into play…
Steer clear of having a monolithic executable. Don’t have the Web site that does everything, don’t have a processing role that performs all processing. Monolithic solutions are hard to scale and hard to make resilient. The smaller the pieces the more composible they become. This about LEGOs and how you can use the same pieces to produce vastly different results. This also lets you break apart your solutions so that it can be scaled separately, made redundant in the right places, etc.Image: Michael Wood, used here under Creative Commons- Attribution, Non Commercial license, Share Alike
Story regarding water towers keeping water on hand for both pressure, but also capacity buffering.Another often overlooked method of helping avoid capacity based request throttling is looking at using various types of caching strategies. We can help overcome temporary capacity constraints by buffering content. Potentially by offload delivery work to items like content delivery networks, or using caches to help storage frequently accessed materials. Maybe even leverage local disk based caches for large files so we’re not constantly retrieving them from other storage systems. We’ll talk more about CDNs during our Scale and Reach discussion as well.Image: Free to use from MorgueFile - http://www.morguefile.com/license/morguefile - http://www.morguefile.com/archive/display/162891
Carrying extra capacity can also help in the case of an outage. If you have two clusters, each running some extra capacity, when one fails you can redirect the traffic to the remaining clusters. While they may be over utilized and running in a degraded state, you are at least able to keep running while you work to increase your capacity.Yet again providing capacity buffering. Image: By Kevin Rosseel – From MorgueFile – Free to use - http://mrg.bz/VHrLhJ
Ever been to an amusement park? The rides at amusement parks are one the greatest examples of dealing with request buffering you’ll find. Disney handles lines efficiently. Early in the morning, or at less crowded times the lines are short, both time wise and physically. As more and more people show up to ride a ride you’ll see them roping off more areas and making people snake through the longer lines. This is request buffering. They’re forming a queue. For those that can’t wait they even have what is called Fast Pass. This lets you grab a ticket for a ride during a certain timeframe and come back at that time to get into a much shorter line.We see the same type of patterns in handling requests when we get too many, or have finite resources to process those requests. We create queues to store up requests and respond when we can, or we have people retry.A we need to learn to address is how to handle transient, temporary issues. These can be caused by temporary exceeding capacity, or just momentary losses in connection. By implementing various approaches that allow for request buffering, we can How to throttle back, disconnected designs that allow work to be queued up so it can be processed as capacity is available. One approach is to implement retry policies using things like the Windows Azure Transient Fault Handling Application block. This allows for errors to be retried and even allow for backoff policies so we can slow down the frequency with which we can retry them. Another approach is to go entirely asynchronous and simply log the request so we can process it later. The techniques can even be combined. Note: If you aren’t investing in designing asynchronous workflows, especially for long running processes, start looking into it now. Image: Joe Shlabotnik - http://www.flickr.com/photos/joeshlabotnik/3424253515 –Attribution, NonCommercial, Share Alike license.
So another approach to increasing availability is running additional copies of your solution, usually somewhere else. Or increased redundancy. These services are spun up completely isolated (in another datacenter even) with no common points of failure. They can exist in different ready states (depending on the financial investment you want to make). The hotter the ready state, the more costly the solution is likely to be. So you’ll want to pick a scenario that meets your requirements.Question: why wouldn’t we want to run a redundant copy in the same datacenter?Answer: this create an aggregate view since both solutions are now share common points of failure. Title: Firesign Theater referenceImage: by Mr. White – MorgueFile – Free to use - http://mrg.bz/GtwqPu
Calculating a redundant SLA. Let’s say I have completely separate copies of my solution that are in no way dependent on each other. Perhaps in different datacenters. In this example, each solution would have 95% (using our previous figure). But because a failure in one doesn’t impact the other, an outage to our overall availability is going to be based on the probability of an intersection of outages occurring in both silos. This can be represented by a formula like 5/100 * 5/100. This gives us a probability of a complete outage to the solution of approximately 0.25% or 99.75% availability. Now there’s an important “but” to this figure. Since we’re arriving at this figure via a probability calculation, it means we’re actually gambling on the chance that we will have an outage that is impacting both solutions.
When calculating how fast you can be “back online”, you need to take all the activities into account. If you have to navigate a complex, manual process, this will slow things down. But if your solutions can react in an automated manner, you reduce outages and in some cases even hide some outages entirelyBy introducing the proper outage behaviors into your solutions and taking the proper steps to ensure that your organization’s process support and not hider reacting to issues, you can help minimize the downtime. This session focuses on what you as a solution architect can do. Image: Microsoft Office Clip Art
Addresses are important when distributing components. As services and resources are moved around make sure you have provided a mechanism to easily change their addresses for consumers or users of those resources. For example, make sure that your storage account URI or connection string is easily changed so that if you need to switch over to your own secondary account you can do so quickly. Same with queue paths and databases.Image: Free to use from MorgueFile - http://www.morguefile.com/license/morguefile - http://www.morguefile.com/archive/display/160055
NASA is a huge believer in redundant systems. Think of all the moving parts on a space shuttle, rocket or rover. There are literally thousands of things that could go wrong and each one could mean the end of a mission, or a life. A great amount of redundant systems are put into place and fail over systems are used. Even if the fail over system doesn’t provide 100% of the same functionality of the primary, some usage is possible.The space shuttle had four synchronized general purpose computers on board, all running the exact same code. Each answer arrived at by one of the computers was compared to the other three to verify before moving on to next instruction. If one of the computers went out there were still three others to keep going.Note that redundancy doesn’t protect you from everything. During a training session in the simulator astronauts were practicing a trans Atlantic landing sequence and as part of that procedure they dumped fuel. When they initiated that operation in the system all four computers locked up and a big red X appeared on the screens. All four computers ran the same instruction which was pointed to an empty location in memory and thus, no redundancy would have fixed that. Now, the shuttle does have a fifth computer which runs a completely different set of software and can only help in take off and landing scenarios. This degraded experience is much better than an unrecoverable failure.Image: Michael Wood, used here under Creative Commons- Attribution, Non Commercial license, Share Alike
Virtualization provides us a greater amount of control and flexibility to help design, scale and harden our solutions. The automatedrecovery systems built into platforms like Windows Azure and Amazon Web Services are based on the idea that you can move the solution around to healthy systems simply because it is a virtualized load. This is great because it gives our systems “self-healing”.Image Credit: Gizmodo.
Having scripts that can help automate your processes and recovery measures will help reduce the mean time to recover from failures. However, we want to be careful about having automation without checks and balances. After all, we’re not out to create the next SkyNet. Image: Image property of Orion Pictures, used here under Fair Use act.
On leap day in 2012, Windows Azure experienced is largest and most visible service disruption to date. Outages happen, but what’s important is that in this case the automated recovery systems had a “human intervention” or “hi” point. When the systems reached this point, they stopped taking action and alerted the support teams that there was something dramatically wrong. Image: Office Clip Art
Highscalability.com -> every site will hit a point where it fails. Know what that point is and handle it gracefullyFocus less on server uptimeAs you scale your system you need to identify possible points of failure. For each point of failure access a risk level and determine if and how you will deal with it. Think about the recent landing of the Mars Curiosity rover by NASA. That landing process was extremely complex with any number of issues that could go wrong. Each one was reviewed, analyzed and a conscious decision was made to either deal with the problem, or accept the risk. In some cases the full cost of the missions ($2.5 Billion dollars) was risked because the cost to provide a backup was either deemed to high, or simply impossible to manage.The best advice I can give you right now for dealing with failure, is to actually deal with it. Have plans in place. Do failure assessments on your designs to find all the holes and possible points of failure. Then, just like NASA, assess each issue and make a decision on how you would fix it (if you even can) and how much effort will that be. Then, using the same risk vs cost discussion we’ve been bringing up, make a decision on if you plan on addressing it or not. Finally, document the recovery plans on how to deal with the failure. Image: NASA, public domain
Try as you might you won’t be successful in rooting out all the potential failure points in your solution. There will always be some surprise that comes up. When this happens take the time to really dig deep into what went wrong and then determine a course of action to help mitigate the issue in the future. If you notice that after significant outages in both Windows Azure and Amazon EC2/S3 the vendors publish root cause analysis. Read these and become familiar with them. Ask yourself is this something that can happen to my code. Or, if your solution is based on some of those services (whether you were affected or not), what should we do if that happens to us?Put something in place to deal with this issue in the future. For example, don’t be the Empire who had a flaw only 2 meters wide in V1 of their product, then in the second version made spaces big enough to fly ships through.If you have users or customers, it’s probably best to be quick and very forthright in what happened. Share your root cause analysis with them.In February of 2012 Windows Azure suffered a severe outage in many of its services, including the management API. The issue boiled down to a simple code error around date calculation. Someone did something bad in code. It might be fun to laugh and say, “wow, can’t they get date math right?”, but then again, did you do a sweep of your own code looking for the same possible problem? I think you’d be really surprised what little gems haunt even the code of “senior” developers. When you see someone has made a mistake in code, in a design, etc., learn from it and make sure you won’t suffer the same fate.
What are your questions that you want answered today?Image: Creative Commons License: by Eleaf : http://www.flickr.com/photos/eleaf/2536358399/
Content Delivery Networks
Distributed Application Cache
Local Content Cache
during outages or
spikes in load
Always carry a spare
0% Capacity, half of all load
100%Capacity, 150% Capacity
75% of load, half of our load
50% more capacity then needed
Over allocated, but still functioning
• Can absorb but don’t fail spikes
• Degrade, of temporary
• Time to react if need to add capacity
Image: Kevin Rosseel
Dept. of Redundancy
Have a backup, somewhere else
More than one? Cost to
Hot = full capacity
Warm = scaled down, but
ready to grow
Cold = mothballed, starts
Image: Mr. White
Redundancy - Its about probability
1 box : 5% downtime or 438hrs per year
(that’s 18 ½ days!)
2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year
4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,000
0.000625% downtime or 3.285 MINUTES per year
Total Outage duration =
Time to Detect
+ Time to Diagnose
+ Time to Decide
+ Time to Act
Image: Office ClipArt
“Don't be too proud of this technological
terror you've constructed…”
• Root cause analysis
• Read other root cause analysis
• Plan for failure
• Your Solution WILL fail at some point
• You can learn from others just as
well as yourself
• Get cocky
• Stick your head in the sand
Images: LucasFilm, Fair Use