Successfully reported this slideshow.

Prepare for failure (fail fast, isolate, shed load)



Loading in …3
1 of 77
1 of 77

Prepare for failure (fail fast, isolate, shed load)



Download to read offline

Use timeouts, circuit breakers, and bulkheads to help shield your application from system failures - consider libraries that employ these patterns (like Hystrix [JVM] and Mjolnir [.NET]) when coding integration points between applications.

Use timeouts, circuit breakers, and bulkheads to help shield your application from system failures - consider libraries that employ these patterns (like Hystrix [JVM] and Mjolnir [.NET]) when coding integration points between applications.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Prepare for failure (fail fast, isolate, shed load)

  1. 1. Prepare for Failure Fail fast. Isolate. Shed load.
  2. 2. @robhruska
  3. 3. Webapp
  4. 4. Webapp Mongo SQL Cache Rabbit Stripe Twilio
  5. 5. Webapp Webapp Webapp
  6. 6. Webapp
  7. 7. Webapp Mobile Push Recruit BBall
  8. 8. Webapp Mobile Push Recruit BBall
  9. 9. Webapp Mobile Push Recruit BBall
  10. 10. Webapp Mobile Push Recruit BBall
  11. 11. Webapp Mobile Push Recruit BBall
  12. 12. Webapp Mobile Push Recruit BBall
  13. 13. Webapp Mobile Push Recruit BBall
  14. 14. Webapp
  15. 15. Webapp
  16. 16. Webapp
  17. 17. Webapp
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23. Timeouts
  24. 24. Timeouts Bulkheads
  25. 25. Timeouts Bulkheads Circuit Breakers
  26. 26. Timeouts
  27. 27. ∞ Timeouts System.Net.Http.HttpClient 100s ∞ org.apache.commons.httpclient.HttpClient
  28. 28. Timeouts ~15s Set High Observe Peak 99.5% Adjust Down
  29. 29. Timeouts 1250ms
  30. 30. Bulkheads
  31. 31. Bulkheads
  32. 32. Bulkheads
  33. 33. Bulkheads
  34. 34. Thread Pools
  35. 35. Thread Pools
  36. 36. Thread Pools
  37. 37. Thread Pools
  38. 38. Thread Pools
  39. 39. Thread Pools
  40. 40. Thread Pools
  41. 41. Thread Pools
  42. 42. Thread Pools
  43. 43. 1/20 20/20 4/10 4/20 Semaphores
  44. 44. Circuit Breakers
  45. 45. Circuit Breakers
  46. 46. Circuit Breakers 34 1% Operations Error
  47. 47. Circuit Breakers 29 75% Operations Error
  48. 48. Circuit Breakers
  49. 49. Circuit Breakers
  50. 50. Circuit Breakers
  51. 51. Circuit Breakers
  52. 52. Circuit Breakers
  53. 53. +/-
  54. 54. +/-
  55. 55. Timeouts Bulkheads Circuit Breakers
  56. 56. Webapp
  57. 57. Webapp
  58. 58. users/get-user … … … …
  59. 59. users/get-user … … … …
  60. 60. users/get-user … … … …
  61. 61. A C G B
  62. 62. A C G B
  63. 63. A C G B
  64. 64. A C G B
  65. 65. A C G B
  66. 66. Resources @robhruska

Editor's Notes

  • My name’s Rob. I currently work on Hudl’s Platform squad, and we’ve recently made some large structural changes to how we develop and deploy our web application. We’ve moved from a large, monolithic application to more of a distributed system - and I’m going to cover some of the coding tools and patterns we use make that distributed system a bit more fault tolerant and resistant to failure.

    To start, a little background.
  • The majority of our developers do a lot, if not all, of their work on, our website. Coaches use Hudl to upload game and practice film, break it down, and share it out with athletes. Athletes watch and analyze game film to become better at their positions, and also do things like create highlights for awesome moments. Coaches can also do a lot of team management, build out and share playbooks, script their practices, stat games, etc. It’s become pretty feature rich, and these features get used by millions of people.
  • And as of the beginning of this year, all of those features were crammed into one big project - the git repository, unpacked, was some 6GB. All that code got compiled into a single deployment that ran on each of our webservers.
  • The webapp worked with a handful of other services - a number of our own databases and cache servers, and also some external services like Stripe for credit card processing.
  • Each of these circles actually represents multiple machines - we currently have a couple hundred webservers up here, and most of these other services are actually clusters of several machines. Most of that’s for scalability, and with the exception of a one or two of these gray circles, each node in a cluster is pretty independent and doesn’t communicate with the rest of the cluster or other services.

    This is still a pretty straightforward system. There are really only two layers of dependencies between applications. It’s fairly easy to reason about relationships between nodes and know what kinds of problems might happen.

    Additionally, these downstream nodes don’t change very frequently. Most are very stable, “core” components like databases. We aren’t deploying to them or upgrading them near as frequently as the webapp code we’re writing - and since they’re not changing very much, they’re a little bit less likely to fail.

    We could keep this system running pretty smoothly, and quickly identify and recover from most incidents.
  • However, that simplicity came with costs. We’ve grown our product team pretty rapidly, and still have aggressive goals for where we’re going.

    Every new developer we hired and every new feature we introduced meant that we piled more and more code onto that one web application. That meant longer checkouts and longer app startup times for local development. It meant increased build and deploy times, which are important to us because we like to deploy changes frequently, sometimes 15-20 a day.

    It also meant that if one of our squads accidentally deployed some code with issues into production, it had a much higher chance to cripple or break the entire site, because everything runs in the same application.
  • And that’s not a made-up scenario - we’ve seen all kinds of those things in production already: memory leaks, obscure circular references, super-aggressive loops that eat up CPU - all these things that are really difficult to find during development and testing, but become nasty things when you throw real production traffic at them.
  • So we set out to solve these problems. We decided to move to a more distributed, service-oriented architecture.

    This meant taking that big webapp and splitting off services that individual squads could work with independently.

    We slowly rolled out separate applications for our college recruiting product, our basketball platform, mobile device push notifications, and a handful of others.

  • Our intent is to have dozens of these in the medium term.
  • And even hundreds of services in the long term.
  • So look at how that changes the graph of how the components in the system interact. Even with just a few new services, we start adding a lot more dependencies between each of the nodes.
  • Recruit and Basketball are going to need to send mobile push notifications out to users’ devices.
  • Recruit will need to ask Basketball for game film on recruitable Basketball athletes.
  • All three of these are still going to have to work with what we affectionately call “the monolith”, the original webapp that’s inevitably going to stick around for a while as we continue to move pieces off of it.

    Seems straightforward, but it gets more complex.
  • Remember that the webapp had a bunch of other backend services it talked to. Databases, caches, queues.
  • On top of that, each of those new services can have their own databases, caches, queues, and other internal and external dependencies.

    And just to really make a mess of things...
  • each of these services is clustered, and has several of each of these nodes running in it.

    And this is with only four services. Imagine what it looks like with a hundred.

    Where are the failures going to happen here? Previously, we had a relatively linear, predictable set of places where things could go wrong. We knew what most of them were, and how to react. But here? Here, problems in one small part can cascade upward and outward in ways we’re not able to predict and handle.
  • An error all the way down here in this database can cause problems...
  • its corresponding application...
  • ...which can propagate upstream, crashing applications all the way back up. I’ll show you with a little more detail just how that can happen in a bit.

    These are all network calls, and the network is a hostile place; when it’s most inconvenient for you, it’s probably going to lose connectivity or become super slow. Network and server hardware can fail, packets can get dropped, there will be unscheduled maintenance, developers will push bad code. These and other failures are all but guaranteed to happen, especially if you’re running on commodity hardware in the cloud, where you’re not in as much control of your own systems.

    We can probably take a good crack at preventing these problems. It’d take a lot of money, time, and humans, but not a lot of our organizations have lots of money, time, and humans to dedicate, especially if we want to keep competing, innovating, and moving forward with the product that we actually deliver to our customers. So without those resources, your applications have to be tolerant of these failures and prevent them from branching out through your system.
  • So what does that mean? How do we anticipate those failures and become more tolerant? There are a number patterns out there that help us solve these problems. The approach we took was inspired by a couple different sources.

    Cascading failure is a big topic in Michael Nygard’s book, Release It! - this is a really excellent book. Even if you don’t work in complex or distributed systems, it helps you think about how systems can fail, which is useful to keep in mind when you’re designing and coding. Last time I checked, the Kindle edition was around 16 or 17 bucks on Amazon, and that’s well worth it.

    And when we approach architectural problems, we have some engineering team role models that we look to to see what they’ve done in similar situations. One of those is Netflix. We both run on Amazon Web Services, so they tend to encounter similar situations as us when it comes to how AWS works and behaves, or misbehaves. Netflix have a library called Hystrix, which takes and applies some concepts from Nygard’s book to solutions that help us with these problems.

    Hystrix is written in Java since Netflix works on the JVM. There’s not really a strong Hystrix equivalent for .NET, which is our primary platform - there’s a port of the library, but it doesn’t get much activity, and we also didn’t really think that porting Hystrix directly to C# the way we wanted to approach it. C# has some great asynchronous language features that don’t have 1:1 equivalents in Java, and we wanted to use them.

    We also wanted to understand the problems and solutions a little better ourselves, so we wrote library similar to Hystrix for .NET called Mjolnir.

    So what are Hystrix and Mjolnir? Let’s check out some code.
  • Both have an abstract class called Command that you can inherit from.

    Within these commands, you put some code that might be dangerous, that might be suscept to those problems that we saw in the example.

    Here’s an example from Hystrix.
  • You can see the run() method down here - whatever you put in here is what gets protected.

    In the majority of cases, this is probably code that does I/O over the network. Imagine that we’re grabbing an HTTP client in this run method and making a GET request.

    Inter-cluster service calls, database calls, calls to systems outside of your application. Those sorts of things. It *could* be code that’s heavily Memory-bound, or code that interacts with something on disk, but those cases are a lot rarer than network communication.
  • To execute the command, you just create a new one and call execute(). It’s fairly straightforward. Hystrix also has equivalent methods to execute() that let you get a Future or an Observable if you want to work with the result in a more asynchronous way.
  • Here’s the equivalent with Mjolnir in C#.

    It looks pretty similar. Some slight differences, but the structure is the same. Potentially dangerous code goes into the ExecuteAsync() method.
  • To use the Command, you create a new one and call InvokeAsync(). That’s the async version - Mjolnir’s also got a synchronous equivalent, Invoke().
  • What are these Commands? When you put code into that overridden method and run the Command, what actually happens?

    The Command applies a few protective patterns around it.
  • One thing they do is enforce Timeouts. Timeouts are kind of an obvious thing, but it’s easy to forget about them.
  • They use Bulkheads, which are a way to isolate pieces of your system from the rest of it, and keep failures contained.
  • They also employ the idea of Circuit Breakers, which help track failure rates and fail fast when things seem to be going wrong.

    We’ll go into a little more detail about what each of these does.
  • First, timeouts.
  • Timeouts are pretty simple, right? You configure a timeout duration, and the operation will abort if it takes any longer.

    But when you’re deep in the bits, coding up a feature, you’re not typically thinking about how long it’s going to take to make a network call. You’re just trying to put things together and get the feature to work, right? It’s so easy to forget to set the timeout.

    When you start wrapping your code in Commands, they make timeouts required. They let you configure timeouts per-Command, but also have default global timeouts that get used if you don’t explicitly set one. They make you put some thought into what a reasonable execution time should be.

    Does anyone know what the default timeout for a .NET HttpClient call is? [ADVANCE] 100 seconds

    When was the last time you waited 100 seconds for a page to load? Maybe once in a while. Maybe.

    [ADVANCE] What about Anyone know the default timeout? [ADVANCE] infinite
    [ADVANCE] apache commons HttpClient is also infinite.

    It’s pretty obvious that defaults aren’t going to cut it. You can’t have these multi-minute or infinite requests sitting around in your application, blocking and tying up threads. Plus, your users aren’t going to wait that long, anyway.
  • So what *should* your timeouts be?

    It depends, and you’ll have to experiment.

    Start with something reasonably high - that depends on how aggressive you want to be. Netflix starts with 1 second timeouts for new commands.

    We’ve been a bit more generous with ours. Given that we’ve recently introduced this into our system, we need to be sure we’re not being too aggressive and affecting our users’ experience. So we default to 10-15 seconds for our timeouts.

    After you get your Commands out there, you’ll want to Observe how long the command takes. This might be a day, a week, a month - ideally it’s whatever will take you through peak traffic.

    This is another reason we set ours pretty high - our peak cycle is different from Netflix. I assume theirs is relatively consistent week-to-week, but we see our peak loads in September, so what works for us right now may be too aggressive then. Once we go through that peak, we’ll tune them again more accurately, and try to get down to that one second mark.

    And when you tune them, you’ll want to adjust them to a timeout value that covers around 99.5% of their requests.
  • For example, this is one of our commands elapsed time over one day. For this, we’d set our timeout around the 1250 to 1300 millisecond mark, which covers almost every single request.

    In this particular case I think I determined we’d end up rejecting just one or two requests out of around 85 thousand.

    It’s important to tune them as soon as you can and as low as you can.

    Our 15 default is still pretty high. There’s likely more than one command execution on any given page request, so they’ll still stack up to something longer than 15 seconds, and the user’s probably not going to wait that long.

    As an aside, timeouts are something you can do even without integrating one of these libraries - it’s a smart thing to get used to setting them whenever the API you’re working with supports it.

    But timeouts alone aren’t enough. They’ll help last a little bit longer in a failure situation, but we need more.
  • We need to isolate problems with Bulkheads.
  • For those unfamiliar with how bulkheads work in naval vessels, here’s an awesome picture of a boat that I made.
  • Some larger ships are divided up into compartments, with these upright walls in between called bulkheads.

    The bulkheads are watertight and fire-resistant, so ...
  • If the hull gets punctured or a fire breaks out in one of the compartments, the bulkhead isolates the problem to that one part of the ship.

    The rest of the compartments aren’t affected, and the ship can stay afloat and continue operating.

    Hystrix and Mjolnir take this concept and apply it to code using thread pools and counting semaphores.
  • Let’s look at an example. This is another look at that cascading failure I described earlier, but with a little more detail.

    On top here in gray is the application we’re trying to protect, and below are four applications it communicates with. Imagine it’s making frequent HTTP calls out to each of these systems.
  • Visualize the application in terms of something like threads or sockets.

    There are a finite number it can handle, and that number is constrained. It might be by the amount of memory on the machine or the memory the app was allocated when it started up, or possibly by a capped thread pool size that’s managed by the application container.
  • When everything’s working normally, those HTTP requests are firing off to other applications and responding pretty quickly - say, 10 milliseconds.

    So imagine that each of these squares turns a color for that 10 milliseconds and then frees up the resource, turning the square white again.
  • But let’s say that one of those downstream applications starts responding really slowly. Maybe our basketball squad decided to run some batch data import, but forgot to throttle it, and cranked their application’s CPU up to 100%.

    Requests are still getting accepted, but now instead of 10 milliseconds, the application might take 10 or 20 seconds to respond depending on how overloaded that cluster is.
  • This is where Timeouts alone won’t save you. If you’ve got a 15 second timeout, and requests are coming in faster than they’re timing out, the application’s going to have more and more threads tied up, blocking, waiting for those requests to get a response back.

    For a while, it’s fine - they’re probably just taking up other unused threads in the application container’s thread pool, and the pool might start resizing itself and adding more threads if it does that sort of thing.
  • But eventually, those slow requests are going to start starving out the other calls.
  • And if that continues, the application will max out its threads, and every one of them will be blocking on HTTP requests, leaving you completely hosed.

    If other things depend on you, you’ve also turned into another bad node in the system. This is the cascade we saw earlier, where a problem downstream can propagate upward and do damage to systems that depend on it, and systems that depend on those systems - all the way up.

    That’s where the bulkheads help out.
  • Instead of letting Commands run wild through your application’s thread pool, they get passed through their own thread pool that has a fixed maximum number of threads, typically around 10.

    These individual thread pools are the bulkheads.
  • If one Command starts to act up, it can only use up as many threads as are in that pool.

    When it consumes them all, new executions of that Command will simply be rejected. The library will throw an exception or execute a predefined, safer fallback method.

    The rest of the Commands in the system continue working normally, and aren’t affected by the outage that’s going on with that one node, which is good, because there’s a good chance they don’t even need to interact with that service.

    The main job of the thread pools is to limit the number of concurrent calls for a single set of operations. We can also do something similar by using counting semaphores.
  • We could replace the thread pool with a semaphore. The counting semaphore is just a lock that can have up to a specific number of concurrent lock holders. The lock acquisition is done in a try/acquire way, which means that if we can’t immediately access one of the spots in the lock, we don’t block, but instead return false.

    When the the semaphore is at its maximum, new Commands won’t be able to acquire that lock and will get immediately rejected, just like the thread pool.

    Hystrix supports semaphores, Mjolnir doesn’t yet, but will at some point.

    So these thread pool and semaphore bulkheads are pretty important - they do a great job of mitigating that failure cascade and preventing it from spreading and taking over everything.
  • On to our remaining protection layer. We’ve covered Timeouts and Bulkheads, let’s talk about Circuit Breakers. They’re pretty straightforward.
  • Every Command gets passed through a circuit breaker.
  • Each circuit breaker maintains a rolling count of the operations that have come through it, and whether or not each of those was a success or a failure.
  • A traditional electrical circuit breaker trips and opens if too much current is drawn through it. One of our circuit breakers trips if it sees too many errors.

    Once a breaker is tripped, any Command that would have gone through it is instead immediately rejected.

    The breaker will stay tripped for a configured period. Maybe 10 or 30 seconds or so.
  • Once the wait period ends, the breaker will allow a single operation through.
  • If that operation succeeds, the breaker is considered fixed and closed, allowing all operations through again.

    But if the single test operation fails...
  • ...the breaker remains tripped for another waiting period, and the process repeats.
  • Circuit breakers serve two important purposes. They help us fail fast back to the caller. We already know that our operation is having problems, we might as well not make our users or client applications wait just to eventually tell them what we already know.
  • They also help shed load from systems that are already under stress.

    When downstream applications are having problems, sending them more traffic certainly isn’t going to help them.

    And if you think about it, when you hit a site (say, GitHub) and it either goes really slowly or dumps an error back on you, what’s one of the things you usually do?

    You try to figure out if it’s you or them, so you hit F5 with a page reload, which sends off another request and is only going to burden the crippled system even more.

    We just need to back off until the application’s back up and working correctly. We also save a few threads and some processing on our calling side.

  • One really nice thing is that most of timeout, thread pool, and circuit breaker behavior is all configurable at runtime. If you find that a timeout’s just a little too aggressive, you can bump it up without an application restart. Need to adjust the error percentage that a circuit breaker trips at? Done.
  • Another thing that can be useful in practice is grouping several types of operations together so that they all use the same thread pool or circuit breaker.

    If you think about it, if you’ve got a service that lets you update user names, change user passwords, and delete users, there could be a high chance that if you start seeing elevated error rates or timeouts on the user name updates, you’ll probably start seeing the same problems with the password updates and the deletes. Not all of the time - sometimes there may be bugs in just one, but it’s likely that if problems are happening with the user database they’re all going to, it’ll affect everything user-related.

    So what we do, and what these libraries allow, is group similar operations together - all of our user service operations flow through the same circuit breaker, which means that they’re all going to hit a tripped breaker together and fail together. That gives us the opportunity to fail a little bit faster, instead of waiting for a circuit breaker for each different type of operation to trip.

    How you might group operations is kind of a judgement call, and you’ll probably have to try things out and adjust them if it doesn’t quite fit. We’ve keep our groups of service methods pretty granular - we have focused groups of services for users, for roles, video clips, playbook plays, highlights. We kind of tie our groups to individual SQL tables or mongo collections. Some of our groups might be a little wrong, and might cause some operations to run into a tripped breaker when they actually didn’t need to. We’re okay with that, and will recognize those situations and can move around our groupings if we find that we need to.
  • So by using all three of these patterns, you can do a lot improve your applications’ fault tolerance and stop that failure cascade.
  • If you look at the cascading failure we saw earlier, but with these added protections, that database error only ever really affects...
  • ...small parts of the apps that depend on it, leaving the rest of the app chugging along.
  • Here’s an example of what these protections look like in our production environment. For this particular case, we had a small incident where a couple of our recruit servers had trouble connecting to our main webapp.

    This is a snapshot of one of our metrics charts of the elapsed round trip time between clusters for each group of Commands we have. You can see at the end there, about half past noon, a bunch of them jumped up to 15 seconds round trip, up from at most around 1 second.

  • From our logs, this is what one set of Commands from one of our recruit servers looked like right when that started happening.

    In blue is the volume of canceled commands. A canceled command means it hit its timeout and we aborted the call.

    In yellow is the volume of rejected commands, which means the circuit breaker immediately prevented them from even happening.

    This bottom chart shows two individual events, the first is when the circuit breaker tripped, and the second is when it fixed itself.

    So you can see, for this particular set of Commands, there was a brief period up front where users would have been waiting 15 seconds for errors to occur, and then when the breaker tripped, we started immediately returning with an error instead of making them wait 15 seconds and *then* fail.

    This is a pretty small example with what was a fairly small incident. It didn’t hit any of our thread pool limits, but did help us fail faster and take a little load off while we waited for the problem to resolve.
  • One thing that’s useful to keep in mind is that these bulkheads and circuit breakers are just patterns. They’re used by Hystrix and Mjolnir, but you can also employ them on your own, and we’ve done that ourselves as well.

    Within every application we have a transport layer that uses HTTP to send API requests from one cluster to another.
  • That transport layer helps locate service endpoints - it holds onto a mapping of routes to other applications that we’ll round-robin through when sending these HTTP requests.

    Our transport layer watches every request that goes to each individual machine, and looks specifically for socket errors. We figure that if we encounter a socket error, it means that there’s something more fundamentally wrong with connectivity between us and that other machine.
  • If we see more than three consecutive socket errors from us to a single machine, we’ll pull it from that internal mapping and mark it as unhealthy - note that we only do that within the application that observed it - we don’t broadcast that out to other nodes in our cluster or other applications - I’ll come back to that in a second.

    We’ll then, in the background, send a ping request to it every 5 seconds until we get a successful response, at which point we’ll put it back into that mapping as an eligible endpoint.

    This is just another example of the circuit breaker pattern, but in a more focused and specialized case. We’re monitoring operations for their failure rates, reacting by preventing them from happening altogether, and then self-healing when we see that things are better.

    Coming back to what I hinted at earlier, this brings us to an interesting observation.
  • Note that all of the patterns and behavior that I’ve talked about here are happening right within one application on one machine. There’s not some global authority or monitoring application that watches the whole system and tells nodes about the state of the rest of the system.

    In fact, if you try to do that, it’s got the potential to be pretty inaccurate.
  • Imagine this orange circle over here is some sort of global arbiter system that monitors all of our servers, and its responsibility is to detect when systems become unavailable and tell the rest of the systems about that. I’ll call it system G (for “global”).
  • In some cases, this may work. If G can’t get to system C down here, it’s possible that system C dropped completely off the network or became unresponsive, in which case it’s fine to tell A and B that C is no longer available, and that they should stop sending traffic to it.
  • But what’s also possible is that system G is able to connect C just fine, but system B actually isn’t. This might be because B goes through a different switch, and that switch is having problems. It might be that something’s wrong with the way system B is receiving DNS information. It could be for any number of reasons.

    In that case, G doesn’t really work well as an arbiter. It’s not going to be able to reliably tell B to stop sending traffic to C.
  • It also doesn’t really work if G can’t connect to systems A or B to tell them about system C being gone, though A and B might still be able to communicate fine with each other and with this other, unmarked gray circle over on the right. This global arbiter model falls apart for a number of other similar situations.
  • So all of our observations about the system are done by individual nodes.

    A, B, and C make decisions for themselves, and don’t really involve G very much. System A has a much better idea about the other applications that it can see and successfully communicate with.

    There can still some need for a global state. For us, we do keep a global service registry that lists all of our servers and a little bit of information about them. Applications, about every 15 seconds or so, grab that state to refresh their worldview. Bring new servers into their service registries, those sorts of things.

    That matters the most when an application starts up, because it gives the application the initial information it needs to build up its route mappings and those sorts of things.

    But outside of application startup, they don’t rely on that state to get things done. If they happen to lose connectivity to G for a bit, that’s fine. They’ll continue functioning. G just serves as a way to make sure things stay relatively in-sync.

    So the point here is that we give individual nodes the responsibility to make decisions about how they communicate.
  • Another important takeaway here is that you need to design and write your user interfaces in a way that gracefully handles problems.

    It’s difficult to do, and we haven’t mastered it yet, but we’re getting better at it. It takes a bit of a mental shift.

    Here’s a screenshot of one of our pages where an athlete can manage his or her profile information.

    Part of that profile management involves uploading academic documents like transcripts for recruiters to view.

    This page itself is served up by one of our monolith servers, but all of the academic document data is managed by our recruit cluster. That means that when we load this page, we have to make a call out to a recruit server to grab the user’s academic documents.

    If recruit’s running slowly, there’s a chance that we’ll run into a tripped breaker when we grab this information. Instead of just blowing up and throwing a 500 back to the user, we just drop a small error message into the documents section - the rest of the page continues to render and behave just fine.

    So it takes a little bit of thought, because you need to figure out if you can still give your user a decent experience if parts of your underlying system are having troubles. There are going to be some obvious situations where a full-on error page is inevitable - if recruit’s down and we’re trying to serve up a very recruit-heavy page, we’re probably not going to be able to do much. But there are a lot of places like this, where that outside interaction is just a small part of the user’s experience, and you can let them continue to do other things that aren’t related to the failures.

    In some cases, you might even consider building in an automatic ajax retry or putting in a “try again” button. Maybe this was just a really quick, transient network blip that fixes itself in a few seconds. I know I scoffed at the fact that users are going to retry and cause you more problems, but that only causes problems if you’re not protected by circuit breakers. If your breaker is still tripped, it’s going to be a really quick request for your server to process, and it’s not going to push that request downstream to wherever the problem is.
  • Finally, and this is solid advice regardless of what you’re doing: monitor your systems.

    Use logging and aggregators like Splunk, Kibana, LogStash. Hook up metrics services like Nagios, statsd, Riemann, or a host of other things. Send yourself alerts via email or something fancier like PagerDuty when breakers trip or error rates get high.

    You need to know how your systems behave and when they misbehave.

    The charts you saw earlier came from Splunk and Graphite, which are a couple tools we use, but you should be able to plug in logging and monitoring pretty easily.
  • So I encourage you to give these frameworks a look. Hystrix is pretty far along, and has a number of other nifty features like request caching. Their wiki documentation is great, and they’ve got some good diagrams and dashboards that can give you more insight.

    I hope I’ve passed on a few valuable things to everyone today, even if it was just to build some awareness on how applications can interact and fail in unpredictable ways. These situations are more likely to happen under higher traffic volumes and in more distributed systems, but the patterns and ideas have applications within systems of any size and any load.

    Check out Michael Nygard’s book, Release It!, which sets the foundation for a lot of these ideas.

    I’ll get this slide deck pushed up and tweet it out and it’ll also probably end up on one of Mjolnir’s wiki pages.

    If you have any questions, I’d be happy to field them - or you can find me during one of the breaks if you think of something later.
  • ×