Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




                                             1
Dozens of dependencies.

   One going down takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.


                                                  2
3
4
5
No single dependency should
 take down the entire app.

          Fail fast.
         Fail silent.
         Fallback.

        Shed load.



                              6
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              7
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              8
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              9
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 10
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 11
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 12
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              13
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             14
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             15
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             16
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          17
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          18
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          19
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              20
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   21
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   22
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   23
Netflix uses all 4 in combination




                                   24
25
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”



                                                     26
27
28
29
Benefits of Separate Threads

         Protection from client libraries

    Lower risk to accept new/updated clients

           Quick recovery from failure

             Client misconfiguration

Client service performance characteristic changes

              Built-in concurrency
                                                    30
Drawbacks of Separate Threads

    Some computational overhead

Load on machine can be pushed too far

                 ...

    Benefits outweigh drawbacks
    when clients are “untrusted”


                                        31
32
Visualizing Circuits in Realtime
      (generally sub-second latency)




      Video available at
https://vimeo.com/33576628




                                       33
Rolling 10 second counter – 1 second granularity

      Median Mean 90th 99th 99.5th

       Latent Error Timeout Rejected

                 Error Percentage
            (error+timeout+rejected)/
(success+latent success+error+timeout+rejected).

                                                   34
Netflix DependencyCommand Implementation




                                          35
Netflix DependencyCommand Implementation




                                          36
Netflix DependencyCommand Implementation




                                          37
Netflix DependencyCommand Implementation




                                          38
Netflix DependencyCommand Implementation




                                          39
Netflix DependencyCommand Implementation




                                          40
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response




                                          41
Netflix DependencyCommand Implementation




                                          42
Netflix DependencyCommand Implementation




                                          43
Rolling Number
Realtime Stats and
 Decision Making




                     44
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    45
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    46
47
Fail fast.
Fail silent.
Fallback.

Shed load.




               48
Questions & More Information


Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



         Making the Netflix API More Resilient
   http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html




                        Ben Christensen
                          @benjchristensen
                 http://www.linkedin.com/in/benjchristensen



                                                                              49

Fault Tolerance in a High Volume, Distributed System

  • 1.
    Fault Tolerance ina High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
  • 2.
    Dozens of dependencies. One going down takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2
  • 3.
  • 4.
  • 5.
  • 6.
    No single dependencyshould take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  • 7.
    Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  • 8.
    Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  • 9.
    Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  • 10.
    Semaphores (Tryable): LimitedConcurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 10
  • 11.
    Semaphores (Tryable): LimitedConcurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 11
  • 12.
    Semaphores (Tryable): LimitedConcurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 12
  • 13.
    Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  • 14.
    Separate Threads: LimitedConcurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 14
  • 15.
    Separate Threads: LimitedConcurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 15
  • 16.
    Separate Threads: LimitedConcurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 16
  • 17.
    Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 17
  • 18.
    Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 18
  • 19.
    Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 19
  • 20.
    Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  • 21.
    Circuit Breaker if (circuitBreaker.allowRequest()){ return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 21
  • 22.
    Circuit Breaker if (circuitBreaker.allowRequest()){ return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 22
  • 23.
    Circuit Breaker if (circuitBreaker.allowRequest()){ return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 23
  • 24.
    Netflix uses all4 in combination 24
  • 25.
  • 26.
    Tryable semaphores for“trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  • 27.
  • 28.
  • 29.
  • 30.
    Benefits of SeparateThreads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30
  • 31.
    Drawbacks of SeparateThreads Some computational overhead Load on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  • 32.
  • 33.
    Visualizing Circuits inRealtime (generally sub-second latency) Video available at https://vimeo.com/33576628 33
  • 34.
    Rolling 10 secondcounter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  • 42.
  • 43.
  • 44.
    Rolling Number Realtime Statsand Decision Making 44
  • 45.
    Request Collapsing Take advantageof resiliency to improve efficiency 45
  • 46.
    Request Collapsing Take advantageof resiliency to improve efficiency 46
  • 47.
  • 48.
  • 49.
    Questions & MoreInformation Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49