Fault Tolerance in a  High Volume, Distributed System
Upcoming SlideShare
Loading in...5
×
 

Fault Tolerance in a High Volume, Distributed System

on

  • 4,854 views

More information can be found at http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

More information can be found at http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Statistics

Views

Total Views
4,854
Views on SlideShare
4,728
Embed Views
126

Actions

Likes
10
Downloads
110
Comments
0

8 Embeds 126

http://irr.posterous.com 73
http://posterous.com 19
http://www.linkedin.com 17
https://www.linkedin.com 9
http://us-w1.rockmelt.com 5
http://storify.com 1
https://si0.twimg.com 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Fault Tolerance in a  High Volume, Distributed System Fault Tolerance in a High Volume, Distributed System Presentation Transcript

  • Fault Tolerance in aHigh Volume, Distributed SystemBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensen 1
  • Dozens of dependencies. One going down takes everything down.99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/montheven if all dependencies have excellent uptime. Reality is generally worse. 2
  • 3
  • 4
  • 5
  • No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  • OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  • OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  • OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  • Semaphores (Tryable): Limited ConcurrencyTryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();} 10
  • Semaphores (Tryable): Limited ConcurrencyTryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();} 11
  • Semaphores (Tryable): Limited ConcurrencyTryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();} 12
  • OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  • Separate Threads: Limited Concurrencytry { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();} 14
  • Separate Threads: Limited Concurrencytry { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();} 15
  • Separate Threads: Limited Concurrencytry { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();} 16
  • Separate Threads: Timeout Override of Future.get()public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); }} 17
  • Separate Threads: Timeout Override of Future.get()public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); }} 18
  • Separate Threads: Timeout Override of Future.get()public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); }} 19
  • OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  • Circuit Breakerif (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();} 21
  • Circuit Breakerif (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();} 22
  • Circuit Breakerif (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();} 23
  • Netflix uses all 4 in combination 24
  • 25
  • Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  • 27
  • 28
  • 29
  • Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfigurationClient service performance characteristic changes Built-in concurrency 30
  • Drawbacks of Separate Threads Some computational overheadLoad on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  • 32
  • Visualizing Circuits in Realtime (generally sub-second latency) Video available athttps://vimeo.com/33576628 33
  • Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/(success+latent success+error+timeout+rejected). 34
  • Netflix DependencyCommand Implementation 35
  • Netflix DependencyCommand Implementation 36
  • Netflix DependencyCommand Implementation 37
  • Netflix DependencyCommand Implementation 38
  • Netflix DependencyCommand Implementation 39
  • Netflix DependencyCommand Implementation 40
  • Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  • Netflix DependencyCommand Implementation 42
  • Netflix DependencyCommand Implementation 43
  • Rolling NumberRealtime Stats and Decision Making 44
  • Request CollapsingTake advantage of resiliency to improve efficiency 45
  • Request CollapsingTake advantage of resiliency to improve efficiency 46
  • 47
  • Fail fast.Fail silent.Fallback.Shed load. 48
  • Questions & More InformationFault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49