Fault Tolerance in a High Volume, Distributed System

Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netﬂix
@benjchristensen
http://www.linkedin.com/in/benjchristensen

1

Dozens of dependencies.

One going down takes everything down.

99.99%30 = 99.7% uptime
0.3% of 1 billion = 3,000,000 failures

2+ hours downtime/month
even if all dependencies have excellent uptime.

Reality is generally worse.

2

No single dependency should
take down the entire app.

Fail fast.
Fail silent.
Fallback.

Shed load.

6

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

7

Options



Separate Threads

Circuit Breaker

8

Options



Separate Threads

Circuit Breaker

9

Semaphores (Tryable): Limited Concurrency

TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
try {
return executeCommand();
} finally {
executionSemaphore.release();
}
} else {
circuitBreaker.markSemaphoreRejection();
// permit not available so return fallback
return getFallback();
}

10


// acquire a permit
try {
} finally {
}
} else {
}

11


// acquire a permit
try {
} finally {
}
} else {
}

12

Options



Separate Threads

Circuit Breaker

13

Separate Threads: Limited Concurrency

try {
if (!threadPool.isQueueSpaceAvailable()) {
// we are at the property defined max so want to throw the RejectedExecutionException to simulate
// reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command
because thread-pool queueSize is at rejection threshold.");
}
... define Callable that performs executeCommand() ...
// submit the work to the thread-pool
return threadPool.submit(command);
} catch (RejectedExecutionException e) {
circuitBreaker.markThreadPoolRejection();
// rejected so return fallback
}

14


try {

RejectedExecutionException
}
}

15


try {

RejectedExecutionException
}
}

16

Separate Threads: Timeout

Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
try {
long timeout =
getCircuitBreaker().getCommandTimeoutInMilliseconds();
return get(timeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// report timeout failure
circuitBreaker.markTimeout(
System.currentTimeMillis() - startTime);

// retrieve the fallback
}
}

17



try {
long timeout =

}
}

18



try {
long timeout =

}
}

19

Options



Separate Threads

Circuit Breaker

20

Circuit Breaker

if (circuitBreaker.allowRequest()) {
} else {
// short-circuit and go directly to fallback
circuitBreaker.markShortCircuited();
}

21

Circuit Breaker

} else {
}

22

Circuit Breaker

} else {
}

23

Netﬂix uses all 4 in combination

24

Tryable semaphores for “trusted” clients and fallbacks

Separate threads for “untrusted” clients

Aggressive timeouts on threads and network calls
to “give up and move on”

Circuit breakers as the “release valve”

26

Beneﬁts of Separate Threads

Protection from client libraries

Lower risk to accept new/updated clients

Quick recovery from failure

Client misconﬁguration

Client service performance characteristic changes

Built-in concurrency
30

Drawbacks of Separate Threads

Some computational overhead

Load on machine can be pushed too far

...

Beneﬁts outweigh drawbacks
when clients are “untrusted”

31

Visualizing Circuits in Realtime
(generally sub-second latency)

Video available at
https://vimeo.com/33576628

33

Rolling 10 second counter – 1 second granularity

Median Mean 90th 99th 99.5th

Latent Error Timeout Rejected

Error Percentage
(error+timeout+rejected)/
(success+latent success+error+timeout+rejected).

34

Netﬂix DependencyCommand Implementation

35


36


37


38


39


40


Fallbacks

Cache
Eventual Consistency
Stubbed Data
Empty Response

41


42


43

Rolling Number
Realtime Stats and
Decision Making

44

Request Collapsing
Take advantage of resiliency to improve efﬁciency

45

Request Collapsing
Take advantage of resiliency to improve efﬁciency

46

Fail fast.
Fail silent.
Fallback.

Shed load.

48

Questions & More Information

Fault Tolerance in a High Volume, Distributed System
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Making the Netflix API More Resilient
http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Ben Christensen
@benjchristensen
http://www.linkedin.com/in/benjchristensen

49

Fault Tolerance in a High Volume, Distributed System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fault Tolerance in a High Volume, Distributed System

Similar to Fault Tolerance in a High Volume, Distributed System (20)

Recently uploaded

Recently uploaded (20)

Fault Tolerance in a High Volume, Distributed System