Brought to you by
Fail Fast, Retry Soon
Omar Elgabry
Software Engineer at Square
Omar Elgabry
Software Engineer at Square
■ A software engineer, a writer, a hackathon winner, with a polymorphic personality.
■ Born in Egypt, lived and worked in India, Turkey, and currently in Canada (Vancouver
and Toronto).
■ Other jobs I like to do: teaching, farming, gardening, wood work, and babysitting!
■ Blog: https://medium.com/@OmarElgabry
Intro
In distributed systems, services consist of a fleet of nodes that functions as one unit. It is
not uncommon for some nodes to go down, usually, for a short time. When this occurs,
failures can happen on the client-side and can lead to wide-ranging problems.
To build resilient systems, reduce the probability of failure, and increase the app
performance, we’re going to talk about:
■ Timeouts
■ Retries
■ Backoff
■ Jitters
■ Adaptive retries
timeouts
timeouts
Timeout is the maximum amount of time that a client must wait for something to
happen, e.g. a request to complete.
■ And why should we use timeout?
● No/long timeouts eat resources. When a client is waiting for a request to complete, it holds on
to the limited resources (memory, threads, connections) while waiting for the response.
● Server can run out of these resources if many client requests hold on to these resources for a
long time.
■ Timeout is a best practice not only on the remote calls but also between the
internal calls across processes on the same machine.
timeouts
■ What timeout to set?
● Too high → not useful, almost like no timeout
● Too low → terminate request early, increase error
rate
One approach is to use the p99 of the downstream
service as starting point for our the client's
timeout. But, …
● p99 might fluctuate and mightn't be consistent
● p99/max much higher than p95/p99 due to some
outliers
● p99 is almost like p95
● high client network latency
Goal: Reduce the % of the timed out requests
when could eventually succeed (false timeouts)
timeouts
■ Why request is timing out?
● Maybe it not because client timeout is short or the downstream service is taking longer
● Maybe the code is establishing a new connection on each request
■ Timeouts might reduce long hanging requests, and thereby, reduce
consumption of limited resources and overall latency, but timeouts don't
reduce error rate.
retries
retries
■ Retrying the same (failed) request again often succeed
● Behind the scene, systems usually don't often fail as a single unit. Instead, partial or transient
failures are more common.
■ Retrying is less useful in cases of deterministic errors, where retrying the
request will almost always fail.
● In, eventual consistency systems, however, a client error if retried later might succeed as
system state propagates.
■ Retrying is only safe if an operation is idempotent.
timeouts+retries
A real-production use case where the DB max latency went down from > 10s to ~500ms and success rates increased
after employing timeouts+retries.
source: https://medium.com/textnowengineering/the-whacking-game-ee3af79c6e13
retries
When partial and transient failures are rare, and the overall number of retried
requests is small, timeouts+retries can improve availability, reduce latency, and
increase success rate.
But these are the same things that retries can put at risk if not used wisely.
■ Retries consume resources
● Retries tradeoff server limited resources (mem, cpu, connections) for higher success rates.
● In almost all cases, we should limit the number client retries.
■ Retries increase load on the downstream service
● … as a result of retrying the failed and timed out requests. If failures are due to service being
overloaded, retrying can delay recovery by keeping the downstream service under a high load
for long.
retries
■ Retries increase load on the downstream service (continued). Examples:
● Hot partition
■ Retrying failures mightn't work as we still overwhelming the hot partition
● Multiple service layers
■ When the backend consists of multiple layers of microservices each is retrying
independently, i.e. 81x retries for 4 layers each retrying 3 times.
● Rate Limiting
■ Services such as AWS S3 and Cloudflare have rate limits, so excessive requests will be
throttled.
backoff
backoff
A solution to retries in succession on a service failing because it’s overloaded.
Instead of retrying immediately and aggressively, the client waits for some period
between retries.
■ What is the benefit?
● Retrying immediately when the likely outcome is another failure, wastes resources.
● Backoff gives the downstream service some breathing time to heal when already overloaded –
so it is not flooded
■ How long should we wait?
● The most common algorithm is the exponential backoff, where the wait time increases
exponentially after every retry.
● Implementations typically cap their backoff to a maximum value to avoid long backoff times.
backoff
■ Backoff just "delays" the retries
● Backoff is insufficient when a service is under a constant overload or in case of contention.
● Failed requests when backoff to the same time, they cause contention or overload again when
they are retried.
jitter
jitter
Adds randomness to the backoff (wait time) when retrying a request to spread out the load
and reduce contention.
■ What jitter to use?
Add Jitter to backoff value (most common) Between zero and backoff value
Sleep
duration
■ (2^retries * delay) +/- random_number
■ (2^retries * delay) * randomization_interval
random_between(0, 2^retries *
delay)
Resource
Utilization
less resources because work is spread out due to randomization
Time to
complete
takes longer to complete, has longer sleep durations takes less time to complete, sleep duration
min value range is 0, i.e. [0, backoff]
When to use if backing-off retries help give downstream service time to
heal
if most failures are due to contention and
spreading out retries is just what we need.
jitter
■ Jitter isn't only for retries
● Spreads out spikes of work by periodic jobs, or any repeated work scheduled at regular
intervals, e.g. expiring cache keys around the same time.
Code: timeouts, retries, backoff + jitter
retries := 0
for {
if retries > MaxRetries { return Err } // exceeded max retries
// execute the operation with a timeout (e.g. 1 second)
if err = operation(timeoutCtx); err == nil { return Success } // operation succeeded
// handle failed request
if isPermanentErr(err) { return Err } // stop on permanent errors (unretryable)
// calculate sleep duration (milliseconds)
// sleep = (2^retries * delay) * randomization_interval
exponentialFn := 1 << retries // 2^retries
backoff := exponentialFn * 100 // retry delay = 100ms
Code: timeouts, retries, backoff + jitter
// … to be continued
// get a random sleep duration between interval [backoff*0.5, backoff*1.5] in ms
// for e.g. if backoff = 100ms, sleep is any number in the range [50ms, 150ms]
minInterval := backoff / 2 // backoff*0.5
maxInterval := backoff + (backoff / 2) // backoff*1.5
// rand.Intn() returns a rand number from 0 to N (exclusive) so we +1
sleep := time.Millisecond *
time.Duration(minInterval + rand.Intn(maxInterval - minInterval + 1))
time.Sleep(sleep) // wait until retry sleep duration has elapsed
retries++ // increment retries for the next retry attempt
}
adaptive retries
adaptive
When a large percentage of requests are failing and retries are unsuccessful, like
in cases of longer running issues, the techniques we talked about aren't sufficient.
This warns that future retries are not currently welcome, and that we need to
throttle any un-welcomed retries, until some time period.
■ How to do that?
● We use the token bucket algorithm! This algorithm is widely used in rate limiting to determine
when it is safe to transmit data that complies with the rate limits.
● We’ll also compare token bucket algorithm vs circuit breaker.
adaptive
Token bucket (standard) algorithm
Algorithm An in-memory bucket holding tokens (just a counter), and periodically, a fixed
number of tokens is added into the bucket (by increasing the counter)
■ On each request, client removes token(s) from the bucket, and completes
the request.
■ If there aren't sufficient tokens, it throttles the request and either drops it or
waits until there are enough tokens to make the request.
Goal Rate limit the “total number of requests” to downstream service, i.e. when error
rate is high, retries drain the token bucket, and throttle future requests until bucket
slowly begins to refill.
adaptive
Token bucket variation algorithm
Algorithm Instead of adding tokens with a fixed amount periodically, we add token(s) on
successful attempts.
■ Client can make initial requests, regardless of the tokens availability.
■ If it succeeds, it adds part of a token into a token bucket, say 0.1 token.
■ If the call fails, retry up to N times as long as there one or more (whole)
token(s) in the bucket.
Goal Rate limit "retries" when error rate is above threshold by throttling "retries" that
exceed that threshold, i.e. max No of retries = only 10% of successful attempts.
adaptive
■ Circuit breaker (CB)
● suffers from modality – it's either retrying or not retrying, and can introduce addition time to recovery.
● has no additional load at high failure rates, but lower success rates after threshold as it stops all future retries.
■ Token Bucket (TB)
● has some (tunable) additional load at high failure rates, but higher success rates as it doesn't deplete its bucket
fast enough.
■ Both behave like N retries (without throttling) under low error rates.
adaptive
Can we design a better algorithm?
Client libraries have inconsistent behaviour for retries and rate limits across different languages:
■ Rate limits
● Client rely on the its limited knowledge (requests succeeded or failed) to guess what's the best action to take.
Yet, the server knows a bit more.
■ Error Rate:
● Client doesn't know the true failure rate, and it relies on its local sampling of the failure rate, which may vary
from the true rate on the server, e.g. serverless and container-based applications, where clients are short-lived,
with each sending fewer requests.
How can we expose some of that server knowledge to clients so that clients can make informed
decisions, thereby, having consistent behaviour, without increasing complexity? I'll leave that
exercise for you!
Recap
■ Timeouts avoid client requests from hanging long while holding on to the
limited resources.
■ Retries can survive partial and transient failures, and therefore, increase the
success rate.
■ Backoff + Jitter can improve resource utilization and reduce congestion.
■ Adaptive Retries dynamically adjusts request rates in response to high error
rates and unsuccessful retries.
Final Words
What seemed to be an easy problem, turned out to be quite hard in distributed
systems, and really depends on the nature and the requirements of the system.
Getting the happy path working is the easy part, but going beyond that, is when the
REAL
ENGINEERING
WORK
BEGINS!
Thank You!
Brought to you by
Omar Elgabry
omar.elgabry.93@gmail.com
LinkedIn/omarelgabry

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

  • 1.
    Brought to youby Fail Fast, Retry Soon Omar Elgabry Software Engineer at Square
  • 2.
    Omar Elgabry Software Engineerat Square ■ A software engineer, a writer, a hackathon winner, with a polymorphic personality. ■ Born in Egypt, lived and worked in India, Turkey, and currently in Canada (Vancouver and Toronto). ■ Other jobs I like to do: teaching, farming, gardening, wood work, and babysitting! ■ Blog: https://medium.com/@OmarElgabry
  • 3.
    Intro In distributed systems,services consist of a fleet of nodes that functions as one unit. It is not uncommon for some nodes to go down, usually, for a short time. When this occurs, failures can happen on the client-side and can lead to wide-ranging problems. To build resilient systems, reduce the probability of failure, and increase the app performance, we’re going to talk about: ■ Timeouts ■ Retries ■ Backoff ■ Jitters ■ Adaptive retries
  • 4.
  • 5.
    timeouts Timeout is themaximum amount of time that a client must wait for something to happen, e.g. a request to complete. ■ And why should we use timeout? ● No/long timeouts eat resources. When a client is waiting for a request to complete, it holds on to the limited resources (memory, threads, connections) while waiting for the response. ● Server can run out of these resources if many client requests hold on to these resources for a long time. ■ Timeout is a best practice not only on the remote calls but also between the internal calls across processes on the same machine.
  • 6.
    timeouts ■ What timeoutto set? ● Too high → not useful, almost like no timeout ● Too low → terminate request early, increase error rate One approach is to use the p99 of the downstream service as starting point for our the client's timeout. But, … ● p99 might fluctuate and mightn't be consistent ● p99/max much higher than p95/p99 due to some outliers ● p99 is almost like p95 ● high client network latency Goal: Reduce the % of the timed out requests when could eventually succeed (false timeouts)
  • 7.
    timeouts ■ Why requestis timing out? ● Maybe it not because client timeout is short or the downstream service is taking longer ● Maybe the code is establishing a new connection on each request ■ Timeouts might reduce long hanging requests, and thereby, reduce consumption of limited resources and overall latency, but timeouts don't reduce error rate.
  • 8.
  • 9.
    retries ■ Retrying thesame (failed) request again often succeed ● Behind the scene, systems usually don't often fail as a single unit. Instead, partial or transient failures are more common. ■ Retrying is less useful in cases of deterministic errors, where retrying the request will almost always fail. ● In, eventual consistency systems, however, a client error if retried later might succeed as system state propagates. ■ Retrying is only safe if an operation is idempotent.
  • 10.
    timeouts+retries A real-production usecase where the DB max latency went down from > 10s to ~500ms and success rates increased after employing timeouts+retries. source: https://medium.com/textnowengineering/the-whacking-game-ee3af79c6e13
  • 11.
    retries When partial andtransient failures are rare, and the overall number of retried requests is small, timeouts+retries can improve availability, reduce latency, and increase success rate. But these are the same things that retries can put at risk if not used wisely. ■ Retries consume resources ● Retries tradeoff server limited resources (mem, cpu, connections) for higher success rates. ● In almost all cases, we should limit the number client retries. ■ Retries increase load on the downstream service ● … as a result of retrying the failed and timed out requests. If failures are due to service being overloaded, retrying can delay recovery by keeping the downstream service under a high load for long.
  • 12.
    retries ■ Retries increaseload on the downstream service (continued). Examples: ● Hot partition ■ Retrying failures mightn't work as we still overwhelming the hot partition ● Multiple service layers ■ When the backend consists of multiple layers of microservices each is retrying independently, i.e. 81x retries for 4 layers each retrying 3 times. ● Rate Limiting ■ Services such as AWS S3 and Cloudflare have rate limits, so excessive requests will be throttled.
  • 13.
  • 14.
    backoff A solution toretries in succession on a service failing because it’s overloaded. Instead of retrying immediately and aggressively, the client waits for some period between retries. ■ What is the benefit? ● Retrying immediately when the likely outcome is another failure, wastes resources. ● Backoff gives the downstream service some breathing time to heal when already overloaded – so it is not flooded ■ How long should we wait? ● The most common algorithm is the exponential backoff, where the wait time increases exponentially after every retry. ● Implementations typically cap their backoff to a maximum value to avoid long backoff times.
  • 15.
    backoff ■ Backoff just"delays" the retries ● Backoff is insufficient when a service is under a constant overload or in case of contention. ● Failed requests when backoff to the same time, they cause contention or overload again when they are retried.
  • 16.
  • 17.
    jitter Adds randomness tothe backoff (wait time) when retrying a request to spread out the load and reduce contention. ■ What jitter to use? Add Jitter to backoff value (most common) Between zero and backoff value Sleep duration ■ (2^retries * delay) +/- random_number ■ (2^retries * delay) * randomization_interval random_between(0, 2^retries * delay) Resource Utilization less resources because work is spread out due to randomization Time to complete takes longer to complete, has longer sleep durations takes less time to complete, sleep duration min value range is 0, i.e. [0, backoff] When to use if backing-off retries help give downstream service time to heal if most failures are due to contention and spreading out retries is just what we need.
  • 18.
    jitter ■ Jitter isn'tonly for retries ● Spreads out spikes of work by periodic jobs, or any repeated work scheduled at regular intervals, e.g. expiring cache keys around the same time.
  • 19.
    Code: timeouts, retries,backoff + jitter retries := 0 for { if retries > MaxRetries { return Err } // exceeded max retries // execute the operation with a timeout (e.g. 1 second) if err = operation(timeoutCtx); err == nil { return Success } // operation succeeded // handle failed request if isPermanentErr(err) { return Err } // stop on permanent errors (unretryable) // calculate sleep duration (milliseconds) // sleep = (2^retries * delay) * randomization_interval exponentialFn := 1 << retries // 2^retries backoff := exponentialFn * 100 // retry delay = 100ms
  • 20.
    Code: timeouts, retries,backoff + jitter // … to be continued // get a random sleep duration between interval [backoff*0.5, backoff*1.5] in ms // for e.g. if backoff = 100ms, sleep is any number in the range [50ms, 150ms] minInterval := backoff / 2 // backoff*0.5 maxInterval := backoff + (backoff / 2) // backoff*1.5 // rand.Intn() returns a rand number from 0 to N (exclusive) so we +1 sleep := time.Millisecond * time.Duration(minInterval + rand.Intn(maxInterval - minInterval + 1)) time.Sleep(sleep) // wait until retry sleep duration has elapsed retries++ // increment retries for the next retry attempt }
  • 21.
  • 22.
    adaptive When a largepercentage of requests are failing and retries are unsuccessful, like in cases of longer running issues, the techniques we talked about aren't sufficient. This warns that future retries are not currently welcome, and that we need to throttle any un-welcomed retries, until some time period. ■ How to do that? ● We use the token bucket algorithm! This algorithm is widely used in rate limiting to determine when it is safe to transmit data that complies with the rate limits. ● We’ll also compare token bucket algorithm vs circuit breaker.
  • 23.
    adaptive Token bucket (standard)algorithm Algorithm An in-memory bucket holding tokens (just a counter), and periodically, a fixed number of tokens is added into the bucket (by increasing the counter) ■ On each request, client removes token(s) from the bucket, and completes the request. ■ If there aren't sufficient tokens, it throttles the request and either drops it or waits until there are enough tokens to make the request. Goal Rate limit the “total number of requests” to downstream service, i.e. when error rate is high, retries drain the token bucket, and throttle future requests until bucket slowly begins to refill.
  • 24.
    adaptive Token bucket variationalgorithm Algorithm Instead of adding tokens with a fixed amount periodically, we add token(s) on successful attempts. ■ Client can make initial requests, regardless of the tokens availability. ■ If it succeeds, it adds part of a token into a token bucket, say 0.1 token. ■ If the call fails, retry up to N times as long as there one or more (whole) token(s) in the bucket. Goal Rate limit "retries" when error rate is above threshold by throttling "retries" that exceed that threshold, i.e. max No of retries = only 10% of successful attempts.
  • 25.
    adaptive ■ Circuit breaker(CB) ● suffers from modality – it's either retrying or not retrying, and can introduce addition time to recovery. ● has no additional load at high failure rates, but lower success rates after threshold as it stops all future retries. ■ Token Bucket (TB) ● has some (tunable) additional load at high failure rates, but higher success rates as it doesn't deplete its bucket fast enough. ■ Both behave like N retries (without throttling) under low error rates.
  • 26.
    adaptive Can we designa better algorithm? Client libraries have inconsistent behaviour for retries and rate limits across different languages: ■ Rate limits ● Client rely on the its limited knowledge (requests succeeded or failed) to guess what's the best action to take. Yet, the server knows a bit more. ■ Error Rate: ● Client doesn't know the true failure rate, and it relies on its local sampling of the failure rate, which may vary from the true rate on the server, e.g. serverless and container-based applications, where clients are short-lived, with each sending fewer requests. How can we expose some of that server knowledge to clients so that clients can make informed decisions, thereby, having consistent behaviour, without increasing complexity? I'll leave that exercise for you!
  • 27.
    Recap ■ Timeouts avoidclient requests from hanging long while holding on to the limited resources. ■ Retries can survive partial and transient failures, and therefore, increase the success rate. ■ Backoff + Jitter can improve resource utilization and reduce congestion. ■ Adaptive Retries dynamically adjusts request rates in response to high error rates and unsuccessful retries.
  • 28.
    Final Words What seemedto be an easy problem, turned out to be quite hard in distributed systems, and really depends on the nature and the requirements of the system. Getting the happy path working is the easy part, but going beyond that, is when the REAL ENGINEERING WORK BEGINS!
  • 29.
  • 30.
    Brought to youby Omar Elgabry omar.elgabry.93@gmail.com LinkedIn/omarelgabry