SlideShare a Scribd company logo
Brought to you by
Fail Fast, Retry Soon
Omar Elgabry
Software Engineer at Square
Omar Elgabry
Software Engineer at Square
■ A software engineer, a writer, a hackathon winner, with a polymorphic personality.
■ Born in Egypt, lived and worked in India, Turkey, and currently in Canada (Vancouver
and Toronto).
■ Other jobs I like to do: teaching, farming, gardening, wood work, and babysitting!
■ Blog: https://medium.com/@OmarElgabry
Intro
In distributed systems, services consist of a fleet of nodes that functions as one unit. It is
not uncommon for some nodes to go down, usually, for a short time. When this occurs,
failures can happen on the client-side and can lead to wide-ranging problems.
To build resilient systems, reduce the probability of failure, and increase the app
performance, we’re going to talk about:
■ Timeouts
■ Retries
■ Backoff
■ Jitters
■ Adaptive retries
timeouts
timeouts
Timeout is the maximum amount of time that a client must wait for something to
happen, e.g. a request to complete.
■ And why should we use timeout?
● No/long timeouts eat resources. When a client is waiting for a request to complete, it holds on
to the limited resources (memory, threads, connections) while waiting for the response.
● Server can run out of these resources if many client requests hold on to these resources for a
long time.
■ Timeout is a best practice not only on the remote calls but also between the
internal calls across processes on the same machine.
timeouts
■ What timeout to set?
● Too high → not useful, almost like no timeout
● Too low → terminate request early, increase error
rate
One approach is to use the p99 of the downstream
service as starting point for our the client's
timeout. But, …
● p99 might fluctuate and mightn't be consistent
● p99/max much higher than p95/p99 due to some
outliers
● p99 is almost like p95
● high client network latency
Goal: Reduce the % of the timed out requests
when could eventually succeed (false timeouts)
timeouts
■ Why request is timing out?
● Maybe it not because client timeout is short or the downstream service is taking longer
● Maybe the code is establishing a new connection on each request
■ Timeouts might reduce long hanging requests, and thereby, reduce
consumption of limited resources and overall latency, but timeouts don't
reduce error rate.
retries
retries
■ Retrying the same (failed) request again often succeed
● Behind the scene, systems usually don't often fail as a single unit. Instead, partial or transient
failures are more common.
■ Retrying is less useful in cases of deterministic errors, where retrying the
request will almost always fail.
● In, eventual consistency systems, however, a client error if retried later might succeed as
system state propagates.
■ Retrying is only safe if an operation is idempotent.
timeouts+retries
A real-production use case where the DB max latency went down from > 10s to ~500ms and success rates increased
after employing timeouts+retries.
source: https://medium.com/textnowengineering/the-whacking-game-ee3af79c6e13
retries
When partial and transient failures are rare, and the overall number of retried
requests is small, timeouts+retries can improve availability, reduce latency, and
increase success rate.
But these are the same things that retries can put at risk if not used wisely.
■ Retries consume resources
● Retries tradeoff server limited resources (mem, cpu, connections) for higher success rates.
● In almost all cases, we should limit the number client retries.
■ Retries increase load on the downstream service
● … as a result of retrying the failed and timed out requests. If failures are due to service being
overloaded, retrying can delay recovery by keeping the downstream service under a high load
for long.
retries
■ Retries increase load on the downstream service (continued). Examples:
● Hot partition
■ Retrying failures mightn't work as we still overwhelming the hot partition
● Multiple service layers
■ When the backend consists of multiple layers of microservices each is retrying
independently, i.e. 81x retries for 4 layers each retrying 3 times.
● Rate Limiting
■ Services such as AWS S3 and Cloudflare have rate limits, so excessive requests will be
throttled.
backoff
backoff
A solution to retries in succession on a service failing because it’s overloaded.
Instead of retrying immediately and aggressively, the client waits for some period
between retries.
■ What is the benefit?
● Retrying immediately when the likely outcome is another failure, wastes resources.
● Backoff gives the downstream service some breathing time to heal when already overloaded –
so it is not flooded
■ How long should we wait?
● The most common algorithm is the exponential backoff, where the wait time increases
exponentially after every retry.
● Implementations typically cap their backoff to a maximum value to avoid long backoff times.
backoff
■ Backoff just "delays" the retries
● Backoff is insufficient when a service is under a constant overload or in case of contention.
● Failed requests when backoff to the same time, they cause contention or overload again when
they are retried.
jitter
jitter
Adds randomness to the backoff (wait time) when retrying a request to spread out the load
and reduce contention.
■ What jitter to use?
Add Jitter to backoff value (most common) Between zero and backoff value
Sleep
duration
■ (2^retries * delay) +/- random_number
■ (2^retries * delay) * randomization_interval
random_between(0, 2^retries *
delay)
Resource
Utilization
less resources because work is spread out due to randomization
Time to
complete
takes longer to complete, has longer sleep durations takes less time to complete, sleep duration
min value range is 0, i.e. [0, backoff]
When to use if backing-off retries help give downstream service time to
heal
if most failures are due to contention and
spreading out retries is just what we need.
jitter
■ Jitter isn't only for retries
● Spreads out spikes of work by periodic jobs, or any repeated work scheduled at regular
intervals, e.g. expiring cache keys around the same time.
Code: timeouts, retries, backoff + jitter
retries := 0
for {
if retries > MaxRetries { return Err } // exceeded max retries
// execute the operation with a timeout (e.g. 1 second)
if err = operation(timeoutCtx); err == nil { return Success } // operation succeeded
// handle failed request
if isPermanentErr(err) { return Err } // stop on permanent errors (unretryable)
// calculate sleep duration (milliseconds)
// sleep = (2^retries * delay) * randomization_interval
exponentialFn := 1 << retries // 2^retries
backoff := exponentialFn * 100 // retry delay = 100ms
Code: timeouts, retries, backoff + jitter
// … to be continued
// get a random sleep duration between interval [backoff*0.5, backoff*1.5] in ms
// for e.g. if backoff = 100ms, sleep is any number in the range [50ms, 150ms]
minInterval := backoff / 2 // backoff*0.5
maxInterval := backoff + (backoff / 2) // backoff*1.5
// rand.Intn() returns a rand number from 0 to N (exclusive) so we +1
sleep := time.Millisecond *
time.Duration(minInterval + rand.Intn(maxInterval - minInterval + 1))
time.Sleep(sleep) // wait until retry sleep duration has elapsed
retries++ // increment retries for the next retry attempt
}
adaptive retries
adaptive
When a large percentage of requests are failing and retries are unsuccessful, like
in cases of longer running issues, the techniques we talked about aren't sufficient.
This warns that future retries are not currently welcome, and that we need to
throttle any un-welcomed retries, until some time period.
■ How to do that?
● We use the token bucket algorithm! This algorithm is widely used in rate limiting to determine
when it is safe to transmit data that complies with the rate limits.
● We’ll also compare token bucket algorithm vs circuit breaker.
adaptive
Token bucket (standard) algorithm
Algorithm An in-memory bucket holding tokens (just a counter), and periodically, a fixed
number of tokens is added into the bucket (by increasing the counter)
■ On each request, client removes token(s) from the bucket, and completes
the request.
■ If there aren't sufficient tokens, it throttles the request and either drops it or
waits until there are enough tokens to make the request.
Goal Rate limit the “total number of requests” to downstream service, i.e. when error
rate is high, retries drain the token bucket, and throttle future requests until bucket
slowly begins to refill.
adaptive
Token bucket variation algorithm
Algorithm Instead of adding tokens with a fixed amount periodically, we add token(s) on
successful attempts.
■ Client can make initial requests, regardless of the tokens availability.
■ If it succeeds, it adds part of a token into a token bucket, say 0.1 token.
■ If the call fails, retry up to N times as long as there one or more (whole)
token(s) in the bucket.
Goal Rate limit "retries" when error rate is above threshold by throttling "retries" that
exceed that threshold, i.e. max No of retries = only 10% of successful attempts.
adaptive
■ Circuit breaker (CB)
● suffers from modality – it's either retrying or not retrying, and can introduce addition time to recovery.
● has no additional load at high failure rates, but lower success rates after threshold as it stops all future retries.
■ Token Bucket (TB)
● has some (tunable) additional load at high failure rates, but higher success rates as it doesn't deplete its bucket
fast enough.
■ Both behave like N retries (without throttling) under low error rates.
adaptive
Can we design a better algorithm?
Client libraries have inconsistent behaviour for retries and rate limits across different languages:
■ Rate limits
● Client rely on the its limited knowledge (requests succeeded or failed) to guess what's the best action to take.
Yet, the server knows a bit more.
■ Error Rate:
● Client doesn't know the true failure rate, and it relies on its local sampling of the failure rate, which may vary
from the true rate on the server, e.g. serverless and container-based applications, where clients are short-lived,
with each sending fewer requests.
How can we expose some of that server knowledge to clients so that clients can make informed
decisions, thereby, having consistent behaviour, without increasing complexity? I'll leave that
exercise for you!
Recap
■ Timeouts avoid client requests from hanging long while holding on to the
limited resources.
■ Retries can survive partial and transient failures, and therefore, increase the
success rate.
■ Backoff + Jitter can improve resource utilization and reduce congestion.
■ Adaptive Retries dynamically adjusts request rates in response to high error
rates and unsuccessful retries.
Final Words
What seemed to be an easy problem, turned out to be quite hard in distributed
systems, and really depends on the nature and the requirements of the system.
Getting the happy path working is the easy part, but going beyond that, is when the
REAL
ENGINEERING
WORK
BEGINS!
Thank You!
Brought to you by
Omar Elgabry
omar.elgabry.93@gmail.com
LinkedIn/omarelgabry

More Related Content

What's hot

Everything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to ask
Carlos Abalde
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System Language
ScyllaDB
 
libSQL
libSQLlibSQL
libSQL
ScyllaDB
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
Thomas Graf
 
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingAnalyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
ScyllaDB
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
P99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 LatencyP99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 Latency
ScyllaDB
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
RogerColl2
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
Md Safiyat Reza
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
ScyllaDB
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationDataWorks Summit
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Tzach Livyatan
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
Brendan Gregg
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
Ying Zheng
 

What's hot (20)

Everything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to ask
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System Language
 
libSQL
libSQLlibSQL
libSQL
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with TracingAnalyze Virtual Machine Overhead Compared to Bare Metal with Tracing
Analyze Virtual Machine Overhead Compared to Bare Metal with Tracing
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
P99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 LatencyP99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 Latency
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 

Similar to Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
Building resilient applications
Building resilient applicationsBuilding resilient applications
Building resilient applications
Nuno Caneco
 
Low latency for high throughput
Low latency for high throughputLow latency for high throughput
Low latency for high throughput
Peter Lawrey
 
Process management in Operating System_Unit-2
Process management in Operating System_Unit-2Process management in Operating System_Unit-2
Process management in Operating System_Unit-2
mohanaps
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Ed Hunter
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
ScyllaDB
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
Diego Pacheco
 
Diesel load testing tool
Diesel load testing toolDiesel load testing tool
Diesel load testing tool
Syed Zaid Irshad
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
ggarber
 
Async discussion 9_29_15
Async discussion 9_29_15Async discussion 9_29_15
Async discussion 9_29_15
Cheryl Yaeger
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Sabrina Marechal
 
Bosun Monitoring Talk at LISA14
Bosun Monitoring Talk at LISA14Bosun Monitoring Talk at LISA14
Bosun Monitoring Talk at LISA14
Kyle Brandt
 
Cw13 0.01-final
Cw13 0.01-finalCw13 0.01-final
Cw13 0.01-final
asm100
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
Ram Dutt Shukla
 
Resilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix worldResilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix world
Rares Musina
 
FreeRTOS Slides annotations.pdf
FreeRTOS Slides annotations.pdfFreeRTOS Slides annotations.pdf
FreeRTOS Slides annotations.pdf
AbdElrahmanMostafa87
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
Syed Zaid Irshad
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
Hao-Ran Liu
 
Scheduling algorithms
Scheduling algorithmsScheduling algorithms
Scheduling algorithms
Paurav Shah
 

Similar to Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique (20)

Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Building resilient applications
Building resilient applicationsBuilding resilient applications
Building resilient applications
 
Low latency for high throughput
Low latency for high throughputLow latency for high throughput
Low latency for high throughput
 
Process management in Operating System_Unit-2
Process management in Operating System_Unit-2Process management in Operating System_Unit-2
Process management in Operating System_Unit-2
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
 
Diesel load testing tool
Diesel load testing toolDiesel load testing tool
Diesel load testing tool
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
 
Async discussion 9_29_15
Async discussion 9_29_15Async discussion 9_29_15
Async discussion 9_29_15
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
 
Bosun Monitoring Talk at LISA14
Bosun Monitoring Talk at LISA14Bosun Monitoring Talk at LISA14
Bosun Monitoring Talk at LISA14
 
Cw13 0.01-final
Cw13 0.01-finalCw13 0.01-final
Cw13 0.01-final
 
Retransmission Tcp
Retransmission TcpRetransmission Tcp
Retransmission Tcp
 
Resilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix worldResilient service to-service calls in a post-Hystrix world
Resilient service to-service calls in a post-Hystrix world
 
FreeRTOS Slides annotations.pdf
FreeRTOS Slides annotations.pdfFreeRTOS Slides annotations.pdf
FreeRTOS Slides annotations.pdf
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
Scheduling algorithms
Scheduling algorithmsScheduling algorithms
Scheduling algorithms
 

More from ScyllaDB

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
ScyllaDB
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
ScyllaDB
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
ScyllaDB
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
ScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
ScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
ScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
ScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
ScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
ScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
ScyllaDB
 

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

  • 1. Brought to you by Fail Fast, Retry Soon Omar Elgabry Software Engineer at Square
  • 2. Omar Elgabry Software Engineer at Square ■ A software engineer, a writer, a hackathon winner, with a polymorphic personality. ■ Born in Egypt, lived and worked in India, Turkey, and currently in Canada (Vancouver and Toronto). ■ Other jobs I like to do: teaching, farming, gardening, wood work, and babysitting! ■ Blog: https://medium.com/@OmarElgabry
  • 3. Intro In distributed systems, services consist of a fleet of nodes that functions as one unit. It is not uncommon for some nodes to go down, usually, for a short time. When this occurs, failures can happen on the client-side and can lead to wide-ranging problems. To build resilient systems, reduce the probability of failure, and increase the app performance, we’re going to talk about: ■ Timeouts ■ Retries ■ Backoff ■ Jitters ■ Adaptive retries
  • 5. timeouts Timeout is the maximum amount of time that a client must wait for something to happen, e.g. a request to complete. ■ And why should we use timeout? ● No/long timeouts eat resources. When a client is waiting for a request to complete, it holds on to the limited resources (memory, threads, connections) while waiting for the response. ● Server can run out of these resources if many client requests hold on to these resources for a long time. ■ Timeout is a best practice not only on the remote calls but also between the internal calls across processes on the same machine.
  • 6. timeouts ■ What timeout to set? ● Too high → not useful, almost like no timeout ● Too low → terminate request early, increase error rate One approach is to use the p99 of the downstream service as starting point for our the client's timeout. But, … ● p99 might fluctuate and mightn't be consistent ● p99/max much higher than p95/p99 due to some outliers ● p99 is almost like p95 ● high client network latency Goal: Reduce the % of the timed out requests when could eventually succeed (false timeouts)
  • 7. timeouts ■ Why request is timing out? ● Maybe it not because client timeout is short or the downstream service is taking longer ● Maybe the code is establishing a new connection on each request ■ Timeouts might reduce long hanging requests, and thereby, reduce consumption of limited resources and overall latency, but timeouts don't reduce error rate.
  • 9. retries ■ Retrying the same (failed) request again often succeed ● Behind the scene, systems usually don't often fail as a single unit. Instead, partial or transient failures are more common. ■ Retrying is less useful in cases of deterministic errors, where retrying the request will almost always fail. ● In, eventual consistency systems, however, a client error if retried later might succeed as system state propagates. ■ Retrying is only safe if an operation is idempotent.
  • 10. timeouts+retries A real-production use case where the DB max latency went down from > 10s to ~500ms and success rates increased after employing timeouts+retries. source: https://medium.com/textnowengineering/the-whacking-game-ee3af79c6e13
  • 11. retries When partial and transient failures are rare, and the overall number of retried requests is small, timeouts+retries can improve availability, reduce latency, and increase success rate. But these are the same things that retries can put at risk if not used wisely. ■ Retries consume resources ● Retries tradeoff server limited resources (mem, cpu, connections) for higher success rates. ● In almost all cases, we should limit the number client retries. ■ Retries increase load on the downstream service ● … as a result of retrying the failed and timed out requests. If failures are due to service being overloaded, retrying can delay recovery by keeping the downstream service under a high load for long.
  • 12. retries ■ Retries increase load on the downstream service (continued). Examples: ● Hot partition ■ Retrying failures mightn't work as we still overwhelming the hot partition ● Multiple service layers ■ When the backend consists of multiple layers of microservices each is retrying independently, i.e. 81x retries for 4 layers each retrying 3 times. ● Rate Limiting ■ Services such as AWS S3 and Cloudflare have rate limits, so excessive requests will be throttled.
  • 14. backoff A solution to retries in succession on a service failing because it’s overloaded. Instead of retrying immediately and aggressively, the client waits for some period between retries. ■ What is the benefit? ● Retrying immediately when the likely outcome is another failure, wastes resources. ● Backoff gives the downstream service some breathing time to heal when already overloaded – so it is not flooded ■ How long should we wait? ● The most common algorithm is the exponential backoff, where the wait time increases exponentially after every retry. ● Implementations typically cap their backoff to a maximum value to avoid long backoff times.
  • 15. backoff ■ Backoff just "delays" the retries ● Backoff is insufficient when a service is under a constant overload or in case of contention. ● Failed requests when backoff to the same time, they cause contention or overload again when they are retried.
  • 17. jitter Adds randomness to the backoff (wait time) when retrying a request to spread out the load and reduce contention. ■ What jitter to use? Add Jitter to backoff value (most common) Between zero and backoff value Sleep duration ■ (2^retries * delay) +/- random_number ■ (2^retries * delay) * randomization_interval random_between(0, 2^retries * delay) Resource Utilization less resources because work is spread out due to randomization Time to complete takes longer to complete, has longer sleep durations takes less time to complete, sleep duration min value range is 0, i.e. [0, backoff] When to use if backing-off retries help give downstream service time to heal if most failures are due to contention and spreading out retries is just what we need.
  • 18. jitter ■ Jitter isn't only for retries ● Spreads out spikes of work by periodic jobs, or any repeated work scheduled at regular intervals, e.g. expiring cache keys around the same time.
  • 19. Code: timeouts, retries, backoff + jitter retries := 0 for { if retries > MaxRetries { return Err } // exceeded max retries // execute the operation with a timeout (e.g. 1 second) if err = operation(timeoutCtx); err == nil { return Success } // operation succeeded // handle failed request if isPermanentErr(err) { return Err } // stop on permanent errors (unretryable) // calculate sleep duration (milliseconds) // sleep = (2^retries * delay) * randomization_interval exponentialFn := 1 << retries // 2^retries backoff := exponentialFn * 100 // retry delay = 100ms
  • 20. Code: timeouts, retries, backoff + jitter // … to be continued // get a random sleep duration between interval [backoff*0.5, backoff*1.5] in ms // for e.g. if backoff = 100ms, sleep is any number in the range [50ms, 150ms] minInterval := backoff / 2 // backoff*0.5 maxInterval := backoff + (backoff / 2) // backoff*1.5 // rand.Intn() returns a rand number from 0 to N (exclusive) so we +1 sleep := time.Millisecond * time.Duration(minInterval + rand.Intn(maxInterval - minInterval + 1)) time.Sleep(sleep) // wait until retry sleep duration has elapsed retries++ // increment retries for the next retry attempt }
  • 22. adaptive When a large percentage of requests are failing and retries are unsuccessful, like in cases of longer running issues, the techniques we talked about aren't sufficient. This warns that future retries are not currently welcome, and that we need to throttle any un-welcomed retries, until some time period. ■ How to do that? ● We use the token bucket algorithm! This algorithm is widely used in rate limiting to determine when it is safe to transmit data that complies with the rate limits. ● We’ll also compare token bucket algorithm vs circuit breaker.
  • 23. adaptive Token bucket (standard) algorithm Algorithm An in-memory bucket holding tokens (just a counter), and periodically, a fixed number of tokens is added into the bucket (by increasing the counter) ■ On each request, client removes token(s) from the bucket, and completes the request. ■ If there aren't sufficient tokens, it throttles the request and either drops it or waits until there are enough tokens to make the request. Goal Rate limit the “total number of requests” to downstream service, i.e. when error rate is high, retries drain the token bucket, and throttle future requests until bucket slowly begins to refill.
  • 24. adaptive Token bucket variation algorithm Algorithm Instead of adding tokens with a fixed amount periodically, we add token(s) on successful attempts. ■ Client can make initial requests, regardless of the tokens availability. ■ If it succeeds, it adds part of a token into a token bucket, say 0.1 token. ■ If the call fails, retry up to N times as long as there one or more (whole) token(s) in the bucket. Goal Rate limit "retries" when error rate is above threshold by throttling "retries" that exceed that threshold, i.e. max No of retries = only 10% of successful attempts.
  • 25. adaptive ■ Circuit breaker (CB) ● suffers from modality – it's either retrying or not retrying, and can introduce addition time to recovery. ● has no additional load at high failure rates, but lower success rates after threshold as it stops all future retries. ■ Token Bucket (TB) ● has some (tunable) additional load at high failure rates, but higher success rates as it doesn't deplete its bucket fast enough. ■ Both behave like N retries (without throttling) under low error rates.
  • 26. adaptive Can we design a better algorithm? Client libraries have inconsistent behaviour for retries and rate limits across different languages: ■ Rate limits ● Client rely on the its limited knowledge (requests succeeded or failed) to guess what's the best action to take. Yet, the server knows a bit more. ■ Error Rate: ● Client doesn't know the true failure rate, and it relies on its local sampling of the failure rate, which may vary from the true rate on the server, e.g. serverless and container-based applications, where clients are short-lived, with each sending fewer requests. How can we expose some of that server knowledge to clients so that clients can make informed decisions, thereby, having consistent behaviour, without increasing complexity? I'll leave that exercise for you!
  • 27. Recap ■ Timeouts avoid client requests from hanging long while holding on to the limited resources. ■ Retries can survive partial and transient failures, and therefore, increase the success rate. ■ Backoff + Jitter can improve resource utilization and reduce congestion. ■ Adaptive Retries dynamically adjusts request rates in response to high error rates and unsuccessful retries.
  • 28. Final Words What seemed to be an easy problem, turned out to be quite hard in distributed systems, and really depends on the nature and the requirements of the system. Getting the happy path working is the easy part, but going beyond that, is when the REAL ENGINEERING WORK BEGINS!
  • 30. Brought to you by Omar Elgabry omar.elgabry.93@gmail.com LinkedIn/omarelgabry