Hopperx1 Seattle 2019 - Don't Let Clients Get You Down
1. Don't Let Clients Get You Down
Configuring Fault-Tolerant Clients in Resilient Systems
Clare Liguori
Principal Engineer, Amazon Web Services
#Hopperx1Seattle
Intro: I’m Clare
Why do clients matter to resiliency?
Example multi-tier system with microservices
There are lots of clients in this system
Drill into service: load balancer + multiple servers
What happens when a single server has problems?
Hopefully something like health checks realizes its down and removes it from load balancer
BUT until then... clients are still going to that server!
And the failure cascades all the way up the system!
So, how can we configure clients to be resilient to failures, so that our entire system is resilient?
Today we’ll talk about three important aspects of client resiliency
Let’s start with retries
Retries are self-explanatory
Retrying likely gets request directed to healthy node
Real world example: Installing npm modules during a cron job
Enable retries in SSH
Set retry limits (3-5), don’t make a bad situation worse
What happens when an API doesn’t de-duplicate requests?
For example, if a network connection drops, you don’t have confirmation of whether the request is still alive on that server
The servers may have just been processing requests slowly, but eventually complete them. Idempotent APIs will have a field like “idempotency token” or “retry token” where you can provide a token that signifies only ONE request with this token should be processed.
Timeouts give opportunity to retry in case of networking or load problems where requests are very slow. Timeouts also ensure slow requests don't hog all your threads
Lots of different types of timeouts, varies by client library how much control you get over each of these
Favorite is socket read timeout: many systems have a default timeout of infinity (FOREVER!). Can result in hung threads that stack up over time
Real world example: curl script would randomly hang forever
Configure timeouts and retries
Too many retries can get you throttled, so backoff lets you retry only the number of times you need to
Simplest backoff, so you’re not hammering the servers
Better: exponential backoff: delay more every time you retry
What happens if every client backs off the same amount?
They all cause high load at the same time
Add jitter: each client randomizes the amount they delay a little bit
Smoothes out the request load
Use retry-cli to get exponential backoff + randomized delay (i.e. jitter)
Use retry-cli to get exponential backoff + randomized delay (i.e. jitter)
We’ve covered retries, timeouts, and backoffs. Go inspect your clients!