10. @jimmydahlqvist
What is resiliency?
The ability for a software solution to handle the impact of problems, and recover from
turbulent conditions, when other parts in the system fails.
33. @jimmydahlqvist
What we talked about
• Design for failure
• Buffer and store messages first
• Process asynchronously
• Level the load
• Retry on failures
• Break if integrations are not healthy
Navigating failures! Building resilient serverless workloads!
That is what we are going to talk about here today, failures! We are going to talk about how we can architect and build resilient serverless systems.
We will not talk about preventing failures, instead we will talk about recovering from them, and some good architectures that can help you on your way.
We will start with a serverless workload, and add to it to make it more resilient.
Some of you might now be thinking.
But hey, doesn't serverless on AWS com with built in availability and resiliency?
Isn’t that one of the strengths with serverless? It's resilient out of the box?
And you are absolutely right, serverless services from AWS has availability and resiliency built into them.
This was a short talk then. So, thank you very much for listing and see you around.....
Or is there more to resilient systems than that?
Serverless services does come with high availability and resiliency built in.
But that is on the service level, Lambda is highly available and resilient, StepFunctions are highly available and resilient, EventBridge are highly available and resilient.
If all components in our systems were serverless that would be great. However, that is not the case. In 95% of all systems that I have designed or worked on there has been components that are not serverless. It can be the need for a relational database, and yes, I do argue that not all datamodels and search requirements can't be designed for DynamoDB. It can be that we need to integrate with a 3rd party and their API and connections come with quotas and throttling. It could be that we as developers are unaware of certain traits of AWS services that make our serverless services break.
And even sometimes our users are even our worst enemy.
When we are building our serverless systems we must always remember that there are components involved that don't scale as fast or to that degree that serverless components does.
Hi! I'm Jimmy!
I have worked with AWS and severless since 2015, almost a decade now, and I have seen all kind of strange things.
I’m a true serverless enthusiast, the very first solution I built on AWS was serverless and I have not looked back since.
I have built serverless solutions for a variaty of companies, from startups to large enterpices.
I'm the founder of serverless-handbook.com where you can find all kind of serverless things that i have built, ranging from workshops to small architecture patterns.
And I have my blog on Jimmydqv.com
As a day-time job, and yes, I do have a daytime job, I know people have been questioning that.
I work as Head of AWS at Sigma Technology Cloud, we are an advanced services partner with AWS and do all kind of fun solutions.
If you like to know more about us, visit our booth outside....
I'm AWS Ambassador, AWS Community Builder and one of user Group leader for the Scania user group.
So, when I say serverless I mean services that come with
automatic scaling,
little to no capacity planning,
built-in high availability
pay-for-use billing model.
This is my definition of serverless, and I’m sure many of you can agree with that.
Looking at services from AWS, in the red corner we have serverless services API gateway, Lambda, SQS, StepFunctions. Services that we can throw a ton of work on and they will just happily scale up and down, without any capacity planning, to handle our traffic.
In the blue corner we have the managed services. This is services like Amazon Aurora, Fargate, OpenSearch, Kinesis data streams. This is services that can scale basically to infinity, but they do require some capacity planning for that to happen, and if we plan incorrectly we can be throttled or even have failing requests. And yes, I do categorize Kinesis Data streams as managed as we need to plan number of shards. Kinesis Firehose on the other hand would be a serverless service.
Then there is the server corner, that would be anything with EC2 instances. We don’t talk about them….
So, what is resiliency? Sometimes it gets confusing, and people mix up resiliency with reliability.
As I said in the beginning, resiliency is not about preventing failures, it's about recovering from them.
It’s about making sure our system maintain an acceptable level of service even if when other parts of our system is not healthy.
It's about gracefully deal with failures.
Reliability focuses on the prevention of the failure happening in the first place, while resiliency is about recovering from it.
This is by far one of my favorite quotes by Dr. Werner Vogels. Because this is real life! Running large distributed systems, then everything will eventually fail.
We can have down-stream services that is not responding as we expect, they can be having health problems. Or we can be throttled by a 3rd party or even our own services.
We need to design our system not with mindset “what happens if this fails” but instead “How can we keep running and recover WHEN this fails”
It's important that the cracks that form when components in our system fail, doesn't spread. That they don’t take down our entire system. We need ways to handle and contain the cracks. That way we can isolate and protect our entire system.
When our serverless systems integrate with non-serverless components. In some cases it can be obvious, your system interacts with a Amazon Aurora database. Other times it's not that clear, the system integrates with a 3rd party API or does encryption using KMS. Both of these scenarios can lead to throttling that can affect our system and start forming cracks if not handled properly.
How does our system handle a integration point that is not responding, specially under a period of high load. This can easily start creating cracks that can bring our entire system to a halt or that we start loosing data.
When we build serverless systems we must remember that every API in AWS has a limit. We can store application properties in System Manager Parameter Store, a few of them might be sensitive and encrypted with KMS. What now happens is that we can get throttled by a different service without realizing it. SSM might have a higher limit but getting an encrypted value would then be impacted by the KMS limit. If we then don't design our functions correctly, and call SSM in the Lambda handler on every invocation we would quickly get throttles and build up a heft bill. Instead we could load properties in the initialization phase.
IF LUC IN AUDIENCE!!! Or if we call Secrets manager on every invocation, that can quickly throttle and build a huge bill. Luc where are you my friend? I’m sure Luc can tell you more about that….
Understanding how AWS services work under the hood, to some extent, is extremely important, so our systems doesn't fail due to some unknown kink. For example, consuming a Kinesis Data Stream
with a Lambda function, if processing an item in a batch fails, the entire batch will fail. The batch would then be sent to the Lambda function over and over again.
TELL ASSA KINESIS STORY!!!!!
What we can do in this case is to bisect batches on Lambda function failures. The processed batch will be split in half and sent to the function. Bisect would continue to we only have the single failing item left.
Now, I bet everyone in this room run a multi-environment system, you have your dev, test, pre-prod, and prod environments. With a show of hands, how many in here would say that your QA, Staging, Pre-prod, or what ever you call it, has an identical setup with your prod environment?
Most of you raised your hand and that is what I normally see. But now, let's make sure you consider data as well. The amount of data, the difference in user generated data. How many would now say that you environment are identical?
As I thought not that many of you. This is an important part when we consider our systems and when we plan for Resiliency testing. Data is different. I have seen system been taken down on multiple occasions due to differences in data and even in integration points.
With one client, we had an update that had been tested and prepared in all environments. But when deploying to prod, the database went haywire on us. We used Amazon Aurora serverless and the database suddenly scaled to max and then could cope anymore. Our entire service was brought down. All due to a SQL query that due to the amount of data in prod consumed all database resources.
Or if you have a integration with a 3rd party where that 3rd party integration staging environment is different. I had a scenario where in prod the 3rd party had an IP-Allow listing in place, so when we extended our system and got some new IPs suddenly only 1/3 of our calls was allowed. In staging, this was not in place. That was.... intermittent failures are always the most fun to debug.
A good way to practice and prepare for failures are through Resiliency testing, chaos engineering. AWS offers their service around this topic, AWS Fault Injection Service, which you can use to simulate failures and see how your components and system handles them.
Now.... So what I'm saying is that when you plan for your Resiliency testing, start in your QA or staging environment. But don't forget about prod and do plan to run test there as well.
Now let's start off with a classic web application with an API. Compute in Lambda and a database in DynamoDB. Now that is one scalable application.
But maybe we actually need an SQL database, as mentioned in the beginning this is still frequently used in many applications.
Or we need to integrate with 3rd party, and this could be an integration that on prem, it could be running in a different cloud on servers. A compute solution that doesn’t scale as fast and flexible as our serverless solution. With a lot of users we could quickly overwhelm the 3rd party API or any downstream service in our solution that doesn’t scale as fast.
This application is setup as a classic synchronous request-response where our client expect a response back immediate to the request.
We wait for this entire process to happen, storing data directly to a database might be very fast and the blocking isn't that long.
But, with more complex integrations with chained calls and even 3rd party integrations the time quickly adds up, and if one of the components is down and not responding we need to fail the entire operation, and we leave any form of retry to the calling application.
One question we need to ask when building our APIs is does our write operations really need an immediate response? Can we make this an asynchronous process?
In a distributed system. Does the calling application need to know that we have stored the data already, or can we just hand over the event and expect response back saying that "Hey I got the message and I will do something with it".
### Buffer events
With an asynchronous we can add a buffer between our calls and storage of our data. What this will do is protect us and the downstream service. The downstream service will not be overwhelmed and and by that we protect our own system as well from failures.
This can however create an eventual consistency model, where read after write not always gives us the same data back.
Let’s return to our application, but we focus only on the API part from now on.
Let's get rid of our Lambda integration completely and instead integrate directly to the SQS. This will create one of the most powerful patterns when building resilient serverles systems, and I use this all the time. Storage First!
So instead of having this integration API GW to Lambda we move the Lambda function
This takes us to the Storage-first architecture pattern.
The idea behind this architecture pattern is to safely store the messages in a durable storage, and then process them in an asynchrounus way. This way we can handle them in the pace we sit fit and we can re-process them if they fail.
Basically we add an buffer to our API.
Latency: Prioritizing storage might increase the time it takes to process or access the data in real-time scenarios.
Complexity: Designing with storage in mind may lead to intricate architectures, especially when integrating with diverse processing systems.
Prerequisites: Requires robust and often expensive storage solutions to ensure data durability and high availability.
Data Integrity: Ensuring data stored is accurate and consistent can pose challenges, especially in high ingestion systems.
Potential for Over-Optimization: There's a risk of over-investing in storage without considering the balance of other architectural needs.
If we circle back to API solution again, not only do we use the storage first pattern in the current setup, we have the possibility for other resilient solution as well.
I have already briefly mentioned this several times without putting a name on it. In this solution we also use the Queue Load leveling pattern
Using the queue load leveling pattern we protect the downstream service, and by doing that our self, by only processing events in a pace that we know the service can handle. Other benefits that come with this pattern, that might not be that obvious. It can help us control cost as we can run on subscriptions with lower throughput that is lower in cost, or we can down-scale our database as we don't need to run a huge instance to deal with peaks. Same goes if we don't process the queue with Lambda functions but instead use containers, we can set the scaling to fewer instances or even do a better auto-scaling solution.
Now! One consideration with this pattern is if our producers are always creating more requests than we can process, we can end up in a situation where we are always trailing behind. For that scenario we either need to scale up the consumers, which might lead to unwanted downstream consequences or we need at some point evict and throw away messages. What you choose of course come with the standard architect answer "It depends...."
So what if we have more than one service that it is interested in the request? For a SQS queue we can only have one consumer, two consumers can't get the same message. In this case we need to create a fan out or multicast system.
So, what we can do in this solution.
Is that we can then replace our queue with EventBridge that can route the request or the message to many different services. It can be SQS queues, StepFunctions, Lamda Functions, other EventBridge buses and many many more. EventBridge is highly scalable with high availability and resiliency with a built in retry mechanism for 24 hours. With the archive feature we can also replay messages in case they failed. And if there is a problem delivering message to a target we can set a DLQ to handle that scenario.
We just however remember the DLQ only come into affect if there is a problem calling the target, lacking IAM permissions or similar. If the target it self has a problem and fails processing message will not end up in the DLQ. Therefor each of our target services must implement resiliency using the patterns we have been talking about.
Even with a storage-first approach we are of course not protected against failures. They will happen, remember "Everything fails all the time".
In the scenarios where our processing do fail we need to retry again. But, retries are selfish and what we don't want to do, in case it's a downstream services that fail, or if we are throttled by the database, is to just retry again. Instead we like to backoff and give the service som breathing room. We would also like to apply exponential backoff, so if our second call also fails we like to back off a bit more. So first retry we do after 1 second, then 2, then 4, and so on till we either timeout and give up of have a success.
In the cases where we do give up the processing. We have hit the max number of retries, this is where the DLQ come in. We route the messages to a DLQ where we can use a different retry logic or even inspect the messages manually. The DLQ also create a good indicator that something might be wrong, and we can create alarms and alerts based on number of messages in the DLQ. One message might not be an problem but the number of messages start stacking up it's a clear indicator that something is wrong.
In case we are using SQS as our message buffer we can directly connect a DLQ to it. We can also use Lambda functions failure destinations and set a SQS as that destination. So in case the function exit with an failure the message is sent to the destination. If we use StepFunctions as our processor we can send messages to a SQS queue if we reach our retry limit.
One more approach would be for use to use Step Functions built in retry with backoff. However, SQS can’t invoke StepFunctions, so what we can do is to use EventBridge instead of SQS, rely on EventBridge durability and archive and replay mechanism.
We add a DLQ where we send event to when we give up the calls.
In our retry scenario there is a study conducted by AWS that show that in a highly distributed system retries will happen at the same time. If all retries happen with the same backoff, 1 second, 2 seconds, 4 seconds and so on they will eventually line up and happen at the same time. This can then lead to the downstream service crashing directly after becoming healthy just due to the amount of job that has stacked up and now happen at the same time.
It's like in an electric grid, after a power failure, all appliances turn on at the same time creating such a load on the grid that it go out again, or we blow a fuse. Then we change the fuse, everything turn on at the same time, and the fuse blow again.
Therefor we should also use some form of jitter in our backoff algorithm. This could be that we add a random wait time to the backoff time. It would work that we first wait 1 second + a random number of hundreds of milliseconds. Second time we wait 2 second + 2x a random number, and so on. By doing that, our services will not line up the retries. How we add the jitter and how much, that well depends on your system and implementation.
Users are our worst enemy story…….
Retries is all good, but there is no point for us to send requests to an integration that is not healthy, it will just keep failing over and over again. So what we can do here is implement Circuit breakers.
If you are not familiar with Circuit breakers it is a classic pattern, and what it does is make sure we don't send requests to API or integration that is not healthy and doesn't respond. This way we can both protect the integration or API but also our self from doing work we know will fail. Because everything fails all the time, right.
So before we call the API we'll have a status check, if the API is all healthy we'll send the request this is a closed state of the circuit breaker. Think of it as an electric circuit, when the circuit is closed electricity can flow and the lights are on.
So before we call the API we'll have a status check, if the API is all healthy we'll send the request this is a closed state of the circuit breaker. Think of it as an electric circuit, when the circuit is closed electricity can flow and the lights are on.
As we do make calls to the API we'll update the status, if we start to get error responses on our requests we'll open the circuit and stop sending requests. In this state is where storage-first shine, we can keep our messages in the storage queue until the integration is back healthy again.
But we just can't stop sending requests for ever. So what we do is to periodically place the circuit in a half-open state to send a few requests to it and update our status with the health from these requests.