Resilient serverless architectures on AWS by Lee Gilmore - Serverless Summit 2021 (17th November 2021)
Three key factors in building resilient serverless architectures
https://www.serverless-summit.io/
3. What are we covering today?
Q. How do we build resilient serverless architectures on AWS?
Event-driven - Event-driven first mindset with Amazon EventBridge
Scalable - Load testing serverless architectures with Artillery
Monitoring - Using synthetic canaries to find issues proactively
A. Scalable + Event-driven + Monitoring
20 minutes
4. Load testing serverless architectures
Key takeaway: Serverless is not a silver bullet for scaling. Understand how your solutions work at unexpected scale.
01.
Scan for deep dive
6. Using Artillery for load testing serverless solutions
Artillery is an open source load, functional and smoke testing solution, which can
be installed as a dependency of your serverless solution using NPM, configured
using a YML file, and accompanied CSV file for load test data, and ran within your
pipelines for regular testing
9. Config
Allows you to pull in load test data
and configure plugins
Environments
This is where you can split out
between Dev/QA/Staging/Prod
Configuration
10. Config
Allows you to pull in load test data
and configure plugins
Environments
This is where you can split out
between Dev/QA/Staging/Prod
Scenarios
This allows you to set up the actual
tests against endpoints
Configuration
11. What are the benefits of Artillery?
Easy to setup and to run through NPM scripts
Expect the correct responses, status codes and headers returned
You can use for smoke, fuzz and functional testing too (not just load testing)
Can be ran very easily in pipelines as part of the CI process
Additional plugins allow for fuzzing and writing the test results to DynamoDB/CloudWatch/SNS
Artillery produces test reports in HTML format which can be saved in pipelines as assets
12. Event-driven first mindset
Key takeaway: Don’t build tightly coupled, brittle architectures, or you will be regularly crying into your coffee at 2am
02.
Scan for deep dive
17. Importance of being event-driven?
Domain services are individually testable
Domain services are individually deployable
Shared versioned schemas for events
They have their own data stores
Totally decoupled
They can scale independently
There are numerous benefits of event-driven domain services which are detailed below:
18. What is an event anyway?
“By using Event Messages you can easily decouple senders and receivers both in terms of identity (you broadcast events without caring
who responds to them) and time (events can be queued and forwarded when the receiver is ready to process them). Such architectures
offer a great deal for scalability and modifiability due to this loose coupling.” - Martin Fowler
An event is a change of state within a
domain (past)
A command is an intent aimed at
another domain which results in some
output (future)
19. Serverless + Events
“Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale using events
generated from your applications, integrated Software-as-a-Service (SaaS) applications, and AWS services” - AWS
Amazon EventBridge should be your default for serverless event-driven architectures for the
following reasons:
There are no servers to maintain or manage
Schema discovery and sharing using the registry
Content based filtering
Input transformation
Archive and Replay
20. Areas to consider
Consider idempotency
Queues, batching and failures
Version events with the Schema Registry
Event-carried state transfer
Potentially use Amazon SNS for low latency/high frequency messages
When building out your new architectures in an event-driven manner, it is worth planning for the
following to ensure your solutions are resilient:
21. Using canaries to find issues proactively
Key takeaway: “Everything fails all the time” - Werner Vogels
03.
Scan for deep dive
22. Amazon CloudWatch Synthetic Canaries
You can use Amazon CloudWatch Synthetics to create ‘canaries’, which are configurable scripts that
run on a schedule, to monitor your endpoints and APIs.
Canaries follow the same routes and perform the same actions as a customer, which makes it
possible for you to continually verify your customer experience.
By using canaries, you can discover issues before your customers do.
23. CloudWatch Synthetics features
Amazon CloudWatch Synthetics is a powerful fully serverless
way of constantly ensuring that your API’s are working correctly,
that there are no broken links in your webpages, visual diff checks
to make sure your web pages are displaying correctly, and
heartbeat checks.
24. How do canaries work?
Canaries are Lambda functions which run on a schedule
Can be written in Node.JS or Python
Offer programmatic access to a headless Google Chrome Browser via Puppeteer or Selenium Webdriver
Can check latency of your endpoints and can store load time data and screenshots of the UI
Can check for unauthorized changes from phishing, code injection and cross-site scripting
Can alarm and send alerts based on failures
29. Summary
These are just three examples of the many ways we can make serverless architectures on AWS more
resilient for our customers
Event-driven - Event-driven first mindset with Amazon EventBridge
Scalable - Load testing serverless architectures with Artillery
Monitoring - Using synthetic canaries to find issues proactively
30. Summary
These are just three examples of the many ways we can make serverless architectures on AWS more
resilient for our customers
Hey everyone, thank you so much for joining today! My name is Lee Gilmore, and I am a Principal Developer & Cloud Architect at AO.com, which is one of the UK’s largest online retailers, and I am also an AWS Community Builder, and an active blogger in the serverless industryI love connecting with like minded people, so feel free to connect with me on LinkedIn, Twitter or Medium - so let’s get started..
Today I aim to cover as much ground as I can in the 20 minutes we have - helping to answer the question “how do we build resilient serverless architectures on AWS”Hopefully this will give you a taster of three key areas that I personally consider are massively important, specifically focusing in on
Serverless architectures being scalable, and load testing them using Artillery.
Event-driven first mindset using Amazon EventBridge.
And finally the importance of proactive monitoring using CloudWatch Synthetic Canaries.
You will see QR codes to scan with your smartphone cameras as we go along, which link off to detailed articles I have written which cover these areas in a lot more detail! (including code examples in GitHub written in TypeScript and the Serverless Framework typically)
First of all we will cover ‘Load testing serverless architectures’, and no, I haven’t gone crazy! As serverless architectures should scale obviously!
The key takeaway of this section is ‘Serverless is not a silver bullet for scaling. Understand how your solutions work at unexpected scale’.
Some of the key reasons (but not limited too) that you may see issues with high load are:1. Lambda functions horizontally scaling out really quickly which are opening and closing database connections which can quickly spike the CPU and memory on your database server making it crash.2. Less scalable legacy systems downstream that can’t cope with the sudden scale out of the lambda functions. May be on premise, or 3rd party services for example.3. Asynchronous eventually consistent processes taking too long for your customers due to poor batching and configuration, for example waiting for time bound activation emails being received.
4. Reserved concurrency causing throttling at high load, think orders going through a system on Black Friday and customers being throttled..5. Hitting the regional lambda concurrency account limit at high scale! Much better to be able to speak to AWS prior to an event rather than in the middle of it!
This is where load testing with Artillery comes in. Artillery is an open source load, functional and smoke testing solution which you can configure using a YML file and pull in repeatable load test data from an associated CSV file.And this can be ran within your pipelines regularly with minimal setup.
There are also additional community led plugins to use alongside Artillery, with the big advantage that you can create your own plugins to extend it however you need!
Now let’s have a look at how you can configure Artillery very easily with a YML file. (I have had load tests running in as quickly as 15 minutes with Artillery)
The first section allows us to firstly configure additional plugins to use alongside the load tests. An example here is the expect plugin so we can make assertions on the responses from your API requests.It also allows us to set fail limits, for example ensuring at least 95% of the request latency is equal to or under 3 seconds, and we fail hard if there are any errors (great for pipelines and functional tests)
Finally it allows us to pull in repeatable load test data from an associated CSV file (which you can then clear up using your delete endpoints or by invoking a lambda to do this)
The next section of the YML file is to configure your various environments that you want to test against, so typically staging for load tests, but for smoke and functional tests this may be all of your environments.
It allows us to add our target APIs for each environment, as well as configuring the actual phases of the tests.
As you can see from this example it is going to run for 10 seconds, with one virtual user when the test phase starts, and after ten seconds will not have more than 2 virtual users.
Finally we have the scenarios section, which allows us to configure the actual flows we want to run through, and allowing us to also use the expect plugin we configured at the top to assert responses coming back.
We can see in this example that we are creating an employee with a POST request, pulling through the JSON from the CSV file for the POST body, and asserting the response status code is 201 and the headers are correct.
In summary for this section of the talk, this will give you the confidence your Serverless architectures will scale for your customers making them more resilient to high load, and this is why I think load testing your serverless solutions are key!
Now we are going to cover why it is so important to have an ‘event-driven first mindset’, with the key takeaway being ‘don’t build tightly coupled, brittle architectures, or you will be regularly crying into your coffee at 2am’ - nobody wants to be getting alerts throughout the night because one of your domain services is having issues!
Getting started with serverless really quickly is one of its main benefits, and very easy to spin up domain services using API Gateway, Lambda and DynamoDB typically.
With a lot of serverless architectures I have seen in the past it is easy to chain multiple domain services together in an organisation with synchronous requests - but this..1. Increases the overall latency of calls for the end user as they wait for all requests to resolve.2. It makes the architecture hugely coupled like a spider web of links - making it massively brittle!
What we find now is that if one domain service goes down (our example of a database on fire) - then all other domain services are affected as they are so tightly coupled and intrinsically linked, and you have very unhappy customers!
The alternative is to use event-driven architectures which are eventually consistent, and asynchronous in nature, where serverless domain services are loosely coupled, and interact with each other using events, rather than synchronous requests.This way the domain services produce events without the concern of who is consuming those events, and the added benefits that we utilise dead letter queues to make sure we can replay events in the event of one of the domain services having issues.
You can see from the diagram that all of the domain services remain online other than the one bottom right, but its failed records are safely kept for re-processing, so your customers are not aware of any issues, and they can be reprocessed later when the service is brought online again.
You can test a domain service in isolation without coordinating with several other teams and with multiple dependencies. (for example mocking APIs)
In the same vein as above, you can deploy your domain services in isolation without being dependant on other teams, as long as the agreed event schemas have not changed.
Historically teams would share contracts through Nuget or NPM packages with actual code, whereas now teams can simply share versioned schemas so work can be developed, tested and deployed in a loosely coupled manner. This reduces the overall dependencies between teams.
Domain services should have their own data stores (typically databases) so they don’t have this dependency at a data layer level. If domain services have a shared database they become tightly coupled, risking cross contamination of bugs, deployment issues and security risks.
Domain services should not be aware of each other. A producer can produce events without caring about which consumers are using them. Consumers also don’t care who produced the events.
And finally, domain services can scale independently without the concern and co-ordination between other teams and domain services.
So what is an event anyway? An event is something which has already happened in the past and immutable, which your producing to allow any other domains that are interested to consume and act upon that event.A command on the other hand is made with an intent for another domain to do something which results in some kind of output, and this is typically a one to one mapping between domains. An example is sending an email, and the producer expects the consumer to deal with errors and retries if there are any.
So now we have covered high level why we want to design our serverless architecture to be event-driven, and have covered what events and commands are.Now lets cover Amazon EventBridge as an enterprise serverless event bus, and why it is so important.
It is completely serverless and allows us to decouple our domain services with the smallest of overheads
Sharing event schemas has been historically difficult, however the schema registry allows us to easily find and share schema structures between domains and teams
Content based filtering, even at the body level of the event (so data), allows us to only consume events that we are interested in.
Input transformations allows us to transpose the event structure and property names to meet the requirements of our consumers.
Archive and Replay functionality allows us to replay events to hydrate new domain service data stores, or once a bug is fixed we can replay the events on failed records.
As you look to use EventBridge it is worth considering the following:
Build your services to be idempotent, so if you get the same event more than once you will always get the same result once actioned. EventBridge guarantees at-least-once delivery, but consumers can get the same events multiple times. You don’t want to be taking additional payments from your customers for example!
When we have issues we need to ensure that we utilise dead letter queues to store failed events, and to remember that a failed record in SQS will force the full batch to be replayed again. This is why idempotency is so important! Also remember that you need DLQs on your event rules in case it can’t route to the targets.
Use the Schema Registry auto discovery mode in development only, as this can be costly if left on in production! And use the Schema Registry to manually upload your own custom schemas, to share with other teams.
The maximum event size for EventBridge is 256kb which is typically fine for most applications, but bear this in mind for events bigger than this, AWS recommend putting the event or part of it into S3 and include a link to it in the event.
For architectures which need low latency and high frequency of messages then it may be worth looking at SNS over EventBridge, but this is in exceptional circumstances.
So in summary, this event-driven first approach well help ensure that your Serverless architectures are more resilient to issues for your customers!
Lastly we will be covering “using canaries to find issues proactively before our customers do” - with the key takeaway of this section being “Everything fails all the time”, which is a famous quote by the fantastic Werner Vogels - and not a truer word has been said!
Typically you find out that your customers are experiencing issues when either a.) your support line is ringing off the hook or b.) you have monitoring, which could be a paid for service through a 3rd party provider, but this is typically only when there are specific errors alerted on.
CloudWatch Synthetics is an AWS offering that allows you to create ‘canaries’, which are configurable scripts that run regularly to monitor your solutions.
Amazon CloudWatch Synthetics is a powerful, yet largely unknown in my experience, way of monitoring your applications proactively!
They perform the same actions and follow the same routes as your customers, so you can continually verify the customer experience, and proactively find issues before your customers do (even when there are no customers on the system)1. Canaries can check that your APIs are working correctly.
2. There are no broken links in your web pages by crawling them.
3. Check the latency of your endpoints storing the information as HAR files (Http ARchive format).4. Visual diff checks so you know if a change has broken some webpages5. And heartbeat checks to ensure that your services are up and running correctly.
Canaries are essentially lambda functions that are being invoked via CloudWatch events, and can be written in either Node.JS or Python.They offer programmatic access to headless Google Chrome Browser via Puppeteer or Selenium Webdriver, so you can easily navigate your webpages as customers would do through code (then verifying the experience)
The canaries will check the latency of your endpoints and store these with other information, alongside any screenshots of your webpages, for 31 days as default (this is configurable).
If you do have an issue with your Serverless solutions you can also setup alerting you so know about issues as they happen (even when there are no customers on your service).
To get started very quickly with CloudWatch Synthetic Canaries you can use the blueprints which have already been created by AWS through the console (as shown in the screenshot)
This also allows you to also use the AWS Canary Recorder plugin in your Chrome browser to automatically generate your scripts based on your performing actions on your web pages, or the workflow builder to generate sequences that your customers would typically do (for example navigating a page, clicking on buttons, typing into text boxes etc)
The other way to setup Synthetic Canaries if you want to do something more bespoke is to use an IaC tool such as the CDK and your own lambda code, which is very simple to setup and deploy as you can see from the code on the screen.
This example is creating a canary that runs every minute, stores the screenshots in an S3 bucket called ‘assetsBucket’, pulls in the Node JS Lambda code from a local directory, and is using Puppeteer version 3.2. That is the actual infrastructure...
So here is some TypeScript sudo code for the lambda itself.
As you can see it utilises the AWS Synthetics package to do the interesting work of taking the screenshots, setting the variance threshold when comparing this screenshot to the base line image, and the logging so you can view the results in the dashboard and alert on them. Super simple to code and setup.
And once this is fully deployed you will be able to view previous runs in the dashboard, view detailed logs on things like latency, view previous screenshots which are stored (and more…)
In summary, Synthetic Canaries, and proactive monitoring as an approach, makes your services more resilient by alerting on issues which could affect customers, potentially before they are even aware of them.
So in closing, there are a lot more factors obviously involved in building resilient serverless architectures on AWS, but I am hoping these three key areas and supporting technologies may have peaked your interest to learn more outside of this short talk in your own time.
And as I said earlier I have detailed articles and Github repos for each of the three areas, so feel free to pull down the code and have a play about!
Thank you so much for taking the time to listen to me today, it has been a real pleasure, and thank you to Marc, Fabian, and the team for inviting me!