AWS Observability Made Simple

AWS Observability
made simple
Eóin Shanaghy - Luciano Mammino
AWS Community Day - November 11th 2021
fth.link/o11y-simple

Hi! I’m Eoin 🙂
CTO
aiasaservicebook.com
@eoins
eoins
✉ Get in touch

👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐦 @loige
🎥 loige
🧳 lucianomammino

We are business focused technologists
that deliver.
Accelerated Serverless | AI as a Service | Platform Modernisation
We are hiring! Let’s have a chat 🙂

Check out our new Podcast!
awsbites.com

Observability in the cloud
a measure of how well internal states of a
system can be inferred from knowledge of its
external outputs
🪵 🔍 📈 🚨
Structured Logs Tracing Metrics Alarms
“

A typical case study
⚡ Serverless app
● Distributed system (100s of components)
🔌 HTTP APIs using
● Lambda
● DynamoDB
● API Gateway
● Cognito
🧱 Multiple services / stacks
🏁 Using SLIC Starter (fth.link/slic)
173
resources!

A typical case study
⚽ The goal: know about problems before users do
How?
📝 Structured Logs
📐 Metrics
🔔 Alarms
📊 Dashboards
🗺 Traces (X-Ray)

Can we test our observability?
󰝊 We run a stress test
○ Simulate traﬃc using the integration test
○ Run the test a number of times in parallel (in a loop)
○ Exercises all the APIs with typical use cases (login, CRUD operations, etc.)
🚨 After 10-15 minutes, we started to get alarms...

Initial Hypothesis
🛑 We got throttled (DynamoDB write throttle events)
↪ 🔁 causing AWS SDK retries (in the Lambda function)
↪ ⏱ causing Lambda timeouts
↪ 👎 causing API Gateway 502
🧪 How do we validate this?
1. Check the timeout cause ➡ Lambda metrics/logs
2. Check the Lambda error cause ➡ Lambda logs
3. Identify the source of 5xx errors in API Gateway ➡ X-Ray
4. Check the DynamoDB metrics ➡ Dashboards

Checking timeouts
● Check lambda timeouts
○ Duration metrics (aggregated data)
○ Logs (individual requests)
● Logs Insights give us duration for each
individual request. We can use this to
isolate the logs for just that request.
● We use stats to see how many executions
are affected.

Conclusions
🌡 Symptom 🐞 Problem 󰟿 Resolution
1 DynamoDB throttles
Table with low provisioned
WCUs (write capacity)
Switch table to
PAY_PER_REQUEST
Add throttling in API Gateway to limit
potential cost impact
2
API 502 Errors
Lambda Timeouts
Throttles caused
DynamoDB retries with
exponential backoff - up to
50 seconds of retry
Change maxRetries to 3 (350ms max
retry)
3 API 500 Errors
Attempt to update a
missing record - problem
with integration test!
Fix the integration test to ensure
deletion occurs after other actions
complete. Also improved the API
design

What we have learned so far 󰠅
● We were able to identify, understand and ﬁx these errors quite quickly
● We didn’t have to change the code to do that
● Nor did we run it locally with a debugger
● All of this was possible because we conﬁgured observability tools in
AWS in advance

AWS native o11y = CloudWatch
Cloudwatch gives you:
➔ Logs with Insights
➔ Metrics
➔ Dashboards
➔ Alarms
➔ Canaries
➔ Distributed tracing (with X-Ray)

Alternatives outside AWS
Established
New entrants
Roll your own (only for the brave)

CloudWatch out of the box
😍 A toolkit you can use to build
observability
🤩 Metrics are automatically
generated for all services!
😟 Lots of dashboards, but by
service and not by application!
😢 Zero alarms out of the box!

Getting the best out of Cloudwatch
Cloudwatch can be your friend if you...
📚 Research and understand available metrics
📐 Decide thresholds
📊 Write IaC for application dashboards
⏰ Write IaC for service metric alarms
⏪ Update every time your application changes
📋 Copy and paste for each stack in your application
(a.k.a. A LOT OF WORK!)

Best practices
😇 AWS Well Architected Framework
🏛 5 Pillars
⚙ Operational excellence pillar covers observability
🧐 Serverless lens applies these pillars
👍 Good guidance on metrics to observe
👎 More reading and research + you still have to pick thresholds

CloudFormation for CloudWatch Alarms 😬
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"ActionsEnabled": true,
"AlarmActions": [
"arn:aws:sns:eu-west-1:665863320777:FTSLICAlarms"
],
"AlarmName": "LambdaThrottles_serverless-test-project-dev-hello",
"AlarmDescription": "Throttles % for serverless-test-project-dev-hello ..",
"EvaluationPeriods": 1,
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": 0,
"TreatMissingData": "notBreaching",
"Metrics": [
{
"Id": "throttles_pc",
"Expression": "(throttles / throttles + invocations) * 100",
"Label": "% Throttles",
"ReturnData": true
},
{
"Id": "throttles",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Throttles",
"Dimensions": [
{
"Name": "FunctionName",
"Value": "serverless-test-project-dev-hello"
}
]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "invocations",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Invocations",

Can we automate this?
Magically
generated alarms
and dashboards for
each application!

fth.link/slic-watch
Introducing
SLIC Watch

How SLIC Watch works 🛠
Your app
serverless.yml
sls deploy
CloudFormation stack
very-big.json
SLIC Watch
👀 🛠
CloudFormation stack ++
even-bigger.json
Deploy ☁
📊📈

After SLIC Watch
Check out SLIC Slack

Configuration
🎀 SLIC Watch comes with sane defaults
📝 You can configure what you don’t like
🔌 Or disable specific dashboards or alarms

How to get started
📣 Create an SNS Topic as the alarm destination (optional)
📦 ❯ npm install serverless-slic-watch-plugin --save-dev
✍ Update serverless.yml
⚙ Conﬁgure (optional)
🚢 ❯ sls deploy
plugins:
- serverless-slic-watch-plugin 💡 Check out
the complete
example project
in the repo!

Wrapping up 🎁
★ If your services are failing you definitely want to know about it!
★ Observability can save you from hundreds of hours of blind debugging!
★ CloudWatch is the go to tool in AWS but you have to configure it!
★ Automation can take most of the configuration pain away
★ SLIC Watch can give you this automation
★ You still have control and flexibility
🔬Try it out! 🗣 Give feedback! 🌈 Let’s make it better!
fth.link/slic-watch

Thank you!
fth.link/o11y-simple
Cover picture by Markus Spiske on Unsplash

AWS Observability Made Simple

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Observability Made Simple

Similar to AWS Observability Made Simple (20)

More from Luciano Mammino

More from Luciano Mammino (20)

Recently uploaded

Recently uploaded (20)

AWS Observability Made Simple