SlideShare a Scribd company logo
1 of 222
Download to read offline
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building observability to
increase resiliency
David Yanacek
C O P 3 4 3 - R
(he/him)
Sr. Principal Engineer
AWS Monitoring and Observability
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Navigating the seas of observability
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
My sailing proficiency
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose
issues
Uncover
hidden issues
Prevent
future issues
Agenda
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose
issues
Learning objectives
Find patterns in high-cardinality
metrics
Use dimensionality to compute
the right metric
Navigate distributed systems
using tracing
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learning objectives
Uncover
hidden issues
Measure from everywhere
Aggregate metrics in customer-
oriented ways
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learning objectives
Auto scale and track the
utilization of everything
Monitor game days just like
production
Prevent
future issues
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
All aboard!
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues
Bad dependency Bad component Bad deployment Traffic spike
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A leak has sprung!
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A “leak” in the real world
Navigation
Product
Search
Browse
[Your website logo] Cart
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A “leak” in the real world
Navigation
Product
Search
Browse
[Your website logo] Cart
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Un)awareness of a problem
Time
Error
rate
Site-wide error rate
Alarm
Website errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Un)awareness of a problem
Time
Error
rate
Site-wide error rate
Problem
Alarm
Website errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Un)awareness of a problem
Time
Error
rate
Site-wide error rate
Problem
Alarm
Website errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Considering dimensionality
“Show me [metric] per [dimension]”
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Considering dimensionality
“Show me [metric] per [dimension]”
Latency
Measurements
Requests
Errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Considering dimensionality
“Show me [metric] per [dimension]”
Latency Website
Measurements Attributes
Requests
Errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Considering dimensionality
“Show me [metric] per [dimension]”
Latency Website
Measurements Attributes
Requests Webpage
Errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Considering dimensionality
“Show me [metric] per [dimension]”
Latency Website
Measurements Attributes
Requests Webpage
Errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A deflated error rate
Product requests vs errors Cart requests vs errors
Requests
Requests
Time Time
Cart requests
Product requests
Cart errors
Product errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A deflated error rate
Cart requests
Product requests
Cart errors
Product errors +
+
=
Site-wide error rate
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API error rate
Cart requests
Product requests
Cart errors
Product errors
= =
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API error rate
Time
Error
rate
Per-widget error rate Cart errors
Product errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API alarms
Time
Error
rate
Per-widget error rate
Cart alarm
Cart errors
Product errors
Product alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API alarms
Time
Error
rate
Per-widget error rate Cart errors
Product errors
Cart alarm
Product alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API alarm noise
Cart alarms
Product alarms Browse alarms Search alarms Nav alarms
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API alarm noise
Cart alarms
Product alarms Browse alarms Search alarms Nav alarms
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Per-API alarm noise
Cart alarms
Product alarms Browse alarms Search alarms Nav alarms
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Composite alarms
Cart alarms
Product alarms Browse alarms Search alarms Nav alarms
Overall alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Composite alarms
Cart alarms
Product alarms Browse alarms Search alarms Nav alarms
Overall alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alarms for everything
Cart alarms
Product alarms Browse alarms Search alarms Nav alarms
Overall alarm
Latency
Errors
Volume
Latency
Errors
Volume
Latency
Errors
Volume
Latency
Errors
Volume
Latency
Errors
Volume
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Dimensions and alarms
Split key application health metrics on separate
dimensions for each customer use case, like per-
webpage or per-widget
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Dimensions and alarms
Split key application health metrics on separate
dimensions for each customer use case, like per-
webpage or per-widget
Combine many alarm signals together to avoid
fatigue by using CloudWatch Composite Alarms
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues
Bad dependency
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Find the source of the problem
Navigation
Product
Search
Browse
[Your website logo] Cart
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Triangulate the problem
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Website architecture
Users Orders queue
Search index
Product DB
Product
info service
Ordering
service
Product search
service
Website
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Website architecture
Users Orders queue
PlaceOrder()
Ordering
service
Website
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Placing an order: Detailed architecture
AWS Cloud
Web server stack Cart service stack
Users Website Ordering
service
Orders queue
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Placing an order: Detailed architecture
Application
Load Balancer
Availability Zone 1
Instances
AWS Cloud
Availability Zone 2
Instances
Availability Zone 3
Instances
Web server stack
Orders queue
Application
Load Balancer
Availability Zone 1
Instances
Availability Zone 2
Instances
Availability Zone 3
Instances
Cart service stack
Users
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Trace propagation
Application
Load Balancer
Availability Zone 1
Instances
AWS Cloud
Availability Zone 2
Instances
Availability Zone 3
Instances
Web server stack
Orders queue
Application
Load Balancer
Availability Zone 1
Instances
Availability Zone 2
Instances
Availability Zone 3
Instances
Cart service stack
Trace-Id: 1-5759e988
Users
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Trace propagation
Application
Load Balancer
Availability Zone 1
Instances
AWS Cloud
Availability Zone 2
Instances
Availability Zone 3
Instances
Web server stack
Orders queue
Application
Load Balancer
Availability Zone 1
Instances
Availability Zone 2
Instances
Availability Zone 3
Instances
Cart service stack
Trace-Id: 1-5759e988
Users
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Trace propagation
Application
Load Balancer
Availability Zone 1
Instances
AWS Cloud
Availability Zone 2
Instances
Web server stack
Orders queue
Application
Load Balancer
Availability Zone 1
Instances
Availability Zone 2
Instances
Cart service stack
… …
AWS X-Ray
Users
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service map
Users Orders queue
Search index
Product DB
Product
info service
Ordering
service
Product search
service
Website
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Triangulate the problem
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CloudWatch ServiceLens Map using AWS X-Ray
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad dependency
Propagate the incoming trace context to every
outbound client call you make to every dependency
Trace-Id: 1-5759e988
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad dependency
Propagate the incoming trace context to every
outbound client call you make to every dependency
Enable collection of trace segments from all AWS
services, and from your apps
Trace-Id: 1-5759e988
AWS X-Ray
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad dependency
Propagate the incoming trace context to every
outbound client call you make to every dependency
Enable collection of trace segments from all AWS
services, and from your apps
Use service maps derived from traces to triangulate
the failing dependency during incidents
Trace-Id: 1-5759e988
AWS X-Ray
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues
Bad component
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A web server process crash
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A web server process crash
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A web server process crash
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A web server process crash
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A web server process crash
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A web server process crash
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Site-wide error rate
~11%
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Health checks to the rescue
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Health checks to the rescue
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Health checks to the rescue
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Site-wide error rate
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The deadlock
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 2
Web servers
Availability Zone 2
Web servers
Product info service Ordering service
Product search service
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Health checks to the rescue?
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Health checks to the rescue?
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
https://aws.amazon.com/builders-library/implementing-health-checks/
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different types of challenges
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Site-wide error rate
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring availability
Time
Error
rate
Site-wide error rate
Website errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dimensionality again
“Show me [metric] per [dimension]”
Latency
Errors Webpage
Website
Instance Id
Measurements
Attributes
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Creating an alarm for a gray-failed instance
Time
Error
rate
Per-instance error rate
ghi789
abc123
def456
jkl012
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Creating an alarm for a gray-failed instance
Time
Error
rate
Per-instance error rate
ghi789
abc123
def456
jkl012
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CloudWatch Metrics Insights query and alarm
SELECT SUM(Failure)
FROM SCHEMA(MyWebsite, InstanceId)
GROUP BY InstanceId
ORDER BY SUM() DESC
LIMIT 10
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CloudWatch Metrics Insights query and alarm
SELECT SUM(Failure)
FROM SCHEMA(MyWebsite, InstanceId)
GROUP BY InstanceId
ORDER BY SUM() DESC
LIMIT 10
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CloudWatch Metrics Insights query and alarm
SELECT SUM(Failure)
FROM SCHEMA(MyWebsite, InstanceId)
GROUP BY InstanceId
ORDER BY SUM() DESC
LIMIT 10
q1 =
FIRST(q1) > 0.01
PERIOD
DATAPOINTS
= 1 minute
= 2
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CloudWatch Metrics Insights query and alarm
SELECT SUM(Failure)
FROM SCHEMA(MyWebsite, InstanceId)
GROUP BY InstanceId
ORDER BY SUM() DESC
LIMIT 10
q1 =
FIRST(q1) > 0.01
PERIOD
DATAPOINTS
= 1 minute
= 2
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CloudWatch Metrics Insights query and alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different types of challenges
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Cache nodes Cache nodes Cache nodes
Product catalog
cache cluster
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different types of challenges
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Cache nodes Cache nodes Cache nodes
Product catalog
cache cluster
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different types of challenges
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Cache nodes Cache nodes Cache nodes
Product catalog
cache cluster
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different types of challenges
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Cache nodes Cache nodes Cache nodes
Product catalog
cache cluster
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different types of challenges
Application
Load Balancer
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Cache nodes Cache nodes Cache nodes
Product catalog
cache cluster
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dimensionality again
“Show me [metric] per [dimension]”
Latency
Errors Availability Zone
Website
Webpage
Measurements
Attributes
Instance Id
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery with zonal shifts
AWS Cloud
Web server environment
Users
Availability Zone 2
Web servers
Availability Zone 3
Web servers
Availability Zone 1
Web servers
Cache nodes Cache nodes Cache nodes
Product catalog
cache cluster
Application
Load Balancer
Route 53
Application Recovery Controller
Operator
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad component
Split key application health metrics on separate
dimensions for each infrastructure boundary, like EC2
Instance or Availability Zone
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad component
Split key application health metrics on separate
dimensions for each infrastructure boundary, like EC2
Instance or Availability Zone
Find and alarm on the poorest performing parts of
your infrastructure using Metrics Insights queries
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues
Bad deployment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How (not) to deploy
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How (not) to deploy
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How (not) to deploy
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How (not) to deploy
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s try that again
Scene: Safe deployment
Take: Two
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roll back first, ask questions later
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roll back first, ask questions later
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roll back first, ask questions later
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roll back first, ask questions later
Service in alarm!
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roll back first, ask questions later
Rollback started
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Roll back first, ask questions later
Rollback complete
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
One-box deployment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
One-box deployment
Wave 1 deployment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
One-box deployment
Wave 1 deployment
Wave 2, 3 deployments
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manual rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manual rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manual rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
Start investigating
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manual rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
Start investigating
Realize there’s a deployment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manual rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
Start rollback
Start investigating
Realize there’s a deployment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manual rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
Time-to-detect Time-to-mitigate
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
Auto rollback started
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
Time-to-detect
Time-to-mitigate
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic rollback in CloudFormation
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic rollback
Time
Error
rate
Site-wide error rate
Website errors
Alarm
?
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dimensionality (yet) again
“Show me [metric] per [dimension]”
Latency
Errors
Measurements
Attributes
Code revision
Website
Availability Zone
Webpage
Instance Id
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Faster detection using dimensions
Time
Error
rate
Site-wide error rate
Overall
New alarm
New code
Overall alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Faster detection using dimensions
Time
Error
rate
Site-wide error rate
Overall
New alarm
New code
Overall alarm
Auto rollback started
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Faster detection using dimensions
Time
Error
rate
Site-wide error rate
Overall
New alarm
New code
Overall alarm
Time-to-detect
Time-to-mitigate
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Composite rollback alarms
New code
version alarms
Overall alarms
Per-webpage
alarms
Infrastructure
alarms
Literally all of
your alarms
Overall alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Composite rollback alarms
New code
version alarms
Overall alarms
Per-webpage
alarms
Infrastructure
alarms
Literally all of
your alarms
Overall alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad deployment
Split key application health metrics on logical boundaries like
DeploymentId to minimize the time to detect bad changes
B
A →
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad deployment
Split key application health metrics on logical boundaries like
DeploymentId to minimize the time to detect bad changes
Roll back all types of changes automatically to minimize
time to recover
B
A →
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Bad deployment
Split key application health metrics on logical boundaries like
DeploymentId to minimize the time to detect bad changes
Roll back all types of changes automatically to minimize
time to recover
Combine literally all of your alarms into a single CloudWatch
composite alarm for triggering rollbacks
B
A →
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues
Traffic spike
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
Average latency
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
Average latency Deployments
4 days ago
5 days ago
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
Average latency
Fleet CPU
Deployments
4 days ago
5 days ago
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
Fleet size
Average latency
Fleet CPU
Deployments
4 days ago
5 days ago
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting
Latency alarm
Fleet size
Average latency
Fleet CPU
Request volume
Deployments
4 days ago
5 days ago
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where/who are these requests coming from?
Request volume
∆?
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Attributes
Code revision
Website
Availability Zone
Webpage
Instance Id
Customer Id
Dimensions with high cardinality
“Show me [metric] per [dimension]”
Requests
Latency
Errors
Measurements
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Attributes
Code revision
Website
Availability Zone
Webpage
Instance Id
Customer Id
Dimensions with high cardinality
“Show me [metric] per [dimension]”
Requests
Cardinality
Latency
Errors
Measurements
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Attributes
Code revision
Website
Availability Zone
Webpage
Instance Id
Customer Id
Dimensions with high cardinality
“Show me [metric] per [dimension]”
Requests
Cardinality
(Few of these)
Latency
Errors
Measurements
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Attributes
Code revision
Website
Availability Zone
Webpage
Instance Id
Customer Id
Dimensions with high cardinality
“Show me [metric] per [dimension]”
Requests
Cardinality
(Many of these)
Latency
Errors
Measurements
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Top-N metrics with CloudWatch
Contributor Insights
Per-customer requests
Time
Request
count
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Top-N metrics with CloudWatch
Contributor Insights
Per-customer requests
Time
Request
count
Per-customer requests
(top-N)
Time
Request
count
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Telemetry in Amazon CloudWatch
Amazon CloudWatch
Alarms
Logs Metrics
Application
{ "structured":"log" }
… …
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contributor Insights: How it’s made
{
"Timestamp": 1574109732004,
"TraceId": "Root=1-5759e988-...",
"CustomerId": "abcde12345",
"ClientIp": "192.168.131.39",
"InstanceId": "i-0012341EXAMPLE",
"Operation": "GetProducts",
"Time": 98,
"Error": 0,
"Failure": 0,
"DB.Time": 49,
"CacheHit": 0
}
A request log entry
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contributor Insights: How it’s made
{
"Timestamp": 1574109732004,
"TraceId": "Root=1-5759e988-...",
"CustomerId": "abcde12345",
"ClientIp": "192.168.131.39",
"InstanceId": "i-0012341EXAMPLE",
"Operation": "GetProducts",
"Time": 98,
"Error": 0,
"Failure": 0,
"DB.Time": 49,
"CacheHit": 0
}
(Things about the request)
A request log entry
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contributor Insights: How it’s made
{
"Timestamp": 1574109732004,
"TraceId": "Root=1-5759e988-...",
"CustomerId": "abcde12345",
"ClientIp": "192.168.131.39",
"InstanceId": "i-0012341EXAMPLE",
"Operation": "GetProducts",
"Time": 98,
"Error": 0,
"Failure": 0,
"DB.Time": 49,
"CacheHit": 0
}
(Things about the request)
A request log entry
(Things that happened during the request)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contributor Insights: How it’s made
{
"Timestamp": 1574109732004,
"TraceId": "Root=1-5759e988-...",
"CustomerId": "abcde12345",
"ClientIp": "192.168.131.39",
"InstanceId": "i-0012341EXAMPLE",
"Operation": "GetProducts",
"Time": 98,
"Error": 0,
"Failure": 0,
"DB.Time": 49,
"CacheHit": 0
}
{
"AggregateOn": "Count",
"Contribution": {
"Keys": [ "$.CustomerId" ]
},
"LogFormat": "JSON",
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"LogGroupARNs": [ "arn:..." ]
}
A request log entry A Contributor Insights rule
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contributor Insights: How it’s made
{
"Timestamp": 1574109732004,
"TraceId": "Root=1-5759e988-...",
"CustomerId": "abcde12345",
"ClientIp": "192.168.131.39",
"InstanceId": "i-0012341EXAMPLE",
"Operation": "GetProducts",
"Time": 98,
"Error": 0,
"Failure": 0,
"DB.Time": 49,
"CacheHit": 0
}
{
"AggregateOn": "Count",
"Contribution": {
"Keys": [ "$.CustomerId" ]
},
"LogFormat": "JSON",
"Schema": {
"Name": "CloudWatchLogRule",
"Version": 1
},
"LogGroupARNs": [ "arn:..." ]
}
A request log entry A Contributor Insights rule
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Top-N metrics with Contributor Insights
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On-demand metrics with Logs Insights
filter ClientIp = '192.168.131.39'
| stats count(*) by bin(60s)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Traffic spike
Emit logs that are rich with data so that you can cut your
metrics on many dimensions
{ "foo":"bar" }
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Traffic spike
Emit logs that are rich with data so that you can cut your
metrics on many dimensions
Record and analyze high-cardinality metrics like per-
customer request volume by configuring Contributor Insights
{ "foo":"bar" }
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose issues: Traffic spike
Emit logs that are rich with data so that you can cut your
metrics on many dimensions
Record and analyze high-cardinality metrics like per-
customer request volume by configuring Contributor Insights
Slice and dice metrics that you didn’t materialize up front by
writing Logs Insights queries
→
{ "foo":"bar" }
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The hidden storm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The hidden storm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The hidden storm
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues
External issues Misattributed faults
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues
External issues
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring from where?
Users Website
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring from where?
Users
Application
Load Balancer
Availability Zone 1
Instances
AWS Cloud
Availability Zone 2
Instances
Availability Zone 3
Instances
(other services)
Your web server environment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring from where?
Users
Application
Load Balancer
AWS Cloud
Your web server environment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An incompatible change
Users
AWS Cloud
Your web server environment
Amazon CloudFront
Distribution
Your CI/CD pipeline
Version 1
Code
Instances
Application
Load Balancer
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An incompatible change
Users
AWS Cloud
Your web server environment
Amazon CloudFront
Distribution
Your CI/CD pipeline
Version 1
Code
Instances
Application
Load Balancer
{}
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An incompatible change
Users
AWS Cloud
Your web server environment
Amazon CloudFront
Distribution
Your CI/CD pipeline
Version 1
Code
Instances
Application
Load Balancer
Version 2
Code
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An incompatible change
Users
AWS Cloud
Your web server environment
Amazon CloudFront
Distribution
Your CI/CD pipeline
Version 1
Code
Instances
Application
Load Balancer
Version 2
Code
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An incompatible change
Users
AWS Cloud
Your web server environment
Amazon CloudFront
Distribution
Your CI/CD pipeline
Version 1
Code
Version 2
Code
Instances
Application
Load Balancer
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An incompatible change
Users
AWS Cloud
Your web server environment
Amazon CloudFront
Distribution
Your CI/CD pipeline
Version 1
Code
Version 2
Code
Instances
Application
Load Balancer
{???}
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measure from everywhere
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring from where?
Users
Application
Load Balancer
AWS Cloud
Your web server environment
Amazon CloudWatch
Real-User Monitoring
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A botched migration
Application
Load Balancer
Instances
AWS Cloud
Containers
(other services)
Old web server environment
Users
Application
Load Balancer
New web server environment
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A botched migration
Application
Load Balancer
Instances
AWS Cloud
Containers
(other services)
Old web server environment
Users
Application
Load Balancer
New web server environment
Amazon Route 53
Old: 100%
New: 0%
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A botched migration
Application
Load Balancer
Instances
AWS Cloud
Containers
(other services)
Old web server environment
Users
Application
Load Balancer
New web server environment
Amazon Route 53
Old: 100%
New: 0%
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A botched migration
Application
Load Balancer
Instances
AWS Cloud
Containers
(other services)
Old web server environment
Users
Application
Load Balancer
New web server environment
Amazon Route 53
Old: 99%
New: 1%
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Synthetic workload measurement
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring from where?
Users
Application
Load Balancer
AWS Cloud
Your web server environment
Amazon CloudWatch
Synthetics
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues: External issues
Look at metrics emitted by synthetic workloads and
real clients of your application to catch issues outside
of your own measurement
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues
Misattributed faults
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Input validation bugs
public void validate() {
- if(input.length() > 140) {
+ if(input.length() > 100) {
throw new ClientError();
}
}
(Bug: Making input validation more strict)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Client vs server fault
Client fault (4XX)
Server fault (5XX)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An increase due to a bad deployment
Time
Error
rate
Error rate by type (4XX vs 5XX)
Server fault
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An increase due to a bad deployment
Time
Error
rate
Error rate by type (4XX vs 5XX)
Client fault
Server fault
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An increase due to a bad deployment
Time
Error
rate
Error rate by type (4XX vs 5XX)
Client fault
Server fault
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An increase due to a bad deployment
Time
Error
rate
Error rate by type (4XX vs 5XX)
Client fault
Server fault
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The same increase, but for a different reason
Time
Error
rate
Customer 789
Customer 123
Customer 456
Customer 012
…
Customer 345
Overall errors
Per-client fault rate (4XX)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alarming when lots of customers change
Time
Error
rate
Per-client fault rate (4XX)
Customer 789
Customer 123
Customer 456
Customer 012
…
Customer 345
Overall errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics from another dimension
Percent of requests
with errors
Percent of clients
with errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics from another dimension
Percent of requests
with errors
# 𝑜𝑓 𝒓𝒆𝒒𝒖𝒆𝒔𝒕𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠
# 𝑜𝑓 𝒓𝒆𝒒𝒖𝒆𝒔𝒕𝒔
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics from another dimension
Percent of clients
with errors
# 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠
# 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dimensionality and cardinality together
# 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠
# 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dimensionality and cardinality together
𝐼𝑁𝑆𝐼𝐺𝐻𝑇_𝑅𝑈𝐿𝐸_𝑀𝐸𝑇𝑅𝐼𝐶(𝐸𝑟𝑟𝑜𝑟𝑠𝑃𝑒𝑟𝐶𝑢𝑠𝑡𝑜𝑚𝑒𝑟, 𝑈𝑛𝑖𝑞𝑢𝑒𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑜𝑟𝑠)
𝐼𝑁𝑆𝐼𝐺𝐻𝑇_𝑅𝑈𝐿𝐸_𝑀𝐸𝑇𝑅𝐼𝐶(𝑅𝑒𝑞𝑢𝑒𝑠𝑡𝑠𝑃𝑒𝑟𝐶𝑢𝑠𝑡𝑜𝑚𝑒𝑟, 𝑈𝑛𝑖𝑞𝑢𝑒𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑜𝑟𝑠)
# 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠
# 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A stable, alarmable metric
Percent of clients
with errors
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues: Misattributed
When many customers suddenly see errors that are
categorized as their fault, it may actually be your fault
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uncover hidden issues: Misattributed
When many customers suddenly see errors that are
categorized as their fault, it may actually be your fault
Calculate metrics like “percent of customers” instead
of “percent of requests” by using Contributor Insights
rules to estimate the cardinality of a dimension
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues
Run game days
Monitor utilization
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues
Monitor utilization
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring resource utilization
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Elastic building blocks
Application
Load Balancer
AWS Cloud
Users
Availability Zone 2
Web servers
Availability Zone 2
Web servers
Auto Scaling Group
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Elastic building blocks
Application
Load Balancer
AWS Cloud
Users
Availability Zone 2
Web servers
Availability Zone 2
Web servers
Auto Scaling Group
Amazon
Elastic Compute Cloud
(Amazon EC2)
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Elastic building blocks
Application
Load Balancer
AWS Cloud
Users
Availability Zone 2
Web servers
Availability Zone 2
Web servers
Auto Scaling Group
Elastic Load
Balancing
Amazon
Elastic Compute Cloud
(Amazon EC2)
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Elastic building blocks
Application
Load Balancer
AWS Cloud
Users
Availability Zone 2
Web servers
Availability Zone 2
Web servers
Auto Scaling Group
Elastic Load
Balancing
Amazon EC2
Auto Scaling
Amazon
Elastic Compute Cloud
(Amazon EC2)
…
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
CPU Memory Thread pools
File system
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
CPU Memory Thread pools
File system
Alarms
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
CPU Memory Thread pools
File system
Alarms Dashboards
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
CPU Memory Thread pools
File system
Amazon EC2
Auto Scaling
Alarms Dashboards
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
myASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MaxSize: '50'
MinSize: '20'
myScalingPolicy:
...
Amazon EC2
Auto Scaling
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
myASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MaxSize: '50'
MinSize: '20'
myScalingPolicy:
...
Amazon EC2
Auto Scaling
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
Amazon EC2
Auto Scaling
GroupMaxSize
Instances
Time
GroupInServiceInstances
Number of instances
in an Auto Scaling Group
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
Amazon EC2
Auto Scaling
GroupMaxSize
Instances
Time
GroupInServiceInstances
Number of instances
in an Auto Scaling Group
× 100
=
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring the utilization
Amazon EC2
Auto Scaling
GroupMaxSize
Instances
%
Time Time
GroupInServiceInstances
Number of instances
in an Auto Scaling Group
Percent of instances
in an Auto Scaling group
× 100
=
100% -
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring utilization across abstractions
AWS Lambda
Amazon EC2
Amazon RDS Amazon DynamoDB
Amazon SQS
Amazon MQ
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring utilization across abstractions
AWS Lambda
Amazon EC2
CPU Memory
File system Thread pools
Concurrency
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic usage metrics
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues
Configure Auto Scaling on all of your elastic
resources to react quickly to changes in load and
maintain healthy headroom
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues
Configure Auto Scaling on all of your elastic
resources to react quickly to changes in load and
maintain healthy headroom
Measure the utilization of everything – from CPU,
to thread pools, to quotas – by creating alarms and
a capacity dashboard
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues
Run game days
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Game day drills
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Properties of a good game day
Regular
Reasoned
Controlled
Realistic
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Properties of a good game day
Regular
Reasoned
Controlled
Realistic
AWS Fault Injection
Service
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Game day observability
Replicate production observability into test environments
Verify behavior of metrics, alarms, dashboards, logs
Add new instrumentation, metrics, alarms afterward
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues: Game days
Regularly perform controlled experiments of failure
modes by using the AWS Fault Injection Service
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Prevent future issues: Game days
Regularly perform controlled experiments of failure
modes by using the AWS Fault Injection Service
Use the same observability tools during experiments
as you do in production by including things like alarm
and dashboard definition in your infrastructure as code
AWS Cloud Development Kit
(AWS CDK)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Navigating the seas of observability
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Diagnose
issues
Recap
Find patterns in
high-cardinality metrics
Use dimensionality
to compute the right metric
Navigate distributed systems
using tracing
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap
Uncover
hidden issues
Measure from everywhere
Aggregate metrics in customer-
oriented ways
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap
Auto scale and track the
utilization of everything
Monitor game days just like
production
Prevent
future issues
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Further reading: Amazon Builders’ Library
https://aws.amazon.com/builders-library/
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session
survey in the mobile app
Thank you!
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session
survey in the mobile app
David Yanacek
david-yanacek
@dyanacek
@dyanacek@hachyderm.io
dyanacek.bsky.social

More Related Content

Similar to Building observability to increase resiliency.pdf

Ask the Experts Hosting, Migrating, and Managing Websites with AWS
Ask the Experts  Hosting, Migrating, and Managing Websites with AWSAsk the Experts  Hosting, Migrating, and Managing Websites with AWS
Ask the Experts Hosting, Migrating, and Managing Websites with AWSTechSoup
 
AWS Lambda Powertools walkthrough.pdf
AWS Lambda Powertools walkthrough.pdfAWS Lambda Powertools walkthrough.pdf
AWS Lambda Powertools walkthrough.pdfHeitor Lessa
 
교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...
교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...
교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...Amazon Web Services Korea
 
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018Amazon Web Services
 
2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...
2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...
2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...Cobus Bernard
 
AWS SSA Webinar 4 - Building out your multi-account infrastructure
AWS SSA Webinar 4 - Building out your multi-account infrastructureAWS SSA Webinar 4 - Building out your multi-account infrastructure
AWS SSA Webinar 4 - Building out your multi-account infrastructureCobus Bernard
 
Building Event-driven Architectures with Amazon EventBridge
Building Event-driven Architectures with Amazon EventBridge Building Event-driven Architectures with Amazon EventBridge
Building Event-driven Architectures with Amazon EventBridge James Beswick
 
Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...
Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...
Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...Amazon Web Services
 
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...HostedbyConfluent
 
[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...
[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...
[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...AWS Korea 금융산업팀
 
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Jonathan Katz
 
Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018
Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018
Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018Amazon Web Services
 
Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...
Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...
Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...Amazon Web Services
 
Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018
Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018
Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018Amazon Web Services
 
AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)
AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)
AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)Yukihiro Kikuchi
 
Control Planes on Kubernetes and Policy Validation
Control Planes on Kubernetes and Policy ValidationControl Planes on Kubernetes and Policy Validation
Control Planes on Kubernetes and Policy ValidationCarlos Santana
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...All Things Open
 
AWS re:Invent 2023 re:Cap - AWS User Group Oslo - Håkon
AWS re:Invent 2023 re:Cap - AWS User Group Oslo - HåkonAWS re:Invent 2023 re:Cap - AWS User Group Oslo - Håkon
AWS re:Invent 2023 re:Cap - AWS User Group Oslo - HåkonHåkon Eriksen Drange
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudAmazon Web Services
 

Similar to Building observability to increase resiliency.pdf (20)

Ask the Experts Hosting, Migrating, and Managing Websites with AWS
Ask the Experts  Hosting, Migrating, and Managing Websites with AWSAsk the Experts  Hosting, Migrating, and Managing Websites with AWS
Ask the Experts Hosting, Migrating, and Managing Websites with AWS
 
AWS Lambda Powertools walkthrough.pdf
AWS Lambda Powertools walkthrough.pdfAWS Lambda Powertools walkthrough.pdf
AWS Lambda Powertools walkthrough.pdf
 
교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...
교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...
교육, 연구 개발자가 직접 전하는 AWS를 선택한 이유 Part.3 - 김재동 교사, IndiSchool (NPO) :: AWS Summi...
 
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
AWS Direct Connect: Deep Dive (NET403) - AWS re:Invent 2018
 
2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...
2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...
2019-11-05 AWS Pretoria Meetup - Setting up your first environment and adding...
 
AWS SSA Webinar 4 - Building out your multi-account infrastructure
AWS SSA Webinar 4 - Building out your multi-account infrastructureAWS SSA Webinar 4 - Building out your multi-account infrastructure
AWS SSA Webinar 4 - Building out your multi-account infrastructure
 
Building Event-driven Architectures with Amazon EventBridge
Building Event-driven Architectures with Amazon EventBridge Building Event-driven Architectures with Amazon EventBridge
Building Event-driven Architectures with Amazon EventBridge
 
Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...
Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...
Introduction to Amazon Route 53 Resolver for Hybrid Cloud (NET215) - AWS re:I...
 
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
 
[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...
[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...
[금융고객을 위한 Resiliency in the Cloud] 금융사의 Resiliency를 위한 AWS Solutio...
 
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
 
Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018
Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018
Cloud Ops Engineer: A Day in the Life (ENT312-R1) - AWS re:Invent 2018
 
Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...
Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...
Autonomous DevSecOps: Five Steps to a Self-Driving Cloud (ENT214-S) - AWS re:...
 
Evolving Security in AWS
Evolving Security in AWSEvolving Security in AWS
Evolving Security in AWS
 
Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018
Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018
Automating Compliance on AWS (HLC302-S-i) - AWS re:Invent 2018
 
AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)
AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)
AWSの最新ネットワーク機能(2019/09/17 NW-JAWS)
 
Control Planes on Kubernetes and Policy Validation
Control Planes on Kubernetes and Policy ValidationControl Planes on Kubernetes and Policy Validation
Control Planes on Kubernetes and Policy Validation
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 
AWS re:Invent 2023 re:Cap - AWS User Group Oslo - Håkon
AWS re:Invent 2023 re:Cap - AWS User Group Oslo - HåkonAWS re:Invent 2023 re:Cap - AWS User Group Oslo - Håkon
AWS re:Invent 2023 re:Cap - AWS User Group Oslo - Håkon
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
 

Recently uploaded

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Recently uploaded (20)

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

Building observability to increase resiliency.pdf

  • 1. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 2. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building observability to increase resiliency David Yanacek C O P 3 4 3 - R (he/him) Sr. Principal Engineer AWS Monitoring and Observability
  • 3. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Navigating the seas of observability
  • 4. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. My sailing proficiency
  • 5. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Uncover hidden issues Prevent future issues Agenda
  • 6. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Learning objectives Find patterns in high-cardinality metrics Use dimensionality to compute the right metric Navigate distributed systems using tracing
  • 7. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Learning objectives Uncover hidden issues Measure from everywhere Aggregate metrics in customer- oriented ways
  • 8. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Learning objectives Auto scale and track the utilization of everything Monitor game days just like production Prevent future issues
  • 9. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. All aboard!
  • 10. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues
  • 11. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Bad dependency Bad component Bad deployment Traffic spike
  • 12. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A leak has sprung!
  • 13. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A “leak” in the real world Navigation Product Search Browse [Your website logo] Cart
  • 14. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A “leak” in the real world Navigation Product Search Browse [Your website logo] Cart
  • 15. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. (Un)awareness of a problem Time Error rate Site-wide error rate Alarm Website errors
  • 16. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. (Un)awareness of a problem Time Error rate Site-wide error rate Problem Alarm Website errors
  • 17. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. (Un)awareness of a problem Time Error rate Site-wide error rate Problem Alarm Website errors
  • 18. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Considering dimensionality “Show me [metric] per [dimension]”
  • 19. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Considering dimensionality “Show me [metric] per [dimension]” Latency Measurements Requests Errors
  • 20. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Considering dimensionality “Show me [metric] per [dimension]” Latency Website Measurements Attributes Requests Errors
  • 21. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Considering dimensionality “Show me [metric] per [dimension]” Latency Website Measurements Attributes Requests Webpage Errors
  • 22. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Considering dimensionality “Show me [metric] per [dimension]” Latency Website Measurements Attributes Requests Webpage Errors
  • 23. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A deflated error rate Product requests vs errors Cart requests vs errors Requests Requests Time Time Cart requests Product requests Cart errors Product errors
  • 24. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A deflated error rate Cart requests Product requests Cart errors Product errors + + = Site-wide error rate
  • 25. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API error rate Cart requests Product requests Cart errors Product errors = =
  • 26. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API error rate Time Error rate Per-widget error rate Cart errors Product errors
  • 27. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API alarms Time Error rate Per-widget error rate Cart alarm Cart errors Product errors Product alarm
  • 28. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API alarms Time Error rate Per-widget error rate Cart errors Product errors Cart alarm Product alarm
  • 29. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API alarm noise Cart alarms Product alarms Browse alarms Search alarms Nav alarms
  • 30. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API alarm noise Cart alarms Product alarms Browse alarms Search alarms Nav alarms
  • 31. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Per-API alarm noise Cart alarms Product alarms Browse alarms Search alarms Nav alarms
  • 32. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Composite alarms Cart alarms Product alarms Browse alarms Search alarms Nav alarms Overall alarm
  • 33. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Composite alarms Cart alarms Product alarms Browse alarms Search alarms Nav alarms Overall alarm
  • 34. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alarms for everything Cart alarms Product alarms Browse alarms Search alarms Nav alarms Overall alarm Latency Errors Volume Latency Errors Volume Latency Errors Volume Latency Errors Volume Latency Errors Volume
  • 35. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Dimensions and alarms Split key application health metrics on separate dimensions for each customer use case, like per- webpage or per-widget
  • 36. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Dimensions and alarms Split key application health metrics on separate dimensions for each customer use case, like per- webpage or per-widget Combine many alarm signals together to avoid fatigue by using CloudWatch Composite Alarms
  • 37. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Bad dependency
  • 38. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Find the source of the problem Navigation Product Search Browse [Your website logo] Cart
  • 39. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Triangulate the problem
  • 40. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Website architecture Users Orders queue Search index Product DB Product info service Ordering service Product search service Website
  • 41. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Website architecture Users Orders queue PlaceOrder() Ordering service Website
  • 42. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Placing an order: Detailed architecture AWS Cloud Web server stack Cart service stack Users Website Ordering service Orders queue
  • 43. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Placing an order: Detailed architecture Application Load Balancer Availability Zone 1 Instances AWS Cloud Availability Zone 2 Instances Availability Zone 3 Instances Web server stack Orders queue Application Load Balancer Availability Zone 1 Instances Availability Zone 2 Instances Availability Zone 3 Instances Cart service stack Users
  • 44. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Trace propagation Application Load Balancer Availability Zone 1 Instances AWS Cloud Availability Zone 2 Instances Availability Zone 3 Instances Web server stack Orders queue Application Load Balancer Availability Zone 1 Instances Availability Zone 2 Instances Availability Zone 3 Instances Cart service stack Trace-Id: 1-5759e988 Users
  • 45. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Trace propagation Application Load Balancer Availability Zone 1 Instances AWS Cloud Availability Zone 2 Instances Availability Zone 3 Instances Web server stack Orders queue Application Load Balancer Availability Zone 1 Instances Availability Zone 2 Instances Availability Zone 3 Instances Cart service stack Trace-Id: 1-5759e988 Users
  • 46. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Trace propagation Application Load Balancer Availability Zone 1 Instances AWS Cloud Availability Zone 2 Instances Web server stack Orders queue Application Load Balancer Availability Zone 1 Instances Availability Zone 2 Instances Cart service stack … … AWS X-Ray Users
  • 47. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service map Users Orders queue Search index Product DB Product info service Ordering service Product search service Website
  • 48. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Triangulate the problem
  • 49. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudWatch ServiceLens Map using AWS X-Ray
  • 50. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad dependency Propagate the incoming trace context to every outbound client call you make to every dependency Trace-Id: 1-5759e988
  • 51. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad dependency Propagate the incoming trace context to every outbound client call you make to every dependency Enable collection of trace segments from all AWS services, and from your apps Trace-Id: 1-5759e988 AWS X-Ray
  • 52. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad dependency Propagate the incoming trace context to every outbound client call you make to every dependency Enable collection of trace segments from all AWS services, and from your apps Use service maps derived from traces to triangulate the failing dependency during incidents Trace-Id: 1-5759e988 AWS X-Ray
  • 53. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Bad component
  • 54. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A web server process crash Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 55. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A web server process crash Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 56. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A web server process crash Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 57. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A web server process crash Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers …
  • 58. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A web server process crash Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 59. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A web server process crash Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Site-wide error rate ~11%
  • 60. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Health checks to the rescue Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 61. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Health checks to the rescue Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 62. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Health checks to the rescue Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Site-wide error rate
  • 63. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. The deadlock Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 2 Web servers Availability Zone 2 Web servers Product info service Ordering service Product search service
  • 64. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Health checks to the rescue? Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers
  • 65. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Health checks to the rescue? Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers https://aws.amazon.com/builders-library/implementing-health-checks/
  • 66. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different types of challenges Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Site-wide error rate
  • 67. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring availability Time Error rate Site-wide error rate Website errors
  • 68. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dimensionality again “Show me [metric] per [dimension]” Latency Errors Webpage Website Instance Id Measurements Attributes
  • 69. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating an alarm for a gray-failed instance Time Error rate Per-instance error rate ghi789 abc123 def456 jkl012 …
  • 70. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating an alarm for a gray-failed instance Time Error rate Per-instance error rate ghi789 abc123 def456 jkl012 …
  • 71. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudWatch Metrics Insights query and alarm SELECT SUM(Failure) FROM SCHEMA(MyWebsite, InstanceId) GROUP BY InstanceId ORDER BY SUM() DESC LIMIT 10
  • 72. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudWatch Metrics Insights query and alarm SELECT SUM(Failure) FROM SCHEMA(MyWebsite, InstanceId) GROUP BY InstanceId ORDER BY SUM() DESC LIMIT 10
  • 73. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudWatch Metrics Insights query and alarm SELECT SUM(Failure) FROM SCHEMA(MyWebsite, InstanceId) GROUP BY InstanceId ORDER BY SUM() DESC LIMIT 10 q1 = FIRST(q1) > 0.01 PERIOD DATAPOINTS = 1 minute = 2
  • 74. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudWatch Metrics Insights query and alarm SELECT SUM(Failure) FROM SCHEMA(MyWebsite, InstanceId) GROUP BY InstanceId ORDER BY SUM() DESC LIMIT 10 q1 = FIRST(q1) > 0.01 PERIOD DATAPOINTS = 1 minute = 2
  • 75. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudWatch Metrics Insights query and alarm
  • 76. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different types of challenges Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Cache nodes Cache nodes Cache nodes Product catalog cache cluster
  • 77. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different types of challenges Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Cache nodes Cache nodes Cache nodes Product catalog cache cluster
  • 78. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different types of challenges Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Cache nodes Cache nodes Cache nodes Product catalog cache cluster
  • 79. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different types of challenges Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Cache nodes Cache nodes Cache nodes Product catalog cache cluster
  • 80. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different types of challenges Application Load Balancer AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Cache nodes Cache nodes Cache nodes Product catalog cache cluster
  • 81. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dimensionality again “Show me [metric] per [dimension]” Latency Errors Availability Zone Website Webpage Measurements Attributes Instance Id
  • 82. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery with zonal shifts AWS Cloud Web server environment Users Availability Zone 2 Web servers Availability Zone 3 Web servers Availability Zone 1 Web servers Cache nodes Cache nodes Cache nodes Product catalog cache cluster Application Load Balancer Route 53 Application Recovery Controller Operator
  • 83. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad component Split key application health metrics on separate dimensions for each infrastructure boundary, like EC2 Instance or Availability Zone
  • 84. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad component Split key application health metrics on separate dimensions for each infrastructure boundary, like EC2 Instance or Availability Zone Find and alarm on the poorest performing parts of your infrastructure using Metrics Insights queries
  • 85. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Bad deployment
  • 86. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. How (not) to deploy
  • 87. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. How (not) to deploy
  • 88. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. How (not) to deploy
  • 89. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. How (not) to deploy
  • 90. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s try that again Scene: Safe deployment Take: Two
  • 91. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Roll back first, ask questions later
  • 92. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Roll back first, ask questions later
  • 93. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Roll back first, ask questions later
  • 94. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Roll back first, ask questions later Service in alarm!
  • 95. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Roll back first, ask questions later Rollback started
  • 96. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Roll back first, ask questions later Rollback complete
  • 97. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. No rollback Time Error rate Site-wide error rate Website errors Alarm
  • 98. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. No rollback Time Error rate Site-wide error rate Website errors Alarm One-box deployment
  • 99. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. No rollback Time Error rate Site-wide error rate Website errors Alarm One-box deployment Wave 1 deployment
  • 100. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. No rollback Time Error rate Site-wide error rate Website errors Alarm One-box deployment Wave 1 deployment Wave 2, 3 deployments
  • 101. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Manual rollback Time Error rate Site-wide error rate Website errors Alarm
  • 102. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Manual rollback Time Error rate Site-wide error rate Website errors Alarm
  • 103. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Manual rollback Time Error rate Site-wide error rate Website errors Alarm Start investigating
  • 104. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Manual rollback Time Error rate Site-wide error rate Website errors Alarm Start investigating Realize there’s a deployment
  • 105. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Manual rollback Time Error rate Site-wide error rate Website errors Alarm Start rollback Start investigating Realize there’s a deployment
  • 106. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Manual rollback Time Error rate Site-wide error rate Website errors Alarm Time-to-detect Time-to-mitigate
  • 107. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic rollback Time Error rate Site-wide error rate Website errors Alarm
  • 108. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic rollback Time Error rate Site-wide error rate Website errors Alarm Auto rollback started
  • 109. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic rollback Time Error rate Site-wide error rate Website errors Alarm Time-to-detect Time-to-mitigate
  • 110. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic rollback in CloudFormation
  • 111. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic rollback Time Error rate Site-wide error rate Website errors Alarm ?
  • 112. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dimensionality (yet) again “Show me [metric] per [dimension]” Latency Errors Measurements Attributes Code revision Website Availability Zone Webpage Instance Id
  • 113. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Faster detection using dimensions Time Error rate Site-wide error rate Overall New alarm New code Overall alarm
  • 114. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Faster detection using dimensions Time Error rate Site-wide error rate Overall New alarm New code Overall alarm Auto rollback started
  • 115. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Faster detection using dimensions Time Error rate Site-wide error rate Overall New alarm New code Overall alarm Time-to-detect Time-to-mitigate
  • 116. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Composite rollback alarms New code version alarms Overall alarms Per-webpage alarms Infrastructure alarms Literally all of your alarms Overall alarm
  • 117. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Composite rollback alarms New code version alarms Overall alarms Per-webpage alarms Infrastructure alarms Literally all of your alarms Overall alarm
  • 118. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad deployment Split key application health metrics on logical boundaries like DeploymentId to minimize the time to detect bad changes B A →
  • 119. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad deployment Split key application health metrics on logical boundaries like DeploymentId to minimize the time to detect bad changes Roll back all types of changes automatically to minimize time to recover B A →
  • 120. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Bad deployment Split key application health metrics on logical boundaries like DeploymentId to minimize the time to detect bad changes Roll back all types of changes automatically to minimize time to recover Combine literally all of your alarms into a single CloudWatch composite alarm for triggering rollbacks B A →
  • 121. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Traffic spike
  • 122. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm
  • 123. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm
  • 124. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm Average latency
  • 125. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm Average latency Deployments 4 days ago 5 days ago …
  • 126. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm Average latency Fleet CPU Deployments 4 days ago 5 days ago …
  • 127. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm Fleet size Average latency Fleet CPU Deployments 4 days ago 5 days ago …
  • 128. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting Latency alarm Fleet size Average latency Fleet CPU Request volume Deployments 4 days ago 5 days ago …
  • 129. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where/who are these requests coming from? Request volume ∆?
  • 130. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Attributes Code revision Website Availability Zone Webpage Instance Id Customer Id Dimensions with high cardinality “Show me [metric] per [dimension]” Requests Latency Errors Measurements
  • 131. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Attributes Code revision Website Availability Zone Webpage Instance Id Customer Id Dimensions with high cardinality “Show me [metric] per [dimension]” Requests Cardinality Latency Errors Measurements
  • 132. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Attributes Code revision Website Availability Zone Webpage Instance Id Customer Id Dimensions with high cardinality “Show me [metric] per [dimension]” Requests Cardinality (Few of these) Latency Errors Measurements
  • 133. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Attributes Code revision Website Availability Zone Webpage Instance Id Customer Id Dimensions with high cardinality “Show me [metric] per [dimension]” Requests Cardinality (Many of these) Latency Errors Measurements
  • 134. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Top-N metrics with CloudWatch Contributor Insights Per-customer requests Time Request count
  • 135. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Top-N metrics with CloudWatch Contributor Insights Per-customer requests Time Request count Per-customer requests (top-N) Time Request count
  • 136. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Telemetry in Amazon CloudWatch Amazon CloudWatch Alarms Logs Metrics Application { "structured":"log" } … …
  • 137. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contributor Insights: How it’s made { "Timestamp": 1574109732004, "TraceId": "Root=1-5759e988-...", "CustomerId": "abcde12345", "ClientIp": "192.168.131.39", "InstanceId": "i-0012341EXAMPLE", "Operation": "GetProducts", "Time": 98, "Error": 0, "Failure": 0, "DB.Time": 49, "CacheHit": 0 } A request log entry
  • 138. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contributor Insights: How it’s made { "Timestamp": 1574109732004, "TraceId": "Root=1-5759e988-...", "CustomerId": "abcde12345", "ClientIp": "192.168.131.39", "InstanceId": "i-0012341EXAMPLE", "Operation": "GetProducts", "Time": 98, "Error": 0, "Failure": 0, "DB.Time": 49, "CacheHit": 0 } (Things about the request) A request log entry
  • 139. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contributor Insights: How it’s made { "Timestamp": 1574109732004, "TraceId": "Root=1-5759e988-...", "CustomerId": "abcde12345", "ClientIp": "192.168.131.39", "InstanceId": "i-0012341EXAMPLE", "Operation": "GetProducts", "Time": 98, "Error": 0, "Failure": 0, "DB.Time": 49, "CacheHit": 0 } (Things about the request) A request log entry (Things that happened during the request)
  • 140. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contributor Insights: How it’s made { "Timestamp": 1574109732004, "TraceId": "Root=1-5759e988-...", "CustomerId": "abcde12345", "ClientIp": "192.168.131.39", "InstanceId": "i-0012341EXAMPLE", "Operation": "GetProducts", "Time": 98, "Error": 0, "Failure": 0, "DB.Time": 49, "CacheHit": 0 } { "AggregateOn": "Count", "Contribution": { "Keys": [ "$.CustomerId" ] }, "LogFormat": "JSON", "Schema": { "Name": "CloudWatchLogRule", "Version": 1 }, "LogGroupARNs": [ "arn:..." ] } A request log entry A Contributor Insights rule
  • 141. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contributor Insights: How it’s made { "Timestamp": 1574109732004, "TraceId": "Root=1-5759e988-...", "CustomerId": "abcde12345", "ClientIp": "192.168.131.39", "InstanceId": "i-0012341EXAMPLE", "Operation": "GetProducts", "Time": 98, "Error": 0, "Failure": 0, "DB.Time": 49, "CacheHit": 0 } { "AggregateOn": "Count", "Contribution": { "Keys": [ "$.CustomerId" ] }, "LogFormat": "JSON", "Schema": { "Name": "CloudWatchLogRule", "Version": 1 }, "LogGroupARNs": [ "arn:..." ] } A request log entry A Contributor Insights rule
  • 142. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Top-N metrics with Contributor Insights
  • 143. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. On-demand metrics with Logs Insights filter ClientIp = '192.168.131.39' | stats count(*) by bin(60s)
  • 144. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Traffic spike Emit logs that are rich with data so that you can cut your metrics on many dimensions { "foo":"bar" }
  • 145. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Traffic spike Emit logs that are rich with data so that you can cut your metrics on many dimensions Record and analyze high-cardinality metrics like per- customer request volume by configuring Contributor Insights { "foo":"bar" }
  • 146. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues: Traffic spike Emit logs that are rich with data so that you can cut your metrics on many dimensions Record and analyze high-cardinality metrics like per- customer request volume by configuring Contributor Insights Slice and dice metrics that you didn’t materialize up front by writing Logs Insights queries → { "foo":"bar" }
  • 147. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues
  • 148. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. The hidden storm
  • 149. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. The hidden storm
  • 150. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. The hidden storm
  • 151. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues External issues Misattributed faults
  • 152. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues External issues
  • 153. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring from where? Users Website …
  • 154. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring from where? Users Application Load Balancer Availability Zone 1 Instances AWS Cloud Availability Zone 2 Instances Availability Zone 3 Instances (other services) Your web server environment
  • 155. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring from where? Users Application Load Balancer AWS Cloud Your web server environment
  • 156. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An incompatible change Users AWS Cloud Your web server environment Amazon CloudFront Distribution Your CI/CD pipeline Version 1 Code Instances Application Load Balancer
  • 157. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An incompatible change Users AWS Cloud Your web server environment Amazon CloudFront Distribution Your CI/CD pipeline Version 1 Code Instances Application Load Balancer {}
  • 158. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An incompatible change Users AWS Cloud Your web server environment Amazon CloudFront Distribution Your CI/CD pipeline Version 1 Code Instances Application Load Balancer Version 2 Code
  • 159. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An incompatible change Users AWS Cloud Your web server environment Amazon CloudFront Distribution Your CI/CD pipeline Version 1 Code Instances Application Load Balancer Version 2 Code
  • 160. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An incompatible change Users AWS Cloud Your web server environment Amazon CloudFront Distribution Your CI/CD pipeline Version 1 Code Version 2 Code Instances Application Load Balancer
  • 161. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An incompatible change Users AWS Cloud Your web server environment Amazon CloudFront Distribution Your CI/CD pipeline Version 1 Code Version 2 Code Instances Application Load Balancer {???}
  • 162. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measure from everywhere
  • 163. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring from where? Users Application Load Balancer AWS Cloud Your web server environment Amazon CloudWatch Real-User Monitoring
  • 164. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A botched migration Application Load Balancer Instances AWS Cloud Containers (other services) Old web server environment Users Application Load Balancer New web server environment
  • 165. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A botched migration Application Load Balancer Instances AWS Cloud Containers (other services) Old web server environment Users Application Load Balancer New web server environment Amazon Route 53 Old: 100% New: 0%
  • 166. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A botched migration Application Load Balancer Instances AWS Cloud Containers (other services) Old web server environment Users Application Load Balancer New web server environment Amazon Route 53 Old: 100% New: 0%
  • 167. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A botched migration Application Load Balancer Instances AWS Cloud Containers (other services) Old web server environment Users Application Load Balancer New web server environment Amazon Route 53 Old: 99% New: 1%
  • 168. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Synthetic workload measurement
  • 169. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring from where? Users Application Load Balancer AWS Cloud Your web server environment Amazon CloudWatch Synthetics
  • 170. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues: External issues Look at metrics emitted by synthetic workloads and real clients of your application to catch issues outside of your own measurement
  • 171. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues Misattributed faults
  • 172. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Input validation bugs public void validate() { - if(input.length() > 140) { + if(input.length() > 100) { throw new ClientError(); } } (Bug: Making input validation more strict)
  • 173. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Client vs server fault Client fault (4XX) Server fault (5XX)
  • 174. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An increase due to a bad deployment Time Error rate Error rate by type (4XX vs 5XX) Server fault
  • 175. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An increase due to a bad deployment Time Error rate Error rate by type (4XX vs 5XX) Client fault Server fault
  • 176. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An increase due to a bad deployment Time Error rate Error rate by type (4XX vs 5XX) Client fault Server fault
  • 177. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. An increase due to a bad deployment Time Error rate Error rate by type (4XX vs 5XX) Client fault Server fault
  • 178. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. The same increase, but for a different reason Time Error rate Customer 789 Customer 123 Customer 456 Customer 012 … Customer 345 Overall errors Per-client fault rate (4XX)
  • 179. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alarming when lots of customers change Time Error rate Per-client fault rate (4XX) Customer 789 Customer 123 Customer 456 Customer 012 … Customer 345 Overall errors
  • 180. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metrics from another dimension Percent of requests with errors Percent of clients with errors
  • 181. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metrics from another dimension Percent of requests with errors # 𝑜𝑓 𝒓𝒆𝒒𝒖𝒆𝒔𝒕𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠 # 𝑜𝑓 𝒓𝒆𝒒𝒖𝒆𝒔𝒕𝒔
  • 182. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metrics from another dimension Percent of clients with errors # 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠 # 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔
  • 183. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dimensionality and cardinality together # 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠 # 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔
  • 184. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dimensionality and cardinality together 𝐼𝑁𝑆𝐼𝐺𝐻𝑇_𝑅𝑈𝐿𝐸_𝑀𝐸𝑇𝑅𝐼𝐶(𝐸𝑟𝑟𝑜𝑟𝑠𝑃𝑒𝑟𝐶𝑢𝑠𝑡𝑜𝑚𝑒𝑟, 𝑈𝑛𝑖𝑞𝑢𝑒𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑜𝑟𝑠) 𝐼𝑁𝑆𝐼𝐺𝐻𝑇_𝑅𝑈𝐿𝐸_𝑀𝐸𝑇𝑅𝐼𝐶(𝑅𝑒𝑞𝑢𝑒𝑠𝑡𝑠𝑃𝑒𝑟𝐶𝑢𝑠𝑡𝑜𝑚𝑒𝑟, 𝑈𝑛𝑖𝑞𝑢𝑒𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑜𝑟𝑠) # 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔 𝑤𝑖𝑡ℎ 𝑒𝑟𝑟𝑜𝑟𝑠 # 𝑜𝑓 𝒄𝒖𝒔𝒕𝒐𝒎𝒆𝒓𝒔
  • 185. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. A stable, alarmable metric Percent of clients with errors
  • 186. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues: Misattributed When many customers suddenly see errors that are categorized as their fault, it may actually be your fault
  • 187. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uncover hidden issues: Misattributed When many customers suddenly see errors that are categorized as their fault, it may actually be your fault Calculate metrics like “percent of customers” instead of “percent of requests” by using Contributor Insights rules to estimate the cardinality of a dimension
  • 188. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues
  • 189. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues Run game days Monitor utilization
  • 190. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues Monitor utilization
  • 191. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring resource utilization
  • 192. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elastic building blocks Application Load Balancer AWS Cloud Users Availability Zone 2 Web servers Availability Zone 2 Web servers Auto Scaling Group …
  • 193. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elastic building blocks Application Load Balancer AWS Cloud Users Availability Zone 2 Web servers Availability Zone 2 Web servers Auto Scaling Group Amazon Elastic Compute Cloud (Amazon EC2) …
  • 194. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elastic building blocks Application Load Balancer AWS Cloud Users Availability Zone 2 Web servers Availability Zone 2 Web servers Auto Scaling Group Elastic Load Balancing Amazon Elastic Compute Cloud (Amazon EC2) …
  • 195. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elastic building blocks Application Load Balancer AWS Cloud Users Availability Zone 2 Web servers Availability Zone 2 Web servers Auto Scaling Group Elastic Load Balancing Amazon EC2 Auto Scaling Amazon Elastic Compute Cloud (Amazon EC2) …
  • 196. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization CPU Memory Thread pools File system
  • 197. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization CPU Memory Thread pools File system Alarms
  • 198. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization CPU Memory Thread pools File system Alarms Dashboards
  • 199. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization CPU Memory Thread pools File system Amazon EC2 Auto Scaling Alarms Dashboards
  • 200. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization myASG: Type: AWS::AutoScaling::AutoScalingGroup Properties: MaxSize: '50' MinSize: '20' myScalingPolicy: ... Amazon EC2 Auto Scaling
  • 201. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization myASG: Type: AWS::AutoScaling::AutoScalingGroup Properties: MaxSize: '50' MinSize: '20' myScalingPolicy: ... Amazon EC2 Auto Scaling
  • 202. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization Amazon EC2 Auto Scaling GroupMaxSize Instances Time GroupInServiceInstances Number of instances in an Auto Scaling Group
  • 203. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization Amazon EC2 Auto Scaling GroupMaxSize Instances Time GroupInServiceInstances Number of instances in an Auto Scaling Group × 100 =
  • 204. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring the utilization Amazon EC2 Auto Scaling GroupMaxSize Instances % Time Time GroupInServiceInstances Number of instances in an Auto Scaling Group Percent of instances in an Auto Scaling group × 100 = 100% -
  • 205. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring utilization across abstractions AWS Lambda Amazon EC2 Amazon RDS Amazon DynamoDB Amazon SQS Amazon MQ
  • 206. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring utilization across abstractions AWS Lambda Amazon EC2 CPU Memory File system Thread pools Concurrency
  • 207. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic usage metrics
  • 208. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues Configure Auto Scaling on all of your elastic resources to react quickly to changes in load and maintain healthy headroom
  • 209. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues Configure Auto Scaling on all of your elastic resources to react quickly to changes in load and maintain healthy headroom Measure the utilization of everything – from CPU, to thread pools, to quotas – by creating alarms and a capacity dashboard
  • 210. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues Run game days
  • 211. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Game day drills
  • 212. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Properties of a good game day Regular Reasoned Controlled Realistic
  • 213. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Properties of a good game day Regular Reasoned Controlled Realistic AWS Fault Injection Service
  • 214. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Game day observability Replicate production observability into test environments Verify behavior of metrics, alarms, dashboards, logs Add new instrumentation, metrics, alarms afterward
  • 215. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues: Game days Regularly perform controlled experiments of failure modes by using the AWS Fault Injection Service
  • 216. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prevent future issues: Game days Regularly perform controlled experiments of failure modes by using the AWS Fault Injection Service Use the same observability tools during experiments as you do in production by including things like alarm and dashboard definition in your infrastructure as code AWS Cloud Development Kit (AWS CDK)
  • 217. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Navigating the seas of observability
  • 218. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Diagnose issues Recap Find patterns in high-cardinality metrics Use dimensionality to compute the right metric Navigate distributed systems using tracing
  • 219. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recap Uncover hidden issues Measure from everywhere Aggregate metrics in customer- oriented ways
  • 220. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recap Auto scale and track the utilization of everything Monitor game days just like production Prevent future issues
  • 221. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Further reading: Amazon Builders’ Library https://aws.amazon.com/builders-library/
  • 222. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app David Yanacek david-yanacek @dyanacek @dyanacek@hachyderm.io dyanacek.bsky.social