Amazon DevOps Guru for the Serverless Applications at AWS Community Day NL 2023

DevOps Guru for the Serverless applications
Vadym Kazulkin, ip.labs, AWS Community Day NL, September 20 2023

Contact
Vadym Kazulkin
ip.labs GmbH Bonn, Germany
Co-Organizer of the Java User Group Bonn
v.kazulkin@gmail.com
@VKazulkin
https://dev.to/vkazulkin
https://github.com/Vadym79/
https://www.linkedin.com/in/vadymkazulkin
https://www.iplabs.de/

ip.labs
https://www.iplabs.de/

AIOPs
Artificial Intelligence for IT Operations (AIOps) is the process of using machine
learning techniques to solve operational problems. The goal of AIOps is to
reduce human intervention in the IT operations processes.
By using advanced machine learning techniques, you can reduce operational
incidents and increase service quality. AIOps can help you with:
• Increase service quality
• for example, by grouping related incidents based on time and language
or by predicting Knowledge Base articles to solve an incident
• Predict incidents before they happen
• Classify new incidents and insights

What is AWS DevOps Guru
Amazon DevOps Guru offers a fully managed AIOps platform powered by
machine learning (ML) that is designed to make it easy to improve an
application’s operational performance and availability
DevOps Guru helps detect behaviors that deviate from normal operating
patterns so you can identify operational issues long before they impact
your customers
• increased latency
• error rates (timeouts, throttles, CPU, memory and, disk utilization)
• resource constraints (exceeding AWS account limits)
https://aws.amazon.com/devops-guru

Benefits of DevOps Guru

How DevOps Guru work

DevOps Guru is powered by pre-trained
ML models
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/
• Built domain-specific, single-purpose models to identify known failure
modes instead of normal metric behavior.
• DevOps Guru relies on a large ensemble of detectors—statistical models
tuned to detect common adverse scenarios in a variety of operational
metrics.
• DevOps Guru detectors don’t need to be trained or configured. They
work instantly as long as enough history is available.
• Individual detectors work in preconfigured ensembles to generate anomalies
on some of the most important metrics operators monitor: error rates,
availability, latency, incoming request rates, CPU, memory, and disk utilization,
among others.

DevOps Guru pre-trained ML detectors
• Detector identified a significant trend
in disk usage, heading for disk
exhaustion within 24 hours.
• The model has identified a significant
trend between the vertical dashed lines.
Extrapolating this trend (diagonal dashed
line), the detector predicts time to
resource exhaustion.
• As the metric breaches the horizontal
red line, which acts as a significance
threshold, the detector notifies operators.

DevOps Guru pre-trained ML detectors
with periodic behaviors
• Many metrics, such as the number of
incoming requests in customer-facing
APIs, exhibit periodic behavior.
• The purpose of the causal convolution
detector is to analyze temporal data
with such patterns and to determine
expected periodic behavior.
• When the detector infers that a metric
is periodic, it adapts normal metric
behavior thresholds to the seasonal
pattern.

DevOps Guru Example Application
https://github.com/Vadym79/DevOpsGuruWorkshopDemo inspired by https://github.com/aws-samples/serverless-java-frameworks-samples

Monitoring & Alerting of the Serverless
Applications

DevOps Guru Set Up with AWS Organizations
https://aws.amazon.com/blogs/mt/how-to-easily-configure-devops-guru-across-your-organization-with-systems-manager-quick-setup/

• Warm up the application (takes between 1 and 24 hours) to create a base line
• Design test experiment to provoke errors and latency increase
• Reduce the service quote of the AWS service (API Gateway, Lambda,
DynamoDB)
• Set very low service quotas for the sake of reducing AWS costs
• Add latency artificially
• Stress test with Hey Tool to run into the operational issues
• See if the DevOps Guru recognized the operational issues
• Remediate the operational issues by increasing service quote, removing the
artificial latency or stopping the stress test
• See whether DevOps Guru closes the incident when it’s resolved
DevOps Guru Examples
| CONFIDENTIAL
19 https://github.com/rakyll/hey

DevOps Guru: Recognize operational issues
in DynamoDB

DevOps Guru Examples: DynamoDB Throttling
hey -q 20 -z 15m -c 20 -H "X-API-Key:
XXXa6XXXX " https://XXX.execute-api.eu-
central 1.amazonaws.com/prod/products/1

DevOps Guru Examples: DynamoDB Throttling

• Table or Account Level read/write capacity for
DynamoDB consumption reaching account limit
• Triggered when the account consumed capacity is
approaching table or account-level limits during a period of time
Other DynamoDB operational issues and the
proactive insights 1/2
| CONFIDENTIAL
34
https://aws.amazon.com/de/blogs/aws/automatically-detect-operational-issues-in-lambda-functions-with-amazon-devops-guru-for-serverless/

• DynamoDB table consumed capacity reaching AutoScaling Maximum parameter
limit
• Triggered when table consumed capacity is reaching
AutoScaling Max parameters limit over a period.
Other DynamoDB operational issues and the
proactive insights 2/2
| CONFIDENTIAL
35
https://aws.amazon.com/de/blogs/aws/automatically-detect-operational-issues-in-lambda-functions-with-amazon-devops-guru-for-serverless/

in API Gateway

DevOps Guru Examples: API Gateway
HTTP 429 „too many requests“ Error
Query to exaust the quota
hey -q 10 -z 1m -c 10 -H "X-API-Key:
XXXa6XXXX" https://XXX.execute-
api.eu
-central-1.amazonaws.com/prod/
products/1

DevOps Guru Examples: API Gateway
HTTP 404 „Not Found“ Error
Query for not existing product id ,e.g. 200
hey -q 1 -z 15m -c 1 -H "X-API-Key: XXXa6XXXX"
https://XXX.execute-api.eu-central-
1.amazonaws.com/prod/products/200

DevOps Guru Examples: API Gateway 4XX Error

in Lambda

DevOps Guru Examples: Lambda Throttling 1
hey -q 5 -z 15m -c 5 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-
api.eu-central-1.amazonaws.com/prod/products/1

DevOps Guru Examples: Lambda Throttling 1

DevOps Guru Examples: Lambda Timeout Error
Add 31 sec latency in the
code of the Lambda function

DevOps Guru Examples: Lambda Init Error
Java runtime requires 256 MB
memory to start and execute
this function

DevOps Guru Examples: Lambda Error

DevOps Guru Examples: Lambda Increased
Latency
Temporary add 28 sec
latency in the code of
the Lambda function

DevOps Guru Examples: Lambda Increased
Latency

in SQS

DevOps Guru Examples: Operational Issues in
SQS
Temporary add 26 sec
latency in the code of
the Lambda function

SQS

in RDS

DevOps Guru Examples: Operational Issues
Lambda -> AWS RDS Postgres w/o RDS Proxy
hey -q 100 -z 15m -c 100 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-
central-1.amazonaws.com/prod/productsFromRDS/2

Amazon in Kinesis

Amazon Kinesis Data Stream -> Lambda -> (S3)

in AWS Step Functions

Amazon Step Functions -> Lambda

DevOps Guru Proactive Insights

DevOps Guru Proactive Examples: Lambda
timeout exceeds recommended SQS visibility

DevOps Guru Proactive Examples: SQS
triggered Lambda does not have a DLQ

function consuming DynamoDB/Kinesis
stream without failure destination

function has concurrency spillover
hey -q 1 -z 30m -c 9 -m DELETE -H "X-API-Key: XXXa6XXXX" -H "Content-Type: application/json;charset=utf-8"
https://XXX.execute-api.eu-central-1.amazonaws.com/prod/products/11

DevOps Guru integration in incident
management tools
• AWS OPsCenter (via AWS Systems Manager)
• PagerDuty
• Atlassian Opsgenie

DevOps Guru Integration Settings

DevOps Guru Integration with PagerDuty
https://www.pagerduty.com/docs/guides/amazon-devops-guru-integration-guide/

Enter „Integration
URL“ generated by
PagerDuty

DevOps Guru PagerDuty Incidents

DevOps Guru Supported Services and Pricing
https://aws.amazon.com/de/devops-guru/pricing/

DevOps Guru Supported Services and Pricing
$2,016 per
resource per month
$3,024 per
resource per
month

DevOps Guru Cost Estimator

DevOps Guru Conclusions, Obeservations,
Suggestions
• Most operational issues have been correctly recognized so far
• It took several (at least 7) minutes to create an incident after
anomaly appeared
• Correctly no insights created for the temporary incidents
• Short time Lambda, DynamoDB and API Gateway Throttling
• Recommendations for the insight reason could be more precise
• No differentiation between Lambda throttling because of reaching
individual function concurrency limit or the total AWS account
concurrency limit
• No differentiation between Lambda Timeout and Init Error

DevOps Guru Conclusions,
Obeservations, Suggestions
• Lambda duration anomalous insights (Duration p90)
• took huge time to create (sometimes more than 30
minutes). Maybe because of the medium severity
• DevOps Guru Proactive Insights
• Improved: will be now displayed more quickly
• Missed some important ones, like not used Lambda
Provisioned Concurrency for a long period of time

DevOps Guru Conclusions,
Obeservations, Suggestions
• Log Groups haven’t always been displayed within DevOps Guru
Insight
• Missing link to Tracing (AWS X-Ray, CloudWatch ServiceLens
and integrations to the 3rd observability tools i.e. Lumigo)

DevOps Guru for RDS
https://aws.amazon.com/devops-guru/features/devops-guru-for-rds/ https://aws.amazon.com/de/blogs/devops/leverage-devops-guru-for-rds-to-detect-anomalies-and-resolve-operational-issues/

www.iplabs.de
Accelerate Your Photo Business
Get in Touch

Amazon DevOps Guru for the Serverless Applications at AWS Community Day NL 2023

Recommended

Recommended

More Related Content

Similar to Amazon DevOps Guru for the Serverless Applications at AWS Community Day NL 2023

Similar to Amazon DevOps Guru for the Serverless Applications at AWS Community Day NL 2023 (20)

More from Vadym Kazulkin

More from Vadym Kazulkin (20)

Recently uploaded

Recently uploaded (20)

Amazon DevOps Guru for the Serverless Applications at AWS Community Day NL 2023