In this talk we’ll use a standard Serverless application which uses of API Gateway, Lambda, DynamoDB, SQS, Step Functions (and other AWS managed services) and explore how Amazon DevOps Guru recognizes operational issues like increased latency and error rates (timeouts, throttling and resource limits) and integrate DevOps Guru with PagerDuty for providing even better incident management.
Amazon DevOps Guru analyzes data like application metrics, logs, events, and traces to establish baseline operational behavior and then uses ML to detect anomalies. The service uses pre-trained ML models that are able to identify spikes in application requests, so it knows when to alert and when not to.
5. AIOPs
Artificial Intelligence for IT Operations (AIOps) is the process of using machine
learning techniques to solve operational problems. The goal of AIOps is to
reduce human intervention in the IT operations processes.
By using advanced machine learning techniques, you can reduce operational
incidents and increase service quality. AIOps can help you with:
• Increase service quality
• for example, by grouping related incidents based on time and language
or by predicting Knowledge Base articles to solve an incident
• Predict incidents before they happen
• Classify new incidents and insights
6. What is AWS DevOps Guru
Amazon DevOps Guru offers a fully managed AIOps platform powered by
machine learning (ML) that is designed to make it easy to improve an
application’s operational performance and availability
DevOps Guru helps detect behaviors that deviate from normal operating
patterns so you can identify operational issues long before they impact
your customers
• increased latency
• error rates (timeouts, throttles, CPU, memory and, disk utilization)
• resource constraints (exceeding AWS account limits)
https://aws.amazon.com/devops-guru
9. DevOps Guru is powered by pre-trained
ML models
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/
• Built domain-specific, single-purpose models to identify known failure
modes instead of normal metric behavior.
• DevOps Guru relies on a large ensemble of detectors—statistical models
tuned to detect common adverse scenarios in a variety of operational
metrics.
• DevOps Guru detectors don’t need to be trained or configured. They
work instantly as long as enough history is available.
• Individual detectors work in preconfigured ensembles to generate anomalies
on some of the most important metrics operators monitor: error rates,
availability, latency, incoming request rates, CPU, memory, and disk utilization,
among others.
10. DevOps Guru pre-trained ML detectors
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/
• Detector identified a significant trend
in disk usage, heading for disk
exhaustion within 24 hours.
• The model has identified a significant
trend between the vertical dashed lines.
Extrapolating this trend (diagonal dashed
line), the detector predicts time to
resource exhaustion.
• As the metric breaches the horizontal
red line, which acts as a significance
threshold, the detector notifies operators.
11. DevOps Guru pre-trained ML detectors
with periodic behaviors
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/
• Many metrics, such as the number of
incoming requests in customer-facing
APIs, exhibit periodic behavior.
• The purpose of the causal convolution
detector is to analyze temporal data
with such patterns and to determine
expected periodic behavior.
• When the detector infers that a metric
is periodic, it adapts normal metric
behavior thresholds to the seasonal
pattern.
12. DevOps Guru Example Application
https://github.com/Vadym79/DevOpsGuruWorkshopDemo inspired by https://github.com/aws-samples/serverless-java-frameworks-samples
15. DevOps Guru Set Up with AWS Organizations
https://aws.amazon.com/blogs/mt/how-to-easily-configure-devops-guru-across-your-organization-with-systems-manager-quick-setup/
18. • Warm up the application (takes between 1 and 24 hours) to create a base line
• Design test experiment to provoke errors and latency increase
• Reduce the service quote of the AWS service (API Gateway, Lambda,
DynamoDB)
• Set very low service quotas for the sake of reducing AWS costs
• Add latency artificially
• Stress test with Hey Tool to run into the operational issues
• See if the DevOps Guru recognized the operational issues
• Remediate the operational issues by increasing service quote, removing the
artificial latency or stopping the stress test
• See whether DevOps Guru closes the incident when it’s resolved
DevOps Guru Examples
| CONFIDENTIAL
19 https://github.com/rakyll/hey
31. • Table or Account Level read/write capacity for
DynamoDB consumption reaching account limit
• Triggered when the account consumed capacity is
approaching table or account-level limits during a period of time
Other DynamoDB operational issues and the
proactive insights 1/2
| CONFIDENTIAL
34
https://aws.amazon.com/de/blogs/aws/automatically-detect-operational-issues-in-lambda-functions-with-amazon-devops-guru-for-serverless/
32. • DynamoDB table consumed capacity reaching AutoScaling Maximum parameter
limit
• Triggered when table consumed capacity is reaching
AutoScaling Max parameters limit over a period.
Other DynamoDB operational issues and the
proactive insights 2/2
| CONFIDENTIAL
35
https://aws.amazon.com/de/blogs/aws/automatically-detect-operational-issues-in-lambda-functions-with-amazon-devops-guru-for-serverless/
66. DevOps Guru Supported Services and Pricing
https://aws.amazon.com/de/devops-guru/pricing/
67. DevOps Guru Supported Services and Pricing
https://aws.amazon.com/de/devops-guru/pricing/
$2,016 per
resource per month
$3,024 per
resource per
month
68. DevOps Guru Cost Estimator
https://aws.amazon.com/de/devops-guru/pricing/
69. DevOps Guru Conclusions, Obeservations,
Suggestions
• Most operational issues have been correctly recognized so far
• It took several (at least 7) minutes to create an incident after
anomaly appeared
• Correctly no insights created for the temporary incidents
• Short time Lambda, DynamoDB and API Gateway Throttling
• Recommendations for the insight reason could be more precise
• No differentiation between Lambda throttling because of reaching
individual function concurrency limit or the total AWS account
concurrency limit
• No differentiation between Lambda Timeout and Init Error
70. DevOps Guru Conclusions,
Obeservations, Suggestions
• Lambda duration anomalous insights (Duration p90)
• took huge time to create (sometimes more than 30
minutes). Maybe because of the medium severity
• DevOps Guru Proactive Insights
• Improved: will be now displayed more quickly
• Missed some important ones, like not used Lambda
Provisioned Concurrency for a long period of time
71. DevOps Guru Conclusions,
Obeservations, Suggestions
• Log Groups haven’t always been displayed within DevOps Guru
Insight
• Missing link to Tracing (AWS X-Ray, CloudWatch ServiceLens
and integrations to the 3rd observability tools i.e. Lumigo)
72. DevOps Guru for RDS
https://aws.amazon.com/devops-guru/features/devops-guru-for-rds/ https://aws.amazon.com/de/blogs/devops/leverage-devops-guru-for-rds-to-detect-anomalies-and-resolve-operational-issues/