Using ai and automation to build resiliency into azure dev ops

Your App /
Container
Cloud
Native

Proof – autonomous cloud survey (median vs 95th percentile)
Verdict: The majority are not “cloud native” (yet)
3 out of 10
Business Impacting
Deployments
3 hotfixes
Per Production
Deployment
4.8 days
MTTR
(Mean Time to Repair)
2.5 weeks
Code to Production
(Commit Cycle Time)
1 out of 10 0 hotfixes ~4 hours2 days
Median
95th Percentile

1. Complexity
2. Manual operations
3. Lack information
4. People
5. Identifying root cause

MULTI-PLATFORM MONITORING
QUALITY
SHIFT-LEFT
DEPLOYMENT
ENABLE-RIGHT
SELF
HEALING
• Automate operations (self-healing) – auto-
mitigate bad deployments in production
• Automate deployment (enable-right) – push “monitoring-as-
code” and “content events” for auto-validation and auto-alerting
• Automate quality (shift-left) – automate the pipeline and
stop bad code changes before they reach prod
• Automated monitoring – full stack monitoring with context
& dependency information of infrastructure and
transactions
Key differentiators – unbreakable continuous delivery blue print
transactions

Monitoring Solution Capabilities
Tags & Meta DataAutomated Rollout Full Stack Monitoring
Applications
Services
Processes
Hosts
Data Centers
Env: production App: crm
Container: blue
Service: front-end
Namespace: prod
Support-team: alpha-dog
Process-group: tomcat
owner: dev-one
region: eastus

Monitoring Solution Capabilities, continued
Understands Dependencies Event Context
1. Deployment
s
2. Configuratio
n Changes
3. Testing
Start/Stop
4. Maintenance
Start/Stop
API
1. Push Events
2. Query
Problems
3. Query
Topology
4. Query time-
series
metrics

QUALITY
SHIFT-LEFT
transactions

1 3 5Staging
CI CD
Code / Config
Change
{
"lowerBound": 100,
"upperBound": 1000,
"_comment": "global configuration environment-wide",
"timeseries": [
{
"timeseriesId": "service.responsetime",
"aggregation": "avg",
"entityIds": "SERVICE-3211ABE8813B9239",
"lowerBound": 20,
"upperBound": 300
}
]
}
Monitoring
Spec
Monitoring2 4 End Users
YES / NO ?

Metric Source &
Query
Grading
Details
& Metric
Score
Pitometer Specfile
Total
Scoring
Objectives
2GB
Allocated Bytes (from Prometheus)
> 2GB: 0 Points
< 2GB: 20 Points
5%
2% < 2%: 0 Points
< 5%: 10 Points
> 5%: 20 Points
Conversion Rate (Dynatrace)
GraderSource
If value: 3GB
Score: 0
If value: 3.9%
Score: 10
Total Score: 10Result is a Fail
#1
#2
#3

Build Pipeline
1. Build
code
2. Run unit
tests
3. Create
artifact
that
contains
Perf Spec
Release Pipeline
1. Deploy
Code
Source Code
Keptn
Pitometer
Service
API
Calls
• Code
• Perf Spec
API
Calls
Full Stack Monitoring Data in Dynatrace
Application
Under Test
Collect
Data
1
2
3
4
5
64. Pitometer
Quality Gate
2. Performanc
e Test
3. Deployment
Event
https://github.com/dt-demos/pitometer-web-service
Application /
Cloud
Infrastructure
Monitoring Tool
db

Check 1
 Is bad coding
leading to higher
costs?
Check 2
• New dependencies? On
Purpose?
• Services connecting
accurately?
• Number of container
instances needed?
Check 3
• Are we jeopardizing
our SLAs?
• Does load balancing
work?
• Difference between
Canaries?
Check 4
• Did we introduce
new “hidden”
exceptions?
Metrics
 Memory usage
 Bytes sent /
received
 Overall CPU
 CPU per
transaction type
Metrics
 Number of incoming /
outgoing
dependencies
 Number of instances
running on containers
 Metrics
 Response Time
(Percentiles)
 Throughput & Perf
per Instance /
Canary
 Metrics
 Total Exceptions
 Exceptions by
Class & Service

QUALITY
SHIFT-LEFT
DEPLOYMENT
ENABLE-RIGHT
• Automate deployment (enable-right) – push “monitoring-as-
code” and “content events” for auto-validation and auto-
alerting
transactions

1 2 3 4
Release
Automation
Deployment
Automation
“Green” Deployment
Deploy “Version
123”
Adjust memory by
“100MB”
Monitoring Tools
Shows All Dependencies
Augments Data for
Richer Context

Integrating the DevOps tool chain
Last Seen by
Source
App Owner
Run Books

QUALITY
SHIFT-LEFT
DEPLOYMENT
ENABLE-RIGHT
SELF
HEALING
• Automate operations (self-healing) – auto-
mitigate bad deployments in production
• Automate deployment (enable-right) – push “monitoring-
as-code” for auto-validation and auto-alerting
transactions

2
Self-healing
Automation1
User Impact
Problem Evolution
1 CPU Exhausted? Add a new service instance!
2 High Garbage Collection? Adjust/Revert Memory Settings!
3 Issue with BLUE only? Switch back to GREEN!
Hung threads? Restart Service!4
Impact Mitigated??
5 Still ongoing? Initiate Rollback!
? Still ongoing? Escalate
Mark Bad Commits
Update Tickets
2:00 a.m. Alert?
Auto-mitigate! …
Application / Cloud
Infrastructure Monitoring Tool
Add new metric to quality
gate – think automation
here, too!
{
"lowerBound": 100,
"upperBound": 1000,
"_comment": "global configuration
environment-wide",
"timeseries": [
{
"timeseriesId":
"service.responsetime",
"aggregation": "avg",
"entityIds": "SERVICE-
3211ABE8813B9239",
"lowerBound": 20,
"upperBound": 300
}
]
}
…
…
…

https://medium.com/@sashman90/ops-mitigation-triangle-300c81d97df6

1 2 4 53
Production
Staging Approve Staging Production Approve Production
CI CD CI CD CI CD CI CD
Push Context Auto-Quality Gate Push Context Auto-Validate
Auto-Remediate!
Build #17 Build #18
Unbreakable delivery pipeline in action
Performance Gates as Code Ops as Code (SLA, Remediation)

https://cloudblogs.microsoft.com/openso
urce

keptn.sh - OpenSource framework for unbreakable pipeline and more …
CORE CAPABILITIES
• Automated multi-stage unbreakable
pipelines
• Self-healing blue / green
• Event-driven runbook automation
DESIGN PRINCIPALS
• GitOps-based collaboration
• Operator patterns for all logic
components
• Monitoring-and-operations-as-code
• Built-on and for Kubernetes
• Event-driven and serverless
• Pluggable tooling

N + 1
Architectural
Anti-Pattern

“Works” well within
a single process

Product Service
Quote Service
1 call to Quote Service
=
44 calls to Product Service
1
1
4
1
7
1
3

1 call to Quote Service
=
87 calls to DB
Quote Service
1
8
7
Product Service Product DB

26k Database Calls
809
3956
4347
4773
3789
3915
4999

33
Payload Flood
Architectural
Anti-Pattern

Using ai and automation to build resiliency into azure dev ops

More Related Content

What's hot

Similar to Using ai and automation to build resiliency into azure dev ops

Recently uploaded

Using ai and automation to build resiliency into azure dev ops

Editor's Notes