A framework for self-healing applications – the path to enable auto-remediation

Jürgen Etzlstorfer
@jetzlstorfer
Technology Strategist
A framework for self-healing applications –
the path to enable auto-remediation
Developer Week Nürnberg, 27th June 2018

confidential
The journey
 Why self-healing applications?
 What is needed for self-healing applications
 Auto-remediation as part of a CI/CD pipeline
 Build your own auto-remediation

On average, a single transaction uses 82 different types of technology
Browser
Multi-geo
Mobile Network
Code
Hosts
Logs
IoT
3rd parties
Services
Cloud SDN
Containers
Applications are getting more complex!

Problem
• Not repeatable in Test and cannot be
troubleshooted with current tooling
• After months of investigation and customers
being impacted, the root-cause of the issue
cannot be found
Impact
• Issue causes severe slow downs for the users
and timeouts, eventually needing a manual
failover to the DR site
• Operations team mislead by current alerting on
their investigation path
Consequences
• Poor customer experience drive
poor conversion rates
Recurring issue
for months
479 hours
lost in War-room
up to today.
6 teams and one 3rd party
were involved
Happening
more frequently
Has cost so far
£23,950
Brand reputation
impacted by bad tweets$32,494
Consequences of complexity

confidential
If you write applications,
they will break eventually
~ Murphy‘s law

confidential
What if you had
something similar to
a self-healing robot?

confidential
What is needed for self-healing applications?
 Monitoring: know what’s going on in your
applications
 End-to-end
 Full-stack – fully integrated in production
(or even in staging)
 Automation/Execution: perform
mitigation/remediation actions
 Access to all systems
 Automation system should be isolated from
production system
APIs

confidential
Know what‘s going on in your
applications
 Monitor your applications Identify the root cause
of the problem!

Applications
are
monitored
Thresholds
are breached
Problem is
analyzed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable remediation
Monitoring Mitigation

confidential
How to automate?
 Automation engines
 Ansible (Tower), Stackstorm, …
 Serverless approaches
 AWS Lambda, Azure Functions, …

Full-stack
environment
is monitored
Anomalies
are detected
automatically
Root
cause
analysis is
performed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable auto-remediation

Version 123
Staging
Approve
Staging
Production
Approve
Production
Up and
running
Version 124
Scenario: How to mitigate a bad deployment?
Staging
Approve
Staging
Production
Approve
Production
Remediation
Roll-
back

confidential
Steps to mitigate the bad deployment
Fetch
information
about event
Process the
data
Select
corresponding
remediation
action
1.Execution the
remediation
action
Keep track of all automation steps

confidential
Auto-remediation with Ansible (Tower)
 APIs are key to enable automation
 Ansible Tower makes extensive use APIs internally and exposes them also externally
 Ansible playbooks are scripts that are executed from a central host on different machines
 Multiple OS are supported
 Idempotent
 Playbooks can be orchestrated in workflows and job templates

confidential
---
- name: rollback to previous version
hosts: localhost
vars:
...
tasks:
- name: push comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Remediation playbook started.", "user": "{{commentuser}}", "context":
"Ansible Tower" }"
- name: fetch custom deployment events
uri:
url: "{{dtdeploymentapiurl}}"
return_content: yes
with_items: "{{ impactedEntities }}"
register: customproperties
ignore_errors: no
- name: parse deployment events
set_fact:
deployment_events: "{{item.json.events}}"
with_items: "{{ customproperties.results }}"
register: app_result

confidential
- name: call remediation action
uri:
url: "{{ myItem.remediationAction }}"
method: POST
body_format: json
body: "{{ payload | to_json }}"
return_content: yes
ignore_errors: yes
register: result
- name: push success comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Invoked remediation action successfully executed: {{result.content}}",
"user": "{{commentuser}}", "context": "Ansible Tower" }"
when: result.status == 200
- name: push error comment to dynatrace
...
body: "{ "comment": "Invoked remediation action failed: {{result.content}}", "user":
"{{commentuser}}", "context": "Ansible Tower" }"
when: result.status != 200

confidential
Auto-remediation with Serverless approaches
 No need for separate installation / maintenance of a system
 Pay-as-you-go (most often for free)
 Support for a variety of languages
 No built-in support for automation tasks

confidential
// remediation
dtUtils.getProblemDetails(myProblem.pid, function (err, resp) {
if (err || !resp.ok) {
console.error("error getProblemDetails for pid " + myEvent.pid + ": " + JSON.stringify(err));
return callback(err);
}
var myRankedEvents = resp.body.result.rankedEvents;
console.info("rankedEvents: " + JSON.stringify(myRankedEvents));
if (myRankedEvents != null) {
var myRootCause = getRootCause(myRankedEvents);
if (myRootCause != undefined) {
// root cause found
console.info("root cause for PID " + myEvent.pid + ": " + JSON.stringify(myRootCause.eventType));
triggerRemediationAction(myProblem, myRootCause, function (err, res, remediationAction) {
if (err) {
console.error("error for remediation of " + myEvent.pid + " (" + myRootCause.eventType + "): " +
JSON.stringify(err));
addComment(myEvent.pid, "error when performing remediation " + JSON.stringify(err), function
(err, res) {
if (err) {
}
} );
}
var remediationLog = "Auto-remediation: " + remediationAction.title + " executed:n " +
remediationAction.description;

confidential
Comparison
 Automation Platforms
 Runbook/Playbook automation built-in
 Step-by-step instructions (yaml)
 Specialized for deployment, provisioning,
configuration management
 Maintenance of platform needed
 Serverless
 Different vendors
 Different languages (js, java, python, …)
 Not limited to runbooks
 No support for typical runbook tasks

confidential
Auto-remediation is a safety net
It does not fix your problem

confidential
https://blogs.msdn.microsoft.com/visualstudioalmrangers/2017/04/17/set-up-a-cicd-pipeline-for-your-team-services-extension/

confidential
Embed auto-remediation in your CI/CD pipeline
Shift-Left: Break Pipeline Earlier
Path to NoOps: Self-Healing, …
Shift-Right: Tags, Deploys, Events
Actionable Feedback Loops

Injecting speed &
quality: automatic gate
at test & performance
• Continuous Performance Validation for daily builds
• Root Cause details automatically pushed to JIRA
• Decisions made to compare, break, or good-to-go
Shift-left:engage Dev withearlier & automatedfeedback

confidential
Shift-right:empowerOps withmore contextto react faster

https://github.com/Dynatrace/AWSDevOpsTutorial
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Compares Builds and Approves/Rejects Pipeline
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Validates Production and Approves/Rejects Pipeline
handleDynatraceProblemNotification
Executes Auto-Remediating Actions, e.g: Rollback
Build 6
Build 7
Production
Production
Auto-Approve!
Auto-Reject!
Auto-Approve!
Auto-Reject!

How to start?
1. Monitor your environment
2. Define your runbooks
3. Start small and with low hanging fruits
 What are frequent issues?
 Of these, which ones are easy to deal with?
4. Build more and more automation along the way
Cultural Change!

AI to the rescue
Automated selection
or generation of solution
AI, big data, …
Automated calling of scripts
Ansible Tower, Workflows, …
Predefined
actions to execute
Runbooks, Shell scripts,
batch files, …

www.dynatrace.com
confidential
Jürgen Etzlstorfer
Technology Strategist
juergen.etzlstorfer@dynatrace.com
@jetzlstorfer
Thank you!

A framework for self-healing applications – the path to enable auto-remediation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A framework for self-healing applications – the path to enable auto-remediation

Similar to A framework for self-healing applications – the path to enable auto-remediation (20)

Recently uploaded

Recently uploaded (20)

A framework for self-healing applications – the path to enable auto-remediation

Editor's Notes