2. confidential
The journey
Why self-healing applications?
What is needed for self-healing applications
Auto-remediation as part of a CI/CD pipeline
Build your own auto-remediation
3. On average, a single transaction uses 82 different types of technology
Browser
Multi-geo
Mobile Network
Code
Hosts
Logs
IoT
3rd parties
Services
Cloud SDN
Containers
Applications are getting more complex!
4. Problem
• Not repeatable in Test and cannot be
troubleshooted with current tooling
• After months of investigation and customers
being impacted, the root-cause of the issue
cannot be found
Impact
• Issue causes severe slow downs for the users
and timeouts, eventually needing a manual
failover to the DR site
• Operations team mislead by current alerting on
their investigation path
Consequences
• Poor customer experience drive
poor conversion rates
Recurring issue
for months
479 hours
lost in War-room
up to today.
6 teams and one 3rd party
were involved
Happening
more frequently
Has cost so far
£23,950
Brand reputation
impacted by bad tweets$32,494
Consequences of complexity
9. confidential
What is needed for self-healing applications?
Monitoring: know what’s going on in your
applications
End-to-end
Full-stack – fully integrated in production
(or even in staging)
Automation/Execution: perform
mitigation/remediation actions
Access to all systems
Automation system should be isolated from
production system
APIs
10. confidential
Know what‘s going on in your
applications
Monitor your applications Identify the root cause
of the problem!
15. confidential
Steps to mitigate the bad deployment
Fetch
information
about event
Process the
data
Select
corresponding
remediation
action
1.Execution the
remediation
action
Keep track of all automation steps
16. confidential
Auto-remediation with Ansible (Tower)
APIs are key to enable automation
Ansible Tower makes extensive use APIs internally and exposes them also externally
Ansible playbooks are scripts that are executed from a central host on different machines
Multiple OS are supported
Idempotent
Playbooks can be orchestrated in workflows and job templates
19. confidential
Auto-remediation with Serverless approaches
No need for separate installation / maintenance of a system
Pay-as-you-go (most often for free)
Support for a variety of languages
No built-in support for automation tasks
20. confidential
// remediation
dtUtils.getProblemDetails(myProblem.pid, function (err, resp) {
if (err || !resp.ok) {
console.error("error getProblemDetails for pid " + myEvent.pid + ": " + JSON.stringify(err));
return callback(err);
}
var myRankedEvents = resp.body.result.rankedEvents;
console.info("rankedEvents: " + JSON.stringify(myRankedEvents));
if (myRankedEvents != null) {
var myRootCause = getRootCause(myRankedEvents);
if (myRootCause != undefined) {
// root cause found
console.info("root cause for PID " + myEvent.pid + ": " + JSON.stringify(myRootCause.eventType));
triggerRemediationAction(myProblem, myRootCause, function (err, res, remediationAction) {
if (err) {
console.error("error for remediation of " + myEvent.pid + " (" + myRootCause.eventType + "): " +
JSON.stringify(err));
addComment(myEvent.pid, "error when performing remediation " + JSON.stringify(err), function
(err, res) {
if (err) {
return callback(err);
}
} );
return callback(err);
}
var remediationLog = "Auto-remediation: " + remediationAction.title + " executed:n " +
remediationAction.description;
21. confidential
Comparison
Automation Platforms
Runbook/Playbook automation built-in
Step-by-step instructions (yaml)
Specialized for deployment, provisioning,
configuration management
Maintenance of platform needed
Serverless
Different vendors
Different languages (js, java, python, …)
Not limited to runbooks
No support for typical runbook tasks
24. confidential
Embed auto-remediation in your CI/CD pipeline
Shift-Left: Break Pipeline Earlier
Path to NoOps: Self-Healing, …
Shift-Right: Tags, Deploys, Events
Actionable Feedback Loops
25. Injecting speed &
quality: automatic gate
at test & performance
• Continuous Performance Validation for daily builds
• Root Cause details automatically pushed to JIRA
• Decisions made to compare, break, or good-to-go
Shift-left:engage Dev withearlier & automatedfeedback
27. https://github.com/Dynatrace/AWSDevOpsTutorial
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Compares Builds and Approves/Rejects Pipeline
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Validates Production and Approves/Rejects Pipeline
handleDynatraceProblemNotification
Executes Auto-Remediating Actions, e.g: Rollback
Build 6
Build 7
Production
Production
Auto-Approve!
Auto-Reject!
Auto-Approve!
Auto-Reject!
28. How to start?
1. Monitor your environment
2. Define your runbooks
3. Start small and with low hanging fruits
What are frequent issues?
Of these, which ones are easy to deal with?
4. Build more and more automation along the way
Cultural Change!
30. AI to the rescue
Automated selection
or generation of solution
AI, big data, …
Automated calling of scripts
Ansible Tower, Workflows, …
Predefined
actions to execute
Runbooks, Shell scripts,
batch files, …
that’s not going to be easy – container and cloud platforms allow for faster deployments, independent release cycles WHILE increasing operational complexity
monolith to microservice, in memory call / network call, Istio (more hops, more technologies) – overall we see on average 82!
applications are incredibly complex
how it works end-to-end? nobody knows all parts ...
Real customer problem in a complex cloud environment
Problem is not only the money spent on this, but also time and bad brand reputation – problem was that
Does your Enterprise look like this today?
Bob has many layers to look through for problems.
Mean time to Recovery (MTTR) for application problems could take 72 hours or more.
Can Bob find the problem quickly let alone fix it?
What about the impact?
In many cases the Mean Time to Discovery (MTTD) takes up two-thirds of the MTTR.
In that time how many other users or applications may be impacted?
It might not break immediately but there will be a point in time when your applications will break.
It can be a broken dependency, it can be a infrastructure failure, it can be a database slowdown severely impacting your service – however, your application will break.
Murphys law: whatever can go wrong, will go wrong!
A self-healing robot fixing itself when it experiences troubles.
This could mean freeing up additional resources, restarting things that are not doing well, rolling back to a state where everything worked perfectly…
Monitoring:
End to end means that you have to track the complete path of your requests to not look at black boxes
Full-stack: has to cover your complete application stack from frontend to backend technologies
Automation:
Means that can execute what you would do manually in case of outages
What we see a lot in customer environments is that the actual root cause of the problem is buried somewhere else than you would expect at first sight.
For example, if your services experience a slow down, the actual problem might be even the network or the underlying database of a different service the one that you are looking for is depending on.
Let‘s take a look…
What measures are needed for enabling remediation?
As a prerequite we have to make sure we somehow monitor our applications simple because we need to know what‘s going on in either our application or our environment. We define thresholds that should not be breached. We then look at the dashboards and once the dashboards are breeched we analyze the problem and send it over to someone else.
This could be either a human operator or even an automation platform.
We can for example employ XXX and trigger a previously defined job that executes a playbook. Basically it‘s a sequence of instructions to automate tasks that can include restarts of processes, scaling up the environments, …
We at Dynatrace have automated this process, since the traditional way still means a lot of manual monitoring and looking at dashboards.
We achieve this by using our own monitoring tool and integrating it with 3rd party vendors.
Also, Dynatrace provides full stack monitoring to detect issues in either layer of your environment. Automatic baselining further allows to automatically detect anomalies without the need to manually define tresholds, since they might differ substantially between applications. Our AI-based root cause analysis finally detects the real root cause of the problem and sends exactly this notification. Now a third party vendor such as Ansible Tower can take over.
As an example, let‘s take a look at a simple delivery pipeline.
When deploying a new version, we make sure to carefully test our new build.
However, despite thorough tests in staging and maybe even in production errors might occur.
Although the pipeline was build to fail early this is not always possible.
So it might happen that the error is only discovered in production. If the error occurs Saturday night it might not possible to inspect it immediately and schedule counter actions. Therefore with auto-remediation in place we can for example automatically rollback to the previous stable version to save the weekend.
- you see the problem in the picture for automation?
As we can see being able to automate lies in the core of even enabling auto-remediation or self-healing.
First you need to have runbooks or scripts that can kick in every time they are needed.
Next you can connect your tools of choice to this scripts to enable auto-remediation. However, you still have to have dedicated runbooks for each scenario in place and have to connect the right problems to the right counter-actions.
Finally, with self-healing we can leverage the power of AI and big data to fully understand the root causes of problems and automatically determine executable steps for remediations.