SlideShare a Scribd company logo
1 of 31
Jürgen Etzlstorfer
@jetzlstorfer
Technology Strategist
A framework for self-healing applications –
the path to enable auto-remediation
Developer Week Nürnberg, 27th June 2018
confidential
The journey
 Why self-healing applications?
 What is needed for self-healing applications
 Auto-remediation as part of a CI/CD pipeline
 Build your own auto-remediation
On average, a single transaction uses 82 different types of technology
Browser
Multi-geo
Mobile Network
Code
Hosts
Logs
IoT
3rd parties
Services
Cloud SDN
Containers
Applications are getting more complex!
Problem
• Not repeatable in Test and cannot be
troubleshooted with current tooling
• After months of investigation and customers
being impacted, the root-cause of the issue
cannot be found
Impact
• Issue causes severe slow downs for the users
and timeouts, eventually needing a manual
failover to the DR site
• Operations team mislead by current alerting on
their investigation path
Consequences
• Poor customer experience drive
poor conversion rates
Recurring issue
for months
479 hours
lost in War-room
up to today.
6 teams and one 3rd party
were involved
Happening
more frequently
Has cost so far
£23,950
Brand reputation
impacted by bad tweets$32,494
Consequences of complexity
MTTD
MTTD vs MTTR
confidential
If you write applications,
they will break eventually
~ Murphy‘s law
confidential
What if you had
something similar to
a self-healing robot?
confidential
What is needed for self-healing applications?
 Monitoring: know what’s going on in your
applications
 End-to-end
 Full-stack – fully integrated in production
(or even in staging)
 Automation/Execution: perform
mitigation/remediation actions
 Access to all systems
 Automation system should be isolated from
production system
APIs
confidential
Know what‘s going on in your
applications
 Monitor your applications Identify the root cause
of the problem!
Applications
are
monitored
Thresholds
are breached
Problem is
analyzed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable remediation
Monitoring Mitigation
confidential
How to automate?
 Automation engines
 Ansible (Tower), Stackstorm, …
 Serverless approaches
 AWS Lambda, Azure Functions, …
Full-stack
environment
is monitored
Anomalies
are detected
automatically
Root
cause
analysis is
performed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable auto-remediation
Version 123
Staging
Approve
Staging
Production
Approve
Production
Up and
running
Version 124
Scenario: How to mitigate a bad deployment?
Staging
Approve
Staging
Production
Approve
Production
Remediation
Roll-
back
confidential
Steps to mitigate the bad deployment
Fetch
information
about event
Process the
data
Select
corresponding
remediation
action
1.Execution the
remediation
action
Keep track of all automation steps
confidential
Auto-remediation with Ansible (Tower)
 APIs are key to enable automation
 Ansible Tower makes extensive use APIs internally and exposes them also externally
 Ansible playbooks are scripts that are executed from a central host on different machines
 Multiple OS are supported
 Idempotent
 Playbooks can be orchestrated in workflows and job templates
confidential
---
- name: rollback to previous version
hosts: localhost
vars:
...
tasks:
- name: push comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Remediation playbook started.", "user": "{{commentuser}}", "context":
"Ansible Tower" }"
- name: fetch custom deployment events
uri:
url: "{{dtdeploymentapiurl}}"
return_content: yes
with_items: "{{ impactedEntities }}"
register: customproperties
ignore_errors: no
- name: parse deployment events
set_fact:
deployment_events: "{{item.json.events}}"
with_items: "{{ customproperties.results }}"
register: app_result
confidential
- name: call remediation action
uri:
url: "{{ myItem.remediationAction }}"
method: POST
body_format: json
body: "{{ payload | to_json }}"
return_content: yes
ignore_errors: yes
register: result
- name: push success comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Invoked remediation action successfully executed: {{result.content}}",
"user": "{{commentuser}}", "context": "Ansible Tower" }"
when: result.status == 200
- name: push error comment to dynatrace
...
body: "{ "comment": "Invoked remediation action failed: {{result.content}}", "user":
"{{commentuser}}", "context": "Ansible Tower" }"
when: result.status != 200
confidential
Auto-remediation with Serverless approaches
 No need for separate installation / maintenance of a system
 Pay-as-you-go (most often for free)
 Support for a variety of languages
 No built-in support for automation tasks
confidential
// remediation
dtUtils.getProblemDetails(myProblem.pid, function (err, resp) {
if (err || !resp.ok) {
console.error("error getProblemDetails for pid " + myEvent.pid + ": " + JSON.stringify(err));
return callback(err);
}
var myRankedEvents = resp.body.result.rankedEvents;
console.info("rankedEvents: " + JSON.stringify(myRankedEvents));
if (myRankedEvents != null) {
var myRootCause = getRootCause(myRankedEvents);
if (myRootCause != undefined) {
// root cause found
console.info("root cause for PID " + myEvent.pid + ": " + JSON.stringify(myRootCause.eventType));
triggerRemediationAction(myProblem, myRootCause, function (err, res, remediationAction) {
if (err) {
console.error("error for remediation of " + myEvent.pid + " (" + myRootCause.eventType + "): " +
JSON.stringify(err));
addComment(myEvent.pid, "error when performing remediation " + JSON.stringify(err), function
(err, res) {
if (err) {
return callback(err);
}
} );
return callback(err);
}
var remediationLog = "Auto-remediation: " + remediationAction.title + " executed:n " +
remediationAction.description;
confidential
Comparison
 Automation Platforms
 Runbook/Playbook automation built-in
 Step-by-step instructions (yaml)
 Specialized for deployment, provisioning,
configuration management
 Maintenance of platform needed
 Serverless
 Different vendors
 Different languages (js, java, python, …)
 Not limited to runbooks
 No support for typical runbook tasks
confidential
Auto-remediation is a safety net
It does not fix your problem
confidential
https://blogs.msdn.microsoft.com/visualstudioalmrangers/2017/04/17/set-up-a-cicd-pipeline-for-your-team-services-extension/
confidential
Embed auto-remediation in your CI/CD pipeline
Shift-Left: Break Pipeline Earlier
Path to NoOps: Self-Healing, …
Shift-Right: Tags, Deploys, Events
Actionable Feedback Loops
Injecting speed &
quality: automatic gate
at test & performance
• Continuous Performance Validation for daily builds
• Root Cause details automatically pushed to JIRA
• Decisions made to compare, break, or good-to-go
Shift-left:engage Dev withearlier & automatedfeedback
confidential
Shift-right:empowerOps withmore contextto react faster
https://github.com/Dynatrace/AWSDevOpsTutorial
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Compares Builds and Approves/Rejects Pipeline
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Validates Production and Approves/Rejects Pipeline
handleDynatraceProblemNotification
Executes Auto-Remediating Actions, e.g: Rollback
Build 6
Build 7
Production
Production
Auto-Approve!
Auto-Reject!
Auto-Approve!
Auto-Reject!
How to start?
1. Monitor your environment
2. Define your runbooks
3. Start small and with low hanging fruits
 What are frequent issues?
 Of these, which ones are easy to deal with?
4. Build more and more automation along the way
Cultural Change!
confidential
AI to the rescue
Automated selection
or generation of solution
AI, big data, …
Automated calling of scripts
Ansible Tower, Workflows, …
Predefined
actions to execute
Runbooks, Shell scripts,
batch files, …
www.dynatrace.com
confidential
Jürgen Etzlstorfer
Technology Strategist
juergen.etzlstorfer@dynatrace.com
@jetzlstorfer
Thank you!

More Related Content

What's hot

Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Amazon Web Services
 
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivFinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivAmazon Web Services
 
AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...
AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...
AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...Amazon Web Services
 
Infrastructure is code with the AWS CDK - MAD312 - New York AWS Summit
Infrastructure is code with the AWS CDK - MAD312 - New York AWS SummitInfrastructure is code with the AWS CDK - MAD312 - New York AWS Summit
Infrastructure is code with the AWS CDK - MAD312 - New York AWS SummitAmazon Web Services
 
Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...
Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...
Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...Amazon Web Services
 
AWS Well-Architected Framework: Operational Excellence Pillar
AWS Well-Architected Framework: Operational Excellence PillarAWS Well-Architected Framework: Operational Excellence Pillar
AWS Well-Architected Framework: Operational Excellence PillarJonathan LaCour
 
Accelerating Your Cloud Migration Journey with MAP
Accelerating Your Cloud Migration Journey with MAPAccelerating Your Cloud Migration Journey with MAP
Accelerating Your Cloud Migration Journey with MAPAmazon Web Services
 
Building a well-engaged and secure AWS account access management - FND207-R ...
 Building a well-engaged and secure AWS account access management - FND207-R ... Building a well-engaged and secure AWS account access management - FND207-R ...
Building a well-engaged and secure AWS account access management - FND207-R ...Amazon Web Services
 
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...Amazon Web Services
 
AWS Initiate Day Dublin 2019 – Cost Optimization on AWS
AWS Initiate Day Dublin 2019 – Cost Optimization on AWSAWS Initiate Day Dublin 2019 – Cost Optimization on AWS
AWS Initiate Day Dublin 2019 – Cost Optimization on AWSAmazon Web Services
 
reInvent reCap 2022
reInvent reCap 2022reInvent reCap 2022
reInvent reCap 2022CloudHesive
 
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
Elastic  Load Balancing Deep Dive - AWS Online Tech TalkElastic  Load Balancing Deep Dive - AWS Online Tech Talk
Elastic Load Balancing Deep Dive - AWS Online Tech TalkAmazon Web Services
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSAmazon Web Services
 

What's hot (20)

Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
Amazon API Gateway
Amazon API GatewayAmazon API Gateway
Amazon API Gateway
 
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivFinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
 
AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...
AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...
AWS Networking – Advanced Concepts and new capabilities | AWS Summit Tel Aviv...
 
Infrastructure is code with the AWS CDK - MAD312 - New York AWS Summit
Infrastructure is code with the AWS CDK - MAD312 - New York AWS SummitInfrastructure is code with the AWS CDK - MAD312 - New York AWS Summit
Infrastructure is code with the AWS CDK - MAD312 - New York AWS Summit
 
Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...
Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...
Monitor All Your Things: Amazon CloudWatch in Action with BBC (DEV302) - AWS ...
 
AWS Well-Architected Framework: Operational Excellence Pillar
AWS Well-Architected Framework: Operational Excellence PillarAWS Well-Architected Framework: Operational Excellence Pillar
AWS Well-Architected Framework: Operational Excellence Pillar
 
Running Kubernetes on AWS.pdf
Running Kubernetes on AWS.pdfRunning Kubernetes on AWS.pdf
Running Kubernetes on AWS.pdf
 
Accelerating Your Cloud Migration Journey with MAP
Accelerating Your Cloud Migration Journey with MAPAccelerating Your Cloud Migration Journey with MAP
Accelerating Your Cloud Migration Journey with MAP
 
Building a well-engaged and secure AWS account access management - FND207-R ...
 Building a well-engaged and secure AWS account access management - FND207-R ... Building a well-engaged and secure AWS account access management - FND207-R ...
Building a well-engaged and secure AWS account access management - FND207-R ...
 
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
Reduce Costs and Build a Strong Operational Foundation with the AWS Migration...
 
AWS Initiate Day Dublin 2019 – Cost Optimization on AWS
AWS Initiate Day Dublin 2019 – Cost Optimization on AWSAWS Initiate Day Dublin 2019 – Cost Optimization on AWS
AWS Initiate Day Dublin 2019 – Cost Optimization on AWS
 
reInvent reCap 2022
reInvent reCap 2022reInvent reCap 2022
reInvent reCap 2022
 
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
Elastic  Load Balancing Deep Dive - AWS Online Tech TalkElastic  Load Balancing Deep Dive - AWS Online Tech Talk
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
 
Migration Planning
Migration PlanningMigration Planning
Migration Planning
 
Security Architectures on AWS
Security Architectures on AWSSecurity Architectures on AWS
Security Architectures on AWS
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
 

Similar to A framework for self-healing applications – the path to enable auto-remediation

How to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup MunichHow to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup MunichJürgen Etzlstorfer
 
Self-healing Applications with Ansible
Self-healing Applications with AnsibleSelf-healing Applications with Ansible
Self-healing Applications with AnsibleJürgen Etzlstorfer
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandMaarten Balliauw
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
MTR Troubleshooting
MTR TroubleshootingMTR Troubleshooting
MTR TroubleshootingGraham Walsh
 
Intro To webOS
Intro To webOSIntro To webOS
Intro To webOSfpatton
 
StackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops AutomationStackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops AutomationDmitri Zimine
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyonddion
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wyciekówKonrad Kokosa
 
Kogito: cloud native business automation
Kogito: cloud native business automationKogito: cloud native business automation
Kogito: cloud native business automationMario Fusco
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Puppet
 
The Next Generation Application Server – How Event Based Processing yields s...
The Next Generation  Application Server – How Event Based Processing yields s...The Next Generation  Application Server – How Event Based Processing yields s...
The Next Generation Application Server – How Event Based Processing yields s...Guy Korland
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go BadSteve Loughran
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
 
Security automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automationSecurity automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automationMoses Schwartz
 
[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0camunda services GmbH
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
Unicenter Autosys Job Management
Unicenter Autosys Job ManagementUnicenter Autosys Job Management
Unicenter Autosys Job ManagementVenkata Duvvuri
 

Similar to A framework for self-healing applications – the path to enable auto-remediation (20)

How to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup MunichHow to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup Munich
 
Self-healing Applications with Ansible
Self-healing Applications with AnsibleSelf-healing Applications with Ansible
Self-healing Applications with Ansible
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
MTR Troubleshooting
MTR TroubleshootingMTR Troubleshooting
MTR Troubleshooting
 
Intro To webOS
Intro To webOSIntro To webOS
Intro To webOS
 
StackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops AutomationStackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops Automation
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyond
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Kogito: cloud native business automation
Kogito: cloud native business automationKogito: cloud native business automation
Kogito: cloud native business automation
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
The Next Generation Application Server – How Event Based Processing yields s...
The Next Generation  Application Server – How Event Based Processing yields s...The Next Generation  Application Server – How Event Based Processing yields s...
The Next Generation Application Server – How Event Based Processing yields s...
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
Security automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automationSecurity automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automation
 
[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Unicenter Autosys Job Management
Unicenter Autosys Job ManagementUnicenter Autosys Job Management
Unicenter Autosys Job Management
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

A framework for self-healing applications – the path to enable auto-remediation

  • 1. Jürgen Etzlstorfer @jetzlstorfer Technology Strategist A framework for self-healing applications – the path to enable auto-remediation Developer Week Nürnberg, 27th June 2018
  • 2. confidential The journey  Why self-healing applications?  What is needed for self-healing applications  Auto-remediation as part of a CI/CD pipeline  Build your own auto-remediation
  • 3. On average, a single transaction uses 82 different types of technology Browser Multi-geo Mobile Network Code Hosts Logs IoT 3rd parties Services Cloud SDN Containers Applications are getting more complex!
  • 4. Problem • Not repeatable in Test and cannot be troubleshooted with current tooling • After months of investigation and customers being impacted, the root-cause of the issue cannot be found Impact • Issue causes severe slow downs for the users and timeouts, eventually needing a manual failover to the DR site • Operations team mislead by current alerting on their investigation path Consequences • Poor customer experience drive poor conversion rates Recurring issue for months 479 hours lost in War-room up to today. 6 teams and one 3rd party were involved Happening more frequently Has cost so far £23,950 Brand reputation impacted by bad tweets$32,494 Consequences of complexity
  • 5.
  • 7. confidential If you write applications, they will break eventually ~ Murphy‘s law
  • 8. confidential What if you had something similar to a self-healing robot?
  • 9. confidential What is needed for self-healing applications?  Monitoring: know what’s going on in your applications  End-to-end  Full-stack – fully integrated in production (or even in staging)  Automation/Execution: perform mitigation/remediation actions  Access to all systems  Automation system should be isolated from production system APIs
  • 10. confidential Know what‘s going on in your applications  Monitor your applications Identify the root cause of the problem!
  • 11. Applications are monitored Thresholds are breached Problem is analyzed Problem notification is sent Event is received Job is triggered Playbook is executed Problem is remediated How to enable remediation Monitoring Mitigation
  • 12. confidential How to automate?  Automation engines  Ansible (Tower), Stackstorm, …  Serverless approaches  AWS Lambda, Azure Functions, …
  • 13. Full-stack environment is monitored Anomalies are detected automatically Root cause analysis is performed Problem notification is sent Event is received Job is triggered Playbook is executed Problem is remediated How to enable auto-remediation
  • 14. Version 123 Staging Approve Staging Production Approve Production Up and running Version 124 Scenario: How to mitigate a bad deployment? Staging Approve Staging Production Approve Production Remediation Roll- back
  • 15. confidential Steps to mitigate the bad deployment Fetch information about event Process the data Select corresponding remediation action 1.Execution the remediation action Keep track of all automation steps
  • 16. confidential Auto-remediation with Ansible (Tower)  APIs are key to enable automation  Ansible Tower makes extensive use APIs internally and exposes them also externally  Ansible playbooks are scripts that are executed from a central host on different machines  Multiple OS are supported  Idempotent  Playbooks can be orchestrated in workflows and job templates
  • 17. confidential --- - name: rollback to previous version hosts: localhost vars: ... tasks: - name: push comment to dynatrace uri: url: "{{dtcommentapiurl}}" method: POST body_format: json body: "{ "comment": "Remediation playbook started.", "user": "{{commentuser}}", "context": "Ansible Tower" }" - name: fetch custom deployment events uri: url: "{{dtdeploymentapiurl}}" return_content: yes with_items: "{{ impactedEntities }}" register: customproperties ignore_errors: no - name: parse deployment events set_fact: deployment_events: "{{item.json.events}}" with_items: "{{ customproperties.results }}" register: app_result
  • 18. confidential - name: call remediation action uri: url: "{{ myItem.remediationAction }}" method: POST body_format: json body: "{{ payload | to_json }}" return_content: yes ignore_errors: yes register: result - name: push success comment to dynatrace uri: url: "{{dtcommentapiurl}}" method: POST body_format: json body: "{ "comment": "Invoked remediation action successfully executed: {{result.content}}", "user": "{{commentuser}}", "context": "Ansible Tower" }" when: result.status == 200 - name: push error comment to dynatrace ... body: "{ "comment": "Invoked remediation action failed: {{result.content}}", "user": "{{commentuser}}", "context": "Ansible Tower" }" when: result.status != 200
  • 19. confidential Auto-remediation with Serverless approaches  No need for separate installation / maintenance of a system  Pay-as-you-go (most often for free)  Support for a variety of languages  No built-in support for automation tasks
  • 20. confidential // remediation dtUtils.getProblemDetails(myProblem.pid, function (err, resp) { if (err || !resp.ok) { console.error("error getProblemDetails for pid " + myEvent.pid + ": " + JSON.stringify(err)); return callback(err); } var myRankedEvents = resp.body.result.rankedEvents; console.info("rankedEvents: " + JSON.stringify(myRankedEvents)); if (myRankedEvents != null) { var myRootCause = getRootCause(myRankedEvents); if (myRootCause != undefined) { // root cause found console.info("root cause for PID " + myEvent.pid + ": " + JSON.stringify(myRootCause.eventType)); triggerRemediationAction(myProblem, myRootCause, function (err, res, remediationAction) { if (err) { console.error("error for remediation of " + myEvent.pid + " (" + myRootCause.eventType + "): " + JSON.stringify(err)); addComment(myEvent.pid, "error when performing remediation " + JSON.stringify(err), function (err, res) { if (err) { return callback(err); } } ); return callback(err); } var remediationLog = "Auto-remediation: " + remediationAction.title + " executed:n " + remediationAction.description;
  • 21. confidential Comparison  Automation Platforms  Runbook/Playbook automation built-in  Step-by-step instructions (yaml)  Specialized for deployment, provisioning, configuration management  Maintenance of platform needed  Serverless  Different vendors  Different languages (js, java, python, …)  Not limited to runbooks  No support for typical runbook tasks
  • 22. confidential Auto-remediation is a safety net It does not fix your problem
  • 24. confidential Embed auto-remediation in your CI/CD pipeline Shift-Left: Break Pipeline Earlier Path to NoOps: Self-Healing, … Shift-Right: Tags, Deploys, Events Actionable Feedback Loops
  • 25. Injecting speed & quality: automatic gate at test & performance • Continuous Performance Validation for daily builds • Root Cause details automatically pushed to JIRA • Decisions made to compare, break, or good-to-go Shift-left:engage Dev withearlier & automatedfeedback
  • 27. https://github.com/Dynatrace/AWSDevOpsTutorial pushDynatraceDeploymentEvent Pushes Deployment Info to Dynatrace Entities validateBuildDynatraceWorker Compares Builds and Approves/Rejects Pipeline pushDynatraceDeploymentEvent Pushes Deployment Info to Dynatrace Entities validateBuildDynatraceWorker Validates Production and Approves/Rejects Pipeline handleDynatraceProblemNotification Executes Auto-Remediating Actions, e.g: Rollback Build 6 Build 7 Production Production Auto-Approve! Auto-Reject! Auto-Approve! Auto-Reject!
  • 28. How to start? 1. Monitor your environment 2. Define your runbooks 3. Start small and with low hanging fruits  What are frequent issues?  Of these, which ones are easy to deal with? 4. Build more and more automation along the way Cultural Change!
  • 30. AI to the rescue Automated selection or generation of solution AI, big data, … Automated calling of scripts Ansible Tower, Workflows, … Predefined actions to execute Runbooks, Shell scripts, batch files, …

Editor's Notes

  1. that’s not going to be easy – container and cloud platforms allow for faster deployments, independent release cycles WHILE increasing operational complexity monolith to microservice, in memory call / network call, Istio (more hops, more technologies) – overall we see on average 82! applications are incredibly complex how it works end-to-end? nobody knows all parts ...
  2. Real customer problem in a complex cloud environment Problem is not only the money spent on this, but also time and bad brand reputation – problem was that
  3. Does your Enterprise look like this today?
  4. Bob has many layers to look through for problems. Mean time to Recovery (MTTR) for application problems could take 72 hours or more. Can Bob find the problem quickly let alone fix it? What about the impact? In many cases the Mean Time to Discovery (MTTD) takes up two-thirds of the MTTR. In that time how many other users or applications may be impacted?
  5. It might not break immediately but there will be a point in time when your applications will break. It can be a broken dependency, it can be a infrastructure failure, it can be a database slowdown severely impacting your service – however, your application will break. Murphys law: whatever can go wrong, will go wrong!
  6. A self-healing robot fixing itself when it experiences troubles. This could mean freeing up additional resources, restarting things that are not doing well, rolling back to a state where everything worked perfectly…
  7. Monitoring: End to end means that you have to track the complete path of your requests to not look at black boxes Full-stack: has to cover your complete application stack from frontend to backend technologies Automation: Means that can execute what you would do manually in case of outages
  8. What we see a lot in customer environments is that the actual root cause of the problem is buried somewhere else than you would expect at first sight. For example, if your services experience a slow down, the actual problem might be even the network or the underlying database of a different service the one that you are looking for is depending on.
  9. Let‘s take a look… What measures are needed for enabling remediation? As a prerequite we have to make sure we somehow monitor our applications simple because we need to know what‘s going on in either our application or our environment. We define thresholds that should not be breached. We then look at the dashboards and once the dashboards are breeched we analyze the problem and send it over to someone else. This could be either a human operator or even an automation platform. We can for example employ XXX and trigger a previously defined job that executes a playbook. Basically it‘s a sequence of instructions to automate tasks that can include restarts of processes, scaling up the environments, …
  10. We at Dynatrace have automated this process, since the traditional way still means a lot of manual monitoring and looking at dashboards. We achieve this by using our own monitoring tool and integrating it with 3rd party vendors. Also, Dynatrace provides full stack monitoring to detect issues in either layer of your environment. Automatic baselining further allows to automatically detect anomalies without the need to manually define tresholds, since they might differ substantially between applications. Our AI-based root cause analysis finally detects the real root cause of the problem and sends exactly this notification. Now a third party vendor such as Ansible Tower can take over.
  11. As an example, let‘s take a look at a simple delivery pipeline. When deploying a new version, we make sure to carefully test our new build. However, despite thorough tests in staging and maybe even in production errors might occur. Although the pipeline was build to fail early this is not always possible. So it might happen that the error is only discovered in production. If the error occurs Saturday night it might not possible to inspect it immediately and schedule counter actions. Therefore with auto-remediation in place we can for example automatically rollback to the previous stable version to save the weekend.
  12. - you see the problem in the picture for automation?
  13. As we can see being able to automate lies in the core of even enabling auto-remediation or self-healing. First you need to have runbooks or scripts that can kick in every time they are needed. Next you can connect your tools of choice to this scripts to enable auto-remediation. However, you still have to have dedicated runbooks for each scenario in place and have to connect the right problems to the right counter-actions. Finally, with self-healing we can leverage the power of AI and big data to fully understand the root causes of problems and automatically determine executable steps for remediations.