Self-Healing at Scale: How Adobe Eliminated Critical Service Outages

DOM BALDIN | SENIOR ENGINEER | ADOBE | @DOMBALDIN
Self Heal Your Way to Scale
Eliminate Critical Outages

Critical
Outages
Self-Healing Lessons
Learned
Q&A
Agenda

Critical
Outages
Lessons
Learned
Q&A
Agenda
Self-Healing

$300K/hr
Average Cost of IT Downtime

TIME
23 minutes to refocus after an interruption

BRAND
It’s your team, and it MATTERS

Mission Critical
DC Migration
96% User Growth
Four 9 Expectation
Scrum & Reporting
Managing Jira
3.5m Issues
1.4k Projects
~400 requests/mo
Manual Server
Admin
7 Outages
Basic Monitoring
Paperwork
Retrospectives
Questions
Poor RCA
Support zips only
Difficulty capturing
Manual gathering
2017

Mission Critical
DC Migration
96% User Growth
Four 9 Expectation
Scrum & Reporting
Managing Jira
3.5m Issues
1.4k Projects
~400 requests/mo
Manual
7 Outages
Basic Monitoring
Paperwork
Retrospectives
Questions
Poor RCA
Support zips only
Difficulty capturing
Manual gathering
Recipe for Trouble

Site
Reliability
Engineering
NATURE SOFTWARE
https://www.reddit.com/r/pics/comments/9jbrwo/this_is_a_real_animal_btw_this_is_an_axolotl/

Health Monitor
Automated service that analyzes the health of
each node, healing nodes that are unresponsive.
Logging
Thread/Heap Dumps captured at exactly the
right time
SRE Focus
Jira DC+ Self Healing is a solid foundation based
in Site Reliability Engineering methods.
Self-Healing

0 Outages
> 99.99% Uptime
Complete Logging
7 Outages
Poor Uptime
Poor RCA
20182017

Automated healing ensures
your team avoids unplanned
outages, and can focus on
customers
DO MORE WITH LESS

Establishing trust in your
environment is critical for
adoption and scale
RESTORE TRUST

Jira Data Center
Multi-node cluster + spare node (if possible)
Breakdown
JDC
Health Monitor
Triggers

Health Monitor
Checks :
1. Presence of trigger file
2. Health status URL of host
Runs on each host (same Tomcat as Jira)
•Reports: Success, Error, or Maint
Breakdown
JDC
Health Monitor
Triggers

Trigger Files
1. Trigger for LB - Places host in/out of pool
Ex: maintenance.txt
2. “Off” switch for the Health Monitor
Ex: RESTART_IN PROGRESS.txt
3. Trigger for spare
Ex: DOWN_ALERT.txt
Breakdown
JDC
Health Monitor
Triggers

Diagram
CLUSTER
HEALTH MONITOR
SHARED STORAGE
TRIGGER FILES
SPARE
TRIGGER FILES

Health Monitor
CLUSTER
HEALTH MONITOR
SHARED STORAGE
TRIGGER FILES
SPARE
TRIGGER FILES

Trigger Files
CLUSTER
HEALTH MONITOR
SHARED STORAGE
TRIGGERS
SPARE
TRIGGER FILES

MONITORING SEQUENCE
If “Off” switch is
present, stop
monitor here.
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes
1st Check 2nd Check 3rd Check Outcome

FIRST CHECK
present, stop
monitor here.
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes

SECOND CHECK
present, stop
monitor here.
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes

THIRD CHECK
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes

OUTCOMES
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes

OUTCOMES
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curls Jira Health
Monitor
Jira is alive!
1st Check 2nd Check 3rd Check SUCCESS

OUTCOMES
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curls Jira Health
Monitor
No response + Jira
NOT running
Start Jira
1st Check 2nd Check 3rd Check NO RESPONSE

OUTCOMES
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curls Jira Health
Monitor
No response + Jira IS
running
Start Self Heal
1st Check 2nd Check 3rd Check NO RESPONSE

Self Heal
Timeline
Thread Dumps
Check GC Logs
& Health
Monitor
Thread Dumps Taken
Sleep for 30 seconds after Thread Dump taken
Restart

Self Healing
Timeline
Thread Dumps
Check GC Logs
& Health
Monitor
Consecutive GCs
Check GC logs for consecutive full GC runs within last 2
minutes
False
Skip to running Health Monitor check
True
Enable Heap Dump (inject JVM parameter in Jira)
• Sleep for 135 seconds
• After sleep, Health Monitor is rerun
If Jira responds, Heap Dump is disabled
If no response, server is restarted
Restart

Self Healing
Timeline
Thread Dumps
Check GC Logs
& Health
Monitor
Restart Process
1. Trigger files for LB & spare are created in share
directory
2. Server is restarted
Restart
Jira is online
1. Trigger files for LB & spare are deleted in share
directory
2. Jira is online and in service

Taking thread and heap
dumps at exactly the right
times can drastically improve
the diagnosis of your incidents.
TIP

Unresponsive due to:
1. Maxed out DB connections
2. High CPU due to blocked threads
3. Memory shortages
Gaps / Limitations
Slow response time
Not load-based

Unresponsive due to:
1. Maxed out DB connections
2. High CPU due to blocked threads
3. Memory shortages
Gaps / Limitations
1. Slow response time
2. Not load-based

Self-Heal to Scale
Reduces manual toil
Get to 99.99% uptime
Takeaways
Safety Net
Restore trust
Rest easy
Do More with Less
Better logging = Better RCA
Extendable & flexible

Scale your Atlassian
systems with Data
Center & Self-
Healing
BIG VISION

Leadership culture is one where
everyone thinks like an owner, a
CEO or a managing director. It’s
one where everyone is
entrepreneurial and proactive.
ROBIN S. SHARMA

DOM BALDIN | SENIOR ENGINEER | ADOBE | @DOMBALDIN
Thank you!

Self-Healing at Scale: How Adobe Eliminated Critical Service Outages

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Self-Healing at Scale: How Adobe Eliminated Critical Service Outages

Similar to Self-Healing at Scale: How Adobe Eliminated Critical Service Outages (20)

More from Atlassian

More from Atlassian (20)

Recently uploaded

Recently uploaded (20)

Self-Healing at Scale: How Adobe Eliminated Critical Service Outages