So, you've successfully deployed Jira Data Center, and it's become the most popular tool in the stack. Perfect! All done, right? Not so fast. As companies scale the use of Atlassian tools, dependence on their availability multiplies. Any bit of downtime can have a serious impact on team productivity and delivery.
In this talk, we'll share how our team at Adobe created a system of monitors and scripts that allow Jira to heal itself when things go wrong, making Critical Service Outages a thing of the past.
20. Health Monitor
Automated service that analyzes the health of
each node, healing nodes that are unresponsive.
Logging
Thread/Heap Dumps captured at exactly the
right time
SRE Focus
Jira DC+ Self Healing is a solid foundation based
in Site Reliability Engineering methods.
Self-Healing
21. Health Monitor
Automated service that analyzes the health of
each node, healing nodes that are unresponsive.
Logging
Thread/Heap Dumps captured at exactly the
right time
SRE Focus
Jira DC+ Self Healing is a solid foundation based
in Site Reliability Engineering methods.
Self-Healing
22. Health Monitor
Automated service that analyzes the health of
each node, healing nodes that are unresponsive.
Logging
Thread/Heap Dumps captured at exactly the
right time
SRE Focus
Jira DC+ Self Healing is a solid foundation based
in Site Reliability Engineering methods.
Self-Healing
29. Health Monitor
Checks :
1. Presence of trigger file
2. Health status URL of host
Runs on each host (same Tomcat as Jira)
•Reports: Success, Error, or Maint
Breakdown
JDC
Health Monitor
Triggers
30. Trigger Files
1. Trigger for LB - Places host in/out of pool
Ex: maintenance.txt
2. “Off” switch for the Health Monitor
Ex: RESTART_IN PROGRESS.txt
3. Trigger for spare
Ex: DOWN_ALERT.txt
Breakdown
JDC
Health Monitor
Triggers
34. MONITORING SEQUENCE
If “Off” switch is
present, stop
monitor here.
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes
1st Check 2nd Check 3rd Check Outcome
35. FIRST CHECK
If “Off” switch is
present, stop
monitor here.
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes
1st Check 2nd Check 3rd Check Outcome
36. SECOND CHECK
If “Off” switch is
present, stop
monitor here.
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes
1st Check 2nd Check 3rd Check Outcome
37. THIRD CHECK
If “Off” switch is
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes
1st Check 2nd Check 3rd Check Outcome
38. OUTCOMES
If “Off” switch is
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curl Jira Health
Monitor
3 Outcomes
1st Check 2nd Check 3rd Check Outcome
39. OUTCOMES
If “Off” switch is
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curls Jira Health
Monitor
Jira is alive!
1st Check 2nd Check 3rd Check SUCCESS
40. OUTCOMES
If “Off” switch is
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curls Jira Health
Monitor
No response + Jira
NOT running
Start Jira
1st Check 2nd Check 3rd Check NO RESPONSE
41. OUTCOMES
If “Off” switch is
present, stop
monitor here
If spare trigger is
present, place
spare in service
Curls Jira Health
Monitor
No response + Jira IS
running
Start Self Heal
1st Check 2nd Check 3rd Check NO RESPONSE
43. Self Healing
Timeline
Thread Dumps
Check GC Logs
& Health
Monitor
Consecutive GCs
Check GC logs for consecutive full GC runs within last 2
minutes
False
Skip to running Health Monitor check
True
Enable Heap Dump (inject JVM parameter in Jira)
• Sleep for 135 seconds
• After sleep, Health Monitor is rerun
If Jira responds, Heap Dump is disabled
If no response, server is restarted
Restart
44. Self Healing
Timeline
Thread Dumps
Check GC Logs
& Health
Monitor
Restart Process
1. Trigger files for LB & spare are created in share
directory
2. Server is restarted
Restart
Jira is online
1. Trigger files for LB & spare are deleted in share
directory
2. Jira is online and in service
45. Taking thread and heap
dumps at exactly the right
times can drastically improve
the diagnosis of your incidents.
TIP
46. Unresponsive due to:
1. Maxed out DB connections
2. High CPU due to blocked threads
3. Memory shortages
Gaps / Limitations
Slow response time
Not load-based
47. Unresponsive due to:
1. Maxed out DB connections
2. High CPU due to blocked threads
3. Memory shortages
Gaps / Limitations
1. Slow response time
2. Not load-based
48. Self-Heal to Scale
Reduces manual toil
Get to 99.99% uptime
Takeaways
Safety Net
Restore trust
Rest easy
Do More with Less
Better logging = Better RCA
Extendable & flexible
49. Self-Heal to Scale
Reduces manual toil
Get to 99.99% uptime
Takeaways
Safety Net
Restore trust
Rest easy
Do More with Less
Better logging = Better RCA
Extendable & flexible
50. Self-Heal to Scale
Reduces manual toil
Get to 99.99% uptime
Takeaways
Safety Net
Restore trust
Rest easy
Do More with Less
Better logging = Better RCA
Extendable & flexible
52. Leadership culture is one where
everyone thinks like an owner, a
CEO or a managing director. It’s
one where everyone is
entrepreneurial and proactive.
ROBIN S. SHARMA
53.
54. DOM BALDIN | SENIOR ENGINEER | ADOBE | @DOMBALDIN
Thank you!