2. Speaker Intro - Todd
● Director of Ops for
● Over 25 years in IT
● Experience with both
academic and
enterprise computing
● Favorite operating system is Tru64
● Enjoys solving problems...but loves sleep more!
@toddminnella
tminnella@soasta.com
3. Speaker Intro - Matt
● VP of Engineering for
● Started programming with Atari BASIC in
elementary school
● Ops on the side :-)
● First Velocity presentation!
@msolnit
msolnit@soasta.com
4. Who are you? :-)
http://www.cliarthut.com/clip-
arts/751/who-are-you-clip-art-
751173.jpg
5. Agenda (1 of 2)
Part One - Theory
● Distributed Systems Challenges
● Mitigating Failure Impact
● Benefits and Risks
● Testing Requirements
● Methodology
6. Agenda (2 of 2)
Part Two - Practice
● Description of Demo System
● Example #1 - Externally Triggered Full GC
● Example #2 - External System Restart
● Example #3 - System-initiated Support Case
● Tools Demonstrated
● Other Ideas for Automation
8. What makes a distributed system?
● Multiple components
● Different servers
● Different regions (data center or geo)
● A component failure != service or app failure
● Requires systems thinking
9. Challenges faced by dist. systems
● Complexity
● Uncontrollable elements
● Hard to see the whole picture
● Impossible for a single person to manage
10. What can we do about it?
Easy answer:
Add people!
But… easy != correct
12. Benefits of Self-Healing
● Better uptime (at the component level)
● Higher service quality
● Rapid identification of repeating issues
● Improved Ops team morale and productivity
13. Risk of Self-Healing Systems
● Worse uptime (at the component level)
● Lower service quality
● Maintenance complexities
● Degraded Ops team morale and productivity
19. Demo Application
Java App Server Farm (n = 2)
Amazon Linux EC2 Instance
EC2 Elastic IP address
Load Balanced via DNS (Dyn Traffic Director)
Simple Web Application (HTTP/HTTPS)
21. Real-life mPulse example
Started reporting Java statistics to monitoring tool in
2013.
When investigating outages, often found an exact
correlation with large garbage collections (sound
familiar?).
Set up an alert to fire when heap usage went above
70%
Everybody into the war room!
23. Real-life mPulse example, cont’d
Engineering looks for a possible memory leak.
Eventually someone says, “Just force a GC!”
Most of the time, this would fix it. JVM isn’t perfect, if
we help it then the system remains stable.
Occasionally this didn’t fix it, which would indicate an
actual bug.
Engineering fixes, deploy, repeat!
26. Identify the Problem
1. Java isn’t garbage-collecting efficiently.
2. Tuning the JVM is time-consuming and
dangerous.
3. Forcing a collection works, but it requires
waking someone up.
27. Describe a Solution (1 of 2)
Identify a metric for JVM Heap Use that is
indicative of the problem:
Java VM Old % Used
Start monitoring/reporting this metric.
Specify a threshold for action:
Old % Used > 65%
28. Describe a Solution (2 of 2)
When the threshold is reached, take an action:
Trigger a full garbage collection
After the action, monitor for success:
Old % Used < 65%
29. Execute by Hand
Trigger the condition that causes the problem
(or be patient and let it happen).
Once monitoring indicates high old % used,
manually execute the full GC.
30. Automate the Solution, Manually Trigger
Write a script to check for Java old % used.
Run the script via cron or similar mechanism.
Report when old % used exceeds threshold.
A DevOps human will trigger the full GC.
33. Automate the Solution, Automate the Trigger
Taking the script shown previously, combine
the step that:
Reports that old % used > 65%
with the step that:
Triggers the full GC
35. Watch and adjust
Set up the automated script to run in as many
test environments as are available/applicable.
Review the results (script log, metrics graphs).
Does it work?
Investigate any issues thoroughly.
Potentially, install the script in a dry-run mode
in production.
36. Go Live!
We recommend a gradual deployment.
Deploy to a subset of production, then assess.
Expand the subset, assess again.
When all of production is live, enjoy more
sleep!
39. Real-life mPulse example
What is a beacon?
{"timestamp":1392256183739,"drop_code":"crumb:missing","http_method":"GET","http_version":"HTTP
/1.1","http_referrer":"","headers":{"host":"localhost:8080","accept":"*/*"},"params":{"nt_dns_e
nd":"1392147897985","nt_load_end":"1392147912182","nt_first_paint":"1392147900.964995","mem.use
d":"131000000","nt_spdy":"0","nt_unload_end":"1392147898577","nt_dns_st":"1392147897985","nt_co
n_st":"1392147897985","rt.bmr.conEn":"834.00000000006","rt.bmr.resEn":"2320.0000000001637","mem
.total":"199000000","nt_nav_st":"1392147897985","nt_domcontloaded_end":"1392147901891","dom.sz"
:"58549","rt.tstart":"1392147897985","rt.bmr.domSt":"419.0000000000964","nt_con_end":"139214789
7985","nt_domint":"1392147901585","nt_red_end":"0","dom.ln":"939","nt_unload_st":"1392147898574
","t_done":"14201","nt_load_st":"1392147912129","t_page":"13638","rt.end":"1392147912186","nt_d
omloading":"1392147898927","nt_res_end":"1392147898571","t_resp":"563","rt.bmr.domEn":"813.0000
000001019","rt.tt":"14201","nt_red_cnt":"0","if":"","nt_fet_st":"1392147897985","nt_res_st":"13
92147898548","nt_req_st":"1392147897995","nt_nav_type":"0","mob.ct":"0","dom.img":"16","nt_red_
st":"0","rt.ss":"1392147897985","config.timedout":"true","rt.bmr.resSt":"2312.0000000001255","r
t.si":"3el0j57fms0885mi-
n0uk6y","rt.sl":"1","rt.bmr.fetSt":"16.000000000076398","rt.bmr.conSt":"813.0000000001019","nt_
domcomp":"1392147912129","dom.script":"27","v":"0.9.1389663787","rt.bmr.reqSt":"834.00000000006
","r":"","rt.bstart":"1392147906107","rt.obo":"0","rt.start":"navigation","nt_domcontloaded_st"
:"1392147901585"}}
40. Real-life mPulse example, cont’d
Each server processes millions of these per day.
Beacons are logged to disk, eventually
compressed and uploaded to S3.
41. Real-life mPulse example, cont’d
Every so often, the background uploader thread
stops working.
(we don’t know why yet)
When this happens, we get 10-12 hours before
the disk fills up and the server dies.
42. Real-life mPulse example, cont’d
A simple re-start fixes it.
SO...
While developers are investigating, Ops is
getting paged (and woken up) to re-start boxes.
44. Identify the Problem (Demo App)
● Lack of activity indicates a failed thread
● While the issue goes unresolved, data is
delayed (and the disk may fill)
45. Describe a Solution
● A restart of the application solves the
problem
● The application server needs to be removed
from service prior to the restart
● The server hosting the application is an
AWS instance, and a reboot is fast and
effective
46. Execute by Hand
1. Take the application out-of-service
2. Restart the application
3. Watch for Self-Check OK
4. Put the application back in-service
47. Automate the Solution, Manually Trigger
● Log metrics go to AWS CloudWatch
● Lack of activity triggers an Alarm
● Alarm triggers a SNS notification
● Human being makes the DNS changes and
restart the server.
50. Automate the Solution, Automate the Trigger
● EC2 and DynECT both have APIs
● DNS changes and reboot can all be
automated
● Todd can sleep!
51. Automate the Solution, Automate the
Trigger
AWS Lambda
Upload code to Amazon (Node.js)
Attach it to a listener (SNS)
No instance required!
52. Automate the Solution, Automate the
Trigger
Lambda function listens on “logs are not being
uploaded” notification.
Uses Dyn REST API to disable the DNS
record.
Uses EC2 API to re-boot the instance.
53. Automate the Solution, Automate the
Trigger
Lambda function listens on “all OK” notification.
Uses Dyn REST API to re-enable the DNS
record.
54. var dynect = require('./dynect_api.js');
var AWS = require('aws-sdk');
exports.cloudwatch_alarm_sns_handler = function(event, context) {
event.Records.forEach(function(record) {
var alarm = JSON.parse(record.Sns.Message);
// Extract the instance status. ALARM means it's down, OK means it's up.
var instance_up = alarm.NewStateValue !== "ALARM";
// ...
https://github.com/SOASTA/velocity-2015-self-healing-systems
Node.js code snippet
59. Real-life mPulse example
● Customers configure raw beacon uploads to
their own S3 buckets.
● Sometimes they break
things (or AWS access
key is changed, etc.)
● We log the error, but we don’t monitor it and
don’t notify customers.
60. Identify the Problem
● Another example: yser connecting to a site
can’t authenticate successfully
● Assumption is that this is a limited access
site
62. Describe a Solution
● Notify the Customer Support team
● Provide Support with details so that they can
proactively reach out
63. Execute by Hand
● Examine the logs for the error
● Review the situation with Support
● Work with Support to handle a case end-to-
end
64. Automate the Solution, Manually Trigger
● Log metrics go to AWS CloudWatch
● Presence of error triggers an Alarm
● Alarm triggers a SNS notification
● Human being can then create a Zendesk
case
65. Automate the Solution, Automate the Trigger
● AWS Lambda listens on SNS notification
● Collects information from the notification
● Files a Zendesk case categorized to go to
the correct team
66. AWS Lambda Actions
On Failed Login notification
● Create a Zendesk case with user details
67. Watch and adjust
● Ops reviews logs
● Ops meets with Support to review case
frequency and outcomes
68. Testing Requirements
● Start small
● Develop (and verify) in stages
● Let run in production-like environment
● Verify behavior in “dry-run” mode