2. “Chaos Engineering is the discipline of experimenting on a
distributed system in order to build confidence in the system's
capability to withstand turbulent conditions in production.”
-- http://principlesofchaos.org/
3. Introduction
• Paul Osman - Senior Engineering Manager
• posman@underarmour.com
• Previous Lives: PagerDuty, 500px, SoundCloud
7. Game Days
• Imagine what could fail.
• Figure out how to prevent it from affecting business,
implement that.
• Cause the failure scenario to happen in production,
hopefully to prove the non effect of the event, thus gaining
confidence in the system.
9. Engineers <> Engineers
This is just a healthy team. A few things I've found build trust on
a team:
• Embrace failures. Learn from them.
• Incident Response Process (STAT)
• Practice blame free retrospectives.
• Embrace ownership - engineers own alerts.
10. Engineers <> Managers
What can managers do to build trust?
• Nurture a blame free and just culture.
• Protect time for action items.
11. Engineers <> Non-Engineers
How about building trust between Engineers and Non-
Engineering stakeholders? (i.e. product, executives, customer
support, etc)
• Metrics that show business impact
• Be Transparent about Incidents
• Talk loudly about Chaos Engineering
12. Operational Maturity Checklist
• Incident Response Process
• Blame Free Retrospectives
• Action Items
• Metrics on Incidents
• Talk Loudly about Resiliency
15. Failure Scenarios
• Scenario A - Weather HTTP Service Unavailable
• Scenario B - Weather MySQL RDS Unavailable
• Scenario C - The Weather Channel API - High Latency
• Scenario D - Workout Service Unavailable
• Scenario E - Weather Async Service Unavailable
16. Failure Scenarios
• Scenario A - Weather HTTP Service Unavailable
• Scenario B - Weather MySQL RDS Unavailable
• Scenario C - The Weather Channel API - High Latency
• Scenario D - Workout Service Unavailable
• Scenario E - Weather Async Service Unavailable
17. Scenario A - Weather HTTP
Service Unavailable
• Workout still shown, just without weather
• PagerDuty alert? Should fire a low urgency alert
18. Scenario B - Weather MySQL RDS
Unavailable
• Expected 503s when database down, service was throwing
504
• Had to restart service after database was brought back up -
connections were not being recycled
19. Scenario C - High Latency from
Weather Channel capability
• Requests timeout - should fire low urgency alert
• Action item: audit timeouts
• Expectation: asynchronous tasks are still processed
20. Take Aways!
• We learned a ton!
• Scheduled some valuable action items
• Just thinking about this stuff was worthwhile
• Less alert fatigue!
• Let's do more!
21. Next steps
• More teams doing more game days more frequently
• Build failure injection into our release process (production
readiness)
• Automate automate automate (hi Gremlin!)
22. Resources
• PagerDuty Incident Response Docs https://
response.pagerduty.com/
• Principles of Chaos https://principlesofchaos.org/
• Fault Injection in Production - https://queue.acm.org/
detail.cfm?id=2353017
• Gremlin Blog - https://www.gremlin.com/blog/