Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
War Games - Flight training for DevOps @ TechSummit Amsterdam
1. Jorge Salamero Sanz <jsalamero@serverdensity.com>
TechSummit Amsterdam 2 June 2016
War Games - Flight training for DevOps
https://joind.in/talk/2e223
5. ● Humans are part of any system
● Initial design, ongoing improvements
● Maintenance
● Upgrades
● Issues, Incident response
Humans in DevOps
6. ● System issues = error rates + SLA + ...
● Human issues = alerts out of hours + interruptions + .
● System issues = Human issues
Human issues = system issues
7. ● Downtime = loss of users, reputation, revenue
● Downtime caused by unreliable systems
● Unhealthy teams reduce reliability
● Unhealthy teams = loss of users, reputation, revenue
Humans impact business
8. ● Slip
● Lapse
● Mistake
● Violation
● (Always, again, again)
Human risk
12. ● Power failure to half of our servers
● Automated failover unavailable
(known failure condition)
● Manual DNS switch required
● Expected impact: 20 min
● Actual impact: 43min
Incident example
17. ● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
General response process
19. ● The “limits of human memory and
attention”
○ Complexity
○ Stress and fatigue
○ Ego
● Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
Pre-flight checklists
20. 1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
Documented procedures
21. ● Realistic replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
War Games