Flight training for DevOps & HumanOps - IncontroDevOps 2016

Jorge Salamero Sanz <jsalamero@serverdensity.com>
IncontroDevOps 1 April 2016
War Games - Flight training for DevOps

● Infrastructure automation
● Configuration automation
● Continuous testing
● Continuous deployment / delivery
● Monitoring
● Logs, error handling
● Feedback
● Human Ops
DevOps lifecycle

● Humans are part of any system
● Initial design, ongoing improvements
● Maintenance
● Upgrades
● Issues, Incident response
Humans in DevOps

● System issues = error rates + SLA + ...
● Human issues = alerts out of hours + interruptions + .
● System issues = Human issues
Human issues = system issues

● System health impacts human health
● Human health impacts system health
Humans impact systems

● Downtime = loss of users, reputation, revenue
● Downtime caused by unreliable systems
● Unhealthy teams reduce reliability
● Unhealthy teams = loss of users, reputation, revenue
Humans impact business

● Slip
● Lapse
● Mistake
● Violation
● (Always, again, again)
Human risk

● Prepare and practice
● Respond
● Postmortem
Expect downtime

Real example
(small war story, won’t be long)

● Power failure to half of our servers
● Automated failover unavailable
(known failure condition)
● Manual DNS switch required
● Expected impact: 20 min
● Actual impact: 43min
Incident example

● Unfamiliarity with the process
● Pressure of time sensitive event
(panic effect)
● Escalation introduces delays
The Human Factor

● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
General response process

1. Extended use of checklists
Documented procedures

● The “limits of human memory and
attention”
○ Complexity
○ Stress and fatigue
○ Ego
● Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
Pre-flight checklists

1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
Documented procedures

● Replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
Realistic scenarios: War Games

● Team and individual test of response
● Run real commands
● Training the people
● Training the procedures
● Training the tools
Results

● Increase confidence
● Reduce panic
● Better coordination
● Trust relationships
● Improves time to resolution
Humans results

● Review
● Suggestions for improvements
● Do it again
● Scenario evolves
● People forget
loop(): review and repeat

● On call rotation design
● Alert prioritization
● Notification optimization
What else?

1. Humans are part of the system
2. Humans impact systems
3. Humans impact business
4. Human issues count as system issues
Human Ops principles

meetup.com/humanops-london/
Human Ops Meetup

Jorge Salamero Sanz
Chief Developer Advocate
@bencerillo
@serverdensity
our DevOps stories
blog.serverdensity.com

Flight training for DevOps & HumanOps - IncontroDevOps 2016

More Related Content

Similar to Flight training for DevOps & HumanOps - IncontroDevOps 2016

More from Server Density

Flight training for DevOps & HumanOps - IncontroDevOps 2016