Jorge Salamero Sanz <jsalamero@serverdensity.com>
IncontroDevOps 1 April 2016
War Games - Flight training for DevOps
How to Monitor MySQL
â—Ź Infrastructure automation
â—Ź Configuration automation
â—Ź Continuous testing
â—Ź Continuous deployment / delivery
â—Ź Monitoring
â—Ź Logs, error handling
â—Ź Feedback
â—Ź Human Ops
DevOps lifecycle
â—Ź Humans are part of any system
â—Ź Initial design, ongoing improvements
â—Ź Maintenance
â—Ź Upgrades
â—Ź Issues, Incident response
Humans in DevOps
â—Ź System issues = error rates + SLA + ...
â—Ź Human issues = alerts out of hours + interruptions + .
â—Ź System issues = Human issues
Human issues = system issues
â—Ź System health impacts human health
â—Ź Human health impacts system health
Humans impact systems
â—Ź Downtime = loss of users, reputation, revenue
â—Ź Downtime caused by unreliable systems
â—Ź Unhealthy teams reduce reliability
â—Ź Unhealthy teams = loss of users, reputation, revenue
Humans impact business
â—Ź Slip
â—Ź Lapse
â—Ź Mistake
â—Ź Violation
â—Ź (Always, again, again)
Human risk
â—Ź Prepare and practice
â—Ź Respond
â—Ź Postmortem
Expect downtime
Real example
(small war story, won’t be long)
â—Ź Power failure to half of our servers
â—Ź Automated failover unavailable
(known failure condition)
â—Ź Manual DNS switch required
â—Ź Expected impact: 20 min
â—Ź Actual impact: 43min
Incident example
Lesson learned?
â—Ź Unfamiliarity with the process
â—Ź Pressure of time sensitive event
(panic effect)
â—Ź Escalation introduces delays
The Human Factor
Handling the Human factor
â—Ź First responder, acknowledge alert
â—Ź Load incident response checklist
â—Ź Log into #ops-war-room in Slack
â—Ź Log incident into JIRA
â—Ź Begin investigation
General response process
1. Extended use of checklists
Documented procedures
● The “limits of human memory and
attention”
â—‹ Complexity
â—‹ Stress and fatigue
â—‹ Ego
â—Ź Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
Pre-flight checklists
1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
Documented procedures
â—Ź Replica environment
â—Ź or mock command line
â—Ź Record actions and timing
â—Ź Multiple failures
â—Ź Unexpected results
Realistic scenarios: War Games
Results
â—Ź Team and individual test of response
â—Ź Run real commands
â—Ź Training the people
â—Ź Training the procedures
â—Ź Training the tools
Results
â—Ź Increase confidence
â—Ź Reduce panic
â—Ź Better coordination
â—Ź Trust relationships
â—Ź Improves time to resolution
Humans results
â—Ź Review
â—Ź Suggestions for improvements
â—Ź Do it again
â—Ź Scenario evolves
â—Ź People forget
loop(): review and repeat
â—Ź On call rotation design
â—Ź Alert prioritization
â—Ź Notification optimization
What else?
Human Ops
1. Humans are part of the system
2. Humans impact systems
3. Humans impact business
4. Human issues count as system issues
Human Ops principles
meetup.com/humanops-london/
Human Ops Meetup
www.CloudStatusApp.com
Jorge Salamero Sanz
Chief Developer Advocate
@bencerillo
@serverdensity
our DevOps stories
blog.serverdensity.com

Flight training for DevOps & HumanOps - IncontroDevOps 2016

  • 1.
    Jorge Salamero Sanz<jsalamero@serverdensity.com> IncontroDevOps 1 April 2016 War Games - Flight training for DevOps
  • 2.
  • 3.
    â—Ź Infrastructure automation â—ŹConfiguration automation â—Ź Continuous testing â—Ź Continuous deployment / delivery â—Ź Monitoring â—Ź Logs, error handling â—Ź Feedback â—Ź Human Ops DevOps lifecycle
  • 4.
    â—Ź Humans arepart of any system â—Ź Initial design, ongoing improvements â—Ź Maintenance â—Ź Upgrades â—Ź Issues, Incident response Humans in DevOps
  • 5.
    â—Ź System issues= error rates + SLA + ... â—Ź Human issues = alerts out of hours + interruptions + . â—Ź System issues = Human issues Human issues = system issues
  • 6.
    â—Ź System healthimpacts human health â—Ź Human health impacts system health Humans impact systems
  • 7.
    â—Ź Downtime =loss of users, reputation, revenue â—Ź Downtime caused by unreliable systems â—Ź Unhealthy teams reduce reliability â—Ź Unhealthy teams = loss of users, reputation, revenue Humans impact business
  • 8.
    â—Ź Slip â—Ź Lapse â—ŹMistake â—Ź Violation â—Ź (Always, again, again) Human risk
  • 9.
    â—Ź Prepare andpractice â—Ź Respond â—Ź Postmortem Expect downtime
  • 10.
    Real example (small warstory, won’t be long)
  • 11.
    â—Ź Power failureto half of our servers â—Ź Automated failover unavailable (known failure condition) â—Ź Manual DNS switch required â—Ź Expected impact: 20 min â—Ź Actual impact: 43min Incident example
  • 13.
  • 14.
    â—Ź Unfamiliarity withthe process â—Ź Pressure of time sensitive event (panic effect) â—Ź Escalation introduces delays The Human Factor
  • 15.
  • 16.
    â—Ź First responder,acknowledge alert â—Ź Load incident response checklist â—Ź Log into #ops-war-room in Slack â—Ź Log incident into JIRA â—Ź Begin investigation General response process
  • 17.
    1. Extended useof checklists Documented procedures
  • 18.
    ● The “limitsof human memory and attention” ○ Complexity ○ Stress and fatigue ○ Ego ● Pilots, doctors, divers: Bruce Willis Ruins All Films (BCD, weights, releases, air, final) Pre-flight checklists
  • 19.
    1. Extended useof checklists 2. Not to follow blindly, use knowledge and experience 3. Independent system 4. Searchable 5. List of known issues and documented workarounds/fixes Documented procedures
  • 20.
    â—Ź Replica environment â—Źor mock command line â—Ź Record actions and timing â—Ź Multiple failures â—Ź Unexpected results Realistic scenarios: War Games
  • 21.
  • 22.
    â—Ź Team andindividual test of response â—Ź Run real commands â—Ź Training the people â—Ź Training the procedures â—Ź Training the tools Results
  • 23.
    â—Ź Increase confidence â—ŹReduce panic â—Ź Better coordination â—Ź Trust relationships â—Ź Improves time to resolution Humans results
  • 24.
    â—Ź Review â—Ź Suggestionsfor improvements â—Ź Do it again â—Ź Scenario evolves â—Ź People forget loop(): review and repeat
  • 25.
    â—Ź On callrotation design â—Ź Alert prioritization â—Ź Notification optimization What else?
  • 26.
  • 27.
    1. Humans arepart of the system 2. Humans impact systems 3. Humans impact business 4. Human issues count as system issues Human Ops principles
  • 28.
  • 29.
  • 30.
    Jorge Salamero Sanz ChiefDeveloper Advocate @bencerillo @serverdensity our DevOps stories blog.serverdensity.com