Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jorge Salamero Sanz <jsalamero@serverdensity.com>
Atmosphere Conference Krakow May 2016
HumanOps - the impact of health on...
Jorge Salamero
@bencerillo
@serverdensity
blog.serverdensity.com
www.CloudStatusApp.com
● Infrastructure automation
● Configuration automation
● Continuous testing
● Continuous deployment / delivery
● Monitorin...
● Humans are part of any system
● Initial design, ongoing improvements
● Maintenance
● Upgrades
● Issues, Incident response
● System issues = error rates + SLA + ...
● Human issues = alerts out of hours + interruptions + .
● System issues = Human...
● Downtime = loss of users, reputation, revenue
● Downtime caused by unreliable systems
● Unhealthy teams reduce reliabili...
● Slip
● Lapse
● Mistake
● Violation
● (Always, again, again)
What can we do?
● Prepare and practice
● Respond
● Postmortem
Real example
(small war story, won’t be long)
● Power failure to half of our servers
● Automated failover unavailable
(known failure condition)
● Manual DNS switch requ...
Lessons learned?
● Unfamiliarity with the process
● Pressure of time sensitive event
(panic effect)
● Escalation introduces delays
Handling the Human factor
● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident i...
1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5...
● The “limits of human memory and
attention”
○ Complexity
○ Stress and fatigue
○ Ego
● Pilots, doctors, divers:
Bruce Will...
1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5...
● Realistic replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
Results
● Team and individual test of response
● Run real commands
● Training the people
● Training the procedures
● Training the ...
● Increase confidence
● Reduce panic
● Better coordination
● Trust relationships
● Improves time to resolution
● Review
● Suggestions for improvements
● Do it again
● Scenario evolves
● People forget
● On call rotation design
● Alert prioritization
● Notification optimization
Human Ops
1. Humans are part of the system
2. Humans impact systems
3. Humans impact business
4. Human issues count as system issues
meetup.com/humanops-london/
humanops.com
serverdensity.com/conferences
DEVOPSDAYS
Jorge Salamero
@bencerillo
@serverdensity
blog.serverdensity.com
Atmosphere 2016 - Jorge Salamero Sanz - HumanOps, the impact of human health of operations
Atmosphere 2016 - Jorge Salamero Sanz - HumanOps, the impact of human health of operations
Upcoming SlideShare
Loading in …5
×

Atmosphere 2016 - Jorge Salamero Sanz - HumanOps, the impact of human health of operations

138 views

Published on

HumanOps is a set of principles which focus on the human aspects of running infrastructure.

It deliberately highlights the importance of the teams running systems, not just the systems themselves.

The health of your infrastructure is not just about hardware, software, automations and uptime - it also includes the health and wellbeing of your team.

The goal of HumanOps is to improve and maintain the good health of your team: easing communication, reducing fatigue and reducing stress.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Atmosphere 2016 - Jorge Salamero Sanz - HumanOps, the impact of human health of operations

  1. 1. Jorge Salamero Sanz <jsalamero@serverdensity.com> Atmosphere Conference Krakow May 2016 HumanOps - the impact of health on operations
  2. 2. Jorge Salamero @bencerillo @serverdensity blog.serverdensity.com
  3. 3. www.CloudStatusApp.com
  4. 4. ● Infrastructure automation ● Configuration automation ● Continuous testing ● Continuous deployment / delivery ● Monitoring ● Logs, error handling ● Feedback ● Human Ops
  5. 5. ● Humans are part of any system ● Initial design, ongoing improvements ● Maintenance ● Upgrades ● Issues, Incident response
  6. 6. ● System issues = error rates + SLA + ... ● Human issues = alerts out of hours + interruptions + . ● System issues = Human issues
  7. 7. ● Downtime = loss of users, reputation, revenue ● Downtime caused by unreliable systems ● Unhealthy teams reduce reliability ● Unhealthy teams = loss of users, reputation, revenue
  8. 8. ● Slip ● Lapse ● Mistake ● Violation ● (Always, again, again)
  9. 9. What can we do?
  10. 10. ● Prepare and practice ● Respond ● Postmortem
  11. 11. Real example (small war story, won’t be long)
  12. 12. ● Power failure to half of our servers ● Automated failover unavailable (known failure condition) ● Manual DNS switch required ● Expected impact: 20 min ● Actual impact: 43min
  13. 13. Lessons learned?
  14. 14. ● Unfamiliarity with the process ● Pressure of time sensitive event (panic effect) ● Escalation introduces delays
  15. 15. Handling the Human factor
  16. 16. ● First responder, acknowledge alert ● Load incident response checklist ● Log into #ops-war-room in Slack ● Log incident into JIRA ● Begin investigation
  17. 17. 1. Extended use of checklists 2. Not to follow blindly, use knowledge and experience 3. Independent system 4. Searchable 5. List of known issues and documented workarounds/fixes
  18. 18. ● The “limits of human memory and attention” ○ Complexity ○ Stress and fatigue ○ Ego ● Pilots, doctors, divers: Bruce Willis Ruins All Films (BCD, weights, releases, air, final)
  19. 19. 1. Extended use of checklists 2. Not to follow blindly, use knowledge and experience 3. Independent system 4. Searchable 5. List of known issues and documented workarounds/fixes
  20. 20. ● Realistic replica environment ● or mock command line ● Record actions and timing ● Multiple failures ● Unexpected results
  21. 21. Results
  22. 22. ● Team and individual test of response ● Run real commands ● Training the people ● Training the procedures ● Training the tools
  23. 23. ● Increase confidence ● Reduce panic ● Better coordination ● Trust relationships ● Improves time to resolution
  24. 24. ● Review ● Suggestions for improvements ● Do it again ● Scenario evolves ● People forget
  25. 25. ● On call rotation design ● Alert prioritization ● Notification optimization
  26. 26. Human Ops
  27. 27. 1. Humans are part of the system 2. Humans impact systems 3. Humans impact business 4. Human issues count as system issues
  28. 28. meetup.com/humanops-london/ humanops.com
  29. 29. serverdensity.com/conferences DEVOPSDAYS
  30. 30. Jorge Salamero @bencerillo @serverdensity blog.serverdensity.com

×