Normal accidents and outpatient surgeries

221 views

Published on

Presentation to Box TechOps team in 2011

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
221
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Normal accidents and outpatient surgeries

  1. 1. Normal Accidents and Outpatient Surgeries Resilience Engineering Done Right
  2. 2. Safety in a Complex and Changing Environment "...so safety isn't about the absence of something...that you need to count errors or monitor violations. But the presence of something. But the presence of what? When we need to find that things go right under difficult circumstances, it's mostly because of people's adaptive capability; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle" -Sidney Dekker
  3. 3. Safety in a Complex and Changing Environment "...so safety isn't about the absence of something...that you need to count errors or monitor violations. But the presence of something. But the presence of what? When we need to find that things go right under difficult circumstances, it's mostly because of people's adaptive capability; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle" -Sidney Dekker RESILIENCE
  4. 4. Vocabulary Lesson Continuous Integration: The ability to quickly make sure the system is ready for production.
  5. 5. Vocabulary Lesson Continuous Integration: The ability to quickly make sure the system is ready for production. Resilience: The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances in order to sustain required operations.
  6. 6. Vocabulary Lesson Continuous Integration: The ability to quickly make sure the system is ready for production. Resilience: The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances in order to sustain required operations. Maintainability: Characteristic of design and installation which determines the probability that a failed equipment, machine, or system can be restored to its normal state within a given timeframe.
  7. 7. Vocabulary Lesson Continuous Integration: The ability to quickly make sure the system is ready for production. Resilience: The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances in order to sustain required operations. Maintainability: Characteristic of design and installation which determines the probability that a failed equipment, machine, or system can be restored to its normal state within a given timeframe. The SYSTEM includes all the hardware and software, but also all of the PEOPLE involved.
  8. 8. Maintainability = Uptime Goodness MTTR vs. MTBF
  9. 9. Maintainability = Uptime Goodness MTTR vs. MTBF Low MTTR > Low MTBF
  10. 10. Maintainability = Uptime Goodness MTTR vs. MTBF Low MTTR > Low MTBF Low MTTR = Better Uptime for most types of F
  11. 11. Maintainability = Uptime Goodness MTTR vs. MTBF Low MTTR > Low MTBF Low MTTR = Better Uptime for most types of F Low MTTR Requires:  • more useful metrics • intelligent data analysis • pre-planned, purposeful resilience • cooperation between application and infrastructure
  12. 12. Your Average Operations Engineer
  13. 13. Your Average Operations Engineer
  14. 14. Automation as a Default: "One of the be st wa ys to e lim ina te hum a n proble m s is to ta ke the hum a n out of the proble m . Ma chine s a re ve ry good a t doing things re pe a te dly a nd doing the m the sa m e wa y e ve ry single tim e . Hum a ns a re not good a t this. Le t the m a chine s do it.” Rapid Recovery: "Do we spe nd a n unpre dicta ble a m ount of tim e trying to solve som e obscure issue , or do we sim ply re cre a te the insta nce providing the se rvice from configura tion m a na ge m e nt" blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
  15. 15. Automation as a Default: "One of the best ways to eliminate human problems is to take the human out of the problem. Machines are very good at doing things repeatedly and doing them the same way every single time. Humans are not good at this. Let the machines do it." Rapid Recovery: "Do we spe nd a n unpre dicta ble a m ount of tim e trying to solve som e obscure issue , or do we sim ply re cre a te the insta nce providing the se rvice from configura tion m a na ge m e nt" blog.lusis.org/blog/2011/10/18/deploy-all-the-things/ PUPPET + KICKSTART + Network Automation
  16. 16. Automation as a Default: "One of the best ways to eliminate human problems is to take the human out of the problem. Machines are very good at doing things repeatedly and doing them the same way every single time. Humans are not good at this. Let the machines do it." Rapid Recovery: "Do we spend an unpredictable amount of time trying to solve some obscure issue, or do we simply recreate the instance providing the service from configuration management" blog.lusis.org/blog/2011/10/18/deploy-all-the-things/ PUPPET + KICKSTART + Network Automation ESPER + HEALTHCHECK + NAGIOS + SPLUNK+ OHSHIT
  17. 17. Comfortable Changes 1) Are Small • Many Small Changes = Fewer Incidents with lower MTTR
  18. 18. Comfortable Changes 1) Are Small • Many Small Changes = Fewer Incidents with lower MTTR 2) Are Reproducible RPM: • Really Peaceful Mornings • Reduce Paging Monitors • Reusable Provisioning Methods
  19. 19. Comfortable Changes 1) Are Small • Many Small Changes = Fewer Incidents with lower MTTR 2) Are Reproducible RPM: • Really Peaceful Mornings • Reduce Paging Monitors • Reusable Provisioning Methods Rule # 81: If you are logging into servers, you are doing it wrong.
  20. 20. Comfortable Changes 3) Are easily understood by your most junior team members
  21. 21. Comfortable Changes 3) Are easily understood by your most junior team members Rule # 4: Keep it Simple, because you are smart. Do not make it overly complex because you can.
  22. 22. Comfortable Changes 3) Are easily understood by your most junior team members Rule # 4: Keep it Simple, because you are smart. Do not make it overly complex because you can. 4) Can be deployed to a subset of production systems
  23. 23. Comfortable Changes 5) Follow Process
  24. 24. Comfortable Changes 5) Follow Process Change control, deployment processes, peer review, all of these things matter for a world-class OPS organization.
  25. 25. Comfortable Changes 6) Have been approved by a GO / NO-GO process with all relevant parties checking in.
  26. 26. Comfortable Changes 6) Have been approved by a GO / NO-GO process with all relevant parties checking in. Ensure that all teams involved in a change have signed off, including ON-CALL and CUSTOMER SERVICE
  27. 27. Tracking Changes
  28. 28. Small Changes John Allspaw presented these graphs of data gathered at Etsy. More Smaller Deployments means Faster MTTR means Fewer Minutes of Disruption
  29. 29. Operations Meta-Metrics When in doubt, COLLECT DATA, Build a Timeline! Things to Monitor: Changes (who/what/when/type) Incidents (Type/Severity/Duration) Responses to Incidents (TTD/TTR) Things to Collect: IRC/Jabber Logs Jira Logs Search your Data: Use HBASE+PIG/HIVE, ESPER, SOLR and SPLUNK Store everything, even stuff you don't yet know how to use.
  30. 30. Tracking Incidents - MTTD 1. Frequency 2.Severity 3.Root Cause: Five Whys Mentality o why was the website down? The CPU utilization on all our front-end servers went to 100% o why did the CPU usage spike? A new bit of code contained an infinite loop! o why did that code get written? So-and-so made a mistake o why did his mistake get checked in? He didn't write a unit test for the feature o why didn't he write a unit test? He's a new employee, and he was not properly trained 1. Time-to-Detect 2.Time-to-Resolve
  31. 31. Tracking Incidents - MTTD Rule # 18: Monitor EVERYTHING, alert on actionable items only, record other for trend information. Rule # 20: Do not make the monitoring system so noisy it is useless.
  32. 32. Tracking Incidents - MTTD Data Points to source these metrics from: Output from Application, CLOG, Puppet, Jabber, Jira, healthcheck, hardware, Eluna, Nagios....all collectible data
  33. 33. Handling Incident Response - MTTR Detect a Problem Communicate to Support/Community/Executives Begin to take Action Communicate to Support/Community/Executives Coordinate Troubleshooting/Diagnosis Communicate to Support/Community/Executives Confirm Stability, Resolving Steps Communicate to Support/Community/Executives
  34. 34. Handling Incident Response - MTTR Rule # 24: Assign people to be point people for every bit of technology Rule # 25: Assign Backup People to those People Rule #12: Know your bottlenecks, and how to spot them. Rule # 42: Create gigantic poster size drawings of the physical layouts of your data center Rule #43: Create gigantic poster size drawings of the logical flows of each part of your product.
  35. 35. XKCD #974: I find that when someone is taking time to do something right in the present, they're a perfectionist with no ability to prioritize, whereas when someone took time to do something right in the past, they're a master artisan of great foresight.

×