Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SPOF - Single "Person" of Failure

4,448 views

Published on

The talk from DevOps Days Silicon Valley 2015 conference which describes the signs of having or being a single point of failure expert on your system, and the ways to solve the problem

Published in: Technology
  • Be the first to comment

SPOF - Single "Person" of Failure

  1. 1. Single Point of Failure… Expert Sasha Rosenbaum, @DivineOps
  2. 2. Who am I? Sasha Rosenbaum Azure & DevOps consultant at 10th Magnitude for 4 years Co-organizer of - DevOps Days Chicago Conference - Chicago Azure meetup @DivineOps
  3. 3. What is a Single Point of Failure? @DivineOps
  4. 4. A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working @DivineOps
  5. 5. High Availability  Achieving redundancy by removing single points of failure  Having reliable cross-over capabilities to switch between components  Detection of failures as they occur, so that cross-over can be initiated @DivineOps
  6. 6. This is complicated @DivineOps
  7. 7. Architecting for HA @DivineOps
  8. 8. How is the entire system down? @DivineOps
  9. 9. We forgot a dependency! @DivineOps
  10. 10. Oh… @DivineOps
  11. 11. Just imagine buying a server that Uptime of roughly 16 hours a day With interruptions Single one of its kind Cannot be replicated! @DivineOps
  12. 12. Humans are NOT highly available @DivineOps
  13. 13. How did we get here? Lack of budget Lack of people Human nature @DivineOps
  14. 14. How to recognize that you have a problem? @DivineOps
  15. 15. 1 @DivineOps
  16. 16. Keys to the Kingdom @DivineOps
  17. 17. TO MY PRODUCTION SERVER @DivineOps
  18. 18. Even when the systems are automated there are still humans who manage them @DivineOps
  19. 19. Why is there a single admin? The situation evolved organically from having a small team Someone took over deliberately @DivineOps
  20. 20. Role Based Access Grant access based on a role/group Admin group size > 1 Service accounts @DivineOps
  21. 21. Make sure that the person on call has the necessary access to fix the problem @DivineOps
  22. 22. TRUST YOUR PEOPLE!!! @DivineOps
  23. 23. 2 @DivineOps
  24. 24. Beware of the Expert! @DivineOps
  25. 25. “This will take 15 minutes to fix And 8 hours to explain” @DivineOps
  26. 26. We cannot afford the loss of productivity! @DivineOps
  27. 27. Can you afford losing this knowledge? @DivineOps
  28. 28. Delegate to Juniors @DivineOps
  29. 29. Juniors are wonderful people They ask tough questions @DivineOps
  30. 30. Your new hires haven’t yet caught the “This is how it’s always been” virus @DivineOps
  31. 31. You are emotionally invested in your code It is hard not to get protective of it @DivineOps
  32. 32. Documentation Documents Readme Comments Tests Automation Features @DivineOps
  33. 33. 3 @DivineOps
  34. 34. “I cannot afford to take vacation!” @DivineOps
  35. 35. Job security? @DivineOps
  36. 36. Productivity? @DivineOps
  37. 37. Hours / Productivity @DivineOps
  38. 38. Research shows that working longer hours DOES NOT increase productivity @DivineOps
  39. 39. You need rest to be at your best! @DivineOps
  40. 40. Cell phones are the single worse thing that happened to people AND businesses in the last century @DivineOps
  41. 41. If people were actually unreachable we would find a more reliable way to solve problems @DivineOps
  42. 42. Mandatory Vacation @DivineOps
  43. 43. Game Days @DivineOps
  44. 44. Say NO to having a Single PERSON of Failure ;-) @DivineOps
  45. 45. Great job, DoD Silicon Valley! @DivineOps

×