Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to be Wrong (or How to be Successful at Being Wrong)

97 views

Published on

In this talk Russ Miles, CEO at ChaosIQ, explores how to turn "Being Wrong" into a super-power through establishing a Resilience Engineering Capability that practices Chaos Engineering.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How to be Wrong (or How to be Successful at Being Wrong)

  1. 1. How to be Wrong or … “How to be Great at Being Wrong” Russ Miles - @russmiles
  2. 2. Who are you?
  3. 3. @russmiles
  4. 4. Also…
  5. 5. Russ Miles CEO, co-founder Your Host Today Sylvain Hellegouarch CTO, co-founder ChaosIQ
  6. 6. “To support our users in establishing their own 
 Resilience Engineering Capability”
  7. 7. What is “Wrong”?
  8. 8. And why are we scared of it?
  9. 9. “not correct or true; incorrect.”
  10. 10. “an injurious, unfair, or unjust act”…
  11. 11. …“action or conduct inflicting harm without due provocation or just cause”
  12. 12. “a violation or invasion of the legal rights of another”
  13. 13. When are we ever wrong?
  14. 14. “the state of being mistaken or incorrect”
  15. 15. I’ve got some bad news for you…
  16. 16. We’re wrong all the time.
  17. 17. We’re wrong all the time.
  18. 18. Why is wrong scary?
  19. 19. Risk?
  20. 20. Consequences.
  21. 21. Why us?!
  22. 22. Two factors
  23. 23. Feature Velocity
  24. 24. Striving for Reliability
  25. 25. Feature Velocity VS. Reliability
  26. 26. Good news!
  27. 27. No conflict!
  28. 28. Feature Velocity VS. Reliability
  29. 29. Feature Velocity + Reliability
  30. 30. But … Microservices!?
  31. 31. What about tests? gates? pipelines? isolation?
  32. 32. We’re covered…
  33. 33. A Story…
  34. 34. And it gets worse…
  35. 35. Dark Debt…
  36. 36. And over to the business…
  37. 37. “One Hour of Downtime Costs > $100K For 95% of Enterprises” http://itic-corp.com/blog/2013/07/one-hour-of-downtime-costs-100k-for-95-of-enterprises/
  38. 38. “lost revenue and lost end user productivity" http://itic-corp.com/blog/2013/07/one-hour-of-downtime-costs-100k-for-95-of-enterprises/
  39. 39. “not take into account the cost of additional penalties for regulatory non-compliance or “good will” gestures made to the organization’s customers and business partners that were negatively impacted by a system or network failure. In fact, these two conditions can cause downtime costs to skyrocket even further” http://itic-corp.com/blog/2013/07/one-hour-of-downtime-costs-100k-for-95-of-enterprises/
  40. 40. Bad news…
  41. 41. You’re not covered
  42. 42. Microservices-based systems tend to look like…
  43. 43. “To be fully described, there are many details, not few” John Allspaw: https://queue.acm.org/detail.cfm?id=2353017
  44. 44. “The rate of change is high; the systems change before a full description (and therefore understanding) can be completed.” John Allspaw: https://queue.acm.org/detail.cfm?id=2353017
  45. 45. “How components function is partly unknown, as they resonate with each other across varying conditions.” John Allspaw: https://queue.acm.org/detail.cfm?id=2353017
  46. 46. “Processes are heterogeneous and possibly irregular.” John Allspaw: https://queue.acm.org/detail.cfm?id=2353017
  47. 47. Reactions?
  48. 48. Ugly Risk Avoidance!
  49. 49. Blame.
  50. 50. A better reaction?
  51. 51. Being wrong is a 
 key software skill
  52. 52. Get Better At Being Wrong™
  53. 53. Make it Safe(r) to be wrong.
  54. 54. Technical Robustness
  55. 55. Zero Blame
  56. 56. Go Beyond Blame
  57. 57. Remember “Dark Debt”
  58. 58. Deliberately Practice Being Wrong
  59. 59. “prepare for undesirable circumstances” - John Allspaw
  60. 60. Deliberately Practice Being Wrong = Chaos Engineering
  61. 61. Invest in Resilience
  62. 62. Resilience is a Learning Loop
  63. 63. “Normal”
  64. 64. Outage “Normal”
  65. 65. Outage Detection Diagnosis Fix “Normal”
  66. 66. Outage Detection Diagnosis Fix “Normal” Learning
  67. 67. Outage Detection Diagnosis FixImprovement (Robustness) “Normal” Learning
  68. 68. “Never Let an Outage Go To Waste” - Casey Rosenthal
  69. 69. Post-mortem Learning 
 is Good
  70. 70. Pre-mortem Learning 
 is Better!
  71. 71. “Chaos Engineering is the discipline of experimenting on a distributed system
 in order to build confidence in the system’s capability to withstand turbulent conditions in production.” - principlesofchaos.org
  72. 72. “Normal”
  73. 73. Game Day / Automated Chaos Experiment “Normal”
  74. 74. Game Day / Automated Chaos Experiment Detection Diagnosis Fix “Normal”
  75. 75. Game Day / Automated Chaos Experiment Detection Diagnosis FixImprovement (Robustness) “Normal” Learning
  76. 76. Chaos Infrastructure Platform Applications People, Practices & Process Game Days Automated
 Experiments Automated
 Experiments Automated
 Experiments
  77. 77. We can learn after outages… but it’s even better to 
 learn from weaknesses before an outage.
  78. 78. Being Wrong can be a super power, if it leads to learning
  79. 79. Establish a Platform for 
 Pre-mortem, Deliberate Practice “Being Wrong”
  80. 80. Establish a Platform for 
 Pre-mortem, Deliberate Practice “Chaos Engineering”
  81. 81. Automate Chaos Experiments Chaos Engineering Control Plane
  82. 82. Reading Recommendations
  83. 83. Later on today… Sylvain Hellegouarch CTO, co-founder +
  84. 84. Start a conversation… ChaosIQ www.chaosplatform.com contact@chaosiq.io O’Reilly Safari Online Courses Available Upcoming Conferences DevOpsCon Munich KubeCon Seattle O’Reilly Book on
 “Chaos Engineering” 
 Coming Soon! https://www.chaostoolkit.org Slack: https://join.chaostoolkit.org

×