CHAOS ENGINEERING
– OR LET'S SHAKE THE
TREE
J I M M Y D A H L Q V I S T , K N O W I T
" Failures are given and
everything will eventually
fail over time "
Werner Vogels
CTO – Amazon.com
TRIBUTE
• Nora Jones
• Adrian Cockroft
• Adrian Hornsby
History and background
2004 Amazon – Jesse Robbins. Master of disaster
2010 Netflix – Greg Orzell. Chaos Monkey
2012 Netflix – Open sources simian army
2016 Gremlin Inc is founded
2017 Netflix Chaos engineering book. Chaos toolset.
2018 Concept is spread, ChaosConf started.
History and background
2004 Amazon – Jesse Robbins. Master of disaster
2010 Netflix – Greg Orzell. Chaos Monkey
2012 Netflix – Open sources simian army
2016 Gremlin Inc is founded
2017 Netflix Chaos engineering book. Chaos toolset.
2018 Concept is spread, ChaosConf started.
" By running experiments on a
regular basis that simulate a
Regional outage, we were able to
identify any systemic weaknesses
early and fix them. "
Netflix Blog
History and background
2004 Amazon – Jesse Robbins. Master of disaster
2010 Netflix – Greg Orzell. Chaos Monkey
2012 Netflix – Open sources simian army
2016 Gremlin Inc is founded
2017 Netflix Chaos engineering book. Chaos toolset.
2018 Concept is spread, ChaosConf started.
History and background
2004 Amazon – Jesse Robbins. Master of disaster
2010 Netflix – Greg Orzell. Chaos Monkey
2012 Netflix – Open sources simian army
2016 Gremlin Inc is founded
2017 Netflix – Chaos engineering book.
2018 Concept is spread, ChaosConf started.
History and background
2004 Amazon – Jesse Robbins. Master of disaster
2010 Netflix – Greg Orzell. Chaos Monkey
2012 Netflix – Open sources simian army
2016 Gremlin Inc is founded
2017 Netflix – Chaos engineering book.
2018 Concept is spread, ChaosConf started.
Chaos Engineering is the discipline of
experimenting on a system
in order to build confidence in the system’s
capability to withstand turbulent conditions
in production.
" Failures are given and
everything will eventually
fail over time "
Werner Vogels
CTO – Amazon.com
Chaos Engineering is the discipline of
experimenting on a system
in order to build confidence in the system’s
capability to withstand turbulent conditions
in production.
Unit testing
Component X
Input Output
Integration testing
Component A
Input OutputOutput / Input
Component B
Distributed System
Input
Output
Distributed System
Input
Output Corrupt?
Chaos Engineering is the discipline of
experimenting on a system
in order to build confidence in the system’s
capability to withstand turbulent
conditions in production.
" Chaos doesn't cause problems.
It reveals them. "
Nora Jones
Slack - Head of chaos engineering
and human factors
Before practicing chaos
Socialize
Start small
Use an opt in model, not an opt out.
Only include services that like to be chaosed.
Start with a success!
Don't start in production.
Steady state
Define the steady state. Build a hypothesis about the
steady state. What does our system look like when it's
behaving normally.
Monitoring
Understand your key business metrics and KPIs.
Netflix key business metric is SPS.
First experiment
Graceful restarts and degradations
Design your next experiments
" You have to know the past to
understand the present. "
Carl Sagan
Move to production
Don't forget about your customers!
Don't destroy the customer experience!
Make sure you can abort!
Only run during business hour.
Automate everything
Run often
Automatic safeguards
Percentage of traffic
Netflix Chaos Automation Platform (ChAP)
Change of mindset
What happens IF this fails to,
what happens WHEN this fails.
Lesson learnt
Takeaways
Everyone can be doing Chaos Engineering
Chaos Engineering is a learning opportunity
Be conscious about customers, involve business
" Chaos doesn't cause problems.
It reveals them. "
Nora Jones
Slack - Head of chaos engineering
and human factors
Tack!

CHAOS ENGINEERING – OR LET'S SHAKE THE TREE

  • 1.
    CHAOS ENGINEERING – ORLET'S SHAKE THE TREE J I M M Y D A H L Q V I S T , K N O W I T
  • 3.
    " Failures aregiven and everything will eventually fail over time " Werner Vogels CTO – Amazon.com
  • 4.
    TRIBUTE • Nora Jones •Adrian Cockroft • Adrian Hornsby
  • 5.
    History and background 2004Amazon – Jesse Robbins. Master of disaster 2010 Netflix – Greg Orzell. Chaos Monkey 2012 Netflix – Open sources simian army 2016 Gremlin Inc is founded 2017 Netflix Chaos engineering book. Chaos toolset. 2018 Concept is spread, ChaosConf started.
  • 7.
    History and background 2004Amazon – Jesse Robbins. Master of disaster 2010 Netflix – Greg Orzell. Chaos Monkey 2012 Netflix – Open sources simian army 2016 Gremlin Inc is founded 2017 Netflix Chaos engineering book. Chaos toolset. 2018 Concept is spread, ChaosConf started.
  • 9.
    " By runningexperiments on a regular basis that simulate a Regional outage, we were able to identify any systemic weaknesses early and fix them. " Netflix Blog
  • 10.
    History and background 2004Amazon – Jesse Robbins. Master of disaster 2010 Netflix – Greg Orzell. Chaos Monkey 2012 Netflix – Open sources simian army 2016 Gremlin Inc is founded 2017 Netflix Chaos engineering book. Chaos toolset. 2018 Concept is spread, ChaosConf started.
  • 12.
    History and background 2004Amazon – Jesse Robbins. Master of disaster 2010 Netflix – Greg Orzell. Chaos Monkey 2012 Netflix – Open sources simian army 2016 Gremlin Inc is founded 2017 Netflix – Chaos engineering book. 2018 Concept is spread, ChaosConf started.
  • 13.
    History and background 2004Amazon – Jesse Robbins. Master of disaster 2010 Netflix – Greg Orzell. Chaos Monkey 2012 Netflix – Open sources simian army 2016 Gremlin Inc is founded 2017 Netflix – Chaos engineering book. 2018 Concept is spread, ChaosConf started.
  • 14.
    Chaos Engineering isthe discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
  • 15.
    " Failures aregiven and everything will eventually fail over time " Werner Vogels CTO – Amazon.com
  • 16.
    Chaos Engineering isthe discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
  • 18.
  • 19.
    Integration testing Component A InputOutputOutput / Input Component B
  • 20.
  • 22.
  • 23.
    Chaos Engineering isthe discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
  • 25.
    " Chaos doesn'tcause problems. It reveals them. " Nora Jones Slack - Head of chaos engineering and human factors
  • 26.
  • 27.
  • 28.
    Start small Use anopt in model, not an opt out. Only include services that like to be chaosed. Start with a success! Don't start in production.
  • 29.
    Steady state Define thesteady state. Build a hypothesis about the steady state. What does our system look like when it's behaving normally.
  • 30.
    Monitoring Understand your keybusiness metrics and KPIs. Netflix key business metric is SPS.
  • 31.
  • 32.
    Design your nextexperiments " You have to know the past to understand the present. " Carl Sagan
  • 33.
    Move to production Don'tforget about your customers! Don't destroy the customer experience! Make sure you can abort! Only run during business hour.
  • 34.
    Automate everything Run often Automaticsafeguards Percentage of traffic Netflix Chaos Automation Platform (ChAP)
  • 35.
    Change of mindset Whathappens IF this fails to, what happens WHEN this fails.
  • 36.
  • 40.
    Takeaways Everyone can bedoing Chaos Engineering Chaos Engineering is a learning opportunity Be conscious about customers, involve business
  • 41.
    " Chaos doesn'tcause problems. It reveals them. " Nora Jones Slack - Head of chaos engineering and human factors
  • 42.

Editor's Notes

  • #34  And do run it during business hours, when everyone is at the office..... Monitor! Monitor! Monitor! And at first sight of problem. Abort the experiment. So make sure you can abort the experiment, make sure you have validated that you can abort. So if you hit the abort button, it does abort and not just keep running!