4. fullstaq.com
4
About myself
● 20+ year experience in the IT industry
● Last 8 years doing some DevOps
● Since 2015 active with Chaos Engineering
● Love everything around Observability
● Supporter of Cloud Native
● Currently working as Product Owner @ Albert Heijn
● Responsible for
○ Observability
○ Monitoring
○ Logging
○ Alerting
○ Performance testing
○ Chaos engineering
○ Some Elastic and some Cilium
11. fullstaq.com
1
Historical facts
● Jesse Robbins Master of Disaster started at
Amazon with Gamedays
● Term Chaos introduced by Netflix (2010), to
fill-in the gap of doing proper resilience
testing in the Cloud.
● In 2011 Simian Army was born, most
famous about Chaos Monkey.
● In 2016 the Principles of Chaos went
publicly available.
● 2018 the first ChaosConf is organised by
Gremlin (Kolton Andrus).
● Since 2020 Chaos Engineering became part
of Well-Architected frameworks. See this as
the start of major adoption.
2021
2010
https://principlesofchaos.org/
2011
2016
2018
2020
15. fullstaq.com
1
It requires more then just Tools
Observability
SLO/SLI
Game days
Analysis
Evaluation
CI/CD
Testing
Chaos Tools
16. fullstaq.com
1
But how do we start ?
● Start with organising an event like a
Game day with the product and other
relevant teams like Incident Commands.
● Ensure that the goals and scope for
experiments are set and agreed.
● Ensure that you have enough time for
creating the hypotheses.
● Run the experiment!
● Evaluate the results and record the
evidence.
In fact we are
building a
discipline
17. fullstaq.com
1
Game days explained
● Creating a culture of
experimentation.
● Repeatable exercises to learn
from failures.
● Well-known method how AWS
validates their services and
major-incident process
resilience.
● Working on collaborative trust.
18. fullstaq.com
1
Goals made easy by SRE
● As team set your Objectives
● Use techniques like SLO/SLI
● Prepare your Observability
systems with the appropriate
Indicators to measure.
● Use these to validate your
expected system behaviour.
20. fullstaq.com
2
Simple flow that make the Scientific part work
Hypothesis
Experiment
Deviation
Evidence
SLO/SLI
Game days
Chaos Tools
Observability
Game days Testing
Analysis
Evaluation
CI/CD
Analysis
21. 2
Building a hypothesis
fullstaq.com
● Build an hypothesis around the Steady-state.
● Steady-state is when customers are happy
● Describe potential/real-world outages that
can/has happen due infrastructure, application
or connectivity failures and hard to predict
cascading effects.
● Get agreement on the Blast radius (scope)
● Describe the fault-injection implementation to
fulfill the experiment.
● Choose a strategy how to execute the
fault-injection experiment. Start small.
Latency
Routing
failures
Unavailability
Connectivity
failures
S
a
t
u
r
a
t
i
o
n
Data corruption
22. 2
Execute the experiment
fullstaq.com
● Ensure everybody is well
informed.
● Ensure that Observability
tools are set ready.
● Record evidence!
Start learning from Failure !
23. 2
Learn from Failure results
fullstaq.com
● Always ensure experiment results are
recorded, (historically) available and
analysed. This is your evidence!
● Look for deviations from the steady-state,
which you can learn from your Observability
system.
● Ensure that the whole experiment is
evaluated, preferable using a post-mortem
analysis.
● Extract improvements !!!
24. 2
Key takeaways
fullstaq.com
● Chaos Engineering is not only about tools.
● Include multiple disciplines and involve other teams to make
the experiment a success.
● Use SRE principles that make things easier to
implement/analyse.
● Never do your first experiment on Production.
● When you are confident ensure that experiments are
automated and periodically gather evidence.
● Experiments can be used to feed with Progressive Delivery.
● Resilience is the ability to recover, but never forget the User
Experience.