Reliability as a Discipline

Chaos Engineering
practices
Reliability as a Discipline
Vrijdag, 2 July 2021

fullstaq.com
2
Doing things with SRE and Observability
Arnold van Wijnbergen
a.vanwijnbergen@fullstaq.com
linkedin.com/in/IlovIT

fullstaq.com
3
Agenda
About myself
Why do we need Chaos Engineering
Historical facts
Chaos Engineering as a Discipline
Key Takeaways

fullstaq.com
4
About myself
● 20+ year experience in the IT industry
● Last 8 years doing some DevOps
● Since 2015 active with Chaos Engineering
● Love everything around Observability
● Supporter of Cloud Native
● Currently working as Product Owner @ Albert Heijn
● Responsible for
○ Observability
○ Monitoring
○ Logging
○ Alerting
○ Performance testing
○ Chaos engineering
○ Some Elastic and some Cilium

Why do we need
Chaos Engineering ?

fullstaq.com
6
Not for restarting nodes,
hosts or destroying our
services

fullstaq.com
7
Learn from failure and
mitigate risks, by breaking
things on purpose

8
fullstaq.com
Experimenting
failures
[in Production]
in order to reveal
weaknesses and
build confidence in
the resilience
capability.

Chaos Engineering is here to
prevent Chaos from
happening

See it as
Disaster Recovery
Testing
on Steroids

fullstaq.com
1
Historical facts
● Jesse Robbins Master of Disaster started at
Amazon with Gamedays
● Term Chaos introduced by Netflix (2010), to
fill-in the gap of doing proper resilience
testing in the Cloud.
● In 2011 Simian Army was born, most
famous about Chaos Monkey.
● In 2016 the Principles of Chaos went
publicly available.
● 2018 the first ChaosConf is organised by
Gremlin (Kolton Andrus).
● Since 2020 Chaos Engineering became part
of Well-Architected frameworks. See this as
the start of major adoption.
2021
2010
https://principlesofchaos.org/
2011
2016
2018
2020

fullstaq.com
1
Common reality with Distributed Systems
Just your ordinary Grocery store

fullstaq.com
1
Reliability becomes a product Feature
Innovation
Reliability

fullstaq.com
1
SREs love Chaos Engineering

fullstaq.com
1
It requires more then just Tools
Observability
SLO/SLI
Game days
Analysis
Evaluation
CI/CD
Testing
Chaos Tools

fullstaq.com
1
But how do we start ?
● Start with organising an event like a
Game day with the product and other
relevant teams like Incident Commands.
● Ensure that the goals and scope for
experiments are set and agreed.
● Ensure that you have enough time for
creating the hypotheses.
● Run the experiment!
● Evaluate the results and record the
evidence.
In fact we are
building a
discipline

fullstaq.com
1
Game days explained
● Creating a culture of
experimentation.
● Repeatable exercises to learn
from failures.
● Well-known method how AWS
validates their services and
major-incident process
resilience.
● Working on collaborative trust.

fullstaq.com
1
Goals made easy by SRE
● As team set your Objectives
● Use techniques like SLO/SLI
● Prepare your Observability
systems with the appropriate
Indicators to measure.
● Use these to validate your
expected system behaviour.

fullstaq.com
1
Simple flow that make the Scientific part work
Hypothesis
Experiment
Deviation
Evidence

fullstaq.com
2
Simple flow that make the Scientific part work
Hypothesis
Experiment
Deviation
Evidence
SLO/SLI
Game days
Chaos Tools
Observability
Game days Testing
Analysis
Evaluation
CI/CD
Analysis

2
Building a hypothesis
fullstaq.com
● Build an hypothesis around the Steady-state.
● Steady-state is when customers are happy
● Describe potential/real-world outages that
can/has happen due infrastructure, application
or connectivity failures and hard to predict
cascading effects.
● Get agreement on the Blast radius (scope)
● Describe the fault-injection implementation to
fulfill the experiment.
● Choose a strategy how to execute the
fault-injection experiment. Start small.
Latency
Routing
failures
Unavailability
Connectivity
failures
S
a
t
u
r
a
t
i
o
n
Data corruption

2
Execute the experiment
fullstaq.com
● Ensure everybody is well
informed.
● Ensure that Observability
tools are set ready.
● Record evidence!
Start learning from Failure !

2
Learn from Failure results
fullstaq.com
● Always ensure experiment results are
recorded, (historically) available and
analysed. This is your evidence!
● Look for deviations from the steady-state,
which you can learn from your Observability
system.
● Ensure that the whole experiment is
evaluated, preferable using a post-mortem
analysis.
● Extract improvements !!!

2
Key takeaways
fullstaq.com
● Chaos Engineering is not only about tools.
● Include multiple disciplines and involve other teams to make
the experiment a success.
● Use SRE principles that make things easier to
implement/analyse.
● Never do your first experiment on Production.
● When you are confident ensure that experiments are
automated and periodically gather evidence.
● Experiments can be used to feed with Progressive Delivery.
● Resilience is the ability to recover, but never forget the User
Experience.

Reliability as a Discipline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reliability as a Discipline

Similar to Reliability as a Discipline (20)

More from Arnold Van Wijnbergen

More from Arnold Van Wijnbergen (6)

Recently uploaded

Recently uploaded (20)

Reliability as a Discipline