Chaos Engineering 101: A Field Guide

Chaos Engineering 101:
A Field Guide
Matthew Brahms | SRE | @matthewbrahms

What you will get from this talk in exchange for your time:
● Understand the deﬁnitions of Chaos Engineering (CE)
● Hear a brief history of the ﬁeld
● Describe the mindset and methodologies of CE
● Know what steps you can take to start doing CE “in the wild”
● Realize the valuable outcomes of having a CE group at your org
● Prepare for common CE myths
● Have some resources for further investigation of the discipline

Who are we in this room?
dev/ops/devops/qa/qe/swe/sre/management

Chaos Engineering is the discipline of experimenting on a distributed
system in order to build confidence in the system’s capability to
withstand turbulent conditions in production.
- http://principlesofchaos.org/

Bad things will (and are) happening to your
system, no matter how well designed it is.
You cannot become ignorant to it.

A *brief* history of the CE field
● 2010 - Chaos Monkey
● 2011 - Simian Army
● 2012 - Chaos Monkey OSS
● 2014 - Chaos Engineer
role @ Netflix
● 2017 - Chaos Toolkit on
GitHub (OSS)
● 2018 - Gremlin hosts first
ChaosConf in SF
● 2018 - CNCF Chaos
working group

● Airline industry
○ Air Traﬃc Control
○ Plane construction
○ Pilot procedures
● Naval Air Operations at Sea
● Electrical Power Systems
● Public Water Systems
● Medical devices
○ Hospitals
○ Implanted devices
● Highway infrastructure
● Car crash safety ratings

Methodology/Mindset of Chaos Engineering

CE is a discipline
● This implies rigor, as in the
academic sense
● Each org/person is unique in
their implementation
● It’s not a process we can “say
we do” and then ﬁle it into the
abyss of “the wiki”

Form a hypothesis
● You should know your
app/tech stack well
● Whiteboard your entire
system with another senior
engineer and always with new
onboards
● Find a domain/service where a
failure is likely to exist and
start there

Test your ideas
● Goal is to either validate or
invalidate your failure-case
hypothesis
● The act of testing your
hypothesis should *not* result
in any harm to the user
experience!

Analyze results
● Lessons learned from the
experiment are priceless
● The results and lessons
learned should be
communicated to the entire
team
● Action items should be
started to increase resiliency if
there were issues discovered

Chaos Engineering: “In the Wild”

Level 0 - The Basics
1. You will need team/engineering buy-in
2. You will need full support from your engineering and business leadership
3. You will need *observability* in your application/infrastructure/user experience.
Note: if you cannot detect/observe failure states when not formally doing chaos
engineering, that is an area to focus on before adopting chaos engineering.
4. You will need a fully-documented and robust SEV outage procedure (replete with
Incident commanders, blameless post-mortems, etc.) Note: this is another topic
that if there is a lack of maturity in before doing chaos engineering, this should be
built-up ﬁrst.
** All of these could be *entire talks* on their own

Level 1 - Assemble team Time: varies
Two things are needed before going to level 2:
- A deﬁned product/domain/service, etc. that you wish to test for failure
- A group of engineers (ops/dev/security/support/business):
- You need this group to be comprised of people who are involved end-to-end with your service
- They need to have time to attend pre-game meeting, experiment, and follow-up
- Involve/inform as many people as possible in case of a failure during the experiment
- Include Senior and Junior Engineers and even business people related to the service
- Be sure to set the expectations for the level of involvement you need
Example: “We will test our resiliency at the base layer of our infrastructure compute
nodes.”

Level 2 - Formulate Hypothesis Time: 1-2 hours
Get everyone together and formulate your hypothesis.
Whiteboard the entire service/hypothesis until everyone has a clear and thorough
understanding of the system and the actions that will be taken to experiment with
resiliency.
Also assign roles and responsibilities for each person that will occur during the
gameday. (Have a documentation user, have a QRF team, have someone just to
operate the experiment, etc.)
Document all of the above and socialize this documentation to other teams.
Example: “If we delete (lose) a cloud compute node, our Kubernetes cluster will
recover and re-provision, with no downtime or negative user experience.”

Level 3 - Gameday Time: 1-4 hours
Ideally, game day looks like a launch at NASA. Each of the assigned persons knows
their role and you can do a pre-launch checklist, ensuring each team is ready.
If there are any issues impacting the system or anything that the gameday *might*
interfere with or make worse, abort the launch.
If you are ready, then proceed with initiating the experiment keeping a keen eye on
watching the progress.
Example: “Our infrastructure is currently not degraded in any way, it is not Black
Friday, we have SRE, SWE, Support, Security, and a few business folks here. We will
now begin to delete a node and watch the success rates of our api’s while expecting
and monitoring for the node recovery/re-provisioning.”

Level 4 - Recap Lessons Learned Time: 30 minutes
Gather everyone involved and recap what happened. In case of success or failure and
remediation--be sure to go over the timeline of what happened.
Gather lessons that everyone learned, being sure to highlight what we learned from the
experiment that we didn’t know before (this is good to see value).
Plan work for engineering teams as necessary to close any resiliency gaps that the
experiment discovered.
Communicate the value of all that has occurred in this process to the business. This is
work that has directly contributed to the bottom line of the company.

Gameday Templates!
If you are very new to doing this, Gremlin has a complete set of templates and
checklists to help you get started! (They really are quite excellent!)
https://www.gremlin.com/gameday/

Outcomes for Chaos Engineering

1. Avoid costs of downtime.
Do we really *know* how much
downtime really costs our enterprise in:
Sales, Engineering, Loss of Productivity, etc.?
User experience will go up!

2. Decrease pages to Ops/Dev/SRE
Do we all like sleep?
Do we track the number of pages our teams get?
The blast-radius/cost of an outage event is large (lurkers & active)

3. Increase Productivity
Less time and money spent on outages
and reactive work will increase our time
and resources for proactive work/features.
What value could our Ops teams add if they were distracted less?

4. Increase the spread of knowledge
throughout your organization
Tired of running into lack of documentation/runbooks?
Tired of people leaving with *heaps* of “tribal knowledge” ?
Tired of people saying “I don’t know...that’s Johnny’s expertise” ?

Top Chaos Engineering Myths
(...not an exhaustive list)

Top Chaos Engineering Myths
1. It’s not my job!
2. *Now* what tool do we have to buy & learn?
3. It costs how much??
4. We have too much work to do (i.e. features,
bug-ﬁxes, etc.)
5. We can just deal with outages JIT, right!?
6. Our uptime target is 100% right? Why should
we ever introduce “experiments” in
production?
7. Why do you think we even have an ops/sre
team?
8. We don’t even have SLO/SLI/SLA in
place...even if we wanted to, how could we
start?

*IMMEDIATE* thoughts/responses
from an SRE to these myths...

It is *everyone’s*
job to care about
functionality,
reliability, and
ultimately #proﬁt

Take the time to be
data-driven about
the whole cost
argument.

There is a
learning/implementation
curve when Engineering
Chaos, but continuous
learning and
improvement are job
req’s, right?

Do we really expect and
employ a strategy of
hope that only OPS/SRE
should be doing Chaos
Engineering?

Chaos Engineering != tooling
(necessarily)
Start with preemptible/spot instances for services in lower environments :)

What can you do about implementing chaos engineering:
1. Evangelize the idea and principles of chaos engineering to our organizations
2. Ensure that your systems are measurable (can detect chaos even if it is
unplanned) and that there is a really solid SEV process in-place.
3. Start with whiteboarding sessions/high-level discussions about how our
applications/services are architected and function--gain “herd immunity”
regarding knowledge
4. Pick 1 service or application that is well-documented, very observable, not in a
critical production path, etc. to serve as your ﬁrst experiment upon for chaos
experimentation. Stop immediately if things go wrong.
5. If you need/feel like ramping up quickly, Gremlin may be a good choice

Chaos Engineering: Additional resources

Additional online resources
- Chaos Conf 2018 talks
- Gremlin (Chaos-as-a-service, Documentation, Community Labs, etc.)
- Gremlin Free Edition
- Chaos Slack community - https://slofile.com/slack/chaosengineering
- Talks by: Adrian Cockroft, Lorin Hochstein, Kolton Andrus, Tammy Butow, John
Allspaw
- CNCF Chaos WG (https://github.com/chaoseng/wg-chaoseng)
- Netflix Simian Army (https://github.com/Netflix/SimianArmy)
- Chaos Toolkit (https://github.com/chaostoolkit)
- Kubernetes Chaos Lab (https://github.com/matthewbrahms/kubernetes-chaos-lab)

Additional reading
Books for further academic reading:
- Release It! 2nd Edition by Michael Nygard
- Drift Into Failure by Sydney Dekker
- Chaos Engineering (O’Reilly)
- The Safety Anarchist by Sydney Dekker

Questions | Comments | Discussions | Ideas ?

Are you interested in
Chaos Engineering?
Join us at the meetup!
www.meetup.com/Austin-Chaos-Engineering-Meetup/

Chaos Engineering 101: A Field Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chaos Engineering 101: A Field Guide

Similar to Chaos Engineering 101: A Field Guide (20)

Recently uploaded

Recently uploaded (20)

Chaos Engineering 101: A Field Guide