2. What you will get from this talk in exchange for your time:
● Understand the definitions of Chaos Engineering (CE)
● Hear a brief history of the field
● Describe the mindset and methodologies of CE
● Know what steps you can take to start doing CE “in the wild”
● Realize the valuable outcomes of having a CE group at your org
● Prepare for common CE myths
● Have some resources for further investigation of the discipline
3. Who are we in this room?
dev/ops/devops/qa/qe/swe/sre/management
4.
5. Chaos Engineering is the discipline of experimenting on a distributed
system in order to build confidence in the system’s capability to
withstand turbulent conditions in production.
- http://principlesofchaos.org/
6. Bad things will (and are) happening to your
system, no matter how well designed it is.
You cannot become ignorant to it.
10. A *brief* history of the CE field
● 2010 - Chaos Monkey
● 2011 - Simian Army
● 2012 - Chaos Monkey OSS
● 2014 - Chaos Engineer
role @ Netflix
● 2017 - Chaos Toolkit on
GitHub (OSS)
● 2018 - Gremlin hosts first
ChaosConf in SF
● 2018 - CNCF Chaos
working group
12. ● Airline industry
○ Air Traffic Control
○ Plane construction
○ Pilot procedures
● Naval Air Operations at Sea
● Electrical Power Systems
● Public Water Systems
● Medical devices
○ Hospitals
○ Implanted devices
● Highway infrastructure
● Car crash safety ratings
15. CE is a discipline
● This implies rigor, as in the
academic sense
● Each org/person is unique in
their implementation
● It’s not a process we can “say
we do” and then file it into the
abyss of “the wiki”
16. Form a hypothesis
● You should know your
app/tech stack well
● Whiteboard your entire
system with another senior
engineer and always with new
onboards
● Find a domain/service where a
failure is likely to exist and
start there
17. Test your ideas
● Goal is to either validate or
invalidate your failure-case
hypothesis
● The act of testing your
hypothesis should *not* result
in any harm to the user
experience!
18. Analyze results
● Lessons learned from the
experiment are priceless
● The results and lessons
learned should be
communicated to the entire
team
● Action items should be
started to increase resiliency if
there were issues discovered
21. Level 0 - The Basics
1. You will need team/engineering buy-in
2. You will need full support from your engineering and business leadership
3. You will need *observability* in your application/infrastructure/user experience.
Note: if you cannot detect/observe failure states when not formally doing chaos
engineering, that is an area to focus on before adopting chaos engineering.
4. You will need a fully-documented and robust SEV outage procedure (replete with
Incident commanders, blameless post-mortems, etc.) Note: this is another topic
that if there is a lack of maturity in before doing chaos engineering, this should be
built-up first.
** All of these could be *entire talks* on their own
22. Level 1 - Assemble team Time: varies
Two things are needed before going to level 2:
- A defined product/domain/service, etc. that you wish to test for failure
- A group of engineers (ops/dev/security/support/business):
- You need this group to be comprised of people who are involved end-to-end with your service
- They need to have time to attend pre-game meeting, experiment, and follow-up
- Involve/inform as many people as possible in case of a failure during the experiment
- Include Senior and Junior Engineers and even business people related to the service
- Be sure to set the expectations for the level of involvement you need
Example: “We will test our resiliency at the base layer of our infrastructure compute
nodes.”
23. Level 2 - Formulate Hypothesis Time: 1-2 hours
Get everyone together and formulate your hypothesis.
Whiteboard the entire service/hypothesis until everyone has a clear and thorough
understanding of the system and the actions that will be taken to experiment with
resiliency.
Also assign roles and responsibilities for each person that will occur during the
gameday. (Have a documentation user, have a QRF team, have someone just to
operate the experiment, etc.)
Document all of the above and socialize this documentation to other teams.
Example: “If we delete (lose) a cloud compute node, our Kubernetes cluster will
recover and re-provision, with no downtime or negative user experience.”
24. Level 3 - Gameday Time: 1-4 hours
Ideally, game day looks like a launch at NASA. Each of the assigned persons knows
their role and you can do a pre-launch checklist, ensuring each team is ready.
If there are any issues impacting the system or anything that the gameday *might*
interfere with or make worse, abort the launch.
If you are ready, then proceed with initiating the experiment keeping a keen eye on
watching the progress.
Example: “Our infrastructure is currently not degraded in any way, it is not Black
Friday, we have SRE, SWE, Support, Security, and a few business folks here. We will
now begin to delete a node and watch the success rates of our api’s while expecting
and monitoring for the node recovery/re-provisioning.”
25. Level 4 - Recap Lessons Learned Time: 30 minutes
Gather everyone involved and recap what happened. In case of success or failure and
remediation--be sure to go over the timeline of what happened.
Gather lessons that everyone learned, being sure to highlight what we learned from the
experiment that we didn’t know before (this is good to see value).
Plan work for engineering teams as necessary to close any resiliency gaps that the
experiment discovered.
Communicate the value of all that has occurred in this process to the business. This is
work that has directly contributed to the bottom line of the company.
26. Gameday Templates!
If you are very new to doing this, Gremlin has a complete set of templates and
checklists to help you get started! (They really are quite excellent!)
https://www.gremlin.com/gameday/
28. 1. Avoid costs of downtime.
Do we really *know* how much
downtime really costs our enterprise in:
Sales, Engineering, Loss of Productivity, etc.?
User experience will go up!
29. 2. Decrease pages to Ops/Dev/SRE
Do we all like sleep?
Do we track the number of pages our teams get?
The blast-radius/cost of an outage event is large (lurkers & active)
30. 3. Increase Productivity
Less time and money spent on outages
and reactive work will increase our time
and resources for proactive work/features.
What value could our Ops teams add if they were distracted less?
31. 4. Increase the spread of knowledge
throughout your organization
Tired of running into lack of documentation/runbooks?
Tired of people leaving with *heaps* of “tribal knowledge” ?
Tired of people saying “I don’t know...that’s Johnny’s expertise” ?
34. Top Chaos Engineering Myths
1. It’s not my job!
2. *Now* what tool do we have to buy & learn?
3. It costs how much??
4. We have too much work to do (i.e. features,
bug-fixes, etc.)
5. We can just deal with outages JIT, right!?
6. Our uptime target is 100% right? Why should
we ever introduce “experiments” in
production?
7. Why do you think we even have an ops/sre
team?
8. We don’t even have SLO/SLI/SLA in
place...even if we wanted to, how could we
start?
44. Do we really expect and
employ a strategy of
hope that only OPS/SRE
should be doing Chaos
Engineering?
45. Chaos Engineering != tooling
(necessarily)
Start with preemptible/spot instances for services in lower environments :)
46. What can you do about implementing chaos engineering:
1. Evangelize the idea and principles of chaos engineering to our organizations
2. Ensure that your systems are measurable (can detect chaos even if it is
unplanned) and that there is a really solid SEV process in-place.
3. Start with whiteboarding sessions/high-level discussions about how our
applications/services are architected and function--gain “herd immunity”
regarding knowledge
4. Pick 1 service or application that is well-documented, very observable, not in a
critical production path, etc. to serve as your first experiment upon for chaos
experimentation. Stop immediately if things go wrong.
5. If you need/feel like ramping up quickly, Gremlin may be a good choice
48. Additional online resources
- Chaos Conf 2018 talks
- Gremlin (Chaos-as-a-service, Documentation, Community Labs, etc.)
- Gremlin Free Edition
- Chaos Slack community - https://slofile.com/slack/chaosengineering
- Talks by: Adrian Cockroft, Lorin Hochstein, Kolton Andrus, Tammy Butow, John
Allspaw
- CNCF Chaos WG (https://github.com/chaoseng/wg-chaoseng)
- Netflix Simian Army (https://github.com/Netflix/SimianArmy)
- Chaos Toolkit (https://github.com/chaostoolkit)
- Kubernetes Chaos Lab (https://github.com/matthewbrahms/kubernetes-chaos-lab)
49. Additional reading
Books for further academic reading:
- Release It! 2nd Edition by Michael Nygard
- Drift Into Failure by Sydney Dekker
- Chaos Engineering (O’Reilly)
- The Safety Anarchist by Sydney Dekker