Principles of Chaos Engineering

Chaos Engineering
Hamburg
Marvin Hoffmann | Computer Scientist
15.12.2015

1. AWS Basics and Intro
2. Evolution of Chaos Testing
3. Tooling
4. Chaos Engineering
Agenda

Europe West (Ireland)US East (N. Virginia)
Regions
AZs Instances
AWS Basics

“A way to improve availability is
to install proven hardware and
software, and then leave it alone”
Jim Gray
Why Do Computers Stop and What Can Be Done About It?

• Systems need to be reliable
• Nuklear weapon arsenal, heart rate monitoring,
World of Warcraft servers, Streaming business
• Third party dependencies (software and
hardware)
Be reliable!

DynamoDB Outage US-East
• “… there was a brief network disruption that impacted a
portion of DynamoDB’s storage servers.”
• 2:19am until 7:10am PDT
• “There are several other AWS services that use
DynamoDB that experienced problems during the event.”
• SQS, EC2 auto scaling, CloudWatch
Source: https://aws.amazon.com/message/5467D2/

• Deployments themselves may cause issues
• Unpredicted behaviour after a change has been
rolled out
• Issues during rollback
• Change in client / user behaviour
It’s not always the infrastructure

Do the simplest thing ﬁrst
• Prepare for your machines to die
• “Cattle, not pets” (Adrian Cockcroft)
• Resilience through redundancy
• Stateless machines

Deal with infrastructure issues
• Latency between instances
• Package loss
• Ports blocked
• or even outages of an entire AZ

Think big!
• Remember that DynamoDB failure?
• Outage of an entire AWS region!
• You’ll need more than one region in the ﬁrst place
• Re-routing of entire trafﬁc from one region to another
• Any region needs to be able to scale to take the load of
two regions

Chaos Monkey
Kills random instances in your account

Chaos Gorilla
Kills a random AZ in your account

Chaos Kong
Kills an entire AWS region in your account

What’s in it?
• A compilation of scripts
• Scripts mess with your AWS account
• Thus, they are very AWS speciﬁc
• If not on AWS, get inspired and build your toolset around
these ideas
• Not a comprehensive toolset

• Latency Monkey
• Conformity Monkey
• Security Monkey
• Doctor Monkey
• 10-18 Monkey
Simian Army

• Systematic approach to Chaos Testing
• Started by Netﬂix
• Talk about it a lot to attract talent
• Many other companies doing similar things in that ﬁeld
• Want to grow a community around it
Chaos Engineering

“Experiment on a distributed system
in order to build conﬁdence in the
system’s capability to withstand
turbulent conditions in production.”
Netﬂix

Four Principles of
Chaos Engineering

Know your system
• Operational insight
• What is “normal”? What does a failure look like?

Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour

The “Happy Path”
• Trace through code
where nothing bad
happens
• usually testing happens
ﬁrst on the happy path
• Bad things usually
happen off the happy
path
Source: https://bethtrissel.ﬁles.wordpress.com/2014/06/176869567.jpg

Four Principles of
Chaos Engineering
2.Vary real-world events

Laboratory
• “Works on my machine” (or “works in stage env.”)
Source: http://www.memegasms.com/media/created/vhyfxm.jpg

Four Principles of
Chaos Engineering
3.Run experiments in production

Four Principles of
Chaos Engineering
3.Run experiments in production
4.Automate experiments to run continuously

Chaos Engineering Culture
• http://principlesofchaos.com
• More resources:
• https://github.com/Netﬂix/SimianArmy
• https://github.com/Netﬂix/atlas
• https://www.youtube.com/watch?v=vq4QZ4_YDok

Principles of Chaos Engineering

More Related Content

What's hot

Similar to Principles of Chaos Engineering

Recently uploaded

Principles of Chaos Engineering