Chaos Engineering: Why Breaking Things Should Be Practised.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adrian Hornsby, Cloud Architecture Evangelist @ AWS
@adhorn
Chaos Engineering:
Why Breaking Things Should Be Practiced.

Been there?

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
An Area of Complex &
Dynamic Systems

Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “

System failure rate
Early Failures
Wear Out Failures
Observed Failures
Random Failures

System failure rate
For high-velocity deployments
Early Failures
Wear Out Failures
Observed Failures
Random Failures

… at the Edge

Building Confidence Through Testing
Unit testing of components:
• Tested in isolation to ensure function meets expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Is it enough???

Jesse Robbins – mid 2000’s
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s

https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Jesse Robbins – mid 2000’s
GameDay: Creating Resiliency Through Destruction

Netflix 2013
https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b

Chaos Monkeys

https://github.com/Netflix/SimianArmy
• Chaos Monkey - Kill instances randomly
• Latency Monkey - Induce latency in services
• Chaos Gorilla - Simulates AZ and regions failure
• Conformity Monkey - Make sure instances follow good
practices

http://principlesofchaos.org

https://bit.ly/2uKOJMQ

What “really” is Chaos Engineering?

“Chaos Engineering is the
discipline of experimenting on a
distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in
production.”http://principlesofchaos.org

Break your systems on
purpose.
Find out their weaknesses
and fix them before they
break when least expected.

Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!

“CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.”
Nora Jones
Senior Chaos Engineer, Netflix

Before breaking things …

Really!!
Don’t break things
before you have done
your home work!

People
Application
Network & Data
Infrastructure

Infrastructure

Availability
Availability Downtime per year Categories
95% (1-nine) 18 days 6 hours
Batch processing, Data extraction,
Load jobs.
99% (2-nines) 3 days 15 hours Internal Tools, Project Tracking
99.9% (3-nines) 8 hours 45 minutes Online Commerce
99.99% (4-nines) 52 minutes Video Delivery, Broadcast systems
99.999% (5-nines) 5 minutes Telecom Industry (ATM Transactions)
99.9999% (6-nines) 31 seconds Answering to my loved one*
* Joke 
http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf

System Availability
Availability =
Normal Operation Time
Total Time
MTBF**
MTBF** + MTTR*
=
* Mean Time To Repair (MTTR)
**Mean Time Between Failure (MTBF)

Availability in Series
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Y 99.99% (4-nines) 52 minutes
X and Y Combined 98.99% 3 days 16 hours 33 minutes

Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds

Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application

Auto-Scaling • Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs

Immutable Infrastructure
• No updates on live systems
• Always start from a new instance being provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong

Immutable components are replaced for
every deployment, rather than being updated in-
place.
Immutable Infrastructure

Infrastructure as Code
• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!

Network & Data

Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica

Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance

Database Sharding
User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance

Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub

Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}

Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
Cache
Amazon SNS
Push Notification
User

Exponential Backoff

Circuit Breaker
• Wrap a protected function call in a
circuit breaker object, which
monitors for failures.
• If failures reach a certain threshold,
the circuit breaker trips.
https://martinfowler.com/bliki/CircuitBreaker.html

Dynamic Routing with Route53
1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US

Dynamic Routing
1. Improve Latency for end-users
2. Disaster Recovery
Applications in
US West
Applications in
US East
Users from
San
Francisco
Users from
New York
Service 1
Service 2
Service 3
Service 4
Service 1
Service 2
Service 3
Service 4

Application

Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User

Transient state
does not belong
in the database.

CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!

Eventual Consistency
… if no new updates are
made to a given data item,
eventually all accesses to that
item will return the last
updated value.
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.

Process A Process B Process A Process B
Synchronous Asynchronous
Waiting
Working
Continues
get or fetch resultGet result

Non-blocking UI
https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158

Exception Handling

Service Degradation & Fallbacks

People

“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy

Conway’s Law
User
UI Team
Application Team
DBA Team
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Siloed Teams
Siloed
Applications

Conway’s Law
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Services
Cross-Functional
Teams

Fire Drills

Phases of Chaos Engineering

Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix
GameDay

Steady State

What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

What is Steady State?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Business Metrics at work
Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg
Linden).
Google: 500 ms of extra load time caused 20% fewer searches (Marissa
Mayer).
Yahoo!: 400 ms of extra load time caused a 5–9% increase in the
number of people who clicked “back” before the page even loaded (Nicole
Sullivan).

Steady State
Important:
• Know the value range of Healthy State!

Hypothesis

What if…?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
Make it everyone’s problem!

Disclaimer!
Don’t make an hypothesis that you know will
break you!

Netflix team members Casey Rosenthal, Lorin Hochstein, Aaron
Blohowiak, Nora Jones, and Ali Basiri, suggest the following inputs for
Chaos experiments:
• Simulating the failure of an entire region or datacenter.
• Partially deleting Kafka topics over a variety of instances to recreate an issue that
occurred in production.
• Injecting latency between services for a select percentage of traffic over a
predetermined period of time.
• Function-based chaos (runtime injection): Randomly causing functions to throw
exceptions.
• Code insertion: Adding instructions to the target program and allowing fault injection to
occur prior to certain instructions.
• Time travel: Forcing system clocks out of sync with each other.
• Executing a routine in driver code emulating I/O errors.
• Maxing out CPU cores on an Elasticsearch cluster.

Designing Experiment

• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization

Rules of thumbs
• Start with very small
• As close as possible to production
• Minimize the blast radius.
• Have an emergency STOP!

Run the Experiment

New Version
Users
Canary deployment
Old Version
99%
Users
1%
Users
Start with ..
Dynamic Routing
(Route53)

Verify & Learn

DON’T blame that one person …

Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?

PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…

The Conveyor Belt Accident
Question: Why did the associate damage his thumb?
Answer: Because his thumb got caught in the conveyor.
Question: Why did his thumb get caught in the conveyor?
Answer: Because he was chasing his bag, which was on a running conveyor
belt.
Question: Why did he chase his bag?
Answer: Because he placed his bag on the conveyor, but it then turned-on by
surprise
Question: Why was his bag on the conveyor?
Answer: Because he used the conveyor as a table
Conclusion: So, the likely root cause of the associate’s damaged thumb is that
he simply needed a table, there wasn’t one around, so he used a conveyor as
a table.
https://www.linkedin.com/pulse/use-5-whys-find-root-causes-peter-abilla/

No, seriously. Root Cause is a Fallacy.
http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/

No, seriously. Root Cause is a Fallacy.
• Can you clarify if there were any preceding events?
• Why would they believe acting in this way was the best course of action to
deliver the desired outcome?
• Is there another failure mode that could present here?
• What decisions or events prior to this made this work before?
• Why stop there – are there places to dig deeper that could shine a light
more on this?
• Did others step in to help, to advise, or to intercede?
http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/

Fix

Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-
services) that chaos engineering tests show is not as resilient
to failures as originally predicted.

Changing Culture takes time!
Be patient…

More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices

Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos
Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”

Thanks you!
@adhorn
https://medium.com/@adhorn

Chaos Engineering: Why Breaking Things Should Be Practised.

More Related Content

Similar to Chaos Engineering: Why Breaking Things Should Be Practised.

More from Adrian Hornsby

Recently uploaded

Chaos Engineering: Why Breaking Things Should Be Practised.

Editor's Notes