Chaos Engineering: Why Breaking Things Should Be Practised.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adrian Hornsby
Cloud Architecture Evangelist – Amazon Web Services
Chaos Engineering:
Why Breaking Things Should Be Practiced.
@adhorn

Been there?

Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “

… at the Edge

Building Confidence Through Testing
Unit testing of components:
• Tested in isolation to ensure function meets expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Is it enough???

Jesse Robbins
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Netflix 2013
https://medium.com/netflix-techblog

Chaos Monkeys
https://github.com/Netflix/SimianArmy

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://bit.ly/2uKOJMQ

Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”

What “really” is Chaos Engineering?

“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the system’s
capability to withstand turbulent conditions in
production.”
http://principlesofchaos.org

Break your systems on purpose.
Find out their weaknesses and fix
them before they break when
least expected.

Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!

“CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.”
Nora Jones
Senior Chaos Engineer, Netflix

Before breaking things …

People
Application
Network & Data
Infrastructure

Build Resilient Architectures
Infrastructure

Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds

Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application

Auto-Scaling • Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs

Infrastructure as Code
• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!

Immutable Infrastructure
• No updates on live systems
• Always start from a new resource being provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong

Network & Data

Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica

Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance

Database Sharding
User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance

Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub

Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}

Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
Cache
Amazon SNS
Push Notification
User

Exponential Backoff

Circuit Breaker
• Wrap a protected function call in a
circuit breaker object, which
monitors for failures.
• If failures reach a certain threshold,
the circuit breaker trips.
https://martinfowler.com/bliki/CircuitBreaker.html

Dynamic Routing with Route53
1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US

Application

Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User

Transient state does not
belong in the database.

CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!

Eventual Consistency
… if no new updates are
made to a given data item,
eventually all accesses to that
item will return the last
updated value.
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.

Process A Process B Process A Process B
Synchronous Asynchronous
Waiting
Working
Continues
get or fetch resultGet result

Non-blocking UI
https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158

Exception Handling

Service Degradation & Fallbacks

People

“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy

Fire Drills

Phases of Chaos Engineering

Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix

What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

What is Steady State?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Business Metrics at work
Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden).
Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer).
Yahoo!: 400 ms of extra load time caused a 5–9% increase in the
number of people who clicked “back” before the page even loaded (Nicole
Sullivan).

Steady State
Important:
• Know the value range of Healthy State!

Hypothesis: What if…?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
”What if Paul does not show up today?”
Make it everyone’s problem!

Disclaimer!
Don’t make an hypothesis that you
know will break you!

Designing Experiment
• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization

Rules of thumbs
• Start with very small
• As close as possible to production
• Minimize the blast radius.
• Have an emergency STOP!

New Version
Users
Run the Experiment: Canary deployment
Old Version
99%
Users
1%
Users
Start with ..
Dynamic Routing
(Route53)

Verify & Learn: Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?

DON’T blame that one person …

PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…

More questions to ask.
• Can you clarify if there were any preceding events?
• Why would they believe acting in this way was the best course of action to
deliver the desired outcome?
• Is there another failure mode that could present here?
• What decisions or events prior to this made this work before?
• Why stop there – are there places to dig deeper that could shine a light more on
this?
• Did others step in to help, to advise, or to intercede?

Fixing the issues!

Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-services)
that chaos engineering tests show is not as resilient to failures as
originally predicted.

Changing Culture takes time!
Be patient…

More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices
• http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
• http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy

Thank you!
@adhorn
https://medium.com/@adhorn

Chaos Engineering: Why Breaking Things Should Be Practised.

More Related Content

Similar to Chaos Engineering: Why Breaking Things Should Be Practised.

More from Adrian Hornsby

Recently uploaded

Chaos Engineering: Why Breaking Things Should Be Practised.

Editor's Notes