Keynote - Chaos Engineering: Why breaking things should be practiced

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BENGALURU

Chaos Engineering:
Why breaking things should be practiced
Madhusudan Shekar | Oct 6, 2018
@madhushekar23

Selfie Time…
Thank you @adhorn for slides

OLD WORLD IT
Employees at work
Factories + supply chainSales channels
Marketing analytics

Employees at work
Factories + supply chainSales channels
Marketing analytics
OLD WORLD IT
NEW WORLD IT

NEW WORLD IT Employees at work
Factories +
supply chain
IoT connected
things
Online
marketing
Continuous
supply tracking
Just in time
production
Online sales
+ delivery
Social media

Worked in Dev
Ops problem now

Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “

… at the Edge

Unit testing of components:
• Tested in isolation to ensure function meets
expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Building Confidence Through Testing
Is it enough???

Jesse Robbins
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s

https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Jesse Robbins – mid 2000’s
GameDay: Creating Resiliency Through Destruction

Netflix 2013
https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b

https://bit.ly/2uKOJMQ

Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos
Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”

What “really” is Chaos Engineering?

“Chaos Engineering is the discipline
of experimenting on a distributed
system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
http://principlesofchaos.org

Break your systems on
purpose.
Find out their weaknesses and
fix them before they break
when least expected.

Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!

Chaos Changes…
Application
Network & Data
Infrastructure
People

Infrastructure

Availability Downtime per year
99% (2-nines) 3 days 15 hours
99.99% (4-nines) 52 minutes
99.999% (5-nines) 5 minutes
99.9999% (6-nines) 31 seconds
Availability

Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds

Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application

Auto-Scaling
• Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs

• No updates on live systems
• Always start from a new instance being
provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong
Immutable Infrastructure

• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!
Infrastructure as Code

Network & Data

Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica

Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance

Database Sharding User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance

Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub

Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}

Exponential Backoff

• Wrap a protected function call in a
circuit breaker object, which monitors
for failures.
• If failures reach a certain threshold,
the circuit breaker trips.
Circuit Breaker
https://martinfowler.com/bliki/CircuitBreaker.html

1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Dynamic Routing with Route53
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US

1. Improve Latency for end-users
2. Disaster Recovery
Dynamic Routing
Applications in
US West
Applications in
US East
Users from
San
Francisco
Users from
New York
Service 1
Service 2
Service 3
Service 4
Service 1
Service 2
Service 3
Service 4

Application

Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User

Transient state does not belong
in the database.

CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!

… if no new updates are made to
a given data item, eventually all
accesses to that item will return
the last updated value.
Eventual Consistency
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.

Service Degradation & Fallbacks

People

“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy

Conway’s Law
User
UI Team
Application Team
DBA Team
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Siloed Teams
Siloed
Applications

Conway’s Law
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Services
Cross-Functional
Teams

Fire Drills

Phases of Chaos Engineering

Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix

What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

Hypothesis …?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
Make it everyone’s problem!

Disclaimer!
Don’t make an hypothesis that you
know will break you!

• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization
Run the Experiment
• Start with very small
• As close as possible to
production
• Minimize the blast radius.
• Have an emergency STOP!

DON’T blame that one person …

Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?

PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…

Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-
services) that chaos engineering tests show is not as
resilient to failures as originally predicted.

Changing Culture takes time!
Be patient…

More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices
• http://royal.pingdom.com/wp-
content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
• http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy

Thank you
@madhushekar23

Keynote - Chaos Engineering: Why breaking things should be practiced

More Related Content

Similar to Keynote - Chaos Engineering: Why breaking things should be practiced

More from AWS User Group Bengaluru

Recently uploaded

Keynote - Chaos Engineering: Why breaking things should be practiced

Editor's Notes