SlideShare a Scribd company logo
1 of 72
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adrian Hornsby
Cloud Infrastructure – Amazon Web Services
Chaos Engineering:
Why Breaking Things Should Be
Practiced.
@adhorn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Been there?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
… at the Edge
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Confidence Through Testing
Unit testing of components:
• Tested in isolation to ensure function meets expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Is it enough???
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jesse Robbins
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Netflix 2013
https://medium.com/netflix-techblog
Chaos Monkeys
https://github.com/Netflix/SimianArmy
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://bit.ly/2uKOJMQ
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What “really” is Chaos Engineering?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the system’s
capability to withstand turbulent conditions in
production.”
http://principlesofchaos.org
Break your systems on
purpose.
Find out their weaknesses and
fix them before they break
when least expected.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.”
Nora Jones
Senior Chaos Engineer, Netflix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Before breaking things …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People
Application
Network & Data
Infrastructure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability
Availability Downtime per year
99% (2-nines) 3 days 15 hours
99.99% (4-nines) 52 minutes
99.999% (5-nines) 5 minutes
99.9999% (6-nines) 31 seconds
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
System Availability
Availability =
Normal Operation Time
Total Time
MTBF**
MTBF** + MTTR*
=
* Mean Time To Repair (MTTR)
**Mean Time Between Failure (MTBF)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto-Scaling • Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Immutable Infrastructure
• No updates on live systems
• Always start from a new resource being provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure as Code
• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network & Data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Sharding
User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
Cache
Amazon SNS
Push Notification
User
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exponential Backoff
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Circuit Breaker
• Wrap a protected function call in a
circuit breaker object, which
monitors for failures.
• If failures reach a certain
threshold, the circuit breaker trips.
https://martinfowler.com/bliki/CircuitBreaker.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dynamic Routing with Route53
1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dynamic Routing
1. Improve Latency for end-users
2. Disaster Recovery
Applications
in US West
Applications
in US East
Users from
San Francisco
Users from
New York
Service 1
Service 2
Service 3
Service 4
Service 1
Service 2
Service 3
Service 4
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Application
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transient state does not
belong in the database.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Eventual Consistency
… if no new updates are
made to a given data item,
eventually all accesses to that
item will return the last
updated value.
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process A Process B Process A Process B
Synchronous Asynchronous
Waiting
Working
Continues
get or fetch resultGet result
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Non-blocking UI
https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exception Handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service Degradation & Fallbacks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Conway’s Law
User
UI Team
Application Team
DBA Team
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Siloed Teams
Siloed
Applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Conway’s Law
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Services
Cross-Functional
Teams
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fire Drills
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Phases of Chaos Engineering
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
What is Steady State?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Business Metrics at work
Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden).
Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer).
Yahoo!: 400 ms of extra load time caused a 5–9% increase in the
number of people who clicked “back” before the page even loaded (Nicole
Sullivan).
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steady State
Important:
• Know the value range of Healthy State!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis: What if…?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
Make it everyone’s problem!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Disclaimer!
Don’t make an hypothesis that you
know will break you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Designing Experiment
• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Rules of thumbs
• Start with very small
• As close as possible to production
• Minimize the blast radius.
• Have an emergency STOP!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New Version
Users
Run the Experiment: Canary deployment
Old Version
99%
Users
1%
Users
Start with ..
Dynamic Routing
(Route53)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Verify & Learn: Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DON’T blame that one person …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More questions to ask.
• Can you clarify if there were any preceding events?
• Why would they believe acting in this way was the best course of action to
deliver the desired outcome?
• Is there another failure mode that could present here?
• What decisions or events prior to this made this work before?
• Why stop there – are there places to dig deeper that could shine a light more
on this?
• Did others step in to help, to advise, or to intercede?
http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fixing the issues!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-services)
that chaos engineering tests show is not as resilient to failures
as originally predicted.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Changing Culture takes time!
Be patient…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices
• http://royal.pingdom.com/wp-
content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
• http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
@adhorn
https://medium.com/@adhorn

More Related Content

What's hot

An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos EngineeringGremlin
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testingjeetendra mandal
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Nora Jones
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureAna Medina
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
From Monolithic to Microservices
From Monolithic to Microservices From Monolithic to Microservices
From Monolithic to Microservices Amazon Web Services
 
Cloud native principles
Cloud native principlesCloud native principles
Cloud native principlesDiego Pacheco
 
Design patterns for microservice architecture
Design patterns for microservice architectureDesign patterns for microservice architecture
Design patterns for microservice architectureThe Software House
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailogjuljo
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform TrainingYevgeniy Brikman
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Abeer R
 
A Pattern Language for Microservices
A Pattern Language for MicroservicesA Pattern Language for Microservices
A Pattern Language for MicroservicesChris Richardson
 

What's hot (20)

An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
 
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017 Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
DevOps and AWS
DevOps and AWSDevOps and AWS
DevOps and AWS
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
CI/CD on AWS
CI/CD on AWSCI/CD on AWS
CI/CD on AWS
 
infrastructure as code
infrastructure as codeinfrastructure as code
infrastructure as code
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
From Monolithic to Microservices
From Monolithic to Microservices From Monolithic to Microservices
From Monolithic to Microservices
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
Cloud native principles
Cloud native principlesCloud native principles
Cloud native principles
 
Design patterns for microservice architecture
Design patterns for microservice architectureDesign patterns for microservice architecture
Design patterns for microservice architecture
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform Training
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
A Pattern Language for Microservices
A Pattern Language for MicroservicesA Pattern Language for Microservices
A Pattern Language for Microservices
 

Similar to Chaos Engineering

Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Adrian Hornsby
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedAWS User Group Bengaluru
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringAmazon Web Services
 
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...Amazon Web Services
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Amazon Web Services
 
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...Amazon Web Services
 
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Amazon Web Services
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Amazon Web Services
 
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends Amazon Web Services
 
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...Amazon Web Services
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon Web Services
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon Web Services
 
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...Amazon Web Services
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudAmazon Web Services
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Amazon Web Services
 
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Amazon Web Services
 
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...Amazon Web Services
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...Amazon Web Services
 
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018Amazon Web Services
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksAmazon Web Services
 

Similar to Chaos Engineering (20)

Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
 
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
 
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
 
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends
 
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
 
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
 
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
 
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Chaos Engineering

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby Cloud Infrastructure – Amazon Web Services Chaos Engineering: Why Breaking Things Should Be Practiced. @adhorn
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Been there?
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. … at the Edge
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  • 9.
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What “really” is Chaos Engineering?
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  • 14. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  • 16.
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before breaking things …
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructure
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network & Data
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Exponential Backoff
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. https://martinfowler.com/bliki/CircuitBreaker.html
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dynamic Routing with Route53 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Amazon Route53 Resource A In US Resource B in EU User in US
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dynamic Routing 1. Improve Latency for end-users 2. Disaster Recovery Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Application
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transient state does not belong in the database.
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Exception Handling
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. People
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fire Drills
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Phases of Chaos Engineering
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  • 56. What is Steady State? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New Version Users Run the Experiment: Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
  • 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Verify & Learn: Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  • 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DON’T blame that one person …
  • 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
  • 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More questions to ask. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede? http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
  • 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fixing the issues!
  • 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  • 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Changing Culture takes time! Be patient…
  • 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp- content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
  • 72. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! @adhorn https://medium.com/@adhorn