SlideShare a Scribd company logo
1 of 74
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Adrian Hornsby, Cloud Architecture Evangelist
@adhorn
Chaos Engineering:
Why Breaking Things Should Be Practiced.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://xkcd.com/1428/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Complex systems
Amazon Twitter Netflix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Partial failure mode
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resiliency: Ability for a system to handle and
eventually recover from unexpected conditions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People
Application
Network & Data
Infrastructure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building confidence through testing
Unit testing of components:
• Tested in isolation to ensure function meets expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Is it enough???
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GameDay at Amazon
Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
https://github.com/Netflix/SimianArmy
Failure injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks
• “Paul” attack
https://www.gremlin.comhttps://github.com/Netflix/SimianArmy https://chaostoolkit.org
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the system’s
capability to withstand turbulent conditions in
production.”
http://principlesofchaos.org
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Break your systems on purpose.
Find out their weaknesses and fix
them before they break when
least expected.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Chaos doesn’t cause problems.
It reveals them.”
Nora Jones
Senior Chaos Engineer, Netflix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://bit.ly/2uKOJMQ
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Phases of Chaos Engineering
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steady
State
Hypothesis
Design & Run
Experiment
Verify
& Learn
Fix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steady
State
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is steady state?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is steady state?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Business metrics at work
Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden).
Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer).
Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number
of people who clicked “back” before the page even loaded (Nicole Sullivan).
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What if…?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
Make it everyone’s problem!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Disclaimer!
Don’t make an hypothesis that you know
will break you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Design & Run
Experiment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Designing experiment
• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Rules of thumbs
• Start with very small
• As close as possible to production
• Minimize the blast radius.
• Have an emergency STOP!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Running Chaos Experiment
Users
Canary deployment
Normal Version
99%
Users
1%
Users
Start with ..
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Verify
& Learn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DON’T blame that one person …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PostMortems – COE (Correction of Errors)
The 5 WHYs
Outage
Because
of …
Because
of …
Because
of …
Because
of …
NOT
ENOUGH
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More questions to ask
• Can you clarify if there were any preceding events?
• Why would they believe acting in this way was the best course of action to
deliver the desired outcome?
• Is there another failure mode that could present here?
• What decisions or events prior to this made this work before?
• Why stop there – are there places to dig deeper that could shine a light more
on this?
• Did others step in to help, to advise, or to intercede?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Rules to remember!
1. Failure requires multiple faults
2. There is no isolated ‘cause’ of an accident.
3. There are multiple contributors to accidents.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fix
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Patterns for Resilient Architecture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability in parallel
A = 1 – (1 – Ax)2
Part X
Part X
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability in parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multi-AZ architecture
Region
Availability zone a Availability zone b Availability zone c
Application
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto Scaling group
Service A
Availability zone 1
Auto Scaling group
AWS Region
Service A
Availability zone 2
Service BService B
Stateless Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto-Scaling • Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto Scaling group
Service A
Availability zone 1
Auto Scaling group
AWS Region
Service A
Availability zone 2
Service BService B
Auto-scaling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Decoupling with async pattern
Listener
Pub-Sub
Queue
Queue
A
A
B
B
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
{JobID: 0001, Result: bar}
Cache node
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Notification
User
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
Cache node
Fetch results
API
Instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Degrade & prioritize traffic
with queues
Worker
Instance
Worker
Instance
API
Instance
API
Instance
API
Instance
HighPriorityQueue
LowPriorityQueue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Read / Write Sharding
DB Instance DB instance read
replica
DB instance read
replica
DB instance read
replica
Instance InstanceInstance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Federation
Users
DB
Products
DB
Instance InstanceInstance
DB Instance
DB instance read
replica
DB Instance
DB instance read
replica
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Sharding User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
Instance InstanceInstance
DB Instance
DB instance
read replica
DB Instance
DB instance
read replica
DB Instance
DB instance
read replica
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transient state does not
belong in the database.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cascading Failures
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s talk about timeouts & retries!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Users
App
DB
Conn
Pool
INSERT
INSERT
INSERT
INSERT
What happens if the DB “slows down”?
Timeout client side Timeout backend side ??
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
User 1
App
DB
Conn
Pool
INSERT
Timeout client side = 10s Timeout backend side = Not implemented
Retry INSERT
Retry INSERT
ERROR: Failed to get connection from pool
Retry
http://docs.python-requests.org/en/master/user/advanced/#timeouts
http://docs.python-requests.org/en/master/user/advanced/#timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
@timeout_decorator.timeout(5, timeout_exception=StopIteration)
def timed_get(url):
return requests.get(url)
https://pypi.org/project/timeout-decorator/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How else could we have prevented the error?
User 1
DB
Conn
Pool
INSERT
Retry INSERT
Retry INSERT
Retry
ERROR: Failed to get connection from pool
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
User 1
DB
Conn
Pool
INSERT
Timeout client side = 10s Timeout backend side = 10s
Wait 2s before Retry
INSERT
INSERT
Wait 4s before Retry
Wait 8s before Retry
Wait 16s before Retry
Backing off between retries
Releasing connectionsBackoff
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No jitter With jitter
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Simple Exponential Backoff is not enough: Add Jitter
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: add jitter 0-1000ms
def get_item(self, url, n=1):
MAX_TRIES = 12
try:
res = requests.get(url)
except:
if n > MAX_TRIES:
return None
n += 1
time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0))
return self.get_item(url, n)
else:
return res
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
@backoff.on_exception(backoff.full_jitter, max_time=60)
def poll_for_message(queue):
return queue.get()
https://pypi.org/project/backoff/
As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the
AWS Architecture Blog’s Exponential Backoff And Jitter post.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Idempotent operation
No additional effect if it is called more than
once with the same input parameters.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service Degradation & Fallbacks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Circuit Breaker
• Wrap a protected function
call in a circuit breaker
object, which monitors for
failures.
• If failures reach a certain
threshold, the circuit
breaker trips.
Producer Circuit Breaker Consumer
Connection
Monitoring
Timeouts
Breaking Circuit
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Non-blocking UI
https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fire Drills
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big challenges to chaos engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-services) that
chaos engineering tests show is not as resilient to failures as originally
predicted.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Changing culture takes time!
Be patient…
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Thanks you!
@adhorn
https://medium.com/@adhorn
Did We Scan Your Badge?
Remember to opt-in to AWS
communications and you will receive a
post-event email with a link to:
• AWS Developer Workshop Slides
• $200 in AWS Credits

More Related Content

What's hot

Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testingjeetendra mandal
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Ana Medina
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgNils Meder
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureAna Medina
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformAnshul Patel
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-KnolxKnoldus Inc.
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Amazon Web Services
 
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...Amazon Web Services
 
An Introduction to the AWS Well Architected Framework - Webinar
An Introduction to the AWS Well Architected Framework - WebinarAn Introduction to the AWS Well Architected Framework - Webinar
An Introduction to the AWS Well Architected Framework - WebinarAmazon Web Services
 
CI/CD with AWS Developer Tools and Fargate
CI/CD with AWS Developer Tools and FargateCI/CD with AWS Developer Tools and Fargate
CI/CD with AWS Developer Tools and FargateAmazon Web Services
 
How to Design a Successful Test Automation Strategy
How to Design a Successful Test Automation Strategy How to Design a Successful Test Automation Strategy
How to Design a Successful Test Automation Strategy Impetus Technologies
 
AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations Amazon Web Services
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
 
Deep Dive into Amazon ECS & Fargate
Deep Dive into Amazon ECS & FargateDeep Dive into Amazon ECS & Fargate
Deep Dive into Amazon ECS & FargateAmazon Web Services
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsYury Roa
 

What's hot (20)

Chaos engineering and chaos testing
Chaos engineering and chaos testingChaos engineering and chaos testing
Chaos engineering and chaos testing
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
 
Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin PlatformChaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-Knolx
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
 
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
CI/CD Best Practices for Building Modern Applications - MAD302 - Anaheim AWS ...
 
An Introduction to the AWS Well Architected Framework - Webinar
An Introduction to the AWS Well Architected Framework - WebinarAn Introduction to the AWS Well Architected Framework - Webinar
An Introduction to the AWS Well Architected Framework - Webinar
 
CI/CD with AWS Developer Tools and Fargate
CI/CD with AWS Developer Tools and FargateCI/CD with AWS Developer Tools and Fargate
CI/CD with AWS Developer Tools and Fargate
 
CI/CD on AWS
CI/CD on AWSCI/CD on AWS
CI/CD on AWS
 
How to Design a Successful Test Automation Strategy
How to Design a Successful Test Automation Strategy How to Design a Successful Test Automation Strategy
How to Design a Successful Test Automation Strategy
 
AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
Deep Dive into Amazon ECS & Fargate
Deep Dive into Amazon ECS & FargateDeep Dive into Amazon ECS & Fargate
Deep Dive into Amazon ECS & Fargate
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in Systems
 
Are you Well-Architected?
Are you Well-Architected?Are you Well-Architected?
Are you Well-Architected?
 

Similar to Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Workshop at Web Summit 2018

Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Adrian Hornsby
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Amazon Web Services
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedAWS User Group Bengaluru
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringAmazon Web Services
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Amazon Web Services
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudAmazon Web Services
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksAmazon Web Services
 
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...Amazon Web Services
 
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)Yan Cui
 
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...Amazon Web Services
 
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Amazon Web Services
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Amazon Web Services
 
Modern Application Delivery on AWS: the Red Hat Way
Modern Application Delivery on AWS: the Red Hat WayModern Application Delivery on AWS: the Red Hat Way
Modern Application Delivery on AWS: the Red Hat WayAmazon Web Services
 
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Amazon Web Services
 
Building a Recommender System on AWS
Building a Recommender System on AWSBuilding a Recommender System on AWS
Building a Recommender System on AWSAmazon Web Services
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Amazon Web Services
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...Amazon Web Services
 
Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...
Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...
Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...Amazon Web Services
 
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...Amazon Web Services
 
Building Serverless IoT solutions - EPAM SEC 2018 Minsk
Building Serverless IoT solutions - EPAM SEC 2018 MinskBuilding Serverless IoT solutions - EPAM SEC 2018 Minsk
Building Serverless IoT solutions - EPAM SEC 2018 MinskBoaz Ziniman
 

Similar to Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Workshop at Web Summit 2018 (20)

Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
 
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
 
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
 
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
 
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Modern Application Delivery on AWS: the Red Hat Way
Modern Application Delivery on AWS: the Red Hat WayModern Application Delivery on AWS: the Red Hat Way
Modern Application Delivery on AWS: the Red Hat Way
 
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
 
Building a Recommender System on AWS
Building a Recommender System on AWSBuilding a Recommender System on AWS
Building a Recommender System on AWS
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
 
Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...
Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...
Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC...
 
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
 
Building Serverless IoT solutions - EPAM SEC 2018 Minsk
Building Serverless IoT solutions - EPAM SEC 2018 MinskBuilding Serverless IoT solutions - EPAM SEC 2018 Minsk
Building Serverless IoT solutions - EPAM SEC 2018 Minsk
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Workshop at Web Summit 2018

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Adrian Hornsby, Cloud Architecture Evangelist @adhorn Chaos Engineering: Why Breaking Things Should Be Practiced.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://xkcd.com/1428/
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Complex systems Amazon Twitter Netflix
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partial failure mode
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  • 10. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack https://www.gremlin.comhttps://github.com/Netflix/SimianArmy https://chaostoolkit.org
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos doesn’t cause problems. It reveals them.” Nora Jones Senior Chaos Engineer, Netflix
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Phases of Chaos Engineering
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Steady State Hypothesis Design & Run Experiment Verify & Learn Fix
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Steady State
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Design & Run Experiment
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Verify & Learn
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DON’T blame that one person …
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of … NOT ENOUGH
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules to remember! 1. Failure requires multiple faults 2. There is no isolated ‘cause’ of an accident. 3. There are multiple contributors to accidents.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fix
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fix
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Patterns for Resilient Architecture
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Application
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service BService B Stateless Services
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service BService B Auto-scaling
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Decoupling with async pattern Listener Pub-Sub Queue Queue A A B B
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue API Instance API Instance API Instance
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Notification User Worker Instance Worker Instance Queue API Instance API Instance Cache node Fetch results API Instance
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance HighPriorityQueue LowPriorityQueue
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read / Write Sharding DB Instance DB instance read replica DB instance read replica DB instance read replica Instance InstanceInstance
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB Instance InstanceInstance DB Instance DB instance read replica DB Instance DB instance read replica
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA Instance InstanceInstance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transient state does not belong in the database.
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cascading Failures
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about timeouts & retries!
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ??
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = Not implemented Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Timeouts
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connectionsBackoff
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res
  • 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.
  • 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  • 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
  • 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  • 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  • 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fire Drills
  • 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  • 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Changing culture takes time! Be patient…
  • 72.
  • 73. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thanks you! @adhorn https://medium.com/@adhorn
  • 74. Did We Scan Your Badge? Remember to opt-in to AWS communications and you will receive a post-event email with a link to: • AWS Developer Workshop Slides • $200 in AWS Credits