Submit Search
Upload
Chaos Engineering
•
16 likes
•
5,143 views
Amazon Web Services
Follow
Chaos Engineering
Read less
Read more
Report
Share
Report
Share
1 of 72
Download now
Download to read offline
Recommended
Introduction to Chaos Engineering
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
Chaos engineering
Chaos engineering
Alberto Acerbis
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Amazon Web Services
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?
Thoughtworks
Chaos Engineering
Chaos Engineering
Yury Roa
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
Bilal Aybar
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
Anshul Patel
Recommended
Introduction to Chaos Engineering
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
Chaos engineering
Chaos engineering
Alberto Acerbis
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Amazon Web Services
Chaos Engineering, When should you release the monkeys?
Chaos Engineering, When should you release the monkeys?
Thoughtworks
Chaos Engineering
Chaos Engineering
Yury Roa
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
Bilal Aybar
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetup...
Ana Medina
Chaos Engineering with Gremlin Platform
Chaos Engineering with Gremlin Platform
Anshul Patel
An Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
Chaos engineering and chaos testing
Chaos engineering and chaos testing
jeetendra mandal
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
Nora Jones
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
DevOps and AWS
DevOps and AWS
Shiva Narayanaswamy
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Ana Medina
CI/CD on AWS
CI/CD on AWS
Amazon Web Services
infrastructure as code
infrastructure as code
Amazon Web Services
Introduction to Microservices
Introduction to Microservices
Amazon Web Services
Chaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
Arun Gupta
From Monolithic to Microservices
From Monolithic to Microservices
Amazon Web Services
DevOps on AWS
DevOps on AWS
Amazon Web Services
Cloud native principles
Cloud native principles
Diego Pacheco
Design patterns for microservice architecture
Design patterns for microservice architecture
The Software House
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
DevOps on AWS
DevOps on AWS
Amazon Web Services
Comprehensive Terraform Training
Comprehensive Terraform Training
Yevgeniy Brikman
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
Abeer R
A Pattern Language for Microservices
A Pattern Language for Microservices
Chris Richardson
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
Adrian Hornsby
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
More Related Content
What's hot
An Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Gremlin
Chaos engineering and chaos testing
Chaos engineering and chaos testing
jeetendra mandal
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
Nora Jones
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
DevOps and AWS
DevOps and AWS
Shiva Narayanaswamy
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Ana Medina
CI/CD on AWS
CI/CD on AWS
Amazon Web Services
infrastructure as code
infrastructure as code
Amazon Web Services
Introduction to Microservices
Introduction to Microservices
Amazon Web Services
Chaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
Arun Gupta
From Monolithic to Microservices
From Monolithic to Microservices
Amazon Web Services
DevOps on AWS
DevOps on AWS
Amazon Web Services
Cloud native principles
Cloud native principles
Diego Pacheco
Design patterns for microservice architecture
Design patterns for microservice architecture
The Software House
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
DevOps on AWS
DevOps on AWS
Amazon Web Services
Comprehensive Terraform Training
Comprehensive Terraform Training
Yevgeniy Brikman
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
Abeer R
A Pattern Language for Microservices
A Pattern Language for Microservices
Chris Richardson
What's hot
(20)
An Introduction to Chaos Engineering
An Introduction to Chaos Engineering
Chaos engineering and chaos testing
Chaos engineering and chaos testing
Choose your own adventure Chaos Engineering - QCon NYC 2017
Choose your own adventure Chaos Engineering - QCon NYC 2017
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
DevOps and AWS
DevOps and AWS
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
CI/CD on AWS
CI/CD on AWS
infrastructure as code
infrastructure as code
Introduction to Microservices
Introduction to Microservices
Chaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
From Monolithic to Microservices
From Monolithic to Microservices
DevOps on AWS
DevOps on AWS
Cloud native principles
Cloud native principles
Design patterns for microservice architecture
Design patterns for microservice architecture
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
DevOps on AWS
DevOps on AWS
Comprehensive Terraform Training
Comprehensive Terraform Training
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
A Pattern Language for Microservices
A Pattern Language for Microservices
Similar to Chaos Engineering
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
Adrian Hornsby
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
AWS User Group Bengaluru
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
Amazon Web Services
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Amazon Web Services
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Amazon Web Services
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
Amazon Web Services
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Amazon Web Services
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Amazon Web Services
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends
Amazon Web Services
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Amazon Web Services
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon Web Services
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon Web Services
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Amazon Web Services
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
Amazon Web Services
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Amazon Web Services
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Amazon Web Services
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Amazon Web Services
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
Amazon Web Services
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
Amazon Web Services
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Amazon Web Services
Similar to Chaos Engineering
(20)
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Fully Realizing the Microservices Vision with Service Mesh (DEV312-S) - AWS r...
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Moving 400 Engineers to AWS: Our Journey to Secure Adoption (SEC306-S) - AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Modernizing Media Supply Chains with AWS Serverless (API301) - AWS re:Invent ...
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
Applying the Twelve-Factor App Methodology to Serverless Applications (SRV218...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
From Monolith to Modern Apps: Best Practices (SRV322-R2) - AWS re:Invent 2018
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
More from Amazon Web Services
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
Open banking as a service
Open banking as a service
Amazon Web Services
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
Computer Vision con AWS
Computer Vision con AWS
Amazon Web Services
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
Tools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
How to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
Building a web application without servers
Building a web application without servers
Amazon Web Services
Fundraising Essentials
Fundraising Essentials
Amazon Web Services
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
More from Amazon Web Services
(20)
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Open banking as a service
Open banking as a service
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Computer Vision con AWS
Computer Vision con AWS
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Tools for building your MVP on AWS
Tools for building your MVP on AWS
How to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Building a web application without servers
Building a web application without servers
Fundraising Essentials
Fundraising Essentials
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Chaos Engineering
1.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby Cloud Infrastructure – Amazon Web Services Chaos Engineering: Why Breaking Things Should Be Practiced. @adhorn
2.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Been there?
3.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
4.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. … at the Edge
5.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
6.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
7.
Netflix 2013 https://medium.com/netflix-techblog
8.
Chaos Monkeys https://github.com/Netflix/SimianArmy
9.
10.
© 2017, Amazon
Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
11.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
12.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. What “really” is Chaos Engineering?
13.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
14.
Break your systems
on purpose. Find out their weaknesses and fix them before they break when least expected.
15.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
16.
17.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
18.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Before breaking things …
19.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure
20.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Infrastructure
21.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
22.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
23.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
24.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
25.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
26.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
27.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!
28.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Network & Data
29.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
30.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
31.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
32.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
33.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
34.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
35.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Exponential Backoff
36.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. https://martinfowler.com/bliki/CircuitBreaker.html
37.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Dynamic Routing with Route53 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Amazon Route53 Resource A In US Resource B in EU User in US
38.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Dynamic Routing 1. Improve Latency for end-users 2. Disaster Recovery Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
39.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Application
40.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
41.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Transient state does not belong in the database.
42.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
43.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
44.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result
45.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
46.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Exception Handling
47.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
48.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. People
49.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
50.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
51.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
52.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Fire Drills
53.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Phases of Chaos Engineering
54.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix
55.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
56.
What is Steady
State? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
57.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
58.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!
59.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
60.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
61.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
62.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
63.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. New Version Users Run the Experiment: Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
64.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Verify & Learn: Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
65.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. DON’T blame that one person …
66.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
67.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. More questions to ask. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede? http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
68.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Fixing the issues!
69.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
70.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Changing Culture takes time! Be patient…
71.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp- content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
72.
© 2018, Amazon
Web Services, Inc. or its affiliates. All rights reserved. Thank you! @adhorn https://medium.com/@adhorn
Download now