SlideShare a Scribd company logo
1 of 64
Download to read offline
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BENGALURU
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chaos Engineering:
Why breaking things should be practiced
Madhusudan Shekar | Oct 6, 2018
@madhushekar23
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Selfie Time…
Thank you @adhorn for slides
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chaos Engineering:
Why breaking things should be practiced
Madhusudan Shekar | Oct 6, 2018
@madhushekar23
OLD WORLD IT
Employees at work
Factories + supply chainSales channels
Marketing analytics
Employees at work
Factories + supply chainSales channels
Marketing analytics
OLD WORLD IT
NEW WORLD IT
NEW WORLD IT Employees at work
Factories +
supply chain
IoT connected
things
Online
marketing
Continuous
supply tracking
Just in time
production
Online sales
+ delivery
Social media
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Worked in Dev
Ops problem now
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
… at the Edge
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unit testing of components:
• Tested in isolation to ensure function meets
expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Building Confidence Through Testing
Is it enough???
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jesse Robbins
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Jesse Robbins – mid 2000’s
GameDay: Creating Resiliency Through Destruction
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix 2013
https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://bit.ly/2uKOJMQ
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos
Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What “really” is Chaos Engineering?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Chaos Engineering is the discipline
of experimenting on a distributed
system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
http://principlesofchaos.org
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Break your systems on
purpose.
Find out their weaknesses and
fix them before they break
when least expected.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chaos Changes…
Application
Network & Data
Infrastructure
People
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Infrastructure
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability Downtime per year
99% (2-nines) 3 days 15 hours
99.99% (4-nines) 52 minutes
99.999% (5-nines) 5 minutes
99.9999% (6-nines) 31 seconds
Availability
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Auto-Scaling
• Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• No updates on live systems
• Always start from a new instance being
provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong
Immutable Infrastructure
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!
Infrastructure as Code
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Network & Data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Database Sharding User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Exponential Backoff
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Wrap a protected function call in a
circuit breaker object, which monitors
for failures.
• If failures reach a certain threshold,
the circuit breaker trips.
Circuit Breaker
https://martinfowler.com/bliki/CircuitBreaker.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Dynamic Routing with Route53
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Improve Latency for end-users
2. Disaster Recovery
Dynamic Routing
Applications in
US West
Applications in
US East
Users from
San
Francisco
Users from
New York
Service 1
Service 2
Service 3
Service 4
Service 1
Service 2
Service 3
Service 4
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Application
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transient state does not belong
in the database.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
… if no new updates are made to
a given data item, eventually all
accesses to that item will return
the last updated value.
Eventual Consistency
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service Degradation & Fallbacks
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
People
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conway’s Law
User
UI Team
Application Team
DBA Team
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Siloed Teams
Siloed
Applications
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conway’s Law
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Services
Cross-Functional
Teams
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fire Drills
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Phases of Chaos Engineering
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hypothesis …?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
Make it everyone’s problem!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Disclaimer!
Don’t make an hypothesis that you
know will break you!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization
Run the Experiment
• Start with very small
• As close as possible to
production
• Minimize the blast radius.
• Have an emergency STOP!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DON’T blame that one person …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-
services) that chaos engineering tests show is not as
resilient to failures as originally predicted.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Changing Culture takes time!
Be patient…
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices
• http://royal.pingdom.com/wp-
content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
• http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you
@madhushekar23

More Related Content

What's hot

Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Amazon Web Services
 
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Amazon Web Services
 
AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018
AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018
AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018Amazon Web Services
 
Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...
Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...
Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...Amazon Web Services
 
Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...
Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...
Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...Amazon Web Services
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Amazon Web Services
 
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Amazon Web Services
 
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...Amazon Web Services
 
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018Amazon Web Services
 
Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...
Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...
Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...Amazon Web Services
 
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)Amazon Web Services
 
[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...
[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...
[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...Amazon Web Services
 
Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018
Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018
Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018Amazon Web Services
 
Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...Amazon Web Services
 
How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...
How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...
How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...Amazon Web Services
 
Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018
Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018
Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018Amazon Web Services
 
Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018
Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018
Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018Amazon Web Services
 
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018Amazon Web Services
 
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...Amazon Web Services
 
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Amazon Web Services
 

What's hot (20)

Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
 
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
 
AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018
AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018
AWS IoT Core Workshop (IOT305-R1) - AWS re:Invent 2018
 
Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...
Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...
Reserve Amazon EC2 On-Demand Capacity for Any Duration with On-Demand Capacit...
 
Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...
Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...
Predicting Hospital Readmissions Using Amazon Machine Learning (HLC304) - AWS...
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
 
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
 
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
Migrating Workloads from Oracle to Amazon Redshift: Best Practices with Pfize...
 
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
Access Control in AWS Glue Data Catalog (ANT376) - AWS re:Invent 2018
 
Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...
Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...
Machine Learning for the Enterprise, ft. Sony Interactive Entertainment (ENT2...
 
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
 
[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...
[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...
[NEW LAUNCH!] Introducing AWS App Mesh – service mesh on AWS (CON367) - AWS r...
 
Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018
Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018
Create a Virtual Concierge Using Sumerian Hosts (ARV201) - AWS re:Invent 2018
 
Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...Accelerate Innovation and Maximize Business Value with Serverless Application...
Accelerate Innovation and Maximize Business Value with Serverless Application...
 
How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...
How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...
How Rovio Uses ML to Acquire, Retain, and Monetize Users (GAM304) - AWS re:In...
 
Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018
Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018
Architecting for the Cloud (ARC213-R2) - AWS re:Invent 2018
 
Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018
Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018
Machine Learning & Analytics on Snowball Edge (WPS314) - AWS re:Invent 2018
 
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
Modernizing .NET Applications on AWS (GPSCT204) - AWS re:Invent 2018
 
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
Computing at the Edge with AWS Greengrass and Amazon FreeRTOS, ft. General El...
 
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
 

Similar to Keynote - Chaos Engineering: Why breaking things should be practiced

Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringAmazon Web Services
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Adrian Hornsby
 
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Amazon Web Services
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudAmazon Web Services
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon Web Services
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon Web Services
 
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Amazon Web Services
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Amazon Web Services
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...Amazon Web Services
 
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)Yan Cui
 
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...Amazon Web Services
 
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends Amazon Web Services
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Amazon Web Services
 
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018Amazon Web Services Korea
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...Amazon Web Services
 
From Monolithic to Microservices
From Monolithic to Microservices From Monolithic to Microservices
From Monolithic to Microservices Amazon Web Services
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...Amazon Web Services
 
Amazon CI-CD Practices for Software Development Teams
Amazon CI-CD Practices for Software Development Teams Amazon CI-CD Practices for Software Development Teams
Amazon CI-CD Practices for Software Development Teams Amazon Web Services
 

Similar to Keynote - Chaos Engineering: Why breaking things should be practiced (20)

Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
 
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
Chaos Engineering: Why Breaking Things Should Be Practiced - AWS Developer Wo...
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
 
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
 
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
 
Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)Applying principles of chaos engineering to serverless (reinvent DVC305)
Applying principles of chaos engineering to serverless (reinvent DVC305)
 
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Inve...
 
Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends Building Global Multi-Region, Active-Active Serverless Backends
Building Global Multi-Region, Active-Active Serverless Backends
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
 
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
Best Practices for Building Multi-Region, Active-Active Serverless Applicatio...
 
From Monolithic to Microservices
From Monolithic to Microservices From Monolithic to Microservices
From Monolithic to Microservices
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
 
Amazon CI-CD Practices for Software Development Teams
Amazon CI-CD Practices for Software Development Teams Amazon CI-CD Practices for Software Development Teams
Amazon CI-CD Practices for Software Development Teams
 

More from AWS User Group Bengaluru

Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSAWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerAWS User Group Bengaluru
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS accountAWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSAWS User Group Bengaluru
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markAWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerAWS User Group Bengaluru
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Decentralized enterprise architecture using Blockchain & AWS
Decentralized enterprise architecture using Blockchain & AWSDecentralized enterprise architecture using Blockchain & AWS
Decentralized enterprise architecture using Blockchain & AWSAWS User Group Bengaluru
 

More from AWS User Group Bengaluru (20)

Demystifying identity on AWS
Demystifying identity on AWSDemystifying identity on AWS
Demystifying identity on AWS
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Medlife journey with AWS
Medlife journey with AWSMedlife journey with AWS
Medlife journey with AWS
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS account
 
Log analytics with ELK stack
Log analytics with ELK stackLog analytics with ELK stack
Log analytics with ELK stack
 
Serverless Culture
Serverless CultureServerless Culture
Serverless Culture
 
Refactoring to serverless
Refactoring to serverlessRefactoring to serverless
Refactoring to serverless
 
Amazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances WorkshopAmazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances Workshop
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Cost Optimization in AWS
Cost Optimization in AWSCost Optimization in AWS
Cost Optimization in AWS
 
Decentralized enterprise architecture using Blockchain & AWS
Decentralized enterprise architecture using Blockchain & AWSDecentralized enterprise architecture using Blockchain & AWS
Decentralized enterprise architecture using Blockchain & AWS
 

Recently uploaded

THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023Joshua Flannery
 
Tecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureTecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureAntonio de Llamas
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceOpsTree solutions
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationBuild Intuit
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxmprakaash5
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...BookNet Canada
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Antonio de Llamas
 
Why Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionWhy Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionDEEPRAJ PATHAK
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerAnchore
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The Critical Role of Spatial Data in Today's Data Ecosystem
The Critical Role of Spatial Data in Today's Data EcosystemThe Critical Role of Spatial Data in Today's Data Ecosystem
The Critical Role of Spatial Data in Today's Data EcosystemSafe Software
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Recently uploaded (20)

THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
THE STATE OF STARTUP ECOSYSTEM - INDIA x JAPAN 2023
 
Tecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureTecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for Rotogravure
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientation
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
Transcript: Green paths: Learning from publishers’ sustainability journeys - ...
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
BoSEU24 | Bill Thompson | Talk From Another Century
BoSEU24 | Bill Thompson | Talk From Another CenturyBoSEU24 | Bill Thompson | Talk From Another Century
BoSEU24 | Bill Thompson | Talk From Another Century
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+
 
Why Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionWhy Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile Evolution
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey Hightower
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The Critical Role of Spatial Data in Today's Data Ecosystem
The Critical Role of Spatial Data in Today's Data EcosystemThe Critical Role of Spatial Data in Today's Data Ecosystem
The Critical Role of Spatial Data in Today's Data Ecosystem
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

Keynote - Chaos Engineering: Why breaking things should be practiced

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BENGALURU
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Engineering: Why breaking things should be practiced Madhusudan Shekar | Oct 6, 2018 @madhushekar23
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Selfie Time… Thank you @adhorn for slides
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Engineering: Why breaking things should be practiced Madhusudan Shekar | Oct 6, 2018 @madhushekar23
  • 5. OLD WORLD IT Employees at work Factories + supply chainSales channels Marketing analytics
  • 6. Employees at work Factories + supply chainSales channels Marketing analytics OLD WORLD IT NEW WORLD IT
  • 7. NEW WORLD IT Employees at work Factories + supply chain IoT connected things Online marketing Continuous supply tracking Just in time production Online sales + delivery Social media
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Worked in Dev Ops problem now
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … at the Edge
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Building Confidence Through Testing Is it enough???
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://www.youtube.com/watch?v=zoz0ZjfrQ9s Jesse Robbins – mid 2000’s GameDay: Creating Resiliency Through Destruction
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix 2013 https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What “really” is Chaos Engineering?
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Changes… Application Network & Data Infrastructure People
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds Availability
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • No updates on live systems • Always start from a new instance being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong Immutable Infrastructure
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it! Infrastructure as Code
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Network & Data
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Exponential Backoff
  • 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Circuit Breaker https://martinfowler.com/bliki/CircuitBreaker.html
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Dynamic Routing with Route53 Amazon Route53 Resource A In US Resource B in EU User in US
  • 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Improve Latency for end-users 2. Disaster Recovery Dynamic Routing Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
  • 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application
  • 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
  • 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient state does not belong in the database.
  • 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  • 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventual Consistency Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  • 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service Degradation & Fallbacks
  • 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People
  • 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
  • 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
  • 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
  • 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fire Drills
  • 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Phases of Chaos Engineering
  • 53. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix
  • 54. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  • 55. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hypothesis …? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  • 56. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  • 57. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization Run the Experiment • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  • 58. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DON’T blame that one person …
  • 59. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  • 60. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
  • 61. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro- services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  • 62. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Changing Culture takes time! Be patient…
  • 63. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp- content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
  • 64. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you @madhushekar23

Editor's Notes

  1. With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. As a result, “random” failures have grown difficult to predict. At the same time, our dependence on these systems has only increased.
  2. Traditionally, these sensible measures to gain confidence are taken before systems or applications reach production. Once in production, the traditional approach is to rely on monitoring and logging to confirm that everything is working correctly. If it is behaving as expected, then you don't have a problem. If it is not, and it requires human intervention (troubleshooting, triage, resolution, etc.), then you need to react to the incident and get things working again as fast as possible. This implies that once a system is in production, "Don't touch it!"—except, of course, when it's broken, in which case touch it all you want, under the time pressure inherent in an outage response. https://queue.acm.org/detail.cfm?id=2353017
  3. GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis. 
  4. Super power with Docker (Dockerfiles) instead of Chef or Puppet.
  5. Invest time to save time
  6. Write and updates Counters!!!! Not on the DB – redis!!
  7. Database Federation is where we break up the database by function. In our example, we have broken out the Forums DB from the User DB from the Products DB Of course, cross functional queries are harder to do and you may need to do your joins at the application layer for these types of queries This will reduce our database footprint for a while and the great thing is, this does prevent you from having to shard until much further down the line. This isn’t going to help for single large tables; for this we will need to shard.
  8. Sharding is where we break up that single large database into multiple DBs. We might need to do this because of database or table size or potentially for high write IOPs as well. Here is an example of us breaking up a database with a large table into 3 databases. Above we show where each userID is located, but the easiest way to describe how this would work would be to use the example of all users with A-H go into one DB, and I – M go in another, and N – Z go into the third DB. Typically this is done by key space and your application has to be aware of where to read from, update and write to for a particular record. ORM support can help here. This does create operation complexity so if you can federate first, do that. This can be done with SQL or NoSQL, and DynamoDB does this for you under the covers on the backend as your data size increases and the reads / writes per second scale.
  9. Route your website visitors to an alternate location to avoid site outages
  10. Does a region Fail? Full region: no Individual services can fail region-wide Most of the time, configuration issue Leading to cascading failures.
  11. Eventual consistency, also called optimistic replication,[2] is widely deployed in distributed systems, and has origins in early mobile computing projects.[3] A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence.[4] Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints.
  12. The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.