SlideShare a Scribd company logo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adrian Hornsby, Cloud Architecture Evangelist @ AWS
@adhorn
Chaos Engineering:
Why Breaking Things Should Be Practiced.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Been there?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
An Area of Complex &
Dynamic Systems
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
System failure rate
Early Failures
Wear Out Failures
Observed Failures
Random Failures
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
System failure rate
For high-velocity deployments
Early Failures
Wear Out Failures
Observed Failures
Random Failures
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
… at the Edge
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building Confidence Through Testing
Unit testing of components:
• Tested in isolation to ensure function meets expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Is it enough???
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jesse Robbins – mid 2000’s
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Jesse Robbins – mid 2000’s
GameDay: Creating Resiliency Through Destruction
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix 2013
https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chaos Monkeys
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://github.com/Netflix/SimianArmy
• Chaos Monkey - Kill instances randomly
• Latency Monkey - Induce latency in services
• Chaos Gorilla - Simulates AZ and regions failure
• Conformity Monkey - Make sure instances follow good
practices
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
http://principlesofchaos.org
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://bit.ly/2uKOJMQ
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What “really” is Chaos Engineering?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Chaos Engineering is the
discipline of experimenting on a
distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in
production.”http://principlesofchaos.org
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Break your systems on
purpose.
Find out their weaknesses
and fix them before they
break when least expected.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.”
Nora Jones
Senior Chaos Engineer, Netflix
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before breaking things …
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Really!!
Don’t break things
before you have done
your home work!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
People
Application
Network & Data
Infrastructure
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Infrastructure
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability
Availability Downtime per year Categories
95% (1-nine) 18 days 6 hours
Batch processing, Data extraction,
Load jobs.
99% (2-nines) 3 days 15 hours Internal Tools, Project Tracking
99.9% (3-nines) 8 hours 45 minutes Online Commerce
99.99% (4-nines) 52 minutes Video Delivery, Broadcast systems
99.999% (5-nines) 5 minutes Telecom Industry (ATM Transactions)
99.9999% (6-nines) 31 seconds Answering to my loved one*
* Joke 
http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
System Availability
Availability =
Normal Operation Time
Total Time
MTBF**
MTBF** + MTTR*
=
* Mean Time To Repair (MTTR)
**Mean Time Between Failure (MTBF)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability in Series
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Y 99.99% (4-nines) 52 minutes
X and Y Combined 98.99% 3 days 16 hours 33 minutes
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Auto-Scaling • Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Immutable Infrastructure
• No updates on live systems
• Always start from a new instance being provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Immutable components are replaced for
every deployment, rather than being updated in-
place.
Immutable Infrastructure
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Infrastructure as Code
• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Network & Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Database Sharding
User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
Cache
Amazon SNS
Push Notification
User
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Exponential Backoff
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Circuit Breaker
• Wrap a protected function call in a
circuit breaker object, which
monitors for failures.
• If failures reach a certain threshold,
the circuit breaker trips.
https://martinfowler.com/bliki/CircuitBreaker.html
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dynamic Routing with Route53
1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dynamic Routing
1. Improve Latency for end-users
2. Disaster Recovery
Applications in
US West
Applications in
US East
Users from
San
Francisco
Users from
New York
Service 1
Service 2
Service 3
Service 4
Service 1
Service 2
Service 3
Service 4
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Application
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transient state
does not belong
in the database.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eventual Consistency
… if no new updates are
made to a given data item,
eventually all accesses to that
item will return the last
updated value.
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Process A Process B Process A Process B
Synchronous Asynchronous
Waiting
Working
Continues
get or fetch resultGet result
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Non-blocking UI
https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Exception Handling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service Degradation & Fallbacks
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
People
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conway’s Law
User
UI Team
Application Team
DBA Team
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Siloed Teams
Siloed
Applications
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conway’s Law
”Any organization that designs a system (defined broadly)
will produce a design whose structure is a copy of the
organization's communication structure.”
http://www.melconway.com/Home/Conways_Law.html
Services
Cross-Functional
Teams
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fire Drills
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Phases of Chaos Engineering
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix
GameDay
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steady State
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
What is Steady State?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Metrics at work
Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg
Linden).
Google: 500 ms of extra load time caused 20% fewer searches (Marissa
Mayer).
Yahoo!: 400 ms of extra load time caused a 5–9% increase in the
number of people who clicked “back” before the page even loaded (Nicole
Sullivan).
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steady State
Important:
• Know the value range of Healthy State!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hypothesis
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What if…?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
Make it everyone’s problem!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Disclaimer!
Don’t make an hypothesis that you know will
break you!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Netflix team members Casey Rosenthal, Lorin Hochstein, Aaron
Blohowiak, Nora Jones, and Ali Basiri, suggest the following inputs for
Chaos experiments:
• Simulating the failure of an entire region or datacenter.
• Partially deleting Kafka topics over a variety of instances to recreate an issue that
occurred in production.
• Injecting latency between services for a select percentage of traffic over a
predetermined period of time.
• Function-based chaos (runtime injection): Randomly causing functions to throw
exceptions.
• Code insertion: Adding instructions to the target program and allowing fault injection to
occur prior to certain instructions.
• Time travel: Forcing system clocks out of sync with each other.
• Executing a routine in driver code emulating I/O errors.
• Maxing out CPU cores on an Elasticsearch cluster.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Designing Experiment
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rules of thumbs
• Start with very small
• As close as possible to production
• Minimize the blast radius.
• Have an emergency STOP!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Run the Experiment
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New Version
Users
Canary deployment
Old Version
99%
Users
1%
Users
Start with ..
Dynamic Routing
(Route53)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Verify & Learn
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DON’T blame that one person …
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…
The Conveyor Belt Accident
Question: Why did the associate damage his thumb?
Answer: Because his thumb got caught in the conveyor.
Question: Why did his thumb get caught in the conveyor?
Answer: Because he was chasing his bag, which was on a running conveyor
belt.
Question: Why did he chase his bag?
Answer: Because he placed his bag on the conveyor, but it then turned-on by
surprise
Question: Why was his bag on the conveyor?
Answer: Because he used the conveyor as a table
Conclusion: So, the likely root cause of the associate’s damaged thumb is that
he simply needed a table, there wasn’t one around, so he used a conveyor as
a table.
https://www.linkedin.com/pulse/use-5-whys-find-root-causes-peter-abilla/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
No, seriously. Root Cause is a Fallacy.
http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
No, seriously. Root Cause is a Fallacy.
• Can you clarify if there were any preceding events?
• Why would they believe acting in this way was the best course of action to
deliver the desired outcome?
• Is there another failure mode that could present here?
• What decisions or events prior to this made this work before?
• Why stop there – are there places to dig deeper that could shine a light
more on this?
• Did others step in to help, to advise, or to intercede?
http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fix
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-
services) that chaos engineering tests show is not as resilient
to failures as originally predicted.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Changing Culture takes time!
Be patient…
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos
Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thanks you!
@adhorn
https://medium.com/@adhorn

More Related Content

What's hot

NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...
NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...
NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...
Amazon Web Services
 
Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...
Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...
Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...
Amazon Web Services
 
Building the Organisation of the Future: Leveraging Artificial Intelligence a...
Building the Organisation of the Future: Leveraging Artificial Intelligence a...Building the Organisation of the Future: Leveraging Artificial Intelligence a...
Building the Organisation of the Future: Leveraging Artificial Intelligence a...
Amazon Web Services
 
Lessons Learned Scaling Your Talent Transformation
Lessons Learned Scaling Your Talent TransformationLessons Learned Scaling Your Talent Transformation
Lessons Learned Scaling Your Talent Transformation
Amazon Web Services
 
Scaling up to and beyond 10M users
Scaling up to and beyond 10M usersScaling up to and beyond 10M users
Scaling up to and beyond 10M users
Amazon Web Services
 
The AWS Philosophy of Security - SID322 - re:Invent 2017
The AWS Philosophy of Security - SID322 - re:Invent 2017The AWS Philosophy of Security - SID322 - re:Invent 2017
The AWS Philosophy of Security - SID322 - re:Invent 2017
Amazon Web Services
 
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
Amazon Web Services
 
Introducing Amazon SageMaker - AWS Online Tech Talks
Introducing Amazon SageMaker - AWS Online Tech TalksIntroducing Amazon SageMaker - AWS Online Tech Talks
Introducing Amazon SageMaker - AWS Online Tech Talks
Amazon Web Services
 
SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...
SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...
SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...
Amazon Web Services
 
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Amazon Web Services
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
Amazon Web Services
 
Are you ready for a cloud pentest? AWS re:Inforce 2019
Are you ready for a cloud pentest? AWS re:Inforce 2019Are you ready for a cloud pentest? AWS re:Inforce 2019
Are you ready for a cloud pentest? AWS re:Inforce 2019
Teri Radichel
 
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Amazon Web Services
 
NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017
NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017
NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017
Amazon Web Services
 
Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...
Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...
Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...
Amazon Web Services
 
FSV306_Getting to Yes—Minimal Viable Cloud with Maximum Security
FSV306_Getting to Yes—Minimal Viable Cloud with Maximum SecurityFSV306_Getting to Yes—Minimal Viable Cloud with Maximum Security
FSV306_Getting to Yes—Minimal Viable Cloud with Maximum Security
Amazon Web Services
 
What’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech Talks
What’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech TalksWhat’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech Talks
What’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech Talks
Amazon Web Services
 
Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017
Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017
Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017
Amazon Web Services
 
NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...
NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...
NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...
Amazon Web Services
 
Introducing Amazon EKS
Introducing Amazon EKSIntroducing Amazon EKS
Introducing Amazon EKS
Amazon Web Services
 

What's hot (20)

NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...
NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...
NEW LAUNCH! Amazon Neptune Overview and Customer Use Cases - DAT319 - re:Inve...
 
Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...
Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...
Protect Your Web Applications from Common Attack Vectors Using AWS WAF - SID3...
 
Building the Organisation of the Future: Leveraging Artificial Intelligence a...
Building the Organisation of the Future: Leveraging Artificial Intelligence a...Building the Organisation of the Future: Leveraging Artificial Intelligence a...
Building the Organisation of the Future: Leveraging Artificial Intelligence a...
 
Lessons Learned Scaling Your Talent Transformation
Lessons Learned Scaling Your Talent TransformationLessons Learned Scaling Your Talent Transformation
Lessons Learned Scaling Your Talent Transformation
 
Scaling up to and beyond 10M users
Scaling up to and beyond 10M usersScaling up to and beyond 10M users
Scaling up to and beyond 10M users
 
The AWS Philosophy of Security - SID322 - re:Invent 2017
The AWS Philosophy of Security - SID322 - re:Invent 2017The AWS Philosophy of Security - SID322 - re:Invent 2017
The AWS Philosophy of Security - SID322 - re:Invent 2017
 
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
NEW LAUNCH! Introducing Amazon SageMaker - MCL365 - re:Invent 2017
 
Introducing Amazon SageMaker - AWS Online Tech Talks
Introducing Amazon SageMaker - AWS Online Tech TalksIntroducing Amazon SageMaker - AWS Online Tech Talks
Introducing Amazon SageMaker - AWS Online Tech Talks
 
SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...
SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...
SecOps 2021 Today: Using AWS Services to Deliver SecOps - SID304 - re:Invent ...
 
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
Building the Largest Repo for Serverless Compliance-as-Code - SID205 - re:Inv...
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
Are you ready for a cloud pentest? AWS re:Inforce 2019
Are you ready for a cloud pentest? AWS re:Inforce 2019Are you ready for a cloud pentest? AWS re:Inforce 2019
Are you ready for a cloud pentest? AWS re:Inforce 2019
 
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
Getting Started with Containers in the Cloud: AWS Developer Workshop at Web S...
 
NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017
NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017
NEW LAUNCH! AWS Serverless Application Repository - SRV215 - re:Invent 2017
 
Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...
Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...
Keys to Successfully Monitoring and Optimizing Innovative and Sophisticated C...
 
FSV306_Getting to Yes—Minimal Viable Cloud with Maximum Security
FSV306_Getting to Yes—Minimal Viable Cloud with Maximum SecurityFSV306_Getting to Yes—Minimal Viable Cloud with Maximum Security
FSV306_Getting to Yes—Minimal Viable Cloud with Maximum Security
 
What’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech Talks
What’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech TalksWhat’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech Talks
What’s New for Amazon DynamoDB - 2018 Q1 Update - AWS Online Tech Talks
 
Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017
Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017
Building CI/CD Pipelines for Serverless Applications - SRV302 - re:Invent 2017
 
NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...
NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...
NEW LAUNCH! Graph-based Approaches for Cyber Investigative Analytics Using GP...
 
Introducing Amazon EKS
Introducing Amazon EKSIntroducing Amazon EKS
Introducing Amazon EKS
 

Similar to Chaos Engineering: Why Breaking Things Should Be Practised.

10 Lessons from 10 Years of AWS
10 Lessons from 10 Years of AWS10 Lessons from 10 Years of AWS
10 Lessons from 10 Years of AWS
Adrian Hornsby
 
Launch Applications the Amazon Way - AWS Online Tech Talks
Launch Applications the Amazon Way - AWS Online Tech TalksLaunch Applications the Amazon Way - AWS Online Tech Talks
Launch Applications the Amazon Way - AWS Online Tech Talks
Amazon Web Services
 
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global TablesBuilding Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
Amazon Web Services
 
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
AWS Germany
 
Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017
Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017
Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017
Amazon Web Services
 
ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...
ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...
ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...
New Relic
 
Devoxx: Building AI-powered applications on AWS
Devoxx: Building AI-powered applications on AWSDevoxx: Building AI-powered applications on AWS
Devoxx: Building AI-powered applications on AWS
Adrian Hornsby
 
Journey Towards Scaling Your API to 10 Million Users
Journey Towards Scaling Your API to 10 Million UsersJourney Towards Scaling Your API to 10 Million Users
Journey Towards Scaling Your API to 10 Million Users
Adrian Hornsby
 
CON320_Monitoring, Logging and Debugging Containerized Services
CON320_Monitoring, Logging and Debugging Containerized ServicesCON320_Monitoring, Logging and Debugging Containerized Services
CON320_Monitoring, Logging and Debugging Containerized Services
Amazon Web Services
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
Amazon Web Services
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
Amazon Web Services
 
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
Amazon Web Services
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
Amazon Web Services
 
ARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWSARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWS
Amazon Web Services
 
re:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Servicesre:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Services
Calvin French-Owen
 
Working with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model TrainingWorking with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model Training
Amazon Web Services
 
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
New Relic
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
Amazon Web Services
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
Amazon Web Services
 
Keith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdf
Keith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdfKeith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdf
Keith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdf
Amazon Web Services
 

Similar to Chaos Engineering: Why Breaking Things Should Be Practised. (20)

10 Lessons from 10 Years of AWS
10 Lessons from 10 Years of AWS10 Lessons from 10 Years of AWS
10 Lessons from 10 Years of AWS
 
Launch Applications the Amazon Way - AWS Online Tech Talks
Launch Applications the Amazon Way - AWS Online Tech TalksLaunch Applications the Amazon Way - AWS Online Tech Talks
Launch Applications the Amazon Way - AWS Online Tech Talks
 
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global TablesBuilding Global Serverless Backends powered by Amazon DynamoDB Global Tables
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
 
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
Building Global Multi-Region, Active-Active Serverless Backends I AWS Dev Day...
 
Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017
Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017
Performing Chaos at Netflix Scale - DEV334 - re:Invent 2017
 
ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...
ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...
ARC207 Monitoring Performance of Enterprise Applications on AWS: Understandin...
 
Devoxx: Building AI-powered applications on AWS
Devoxx: Building AI-powered applications on AWSDevoxx: Building AI-powered applications on AWS
Devoxx: Building AI-powered applications on AWS
 
Journey Towards Scaling Your API to 10 Million Users
Journey Towards Scaling Your API to 10 Million UsersJourney Towards Scaling Your API to 10 Million Users
Journey Towards Scaling Your API to 10 Million Users
 
CON320_Monitoring, Logging and Debugging Containerized Services
CON320_Monitoring, Logging and Debugging Containerized ServicesCON320_Monitoring, Logging and Debugging Containerized Services
CON320_Monitoring, Logging and Debugging Containerized Services
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
DEV325_Application Deployment Techniques for Amazon EC2 Workloads with AWS Co...
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
 
ARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWSARC207_Monitoring Performance of Enterprise Applications on AWS
ARC207_Monitoring Performance of Enterprise Applications on AWS
 
re:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Servicesre:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Services
 
Working with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model TrainingWorking with Amazon SageMaker Algorithms for Faster Model Training
Working with Amazon SageMaker Algorithms for Faster Model Training
 
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
DEV209 A Field Guide to Monitoring in the Cloud: From Lift and Shift to AWS L...
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
 
Keith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdf
Keith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdfKeith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdf
Keith Steward - SageMaker Algorithms Infinitely Scalable Machine Learning_VK.pdf
 

More from Adrian Hornsby

Model Serving for Deep Learning
Model Serving for Deep LearningModel Serving for Deep Learning
Model Serving for Deep Learning
Adrian Hornsby
 
AI in Finance: Moving forward!
AI in Finance: Moving forward!AI in Finance: Moving forward!
AI in Finance: Moving forward!
Adrian Hornsby
 
Moving Forward with AI
Moving Forward with AIMoving Forward with AI
Moving Forward with AI
Adrian Hornsby
 
AI: State of the Union
AI: State of the UnionAI: State of the Union
AI: State of the Union
Adrian Hornsby
 
re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...
re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...
re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...
Adrian Hornsby
 
re:Invent re:Cap - Big Data & IoT at Any Scale
re:Invent re:Cap - Big Data & IoT at Any Scalere:Invent re:Cap - Big Data & IoT at Any Scale
re:Invent re:Cap - Big Data & IoT at Any Scale
Adrian Hornsby
 
Innovations and the Cloud
Innovations and the CloudInnovations and the Cloud
Innovations and the Cloud
Adrian Hornsby
 
Serverless in Action on AWS
Serverless in Action on AWSServerless in Action on AWS
Serverless in Action on AWS
Adrian Hornsby
 
Innovations and The Cloud
Innovations and The CloudInnovations and The Cloud
Innovations and The Cloud
Adrian Hornsby
 
Developing Sophisticated Serverless Applications with AI
Developing Sophisticated Serverless Applications with AIDeveloping Sophisticated Serverless Applications with AI
Developing Sophisticated Serverless Applications with AI
Adrian Hornsby
 
AWS Startup Day Bangalore: Being Well-Architected in the Cloud
AWS Startup Day Bangalore: Being Well-Architected in the CloudAWS Startup Day Bangalore: Being Well-Architected in the Cloud
AWS Startup Day Bangalore: Being Well-Architected in the Cloud
Adrian Hornsby
 
AWSome Day - Opening Keynote
AWSome Day - Opening KeynoteAWSome Day - Opening Keynote
AWSome Day - Opening Keynote
Adrian Hornsby
 
Building AI-powered Serverless Applications on AWS
Building AI-powered Serverless Applications on AWSBuilding AI-powered Serverless Applications on AWS
Building AI-powered Serverless Applications on AWS
Adrian Hornsby
 
Innovations fueled by IoT and the Cloud
Innovations fueled by IoT and the CloudInnovations fueled by IoT and the Cloud
Innovations fueled by IoT and the Cloud
Adrian Hornsby
 
AWS Batch: Simplifying batch computing in the cloud
AWS Batch: Simplifying batch computing in the cloudAWS Batch: Simplifying batch computing in the cloud
AWS Batch: Simplifying batch computing in the cloud
Adrian Hornsby
 
Being Well Architected in the Cloud (Updated)
Being Well Architected in the Cloud (Updated)Being Well Architected in the Cloud (Updated)
Being Well Architected in the Cloud (Updated)
Adrian Hornsby
 
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Deep Dive on Object Storage: Amazon S3 and Amazon GlacierDeep Dive on Object Storage: Amazon S3 and Amazon Glacier
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Adrian Hornsby
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Adrian Hornsby
 
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Adrian Hornsby
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
Adrian Hornsby
 

More from Adrian Hornsby (20)

Model Serving for Deep Learning
Model Serving for Deep LearningModel Serving for Deep Learning
Model Serving for Deep Learning
 
AI in Finance: Moving forward!
AI in Finance: Moving forward!AI in Finance: Moving forward!
AI in Finance: Moving forward!
 
Moving Forward with AI
Moving Forward with AIMoving Forward with AI
Moving Forward with AI
 
AI: State of the Union
AI: State of the UnionAI: State of the Union
AI: State of the Union
 
re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...
re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...
re:Invent re:Cap - An overview of Artificial Intelligence and Machine Learnin...
 
re:Invent re:Cap - Big Data & IoT at Any Scale
re:Invent re:Cap - Big Data & IoT at Any Scalere:Invent re:Cap - Big Data & IoT at Any Scale
re:Invent re:Cap - Big Data & IoT at Any Scale
 
Innovations and the Cloud
Innovations and the CloudInnovations and the Cloud
Innovations and the Cloud
 
Serverless in Action on AWS
Serverless in Action on AWSServerless in Action on AWS
Serverless in Action on AWS
 
Innovations and The Cloud
Innovations and The CloudInnovations and The Cloud
Innovations and The Cloud
 
Developing Sophisticated Serverless Applications with AI
Developing Sophisticated Serverless Applications with AIDeveloping Sophisticated Serverless Applications with AI
Developing Sophisticated Serverless Applications with AI
 
AWS Startup Day Bangalore: Being Well-Architected in the Cloud
AWS Startup Day Bangalore: Being Well-Architected in the CloudAWS Startup Day Bangalore: Being Well-Architected in the Cloud
AWS Startup Day Bangalore: Being Well-Architected in the Cloud
 
AWSome Day - Opening Keynote
AWSome Day - Opening KeynoteAWSome Day - Opening Keynote
AWSome Day - Opening Keynote
 
Building AI-powered Serverless Applications on AWS
Building AI-powered Serverless Applications on AWSBuilding AI-powered Serverless Applications on AWS
Building AI-powered Serverless Applications on AWS
 
Innovations fueled by IoT and the Cloud
Innovations fueled by IoT and the CloudInnovations fueled by IoT and the Cloud
Innovations fueled by IoT and the Cloud
 
AWS Batch: Simplifying batch computing in the cloud
AWS Batch: Simplifying batch computing in the cloudAWS Batch: Simplifying batch computing in the cloud
AWS Batch: Simplifying batch computing in the cloud
 
Being Well Architected in the Cloud (Updated)
Being Well Architected in the Cloud (Updated)Being Well Architected in the Cloud (Updated)
Being Well Architected in the Cloud (Updated)
 
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Deep Dive on Object Storage: Amazon S3 and Amazon GlacierDeep Dive on Object Storage: Amazon S3 and Amazon Glacier
Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
 
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
 

Recently uploaded

Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 

Chaos Engineering: Why Breaking Things Should Be Practised.

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adrian Hornsby, Cloud Architecture Evangelist @ AWS @adhorn Chaos Engineering: Why Breaking Things Should Be Practiced.
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Been there?
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. An Area of Complex & Dynamic Systems
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System failure rate Early Failures Wear Out Failures Observed Failures Random Failures
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System failure rate For high-velocity deployments Early Failures Wear Out Failures Observed Failures Random Failures
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … at the Edge
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesse Robbins – mid 2000’s GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://www.youtube.com/watch?v=zoz0ZjfrQ9s Jesse Robbins – mid 2000’s GameDay: Creating Resiliency Through Destruction
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix 2013 https://medium.com/netflix-techblog/active-active-for-multi-regional-resiliency-c47719f6685b
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Monkeys
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://github.com/Netflix/SimianArmy • Chaos Monkey - Kill instances randomly • Latency Monkey - Induce latency in services • Chaos Gorilla - Simulates AZ and regions failure • Conformity Monkey - Make sure instances follow good practices
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. http://principlesofchaos.org
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What “really” is Chaos Engineering?
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”http://principlesofchaos.org
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before breaking things …
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Really!! Don’t break things before you have done your home work!
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People Application Network & Data Infrastructure
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Availability Downtime per year Categories 95% (1-nine) 18 days 6 hours Batch processing, Data extraction, Load jobs. 99% (2-nines) 3 days 15 hours Internal Tools, Project Tracking 99.9% (3-nines) 8 hours 45 minutes Online Commerce 99.99% (4-nines) 52 minutes Video Delivery, Broadcast systems 99.999% (5-nines) 5 minutes Telecom Industry (ATM Transactions) 99.9999% (6-nines) 31 seconds Answering to my loved one* * Joke  http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Series Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new instance being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Immutable components are replaced for every deployment, rather than being updated in- place. Immutable Infrastructure
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Network & Data
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Exponential Backoff
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. https://martinfowler.com/bliki/CircuitBreaker.html
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dynamic Routing with Route53 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Amazon Route53 Resource A In US Resource B in EU User in US
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dynamic Routing 1. Improve Latency for end-users 2. Disaster Recovery Applications in US West Applications in US East Users from San Francisco Users from New York Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient state does not belong in the database.
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Exception Handling
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service Degradation & Fallbacks
  • 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
  • 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
  • 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conway’s Law ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams
  • 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fire Drills
  • 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Phases of Chaos Engineering
  • 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix GameDay
  • 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State
  • 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  • 65. What is Steady State? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  • 66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  • 67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!
  • 68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hypothesis
  • 69. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  • 70. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  • 71. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix team members Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, suggest the following inputs for Chaos experiments: • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production. • Injecting latency between services for a select percentage of traffic over a predetermined period of time. • Function-based chaos (runtime injection): Randomly causing functions to throw exceptions. • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions. • Time travel: Forcing system clocks out of sync with each other. • Executing a routine in driver code emulating I/O errors. • Maxing out CPU cores on an Elasticsearch cluster.
  • 72. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Designing Experiment
  • 73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  • 74. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  • 75. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Run the Experiment
  • 76. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Version Users Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
  • 77. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Verify & Learn
  • 78. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DON’T blame that one person …
  • 79. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  • 80. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
  • 81. The Conveyor Belt Accident Question: Why did the associate damage his thumb? Answer: Because his thumb got caught in the conveyor. Question: Why did his thumb get caught in the conveyor? Answer: Because he was chasing his bag, which was on a running conveyor belt. Question: Why did he chase his bag? Answer: Because he placed his bag on the conveyor, but it then turned-on by surprise Question: Why was his bag on the conveyor? Answer: Because he used the conveyor as a table Conclusion: So, the likely root cause of the associate’s damaged thumb is that he simply needed a table, there wasn’t one around, so he used a conveyor as a table. https://www.linkedin.com/pulse/use-5-whys-find-root-causes-peter-abilla/
  • 82. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. No, seriously. Root Cause is a Fallacy. http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
  • 83. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. No, seriously. Root Cause is a Fallacy. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede? http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy/
  • 84. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fix
  • 85. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro- services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  • 86. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Changing Culture takes time! Be patient…
  • 87. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices
  • 88. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
  • 89. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thanks you! @adhorn https://medium.com/@adhorn

Editor's Notes

  1. Hands up - how many of you can relate to this story? Great – so this session is dedicated to you 
  2. Short intro on the move from monolith to micro-services.
  3. With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. As a result, “random” failures have grown difficult to predict. At the same time, our dependence on these systems has only increased.
  4. Traditionally, these sensible measures to gain confidence are taken before systems or applications reach production. Once in production, the traditional approach is to rely on monitoring and logging to confirm that everything is working correctly. If it is behaving as expected, then you don't have a problem. If it is not, and it requires human intervention (troubleshooting, triage, resolution, etc.), then you need to react to the incident and get things working again as fast as possible. This implies that once a system is in production, "Don't touch it!"—except, of course, when it's broken, in which case touch it all you want, under the time pressure inherent in an outage response. https://queue.acm.org/detail.cfm?id=2353017
  5. GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis. 
  6. Super power with Docker (Dockerfiles) instead of Chef or Puppet.
  7. Invest time to save time
  8. Write and updates Counters!!!! Not on the DB – redis!!
  9. Database Federation is where we break up the database by function. In our example, we have broken out the Forums DB from the User DB from the Products DB Of course, cross functional queries are harder to do and you may need to do your joins at the application layer for these types of queries This will reduce our database footprint for a while and the great thing is, this does prevent you from having to shard until much further down the line. This isn’t going to help for single large tables; for this we will need to shard.
  10. Sharding is where we break up that single large database into multiple DBs. We might need to do this because of database or table size or potentially for high write IOPs as well. Here is an example of us breaking up a database with a large table into 3 databases. Above we show where each userID is located, but the easiest way to describe how this would work would be to use the example of all users with A-H go into one DB, and I – M go in another, and N – Z go into the third DB. Typically this is done by key space and your application has to be aware of where to read from, update and write to for a particular record. ORM support can help here. This does create operation complexity so if you can federate first, do that. This can be done with SQL or NoSQL, and DynamoDB does this for you under the covers on the backend as your data size increases and the reads / writes per second scale.
  11. Route your website visitors to an alternate location to avoid site outages
  12. Does a region Fail? Full region: no Individual services can fail region-wide Most of the time, configuration issue Leading to cascading failures.
  13. Eventual consistency, also called optimistic replication,[2] is widely deployed in distributed systems, and has origins in early mobile computing projects.[3] A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence.[4] Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints.
  14. Eventual consistency
  15. The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.
  16. The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.
  17. partial release to a subset of production nodes with sticky sessions turned on. That way you can control and minimize the number of users/customers that get impacted if you end up releasing a bad bug.
  18. Go fix it! After running your first experiment, hopefully, there is one of two outcomes. You’ve verified either that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you’ve found a problem before it caused an outage.