© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adrian Hornsby
Cloud Architecture Evangelist – Amazon Web Services
Chaos Engineering:
Why Breaking Things Should Be Practiced.
@adhorn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Been there?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failures are a given and
everything will eventually
fail over time.
Werner Vogels
CTO – Amazon.com
“ “
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
… at the Edge
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Confidence Through Testing
Unit testing of components:
• Tested in isolation to ensure function meets expectations.
Functional testing of integrations:
• Each execution path tested to assure expected results.
Is it enough???
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jesse Robbins
GameDay: Creating Resiliency Through Destruction
https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Netflix 2013
https://medium.com/netflix-techblog
Chaos Monkeys
https://github.com/Netflix/SimianArmy
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://bit.ly/2uKOJMQ
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Twilio Use-Case
Discovering Issues with HTTP/2 via Chaos Testing
https://www.twilio.com/blog/2017/10/http2-issues.html
”While HTTP/2 provides for a
number of improvements over
HTTP/1.x, via Chaos
Testing we discovered that
there are situations where
HTTP/2 will perform worse than
HTTP/1.”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What “really” is Chaos Engineering?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the system’s
capability to withstand turbulent conditions in
production.”
http://principlesofchaos.org
Break your systems on purpose.
Find out their weaknesses and fix
them before they break when
least expected.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure Injection
• Start small & build confidence
• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“CHAOS DOESN’T CAUSE PROBLEMS.
IT REVEALS THEM.”
Nora Jones
Senior Chaos Engineer, Netflix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Before breaking things …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People
Application
Network & Data
Infrastructure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build Resilient Architectures
Infrastructure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability in Parallel
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability Zone 1 Availability Zone 2 Availability Zone n
Multi-AZ
Support Instance Failure
Application
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto-Scaling • Compute efficiency
• Node failure
• Traffic spikes
• Performance bugs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure as Code
• Template of the infrastructure in code.
• Version controlled infrastructure.
• Repeatable template.
• Testable infrastructure.
• Automate it!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Immutable Infrastructure
• No updates on live systems
• Always start from a new resource being provisioned
• Deploy the new software
• Test in different environments (dev, staging)
• Deploy to prod (inactive)
• Change references (DNS or Load Balancer)
• Keep old version around (inactive)
• Fast rollback if things go wrong
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build Resilient Architectures
Network & Data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Read / Write Sharding
RDS DB Instance
Read Replica
App
Instance
App
Instance
App
Instance
RDS DB Instance
Master (Multi-AZ)
RDS DB Instance
Read Replica
RDS DB Instance
Read Replica
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Federation
Users
DB
Products
DB
App
Instance
App
Instance
App
Instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Sharding
User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
CBA
App
Instance
App
Instance
App
Instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Message passing for async. patterns
A
Queue
B
A
Queue
BListener
Pub-Sub
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Web
Instances
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
API: {DO foo}
PUT JOB: {JobID: 0001, Task: DO foo}
API: {JobID: 0001}
GET JOB: {JobID: 0001, Task: DO foo}
Cache
Result:
{
JobID: 0001,
Result: bar
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Worker
Instance
Worker
Instance
Queue
API
Instance
API
Instance
API
Instance
Cache
Amazon SNS
Push Notification
User
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exponential Backoff
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Circuit Breaker
• Wrap a protected function call in a
circuit breaker object, which
monitors for failures.
• If failures reach a certain threshold,
the circuit breaker trips.
https://martinfowler.com/bliki/CircuitBreaker.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dynamic Routing with Route53
1. Latency Based Routing
2. Geo DNS
3. Weighted Round Robin
4. DNS Failover
Amazon
Route53
Resource A
In US
Resource B
in EU
User in US
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build Resilient Architectures
Application
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stateless Services
AZ1
AZ2
AWS Region
Data Store
Cache
Auto-ScalingGroup
User
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transient state does not
belong in the database.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CAP Theorem
Consistency Availability Partition Tolerance
Data is consistent.
All nodes see the same state.
Every request is non-failing. Service still responds as expected
if some nodes crash.
Distributed System
In the presence of a network partition, you must
choose between consistency and availability!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Eventual Consistency
… if no new updates are
made to a given data item,
eventually all accesses to that
item will return the last
updated value.
Availability
An eventually consistent system can
return any value before it converges!!
https://en.wikipedia.org/wiki/Eventual_consistency
Distributed System
Every request is non-failing.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process A Process B Process A Process B
Synchronous Asynchronous
Waiting
Working
Continues
get or fetch resultGet result
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Non-blocking UI
https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exception Handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Service Degradation & Fallbacks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build Resilient Architectures
People
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“It is not failure itself that holds you back; it
is the fear of failure that paralyses you.”
Brian Tracy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fire Drills
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Phases of Chaos Engineering
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steady
State
Hypothesis
Design
Experiment
Verify
& Learn
Fix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is Steady State?
• ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
What is Steady State?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Business Metrics at work
Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden).
Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer).
Yahoo!: 400 ms of extra load time caused a 5–9% increase in the
number of people who clicked “back” before the page even loaded (Nicole
Sullivan).
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steady State
Important:
• Know the value range of Healthy State!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis: What if…?
“What if this load balancer breaks?”
“What if Redis becomes slow?”
“What if a host on Cassandra goes away?”
”What if latency increases by 300ms?”
”What if the database stops?”
”What if Paul does not show up today?”
Make it everyone’s problem!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Disclaimer!
Don’t make an hypothesis that you
know will break you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Designing Experiment
• Pick hypothesis
• Scope the experiment
• Identify metrics
• Notify the organization
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Rules of thumbs
• Start with very small
• As close as possible to production
• Minimize the blast radius.
• Have an emergency STOP!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New Version
Users
Run the Experiment: Canary deployment
Old Version
99%
Users
1%
Users
Start with ..
Dynamic Routing
(Route53)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Verify & Learn: Quantifying the result of the experiment
• Time to detect?
• Time for notification? And escalation?
• Time to public notification?
• Time for graceful degradation to kick-in?
• Time for self healing to happen?
• Time to recovery – partial and full?
• Time to all-clear and stable?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DON’T blame that one person …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PostMortems
The 5 WHYs
Outage
Because of
…
Because of
…
Because of
…
Because of
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More questions to ask.
• Can you clarify if there were any preceding events?
• Why would they believe acting in this way was the best course of action to
deliver the desired outcome?
• Is there another failure mode that could present here?
• What decisions or events prior to this made this work before?
• Why stop there – are there places to dig deeper that could shine a light more on
this?
• Did others step in to help, to advise, or to intercede?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fixing the issues!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Challenges to Chaos Engineering
Mostly Cultural
• no time or flexibility to simulate disasters.
• teams already spending all of its time fixing things.
• can be very political.
• might force deep conversations.
• deeply invested in a specific technical roadmap (micro-services)
that chaos engineering tests show is not as resilient to failures as
originally predicted.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Changing Culture takes time!
Be patient…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More Resources
• https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
• https://www.gremlin.com
• https://queue.acm.org/detail.cfm?id=2353017
• https://softwareengineeringdaily.com/
• https://github.com/dastergon/awesome-sre
• https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
• https://medium.com/@NetflixTechBlog
• http://principlesofchaos.org
• https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
• https://github.com/adhorn/awesome-chaos-engineering
• https://www.infoq.com/presentations/netflix-chaos-microservices
• http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
• http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
@adhorn
https://medium.com/@adhorn

Chaos Engineering: Why Breaking Things Should Be Practised.

  • 1.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby Cloud Architecture Evangelist – Amazon Web Services Chaos Engineering: Why Breaking Things Should Be Practiced. @adhorn
  • 2.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Been there?
  • 3.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  • 4.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. … at the Edge
  • 5.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  • 6.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  • 7.
  • 8.
  • 10.
    © 2017, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  • 11.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Twilio Use-Case Discovering Issues with HTTP/2 via Chaos Testing https://www.twilio.com/blog/2017/10/http2-issues.html ”While HTTP/2 provides for a number of improvements over HTTP/1.x, via Chaos Testing we discovered that there are situations where HTTP/2 will perform worse than HTTP/1.”
  • 12.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. What “really” is Chaos Engineering?
  • 13.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  • 14.
    Break your systemson purpose. Find out their weaknesses and fix them before they break when least expected.
  • 15.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!
  • 17.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
  • 18.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Before breaking things …
  • 19.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure
  • 20.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build Resilient Architectures Infrastructure
  • 21.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 22.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  • 23.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  • 24.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!
  • 25.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
  • 26.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build Resilient Architectures Network & Data
  • 27.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  • 28.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  • 29.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A CBA App Instance App Instance App Instance
  • 30.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue BListener Pub-Sub
  • 31.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  • 32.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
  • 33.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Exponential Backoff
  • 34.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. https://martinfowler.com/bliki/CircuitBreaker.html
  • 35.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Dynamic Routing with Route53 1. Latency Based Routing 2. Geo DNS 3. Weighted Round Robin 4. DNS Failover Amazon Route53 Resource A In US Resource B in EU User in US
  • 36.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build Resilient Architectures Application
  • 37.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-ScalingGroup User
  • 38.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Transient state does not belong in the database.
  • 39.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  • 40.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  • 41.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch resultGet result
  • 42.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  • 43.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Exception Handling
  • 44.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
  • 45.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build Resilient Architectures People
  • 46.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy
  • 47.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Fire Drills
  • 48.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Phases of Chaos Engineering
  • 49.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix
  • 50.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  • 51.
    What is SteadyState? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  • 52.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  • 53.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!
  • 54.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Hypothesis: What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” ”What if Paul does not show up today?” Make it everyone’s problem!
  • 55.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  • 56.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  • 57.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  • 58.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. New Version Users Run the Experiment: Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
  • 59.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Verify & Learn: Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  • 60.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. DON’T blame that one person …
  • 61.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. PostMortems The 5 WHYs Outage Because of … Because of … Because of … Because of …
  • 62.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. More questions to ask. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?
  • 63.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Fixing the issues!
  • 64.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  • 65.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Changing Culture takes time! Be patient…
  • 66.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
  • 67.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Thank you! @adhorn https://medium.com/@adhorn

Editor's Notes

  • #3 Hands up - how many of you can relate to this story? Great – so this session is dedicated to you 
  • #4 We live in An Area of Complex & Dynamic Systems - With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. As a result, “random” failures have grown difficult to predict. At the same time, our dependence on these systems has only increased.
  • #6 Traditionally, these sensible measures to gain confidence are taken before systems or applications reach production. Once in production, the traditional approach is to rely on monitoring and logging to confirm that everything is working correctly. If it is behaving as expected, then you don't have a problem. If it is not, and it requires human intervention (troubleshooting, triage, resolution, etc.), then you need to react to the incident and get things working again as fast as possible. This implies that once a system is in production, "Don't touch it!"—except, of course, when it's broken, in which case touch it all you want, under the time pressure inherent in an outage response. https://queue.acm.org/detail.cfm?id=2353017
  • #7 GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis. 
  • #25 Invest time to save time
  • #26 Super power with Docker (Dockerfiles) instead of Chef or Puppet.
  • #28 Write and updates Counters!!!! Not on the DB – redis!!
  • #29 Database Federation is where we break up the database by function. In our example, we have broken out the Forums DB from the User DB from the Products DB Of course, cross functional queries are harder to do and you may need to do your joins at the application layer for these types of queries This will reduce our database footprint for a while and the great thing is, this does prevent you from having to shard until much further down the line. This isn’t going to help for single large tables; for this we will need to shard.
  • #30 Sharding is where we break up that single large database into multiple DBs. We might need to do this because of database or table size or potentially for high write IOPs as well. Here is an example of us breaking up a database with a large table into 3 databases. Above we show where each userID is located, but the easiest way to describe how this would work would be to use the example of all users with A-H go into one DB, and I – M go in another, and N – Z go into the third DB. Typically this is done by key space and your application has to be aware of where to read from, update and write to for a particular record. ORM support can help here. This does create operation complexity so if you can federate first, do that. This can be done with SQL or NoSQL, and DynamoDB does this for you under the covers on the backend as your data size increases and the reads / writes per second scale.
  • #36 Route your website visitors to an alternate location to avoid site outages
  • #37 Does a region Fail? Full region: no Individual services can fail region-wide Most of the time, configuration issue Leading to cascading failures.
  • #42 Eventual consistency, also called optimistic replication,[2] is widely deployed in distributed systems, and has origins in early mobile computing projects.[3] A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence.[4] Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints.
  • #43 Eventual consistency
  • #52 The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.
  • #53 The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.
  • #60 partial release to a subset of production nodes with sticky sessions turned on. That way you can control and minimize the number of users/customers that get impacted if you end up releasing a bad bug.
  • #65 Go fix it! After running your first experiment, hopefully, there is one of two outcomes. You’ve verified either that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you’ve found a problem before it caused an outage.