SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
How to apply chaos engineering to serverless applications
Abstract:
Chaos engineering is a discipline that focuses on improving system resilience through controlled experiments that expose the inherent chaos and failure modes in our system.
You might have heard about tools such as Netflix's Simian Army or Gremlin, which can inject different failures into your AWS environment to simulate different forms of infrastructure failures. But how can we apply the same principles to a serverless architecture where we have no access to the underlying infrastructure? Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
Abstract:
Chaos engineering is a discipline that focuses on improving system resilience through controlled experiments that expose the inherent chaos and failure modes in our system.
You might have heard about tools such as Netflix's Simian Army or Gremlin, which can inject different failures into your AWS environment to simulate different forms of infrastructure failures. But how can we apply the same principles to a serverless architecture where we have no access to the underlying infrastructure? Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?
4.
@theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
5.
@theburningmonk theburningmonk.com
microservices death stars circa 2015
7.
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
8.
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
it’s not about
preventing failures!
10.
@theburningmonk theburningmonk.com
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch
12.
@theburningmonk theburningmonk.com
anything that can go wrong, will go wrong.
MURPHY’s LAW
13.
@theburningmonk theburningmonk.com
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
14.
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
15.
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
game days
failure injection
16.
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
17.
Yan Cui
http://theburningmonk.com
@theburningmonk
http://bit.ly/yubl-serverless
18.
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
19.
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
24.
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
25.
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
26.
@theburningmonk theburningmonk.com
Shared Responsibility Model
27.
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
28.
@theburningmonk theburningmonk.com
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
36.
@theburningmonk theburningmonk.com
STEP 1.
define steady state
i.e. “what does normal look like”
37.
@theburningmonk theburningmonk.com
STEP 2.
hypothesis that steady state continues in control and experimental group
e.g. “the system stays up if a server dies”
38.
@theburningmonk theburningmonk.com
STEP 3.
inject realistic failures
e.g. “slow response from 3rd-party service”
39.
@theburningmonk theburningmonk.com
STEP 4.
try to disprove hypothesis
i.e. “look for difference between control and experimental group”
40.
@theburningmonk theburningmonk.com
latency inject latency to function invocation
41.
@theburningmonk theburningmonk.com
“what if service X has elevated latency?”
42.
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
48.
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
502
200
49.
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
3s timeout
6s timeout
50.
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
max 29s integration
max 15 mins timeout
51.
@theburningmonk theburningmonk.com
and then there’s
cold starts…
52.
@theburningmonk theburningmonk.com
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integration timeout of 29s.
Most Lambda functions default to timeout of 3-6s.
Don’t forget about the cold starts!
71.
@theburningmonk theburningmonk.com
hypothesis: the AWS SDK retries would handle it
72.
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
73.
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
74.
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!
88.
https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead