The document discusses applying principles of chaos engineering to serverless architectures. It describes challenges with chaos engineering in serverless due to the lack of servers to inject failures into. It then discusses approaches to injecting latency and errors into serverless functions and downstream services as part of controlled experiments to test resiliency. The goal is not to break systems but to uncover weaknesses and improve robustness. Containment of experiments is emphasized.
4. Chaos Engineering is the discipline of experimenting on a distributed system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
- principlesofchaos.org
5. Smallpox
Earliest evidence of disease in 3rd Century BC Egyptian Mummy.
est. 400K deaths per year in 18th Century Europe.
6. History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
https://en.wikipedia.org/wiki/Edward_Jenner
7. History of vaccination
First vaccine was developed in
1798 by Edward Jenner.
WHO certified global eradication
in 1980.
https://en.wikipedia.org/wiki/Edward_Jenner
10. History of vaccination
Vaccination is the most effective method to prevent infectious diseases.
Vaccines stimulate the immune system to recognise and destroy the
disease before contracting it for real.
12. Chaos engineering
Use controlled experiments to inject failures into our system.
Help us learn about our system’s behaviour and uncover unknown failure
modes, before they manifest like wildfire in production.
13. Chaos engineering
Use controlled experiments to inject failures into our system.
Help us learn about our system’s behaviour and uncover unknown failure
modes, before they manifest like wildfire in production.
Lets us build confidence in its ability to withstand turbulent conditions.
15. Who am I?
Principal Engineer at DAZN.
AWS Serverless Hero.
Author of Production-Ready Serverless* course by Manning.
Blogger**, speaker.
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com
20. About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platforms.
21. About DAZN
Available in 7 countries - Austria, Switzerland, Germany,
Japan, Canada, Italy and USA.
Available on 30+ platforms.
Around 1,000,000 concurrent viewers at peak.
22. follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRING!
26. Chaos engineering has an image problem
Too much emphasis is on breaking things.
Easy to conflate the action of injecting failures with the payback.
34. Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of confidence the
system would handle the failure before you proceed with
the experiment
STEP 2.
36. Chaos in practice
Explore unknown unknowns away from production.
Experiments that graduate to production should be carefully
considered and planned.
You should have reasonable confidence in the system before
running experiments in production.
37. Chaos in practice
Treat production with the care it deserves.
The goal is not to break things.
38. Chaos in practice
If you know the system would break and you did it anyway,
then it’s not a chaos experiment!
It’s called being irresponsible.
39. STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
50. Containment
Make the smallest change necessary to prove or disprove hypothesis.
Have a rollback plan.
Stop the experiment right away if things start to go wrong.
54. chaos monkey kills an EC2
instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
66. Serverless challenges
Unknown failure modes in the infrastructure we don’t control.
Often there’s little we can do when an outage occurs in the platform.
75. Defining steady state
What metrics do you use?
p95/p99 latencies, error count, backlog size, yield*, harvest**
* percentage of requests completed
** completeness of the returned response
76. Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of confidence the
system would handle the failure before you proceed with
the experiment
STEP 2.
83. Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
84. Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Too long : risk timing out the calling function.
85. Request timeouts
Finding the right timeout value is tricky.
Too short : requests not given the best chance to succeed.
Too long : risk timing out the calling function.
Even more complicated when you have multiple integration points.
86. Approach 1: split invocation time equally
(e.g. 3 requests, 6s function timeout = 2s timeout per request)
87. Approach 2: every request is given nearly all the invocation time
(e.g. 3 requests, 6s function timeout = 5s timeout per request)
92. Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.
93. Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.
Record custom metrics.
94. Recovery steps
Log the timeout with as much context as possible.
The API, timeout value, correlation IDs, request object, etc.
Record custom metrics.
Use fallbacks.
95.
96.
97.
98. Recovery steps
Be mindful when you sacrifice precision for availability.
User experience is the king.
99. STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
107. hypothesis:
All functions have appropriate timeout on their HTTP
communications to this internal API, and can degrade
gracefully when requests are timed out
114. Priming (psychology):
Priming is a technique whereby exposure to one stimulus
influences a response to a subsequent stimulus, without
conscious guidance or intention.
It is a technique in psychology used to train a person's
memory both in positive and negative ways.
115.
116. Use failure injection to programme your colleagues into
thinking about failure modes early.
117. Where to inject latency?
Make X% of all requests slow
in the dev environment.
118. hypothesis:
The client app has appropriate timeout on their HTTP
communication with the server, and can degrade gracefully
when requests are timed out
159. STEP 1. Define “steady state”
aka. what does normal, working
condition looks like?
160. Hypothesise steady state will
continue in both control group
& the experiment group
ie. you should have a reasonable degree of confidence the
system would handle the failure before you proceed with
the experiment
STEP 2.
161. STEP 3. Inject realistic failures
e.g. server crash, network error, HD
malfunction, etc.
162. STEP 4. Disprove hypothesis
i.e. look for difference in steady state
163. there are more inherent chaos and
complexity in a serverless application
164. even without servers, you can still inject
CONTROLLED failures at the application level
165.
166.
167.
168.
169. API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui
170. @theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
Canary deployments
http://bit.ly/prod-ready-serverless
get 40% off with:
ytcui