Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
APPLYING PRINCIPLES
to SERVERLESSt
a
b
chaos engineering
of
A
E
S
of
what is chaos engineering?
Chaos Engineering is the discipline of experimenting on a distributed system

in order to build confidence in the system’s ...
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egypt...
history of Smallpox
est. 400K deaths per year in 18th Century Europe.
earliest evidence of disease in 3rd Century BC Egypt...
1798
first vaccine developed
1980
history of Smallpox
Edward Jenner
WHO certified
global eradication
est. 400K deaths per ye...
Vaccination is the most effective method of
preventing infectious diseases
stimulates the immune system to recognize
and destroy the disease before contracting
the disease for real
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability ...
chaos engineering is the vaccine to frailties in
modern software
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
“Netflix for sports”
offices in London, Leeds, Katowice and Amsterdam
available in 7 countries, 30+ platforms
~1,000,000 concurrent viewers
“Netflix for sports”
offices in London, Leeds, Katowice and Amsterdam
We’re hiring! Visit
engineering.dazn.com to
learn mo...
chaos engineering has an
image problem
Why did you break
production?
Because I can!
it’s about building confidence,
NOT breaking things
http://principlesofchaos.org
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
this is not a
steady state
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonab...
explore unknown unknowns
away from production
treat production with the
care it deserves
the goal is NOT,
to actually hurt production
If you know the system would break,
and you did it anyway…
then it’s NOT a chaos experiment.
It’s called being IRRESPONSIB...
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
https://github.com/Netflix/SimianArmy
https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
if a WEAkNESS is uncovered,
IMPROVE it before the behaviour
manifests in the system at large
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability ...
Chaos Engineering
controlled experiments to help us learn about
our system’s behaviour and build confidence
in its ability ...
containment and blast radius should
be front and centre of your thinking
communication
ensure everyone knows what you’re doing
ensure everyone knows what you’re doing
NO surprises!
communication
Timing
run experiments during office hours
AVOID important dates
communication
Timing
contain Blast radius
smallest change that allows
you to detect a signal that
steady state is disrupted
rollback at the first sign of
TROUBLE!
communication
Timing
contain Blast radius
don’t try to run before you
know how to walk.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an
AWS Availability ...
there is no server…
there is no server…
that you can kill
there are more inherent chaos and
complexity in a Serverless architecture
smaller units of deployment
but A LOT more of them!
more difficult to harden
around boundaries
serverful
serverless
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
more intermediary services,
and greater variety too
eac...
serverful
serverless
more configurations,
more opportunities for misconfiguration
more unknown failure modes in
infrastructure that we don’t control
often there’s little we can do when an
outage occurs in the platform
improperly tuned timeouts
missing error handling
missing fallback when downstream is unavailable
LATENCY INJECTION
STEP 1. define “Steady State”
aka. what does normal, working
condition looks like?
what metrics do you monitor?
9X-percentile latency
error count
yield (% of requests completed)
harvest (completeness of results)
STEP 2.
hypothesize steady state will
continue in both control group
& the experiment group
ie. you should have a reasonab...
API Gateway
consider the effect of cold-starts
& API Gateway overhead
use short timeout for API calls
the goal of a timeout strategy is to give HTTP
requests the best chance to succeed,
provided that doing so does not cause ...
fixed timeout are tricky to get right…
fixed timeout are tricky to get right…
too short and you don’t
give requests the best
chance to succeed
fixed timeout are tricky to get right…
too long and you run the
risk of letting the request
timeout the calling function
and it gets worse when you make multiple
API calls in one function…
set the request timeout based on the
amount of invocation time left
log the timeout incident with
as much context as possible
e.g. timeout value, correlation IDs,
request object, …
report custom metrics
trade harvest (completeness of response)
for yield (availability of response)
be mindful when you sacrifice precision for
availability, user experience is the king
STEP 3.
inject realistic failures
e.g. server crash, network error,
HD malfunction, etc.
where to inject latency?
hypothesis:
function has appropriate timeout on its HTTP
communications and can degrade gracefully
when these requests tim...
should also be applied to 3rd parties
services we depend on, e.g. DynamoDB
what’s the blast radius?
http client
public-api-a
http client
public-api-b
internal-api
hypothesis:
all functions have appropriate timeout on
their HTTP communications to this internal
API, and can degrade grac...
large blast radius, risky..
could be effective when used away from
production environment, to weed out
weaknesses quickly
not priming developers to
build more resilient systems
development
development
production
Priming (psychology):
Priming is a technique whereby exposure to one
stimulus influences a response to a subsequent
stimulu...
make dev environments better resemble the
turbulent conditions you should realistically
expect your system to survive in p...
hypothesis:
the client app has appropriate timeout on
their HTTP communication with the server,
and can degrade gracefully...
STEP 4.
disprove hypothesis
i.e. look for difference with steady state
how to inject latency?
static weaver (e.g. AspectJ, PostSharp),
or dynamic proxies
https://theburningmonk.com/2015/04/design-for-latency-issues/
manually crafted wrapper library
configured in SSM Parameter Store
no injected latency
with injected latency
factory wrapper function
(think bluebird’s promisifyAll function)
ERROR INJECTION
common errors
HTTP 5xx
DynamoDB throughput exceeded
throttled Lambda executions
hypothesis:
Function has appropriate error handling on its
HTTP communications and can degrade
gracefully when downstream ...
hypothesis:
Function has appropriate error handling on
DynamoDB operations and can degrade gracefully
when DynamoDB throug...
Induce Lambda throttling by temporarily setting reserved concurrency.
failures are INEVITABLE
the only way to truly know your system’s
resilience against failures is to test it
through controlled experiments
vaccinate your serverless
architecture against failures
@theburningmonk
theburningmonk.com
github.com/theburningmonk
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log ag...
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Applying principles of chaos engineering to serverless (CodeMesh)
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Applying principles of chaos engineering to serverless (CodeMesh)

Download to read offline

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.

Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Applying principles of chaos engineering to serverless (CodeMesh)

  1. 1. APPLYING PRINCIPLES to SERVERLESSt a b chaos engineering of A E S of
  2. 2. what is chaos engineering?
  3. 3. Chaos Engineering is the discipline of experimenting on a distributed system
 in order to build confidence in the system’s capability
 to withstand turbulent conditions in production. - principlesofchaos.org
  4. 4. history of Smallpox est. 400K deaths per year in 18th Century Europe. earliest evidence of disease in 3rd Century BC Egyptian Mummy
  5. 5. history of Smallpox est. 400K deaths per year in 18th Century Europe. earliest evidence of disease in 3rd Century BC Egyptian Mummy 1798 first vaccine developed Edward Jenner
  6. 6. 1798 first vaccine developed 1980 history of Smallpox Edward Jenner WHO certified global eradication est. 400K deaths per year in 18th Century Europe. earliest evidence of disease in 3rd Century BC Egyptian Mummy
  7. 7. Vaccination is the most effective method of preventing infectious diseases
  8. 8. stimulates the immune system to recognize and destroy the disease before contracting the disease for real
  9. 9. Chaos Engineering controlled experiments to help us learn about our system’s behaviour and build confidence in its ability to withstand turbulent conditions
  10. 10. chaos engineering is the vaccine to frailties in modern software
  11. 11. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  12. 12. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  13. 13. “Netflix for sports” offices in London, Leeds, Katowice and Amsterdam
  14. 14. available in 7 countries, 30+ platforms
  15. 15. ~1,000,000 concurrent viewers
  16. 16. “Netflix for sports” offices in London, Leeds, Katowice and Amsterdam We’re hiring! Visit engineering.dazn.com to learn more. follow @DAZN_ngnrs for updates about the engineering team. WE’RE HIRING!
  17. 17. chaos engineering has an image problem
  18. 18. Why did you break production?
  19. 19. Because I can!
  20. 20. it’s about building confidence, NOT breaking things
  21. 21. http://principlesofchaos.org
  22. 22. STEP 1. define “Steady State” aka. what does normal, working condition looks like?
  23. 23. this is not a steady state
  24. 24. STEP 2. hypothesize steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment
  25. 25. explore unknown unknowns away from production
  26. 26. treat production with the care it deserves
  27. 27. the goal is NOT, to actually hurt production
  28. 28. If you know the system would break, and you did it anyway… then it’s NOT a chaos experiment. It’s called being IRRESPONSIBLE.
  29. 29. STEP 3. inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  30. 30. https://github.com/Netflix/SimianArmy
  31. 31. https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
  32. 32. STEP 4. disprove hypothesis i.e. look for difference with steady state
  33. 33. if a WEAkNESS is uncovered, IMPROVE it before the behaviour manifests in the system at large
  34. 34. Chaos Engineering controlled experiments to help us learn about our system’s behaviour and build confidence in its ability to withstand turbulent conditions
  35. 35. Chaos Engineering controlled experiments to help us learn about our system’s behaviour and build confidence in its ability to withstand turbulent conditions
  36. 36. containment and blast radius should be front and centre of your thinking
  37. 37. communication
  38. 38. ensure everyone knows what you’re doing
  39. 39. ensure everyone knows what you’re doing NO surprises!
  40. 40. communication Timing
  41. 41. run experiments during office hours
  42. 42. AVOID important dates
  43. 43. communication Timing contain Blast radius
  44. 44. smallest change that allows you to detect a signal that steady state is disrupted
  45. 45. rollback at the first sign of TROUBLE!
  46. 46. communication Timing contain Blast radius
  47. 47. don’t try to run before you know how to walk.
  48. 48. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  49. 49. chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  50. 50. there is no server…
  51. 51. there is no server… that you can kill
  52. 52. there are more inherent chaos and complexity in a Serverless architecture
  53. 53. smaller units of deployment but A LOT more of them!
  54. 54. more difficult to harden around boundaries serverful serverless
  55. 55. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  56. 56. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES more intermediary services, and greater variety too
  57. 57. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES more intermediary services, and greater variety too each with its own set of failure modes
  58. 58. serverful serverless more configurations, more opportunities for misconfiguration
  59. 59. more unknown failure modes in infrastructure that we don’t control
  60. 60. often there’s little we can do when an outage occurs in the platform
  61. 61. improperly tuned timeouts
  62. 62. missing error handling
  63. 63. missing fallback when downstream is unavailable
  64. 64. LATENCY INJECTION
  65. 65. STEP 1. define “Steady State” aka. what does normal, working condition looks like?
  66. 66. what metrics do you monitor?
  67. 67. 9X-percentile latency error count yield (% of requests completed) harvest (completeness of results)
  68. 68. STEP 2. hypothesize steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment
  69. 69. API Gateway
  70. 70. consider the effect of cold-starts & API Gateway overhead
  71. 71. use short timeout for API calls
  72. 72. the goal of a timeout strategy is to give HTTP requests the best chance to succeed, provided that doing so does not cause the calling function itself to err
  73. 73. fixed timeout are tricky to get right…
  74. 74. fixed timeout are tricky to get right… too short and you don’t give requests the best chance to succeed
  75. 75. fixed timeout are tricky to get right… too long and you run the risk of letting the request timeout the calling function
  76. 76. and it gets worse when you make multiple API calls in one function…
  77. 77. set the request timeout based on the amount of invocation time left
  78. 78. log the timeout incident with as much context as possible e.g. timeout value, correlation IDs, request object, …
  79. 79. report custom metrics
  80. 80. trade harvest (completeness of response) for yield (availability of response)
  81. 81. be mindful when you sacrifice precision for availability, user experience is the king
  82. 82. STEP 3. inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  83. 83. where to inject latency?
  84. 84. hypothesis: function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  85. 85. should also be applied to 3rd parties services we depend on, e.g. DynamoDB
  86. 86. what’s the blast radius?
  87. 87. http client public-api-a http client public-api-b internal-api
  88. 88. hypothesis: all functions have appropriate timeout on their HTTP communications to this internal API, and can degrade gracefully when requests are timed out
  89. 89. large blast radius, risky..
  90. 90. could be effective when used away from production environment, to weed out weaknesses quickly
  91. 91. not priming developers to build more resilient systems
  92. 92. development
  93. 93. development production
  94. 94. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention. It is a technique in psychology used to train a person's memory both in positive and negative ways.
  95. 95. make dev environments better resemble the turbulent conditions you should realistically expect your system to survive in production
  96. 96. hypothesis: the client app has appropriate timeout on their HTTP communication with the server, and can degrade gracefully when requests are timed out
  97. 97. STEP 4. disprove hypothesis i.e. look for difference with steady state
  98. 98. how to inject latency?
  99. 99. static weaver (e.g. AspectJ, PostSharp), or dynamic proxies
  100. 100. https://theburningmonk.com/2015/04/design-for-latency-issues/
  101. 101. manually crafted wrapper library
  102. 102. configured in SSM Parameter Store
  103. 103. no injected latency
  104. 104. with injected latency
  105. 105. factory wrapper function (think bluebird’s promisifyAll function)
  106. 106. ERROR INJECTION
  107. 107. common errors HTTP 5xx DynamoDB throughput exceeded throttled Lambda executions
  108. 108. hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  109. 109. hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  110. 110. Induce Lambda throttling by temporarily setting reserved concurrency.
  111. 111. failures are INEVITABLE
  112. 112. the only way to truly know your system’s resilience against failures is to test it through controlled experiments
  113. 113. vaccinate your serverless architecture against failures
  114. 114. @theburningmonk theburningmonk.com github.com/theburningmonk
  115. 115. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/prod-ready-serverless get 40% off with: ytcui

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users. Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions. But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions? These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems. Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Views

Total views

963

On Slideshare

0

From embeds

0

Number of embeds

609

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×