Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applying principles of chaos engineering to serverless (O'Reilly Software Architecture 2018)

995 views

Published on

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.

Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Applying principles of chaos engineering to serverless (O'Reilly Software Architecture 2018)

  1. 1. Applying Principles of Chaos Engineering to Serverless Yan Cui @theburningmonk
  2. 2. Agenda ▪What is chaos engineering? ▪New challenges with serverless ▪Applying latency injection to serverless ▪Applying error injection to serverless
  3. 3. After the talk ▪Slides will be available on slideshare ▪Find links to slides and video at https://theburningmonk.com/ oreillysa2018
  4. 4. What is chaos engineering?
  5. 5. Chaos Engineering is the discipline of experimenting on a distributed system
 in order to build confidence in the system’s capability
 to withstand turbulent conditions in production. - principlesofchaos.org
  6. 6. Smallpox Earliest evidence of disease in 3rd Century BC Egyptian Mummy. est. 400K deaths per year in 18th Century Europe.
  7. 7. History of vaccination First vaccine was developed in 1798 by Edward Jenner. https://en.wikipedia.org/wiki/Edward_Jenner
  8. 8. History of vaccination First vaccine was developed in 1798 by Edward Jenner. WHO certified global eradication in 1980. https://en.wikipedia.org/wiki/Edward_Jenner
  9. 9. https://en.wikipedia.org/wiki/Vaccine History of vaccination
  10. 10. History of vaccination Vaccination is the most effective method to prevent infectious diseases.
  11. 11. History of vaccination Vaccination is the most effective method to prevent infectious diseases. Vaccines stimulate the immune system to recognise and destroy the disease before contracting it for real.
  12. 12. Chaos engineering Use controlled experiments to inject failures into our system.
  13. 13. Chaos engineering Use controlled experiments to inject failures into our system. Help us learn about our system’s behaviour and uncover unknown failure modes, before they manifest like wildfire in production.
  14. 14. Chaos engineering Use controlled experiments to inject failures into our system. Help us learn about our system’s behaviour and uncover unknown failure modes, before they manifest like wildfire in production. Lets us build confidence in its ability to withstand turbulent conditions.
  15. 15. Chaos Engineering is the vaccine to frailties in modern software.
  16. 16. * https://bit.ly/production-ready-serverless ** https://theburningmonk.com Who am I? Principal Engineer at DAZN. AWS Serverless Hero. Author of Production-Ready Serverless* course by Manning. Blogger**, speaker.
  17. 17. https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914
  18. 18. https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netflix-sport-us-boxing-eddie-hearn
  19. 19. About DAZN Available in 7 countries - Austria, Switzerland, Germany, Japan, Canada, Italy and USA. Available on 30+ platforms.
  20. 20. About DAZN Around 1,000,000 concurrent viewers at peak.
  21. 21. Chaos engineering has an image problem
  22. 22. Chaos engineering has an image problem
  23. 23. Chaos engineering has an image problem Too much emphasis is on breaking things.
  24. 24. Chaos engineering has an image problem Too much emphasis is on breaking things. Easy to conflate the action of injecting failures with the payback.
  25. 25. Why did you break production?
  26. 26. Because I can!
  27. 27. Chaos engineering has an image problem The goal is to learn about the system and build confidence.
  28. 28. Chaos engineering has an image problem The goal is to learn about the system and build confidence. The goal is not to break things.
  29. 29. Chaos in practice 4 steps to start running chaos experiments yourself.
  30. 30. STEP 1. Define “steady state” aka. what does normal, working condition looks like?
  31. 31. this is not a steady state
  32. 32. Hypothesise steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  33. 33. Chaos in practice Explore unknown unknowns away from production.
  34. 34. Chaos in practice Explore unknown unknowns away from production. Experiments that graduate to production should be carefully considered and planned. You should have reasonable confidence in the system before running experiments in production.
  35. 35. Chaos in practice Treat production with the care it deserves. The goal is not to break things.
  36. 36. Chaos in practice If you know the system would break and you did it anyway, then it’s not a chaos experiment! It’s called being irresponsible.
  37. 37. STEP 3. Inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  38. 38. https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
  39. 39. STEP 4. Disprove hypothesis i.e. look for difference in steady state
  40. 40. Chaos in practice Look for evidence that steady state was impacted by the injected failure.
  41. 41. Chaos in practice Look for evidence that steady state was impacted by the injected failure. Address weaknesses before failures happen for real.
  42. 42. Containment Experiments needs to be controlled. The goal is not to break things.
  43. 43. Containment Ensure everyone knows what you are doing. Don’t surprise your teammates.
  44. 44. Containment Run experiments during office hours.
  45. 45. Containment Run experiments during office hours. Avoid important dates.
  46. 46. Containment Make the smallest change necessary to prove or disprove hypothesis.
  47. 47. Containment Make the smallest change necessary to prove or disprove hypothesis. Have a rollback plan. Stop the experiment right away if things start to go wrong.
  48. 48. Containment Don’t start in production. Can learn a lot by running experiments in staging.
  49. 49. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  50. 50. New challenges with serverless
  51. 51. chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  52. 52. Serverless challenges There are no servers that you can access and kill.
  53. 53. There are more inherent chaos and complexity in a serverless architecture.
  54. 54. Serverless challenges Smaller unit of deployment, but a lot more of them.
  55. 55. serverful serverlessServerless challenges
  56. 56. Serverless challenges Every function needs to be correctly configured and secured.
  57. 57. Kinesis ? SNS CloudWatch Events CloudWatch LogsIoT Core DynamoDB S3 SES Serverless challenges Lambda Lambda
  58. 58. Serverless challenges A lot of managed, intermediate services. Each with its own set of failure modes.
  59. 59. Serverless challenges Unknown failure modes in the infrastructure we don’t control.
  60. 60. Serverless challenges Unknown failure modes in the infrastructure we don’t control. Often there’s little we can do when an outage occurs in the platform.
  61. 61. Common weaknesses
  62. 62. Common weaknesses Improperly tuned timeouts.
  63. 63. Common weaknesses Missing error handling.
  64. 64. Common weaknesses Missing fallback.
  65. 65. Common weaknesses Missing regional failover.
  66. 66. Latency injection with serverless
  67. 67. STEP 1. Define “steady state” aka. what does normal, working condition looks like?
  68. 68. Defining steady state What metrics do you use?
  69. 69. Defining steady state What metrics do you use? p95/p99 latencies, error count, backlog size, yield*, harvest** * percentage of requests completed ** completeness of the returned response
  70. 70. Hypothesise steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  71. 71. API Gateway Serverless considerations
  72. 72. Serverless considerations Consider the effect of cold starts. How does it affect your strategy for handling slow responses.
  73. 73. Request timeouts Strategy should:
  74. 74. Request timeouts Strategy should: 1. Give requests the best chance to succeed
  75. 75. Request timeouts Strategy should: 1. Give requests the best chance to succeed 2. Do not allow slow response to timeout the caller function
  76. 76. Request timeouts Finding the right timeout value is tricky.
  77. 77. Request timeouts Finding the right timeout value is tricky. Too short : requests not given the best chance to succeed.
  78. 78. Request timeouts Finding the right timeout value is tricky. Too short : requests not given the best chance to succeed. Too long : risk timing out the calling function.
  79. 79. Request timeouts Finding the right timeout value is tricky. Too short : requests not given the best chance to succeed. Too long : risk timing out the calling function. Even more complicated when you have multiple integration points.
  80. 80. Approach 1: split invocation time equally (e.g. 3 requests, 6s function timeout = 2s timeout per request)
  81. 81. Approach 2: every request is given nearly all the invocation time (e.g. 3 requests, 6s function timeout = 5s timeout per request)
  82. 82. Request timeouts Proposal: set request timeouts dynamically based on invocation time left
  83. 83. Request timeouts
  84. 84. Set timeout based on remaining invocation time
  85. 85. Set timeout based on remaining invocation time
  86. 86. Recovery steps Log the timeout with as much context as possible. The API, timeout value, correlation IDs, request object, etc.
  87. 87. Recovery steps Log the timeout with as much context as possible. The API, timeout value, correlation IDs, request object, etc. Record custom metrics.
  88. 88. Recovery steps Log the timeout with as much context as possible. The API, timeout value, correlation IDs, request object, etc. Record custom metrics. Use fallbacks.
  89. 89. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  90. 90. Recovery steps Be mindful when you sacrifice precision for availability. User experience is the king.
  91. 91. STEP 3. Inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  92. 92. Where to inject latency?
  93. 93. Where to inject latency?
  94. 94. hypothesis: Function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  95. 95. Where to inject latency?
  96. 96. Where to inject latency? Should be applied to 3rd party services too. DynamoDB, Twillio, Auth0, …
  97. 97. Where to inject latency?
  98. 98. Where to inject latency? Be mindful of the blast radius of the experiment. The goal is not to break things.
  99. 99. http client public-api-a http client public-api-b internal-api Where to inject latency?
  100. 100. hypothesis: All functions have appropriate timeout on their HTTP communications to this internal API, and can degrade gracefully when requests are timed out
  101. 101. Where to inject latency?
  102. 102. Where to inject latency?
  103. 103. Large blast radius, can cause cascade failures unintentionally Where to inject latency?
  104. 104. development
  105. 105. development production
  106. 106. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention. It is a technique in psychology used to train a person's memory both in positive and negative ways.
  107. 107. Use failure injection to programme your colleagues into thinking about failure modes early.
  108. 108. Where to inject latency? Make X% of all requests slow in the dev environment.
  109. 109. hypothesis: The client app has appropriate timeout on their HTTP communication with the server, and can degrade gracefully when requests are timed out
  110. 110. Where to inject latency?
  111. 111. Where to inject latency?
  112. 112. Where to inject latency?
  113. 113. STEP 4. Disprove hypothesis i.e. look for difference in steady state
  114. 114. How to inject latency?
  115. 115. How to inject latency? Static weavers (e.g. PostSharp, AspectJ). Dynamic proxies.
  116. 116. https://theburningmonk.com/2015/04/design-for-latency-issues/
  117. 117. How to inject latency? Manually crafted wrapper libraries.
  118. 118. configured in SSM Parameter Store
  119. 119. no injected latency
  120. 120. with injected latency
  121. 121. factory wrapper function (think bluebird’s promisifyAll function)
  122. 122. Error injection with serverless
  123. 123. Common errors HTTP 5XX DynamoDB provisioned throughput exceeded Throttled Lambda invocations
  124. 124. hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  125. 125. Where to inject errors?
  126. 126. hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  127. 127. Where to inject errors?
  128. 128. Where to inject errors? Induce Lambda throttling by temporarily setting reserved concurrency.
  129. 129. Recap
  130. 130. failures are INEVITABLE
  131. 131. the only way to truly know your system’s resilience against failures is to test it through CONTROLLED experiments
  132. 132. the goal of chaos engineering is NOT to actually break production
  133. 133. CONTAINMENT should be front and centre of your thinking
  134. 134. STEP 1. Define “steady state” aka. what does normal, working condition looks like?
  135. 135. Hypothesise steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  136. 136. STEP 3. Inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  137. 137. STEP 4. Disprove hypothesis i.e. look for difference in steady state
  138. 138. there are more inherent chaos and complexity in a serverless application
  139. 139. even without servers, you can still inject CONTROLLED failures at the application level
  140. 140. @theburningmonk theburningmonk.com github.com/theburningmonk
  141. 141. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/prod-ready-serverless get 40% off with: ytcui

×