Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Invent 2018

2,741 views

Published on

Chaos engineering focuses on improving system resilience through controlled experiments, exposing the inherent chaos and failure modes in our system before they manifest in production and impact users. However, much of the publicized tools and articles focus on killing Amazon EC2 instances, and the efforts in the serverless community have been largely limited to moving those tools into Lambda functions. How can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions? Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider? Come to this session and find out.

This session is part of re:Invent Developer Community Day, a series led by AWS enthusiasts who share first-hand, technical insights on trending topics.

Applying Principles of Chaos Engineering to Serverless (DVC305) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Applying Principles of Chaos Engineering to Serverless Yan Cui Principal Engineer DAZN D V C 3 0 5
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda What is chaos engineering? New challenges with serverless Applying latency injection to serverless Applying error injection to serverless
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. After the talk Slides will be shared on Slideshare Recording will be posted on YouTube within 48 hours Find the links on https://theburningmonk.com/reinvent2018
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is chaos engineering?
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - principlesofchaos.org
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Smallpox Earliest evidence of disease in third century BC Egyptian mummy Estimated 400K deaths per year in eighteenth century Europe
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination First vaccine was developed in 1798 by Edward Jenner https://en.wikipedia.org/wiki/Edward_Jenner
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination WHO certified global eradication in 1980 https://en.wikipedia.org/wiki/Edward_Jenner
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://en.wikipedia.org/wiki/Vaccine History of vaccination
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccination is the most effective method to prevent infectious diseases
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccines stimulate the immune system to recognize and destroy the disease before contracting it for real
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Use controlled experiments to inject failures into our system
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Help us learn about our system’s behavior and uncover unknown failure modes, before they manifest like wildfire in production
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Lets us build confidence in its ability to withstand turbulent conditions
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the vaccine to frailties in modern software
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Who am I? Principal engineer at DAZN AWS Serverless hero Author of Production-Ready Serverless* course by Manning. Blogger**, speaker. * https://bit.ly/production-ready-serverless ** https://theburningmonk.com
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Available in seven countries—Austria, Switzerland, Germany, Japan, Canada, Italy, and USA Available on 30+ platforms
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Around 1,000,000 concurrent viewers at peak
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Too much emphasis is on breaking things
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Easy to conflate the action of injecting failures with the payback
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is to learn about the system and build confidence
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is not to break things
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Four steps to start running chaos experiments yourself
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. this is not a steady state
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Explore unknown unknowns away from production
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Experiments that graduate to production should be carefully considered and planned You should have reasonable confidence in the system before running experiments in production
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Treat production with the care it deserves The goal is not to break things
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice If you knew the system would break and you did it anyway, then it’s not a chaos experiment! It’s called being irresponsible.
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Netflix’s Simian Army: https://github.com/Netflix/SimianArmy Chaos Engineering ebook (O’Reilly): http://oreil.ly/2tZU1Sn
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Look for evidence that steady state was impacted by the injected failure
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Address weaknesses before failures happen for real
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Experiments needs to be controlled The goal is not to break things
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Ensure everyone knows what you are doing Don’t surprise your teammates
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Run experiments during office hours
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Avoid important dates
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Make the smallest change necessary to prove or disprove hypothesis
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Have a rollback plan Stop the experiment right away if things start to go wrong
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Don’t start in production Can learn a lot by running experiments in staging
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New challenges with serverless
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. chaos monkey kills an Amazon Elastic Cloud (Amazon EC2) instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges There are no servers that you can access and kill
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  50. 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  51. 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. There is more inherent chaos and complexity in a serverless architecture.
  52. 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Smaller units of deployment, but a lot more of them
  53. 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. serverful serverlessServerless challenges
  54. 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Every function needs to be correctly configured and secured
  55. 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Kinesis ? SNS CloudWatch Events CloudWatch LogsIoT Core DynamoDB S3 SES Serverless challenges
  56. 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges A lot of managed, intermediate services Each with its own set of failure modes
  57. 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Unknown failure modes in the infrastructure we don’t control
  58. 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Often there’s little we can do when an outage occurs in the platform
  59. 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses
  60. 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Improperly tuned timeouts
  61. 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing error handling
  62. 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing fallback
  63. 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing regional failover
  64. 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Latency injection with serverless
  65. 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  66. 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state What metrics do you use?
  67. 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state p95/p99 latencies, error count, backlog size, yield*, harvest** * percentage of requests completed ** completeness of the returned response
  68. 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  69. 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API Gateway Serverless considerations
  70. 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless considerations Consider the effect of cold starts How does it affect your strategy for handling slow responses
  71. 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should:
  72. 72. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed
  73. 73. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed 2. Do not allow slow response to timeout the caller function
  74. 74. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Finding the right timeout value is tricky
  75. 75. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too short: requests not given the best chance to succeed
  76. 76. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too long: risk timing out the calling function
  77. 77. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Even more complicated when you have multiple integration points
  78. 78. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 1: Split invocation time equally (for example, 3 requests, 6s function timeout = 2s timeout per request)
  79. 79. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 2: Every request is given nearly all the invocation time (for example, 3 requests, 6s function timeout = 5s timeout per request)
  80. 80. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Proposal: set request timeouts dynamically based on invocation time left
  81. 81. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts
  82. 82. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  83. 83. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  84. 84. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Log the timeout with as much context as possible The API, timeout value, correlation IDs, request object, and more
  85. 85. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Record custom metrics
  86. 86. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Use fallbacks
  87. 87. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  88. 88. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Be mindful when you sacrifice precision for availability User experience is the king
  89. 89. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  90. 90. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  91. 91. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  92. 92. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  93. 93. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Should be applied to third-party services too DynamoDB, Twillio, Auth0 …
  94. 94. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  95. 95. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Be mindful of the blast radius of the experiment The goal is not to break things
  96. 96. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. http client public-api-a http client public-api-b internal-api Where to inject latency?
  97. 97. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: All functions have appropriate timeout on their HTTP communications to this internal API and can degrade gracefully when requests are timed out
  98. 98. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  99. 99. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  100. 100. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Large blast radius, can cause cascade failures unintentionally
  101. 101. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  102. 102. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention It is a technique in psychology used to train a person's memory both in positive and negative ways
  103. 103. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use failure injection to program your colleagues into thinking about failure modes early.
  104. 104. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Make X% of all requests slow in the dev environment
  105. 105. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: The client app has appropriate timeout on their HTTP communication with the server and can degrade gracefully when requests are timed out
  106. 106. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  107. 107. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  108. 108. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  109. 109. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  110. 110. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  111. 111. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency?
  112. 112. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Static weavers (such as PostSharp, AspectJ) Dynamic proxies
  113. 113. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://theburningmonk.com/2015/04/design-for-latency-issues/
  114. 114. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Manually crafted wrapper libraries
  115. 115. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  116. 116. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  117. 117. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  118. 118. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  119. 119. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  120. 120. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configured in SSM Parameter Store
  121. 121. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  122. 122. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No injected latency
  123. 123. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  124. 124. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. With injected latency
  125. 125. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  126. 126. Factory wrapper function (think bluebird’s promisifyAll function)
  127. 127. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  128. 128. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  129. 129. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  130. 130. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  131. 131. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  132. 132. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Error injection with serverless
  133. 133. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common errors HTTP 5XX Amazon DynamoDB provisioned throughput exceeded Throttled AWS Lambda invocations
  134. 134. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  135. 135. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  136. 136. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  137. 137. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  138. 138. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors? Induce Lambda throttling by temporarily setting reserve concurrency
  139. 139. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recap
  140. 140. Failures are INEVITABLE
  141. 141. The only way to truly know your system’s resilience against failures is to test it through CONTROLLED experiments
  142. 142. The goal of chaos engineering is NOT to actually break production
  143. 143. CONTAINMENT should be front and centre of your thinking
  144. 144. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  145. 145. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  146. 146. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  147. 147. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  148. 148. There is more inherent chaos and complexity in a serverless application
  149. 149. Even without servers, you can still inject CONTROLLED failures at the application level
  150. 150. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  151. 151. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  152. 152. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  153. 153. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  154. 154. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Yan Cui @theburningmonk https://theburningmonk.com
  155. 155. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Wednesday, Nov 28 SRV425-R - Best Practices for Building Multi-Region, Active-Active Serverless Applications 4:00PM – 5:00PM | Venetian, Level 4, Lando 4305 Wednesday, Nov 28 SRV343-R - Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway 4:45PM – 5:45PM | MGM, Level 1, South Concourse 105 Thursday, Nov 29 ARC308 - Chaos Engineering and Scalability at Audible.com 1:00PM – 2:00PM | Aria West, Level 3, Ironwood 5
  156. 156. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×