Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to bring chaos engineering to serverless

128 views

Published on

You might have heard about chaos engineering in the context of Netflix and Amazon, and how they kill EC2 servers in production at random to verify that their systems can stay up in the face of infrastructure failures. But did you know that the same ideas can be applied to serverless applications? Yes, despite not having access to the underlying servers, we can still apply principles of chaos engineering to uncover failure modes in our system (and there are plenty!) so we can build a defence against them and make our serverless applications more robust and more resilient!

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How to bring chaos engineering to serverless

  1. 1. October 20-21-22, 2020 How to bring Chaos Engineering to Serverless Yan Cui – Developer Advocate, Lumigo
  2. 2. Chaos Engineering?
  3. 3. MUST KILL SERVERS! RAWR!! RAWR!!
  4. 4. @theburningmonk theburningmonk.com “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  5. 5. @theburningmonk theburningmonk.com microservices death stars circa 2015
  6. 6. @theburningmonk theburningmonk.com
  7. 7. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  8. 8. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun it’s not about preventing failures!
  9. 9. everything fails, all the time
  10. 10. @theburningmonk theburningmonk.com “You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  11. 11. @theburningmonk theburningmonk.com
  12. 12. @theburningmonk theburningmonk.com anything that can go wrong, will go wrong. MURPHY’s LAW
  13. 13. @theburningmonk theburningmonk.com identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  14. 14. @theburningmonk theburningmonk.com learn about the system’s behavior by observing it during a controlled experiments HOW
  15. 15. @theburningmonk theburningmonk.com learn about the system’s behavior by observing it during a controlled experiments HOW game days failure injection
  16. 16. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  17. 17. Yan Cui http://theburningmonk.com @theburningmonk http://bit.ly/yubl-serverless
  18. 18. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  19. 19. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  20. 20. theburningmonk.com/courses
  21. 21. theburningmonk.com/courses
  22. 22. realworldserverless.com
  23. 23. “using serverless reduces the blast radius” www.buzzsprout.com/877747/4615985
  24. 24. @theburningmonk theburningmonk.com serverless improves resilience as platform takes care of infrastructure failures
  25. 25. @theburningmonk theburningmonk.com
  26. 26. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  27. 27. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  28. 28. @theburningmonk theburningmonk.com Shared Responsibility Model
  29. 29. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  30. 30. @theburningmonk theburningmonk.com chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  31. 31. @theburningmonk theburningmonk.com
  32. 32. @theburningmonk theburningmonk.com
  33. 33. @theburningmonk theburningmonk.com there are no servers to kill! SERVERLESS
  34. 34. @theburningmonk theburningmonk.com improperly tuned timeouts
  35. 35. @theburningmonk theburningmonk.com missing error handling
  36. 36. @theburningmonk theburningmonk.com missing fallbacks
  37. 37. @theburningmonk theburningmonk.com
  38. 38. @theburningmonk theburningmonk.com STEP 1. define steady state i.e. “what does normal look like”
  39. 39. @theburningmonk theburningmonk.com STEP 2. hypothesis that steady state continues in control and experimental group e.g. “the system stays up if a server dies”
  40. 40. @theburningmonk theburningmonk.com STEP 3. inject realistic failures e.g. “slow response from 3rd-party service”
  41. 41. @theburningmonk theburningmonk.com STEP 4. try to disprove hypothesis i.e. “look for difference between control and experimental group”
  42. 42. @theburningmonk theburningmonk.com latency inject latency to function invocation
  43. 43. @theburningmonk theburningmonk.com “what if service X has elevated latency?”
  44. 44. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda
  45. 45. @theburningmonk theburningmonk.com
  46. 46. @theburningmonk theburningmonk.com hypothesis: API would timeout and our try-catch would handle it and return default response
  47. 47. @theburningmonk theburningmonk.com
  48. 48. @theburningmonk theburningmonk.com
  49. 49. @theburningmonk theburningmonk.com result: function times out after 6s (hypothesis is disproved)
  50. 50. @theburningmonk theburningmonk.com
  51. 51. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda 502 200
  52. 52. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda 3s timeout 6s timeout
  53. 53. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda max 29s integration max 15 mins timeout
  54. 54. @theburningmonk theburningmonk.com and then there’s cold starts…
  55. 55. @theburningmonk theburningmonk.com TIL: most HTTP client libraries have default timeout of 60s. API Gateway has an integration timeout of 29s. Most Lambda functions default to timeout of 3-6s. Don’t forget about the cold starts!
  56. 56. @theburningmonk theburningmonk.com
  57. 57. @theburningmonk theburningmonk.com
  58. 58. @theburningmonk theburningmonk.com https://bit.ly/2Wvfort
  59. 59. @theburningmonk theburningmonk.com
  60. 60. @theburningmonk theburningmonk.com
  61. 61. @theburningmonk theburningmonk.com
  62. 62. @theburningmonk theburningmonk.com
  63. 63. @theburningmonk theburningmonk.com
  64. 64. @theburningmonk theburningmonk.com
  65. 65. @theburningmonk theburningmonk.com
  66. 66. @theburningmonk theburningmonk.com outcome: a more resilient system
  67. 67. @theburningmonk theburningmonk.com latency exception inject latency to function invocation throws exception
  68. 68. @theburningmonk theburningmonk.com latency exception statuscode inject latency to function invocation throws exception return HTTP status code
  69. 69. @theburningmonk theburningmonk.com latency exception statuscode diskspace inject latency to function invocation throws exception return HTTP status code fills up /tmp directory
  70. 70. @theburningmonk theburningmonk.com latency exception statuscode diskspace denylist inject latency to function invocation throws exception return HTTP status code fills up /tmp directory looses network connectivity
  71. 71. @theburningmonk theburningmonk.com “what if DynamoDB has an elevated error rate?”
  72. 72. @theburningmonk theburningmonk.com API Gateway Lambda DynamoDB
  73. 73. @theburningmonk theburningmonk.com
  74. 74. @theburningmonk theburningmonk.com hypothesis: the AWS SDK retries would handle it
  75. 75. @theburningmonk theburningmonk.com result: function times out after 6s (hypothesis is disproved)
  76. 76. @theburningmonk theburningmonk.com TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms
  77. 77. @theburningmonk theburningmonk.com TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
  78. 78. @theburningmonk theburningmonk.com
  79. 79. @theburningmonk theburningmonk.com
  80. 80. @theburningmonk theburningmonk.com action: set max retry count + fallback
  81. 81. @theburningmonk theburningmonk.com
  82. 82. @theburningmonk theburningmonk.com
  83. 83. @theburningmonk theburningmonk.com
  84. 84. @theburningmonk theburningmonk.com
  85. 85. @theburningmonk theburningmonk.com outcome: a more resilient system
  86. 86. @theburningmonk theburningmonk.com latency exception statuscode diskspace denylist inject latency to function invocation throws exception return HTTP status code fills up /tmp directory looses network connectivity
  87. 87. everything fails, all the time
  88. 88. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  89. 89. @theburningmonk theburningmonk.com
  90. 90. https://theburningmonk.com/hire-me AdviseTraining Delivery “Fundamentally, Yan has improved our team by increasing our ability to derive value from AWS and Lambda in particular.” Nick Blair Tech Lead
  91. 91. @theburningmonk theburningmonk.com github.com/theburningmonk

×