Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to apply chaos engineering to serverless applications

Recording: https://youtu.be/7rZFiglir68

Real-world serverless podcast: https://realworldserverless.com
Learn Lambda best practices: https://lambdabestpractice.com
Blog: https://theburningmonk.com
Consulting services: https://theburningmonk.com/hire-me
Production-Ready Serverless workshop: https://productionreadyserverless.com

Abstract:
Chaos engineering is a discipline that focuses on improving system resilience through controlled experiments that expose the inherent chaos and failure modes in our system.

You might have heard about tools such as Netflix's Simian Army or Gremlin, which can inject different failures into your AWS environment to simulate different forms of infrastructure failures. But how can we apply the same principles to a serverless architecture where we have no access to the underlying infrastructure? Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

  • Be the first to comment

How to apply chaos engineering to serverless applications

  1. 1. how to apply chaos engineering to Serverless applications Yan Cui, @theburningmonk
  2. 2. Chaos Engineering?
  3. 3. MUST KILL SERVERS! RAWR!! RAWR!!
  4. 4. @theburningmonk theburningmonk.com “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  5. 5. @theburningmonk theburningmonk.com microservices death stars circa 2015
  6. 6. @theburningmonk theburningmonk.com
  7. 7. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  8. 8. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun it’s not about preventing failures!
  9. 9. everything fails, all the time
  10. 10. @theburningmonk theburningmonk.com “You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  11. 11. @theburningmonk theburningmonk.com
  12. 12. @theburningmonk theburningmonk.com anything that can go wrong, will go wrong. MURPHY’s LAW
  13. 13. @theburningmonk theburningmonk.com identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  14. 14. @theburningmonk theburningmonk.com learn about the system’s behavior by observing it during a controlled experiments HOW
  15. 15. @theburningmonk theburningmonk.com learn about the system’s behavior by observing it during a controlled experiments HOW game days failure injection
  16. 16. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  17. 17. Yan Cui http://theburningmonk.com @theburningmonk http://bit.ly/yubl-serverless
  18. 18. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  19. 19. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  20. 20. theburningmonk.com/courses
  21. 21. theburningmonk.com/courses
  22. 22. realworldserverless.com
  23. 23. @theburningmonk theburningmonk.com
  24. 24. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  25. 25. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  26. 26. @theburningmonk theburningmonk.com Shared Responsibility Model
  27. 27. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  28. 28. @theburningmonk theburningmonk.com chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  29. 29. @theburningmonk theburningmonk.com
  30. 30. @theburningmonk theburningmonk.com
  31. 31. @theburningmonk theburningmonk.com there are no servers to kill! SERVERLESS
  32. 32. @theburningmonk theburningmonk.com improperly tuned timeouts
  33. 33. @theburningmonk theburningmonk.com missing error handling
  34. 34. @theburningmonk theburningmonk.com missing fallbacks
  35. 35. @theburningmonk theburningmonk.com
  36. 36. @theburningmonk theburningmonk.com STEP 1. define steady state i.e. “what does normal look like”
  37. 37. @theburningmonk theburningmonk.com STEP 2. hypothesis that steady state continues in control and experimental group e.g. “the system stays up if a server dies”
  38. 38. @theburningmonk theburningmonk.com STEP 3. inject realistic failures e.g. “slow response from 3rd-party service”
  39. 39. @theburningmonk theburningmonk.com STEP 4. try to disprove hypothesis i.e. “look for difference between control and experimental group”
  40. 40. @theburningmonk theburningmonk.com latency inject latency to function invocation
  41. 41. @theburningmonk theburningmonk.com “what if service X has elevated latency?”
  42. 42. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda
  43. 43. @theburningmonk theburningmonk.com
  44. 44. @theburningmonk theburningmonk.com hypothesis: API would timeout and our try-catch would handle it and return default response
  45. 45. @theburningmonk theburningmonk.com
  46. 46. @theburningmonk theburningmonk.com result: function times out after 6s (hypothesis is disproved)
  47. 47. @theburningmonk theburningmonk.com
  48. 48. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda 502 200
  49. 49. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda 3s timeout 6s timeout
  50. 50. @theburningmonk theburningmonk.com API Gateway Lambda API Gateway Lambda max 29s integration max 15 mins timeout
  51. 51. @theburningmonk theburningmonk.com and then there’s cold starts…
  52. 52. @theburningmonk theburningmonk.com TIL: most HTTP client libraries have default timeout of 60s. API Gateway has an integration timeout of 29s. Most Lambda functions default to timeout of 3-6s. Don’t forget about the cold starts!
  53. 53. @theburningmonk theburningmonk.com
  54. 54. @theburningmonk theburningmonk.com
  55. 55. @theburningmonk theburningmonk.com https://bit.ly/2Wvfort
  56. 56. @theburningmonk theburningmonk.com
  57. 57. @theburningmonk theburningmonk.com
  58. 58. @theburningmonk theburningmonk.com
  59. 59. @theburningmonk theburningmonk.com
  60. 60. @theburningmonk theburningmonk.com
  61. 61. @theburningmonk theburningmonk.com
  62. 62. @theburningmonk theburningmonk.com
  63. 63. @theburningmonk theburningmonk.com outcome: a more resilient system
  64. 64. @theburningmonk theburningmonk.com latency exception inject latency to function invocation throws exception
  65. 65. @theburningmonk theburningmonk.com latency exception statuscode inject latency to function invocation throws exception return HTTP status code
  66. 66. @theburningmonk theburningmonk.com latency exception statuscode diskspace inject latency to function invocation throws exception return HTTP status code fills up /tmp directory
  67. 67. @theburningmonk theburningmonk.com latency exception statuscode diskspace blacklist inject latency to function invocation throws exception return HTTP status code fills up /tmp directory looses network connectivity
  68. 68. @theburningmonk theburningmonk.com “what if DynamoDB has an elevated error rate?”
  69. 69. @theburningmonk theburningmonk.com API Gateway Lambda DynamoDB
  70. 70. @theburningmonk theburningmonk.com
  71. 71. @theburningmonk theburningmonk.com hypothesis: the AWS SDK retries would handle it
  72. 72. @theburningmonk theburningmonk.com result: function times out after 6s (hypothesis is disproved)
  73. 73. @theburningmonk theburningmonk.com TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms
  74. 74. @theburningmonk theburningmonk.com TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
  75. 75. @theburningmonk theburningmonk.com
  76. 76. @theburningmonk theburningmonk.com
  77. 77. @theburningmonk theburningmonk.com action: set max retry count + fallback
  78. 78. @theburningmonk theburningmonk.com
  79. 79. @theburningmonk theburningmonk.com
  80. 80. @theburningmonk theburningmonk.com
  81. 81. @theburningmonk theburningmonk.com
  82. 82. @theburningmonk theburningmonk.com outcome: a more resilient system
  83. 83. @theburningmonk theburningmonk.com latency exception statuscode diskspace blacklist inject latency to function invocation throws exception return HTTP status code fills up /tmp directory looses network connectivity
  84. 84. everything fails, all the time
  85. 85. @theburningmonk theburningmonk.com by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  86. 86. @theburningmonk theburningmonk.com
  87. 87. @theburningmonk theburningmonk.com homeschool.dev/class/production-ready-serverless
  88. 88. https://theburningmonk.com/hire-me AdviseTraining Delivery “Fundamentally, Yan has improved our team by increasing our ability to derive value from AWS and Lambda in particular.” Nick Blair Tech Lead
  89. 89. @theburningmonk theburningmonk.com github.com/theburningmonk

×