Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applying principles of chaos engineering to Serverless (SRECon)

621 views

Published on

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users.

Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Published in: Technology
  • Be the first to comment

Applying principles of chaos engineering to Serverless (SRECon)

  1. 1. APPLYING PRINCIPLES to SERVERLESSt a b chaos engineering of A E S of
  2. 2. history of Smallpox est. 400K deaths per year in 18th Century Europe. earliest evidence of disease in 3rd Century BC Egyptian Mummy
  3. 3. history of Smallpox est. 400K deaths per year in 18th Century Europe. earliest evidence of disease in 3rd Century BC Egyptian Mummy 1798 first vaccine developed Edward Jenner
  4. 4. 1798 first vaccine developed 1980 history of Smallpox Edward Jenner WHO certified global eradication est. 400K deaths per year in 18th Century Europe. earliest evidence of disease in 3rd Century BC Egyptian Mummy
  5. 5. Vaccination is the most effective method of preventing infectious diseases
  6. 6. stimulates the immune system to recognize and destroy the disease before contracting the disease for real
  7. 7. Chaos Engineering controlled experiments to help us learn about our system’s behaviour and build confidence in its ability to withstand turbulent conditions
  8. 8. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  9. 9. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  10. 10. “Netflix for sports” offices in London, Leeds, Katowice and Amsterdam
  11. 11. available in Austria, Switzerland, Germany, Japan, Canada and Italy US coming soon ;-)
  12. 12. available on 30+ platforms
  13. 13. ~500,000 concurrent viewers
  14. 14. “Netflix for sports” offices in London, Leeds, Katowice and Amsterdam We’re hiring! Visit engineering.dazn.com to learn more. follow @DAZN_ngnrs for updates about the engineering team. WE’RE HIRING!
  15. 15. Why did you break production?
  16. 16. Because I can!
  17. 17. Kolton Andrus, CEO of Gremlin Russ Miles, CEO of ChaosIQ Nora Jones, Chaos Engineer at Netflix
  18. 18. Kolton Andrus, CEO of Gremlin Russ Miles, CEO of ChaosIQ Nora Jones, Chaos Engineer at Netflix
  19. 19. it’s about building confidence, NOT breaking things
  20. 20. http://principlesofchaos.org
  21. 21. STEP 1. define “Steady State” aka. what does normal, working condition looks like?
  22. 22. this is not a steady state
  23. 23. STEP 2. hypothesize steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment
  24. 24. explore unknown unknowns away from production
  25. 25. treat production with the care it deserves
  26. 26. the goal is NOT, to actually hurt production
  27. 27. If you know the system would break, and you did it anyway… then it’s NOT a chaos experiment. It’s called being IRRESPONSIBLE.
  28. 28. STEP 3. inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  29. 29. https://github.com/Netflix/SimianArmy
  30. 30. https://github.com/Netflix/SimianArmy http://oreil.ly/2tZU1Sn
  31. 31. STEP 4. disprove hypothesis i.e. look for difference with steady state
  32. 32. if a WEAkNESS is uncovered, IMPROVE it before the behaviour manifests in the system at large
  33. 33. Chaos Engineering controlled experiments to help us learn about our system’s behaviour and build confidence in its ability to withstand turbulent conditions
  34. 34. Chaos Engineering controlled experiments to help us learn about our system’s behaviour and build confidence in its ability to withstand turbulent conditions
  35. 35. communication
  36. 36. ensure everyone knows what you’re doing
  37. 37. ensure everyone knows what you’re doing NO surprises!
  38. 38. communication Timing
  39. 39. run experiments during office hours
  40. 40. AVOID important dates
  41. 41. communication Timing contain Blast radius
  42. 42. smallest change that allows you to detect a signal that steady state is disrupted
  43. 43. rollback at the first sign of TROUBLE!
  44. 44. communication Timing contain Blast radius
  45. 45. don’t try to run before you know how to walk.
  46. 46. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  47. 47. chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  48. 48. there is no server…
  49. 49. there is no server… that you can kill
  50. 50. there are more inherent chaos and complexity in a Serverless architecture
  51. 51. smaller units of deployment but A LOT more of them!
  52. 52. more difficult to harden around boundaries serverful serverless
  53. 53. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  54. 54. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES more intermediary services, and greater variety too
  55. 55. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES more intermediary services, and greater variety too each with its own set of failure modes
  56. 56. serverful serverless more configurations, more opportunities for misconfiguration
  57. 57. more unknown failure modes in infrastructure that we don’t control
  58. 58. often there’s little we can do when an outage occurs in the platform
  59. 59. improperly tuned timeouts
  60. 60. missing error handling
  61. 61. missing fallback when downstream is unavailable
  62. 62. LATENCY INJECTION
  63. 63. STEP 1. define “Steady State” aka. what does normal, working condition looks like?
  64. 64. what metrics do you monitor?
  65. 65. 9X-percentile latency error count yield (% of requests completed) harvest (completeness of results)
  66. 66. STEP 2. hypothesize steady state will continue in both control group & the experiment group ie. you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment
  67. 67. API Gateway
  68. 68. consider the effect of cold-starts & API Gateway overhead
  69. 69. use short timeout for API calls
  70. 70. the goal of a timeout strategy is to give HTTP requests the best chance to succeed, provided that doing so does not cause the calling function itself to err
  71. 71. fixed timeout are tricky to get right…
  72. 72. fixed timeout are tricky to get right… too short and you don’t give requests the best chance to succeed
  73. 73. fixed timeout are tricky to get right… too long and you run the risk of letting the request timeout the calling function
  74. 74. and it gets worse when you make multiple API calls in one function…
  75. 75. set the request timeout based on the amount of invocation time left
  76. 76. log the timeout incident with as much context as possible e.g. timeout value, correlation IDs, request object, …
  77. 77. report custom metrics
  78. 78. be mindful when you sacrifice precision for availability, user experience is the king
  79. 79. STEP 3. inject realistic failures e.g. server crash, network error, HD malfunction, etc.
  80. 80. where to inject latency?
  81. 81. hypothesis: function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  82. 82. should also be applied to 3rd parties services we depend on, e.g. DynamoDB
  83. 83. what’s the blast radius?
  84. 84. http client public-api-a http client public-api-b internal-api
  85. 85. hypothesis: all functions have appropriate timeout on their HTTP communications to this internal API, and can degrade gracefully when requests are timed out
  86. 86. large blast radius, risky..
  87. 87. could be effective when used away from production environment, to weed out weaknesses quickly
  88. 88. not priming developers to build more resilient systems
  89. 89. development
  90. 90. development production
  91. 91. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention. It is a technique in psychology used to train a person's memory both in positive and negative ways.
  92. 92. make dev environments better resemble the turbulent conditions you should realistically expect your system to survive in production
  93. 93. hypothesis: the client app has appropriate timeout on their HTTP communication with the server, and can degrade gracefully when requests are timed out
  94. 94. STEP 4. disprove hypothesis i.e. look for difference with steady state
  95. 95. how to inject latency?
  96. 96. static weaver (e.g. AspectJ, PostSharp), or dynamic proxies
  97. 97. manually crafted wrapper library
  98. 98. configured in SSM Parameter Store
  99. 99. no injected latency
  100. 100. with injected latency
  101. 101. factory wrapper function (think bluebird’s promisifyAll function)
  102. 102. ERROR INJECTION
  103. 103. failures are INEVITABLE
  104. 104. the only way to truly know your system’s resilience against failures is to test it through controlled experiments
  105. 105. vaccinate your serverless architecture against failures
  106. 106. Yan Cui http://theburningmonk.com @theburningmonk
  107. 107. “Netflix for sports” offices in London, Leeds, Katowice and Amsterdam We’re hiring! Visit engineering.dazn.com to learn more. follow @DAZN_ngnrs for updates about the engineering team. WE’RE HIRING!

×