Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

You shall not Fail! (in the face of turbulent conditions)

414 views

Published on

Lambda gives you a lot of scalability and multi-AZ out-of-the-box, but still, things can go wrong in production.

There are region-wide outages, and performance degradation in services your function depends on can cause it to time out or error. And what if you're dealing with downstream systems that just aren't as scalable and can't handle the load you put on them?

The bottomline is many things can go wrong and they often do at the worst times. The goal of building resilient systems is not to prevent failures, but to build systems that can withstand these failures. In this talk, we will look at a number of practices and architectural patterns that can help you build more resilient serverless applications. Such as multi-region, active-active, employing DLQs and surge queues and using chaos experiments to identify failure modes before they manifest in production.

The recording is available here: https://www.youtube.com/watch?v=elVeOYYtLM0

Published in: Technology
  • Be the first to comment

You shall not Fail! (in the face of turbulent conditions)

  1. 1. @theburningmonk @sarutule 1 "I COME BACK TO YOU NOW AT THE TURN OF THE TIDE" 1
  2. 2. @theburningmonk @sarutule WHAT IS RESILIENCE? 2
  3. 3. @theburningmonk @sarutule Failures in distributed systems 3
  4. 4. @theburningmonk @sarutule Failures on load: exhaustion of resources 4
  5. 5. @theburningmonk @sarutule Failures on load: exhaustion of resources 5
  6. 6. You Shall Not Fail! in the face of turbulent conditions TM what is RESILIENCE chaos ENGINEERING multi-region STRATEGIES retries & TIMEOUTS lambda SCALING decoupled INVOCATION
  7. 7. PRODUCERS Yan Cui, @theburningmonk Sara Gerion, @sarutule SPEAKERS Yan Cui, @theburningmonk Sara Gerion, @sarutule SPEAKING AT AWS Community Summit Online SPECIAL THANKS Phil Horn Joe Park
  8. 8. @theburningmonk @sarutule 8 Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  9. 9. @theburningmonk @sarutule 9 Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  10. 10. @theburningmonk @sarutule 10
  11. 11. @theburningmonk @sarutule 11 Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  12. 12. SARA GERION Italian living in Amsterdam, The Netherlands Passionate about cloud, scalability, resilience Twitter: @Sarutule Backend engineer at DAZN @dazneng Director of Tech at SheSharp @SheSharpNL
  13. 13. @theburningmonk @sarutule Lambda execution environment 13
  14. 14. @theburningmonk @sarutule Serverless - multiple AZ’s out of the box 14 Total resources created: 1 API Gateway 1 Lambda
  15. 15. @theburningmonk @sarutule Load balancing 15
  16. 16. @theburningmonk @sarutule Data replication 16
  17. 17. @theburningmonk @sarutule REST API - Lambda autoscaling 17 Concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute

  18. 18. @theburningmonk @sarutule REST API - Lambda autoscaling 18 X number of execution environments 
 pre-initialized (ready to respond to invocations) Note: standard burst concurrency limits when over the provisioned capacity 
 Concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute

  19. 19. @theburningmonk @sarutule REST API - Lambda autoscaling 19 Adjustable provisioned capacity based on CloudWatch metrics X number of execution environments 
 pre-initialized (ready to respond to invocations) Note: standard burst concurrency limits when over the provisioned capacity 
 Concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute

  20. 20. @theburningmonk @sarutule REST API - Lambda limitations & throttling 20
  21. 21. @theburningmonk @sarutule HOW TO SOLVE IT? 21
  22. 22. @theburningmonk @sarutule HOW TO SOLVE IT? IT DEPENDS 22
  23. 23. @theburningmonk @sarutule The importance of retry policies 23
  24. 24. @theburningmonk @sarutule Scenario: client only needs an acknowledgement 24
  25. 25. @theburningmonk @sarutule If fast acknowledgement not possible…
  26. 26. @theburningmonk @sarutule Scenario: predictable spikes 26
  27. 27. @theburningmonk @sarutule Scenario: predictable spikes 27 Holidays, weekends,
 celebrations
 (Black Friday) Planned launch of
 resources
 (new series available) Sport events
  28. 28. @theburningmonk @sarutule Scenario: unpredictable spikes 28 Traffic generated by user actions
 
 Jennifer Aniston’s first post
  29. 29. @theburningmonk @sarutule Possible mitigations for REST API’s 29 Use 1 Lambda
 for each
 endpoint 29
  30. 30. @theburningmonk @sarutule One Lambda function for each endpoint 3030
  31. 31. @theburningmonk @sarutule Possible mitigations for REST API’s 31 Use 1 Lambda
 for each
 endpoint Raise limits with
 an AWS support ticket 31
  32. 32. @theburningmonk @sarutule Possible mitigations for REST API’s 32 Use 1 Lambda
 for each
 endpoint Optimise 
 performance Raise limits with
 an AWS support ticket 32
  33. 33. @theburningmonk @sarutule Possible mitigations for REST API’s 33 Use 1 Lambda
 for each
 endpoint Optimise 
 performance Offload computing
 operations to an 
 async flow (SQS, SNS, …) Raise limits with
 an AWS support ticket 33
  34. 34. @theburningmonk @sarutule Offload computing operations to queues 34
  35. 35. @theburningmonk @sarutule Offload computing operations to queues 35
  36. 36. @theburningmonk @sarutule Possible mitigations for REST API’s 36 Use 1 Lambda
 for each
 endpoint Optimise 
 performance Offload computing
 operations to an 
 async flow (SQS, SNS, …) Use provisioned capacity
 (plus autoscaling) Raise limits with
 an AWS support ticket 36
  37. 37. @theburningmonk @sarutule Reminder: beware of long timeouts 37 API Gateway
 Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  38. 38. @theburningmonk @sarutule Single-region architectures 38
  39. 39. @theburningmonk @sarutule Multi-region: active-passive 39
  40. 40. @theburningmonk @sarutule Multi-region: active-active 40
  41. 41. @theburningmonk @sarutule Active-active & data replication 41
  42. 42. @theburningmonk @sarutule Multi-region architecture - benefits & tradeoffs 42 Protection against
 regional failures
  43. 43. @theburningmonk @sarutule Multi-region architecture - benefits & tradeoffs 43 Protection against
 regional failures Higher complexity
  44. 44. @theburningmonk @sarutule Multi-region architecture - benefits & tradeoffs 44 Protection against
 regional failures Higher complexity Very hard to test
  45. 45. @theburningmonk @sarutule CHAOS ENGINEERING 45
  46. 46. @theburningmonk @sarutule 46 MUST KILL SERVERS! RAWR!! RAWR!!
  47. 47. @theburningmonk @sarutule 47 “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  48. 48. @theburningmonk @sarutule 48 “You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  49. 49. @theburningmonk @sarutule 49 identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  50. 50. @theburningmonk @sarutule 50 learn about the system’s behavior by observing it during a controlled experiments HOW
  51. 51. @theburningmonk @sarutule 51 learn about the system’s behavior by observing it during a controlled experiments HOW game days failure injection
  52. 52. @theburningmonk @sarutule 52 MUST KILL SERVERS! RAWR!! RAWR!! ahhhhhhh!!!! HELP!!! OMG!!! F***!!!
  53. 53. @theburningmonk @sarutule 53 phew!
  54. 54. @theburningmonk @sarutule 54 STEP 1. define steady state i.e. “what does normal look like”
  55. 55. @theburningmonk @sarutule 55 STEP 2. hypothesis that steady state continues in control and experimental group e.g. “the system stays up if a server dies”
  56. 56. @theburningmonk @sarutule 56 STEP 3. inject realistic failures e.g. “slow response from 3rd-party service”
  57. 57. @theburningmonk @sarutule 57 STEP 4. try to disprove hypothesis i.e. “look for difference between control and experimental group”
  58. 58. @theburningmonk @sarutule DON’T START EXPERIMENTS IN PRODUCTION 58
  59. 59. @theburningmonk @sarutule 59 identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  60. 60. @theburningmonk @sarutule 60 “Corporation X lost millions due to a chaos experiment went wrong and destroyed key infrastructure, resulting in hours of downtime and unrecoverable data loss.”
  61. 61. @theburningmonk @sarutule 61 Chaos Engineering doesn't cause problems. It reveals them. Nora Jones
  62. 62. @theburningmonk @sarutule 62 CONTAINMENT
  63. 63. @theburningmonk @sarutule 63 CONTAINMENT run experiments during office hours
  64. 64. @theburningmonk @sarutule 64 CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises
  65. 65. @theburningmonk @sarutule 65 CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises avoid important dates
  66. 66. @theburningmonk @sarutule 66 CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises avoid important dates make the smallest change possible
  67. 67. @theburningmonk @sarutule 67 CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises avoid important dates make the smallest change possible have a rollback plan before you start
  68. 68. @theburningmonk @sarutule DON’T START EXPERIMENTS IN PRODUCTION 68
  69. 69. @theburningmonk @sarutule 69 by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  70. 70. @theburningmonk @sarutule 70 chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  71. 71. @theburningmonk @sarutule 71
  72. 72. @theburningmonk @sarutule 72 there are no servers to kill! SERVERLESS
  73. 73. @theburningmonk @sarutule 73 by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  74. 74. @theburningmonk @sarutule 74 by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  75. 75. @theburningmonk @sarutule 75 improperly tuned timeouts
  76. 76. @theburningmonk @sarutule 76 missing error handling
  77. 77. @theburningmonk @sarutule 77 missing fallbacks
  78. 78. @theburningmonk @sarutule 78
  79. 79. @theburningmonk @sarutule 79 “what if DynamoDB has an elevated error rate?”
  80. 80. @theburningmonk @sarutule 80 hypothesis: the AWS SDK retries would handle it
  81. 81. @theburningmonk @sarutule 81 runs experiment…
  82. 82. @theburningmonk @sarutule 82 TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms
  83. 83. @theburningmonk @sarutule 83 TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
  84. 84. @theburningmonk @sarutule 84
  85. 85. @theburningmonk @sarutule 85 result: function times out after 6s (hypothesis is disproved)
  86. 86. @theburningmonk @sarutule 86 action: set max retry count + fallback
  87. 87. @theburningmonk @sarutule 87 outcome: a more resilient system
  88. 88. @theburningmonk @sarutule 88 “what if service X has elevated latency?”
  89. 89. @theburningmonk @sarutule 89 hypothesis: our try-catch would handle it
  90. 90. @theburningmonk @sarutule 90 runs experiment…
  91. 91. @theburningmonk @sarutule 91 result: function times out after 6s (hypothesis is disproved)
  92. 92. @theburningmonk @sarutule 92 TIL: most HTTP client libraries have default timeout of 60s. API Gateway has an integration timeout of 29s. Most Lambda functions default to timeout of 3-6s.
  93. 93. @theburningmonk @sarutule 93
  94. 94. @theburningmonk @sarutule 94
  95. 95. @theburningmonk @sarutule 95 https://bit.ly/2Wvfort
  96. 96. @theburningmonk @sarutule 96
  97. 97. @theburningmonk @sarutule 97
  98. 98. @theburningmonk @sarutule 98 outcome: a more resilient system
  99. 99. @theburningmonk @sarutule 99 recap
  100. 100. @theburningmonk @sarutule Failures in distributed systems 100
  101. 101. @theburningmonk @sarutule Serverless - multiple AZ’s out of the box 101 Total resources created: 1 API Gateway 1 Lambda
  102. 102. @theburningmonk @sarutule Beware of timeouts 102 API Gateway
 Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  103. 103. @theburningmonk @sarutule Offload computing operations to queues 103
  104. 104. @theburningmonk @sarutule Active-active 104
  105. 105. @theburningmonk @sarutule 105 “You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  106. 106. @theburningmonk @sarutule 106 by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  107. 107. @theburningmonk @sarutule 107 by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  108. 108. @theburningmonk @sarutule 108

×