Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Patterns and Practices for Building Resilient Serverless Applications

132 views

Published on

Recording: https://youtu.be/NdPg_AXdQQU

Real-world serverless podcast: https://realworldserverless.com
Learn Lambda best practices: https://lambdabestpractice.com
Blog: https://theburningmonk.com
Consulting services: https://theburningmonk.com/hire-me
Production-Ready Serverless workshop: https://productionreadyserverless.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Patterns and Practices for Building Resilient Serverless Applications

  1. 1. Patterns and Practices for building resilient serverless applications presented by Yan Cui @theburningmonk
  2. 2. @theburningmonk theburningmonk.com
  3. 3. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  4. 4. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun it’s not about preventing failures!
  5. 5. everything fails, all the time
  6. 6. @theburningmonk theburningmonk.com we need to build applications that can withstand failures
  7. 7. @theburningmonk theburningmonk.com
  8. 8. @theburningmonk theburningmonk.com don’t run your application on one server…
  9. 9. @theburningmonk theburningmonk.com entire data centers can go down…
  10. 10. @theburningmonk theburningmonk.com run your application in multiple AZs and regions
  11. 11. @theburningmonk theburningmonk.com Failures on load: exhaustion of resources
  12. 12. @theburningmonk theburningmonk.com Failures on load: exhaustion of resources
  13. 13. @theburningmonk theburningmonk.com latency reqs/s Failures on load: exhaustion of resources CPU saturation
  14. 14. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user
  15. 15. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user
  16. 16. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user HTTP 502
  17. 17. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user You suck!
  18. 18. @theburningmonk theburningmonk.com microservices death stars circa 2015
  19. 19. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  20. 20. Yan Cui http://theburningmonk.com @theburningmonk http://bit.ly/yubl-serverless
  21. 21. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  22. 22. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  23. 23. by Uwe Friedrichsen
  24. 24. @theburningmonk theburningmonk.com Lambda execution environment
  25. 25. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda
  26. 26. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda don’t pay for idle redundant resources!
  27. 27. @theburningmonk theburningmonk.com Load balancing
  28. 28. @theburningmonk theburningmonk.com Data replication in different AZ’s DynamoDB Global Tables
  29. 29. @theburningmonk theburningmonk.com There are throttling everywhere!
  30. 30. @theburningmonk theburningmonk.com Beware of timeout mismatch API Gateway
 Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes
  31. 31. @theburningmonk theburningmonk.com Beware of timeout mismatch Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  32. 32. @theburningmonk theburningmonk.com Beware of timeout mismatch Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours set VisibilityTimeout to 6x Lambda timeout
  33. 33. @theburningmonk theburningmonk.com Offload computing operations to queues
  34. 34. @theburningmonk theburningmonk.com Offload computing operations to queues
  35. 35. @theburningmonk theburningmonk.com Offload computing operations to queues better absorb downstream problems
  36. 36. @theburningmonk theburningmonk.com Offload computing operations to queues need way to replay DLQ events
  37. 37. https://www.npmjs.com/package/lumigo-cli
  38. 38. @theburningmonk theburningmonk.com Offload computing operations to queues great for fire-and-forget tasks
  39. 39. @theburningmonk theburningmonk.com “what if the client is waiting for a response?”
  40. 40. @theburningmonk theburningmonk.com “Decoupled Invocation”
  41. 41. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results not ready…
  42. 42. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results not ready… 202
  43. 43. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results reporting for duty!
  44. 44. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results working hard… not ready…
  45. 45. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results 202 working hard…
  46. 46. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results done!
  47. 47. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results done!
  48. 48. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results 200 { … }
  49. 49. @theburningmonk theburningmonk.com wait…
  50. 50. @theburningmonk theburningmonk.com a distributed transaction!
  51. 51. @theburningmonk theburningmonk.com a distributed transaction! needs rollback
  52. 52. @theburningmonk theburningmonk.com how do you implement distributed transactions?
  53. 53. @theburningmonk theburningmonk.com The Saga pattern A pattern for managing failures where each action has a compensating action for rollback
  54. 54. @theburningmonk theburningmonk.com The Saga pattern https://www.youtube.com/watch?v=xDuwrtwYHu8
  55. 55. @theburningmonk theburningmonk.com The Saga pattern Begin transaction Start book hotel request End book hotel request Start book flight request End book flight request Start book car rental request End book car rental request End transaction
  56. 56. @theburningmonk theburningmonk.com The Saga pattern model both actions and compensating actions as Lambda functions
  57. 57. @theburningmonk theburningmonk.com The Saga pattern use Step Functions as the coordinator for the saga
  58. 58. @theburningmonk theburningmonk.com The Saga pattern Input
  59. 59. @theburningmonk theburningmonk.com The Saga pattern
  60. 60. @theburningmonk theburningmonk.com The Saga pattern
  61. 61. @theburningmonk theburningmonk.com The Saga pattern
  62. 62. @theburningmonk theburningmonk.com no distributed transactions
  63. 63. @theburningmonk theburningmonk.com do the work here
  64. 64. @theburningmonk theburningmonk.com retry-until-success
  65. 65. @theburningmonk theburningmonk.com
  66. 66. @theburningmonk theburningmonk.com 24 hours data retention
  67. 67. @theburningmonk theburningmonk.com 24 hours data retention need alerting to ensure issue are addressed quickly
  68. 68. @theburningmonk theburningmonk.com Mind the poison message
  69. 69. @theburningmonk theburningmonk.com retry-until-success needs to deal with poinson messages
  70. 70. @theburningmonk theburningmonk.com Mind the poison message
  71. 71. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, …
  72. 72. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, … only count the “same” batch
  73. 73. @theburningmonk theburningmonk.com Mind the poison message
  74. 74. @theburningmonk theburningmonk.com Mind the poison message have to fetch from the stream
  75. 75. @theburningmonk theburningmonk.com Mind the poison message have to fetch from the stream do it before they expire from the stream!
  76. 76. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS
  77. 77. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller
  78. 78. @theburningmonk theburningmonk.com LambdaSQS Poller Mind the partial failures Delete
  79. 79. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error
  80. 80. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error DLQ
  81. 81. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error DLQ batch fails as a unit
  82. 82. https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes Mind the partial failures
  83. 83. @theburningmonk theburningmonk.com Mind the partial failures
  84. 84. @theburningmonk theburningmonk.com Mind the partial failures
  85. 85. @theburningmonk theburningmonk.com Mind the partial failures
  86. 86. @theburningmonk theburningmonk.com Mind the retry storm Service A
  87. 87. @theburningmonk theburningmonk.com Mind the retry storm Service A
  88. 88. @theburningmonk theburningmonk.com Mind the retry storm Service A retry retry retry retry
  89. 89. @theburningmonk theburningmonk.com Mind the retry storm Service A
  90. 90. @theburningmonk theburningmonk.com Mind the retry storm Service A
  91. 91. @theburningmonk theburningmonk.com Mind the retry storm Service A
  92. 92. @theburningmonk theburningmonk.com Mind the retry storm Service A
  93. 93. @theburningmonk theburningmonk.com retry storm
  94. 94. @theburningmonk theburningmonk.com circuit breaker pattern After X consecutive timeouts, trip the circuit
  95. 95. @theburningmonk theburningmonk.com circuit breaker pattern After X consecutive timeouts, trip the circuit When circuit is open, fail fast
  96. 96. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins After X consecutive timeouts, trip the circuit
  97. 97. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins If request succeeds, close the circuit After X consecutive timeouts, trip the circuit
  98. 98. @theburningmonk theburningmonk.com where do I keep the state of the circuit?
  99. 99. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service CONS takes longer & more requests to stop all traffic new containers would generate more traffic
  100. 100. @theburningmonk theburningmonk.com external service PROS minimizes no. of total requests to trip circuit new containers respect collective decision CONS complexity dependency on an external service
  101. 101. @theburningmonk theburningmonk.com which approach should I use? It depends. Maybe start with the simplest solution first?
  102. 102. @theburningmonk theburningmonk.com multi-region, active-active
  103. 103. @theburningmonk theburningmonk.com us-east-1 API Gateway Lambda DynamoDBRoute53
  104. 104. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1
  105. 105. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1 GlobalTable
  106. 106. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1 GlobalTable
  107. 107. @theburningmonk theburningmonk.com eu-central-1 us-east-1 us-east-1 SQS Lambda DynamoDB Lambda API Gateway SNS SNS
  108. 108. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway eu-central-1 us-east-1 SNS SNS
  109. 109. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway eu-central-1 us-east-1 SNS SNS
  110. 110. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway eu-central-1 us-east-1 SNS SNS Ddedupe
  111. 111. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  112. 112. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  113. 113. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  114. 114. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  115. 115. @theburningmonk theburningmonk.com Multi-region architecture - benefits & tradeoffs Protection against
 regional failures Higher complexity Very hard to test
  116. 116. CHAOS ENGINEERING
  117. 117. MUST KILL SERVERS! RAWR!! RAWR!!
  118. 118. @theburningmonk theburningmonk.com “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  119. 119. @theburningmonk theburningmonk.com “You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  120. 120. @theburningmonk theburningmonk.com
  121. 121. @theburningmonk theburningmonk.com “what if DynamoDB has an elevated error rate?”
  122. 122. @theburningmonk theburningmonk.com “what if service X has elevated latency?”
  123. 123. @theburningmonk theburningmonk.com identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  124. 124. everything fails, all the time
  125. 125. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  126. 126. @theburningmonk theburningmonk.com
  127. 127. @theburningmonk theburningmonk.com lambdabestpractice.com
  128. 128. https://theburningmonk.com/hire-me AdviseTraining Delivery “Fundamentally, Yan has improved our team by increasing our ability to derive value from AWS and Lambda in particular.” Nick Blair Tech Lead
  129. 129. @theburningmonk theburningmonk.com github.com/theburningmonk

×