Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement
Advertisement

Patterns and practices for building resilient Serverless applications

  1. Patterns and Practices for building resilient serverless applications presented by Yan Cui @theburningmonk
  2. @theburningmonk theburningmonk.com
  3. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  4. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun it’s not about preventing failures!
  5. everything fails, all the time
  6. @theburningmonk theburningmonk.com we need to build applications that can withstand failures
  7. @theburningmonk theburningmonk.com
  8. @theburningmonk theburningmonk.com don’t run your application on one server…
  9. @theburningmonk theburningmonk.com entire data centers can go down…
  10. @theburningmonk theburningmonk.com run your application in multiple AZs and regions
  11. @theburningmonk theburningmonk.com Failures on load: exhaustion of resources
  12. @theburningmonk theburningmonk.com Failures on load: exhaustion of resources
  13. @theburningmonk theburningmonk.com latency reqs/s Failures on load: exhaustion of resources CPU saturation
  14. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user
  15. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user
  16. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user HTTP 502
  17. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user You suck!
  18. @theburningmonk theburningmonk.com microservices death stars circa 2015
  19. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  20. Yan Cui http://theburningmonk.com @theburningmonk http://bit.ly/yubl-serverless
  21. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  22. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  23. by Uwe Friedrichsen
  24. @theburningmonk theburningmonk.com Lambda execution environment
  25. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda
  26. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda don’t pay for idle redundant resources!
  27. @theburningmonk theburningmonk.com Load balancing
  28. @theburningmonk theburningmonk.com Data replication in different AZ’s DynamoDB Global Tables
  29. @theburningmonk theburningmonk.com There are throttling everywhere!
  30. @theburningmonk theburningmonk.com Beware of timeout mismatch API Gateway
 Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes
  31. @theburningmonk theburningmonk.com Beware of timeout mismatch Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  32. @theburningmonk theburningmonk.com Beware of timeout mismatch Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours set VisibilityTimeout to 6x Lambda timeout
  33. @theburningmonk theburningmonk.com Offload computing operations to queues
  34. @theburningmonk theburningmonk.com Offload computing operations to queues
  35. @theburningmonk theburningmonk.com Offload computing operations to queues better absorb downstream problems
  36. @theburningmonk theburningmonk.com Offload computing operations to queues need way to replay DLQ events
  37. https://www.npmjs.com/package/lumigo-cli
  38. @theburningmonk theburningmonk.com Offload computing operations to queues great for fire-and-forget tasks
  39. @theburningmonk theburningmonk.com “what if the client is waiting for a response?”
  40. @theburningmonk theburningmonk.com “Decoupled Invocation”
  41. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results not ready…
  42. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results not ready… 202
  43. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results reporting for duty!
  44. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results working hard… not ready…
  45. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results 202 working hard…
  46. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results done!
  47. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results done!
  48. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results 200 { … }
  49. @theburningmonk theburningmonk.com wait…
  50. @theburningmonk theburningmonk.com a distributed transaction!
  51. @theburningmonk theburningmonk.com a distributed transaction! needs rollback
  52. @theburningmonk theburningmonk.com no distributed transactions
  53. @theburningmonk theburningmonk.com do the work here
  54. @theburningmonk theburningmonk.com retry-until-success
  55. @theburningmonk theburningmonk.com
  56. @theburningmonk theburningmonk.com 24 hours data retention
  57. @theburningmonk theburningmonk.com 24 hours data retention need alerting to ensure issue are addressed quickly
  58. @theburningmonk theburningmonk.com retry-until-success needs to deal with poinson messages
  59. @theburningmonk theburningmonk.com what if you can’t avoid distributed transactions?
  60. @theburningmonk theburningmonk.com The Saga pattern A pattern for managing failures where each action has a compensating action for rollback
  61. @theburningmonk theburningmonk.com The Saga pattern https://www.youtube.com/watch?v=xDuwrtwYHu8
  62. @theburningmonk theburningmonk.com The Saga pattern Begin transaction Start book hotel request End book hotel request Start book flight request End book flight request Start book car rental request End book car rental request End transaction
  63. @theburningmonk theburningmonk.com The Saga pattern model both actions and compensating actions as Lambda functions
  64. @theburningmonk theburningmonk.com The Saga pattern use Step Functions as the coordinator for the saga
  65. @theburningmonk theburningmonk.com The Saga pattern Input
  66. @theburningmonk theburningmonk.com The Saga pattern
  67. @theburningmonk theburningmonk.com The Saga pattern
  68. @theburningmonk theburningmonk.com The Saga pattern
  69. @theburningmonk theburningmonk.com retry-until-success needs to deal with poinson messages Mind the poison message
  70. @theburningmonk theburningmonk.com Mind the poison message
  71. @theburningmonk theburningmonk.com Mind the poison message
  72. @theburningmonk theburningmonk.com Mind the poison message
  73. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, …
  74. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, … only count the “same” batch
  75. @theburningmonk theburningmonk.com Mind the poison message
  76. @theburningmonk theburningmonk.com Mind the poison message have to fetch from the stream
  77. @theburningmonk theburningmonk.com Mind the poison message have to fetch from the stream do it before they expire from the stream!
  78. @theburningmonk theburningmonk.com how do you prevent building up an insurmountable backlog?
  79. @theburningmonk theburningmonk.com Load shedding implement load shedding prioritize newer messages with a better chance to succeed
  80. @theburningmonk theburningmonk.com Load shedding excess load is sent to DLQ
  81. @theburningmonk theburningmonk.com Load shedding process with a delay
  82. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS
  83. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller
  84. @theburningmonk theburningmonk.com LambdaSQS Poller Mind the partial failures Delete
  85. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error
  86. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error DLQ
  87. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error DLQ batch fails as a unit
  88. https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes Mind the partial failures
  89. @theburningmonk theburningmonk.com Mind the partial failures
  90. @theburningmonk theburningmonk.com Mind the partial failures
  91. @theburningmonk theburningmonk.com Mind the partial failures
  92. @theburningmonk theburningmonk.com Mind the retry storm Service A
  93. @theburningmonk theburningmonk.com Mind the retry storm Service A
  94. @theburningmonk theburningmonk.com Mind the retry storm Service A retry retry retry retry
  95. @theburningmonk theburningmonk.com Mind the retry storm Service A
  96. @theburningmonk theburningmonk.com Mind the retry storm Service A
  97. @theburningmonk theburningmonk.com Mind the retry storm Service A
  98. @theburningmonk theburningmonk.com Mind the retry storm Service A
  99. @theburningmonk theburningmonk.com retry storm
  100. @theburningmonk theburningmonk.com circuit breaker pattern After X consecutive timeouts, trip the circuit
  101. @theburningmonk theburningmonk.com circuit breaker pattern After X consecutive timeouts, trip the circuit When circuit is open, fail fast
  102. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins After X consecutive timeouts, trip the circuit
  103. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins If request succeeds, close the circuit After X consecutive timeouts, trip the circuit
  104. @theburningmonk theburningmonk.com
  105. @theburningmonk theburningmonk.com where do I keep the state of the circuit?
  106. @theburningmonk theburningmonk.com in-memory Service A isOpen: false isOpen: false isOpen: false isOpen: false
  107. @theburningmonk theburningmonk.com in-memory Service A isOpen: true isOpen: false isOpen: true isOpen: false
  108. @theburningmonk theburningmonk.com in-memory PROS simplicity
  109. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service requires another circuit breaker to protect… cost & maintenance overhead (IAM, infra, etc.)
  110. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service CONS takes longer & more requests to stop all traffic
  111. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service CONS takes longer & more requests to stop all traffic new containers would generate more traffic
  112. @theburningmonk theburningmonk.com external service Service AisOpen: false
  113. @theburningmonk theburningmonk.com external service Service AisOpen: true
  114. @theburningmonk theburningmonk.com external service Service AisOpen: true
  115. @theburningmonk theburningmonk.com external service PROS minimizes no. of total requests to trip circuit new containers respect collective decision CONS complexity dependency on an external service
  116. @theburningmonk theburningmonk.com which approach should I use?
  117. @theburningmonk theburningmonk.com which approach should I use? It depends. Maybe start with the simplest solution first?
  118. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute

  119. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute
 Standard burst concurrency limits when over the provisioned capacity 

  120. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute
 Adjustable provisioned capacity based on CloudWatch metrics Standard burst concurrency limits when over the provisioned capacity 

  121. @theburningmonk theburningmonk.com Lambda limitations & throttling Concurrent executions: 1000*
 Timeout: 15 minutes
 Burst concurrency: 500 - 3000
 Burst: 500 new instances / minute * Can be increased with support ticket
  122. @theburningmonk theburningmonk.com Lambda limitations & throttling good for spikey traffic, up to a point Concurrent executions: 1000*
 Timeout: 15 minutes
 Burst concurrency: 500 - 3000
 Burst: 500 new instances / minute * Can be increased with support ticket
  123. @theburningmonk theburningmonk.com “what if my traffic is more spiky than that?”
  124. @theburningmonk theburningmonk.com Scenario: predictable spikes Holidays, weekends,
 celebrations
 (Black Friday) Planned launch of
 resources
 (new series available) Sport events
  125. @theburningmonk theburningmonk.com Scenario: predictable spikes scheduled auto-scaling
  126. @theburningmonk theburningmonk.com Scenario: predictable spikes scheduled auto-scaling the burst limits still apply, factor the timing into account
  127. @theburningmonk theburningmonk.com Scenario: predictable spikes
  128. @theburningmonk theburningmonk.com Scenario: unpredictable spikes Traffic generated by user actions
 
 Jennifer Aniston’s first post
  129. @theburningmonk theburningmonk.com “if Lambda scaling is the problem…”
  130. @theburningmonk theburningmonk.com Client only needs an acknowledgement