Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beware the potholes

622 views

Published on

Looking in from the outside, serverless seems so simple! And yet, many companies are struggling on their journey to serverless. In this talk, I highlight a number of mistakes companies are making when they adopt serverless.

Published in: Technology
  • Be the first to comment

Beware the potholes

  1. 1. MIND THE POTHOLES MIND THE POTHOLES
  2. 2. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  3. 3. http://bit.ly/yubl-serverless
  4. 4. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  5. 5. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant
  6. 6. What do you mean by ‘serverless’?
  7. 7. “Serverless”
  8. 8. Gojko Adzic It is serverless the same way WiFi is wireless. http://bit.ly/2yQgwwb
  9. 9. Serverless means… don’t pay for it if no-one uses it don’t need to worry about scaling don’t need to provision and manage servers
  10. 10. in other words, it’s a lot like taking a cab
  11. 11. Ownership Fuel Navigate To get there! Focus on getting there!
  12. 12. HW Ownership OS Runtime & Scale Code Focus on getting there! Physical Servers Virtual Machines Containers Serverless
  13. 13. Nano Services Self Managed Cost Paradigm ChangeAsync Dynamic agile env
  14. 14. “why are we failing at this?”
  15. 15. hidden dangers
  16. 16. monolith microservices serverless
  17. 17. monolith microservices serverless observability distributed systems bounded context
  18. 18. monolith microservices serverless observability distributed systems bounded context
  19. 19. monolith microservices serverless observability distributed systems bounded context event driven
  20. 20. monolith serverless missing learnings from microservices
  21. 21. monolith serverless missing learnings from microservices poor decisions
  22. 22. #1 not letting go of legacy thinking
  23. 23. “we’re doing serverless, but why aren’t thing going faster?”
  24. 24. Socio Technical
  25. 25. there are no silver bullets
  26. 26. centralised team Team A Team B Team C Team D …
  27. 27. “but the developers don’t understand AWS and how our infrastructure is set up”
  28. 28. “but the developers don’t understand AWS and how our infrastructure is set up” let’s solve this problem instead!
  29. 29. what got you here won’t get you there
  30. 30. if (path == “/user” && method == “GET”) { return getUser(…); } else if (path == “/user” && method == “DELETE”) { return deleteUser(…); } else if (path == “/user” && method == “POST”) { return createUser(…); } else if …. Monolithic Functions
  31. 31. GET /user POST /user DELETE /user Single-Purposed Functions
  32. 32. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user
  33. 33. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user find related functions by prefix
  34. 34. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user discoverability (without having to dig into the code)
  35. 35. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user what does it do?
  36. 36. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user dynamodb:GetItem dynamodb:PutItem dynamodb:DeleteItem
  37. 37. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user dynamodb:GetItem dynamodb:PutItem dynamodb:DeleteItem no least privilege…
  38. 38. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user require(x) require(y) require(z)
  39. 39. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user require(x) require(y) require(z)
  40. 40. more dependecies equals slower cold start
  41. 41. author: yan.cui feature: user-api user-api-dev Monolithic Single-Purposed author: yan.cui feature: user-api user-api-dev-get-user author: yan.cui feature: user-api user-api-dev-create-user author: yan.cui feature: user-api user-api-dev-delete-user require(x) require(y) require(z) worse cold start performance
  42. 42. keep functions simple, and single-purposed
  43. 43. #2 one account that rules them all
  44. 44. mind the shared limits
  45. 45. no. of DynamoDB tables no. of API Gateway regional APIs no. of API Gateway edge-optimized APIs no. of Kinesis shards no. of IAM roles no. of S3 buckets no. of CloudFormation stacks no. of SNS subscription filters no. of SSM parameters … Resource Limits
  46. 46. DynamoDB read & write API Gateway requests/second Lambda concurrent executions SSM parameter ops/second … Throughput Limits
  47. 47. One account per Team per Environment
  48. 48. compartmentalise security breaches
  49. 49. #3 do first, research later
  50. 50. https://einaregilsson.com/serverless-15-percent-slower-and-eight-times-more-expensive/
  51. 51. the platforms need to do better at educating users on how to choose between different services
  52. 52. SNS vs SQS vs Kinesis vs MKS? the platforms need to do better at educating users on how to choose between different services
  53. 53. ordering replay events Kinesis SQS SNS by shard none (standard) global (FIFO) none up to 7 days none none mode retry batched batched (up to 10) singular retried until success retry + DLQ retry + DLQ concurrency 1 per shard auto-scaled fan-out!!! subscribers many one-to-one many
  54. 54. https://medium.com/theburningmonk-com/all-my-posts-on-serverless-aws-lambda-43c17a147f91
  55. 55. https://www.jeremydaly.com/newsletter/
  56. 56. #4 not using a deployment toolkit
  57. 57. https://lumigo.io/blog/comparison-of-lambda-deployment-frameworks/
  58. 58. don’t write your own deployment framework
  59. 59. #5 console-driven development
  60. 60. #6 one repo per function
  61. 61. github repo github repo github repo github repo github repo github repo github repo github repo github repo
  62. 62. github repo github repo github repo github repo github repo github repo github repo github repo github repo
  63. 63. monorepo?
  64. 64. github repo
  65. 65. https://lumigo.io/blog/mono-repo-vs-one-per-service/
  66. 66. one repo per service?
  67. 67. github repo github repo github repo github repo user-api timeline-api relationship-api search-api
  68. 68. CI/CD pipeline per service
  69. 69. functions are deployed together, as a stack
  70. 70. unencrypted secrets in env vars #7
  71. 71. secrets should NEVER be in plain text in env variables
  72. 72. SSM Parameter Store Secret 1 Secret 2 IAM Environment: SECRET_1: … SECRET_2: … Environment: SECRET_1: … SECRET_2: …
  73. 73. SSM Parameter Store Secret 1 Secret 2 IAM Environment: SECRET_1: … SECRET_2: … Environment: SECRET_1: … SECRET_2: … yay!
  74. 74. SSM Parameter Store Secret 1 Secret 2 IAM fetch at cold start, cache, invalidate every x mins
  75. 75. https://github.com/middyjs/middy
  76. 76. SSM Parameter Store Secret 1 Secret 2 IAM switch to Higher Throughput if you need more than 40 ops/s
  77. 77. not following least privilege principle #8
  78. 78. missing DLQs #9
  79. 79. async sync S3 SNS SES CloudFormation CloudWatch Logs CloudWatch Events Scheduled Events CodeCommit AWS Config http://amzn.to/2v7Kc3b Cognito Alexa Lex API Gateway pulling DynamoDB Stream Kinesis Stream SQS
  80. 80. async sync S3 SNS SES CloudFormation CloudWatch Logs CloudWatch Events Scheduled Events CodeCommit AWS Config http://amzn.to/2vs2lIg Cognito Alexa Lex API Gateway pulling DynamoDB Stream Kinesis Stream SQS Lambda handles retries (twice, then DLQ)
  81. 81. configure DLQ for async functions so you don’t lose failed events
  82. 82. too much/too little concurrency #10
  83. 83. “Lambda generates too much load for the downstream system”
  84. 84. one invocation per message SNS Lambda
  85. 85. Downstream System SNS Lambda
  86. 86. ordering replay events Kinesis SQS SNS by shard none (standard) global (FIFO) none up to 7 days none none mode retry batched batched (up to 10) singular retried until success retry + DLQ retry + DLQ concurrency 1 per shard auto-scaled fan-out!!! subscribers many one-to-one many
  87. 87. if you want… maximum throughput SNS precise control over throughput Kinesis
  88. 88. if you want… maximum throughput SNS precise control over throughput Kinesis how quickly it scales out
  89. 89. if you want… maximum throughput SNS precise control over throughput Kinesis how quickly it scales out SQS DynamoDB Streams
  90. 90. ordering replay events Kinesis SQS SNS by shard none (standard) global (FIFO) none up to 7 days none none mode retry batched batched (up to 10) singular retried until success retry + DLQ retry + DLQ concurrency 1 per shard auto-scaled fan-out!!! subscribers many one-to-one many
  91. 91. cold starts #11
  92. 92. “cold starts only happen to the first request”
  93. 93. function invocationconcurrent execution i.e. a container
  94. 94. function invocationconcurrent execution i.e. a container class instance method call
  95. 95. Lambda scales the number of concurrent executions based on traffic
  96. 96. existing “containers” are reused where possible
  97. 97. time invocation
  98. 98. time invocation invocation
  99. 99. time invocation invocation
  100. 100. time invocation invocation invocation invocation
  101. 101. time invocation invocation invocation invocation invocation invocation
  102. 102. time invocation invocation invocation invocation invocation invocation
  103. 103. time invocation invocation invocation invocation invocation invocation invocation
  104. 104. time invocation invocation invocation invocation invocation invocation invocation invocation
  105. 105. time invocation invocation invocation invocation invocation invocation invocation invocation
  106. 106. time invocation invocation invocation invocation invocation invocation invocation invocation
  107. 107. time invocation invocation ping invocation invocation invocation ping ping
  108. 108. Lambda warmers don’t work when you have > 1 concurrent executions
  109. 109. FREQUENCY DURATION
  110. 110. FREQUENCY DURATION dictated by user traffic, out of your control
  111. 111. cold starts is generally not an issue if you have a steady traffic pattern
  112. 112. time req/s
  113. 113. time req/s El Classico
  114. 114. time req/s lunch dinner
  115. 115. FREQUENCY DURATION optimize this!
  116. 116. minimise the duration of cold starts so they fall within acceptable latency range
  117. 117. RDS connection handling #12
  118. 118. default RDS configs are bad for Lambda
  119. 119. default RDS configs are bad for Lambda idle connections are not closed too many connections per “container” max open connection is too low
  120. 120. https://www.jeremydaly.com/manage-rds-connections-aws-lambda/
  121. 121. set “wait_timeout” and “interactive_timeout” to 10 mins (default is 8 hours!)
  122. 122. increase “max_connections” setting
  123. 123. set client socket pool size to 1
  124. 124. (lack of) observability #13
  125. 125. happened system repaireduser impact reduce MTTR
  126. 126. Identify & Resolve Issues Understanding costs Visibility
  127. 127. Identify & Resolve Issues Understanding costs Visibility
  128. 128. happened system repaireduser impact MTTDiscovery
  129. 129. “What alerts should I have?”
  130. 130. It depends on what you’re building…
  131. 131. But, this is a good starting point
  132. 132. Lambda error rate % throttle count DLR error count iterator age regional concurrency
  133. 133. Lambda error rate % throttle count DLR error count iterator age regional concurrency API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate %
  134. 134. API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % SQS message age Lambda error rate % throttle count DLR error count iterator age regional concurrency
  135. 135. API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % SQS message age Step Functions failed count throttle count timed out count Lambda error rate % throttle count DLR error count iterator age regional concurrency
  136. 136. SQS message age Step Functions failed count throttle count timed out count API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % Lambda error rate % throttle count DLR error count iterator age regional concurrency
  137. 137. “Can’t you codify these?”
  138. 138. https://theburningmonk.com/hire-me AdviseTraining Delivery “Fundamentally, Yan has improved our team by increasing our ability to derive value from AWS and Lambda in particular.” Nick Blair Tech Lead
  139. 139. @theburningmonk theburningmonk.com github.com/theburningmonk

×