Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to debug slow lambda response times

One of the most common performance issues in serverless architectures is elevated latencies from external services, such as DynamoDB, ElasticSearch or Stripe.

In this webinar, we will show you how to quickly identify and debug these problems, and some best practices for dealing with poor performing 3rd party services.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

How to debug slow lambda response times

  1. 1. How to debug slow Lambda response times Yan Cui @theburningmonk Developer Advocate, Lumigo AWS Serverless Hero Author of Production-Ready Serverless
  2. 2. Lambda autoscales by traffic
  3. 3. multi-AZ by default
  4. 4. MyApiFunction Worker Worker …
  5. 5. overloaded servers are a thing of the past
  6. 6. observation majority of performance problems originates from a function’s integration points
  7. 7. macro how well is this service performing in general? micro why did this transaction perform poorly?
  8. 8. macro micro identify systemic issues how well is this service performing in general? why did this transaction perform poorly?
  9. 9. how well is this service performing in general? why did this transaction perform poorly? macro micro why did this user get a bad exp?
  10. 10. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
  11. 11. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. what do we need to collect?
  12. 12. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @ AWS user since 2009
  13. 13. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  14. 14. API Gateway Lambda DynamoDB
  15. 15. API Gateway Lambda DynamoDBhow long did this req take?
  16. 16. what is the state of the world?
  17. 17. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. what are the most important outputs to collect?
  18. 18. macro micro how well is this service performing in general? why did this transaction perform poorly?
  19. 19. API Gateway API GatewayLambda Lambda DynamoDB Service A Service B
  20. 20. API Gateway API GatewayLambda Lambda DynamoDB Service A Service B how long did service B took to respond?
  21. 21. API Gateway API GatewayLambda Lambda DynamoDB Service A Service B how long did service B took to respond? was DynamoDB slow? was it a cold start? could it be API Gateway?
  22. 22. API Gateway
  23. 23. Lambda
  24. 24. Lambda Duration
  25. 25. Lambda time to create and initialize the worker instance
  26. 26. Lambda bit.ly/2QXNVwc
  27. 27. bit.ly/2WL1uj0 Lambda
  28. 28. Lambda time to create and initialize the worker instance
  29. 29. for API functions, use API Gateway’s IntegrationLatency as a proxy for “total response time from Lambda”
  30. 30. DynamoDB
  31. 31. DynamoDB SuccessfulRequestLatency
  32. 32. “I'm facing this problem now with a lambda that usually takes 25 ms but once a week or so takes > 6000 ms and times out.  The lambda's first step is to load a DynamoDB table that only has 8 items.  I'm at a loss to understand how such a simple query could take so long.”
  33. 33. START
  34. 34. START 1st attempt
  35. 35. START 1st attempt exponential backoff (1)
  36. 36. START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2)
  37. 37. START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3)
  38. 38. START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3) 4th attempt success!
  39. 39. START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3) 4th attempt success! SuccessfulRequestLatency
  40. 40. JavaScript AWS SDK 10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
  41. 41. 10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) JavaScript AWS SDK
  42. 42. 10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) JavaScript AWS SDK danger zone!
  43. 43. Record client-side latency metrics for IO operations
  44. 44. www.youtube.com/watch?v=adtCwnKApWI
  45. 45. Embedded Metric Format (EMF)
  46. 46. Latency [API Gateway]
  47. 47. IntegrationLatency [API Gateway] Latency [API Gateway]
  48. 48. API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
  49. 49. Duration [Lambda] API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
  50. 50. Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
  51. 51. SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
  52. 52. Caller-side DynamoDB latency [custom metric] SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
  53. 53. Caller-side retries (mostly) Caller-side DynamoDB latency [custom metric] SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
  54. 54. Latency (ms) Time Latency IntegrationLatency Duration Caller-side DynamoDB latency SuccessfulRequestLatency
  55. 55. Latency (ms) Time Latency IntegrationLatency Duration Caller-side DynamoDB latency SuccessfulRequestLatency
  56. 56. how well is this service performing in general? macro
  57. 57. why did this transaction perform poorly? micro
  58. 58. X-Ray
  59. 59. X-Ray
  60. 60. X-Ray can be encapsulated in custom modules
  61. 61. X-Ray doesn’t add latency
  62. 62. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time)
  63. 63. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling
  64. 64. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead
  65. 65. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache)
  66. 66. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS)
  67. 67. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data
  68. 68. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data logs and traces are separate
  69. 69. X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data logs and traces are separate difficult to search
  70. 70. X-Ray good enough for simple workloads when you outgrow X-Ray, look for a 3rd-party tool
  71. 71. answer both macro and micro level questions in just a few clicks!
  72. 72. Support async event sources such as Kinesis, DynamoDB streams and SNS
  73. 73. Support TCP traffic - e.g. RDS and Elasticache
  74. 74. platform.lumigo.io/signup
  75. 75. trace 500K invocations per month for FREE with promo code Yan500 platform.lumigo.io/signup
  76. 76. How to mitigate slow dependencies?
  77. 77. it depends…
  78. 78. can you use another service?
  79. 79. if not, a good caching strategy often helps
  80. 80. bit.ly/3h7Bo41
  81. 81. Client Server 1 Server 2
  82. 82. Client Server 1 Server 2 50ms later
  83. 83. Client Server 1 Server 2
  84. 84. runing required for each service
  85. 85. helps in some cases but can exaspate the problem in other cases
  86. 86. can you use another service?
  87. 87. platform.lumigo.io/signup trace 500K invocations per month for FREE with promo code Yan500
  88. 88. @theburningmonk theburningmonk.com github.com/theburningmonk yan@lumigo.io

    Be the first to comment

One of the most common performance issues in serverless architectures is elevated latencies from external services, such as DynamoDB, ElasticSearch or Stripe. In this webinar, we will show you how to quickly identify and debug these problems, and some best practices for dealing with poor performing 3rd party services.

Views

Total views

260

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×