Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Troubleshooting serverless applications

516 views

Published on

In this talk, we will discuss some tips for alerting around your serverless application, and different approaches to troubleshooting issues in your serverless application: using first-party tools from AWS; using custom-built solutions; or using a serverless monitoring solution.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Troubleshooting serverless applications

  1. 1. Serverless Application Troubleshooting
  2. 2. I watch a lot of TV shows…
  3. 3. protagonist is shot
  4. 4. 3 hours earlier… protagonist is shot
  5. 5. 3 hours earlier… protagonist is shot
  6. 6. 3 hours earlier… protagonist emerges victoriously protagonist is shot
  7. 7. happened
  8. 8. happened user impact
  9. 9. happened system repaireduser impact
  10. 10. happened system repaireduser impact goal: to fail without users noticing
  11. 11. happened system repaireduser impact reduce MTTR
  12. 12. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @ Independent Consultant AWS user since 2009 since 2018 yan@lumigo.io
  13. 13. What do you mean by ‘serverless’?
  14. 14. “Serverless”
  15. 15. Gojko Adzic It is serverless the same way WiFi is wireless. http://bit.ly/2yQgwwb
  16. 16. Serverless means… don’t pay for it if no-one uses it don’t need to worry about scaling don’t need to provision and manage servers
  17. 17. in other words, it’s a lot like taking a cab
  18. 18. Ownership Fuel Navigate To get there! Focus on getting there!
  19. 19. HW Ownership OS Runtime & Scale Code Focus on getting there! Physical Servers Virtual Machines Containers Serverless
  20. 20. Nano Services Self Managed Cost Paradigm ChangeAsync Dynamic agile env
  21. 21. happened system repaireduser impact reduce MTTR
  22. 22. Identify & Resolve Issues Understanding costs Visibility
  23. 23. Identify & Resolve Issues Understanding costs Visibility
  24. 24. happened system repaireduser impact MTTDiscovery
  25. 25. “What alerts should I have?”
  26. 26. It depends on what you’re building…
  27. 27. But, this is a good starting point
  28. 28. Lambda error rate % throttle count DLR error count iterator age regional concurrency
  29. 29. Lambda error rate % throttle count DLR error count iterator age regional concurrency API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate %
  30. 30. API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % SQS message age Lambda error rate % throttle count DLR error count iterator age regional concurrency
  31. 31. API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % SQS message age Step Functions failed count throttle count timed out count Lambda error rate % throttle count DLR error count iterator age regional concurrency
  32. 32. SQS message age Step Functions failed count throttle count timed out count API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % Lambda error rate % throttle count DLR error count iterator age regional concurrency
  33. 33. “Can’t you codify these?”
  34. 34. Identify & Resolve Issues Understanding costs Visibility
  35. 35. happened system repaireduser impact finding root cause
  36. 36. option 1: CloudWatch & friends
  37. 37. https://lumigo.io/blog/getting-the-most-out-of-cloudwatch-logs/
  38. 38. Pros Out of the box No overhead Comparatively cheap AWS support
  39. 39. Pros Out of the box No overhead Comparatively cheap AWS support Cons Complicated
  40. 40. https://lumigo.io/blog/serverless-applications-automate-chores-cloudwatch-logs/
  41. 41. Pros Out of the box No overhead Comparatively cheap AWS support Cons Complicated Hard to query* * Insights improved things drastically, but still a gap to ELK
  42. 42. https://lumigo.io/blog/how-to-monitor-lambda-with-cloudwatch-metrics/
  43. 43. Pros Out of the box Source of truth No overhead* Comparatively cheap AWS support * unless you record custom metrics synchronously
  44. 44. Pros Out of the box Source of truth No overhead* Comparatively cheap AWS support * unless you record custom metrics synchronously ** can compensate with custom metrics/metric filters, etc. Cons Missing metrics** Lambda percentile latencies don’t work Only granular to 1 min No query language
  45. 45. Pros Out of the box SDK No overhead Comparatively cheap AWS support
  46. 46. Pros Out of the box SDK No overhead Comparatively cheap AWS support Cons Poor async support
  47. 47. Pros Out of the box SDK No overhead Comparatively cheap AWS support Cons Poor async support No auto- instrumentation Bad DX (for node.js) Poor documentation
  48. 48. option 2: custom built solutions
  49. 49. https://github.com/getndazn/dazn-lambda-powertools
  50. 50. Structured Logging
  51. 51. Structured Logging Sampling
  52. 52. Structured Logging Sampling Correlation IDs
  53. 53. Structured Logging Sampling Correlation IDs Auto “instrumentation”
  54. 54. Structured Logging Sampling Correlation IDs Auto “instrumentation” Support async events
  55. 55. enrich the usefulness of your logs
  56. 56. https://theburningmonk.com/2017/08/centralised-logging-for-aws-lambda/
  57. 57. https://theburningmonk.com/2018/07/centralised-logging-for-aws-lambda-revised-2018/
  58. 58. Pros Tailor fit Free!
  59. 59. Pros Tailor fit Free! Cons Very high-touch Not all services are supported equally Tailor fit (for someone else…)
  60. 60. option 3: serverless monitoring solutions
  61. 61. Pros SAAS Serverless focus More than just tracing Very low touch Cons Yet another 3rd party More than just tracing
  62. 62. Takeaways Serverless is a game-changer Serverless has challenges Options for troubleshooting serverless applications
  63. 63. https://info.lumigo.io/serverless-consulting Start off on the right foot
  64. 64. @theburningmonk theburningmonk.com github.com/theburningmonk yan@lumigo.io

×