Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to build observability into a serverless application

264 views

Published on

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk, we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Published in: Technology
  • Be the first to comment

How to build observability into a serverless application

  1. 1. how to build Serverless OBSERVABILITY into a application
  2. 2. What do I mean by “observability”?
  3. 3. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  4. 4. Observability being able to debug the system, and gain insights into the system’s behaviour
  5. 5. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  6. 6. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability including non- functional outputs
  7. 7. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  8. 8. microservices death stars circa 2015
  9. 9. microservices death stars circa 2015 mm… I wonder what’s going on here…
  10. 10. microservices death stars circa 2015 I got this!
  11. 11. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant
  12. 12. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  13. 13. Yan Cui http://theburningmonk.com @theburningmonk AWS user since 2009
  14. 14. Yan Cui http://theburningmonk.com @theburningmonk AWS user since 2009
  15. 15. new challenges
  16. 16. NO ACCESS to underlying OS
  17. 17. NOWHERE to install agents/daemons
  18. 18. •nowhere to install agents/daemons new challenges
  19. 19. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  20. 20. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  21. 21. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  22. 22. •no background processing •nowhere to install agents/daemons new challenges
  23. 23. EC2 concurrency used to be handled by your code
  24. 24. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  25. 25. EC2 logs & metrics used to be batched here
  26. 26. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  27. 27. HIGHER concurrency to log aggregation/telemetry system
  28. 28. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  29. 29. Lambda cold start
  30. 30. Lambda data is batched between invocations
  31. 31. Lambda idle data is batched between invocations
  32. 32. Lambda idle garbage collectiondata is batched between invocations
  33. 33. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  34. 34. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  35. 35. Lambda
  36. 36. my code send metrics
  37. 37. my code send metrics
  38. 38. my code send metrics internet internet press button something happens
  39. 39. http://bit.ly/2Dpidje
  40. 40. ? functions are often chained together via asynchronous invocations
  41. 41. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  42. 42. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  43. 43. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  44. 44. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  45. 45. LOGGING
  46. 46. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  47. 47. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  48. 48. one log group per function one log stream for each concurrent invocation
  49. 49. logs are not easily searchable in CloudWatch Logs me
  50. 50. CloudWatch Logs
  51. 51. CloudWatch Logs is an async event source for Lambda
  52. 52. Concurrent Executions Time regional max concurrency functions that are delivering business value
  53. 53. Concurrent Executions Time regional max concurrency functions that are delivering business value ship logs
  54. 54. either set concurrency limit on the log shipping function (and potentially lose logs due to throttling) or…
  55. 55. 1 shard = 1 concurrent execution i.e. control the no. of concurrent executions with no. of shards
  56. 56.
  57. 57. CloudWatch Logs
  58. 58. https://amzn.to/2DnREgn
  59. 59. https://amzn.to/2uZYmEw
  60. 60. use structured logging with JSON
  61. 61. https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/log-everything-as-json/
  62. 62. https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
  63. 63. traditional loggers are too heavy for Lambda
  64. 64. https://github.com/getndazn/dazn-lambda-powertools
  65. 65. Writing lots more data to CloudWatch Logs
  66. 66. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month
  67. 67. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month 1M invocation of a 128MB function = $0.000000208 * 1M + $0.20 = $0.408
  68. 68. DON’T leave debug logging ON in production
  69. 69. have to redeploy ALL the functions along the call path to collect all relevant debug logs
  70. 70. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  71. 71. sampling decision has to be followed by an entire call chain
  72. 72. Initial Request ID User ID Session ID User-Agent Order ID …
  73. 73. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  74. 74. store correlation IDs in global variable
  75. 75. use middleware to auto-capture incoming correlation IDs
  76. 76. extract correlation IDs from invocation event, and store them in the correlation-ids module reset
  77. 77. logger to always include captured correlation IDs
  78. 78. HTTP and AWS SDK clients to auto-forward correlation IDs on
  79. 79. context.awsRequestId get-index
  80. 80. context.awsRequestId x-correlation-id get-index
  81. 81. { “headers”: { “x-correlation-id”: “…” }, … } get-index
  82. 82. { “body”: null, “resource”: “/restaurants”, “headers”: { “x-correlation-id”: “…” }, … } get-index get-restaurants
  83. 83. get-restaurants global.CONTEXT global.CONTEXT x-correlation-id = … x-correlation-xxx = … get-index headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] capture forward function event log.info(…)
  84. 84. https://github.com/getndazn/dazn-lambda-powertools
  85. 85. MONITORING
  86. 86. •no background processing •nowhere to install agents/daemons new challenges
  87. 87. my code send metrics internet internet press button something happens
  88. 88. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  89. 89. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  90. 90. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  91. 91. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  92. 92. https://amzn.to/2YkjgOR
  93. 93. delay cost concurrency
  94. 94. delay cost concurrency no latency overhead
  95. 95. API Gateway send custom metrics asynchronously
  96. 96. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  97. 97. TRACING
  98. 98. X-Ray
  99. 99. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  100. 100. don’t span over non-AWS services
  101. 101. write structured logs
  102. 102. instrument your code
  103. 103. make it easy to do the right thing
  104. 104. Yan Cui http://theburningmonk.com @theburningmonk

×