Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to build observability into a serverless application

135 views

Published on

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk, we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Recording of this talk is available at https://www.youtube.com/watch?v=AMKUKKGPJhQ

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How to build observability into a serverless application

  1. 1. how to build Serverless OBSERVABILITY into a application
  2. 2. @theburningmonk theburningmonk.com What do I mean by “observability”?
  3. 3. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  4. 4. @theburningmonk theburningmonk.com Observability being able to debug the system, and gain insights into the system’s behaviour
  5. 5. @theburningmonk theburningmonk.com In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  6. 6. @theburningmonk theburningmonk.com In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability including non- functional outputs
  7. 7. @theburningmonk theburningmonk.com These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  8. 8. @theburningmonk theburningmonk.com microservices death stars circa 2015
  9. 9. @theburningmonk theburningmonk.com microservices death stars circa 2015 mm… I wonder what’s going on here…
  10. 10. @theburningmonk theburningmonk.com microservices death stars circa 2015 I got this!
  11. 11. @theburningmonk theburningmonk.com
  12. 12. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  13. 13. http://bit.ly/yubl-serverless
  14. 14. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  15. 15. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  16. 16. https://theburningmonk.com/courses lambdabestpractice.com
  17. 17. theburningmonk.com/workshops in your company flexible datesHelsinki, Aug 20-21 London, Sep 24-25 Berlin, Oct 8-9 4-week virtual workshop, May 4 - May 29 Amsterdam, Jul 7-8
  18. 18. new challenges
  19. 19. @theburningmonk theburningmonk.com NO ACCESS to underlying OS
  20. 20. @theburningmonk theburningmonk.com NOWHERE to install agents/daemons
  21. 21. @theburningmonk theburningmonk.com •nowhere to install agents/daemons new challenges
  22. 22. @theburningmonk theburningmonk.com user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  23. 23. @theburningmonk theburningmonk.com user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  24. 24. @theburningmonk theburningmonk.com user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  25. 25. @theburningmonk theburningmonk.com •no background processing •nowhere to install agents/daemons new challenges
  26. 26. @theburningmonk theburningmonk.com EC2 concurrency used to be handled by your code
  27. 27. @theburningmonk theburningmonk.com EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  28. 28. @theburningmonk theburningmonk.com EC2 logs & metrics used to be batched here
  29. 29. @theburningmonk theburningmonk.com EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  30. 30. @theburningmonk theburningmonk.com
  31. 31. @theburningmonk theburningmonk.com
  32. 32. @theburningmonk theburningmonk.com HIGHER concurrency to log aggregation/telemetry system
  33. 33. @theburningmonk theburningmonk.com •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  34. 34. @theburningmonk theburningmonk.com Lambda cold start
  35. 35. @theburningmonk theburningmonk.com Lambda data is batched between invocations
  36. 36. @theburningmonk theburningmonk.com Lambda idle data is batched between invocations
  37. 37. @theburningmonk theburningmonk.com Lambda idle garbage collectiondata is batched between invocations
  38. 38. @theburningmonk theburningmonk.com Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  39. 39. @theburningmonk theburningmonk.com •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  40. 40. @theburningmonk theburningmonk.com Lambda
  41. 41. @theburningmonk theburningmonk.com my code send metrics
  42. 42. @theburningmonk theburningmonk.com my code send metrics
  43. 43. @theburningmonk theburningmonk.com my code send metrics internet internet press button something happens
  44. 44. @theburningmonk theburningmonk.com http://bit.ly/2Dpidje
  45. 45. @theburningmonk theburningmonk.com ? functions are often chained together via asynchronous invocations
  46. 46. @theburningmonk theburningmonk.com ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  47. 47. @theburningmonk theburningmonk.com ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  48. 48. @theburningmonk theburningmonk.com •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  49. 49. @theburningmonk theburningmonk.com These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  50. 50. LOGGING
  51. 51. @theburningmonk theburningmonk.com
  52. 52. @theburningmonk theburningmonk.com 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  53. 53. @theburningmonk theburningmonk.com 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  54. 54. @theburningmonk theburningmonk.com
  55. 55. @theburningmonk theburningmonk.com one log group per function one log stream for each concurrent invocation
  56. 56. @theburningmonk theburningmonk.com logs are not easily searchable in CloudWatch Logs me
  57. 57. @theburningmonk theburningmonk.com CloudWatch Logs
  58. 58. @theburningmonk theburningmonk.com
  59. 59. @theburningmonk theburningmonk.com CloudWatch Logs is an async event source for Lambda
  60. 60. @theburningmonk theburningmonk.com Concurrent Executions Time regional max concurrency functions that are delivering business value
  61. 61. @theburningmonk theburningmonk.com Concurrent Executions Time regional max concurrency functions that are delivering business value ship logs
  62. 62. @theburningmonk theburningmonk.com either set concurrency limit on the log shipping function (and potentially lose logs due to throttling) or…
  63. 63. @theburningmonk theburningmonk.com
  64. 64. @theburningmonk theburningmonk.com 1 shard = 1 concurrent execution i.e. control the no. of concurrent executions with no. of shards
  65. 65. @theburningmonk theburningmonk.com …
  66. 66. @theburningmonk theburningmonk.com CloudWatch Logs
  67. 67. @theburningmonk theburningmonk.com
  68. 68. @theburningmonk theburningmonk.com
  69. 69. @theburningmonk theburningmonk.com
  70. 70. @theburningmonk theburningmonk.com https://amzn.to/2DnREgn
  71. 71. @theburningmonk theburningmonk.com https://amzn.to/2uZYmEw
  72. 72. @theburningmonk theburningmonk.com use structured logging with JSON
  73. 73. @theburningmonk theburningmonk.com https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/log-everything-as-json/
  74. 74. @theburningmonk theburningmonk.com https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
  75. 75. @theburningmonk theburningmonk.com
  76. 76. @theburningmonk theburningmonk.com traditional loggers are too heavy for Lambda
  77. 77. @theburningmonk theburningmonk.com https://github.com/getndazn/dazn-lambda-powertools
  78. 78. @theburningmonk theburningmonk.com
  79. 79. @theburningmonk theburningmonk.com
  80. 80. @theburningmonk theburningmonk.com
  81. 81. @theburningmonk theburningmonk.com
  82. 82. @theburningmonk theburningmonk.com
  83. 83. @theburningmonk theburningmonk.com Writing lots more data to CloudWatch Logs
  84. 84. @theburningmonk theburningmonk.com CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month
  85. 85. @theburningmonk theburningmonk.com CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month 1M invocation of a 128MB function = $0.000000208 * 1M + $0.20 = $0.408
  86. 86. @theburningmonk theburningmonk.com DON’T leave debug logging ON in production
  87. 87. @theburningmonk theburningmonk.com
  88. 88. @theburningmonk theburningmonk.com have to redeploy ALL the functions along the call path to collect all relevant debug logs
  89. 89. @theburningmonk theburningmonk.com
  90. 90. @theburningmonk theburningmonk.com
  91. 91. @theburningmonk theburningmonk.com https://github.com/middyjs/middy
  92. 92. @theburningmonk theburningmonk.com
  93. 93. @theburningmonk theburningmonk.com EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  94. 94. @theburningmonk theburningmonk.com
  95. 95. @theburningmonk theburningmonk.com
  96. 96. @theburningmonk theburningmonk.com
  97. 97. @theburningmonk theburningmonk.com
  98. 98. @theburningmonk theburningmonk.com
  99. 99. @theburningmonk theburningmonk.com
  100. 100. @theburningmonk theburningmonk.com sampling decision has to be followed by an entire call chain
  101. 101. @theburningmonk theburningmonk.com Initial Request ID User ID Session ID User-Agent Order ID …
  102. 102. @theburningmonk theburningmonk.com EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  103. 103. @theburningmonk theburningmonk.com store correlation IDs in global variable
  104. 104. @theburningmonk theburningmonk.com
  105. 105. @theburningmonk theburningmonk.com
  106. 106. @theburningmonk theburningmonk.com use middleware to auto-capture incoming correlation IDs
  107. 107. @theburningmonk theburningmonk.com extract correlation IDs from invocation event, and store them in the correlation-ids module reset
  108. 108. @theburningmonk theburningmonk.com
  109. 109. @theburningmonk theburningmonk.com
  110. 110. @theburningmonk theburningmonk.com logger to always include captured correlation IDs
  111. 111. @theburningmonk theburningmonk.com HTTP and AWS SDK clients to auto-forward correlation IDs on
  112. 112. @theburningmonk theburningmonk.com context.awsRequestId get-index
  113. 113. @theburningmonk theburningmonk.com context.awsRequestId x-correlation-id get-index
  114. 114. @theburningmonk theburningmonk.com { “headers”: { “x-correlation-id”: “…” }, … } get-index
  115. 115. @theburningmonk theburningmonk.com { “body”: null, “resource”: “/restaurants”, “headers”: { “x-correlation-id”: “…” }, … } get-index get-restaurants
  116. 116. @theburningmonk theburningmonk.com get-restaurants global.CONTEXT global.CONTEXT x-correlation-id = … x-correlation-xxx = … get-index headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] capture forward function event log.info(…)
  117. 117. @theburningmonk theburningmonk.com
  118. 118. @theburningmonk theburningmonk.com https://github.com/getndazn/dazn-lambda-powertools
  119. 119. @theburningmonk theburningmonk.com
  120. 120. @theburningmonk theburningmonk.com MONITORING
  121. 121. @theburningmonk theburningmonk.com •no background processing •nowhere to install agents/daemons new challenges
  122. 122. @theburningmonk theburningmonk.com my code send metrics internet internet press button something happens
  123. 123. @theburningmonk theburningmonk.com those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  124. 124. @theburningmonk theburningmonk.com Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  125. 125. @theburningmonk theburningmonk.com
  126. 126. @theburningmonk theburningmonk.com
  127. 127. @theburningmonk theburningmonk.com
  128. 128. @theburningmonk theburningmonk.com https://aws.amazon.com/cloudwatch/pricing/
  129. 129. @theburningmonk theburningmonk.com https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
  130. 130. @theburningmonk theburningmonk.com https://aws.amazon.com/cloudwatch/pricing/
  131. 131. @theburningmonk theburningmonk.com happened system repaireduser impact reduce MTTR
  132. 132. @theburningmonk theburningmonk.com Identify & Resolve Issues Understanding costs Visibility
  133. 133. @theburningmonk theburningmonk.com Identify & Resolve Issues Understanding costs Visibility
  134. 134. @theburningmonk theburningmonk.com happened system repaireduser impact MTTDiscovery
  135. 135. @theburningmonk theburningmonk.com
  136. 136. @theburningmonk theburningmonk.com “What alerts should I have?”
  137. 137. @theburningmonk theburningmonk.com It depends on what you’re building…
  138. 138. @theburningmonk theburningmonk.com But, this is a good starting point
  139. 139. @theburningmonk theburningmonk.com Lambda error rate % throttle count DLR error count iterator age regional concurrency
  140. 140. @theburningmonk theburningmonk.com Lambda error rate % throttle count DLR error count iterator age regional concurrency API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate %
  141. 141. @theburningmonk theburningmonk.com API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % SQS message age Lambda error rate % throttle count DLR error count iterator age regional concurrency
  142. 142. @theburningmonk theburningmonk.com API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % SQS message age Step Functions failed count throttle count timed out count Lambda error rate % throttle count DLR error count iterator age regional concurrency
  143. 143. @theburningmonk theburningmonk.com SQS message age Step Functions failed count throttle count timed out count API Gateway p90/95/99 latency success rate % 4xx rate % 5xx rate % Lambda error rate % throttle count DLR error count iterator age regional concurrency
  144. 144. @theburningmonk theburningmonk.com monitor and alert on message flow rate for event processing pipelines
  145. 145. @theburningmonk theburningmonk.com “Can’t you codify these?”
  146. 146. @theburningmonk theburningmonk.com
  147. 147. @theburningmonk theburningmonk.com TRACING
  148. 148. X-Ray
  149. 149. poor support for async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  150. 150. don’t span over non-AWS services
  151. 151. write structured logs
  152. 152. instrument your code
  153. 153. make it easy to do the right thing
  154. 154. https://theburningmonk.com/hire-me AdviseTraining Delivery “Fundamentally, Yan has improved our team by increasing our ability to derive value from AWS and Lambda in particular.” Nick Blair Tech Lead
  155. 155. @theburningmonk theburningmonk.com Production-Ready Serverless
  156. 156. theburningmonk.com/workshops in your company flexible datesHelsinki, Aug 20-21 London, Sep 24-25 Berlin, Oct 8-9 4-week virtual workshop, May 4 - May 29 Amsterdam, Jul 7-8 @theburningmonk theburningmonk.com smartly-2020 €100 off all my workshops
  157. 157. @theburningmonk theburningmonk.com lambdabestpractice.com bit.ly/complete-guide-to-aws-step-functions smartly-2020 20% off my courses
  158. 158. @theburningmonk theburningmonk.com github.com/theburningmonk

×