Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to build observability into Serverless (O'Reilly Velocity 2018)

1,112 views

Published on

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Published in: Technology
  • Be the first to comment

How to build observability into Serverless (O'Reilly Velocity 2018)

  1. 1. How to build observability into Serverless Yan Cui @theburningmonk
  2. 2. Agenda ▪ What do I mean by observability? ▪ New challenges with serverless ▪ Logging ▪ Monitoring ▪ Tracing
  3. 3. After the talk ▪ Slides will be available on slideshare ▪ Find links to slides and video at https://theburningmonk.com/ oreillyvelocity2018
  4. 4. Abraham Wald
  5. 5. Abraham Wald
  6. 6. Abraham Wald
  7. 7. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  8. 8. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  9. 9. survivor bias in monitoring
  10. 10. survivor bias in monitoring Only focus on failure modes that we were able to successfully identify through investigation and postmortem in the past. The bullet holes that shot us down and we couldn’t identify stay invisible, and will continue to shoot us down.
  11. 11. What do I mean by “observability”?
  12. 12. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  13. 13. Observability being able to debug the system, and gain insights into the system’s behaviour
  14. 14. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  15. 15. Known Success
  16. 16. Known SuccessKnown Errors
  17. 17. Known SuccessKnown Errors easy to monitor!
  18. 18. Known SuccessKnown Errors Known Unknowns
  19. 19. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  20. 20. Known SuccessKnown Errors Known UnknownsUnknown Unknowns invisible bullet holes
  21. 21. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  22. 22. Known SuccessKnown Errors Known UnknownsUnknown Unknowns only alert on this
  23. 23. Known SuccessKnown Errors Known UnknownsUnknown Unknowns alert on the absence of this!
  24. 24. Known SuccessKnown Errors Known UnknownsUnknown Unknowns what went wrong?
  25. 25. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  26. 26. microservices death stars circa 2015
  27. 27. mm… I wonder what’s going on here… microservices death stars circa 2015
  28. 28. I got this! microservices death stars circa 2015
  29. 29. About me ▪ Principal Engineer at DAZN ▪ AWS Serverless Hero ▪ Author of Production-Ready Serverless* by Manning ▪ Blogger** ▪ Speaker * https://bit.ly/production-ready-serverless ** https://theburningmonk.com
  30. 30. https://www.ft.com/content/07d375ee-6ee5-11e8-92d3-6c13e5c92914
  31. 31. https://www.theguardian.com/media/2018/may/14/streaming-service-dazn-netflix-sport-us-boxing-eddie-hearn
  32. 32. About DAZN ▪ Available in 7 countries - Austria, Switzerland, Germany, Japan, Canada, Italy and USA ▪ Available on 30+ platforms
  33. 33. About DAZN ▪~1,000,000 concurrent viewers at peak
  34. 34. follow @dazneng for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more. WE’RE HIRING!
  35. 35. new challenges
  36. 36. NO ACCESS to underlying OS
  37. 37. NOWHERE to install agents/daemons
  38. 38. •nowhere to install agents/daemons new challenges
  39. 39. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  40. 40. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  41. 41. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  42. 42. •no background processing •nowhere to install agents/daemons new challenges
  43. 43. EC2 concurrency used to be handled by your code
  44. 44. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  45. 45. EC2 logs & metrics used to be batched here
  46. 46. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  47. 47. HIGHER concurrency to log aggregation/telemetry system
  48. 48. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  49. 49. Lambda cold start
  50. 50. Lambda data is batched between invocations
  51. 51. Lambda idle data is batched between invocations
  52. 52. Lambda idle garbage collectiondata is batched between invocations
  53. 53. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  54. 54. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  55. 55. Lambda
  56. 56. my code send metrics
  57. 57. my code send metrics
  58. 58. my code send metrics internet internet press button something happens
  59. 59. http://bit.ly/2Dpidje
  60. 60. ? functions are often chained together via asynchronous invocations
  61. 61. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  62. 62. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  63. 63. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  64. 64. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  65. 65. LOGGING
  66. 66. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  67. 67. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  68. 68. one log group per function one log stream for each concurrent invocation
  69. 69. logs are not easily searchable in CloudWatch Logs me
  70. 70. CloudWatch Logs
  71. 71. CloudWatch Logs is an async event source for Lambda
  72. 72. Concurrent Executions Time regional max concurrency functions that are delivering business value
  73. 73. Concurrent Executions Time regional max concurrency functions that are delivering business value ship logs
  74. 74. either set concurrency limit on the log shipping function (and potentially lose logs due to throttling) or…
  75. 75. 1 shard = 1 concurrent execution i.e. control the no. of concurrent executions with no. of shards
  76. 76.
  77. 77. CloudWatch Logs
  78. 78. CloudWatch Logs
  79. 79. use structured logging with JSON
  80. 80. https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/log-everything-as-json/
  81. 81. https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
  82. 82. traditional loggers are too heavy for Lambda
  83. 83. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month
  84. 84. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month 1M invocation of a 128MB function = $0.000000208 * 1M + $0.20 = $0.408
  85. 85. DON’T leave debug logging ON in production
  86. 86. have to redeploy ALL the functions along the call path to collect all relevant debug logs
  87. 87. https://github.com/middyjs/middy
  88. 88. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  89. 89. sampling decision has to be followed by an entire call chain
  90. 90. Initial Request ID User ID Session ID User-Agent Order ID …
  91. 91. nonintrusive
  92. 92. nonintrusive extensible
  93. 93. nonintrusive extensible consistent
  94. 94. nonintrusive extensible consistent works for streams
  95. 95. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  96. 96. store correlation IDs in global variable
  97. 97. use middleware to auto-capture incoming correlation IDs
  98. 98. extract correlation IDs from invocation event, and store them in the correlation-ids module reset
  99. 99. logger to always include captured correlation IDs
  100. 100. HTTP and AWS SDK clients to auto-forward correlation IDs on
  101. 101. context.awsRequestId get-index
  102. 102. context.awsRequestId x-correlation-id get-index
  103. 103. { “headers”: { “x-correlation-id”: “…” }, … } get-index
  104. 104. { “body”: null, “resource”: “/restaurants”, “headers”: { “x-correlation-id”: “…” }, … } get-index get-restaurants
  105. 105. get-restaurants global.CONTEXT global.CONTEXT x-correlation-id = … x-correlation-xxx = … get-index headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] capture forward function event log.info(…)
  106. 106. nonintrusive extensible consistent works for streams
  107. 107. MONITORING
  108. 108. •no background processing •nowhere to install agents/daemons new challenges
  109. 109. my code send metrics internet internet press button something happens
  110. 110. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  111. 111. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  112. 112. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  113. 113. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  114. 114. trade-off delay cost concurrency
  115. 115. trade-off delay cost concurrency no latency overhead
  116. 116. API Gateway send custom metrics asynchronously
  117. 117. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  118. 118. TRACING
  119. 119. X-Ray
  120. 120. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  121. 121. don’t span over non-AWS services
  122. 122. write structured logs
  123. 123. instrument your code
  124. 124. make it easy to do the right thing
  125. 125. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/prod-ready-serverless get 40% off with: ytcui
  126. 126. @theburningmonk theburningmonk.com github.com/theburningmonk API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/prod-ready-serverless get 40% off with: ytcui

×