Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The present and future of Serverless observability (Serverless Computing London)

476 views

Published on

As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.

Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.

Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The present and future of Serverless observability (Serverless Computing London)

  1. 1. @theburningmonk#aws #awslambda #serverless the present and future of serverless observability Yan Cui @theburningmonk
  2. 2. Abraham Wald
  3. 3. Abraham Wald
  4. 4. Abraham Wald
  5. 5. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  6. 6. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  7. 7. survivor bias in monitoring
  8. 8. survivor bias in monitoring Only focus on failure modes that we were able to successfully identify through investigation and postmortem in the past. The bullet holes that shot us down and we couldn’t identify stay invisible, and will continue to shoot us down.
  9. 9. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  10. 10. available in Austria, Switzerland, Germany, Japan, Canada, Italy and US
  11. 11. available on 30+ platforms
  12. 12. ~1,000,000 concurrent viewers
  13. 13. We’re hiring! Visit engineering.dazn.com to learn more. follow @dazneng for updates about the engineering team
  14. 14. follow @dazneng for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more. WE’RE HIRING!
  15. 15. AWS user since 2009
  16. 16. AWS user since 2009
  17. 17. https://www.youtube.com/watch?v=pptsgV4bKv8
  18. 18. https://bit.ly/production-ready-serverless
  19. 19. http://bit.ly/2C9LwIM
  20. 20. 2017 observability
  21. 21. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  22. 22. Observability being able to debug the system, and gain insights into the system’s behaviour
  23. 23. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  24. 24. Known Success
  25. 25. Known SuccessKnown Errors
  26. 26. Known SuccessKnown Errors easy to monitor!
  27. 27. Known SuccessKnown Errors Known Unknowns
  28. 28. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  29. 29. Known SuccessKnown Errors Known UnknownsUnknown Unknowns invisible bullet holes
  30. 30. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  31. 31. Known SuccessKnown Errors Known UnknownsUnknown Unknowns only alert on this
  32. 32. Known SuccessKnown Errors Known UnknownsUnknown Unknowns alert on the absence of this!
  33. 33. Known SuccessKnown Errors Known UnknownsUnknown Unknowns what went wrong?
  34. 34. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  35. 35. microservices death stars circa 2015
  36. 36. microservices death stars circa 2015 I got this!
  37. 37. new challenges
  38. 38. new challenges
  39. 39. NO ACCESS to underlying OS
  40. 40. NOWHERE to install agents/daemons
  41. 41. •nowhere to install agents/daemons new challenges
  42. 42. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  43. 43. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  44. 44. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  45. 45. •no background processing •nowhere to install agents/daemons new challenges
  46. 46. EC2 concurrency used to be handled by your code
  47. 47. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  48. 48. EC2 logs & metrics used to be batched here
  49. 49. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  50. 50. HIGHER concurrency to log aggregation/telemetry system
  51. 51. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  52. 52. Lambda cold start
  53. 53. Lambda data is batched between invocations
  54. 54. Lambda idle data is batched between invocations
  55. 55. Lambda idle garbage collectiondata is batched between invocations
  56. 56. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  57. 57. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  58. 58. Lambda
  59. 59. my code send metrics
  60. 60. my code send metrics
  61. 61. my code send metrics internet internet press button something happens
  62. 62. http://bit.ly/2Dpidje
  63. 63. ? functions are often chained together via asynchronous invocations
  64. 64. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  65. 65. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  66. 66. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  67. 67. the Present
  68. 68. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  69. 69. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  70. 70. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  71. 71. one log group per function one log stream for each concurrent invocation
  72. 72. logs are not easily searchable in CloudWatch Logs me
  73. 73. CloudWatch Logs
  74. 74. CloudWatch Logs AWS Lambda ELK stack
  75. 75.
  76. 76. CloudWatch Logs
  77. 77. CloudWatch Logs
  78. 78. •no background processing •nowhere to install agents/daemons new challenges
  79. 79. my code send metrics internet internet press button something happens
  80. 80. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  81. 81. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  82. 82. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  83. 83. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  84. 84. delay cost concurrency
  85. 85. delay cost concurrency no latency overhead
  86. 86. API Gateway send custom metrics asynchronously
  87. 87. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  88. 88. X-Ray
  89. 89. do not span over API Gateway
  90. 90. narrow focus on a function good for homing in on performance issues for a particular function, but offers little to help you build intuition about how your system operates as a whole.
  91. 91. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  92. 92. follow the data
  93. 93. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  94. 94. don’t span over non-AWS services
  95. 95. static view
  96. 96. our tools need to do more to help us with understanding & debugging our distributed system, not just what happens inside one function
  97. 97. “one user action/vertical slice through the system”
  98. 98. microservices death stars circa 2015
  99. 99. microservices death stars circa 2015 HELP…
  100. 100. WARNING: this is part fiction, part inspired by new tools
  101. 101. DASHBOARDS
  102. 102. different dimensions of service X splattered across the screen
  103. 103. + cold starts + throttled invocations + concurrent executions + estimated cost ($)
  104. 104. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  105. 105. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  106. 106. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  107. 107. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  108. 108. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  109. 109. birds-eye view of our system as it lives and breathes
  110. 110. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API create-auth0-user
  111. 111. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace async invocations create-auth0-user
  112. 112. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace non-AWS resources create-auth0-user
  113. 113. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  114. 114. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  115. 115. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 incoming request… level debug request-id start-time 0ae4ba5d-dab1-4f9e-9de7-eace27ebfbc2 2018/01/25 20:51:23.188 method POST create-auth0-user
  116. 116. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 create-auth0-user
  117. 117. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 click here to go to code create-auth0-user
  118. 118. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "body": "{ "username":"theburningmonk"}", "resource": "/user", "requestContext": { "resourceId": "123456", "apiId": “1234567890", "resourcePath": "/user", { "statusCode": 200 } create-auth0-user
  119. 119. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", { "error": null, "result": "OK" } create-auth0-user
  120. 120. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … create-auth0-user
  121. 121. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … !
  122. 122. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  123. 123. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  124. 124. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  125. 125. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  126. 126. all your needs in one placeTRACING
  127. 127. mmm… it’s a graph
  128. 128. what if we can query it like a graph?
  129. 129. http://amzn.to/2nk7uiW
  130. 130. ability to query based on the relationship between observed components (as well as the components themselves)
  131. 131. root cause analysis
  132. 132. the elevated error rate in service X was caused by DynamoDB table throttling.“ ”
  133. 133. payment was slow last night around 10PM. investigate.
  134. 134. time 95-percentile latency service A service B 10PM
  135. 135. time 95-percentile latency service A service B 10PM causality? or correlation?
  136. 136. user-service USESUSES DEPENDS_ON auth-serviceUSES payment-service DEPENDS_ON “payment was slow last night around 10PM” user-table
  137. 137. user-service USESUSES DEPENDS_ON auth-serviceUSES DEPENDS_ON payment-service user-table throttled exceptions!
  138. 138. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  139. 139. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  140. 140. wouldn’t that be nice?
  141. 141. MACHINE LEARNING
  142. 142. use ML to auto-detect erroneous or suspicious behaviours, or to suggest possible improvements
  143. 143. ! Function [X] just performed an unexpected write against DynamoDB table [Y]. Should I… ignore it from now on shut it down!!
  144. 144. Observability Bot <bot@bestobservability.com>
  145. 145. Observability Bot <bot@bestobservability.com> don’t bother me about this again
  146. 146. Observability Bot <bot@bestobservability.com> auto-modify IAM role with DENY rule
  147. 147. Function [X]’s performance has degraded since yesterday - 99% latency has gone up by 47% from 100ms to 147ms. !
  148. 148. ! Function [X] can run faster & cheaper if you increase its memory allocation. Should I… ignore it from now on update setting
  149. 149. zzz… the future of… zzz … serverless observability… zzz
  150. 150. Simon Wardley
  151. 151. Simon Wardley context & movement
  152. 152. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  153. 153. “one user action/vertical slice through the system”
  154. 154. movement context movement
  155. 155. The best way to predict the future is to invent it. - Alan Kay
  156. 156. The best way to invent the future is to inception someone else to do it. - me

×