Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The present and future of serverless observability

1,026 views

Published on

As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.

Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.

Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.

Published in: Technology
  • Hi Yan, enjoying the experiences that you shared on your journey to serverless. One question, would it be possible to have you send me slide 10? I would like to drill down into it and look at it in detail and I can't do that without the labels distorting due to pixelation? Thanks...I appreciate the help! Darren
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

The present and future of serverless observability

  1. 1. and FUTUREThe Serverless OBSERVABILITY ofpresent
  2. 2. hi, my name is Yan.
  3. 3. I’m a principal engineer at
  4. 4. available in: Austria, Switzerland, Germany, Japan and Canada 30+ platforms
  5. 5. coming to the US
  6. 6. follow @DAZN_ngnrs for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more.
  7. 7. AWS user since 2009
  8. 8. http://bit.ly/yubl-serverless
  9. 9. http://bit.ly/2Cdsai5
  10. 10. 2017 observability
  11. 11. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  12. 12. Observability being able to debug the system, and gain insights into the system’s behaviour
  13. 13. http://bit.ly/2EXKEFZ
  14. 14. mm… I wonder what’s going on here…
  15. 15. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  16. 16. microservices death stars circa 2015
  17. 17. microservices death stars circa 2015 I got this!
  18. 18. new challenges
  19. 19. new challenges
  20. 20. NO ACCESS to underlying OS
  21. 21. NOWHERE to install agents/daemons
  22. 22. •nowhere to install agents/daemons new challenges
  23. 23. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  24. 24. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  25. 25. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  26. 26. •no background processing •nowhere to install agents/daemons new challenges
  27. 27. EC2 concurrency used to be handled by your code
  28. 28. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  29. 29. EC2 logs & metrics used to be batched here
  30. 30. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  31. 31. HIGHER concurrency to log aggregation/telemetry system
  32. 32. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  33. 33. Lambda cold start
  34. 34. Lambda data is batched between invocations
  35. 35. Lambda idle data is batched between invocations
  36. 36. Lambda idle garbage collectiondata is batched between invocations
  37. 37. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  38. 38. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  39. 39. Lambda
  40. 40. my code send metrics
  41. 41. my code send metrics
  42. 42. my code send metrics internet internet press button something happens
  43. 43. http://bit.ly/2Dpidje
  44. 44. ? functions are often chained together via asynchronous invocations
  45. 45. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  46. 46. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  47. 47. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  48. 48. the Present
  49. 49. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  50. 50. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  51. 51. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  52. 52. one log group per function one log stream for each concurrent invocation
  53. 53. logs are not easily searchable in CloudWatch Logs me
  54. 54. CloudWatch Logs
  55. 55. CloudWatch Logs AWS Lambda ELK stack
  56. 56.
  57. 57. CloudWatch Logs
  58. 58. CloudWatch Logs
  59. 59. •no background processing •nowhere to install agents/daemons new challenges
  60. 60. my code send metrics internet internet press button something happens
  61. 61. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  62. 62. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  63. 63. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  64. 64. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  65. 65. delay cost concurrency
  66. 66. delay cost concurrency no latency overhead
  67. 67. API Gateway send custom metrics asynchronously
  68. 68. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  69. 69. X-Ray
  70. 70. do not span over API Gateway
  71. 71. narrow focus on a function good for homing in on performance issues for a particular function, but offers little to help you build intuition about how your system operates as a whole.
  72. 72. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  73. 73. follow the data
  74. 74. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  75. 75. don’t span over non-AWS services
  76. 76. static view
  77. 77. our tools need to do more to help us with understanding & debugging our distributed system, not just what happens inside one function
  78. 78. “one user action/vertical slice through the system”
  79. 79. microservices death stars circa 2015
  80. 80. microservices death stars circa 2015 HELP…
  81. 81. WARNING: this is part fiction, part inspired by new tools
  82. 82. DASHBOARDS
  83. 83. different dimensions of service X splattered across the screen
  84. 84. +cold starts +throttled invocations +concurrent executions +estimated cost ($)
  85. 85. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  86. 86. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  87. 87. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  88. 88. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  89. 89. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  90. 90. birds-eye view of our system as it lives and breathes
  91. 91. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API create-auth0-user
  92. 92. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace async invocations create-auth0-user
  93. 93. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace non-AWS resources create-auth0-user
  94. 94. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  95. 95. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  96. 96. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 incoming request… level debug request-id start-time 0ae4ba5d-dab1-4f9e-9de7-eace27ebfbc2 2018/01/25 20:51:23.188 method POST create-auth0-user
  97. 97. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 create-auth0-user
  98. 98. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 click here to go to code create-auth0-user
  99. 99. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "body": "{ "username":"theburningmonk"}", "resource": "/user", "requestContext": { "resourceId": "123456", "apiId": “1234567890", "resourcePath": "/user", { "statusCode": 200 } create-auth0-user
  100. 100. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", { "error": null, "result": "OK" } create-auth0-user
  101. 101. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … create-auth0-user
  102. 102. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … !
  103. 103. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  104. 104. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  105. 105. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  106. 106. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  107. 107. all your needs in one placeTRACING
  108. 108. mmm… it’s a graph
  109. 109. what if we can query it like a graph?
  110. 110. http://amzn.to/2nk7uiW
  111. 111. ability to query based on the relationship between observed components (as well as the components themselves)
  112. 112. root cause analysis
  113. 113. the elevated error rate in service X was caused by DynamoDB table throttling.“ ”
  114. 114. payment was slow last night around 10PM. investigate.
  115. 115. time 95-percentile latency service A service B 10PM
  116. 116. time 95-percentile latency service A service B 10PM causality? or correlation?
  117. 117. user-service USESUSES DEPENDS_ON auth-serviceUSES payment-service DEPENDS_ON “payment was slow last night around 10PM” user-table
  118. 118. user-service USESUSES DEPENDS_ON auth-serviceUSES DEPENDS_ON payment-service user-table throttled exceptions!
  119. 119. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  120. 120. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  121. 121. wouldn’t that be nice?
  122. 122. MACHINE LEARNING
  123. 123. use ML to auto-detect erroneous or suspicious behaviours, or to suggest possible improvements
  124. 124. ! Function [X] just performed an unexpected write against DynamoDB table [Y]. Should I… ignore it from now on shut it down!!
  125. 125. Observability Bot <bot@bestobservability.com>
  126. 126. Observability Bot <bot@bestobservability.com> don’t bother me about this again
  127. 127. Observability Bot <bot@bestobservability.com> auto-modify IAM role with DENY rule
  128. 128. Function [X]’s performance has degraded since yesterday - 99% latency has gone up by 47% from 100ms to 147ms. !
  129. 129. ! Function [X] can run faster & cheaper if you increase its memory allocation. Should I… ignore it from now on update setting
  130. 130. zzz… the future of… zzz … serverless observability… zzz
  131. 131. Simon Wardley
  132. 132. Simon Wardley context & movement
  133. 133. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  134. 134. “one user action/vertical slice through the system”
  135. 135. movement context movement
  136. 136. The best way to predict the future is to invent it. - Alan Kay
  137. 137. The best way to invent the future is to inception someone else to do it. - me

×