Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The present and future of Serverless observability

3,396 views

Published on

As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.

Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.

Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.

Published in: Technology

The present and future of Serverless observability

  1. 1. and FUTUREThe Serverless OBSERVABILITY ofpresent
  2. 2. hi, my name is Yan.
  3. 3. hi, my name is Yan.
  4. 4. hi, my name is Yan. AWS user since 2009
  5. 5. http://bit.ly/yubl-serverless
  6. 6. http://bit.ly/2Cdsai5
  7. 7. 2017 observability
  8. 8. http://bit.ly/2EXQZBj
  9. 9. http://bit.ly/2EXKEFZ
  10. 10. mm… I wonder what’s going on here…
  11. 11. what is observability? how is it different from monitoring?
  12. 12. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  13. 13. Observability being able to debug the system, and gain insights into the system’s behaviour
  14. 14. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  15. 15. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  16. 16. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  17. 17. Observability is useful even outside of incidents and outages
  18. 18. microservices death stars circa 2015
  19. 19. microservices death stars circa 2015 I got this!
  20. 20. new challenges
  21. 21. new challenges
  22. 22. NO ACCESS to underlying OS
  23. 23. NOWHERE to install agents/daemons
  24. 24. •nowhere to install agents/daemons new challenges
  25. 25. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  26. 26. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  27. 27. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  28. 28. •no background processing •nowhere to install agents/daemons new challenges
  29. 29. EC2 concurrency used to be handled by your code
  30. 30. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  31. 31. EC2 logs & metrics used to be batched here
  32. 32. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  33. 33. HIGHER concurrency to log aggregation/telemetry system
  34. 34. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  35. 35. Lambda cold start
  36. 36. Lambda data is batched between invocations
  37. 37. Lambda idle data is batched between invocations
  38. 38. Lambda idle garbage collectiondata is batched between invocations
  39. 39. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  40. 40. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  41. 41. Lambda
  42. 42. my code send metrics
  43. 43. my code send metrics
  44. 44. my code send metrics internet internet press button something happens
  45. 45. http://bit.ly/2Dpidje
  46. 46. ? functions are often chained together via asynchronous invocations
  47. 47. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  48. 48. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  49. 49. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  50. 50. the Present
  51. 51. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  52. 52. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  53. 53. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  54. 54. one log group per function one log stream for each concurrent invocation
  55. 55. logs are not easily searchable in CloudWatch Logs me
  56. 56. CloudWatch Logs
  57. 57. CloudWatch Logs AWS Lambda ELK stack
  58. 58.
  59. 59. CloudWatch Logs
  60. 60. •no background processing •nowhere to install agents/daemons new challenges
  61. 61. my code send metrics internet internet press button something happens
  62. 62. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  63. 63. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  64. 64. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  65. 65. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  66. 66. delay cost concurrency
  67. 67. delay cost concurrency no latency overhead
  68. 68. API Gateway send custom metrics asynchronously
  69. 69. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  70. 70. X-Ray
  71. 71. do not span over API Gateway
  72. 72. narrow focus on a function good for homing in on performance issues for a particular function, but offers little to help you build intuition about how your system operates as a whole.
  73. 73. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  74. 74. follow the data
  75. 75. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  76. 76. don’t span over non-AWS services
  77. 77. static view
  78. 78. our tools need to do more to help us with understanding & debugging our distributed system, not just what happens inside one function
  79. 79. “one user action/vertical slice through the system”
  80. 80. microservices death stars circa 2015
  81. 81. microservices death stars circa 2015 HELP…
  82. 82. WARNING: this is part fiction, part inspired by new tools
  83. 83. DASHBOARDS
  84. 84. different dimensions of X splattered across the screen
  85. 85. + cold starts + throttled invocations + concurrent executions + estimated cost ($)
  86. 86. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  87. 87. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  88. 88. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  89. 89. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  90. 90. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  91. 91. birds-eye view of our system as it lives and breathes
  92. 92. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API create-auth0-user
  93. 93. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace async invocations create-auth0-user
  94. 94. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace non-AWS resources create-auth0-user
  95. 95. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  96. 96. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  97. 97. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 incoming request… level debug request-id start-time 0ae4ba5d-dab1-4f9e-9de7-eace27ebfbc2 2018/01/25 20:51:23.188 method POST create-auth0-user
  98. 98. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 create-auth0-user
  99. 99. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 click here to go to code create-auth0-user
  100. 100. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "body": "{ "username":"theburningmonk"}", "resource": "/user", "requestContext": { "resourceId": "123456", "apiId": “1234567890", "resourcePath": "/user", { "statusCode": 200 } create-auth0-user
  101. 101. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", { "error": null, "result": "OK" } create-auth0-user
  102. 102. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … create-auth0-user
  103. 103. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … !
  104. 104. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  105. 105. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  106. 106. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  107. 107. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  108. 108. all your needs in one placeTRACING
  109. 109. mmm… it’s a graph
  110. 110. what if we can query it like a graph?
  111. 111. http://amzn.to/2nk7uiW
  112. 112. ability to query based on the relationship between observed components (as well as the components themselves)
  113. 113. root cause analysis
  114. 114. the elevated error rate in service X was caused by DynamoDB table throttling.“ ”
  115. 115. payment was slow last night around 10PM. investigate.
  116. 116. time 95-percentile latency service A service B 10PM
  117. 117. time 95-percentile latency service A service B 10PM causality? or correlation?
  118. 118. user-service USESUSES DEPENDS_ON auth-serviceUSES payment-service DEPENDS_ON “payment was slow last night around 10PM” user-table
  119. 119. user-service USESUSES DEPENDS_ON auth-serviceUSES DEPENDS_ON payment-service user-table throttled exceptions!
  120. 120. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  121. 121. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  122. 122. wouldn’t that be nice?
  123. 123. MACHINE LEARNING
  124. 124. use ML to auto-detect erroneous or suspicious behaviours, or to suggest possible improvements
  125. 125. ! Function [X] just performed an unexpected write against DynamoDB table [Y]. Should I… ignore it from now on shut it down!!
  126. 126. Observability Bot <bot@bestobservability.com>
  127. 127. Observability Bot <bot@bestobservability.com> don’t bother me about this again
  128. 128. Observability Bot <bot@bestobservability.com> auto-modify IAM role with DENY rule
  129. 129. Function [X]’s performance has degraded since yesterday - 99% latency has gone up by 47% from 100ms to 147ms. !
  130. 130. ! Function [X] can run faster & cheaper if you increase its memory allocation. Should I… ignore it from now on update setting
  131. 131. zzz… the future of… zzz … serverless observability… zzz
  132. 132. Simon Wardley
  133. 133. Simon Wardley context & movement
  134. 134. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  135. 135. “one user action/vertical slice through the system”
  136. 136. movement context movement
  137. 137. The best way to predict the future is to invent it. - Alan Kay
  138. 138. Serkan Özal @serkan_ozal
  139. 139. Nitzan Shapira @nitzanshapira Ran Ribenzaft @ranrib
  140. 140. Adam Johnson @adjohn Erica Windisch @ewindisch
  141. 141. Charity Majors @mipsytipsy Cindy Sridharan @copyconstruct Erica Windisch @ewindisch Liz Fong-Jones @lizthegrey JBD @rakyll
  142. 142. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/production-ready-serverless get 40% off with code: ytcui

×