Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
how to build
Serverless
OBSERVABILITY
into a application
What do I mean by “observability”?
Monitoring
watching out for
known failure modes
in the system,
e.g. network I/O, CPU,
memory usage, …
Observability
being able to debug
the system, and gain
insights into the
system’s behaviour
In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of it...
In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of it...
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distri...
microservices death stars circa 2015
microservices death stars circa 2015
mm… I wonder what’s
going on here…
microservices death stars circa 2015
I got this!
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user since 2009
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user since 2009
new
challenges
NO ACCESS
to underlying OS
NOWHERE
to install agents/daemons
•nowhere to install agents/daemons
new challenges
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
•no background processing
•nowhere to install agents/daemons
new challenges
EC2
concurrency used to be
handled by your code
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, it’s handled by the
AWS Lambda platform
EC2
logs & metrics used to be
batched here
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, they are batched in each
concurrent execution, at best…
HIGHER concurrency to log
aggregation/telemetry system
•higher concurrency to telemetry system
•nowhere to install agents/daemons
•no background processing
new challenges
Lambda
cold start
Lambda
data is batched between
invocations
Lambda
idle
data is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
HIGH chance of data loss
•high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency t...
Lambda
my code
send metrics
my code
send metrics
my code
send metrics
internet internet
press button something happens
http://bit.ly/2Dpidje
?
functions are often chained together
via asynchronous invocations
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
differ...
•asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry sy...
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distri...
LOGGING
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp Request Id
y...
one log group per
function
one log stream for each
concurrent invocation
logs are not easily searchable in
CloudWatch Logs
me
CloudWatch Logs
CloudWatch Logs is an async event source for Lambda
Concurrent Executions
Time
regional max
concurrency
functions that are
delivering business value
Concurrent Executions
Time
regional max
concurrency
functions that are
delivering business value
ship logs
either set concurrency limit on the log shipping function
(and potentially lose logs due to throttling)
or…
1 shard = 1 concurrent execution
i.e. control the no. of concurrent
executions with no. of shards
…
CloudWatch Logs
https://amzn.to/2DnREgn
https://amzn.to/2uZYmEw
use structured logging with JSON
https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/...
https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
traditional loggers are too heavy for Lambda
https://github.com/getndazn/dazn-lambda-powertools
Writing lots more data to
CloudWatch Logs
CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
1M invocation of a 128MB function =
$0.000000208 * 1...
DON’T leave debug logging ON in production
have to redeploy ALL the
functions along the call path to
collect all relevant debug logs
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
Concurrency is handled by
the AWS Lambda platform
sampling decision has to be
followed by an entire call chain
Initial Request ID
User ID
Session ID
User-Agent
Order ID
…
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
Concurrency is handled by
the AWS Lambda platform
store correlation IDs in global variable
use middleware to auto-capture incoming correlation IDs
extract correlation IDs from
invocation event, and store them in
the correlation-ids module
reset
logger to always include captured correlation IDs
HTTP and AWS SDK clients to auto-forward correlation IDs on
context.awsRequestId
get-index
context.awsRequestId x-correlation-id
get-index
{
“headers”: {
“x-correlation-id”: “…”
},
…
}
get-index
{
“body”: null,
“resource”: “/restaurants”,
“headers”: {
“x-correlation-id”: “…”
},
…
}
get-index get-restaurants
get-restaurants
global.CONTEXT
global.CONTEXT
x-correlation-id = …
x-correlation-xxx = …
get-index
headers[“User-Agent”]
h...
https://github.com/getndazn/dazn-lambda-powertools
MONITORING
•no background processing
•nowhere to install agents/daemons
new challenges
my code
send metrics
internet internet
press button something happens
those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called wit...
Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|14897953...
CloudWatch Logs AWS Lambda
ELK stack
logs
m
etrics
CloudWatch
https://amzn.to/2YkjgOR
delay
cost
concurrency
delay
cost
concurrency
no latency
overhead
API Gateway
send custom metrics
asynchronously
SNS KinesisS3API Gateway
…
send custom metrics
asynchronously
send custom metrics as
part of function invocation
TRACING
X-Ray
don’t span over async
invocations
good for identifying dependencies of a function,
but not good enough for tracing the ent...
don’t span over non-AWS services
write structured logs
instrument your code
make it easy to do the right thing
Yan Cui
http://theburningmonk.com
@theburningmonk
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Download to read offline

How to build observability into a serverless application

Download to read offline

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk, we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Related Books

Free with a 30 day trial from Scribd

See all

How to build observability into a serverless application

  1. 1. how to build Serverless OBSERVABILITY into a application
  2. 2. What do I mean by “observability”?
  3. 3. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  4. 4. Observability being able to debug the system, and gain insights into the system’s behaviour
  5. 5. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  6. 6. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability including non- functional outputs
  7. 7. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  8. 8. microservices death stars circa 2015
  9. 9. microservices death stars circa 2015 mm… I wonder what’s going on here…
  10. 10. microservices death stars circa 2015 I got this!
  11. 11. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant
  12. 12. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  13. 13. Yan Cui http://theburningmonk.com @theburningmonk AWS user since 2009
  14. 14. Yan Cui http://theburningmonk.com @theburningmonk AWS user since 2009
  15. 15. new challenges
  16. 16. NO ACCESS to underlying OS
  17. 17. NOWHERE to install agents/daemons
  18. 18. •nowhere to install agents/daemons new challenges
  19. 19. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  20. 20. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  21. 21. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  22. 22. •no background processing •nowhere to install agents/daemons new challenges
  23. 23. EC2 concurrency used to be handled by your code
  24. 24. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  25. 25. EC2 logs & metrics used to be batched here
  26. 26. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  27. 27. HIGHER concurrency to log aggregation/telemetry system
  28. 28. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  29. 29. Lambda cold start
  30. 30. Lambda data is batched between invocations
  31. 31. Lambda idle data is batched between invocations
  32. 32. Lambda idle garbage collectiondata is batched between invocations
  33. 33. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  34. 34. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  35. 35. Lambda
  36. 36. my code send metrics
  37. 37. my code send metrics
  38. 38. my code send metrics internet internet press button something happens
  39. 39. http://bit.ly/2Dpidje
  40. 40. ? functions are often chained together via asynchronous invocations
  41. 41. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  42. 42. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  43. 43. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  44. 44. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  45. 45. LOGGING
  46. 46. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  47. 47. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  48. 48. one log group per function one log stream for each concurrent invocation
  49. 49. logs are not easily searchable in CloudWatch Logs me
  50. 50. CloudWatch Logs
  51. 51. CloudWatch Logs is an async event source for Lambda
  52. 52. Concurrent Executions Time regional max concurrency functions that are delivering business value
  53. 53. Concurrent Executions Time regional max concurrency functions that are delivering business value ship logs
  54. 54. either set concurrency limit on the log shipping function (and potentially lose logs due to throttling) or…
  55. 55. 1 shard = 1 concurrent execution i.e. control the no. of concurrent executions with no. of shards
  56. 56.
  57. 57. CloudWatch Logs
  58. 58. https://amzn.to/2DnREgn
  59. 59. https://amzn.to/2uZYmEw
  60. 60. use structured logging with JSON
  61. 61. https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/log-everything-as-json/
  62. 62. https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
  63. 63. traditional loggers are too heavy for Lambda
  64. 64. https://github.com/getndazn/dazn-lambda-powertools
  65. 65. Writing lots more data to CloudWatch Logs
  66. 66. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month
  67. 67. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month 1M invocation of a 128MB function = $0.000000208 * 1M + $0.20 = $0.408
  68. 68. DON’T leave debug logging ON in production
  69. 69. have to redeploy ALL the functions along the call path to collect all relevant debug logs
  70. 70. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  71. 71. sampling decision has to be followed by an entire call chain
  72. 72. Initial Request ID User ID Session ID User-Agent Order ID …
  73. 73. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  74. 74. store correlation IDs in global variable
  75. 75. use middleware to auto-capture incoming correlation IDs
  76. 76. extract correlation IDs from invocation event, and store them in the correlation-ids module reset
  77. 77. logger to always include captured correlation IDs
  78. 78. HTTP and AWS SDK clients to auto-forward correlation IDs on
  79. 79. context.awsRequestId get-index
  80. 80. context.awsRequestId x-correlation-id get-index
  81. 81. { “headers”: { “x-correlation-id”: “…” }, … } get-index
  82. 82. { “body”: null, “resource”: “/restaurants”, “headers”: { “x-correlation-id”: “…” }, … } get-index get-restaurants
  83. 83. get-restaurants global.CONTEXT global.CONTEXT x-correlation-id = … x-correlation-xxx = … get-index headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] capture forward function event log.info(…)
  84. 84. https://github.com/getndazn/dazn-lambda-powertools
  85. 85. MONITORING
  86. 86. •no background processing •nowhere to install agents/daemons new challenges
  87. 87. my code send metrics internet internet press button something happens
  88. 88. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  89. 89. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  90. 90. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  91. 91. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  92. 92. https://amzn.to/2YkjgOR
  93. 93. delay cost concurrency
  94. 94. delay cost concurrency no latency overhead
  95. 95. API Gateway send custom metrics asynchronously
  96. 96. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  97. 97. TRACING
  98. 98. X-Ray
  99. 99. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  100. 100. don’t span over non-AWS services
  101. 101. write structured logs
  102. 102. instrument your code
  103. 103. make it easy to do the right thing
  104. 104. Yan Cui http://theburningmonk.com @theburningmonk
  • powerirs

    Jul. 26, 2020
  • AngelPsiakis

    Jul. 25, 2019

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk, we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Views

Total views

592

On Slideshare

0

From embeds

0

Number of embeds

17

Actions

Downloads

13

Shares

0

Comments

0

Likes

2

×