Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
how to build
Serverless
OBSERVABILITY
into a application
Abraham Wald
Abraham Wald
Abraham Wald
Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
...
Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
...
survivor bias in monitoring
survivor bias in monitoring
Only focus on failure modes that we were able to successfully
identify through investigation a...
What do I mean by “observability”?
Monitoring
watching out for
known failure modes
in the system,
e.g. network I/O, CPU,
memory usage, …
Observability
being able to debug
the system, and gain
insights into the
system’s behaviour
In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of it...
Known Success
Known SuccessKnown Errors
Known SuccessKnown Errors
easy to monitor!
Known SuccessKnown Errors
Known Unknowns
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
invisible bullet
holes
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
only alert
on this
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
alert on the
absence of this!
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
what went wrong?
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distri...
microservices death stars circa 2015
microservices death stars circa 2015
mm… I wonder what’s
going on here…
microservices death stars circa 2015
I got this!
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
Independent Consultant
available in Austria, Switzerland, Germany,
Japan, Canada, Italy, US and Spain
available on 30+ platforms
~1,000,000 concurrent viewers
follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRIN...
AWS user since 2009
AWS user since 2009
new
challenges
NO ACCESS
to underlying OS
NOWHERE
to install agents/daemons
•nowhere to install agents/daemons
new challenges
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
•no background processing
•nowhere to install agents/daemons
new challenges
EC2
concurrency used to be
handled by your code
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, it’s handled by the
AWS Lambda platform
EC2
logs & metrics used to be
batched here
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, they are batched in each
concurrent execution, at best…
HIGHER concurrency to log
aggregation/telemetry system
•higher concurrency to telemetry system
•nowhere to install agents/daemons
•no background processing
new challenges
Lambda
cold start
Lambda
data is batched between
invocations
Lambda
idle
data is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
HIGH chance of data loss
•high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency t...
Lambda
my code
send metrics
my code
send metrics
my code
send metrics
internet internet
press button something happens
http://bit.ly/2Dpidje
?
functions are often chained together
via asynchronous invocations
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
differ...
•asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry sy...
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distri...
LOGGING
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp Request Id
y...
one log group per
function
one log stream for each
concurrent invocation
logs are not easily searchable in
CloudWatch Logs
me
CloudWatch Logs
CloudWatch Logs is an async event source for Lambda
Concurrent Executions
Time
regional max
concurrency
functions that are
delivering business value
Concurrent Executions
Time
regional max
concurrency
functions that are
delivering business value
ship logs
either set concurrency limit on the log shipping function
(and potentially lose logs due to throttling)
or…
1 shard = 1 concurrent execution
i.e. control the no. of concurrent
executions with no. of shards
…
CloudWatch Logs
CloudWatch Logs
use structured logging with JSON
https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/...
https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
traditional loggers are too heavy for Lambda
CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
1M invocation of a 128MB function =
$0.000000208 * 1...
DON’T leave debug logging ON in production
have to redeploy ALL the
functions along the call path to
collect all relevant debug logs
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
Concurrency is handled by
the AWS Lambda platform
sampling decision has to be
followed by an entire call chain
Initial Request ID
User ID
Session ID
User-Agent
Order ID
…
nonintrusive
extensible
consistent
works for streams
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
Concurrency is handled by
the AWS Lambda platform
store correlation IDs in global variable
use middleware to auto-capture incoming correlation IDs
extract correlation IDs from
invocation event, and store them in
the correlation-ids module
reset
logger to always include captured correlation IDs
HTTP and AWS SDK clients to auto-forward correlation IDs on
context.awsRequestId
get-index
context.awsRequestId x-correlation-id
get-index
{
“headers”: {
“x-correlation-id”: “…”
},
…
}
get-index
{
“body”: null,
“resource”: “/restaurants”,
“headers”: {
“x-correlation-id”: “…”
},
…
}
get-index get-restaurants
get-restaurants
global.CONTEXT
global.CONTEXT
x-correlation-id = …
x-correlation-xxx = …
get-index
headers[“User-Agent”]
h...
nonintrusive
extensible
consistent
works for streams
MONITORING
•no background processing
•nowhere to install agents/daemons
new challenges
my code
send metrics
internet internet
press button something happens
those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called wit...
Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|14897953...
CloudWatch Logs AWS Lambda
ELK stack
logs
m
etrics
CloudWatch
delay
cost
concurrency
delay
cost
concurrency
no latency
overhead
API Gateway
send custom metrics
asynchronously
SNS KinesisS3API Gateway
…
send custom metrics
asynchronously
send custom metrics as
part of function invocation
TRACING
X-Ray
don’t span over async
invocations
good for identifying dependencies of a function,
but not good enough for tracing the ent...
don’t span over non-AWS services
write structured logs
instrument your code
make it easy to do the right thing
Yan Cui
http://theburningmonk.com
@theburningmonk
follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRIN...
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
How to build observability into a serverless application
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

How to build observability into a serverless application

Download to read offline

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

How to build observability into a serverless application

  1. 1. how to build Serverless OBSERVABILITY into a application
  2. 2. Abraham Wald
  3. 3. Abraham Wald
  4. 4. Abraham Wald
  5. 5. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  6. 6. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  7. 7. survivor bias in monitoring
  8. 8. survivor bias in monitoring Only focus on failure modes that we were able to successfully identify through investigation and postmortem in the past. The bullet holes that shot us down and we couldn’t identify stay invisible, and will continue to shoot us down.
  9. 9. What do I mean by “observability”?
  10. 10. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  11. 11. Observability being able to debug the system, and gain insights into the system’s behaviour
  12. 12. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  13. 13. Known Success
  14. 14. Known SuccessKnown Errors
  15. 15. Known SuccessKnown Errors easy to monitor!
  16. 16. Known SuccessKnown Errors Known Unknowns
  17. 17. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  18. 18. Known SuccessKnown Errors Known UnknownsUnknown Unknowns invisible bullet holes
  19. 19. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  20. 20. Known SuccessKnown Errors Known UnknownsUnknown Unknowns only alert on this
  21. 21. Known SuccessKnown Errors Known UnknownsUnknown Unknowns alert on the absence of this!
  22. 22. Known SuccessKnown Errors Known UnknownsUnknown Unknowns what went wrong?
  23. 23. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  24. 24. microservices death stars circa 2015
  25. 25. microservices death stars circa 2015 mm… I wonder what’s going on here…
  26. 26. microservices death stars circa 2015 I got this!
  27. 27. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @ Independent Consultant
  28. 28. available in Austria, Switzerland, Germany, Japan, Canada, Italy, US and Spain
  29. 29. available on 30+ platforms
  30. 30. ~1,000,000 concurrent viewers
  31. 31. follow @dazneng for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more. WE’RE HIRING!
  32. 32. AWS user since 2009
  33. 33. AWS user since 2009
  34. 34. new challenges
  35. 35. NO ACCESS to underlying OS
  36. 36. NOWHERE to install agents/daemons
  37. 37. •nowhere to install agents/daemons new challenges
  38. 38. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  39. 39. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  40. 40. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  41. 41. •no background processing •nowhere to install agents/daemons new challenges
  42. 42. EC2 concurrency used to be handled by your code
  43. 43. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  44. 44. EC2 logs & metrics used to be batched here
  45. 45. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  46. 46. HIGHER concurrency to log aggregation/telemetry system
  47. 47. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  48. 48. Lambda cold start
  49. 49. Lambda data is batched between invocations
  50. 50. Lambda idle data is batched between invocations
  51. 51. Lambda idle garbage collectiondata is batched between invocations
  52. 52. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  53. 53. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  54. 54. Lambda
  55. 55. my code send metrics
  56. 56. my code send metrics
  57. 57. my code send metrics internet internet press button something happens
  58. 58. http://bit.ly/2Dpidje
  59. 59. ? functions are often chained together via asynchronous invocations
  60. 60. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  61. 61. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  62. 62. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  63. 63. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/Visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  64. 64. LOGGING
  65. 65. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  66. 66. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  67. 67. one log group per function one log stream for each concurrent invocation
  68. 68. logs are not easily searchable in CloudWatch Logs me
  69. 69. CloudWatch Logs
  70. 70. CloudWatch Logs is an async event source for Lambda
  71. 71. Concurrent Executions Time regional max concurrency functions that are delivering business value
  72. 72. Concurrent Executions Time regional max concurrency functions that are delivering business value ship logs
  73. 73. either set concurrency limit on the log shipping function (and potentially lose logs due to throttling) or…
  74. 74. 1 shard = 1 concurrent execution i.e. control the no. of concurrent executions with no. of shards
  75. 75.
  76. 76. CloudWatch Logs
  77. 77. CloudWatch Logs
  78. 78. use structured logging with JSON
  79. 79. https://stackify.com/what-is-structured-logging-and-why-developers-need-it/ https://blog.treasuredata.com/blog/2012/04/26/log-everything-as-json/
  80. 80. https://www.loggly.com/blog/8-handy-tips-consider-logging-json/
  81. 81. traditional loggers are too heavy for Lambda
  82. 82. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month
  83. 83. CloudWatch Logs $0.50 per GB ingested $0.03 per GB archived per month 1M invocation of a 128MB function = $0.000000208 * 1M + $0.20 = $0.408
  84. 84. DON’T leave debug logging ON in production
  85. 85. have to redeploy ALL the functions along the call path to collect all relevant debug logs
  86. 86. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  87. 87. sampling decision has to be followed by an entire call chain
  88. 88. Initial Request ID User ID Session ID User-Agent Order ID …
  89. 89. nonintrusive extensible consistent works for streams
  90. 90. EC2 Lambda Lambda Lambda Lambda Lambda Concurrency is handled by the AWS Lambda platform
  91. 91. store correlation IDs in global variable
  92. 92. use middleware to auto-capture incoming correlation IDs
  93. 93. extract correlation IDs from invocation event, and store them in the correlation-ids module reset
  94. 94. logger to always include captured correlation IDs
  95. 95. HTTP and AWS SDK clients to auto-forward correlation IDs on
  96. 96. context.awsRequestId get-index
  97. 97. context.awsRequestId x-correlation-id get-index
  98. 98. { “headers”: { “x-correlation-id”: “…” }, … } get-index
  99. 99. { “body”: null, “resource”: “/restaurants”, “headers”: { “x-correlation-id”: “…” }, … } get-index get-restaurants
  100. 100. get-restaurants global.CONTEXT global.CONTEXT x-correlation-id = … x-correlation-xxx = … get-index headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] capture forward function event log.info(…)
  101. 101. nonintrusive extensible consistent works for streams
  102. 102. MONITORING
  103. 103. •no background processing •nowhere to install agents/daemons new challenges
  104. 104. my code send metrics internet internet press button something happens
  105. 105. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  106. 106. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  107. 107. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  108. 108. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  109. 109. delay cost concurrency
  110. 110. delay cost concurrency no latency overhead
  111. 111. API Gateway send custom metrics asynchronously
  112. 112. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  113. 113. TRACING
  114. 114. X-Ray
  115. 115. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  116. 116. don’t span over non-AWS services
  117. 117. write structured logs
  118. 118. instrument your code
  119. 119. make it easy to do the right thing
  120. 120. Yan Cui http://theburningmonk.com @theburningmonk
  121. 121. follow @dazneng for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more. WE’RE HIRING!

Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.

Views

Total views

1,107

On Slideshare

0

From embeds

0

Number of embeds

617

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×