SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
How to build observability into a serverless application
Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.
Serverless introduces a number of challenges to existing tools for observability, we need to adapt our practices to fit this new paradigm. In this talk we will discuss how we can build observability into a serverless application. We will see how you can implement log aggregation, distributed tracing and correlation IDs through both synchronous as well as asynchronous events.
5.
Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
been shot down were not present for the
damage assessment.
The holes in the returning aircraft, then,
represented areas where a bomber could
take damage and still return home safely.
6.
Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
been shot down were not present for the
damage assessment.
The holes in the returning aircraft, then,
represented areas where a bomber could
take damage and still return home safely.
8.
survivor bias in monitoring
Only focus on failure modes that we were able to successfully
identify through investigation and postmortem in the past.
The bullet holes that shot us down and we couldn’t identify stay
invisible, and will continue to shoot us down.
10.
Monitoring
watching out for
known failure modes
in the system,
e.g. network I/O, CPU,
memory usage, …
11.
Observability
being able to debug
the system, and gain
insights into the
system’s behaviour
12.
In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of its external outputs.
https://en.wikipedia.org/wiki/Observability
17.
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
18.
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
invisible bullet
holes
19.
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
20.
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
alert on this
21.
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
alert on the
absence of this!
22.
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
what went wrong?
23.
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
41.
•nowhere to install agents/daemons
new challenges
42.
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
handler
handler
handler
handler
handler
handler
handler
43.
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
44.
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
NO background processing
except what platform provides
45.
•no background processing
•nowhere to install agents/daemons
new challenges
46.
EC2
concurrency used to be
handled by your code
47.
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, it’s handled by the
AWS Lambda platform
54.
Lambda
idle
data is batched between
invocations
55.
Lambda
idle
garbage collectiondata is batched between
invocations
56.
Lambda
idle
garbage collectiondata is batched between
invocations
HIGH chance of data loss
57.
•high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry system
new challenges
65.
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
different event sources is difficult
66.
•asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry system
•high chance of data loss (if batching)
new challenges
67.
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
107.
•no background processing
•nowhere to install agents/daemons
new challenges
108.
my code
send metrics
internet internet
press button something happens
109.
those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called within one slice
of user event
110.
Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
111.
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
119.
don’t span over async
invocations
good for identifying dependencies of a function,
but not good enough for tracing the entire call
chain as user request/data flows through the
system via async event sources.