Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
@theburningmonk#aws #awslambda #serverless
the present and future of
serverless observability
Yan Cui @theburningmonk
Abraham Wald
Abraham Wald
Abraham Wald
Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
...
Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
...
survivor bias in monitoring
survivor bias in monitoring
Only focus on failure modes that we were able to successfully
identify through investigation a...
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
available in Austria, Switzerland, Germany,
Japan, Canada, Italy and US
available on 30+ platforms
~1,000,000 concurrent viewers
We’re hiring! Visit
engineering.dazn.com
to learn more.
follow @dazneng for
updates about the
engineering team
follow @dazneng for
updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
WE’RE HIRIN...
AWS user since 2009
AWS user since 2009
https://www.youtube.com/watch?v=pptsgV4bKv8
https://bit.ly/production-ready-serverless
http://bit.ly/2C9LwIM
2017
observability
Monitoring
watching out for
known failure modes
in the system,
e.g. network I/O, CPU,
memory usage, …
Observability
being able to debug
the system, and gain
insights into the
system’s behaviour
In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of it...
Known Success
Known SuccessKnown Errors
Known SuccessKnown Errors
easy to monitor!
Known SuccessKnown Errors
Known Unknowns
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
invisible bullet
holes
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
only alert on
this
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
alert on the
absence of this!
Known SuccessKnown Errors
Known UnknownsUnknown Unknowns
what went wrong?
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distri...
microservices death stars circa 2015
microservices death stars circa 2015
I got this!
new
challenges
new
challenges
NO ACCESS
to underlying OS
NOWHERE
to install agents/daemons
•nowhere to install agents/daemons
new challenges
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
•no background processing
•nowhere to install agents/daemons
new challenges
EC2
concurrency used to be
handled by your code
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, it’s handled by the
AWS Lambda platform
EC2
logs & metrics used to be
batched here
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, they are batched in each
concurrent execution, at best…
HIGHER concurrency to log
aggregation/telemetry system
•higher concurrency to telemetry system
•nowhere to install agents/daemons
•no background processing
new challenges
Lambda
cold start
Lambda
data is batched between
invocations
Lambda
idle
data is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
HIGH chance of data loss
•high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency t...
Lambda
my code
send metrics
my code
send metrics
my code
send metrics
internet internet
press button something happens
http://bit.ly/2Dpidje
?
functions are often chained together
via asynchronous invocations
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
differ...
•asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry sy...
the Present
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distri...
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp Request Id
y...
one log group per
function
one log stream for each
concurrent invocation
logs are not easily searchable in
CloudWatch Logs
me
CloudWatch Logs
CloudWatch Logs AWS Lambda ELK stack
…
CloudWatch Logs
CloudWatch Logs
•no background processing
•nowhere to install agents/daemons
new challenges
my code
send metrics
internet internet
press button something happens
those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called wit...
Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|14897953...
CloudWatch Logs AWS Lambda
ELK stack
logs
m
etrics
CloudWatch
delay
cost
concurrency
delay
cost
concurrency
no latency
overhead
API Gateway
send custom metrics
asynchronously
SNS KinesisS3API Gateway
…
send custom metrics
asynchronously
send custom metrics as
part of function invocation
X-Ray
do not span over API Gateway
narrow focus on a function
good for homing in on performance issues
for a particular function, but offers little to
help y...
However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the hea...
follow the data
don’t span over async invocations
good for identifying dependencies of a function,
but not good enough for tracing the ent...
don’t span over non-AWS services
static view
our tools need to do more to help us with
understanding & debugging our distributed system,
not just what happens inside o...
“one user action/vertical slice through the system”
microservices death stars circa 2015
microservices death stars circa 2015
HELP…
WARNING: this is part fiction, part inspired by new tools
DASHBOARDS
different dimensions of service X
splattered across the screen
+ cold starts
+ throttled invocations
+ concurrent executions
+ estimated cost ($)
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
birds-eye view of our system as it lives and breathes
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-images...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-images...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-images...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
create-auth0-us...
All
0 200 400 600 800
create-user
…user.insert_user
…user.upload_img
tag-user
create-auto0-user
process-images
resize-imag...
All
0 200 400 600 800
create-user
…user.insert_user
…user.upload_img
tag-user
create-auto0-user
process-images
resize-imag...
Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
create-auth0-user
re...
Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
create-auth0-user
re...
all your needs in one placeTRACING
mmm… it’s a graph
what if we can query it
like a graph?
http://amzn.to/2nk7uiW
ability to query based on the relationship
between observed components
(as well as the components themselves)
root cause analysis
the elevated error rate in service X was caused by
DynamoDB table throttling.“
”
payment was slow last
night around 10PM.
investigate.
time
95-percentile latency
service A
service B
10PM
time
95-percentile latency
service A
service B
10PM
causality? or correlation?
user-service
USESUSES
DEPENDS_ON
auth-serviceUSES
payment-service
DEPENDS_ON
“payment was slow last
night around 10PM”
use...
user-service
USESUSES
DEPENDS_ON
auth-serviceUSES
DEPENDS_ON
payment-service
user-table
throttled exceptions!
user-table
user-stream
DEPENDS_ON
DEPENDS_ON USES
USES
USES
USES
USES
DEPENDS_ON
D
EPEN
D
S_O
N
DEPENDS_ON
PUBLISHES_TO
“w...
user-table
user-stream
DEPENDS_ON
DEPENDS_ON USES
USES
USES
USES
USES
DEPENDS_ON
D
EPEN
D
S_O
N
DEPENDS_ON
PUBLISHES_TO
“w...
wouldn’t that be nice?
MACHINE
LEARNING
use ML to auto-detect erroneous or
suspicious behaviours, or to suggest
possible improvements
!
Function [X] just performed
an unexpected write against
DynamoDB table [Y].
Should I…
ignore it from now on
shut it down...
Observability Bot <bot@bestobservability.com>
Observability Bot <bot@bestobservability.com>
don’t bother me about this again
Observability Bot <bot@bestobservability.com>
auto-modify IAM role with DENY rule
Function [X]’s performance
has degraded since yesterday -
99% latency has gone up by
47% from 100ms to 147ms.
!
!
Function [X] can run faster &
cheaper if you increase its
memory allocation.
Should I…
ignore it from now on
update sett...
zzz… the future of… zzz …
serverless observability… zzz
Simon Wardley
Simon Wardley
context &
movement
However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the hea...
“one user action/vertical slice through the system”
movement
context
movement
The best way to predict the future
is to invent it.
- Alan Kay
The best way to invent
the future is to inception
someone else to do it.
- me
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
Upcoming SlideShare
Loading in …5
×

The present and future of Serverless observability (Serverless Computing London)

As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.

Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.

Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.

  • Be the first to comment

The present and future of Serverless observability (Serverless Computing London)

  1. 1. @theburningmonk#aws #awslambda #serverless the present and future of serverless observability Yan Cui @theburningmonk
  2. 2. Abraham Wald
  3. 3. Abraham Wald
  4. 4. Abraham Wald
  5. 5. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  6. 6. Abraham Wald Wald noted that the study only considered the aircraft that had survived their missions—the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely.
  7. 7. survivor bias in monitoring
  8. 8. survivor bias in monitoring Only focus on failure modes that we were able to successfully identify through investigation and postmortem in the past. The bullet holes that shot us down and we couldn’t identify stay invisible, and will continue to shoot us down.
  9. 9. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  10. 10. available in Austria, Switzerland, Germany, Japan, Canada, Italy and US
  11. 11. available on 30+ platforms
  12. 12. ~1,000,000 concurrent viewers
  13. 13. We’re hiring! Visit engineering.dazn.com to learn more. follow @dazneng for updates about the engineering team
  14. 14. follow @dazneng for updates about the engineering team We’re hiring! Visit engineering.dazn.com to learn more. WE’RE HIRING!
  15. 15. AWS user since 2009
  16. 16. AWS user since 2009
  17. 17. https://www.youtube.com/watch?v=pptsgV4bKv8
  18. 18. https://bit.ly/production-ready-serverless
  19. 19. http://bit.ly/2C9LwIM
  20. 20. 2017 observability
  21. 21. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  22. 22. Observability being able to debug the system, and gain insights into the system’s behaviour
  23. 23. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. https://en.wikipedia.org/wiki/Observability
  24. 24. Known Success
  25. 25. Known SuccessKnown Errors
  26. 26. Known SuccessKnown Errors easy to monitor!
  27. 27. Known SuccessKnown Errors Known Unknowns
  28. 28. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  29. 29. Known SuccessKnown Errors Known UnknownsUnknown Unknowns invisible bullet holes
  30. 30. Known SuccessKnown Errors Known UnknownsUnknown Unknowns
  31. 31. Known SuccessKnown Errors Known UnknownsUnknown Unknowns only alert on this
  32. 32. Known SuccessKnown Errors Known UnknownsUnknown Unknowns alert on the absence of this!
  33. 33. Known SuccessKnown Errors Known UnknownsUnknown Unknowns what went wrong?
  34. 34. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  35. 35. microservices death stars circa 2015
  36. 36. microservices death stars circa 2015 I got this!
  37. 37. new challenges
  38. 38. new challenges
  39. 39. NO ACCESS to underlying OS
  40. 40. NOWHERE to install agents/daemons
  41. 41. •nowhere to install agents/daemons new challenges
  42. 42. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  43. 43. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  44. 44. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  45. 45. •no background processing •nowhere to install agents/daemons new challenges
  46. 46. EC2 concurrency used to be handled by your code
  47. 47. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  48. 48. EC2 logs & metrics used to be batched here
  49. 49. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  50. 50. HIGHER concurrency to log aggregation/telemetry system
  51. 51. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  52. 52. Lambda cold start
  53. 53. Lambda data is batched between invocations
  54. 54. Lambda idle data is batched between invocations
  55. 55. Lambda idle garbage collectiondata is batched between invocations
  56. 56. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  57. 57. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  58. 58. Lambda
  59. 59. my code send metrics
  60. 60. my code send metrics
  61. 61. my code send metrics internet internet press button something happens
  62. 62. http://bit.ly/2Dpidje
  63. 63. ? functions are often chained together via asynchronous invocations
  64. 64. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  65. 65. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  66. 66. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  67. 67. the Present
  68. 68. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  69. 69. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  70. 70. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  71. 71. one log group per function one log stream for each concurrent invocation
  72. 72. logs are not easily searchable in CloudWatch Logs me
  73. 73. CloudWatch Logs
  74. 74. CloudWatch Logs AWS Lambda ELK stack
  75. 75.
  76. 76. CloudWatch Logs
  77. 77. CloudWatch Logs
  78. 78. •no background processing •nowhere to install agents/daemons new challenges
  79. 79. my code send metrics internet internet press button something happens
  80. 80. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  81. 81. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  82. 82. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  83. 83. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  84. 84. delay cost concurrency
  85. 85. delay cost concurrency no latency overhead
  86. 86. API Gateway send custom metrics asynchronously
  87. 87. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  88. 88. X-Ray
  89. 89. do not span over API Gateway
  90. 90. narrow focus on a function good for homing in on performance issues for a particular function, but offers little to help you build intuition about how your system operates as a whole.
  91. 91. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  92. 92. follow the data
  93. 93. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  94. 94. don’t span over non-AWS services
  95. 95. static view
  96. 96. our tools need to do more to help us with understanding & debugging our distributed system, not just what happens inside one function
  97. 97. “one user action/vertical slice through the system”
  98. 98. microservices death stars circa 2015
  99. 99. microservices death stars circa 2015 HELP…
  100. 100. WARNING: this is part fiction, part inspired by new tools
  101. 101. DASHBOARDS
  102. 102. different dimensions of service X splattered across the screen
  103. 103. + cold starts + throttled invocations + concurrent executions + estimated cost ($)
  104. 104. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  105. 105. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  106. 106. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  107. 107. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  108. 108. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  109. 109. birds-eye view of our system as it lives and breathes
  110. 110. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API create-auth0-user
  111. 111. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace async invocations create-auth0-user
  112. 112. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace non-AWS resources create-auth0-user
  113. 113. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  114. 114. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  115. 115. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 incoming request… level debug request-id start-time 0ae4ba5d-dab1-4f9e-9de7-eace27ebfbc2 2018/01/25 20:51:23.188 method POST create-auth0-user
  116. 116. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 create-auth0-user
  117. 117. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 click here to go to code create-auth0-user
  118. 118. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "body": "{ "username":"theburningmonk"}", "resource": "/user", "requestContext": { "resourceId": "123456", "apiId": “1234567890", "resourcePath": "/user", { "statusCode": 200 } create-auth0-user
  119. 119. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", { "error": null, "result": "OK" } create-auth0-user
  120. 120. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … create-auth0-user
  121. 121. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … !
  122. 122. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  123. 123. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  124. 124. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  125. 125. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  126. 126. all your needs in one placeTRACING
  127. 127. mmm… it’s a graph
  128. 128. what if we can query it like a graph?
  129. 129. http://amzn.to/2nk7uiW
  130. 130. ability to query based on the relationship between observed components (as well as the components themselves)
  131. 131. root cause analysis
  132. 132. the elevated error rate in service X was caused by DynamoDB table throttling.“ ”
  133. 133. payment was slow last night around 10PM. investigate.
  134. 134. time 95-percentile latency service A service B 10PM
  135. 135. time 95-percentile latency service A service B 10PM causality? or correlation?
  136. 136. user-service USESUSES DEPENDS_ON auth-serviceUSES payment-service DEPENDS_ON “payment was slow last night around 10PM” user-table
  137. 137. user-service USESUSES DEPENDS_ON auth-serviceUSES DEPENDS_ON payment-service user-table throttled exceptions!
  138. 138. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  139. 139. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  140. 140. wouldn’t that be nice?
  141. 141. MACHINE LEARNING
  142. 142. use ML to auto-detect erroneous or suspicious behaviours, or to suggest possible improvements
  143. 143. ! Function [X] just performed an unexpected write against DynamoDB table [Y]. Should I… ignore it from now on shut it down!!
  144. 144. Observability Bot <bot@bestobservability.com>
  145. 145. Observability Bot <bot@bestobservability.com> don’t bother me about this again
  146. 146. Observability Bot <bot@bestobservability.com> auto-modify IAM role with DENY rule
  147. 147. Function [X]’s performance has degraded since yesterday - 99% latency has gone up by 47% from 100ms to 147ms. !
  148. 148. ! Function [X] can run faster & cheaper if you increase its memory allocation. Should I… ignore it from now on update setting
  149. 149. zzz… the future of… zzz … serverless observability… zzz
  150. 150. Simon Wardley
  151. 151. Simon Wardley context & movement
  152. 152. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  153. 153. “one user action/vertical slice through the system”
  154. 154. movement context movement
  155. 155. The best way to predict the future is to invent it. - Alan Kay
  156. 156. The best way to invent the future is to inception someone else to do it. - me

×