Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
and FUTUREThe
Serverless
OBSERVABILITY
ofpresent
hi, my name is Yan.
hi, my name is Yan.
hi, my name is Yan.
AWS user since 2009
http://bit.ly/yubl-serverless
http://bit.ly/2Cdsai5
2017
observability
http://bit.ly/2EXQZBj
http://bit.ly/2EXKEFZ
mm… I wonder what’s
going on here…
what is observability?
how is it different from monitoring?
Monitoring
watching out for
known failure modes
in the system,
e.g. network I/O, CPU,
memory usage, …
Observability
being able to debug
the system, and gain
insights into the
system’s behaviour
However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the hea...
However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the hea...
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distri...
Observability is useful even outside of incidents and outages
microservices death stars circa 2015
microservices death stars circa 2015
I got this!
new
challenges
new
challenges
NO ACCESS
to underlying OS
NOWHERE
to install agents/daemons
•nowhere to install agents/daemons
new challenges
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-f...
•no background processing
•nowhere to install agents/daemons
new challenges
EC2
concurrency used to be
handled by your code
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, it’s handled by the
AWS Lambda platform
EC2
logs & metrics used to be
batched here
EC2
Lambda
Lambda
Lambda
Lambda
Lambda
now, they are batched in each
concurrent execution, at best…
HIGHER concurrency to log
aggregation/telemetry system
•higher concurrency to telemetry system
•nowhere to install agents/daemons
•no background processing
new challenges
Lambda
cold start
Lambda
data is batched between
invocations
Lambda
idle
data is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
Lambda
idle
garbage collectiondata is batched between
invocations
HIGH chance of data loss
•high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency t...
Lambda
my code
send metrics
my code
send metrics
my code
send metrics
internet internet
press button something happens
http://bit.ly/2Dpidje
?
functions are often chained together
via asynchronous invocations
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
differ...
•asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry sy...
the Present
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distri...
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp Request Id
y...
one log group per
function
one log stream for each
concurrent invocation
logs are not easily searchable in
CloudWatch Logs
me
CloudWatch Logs
CloudWatch Logs AWS Lambda ELK stack
…
CloudWatch Logs
•no background processing
•nowhere to install agents/daemons
new challenges
my code
send metrics
internet internet
press button something happens
those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called wit...
Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|14897953...
CloudWatch Logs AWS Lambda
ELK stack
logs
m
etrics
CloudWatch
delay
cost
concurrency
delay
cost
concurrency
no latency
overhead
API Gateway
send custom metrics
asynchronously
SNS KinesisS3API Gateway
…
send custom metrics
asynchronously
send custom metrics as
part of function invocation
X-Ray
do not span over API Gateway
narrow focus on a function
good for homing in on performance issues
for a particular function, but offers little to
help y...
However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the hea...
follow the data
don’t span over async invocations
good for identifying dependencies of a function,
but not good enough for tracing the ent...
don’t span over non-AWS services
static view
our tools need to do more to help us with
understanding & debugging our distributed system,
not just what happens inside o...
“one user action/vertical slice through the system”
microservices death stars circa 2015
microservices death stars circa 2015
HELP…
WARNING: this is part fiction, part inspired by new tools
DASHBOARDS
different dimensions of X splattered
across the screen
+ cold starts
+ throttled invocations
+ concurrent executions
+ estimated cost ($)
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
...
birds-eye view of our system as it lives and breathes
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-images...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-images...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-images...
Logs Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
create-auth0-us...
All
0 200 400 600 800
create-user
…user.insert_user
…user.upload_img
tag-user
create-auto0-user
process-images
resize-imag...
All
0 200 400 600 800
create-user
…user.insert_user
…user.upload_img
tag-user
create-auto0-user
process-images
resize-imag...
Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
create-auth0-user
re...
Input/Output
user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
create-auth0-user
re...
all your needs in one placeTRACING
mmm… it’s a graph
what if we can query it
like a graph?
http://amzn.to/2nk7uiW
ability to query based on the relationship
between observed components
(as well as the components themselves)
root cause analysis
the elevated error rate in service X was caused by
DynamoDB table throttling.“
”
payment was slow last
night around 10PM.
investigate.
time
95-percentile latency
service A
service B
10PM
time
95-percentile latency
service A
service B
10PM
causality? or correlation?
user-service
USESUSES
DEPENDS_ON
auth-serviceUSES
payment-service
DEPENDS_ON
“payment was slow last
night around 10PM”
use...
user-service
USESUSES
DEPENDS_ON
auth-serviceUSES
DEPENDS_ON
payment-service
user-table
throttled exceptions!
user-table
user-stream
DEPENDS_ON
DEPENDS_ON USES
USES
USES
USES
USES
DEPENDS_ON
D
EPEN
D
S_O
N
DEPENDS_ON
PUBLISHES_TO
“w...
user-table
user-stream
DEPENDS_ON
DEPENDS_ON USES
USES
USES
USES
USES
DEPENDS_ON
D
EPEN
D
S_O
N
DEPENDS_ON
PUBLISHES_TO
“w...
wouldn’t that be nice?
MACHINE
LEARNING
use ML to auto-detect erroneous or
suspicious behaviours, or to suggest
possible improvements
!
Function [X] just performed
an unexpected write against
DynamoDB table [Y].
Should I…
ignore it from now on
shut it down...
Observability Bot <bot@bestobservability.com>
Observability Bot <bot@bestobservability.com>
don’t bother me about this again
Observability Bot <bot@bestobservability.com>
auto-modify IAM role with DENY rule
Function [X]’s performance
has degraded since yesterday -
99% latency has gone up by
47% from 100ms to 147ms.
!
!
Function [X] can run faster &
cheaper if you increase its
memory allocation.
Should I…
ignore it from now on
update sett...
zzz… the future of… zzz …
serverless observability… zzz
Simon Wardley
Simon Wardley
context &
movement
However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the hea...
“one user action/vertical slice through the system”
movement
context
movement
The best way to predict the future
is to invent it.
- Alan Kay
The best way to invent
the future is to inception
someone else to do it.
- me
Serkan Özal
@serkan_ozal
Adam Johnson
@adjohn
Erica Windisch
@ewindisch
Nitzan Shapira
@nitzanshapira
Ran Ribenzaft
@ranrib
Charity Majors
@mipsytipsy
Cindy Sridharan
@copyconstruct
Erica Windisch
@ewindisch
Liz Fong-Jones
@lizthegrey
JBD
@rakyll
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log ag...
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
The present and future of serverless observability (QCon London)
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

The present and future of serverless observability (QCon London)

Download to read offline

As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path.

Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns.

Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

The present and future of serverless observability (QCon London)

  1. 1. and FUTUREThe Serverless OBSERVABILITY ofpresent
  2. 2. hi, my name is Yan.
  3. 3. hi, my name is Yan.
  4. 4. hi, my name is Yan. AWS user since 2009
  5. 5. http://bit.ly/yubl-serverless
  6. 6. http://bit.ly/2Cdsai5
  7. 7. 2017 observability
  8. 8. http://bit.ly/2EXQZBj
  9. 9. http://bit.ly/2EXKEFZ
  10. 10. mm… I wonder what’s going on here…
  11. 11. what is observability? how is it different from monitoring?
  12. 12. Monitoring watching out for known failure modes in the system, e.g. network I/O, CPU, memory usage, …
  13. 13. Observability being able to debug the system, and gain insights into the system’s behaviour
  14. 14. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  15. 15. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  16. 16. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  17. 17. Observability is useful even outside of incidents and outages
  18. 18. microservices death stars circa 2015
  19. 19. microservices death stars circa 2015 I got this!
  20. 20. new challenges
  21. 21. new challenges
  22. 22. NO ACCESS to underlying OS
  23. 23. NOWHERE to install agents/daemons
  24. 24. •nowhere to install agents/daemons new challenges
  25. 25. user request user request user request user request user request user request user request critical paths: minimise user-facing latency handler handler handler handler handler handler handler
  26. 26. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead
  27. 27. user request user request user request user request user request user request user request critical paths: minimise user-facing latency StatsD handler handler handler handler handler handler handler rsyslog background processing: batched, asynchronous, low overhead NO background processing except what platform provides
  28. 28. •no background processing •nowhere to install agents/daemons new challenges
  29. 29. EC2 concurrency used to be handled by your code
  30. 30. EC2 Lambda Lambda Lambda Lambda Lambda now, it’s handled by the AWS Lambda platform
  31. 31. EC2 logs & metrics used to be batched here
  32. 32. EC2 Lambda Lambda Lambda Lambda Lambda now, they are batched in each concurrent execution, at best…
  33. 33. HIGHER concurrency to log aggregation/telemetry system
  34. 34. •higher concurrency to telemetry system •nowhere to install agents/daemons •no background processing new challenges
  35. 35. Lambda cold start
  36. 36. Lambda data is batched between invocations
  37. 37. Lambda idle data is batched between invocations
  38. 38. Lambda idle garbage collectiondata is batched between invocations
  39. 39. Lambda idle garbage collectiondata is batched between invocations HIGH chance of data loss
  40. 40. •high chance of data loss (if batching) •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system new challenges
  41. 41. Lambda
  42. 42. my code send metrics
  43. 43. my code send metrics
  44. 44. my code send metrics internet internet press button something happens
  45. 45. http://bit.ly/2Dpidje
  46. 46. ? functions are often chained together via asynchronous invocations
  47. 47. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES
  48. 48. ? SNS Kinesis CloudWatch Events CloudWatch LogsIoT DynamoDB S3 SES tracing ASYNCHRONOUS invocations through so many different event sources is difficult
  49. 49. •asynchronous invocations •nowhere to install agents/daemons •no background processing •higher concurrency to telemetry system •high chance of data loss (if batching) new challenges
  50. 50. the Present
  51. 51. These are the four pillars of the Observability Engineering team’s charter: • Monitoring • Alerting/visualization • Distributed systems tracing infrastructure • Log aggregation/analytics “ ” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
  52. 52. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  53. 53. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp Request Id your log message
  54. 54. one log group per function one log stream for each concurrent invocation
  55. 55. logs are not easily searchable in CloudWatch Logs me
  56. 56. CloudWatch Logs
  57. 57. CloudWatch Logs AWS Lambda ELK stack
  58. 58.
  59. 59. CloudWatch Logs
  60. 60. •no background processing •nowhere to install agents/daemons new challenges
  61. 61. my code send metrics internet internet press button something happens
  62. 62. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  63. 63. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  64. 64. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  65. 65. CloudWatch Logs AWS Lambda ELK stack logs m etrics CloudWatch
  66. 66. delay cost concurrency
  67. 67. delay cost concurrency no latency overhead
  68. 68. API Gateway send custom metrics asynchronously
  69. 69. SNS KinesisS3API Gateway … send custom metrics asynchronously send custom metrics as part of function invocation
  70. 70. X-Ray
  71. 71. do not span over API Gateway
  72. 72. narrow focus on a function good for homing in on performance issues for a particular function, but offers little to help you build intuition about how your system operates as a whole.
  73. 73. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  74. 74. follow the data
  75. 75. don’t span over async invocations good for identifying dependencies of a function, but not good enough for tracing the entire call chain as user request/data flows through the system via async event sources.
  76. 76. don’t span over non-AWS services
  77. 77. static view
  78. 78. our tools need to do more to help us with understanding & debugging our distributed system, not just what happens inside one function
  79. 79. “one user action/vertical slice through the system”
  80. 80. microservices death stars circa 2015
  81. 81. microservices death stars circa 2015 HELP…
  82. 82. WARNING: this is part fiction, part inspired by new tools
  83. 83. DASHBOARDS
  84. 84. different dimensions of X splattered across the screen
  85. 85. + cold starts + throttled invocations + concurrent executions + estimated cost ($)
  86. 86. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  87. 87. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  88. 88. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  89. 89. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  90. 90. SubscriberGetAccount 200,545 0 19 94 0 0 % 0 % Est Cost: Req Rate: $54.0/s 20,056.0/s Concurrency Median Mean 99.5th 99th 90th370 1ms 4ms 61ms 44ms 10ms circle colour and size represent health and traffic volume 2 minutes of request rate to show relative changes in traffic no. of concurrent executions of this function Request rate Estimated cost Error percentage of last 10 seconds Cold start percentage last 10 seconds last minute latency percentiles 200,545 0 19 94 0 Rolling 10 second counters with 1 second granularity Successes Cold starts Timeouts Throttled Invocations Errors
  91. 91. birds-eye view of our system as it lives and breathes
  92. 92. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API create-auth0-user
  93. 93. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace async invocations create-auth0-user
  94. 94. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API trace non-AWS resources create-auth0-user
  95. 95. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  96. 96. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.521 tag-user incoming request… saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug debug tagging user [theburningmonk] with Azure Face API… create-auth0-user
  97. 97. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message POST /user2018/01/25 20:51:23.188 incoming request… level debug request-id start-time 0ae4ba5d-dab1-4f9e-9de7-eace27ebfbc2 2018/01/25 20:51:23.188 method POST create-auth0-user
  98. 98. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 create-auth0-user
  99. 99. user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API Logs timestamp component message 2018/01/25 20:51:23.201 create-user 2018/01/25 20:51:23.215 create-user 2018/01/25 20:51:23.585 saving user [theburningmonk] in the [user] table… saved user [theburningmonk] in the [user] table level debug debug debug uploading profile image… create-user debug tagged user [theburningmonk] with Azure Face API… create-user2018/01/25 20:51:23.587 click here to go to code create-auth0-user
  100. 100. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "body": "{ "username":"theburningmonk"}", "resource": "/user", "requestContext": { "resourceId": "123456", "apiId": “1234567890", "resourcePath": "/user", { "statusCode": 200 } create-auth0-user
  101. 101. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input output { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", { "error": null, "result": "OK" } create-auth0-user
  102. 102. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … create-auth0-user
  103. 103. Logs Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API input error { "Records": [ { "Sns": { "Type": "Notification", "MessageId": "…", "TopicArn": "…", "Message": "…", "Timestamp": "2018/01/25 20:51:24.215", [com.spaceape.dragon.handler.ReformatProfileImageHandle r] Null reference exception *java.lang.NullReferenceException: … * at … * at … * at … !
  104. 104. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  105. 105. All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  106. 106. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  107. 107. Input/Output user profile-images POST /user process-images resize-images image-tasks Auth0 create-user create-auth0-user reformat-imagestag-user Face API Logs ! All 0 200 400 600 800 create-user …user.insert_user …user.upload_img tag-user create-auto0-user process-images resize-images reformat-images! 837ms 406ms 66ms 114ms 122ms 82ms 240ms 157ms 35ms
  108. 108. all your needs in one placeTRACING
  109. 109. mmm… it’s a graph
  110. 110. what if we can query it like a graph?
  111. 111. http://amzn.to/2nk7uiW
  112. 112. ability to query based on the relationship between observed components (as well as the components themselves)
  113. 113. root cause analysis
  114. 114. the elevated error rate in service X was caused by DynamoDB table throttling.“ ”
  115. 115. payment was slow last night around 10PM. investigate.
  116. 116. time 95-percentile latency service A service B 10PM
  117. 117. time 95-percentile latency service A service B 10PM causality? or correlation?
  118. 118. user-service USESUSES DEPENDS_ON auth-serviceUSES payment-service DEPENDS_ON “payment was slow last night around 10PM” user-table
  119. 119. user-service USESUSES DEPENDS_ON auth-serviceUSES DEPENDS_ON payment-service user-table throttled exceptions!
  120. 120. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  121. 121. user-table user-stream DEPENDS_ON DEPENDS_ON USES USES USES USES USES DEPENDS_ON D EPEN D S_O N DEPENDS_ON PUBLISHES_TO “what else is impacted by the throttled exceptions on user-table?”
  122. 122. wouldn’t that be nice?
  123. 123. MACHINE LEARNING
  124. 124. use ML to auto-detect erroneous or suspicious behaviours, or to suggest possible improvements
  125. 125. ! Function [X] just performed an unexpected write against DynamoDB table [Y]. Should I… ignore it from now on shut it down!!
  126. 126. Observability Bot <bot@bestobservability.com>
  127. 127. Observability Bot <bot@bestobservability.com> don’t bother me about this again
  128. 128. Observability Bot <bot@bestobservability.com> auto-modify IAM role with DENY rule
  129. 129. Function [X]’s performance has degraded since yesterday - 99% latency has gone up by 47% from 100ms to 147ms. !
  130. 130. ! Function [X] can run faster & cheaper if you increase its memory allocation. Should I… ignore it from now on update setting
  131. 131. zzz… the future of… zzz … serverless observability… zzz
  132. 132. Simon Wardley
  133. 133. Simon Wardley context & movement
  134. 134. However, I would argue that the health of the system no longer matters. We've entered an era where what matters is the health of each individual event, or each individual user's experience, or each shopping cart's experience (or other high cardinality dimensions). With distributed systems you don't care about the health of the system, you care about the health of the event or the slice. ”http://bit.ly/2E2QngU- Charity Majors “
  135. 135. “one user action/vertical slice through the system”
  136. 136. movement context movement
  137. 137. The best way to predict the future is to invent it. - Alan Kay
  138. 138. The best way to invent the future is to inception someone else to do it. - me
  139. 139. Serkan Özal @serkan_ozal
  140. 140. Adam Johnson @adjohn Erica Windisch @ewindisch
  141. 141. Nitzan Shapira @nitzanshapira Ran Ribenzaft @ranrib
  142. 142. Charity Majors @mipsytipsy Cindy Sridharan @copyconstruct Erica Windisch @ewindisch Liz Fong-Jones @lizthegrey JBD @rakyll
  143. 143. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/production-ready-serverless get 40% off with code: ytcui

As engineers, we’re empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it’s no longer feasible to use background threads to send monitoring data outside the critical path. Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan’d-out and fan’d-in via various event processing patterns. Join us in this talk as Yan Cui gives us an overview of the challenges with observing a serverless architecture (ephemerality, no access to host OS, no background thread for sending monitoring data, etc.), the tradeoffs to consider, and the state of the tooling for serverless observability.

Views

Total views

610

On Slideshare

0

From embeds

0

Number of embeds

19

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×