Serverless Application
Troubleshooting
I watch a lot of TV shows…
protagonist is shot
3 hours earlier… protagonist is shot
3 hours earlier… protagonist is shot
3 hours earlier… protagonist emerges
victoriously
protagonist is shot
happened
happened user impact
happened system repaireduser impact
happened system repaireduser impact
goal: to fail without users noticing
happened system repaireduser impact
reduce MTTR
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Independent Consultant
AWS user since 2009
since 2018
yan@lumigo.io
What do you mean
by ‘serverless’?
“Serverless”
Gojko Adzic
It is serverless the same way
WiFi is wireless.
http://bit.ly/2yQgwwb
Serverless means…
don’t pay for it if no-one uses it
don’t need to worry about scaling
don’t need to provision and manage servers
in other words, it’s a lot like taking a cab
Ownership
Fuel
Navigate
To get there!
Focus on
getting there!
HW Ownership
OS
Runtime & Scale
Code
Focus on
getting there!
Physical
Servers
Virtual
Machines
Containers Serverless
Nano Services Self Managed Cost Paradigm
ChangeAsync
Dynamic agile env
happened system repaireduser impact
reduce MTTR
Identify & Resolve
Issues
Understanding
costs
Visibility
Identify & Resolve
Issues
Understanding
costs
Visibility
happened system repaireduser impact
MTTDiscovery
“What alerts should I have?”
It depends on what you’re building…
But, this is a good starting point
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
SQS
message age
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
SQS
message age
Step Functions
failed count
throttle count
timed out count
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
SQS
message age
Step Functions
failed count
throttle count
timed out count
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
“Can’t you codify these?”
Identify & Resolve
Issues
Understanding
costs
Visibility
happened system repaireduser impact
finding root cause
option 1: CloudWatch & friends
https://lumigo.io/blog/getting-the-most-out-of-cloudwatch-logs/
Pros
Out of the box
No overhead
Comparatively cheap
AWS support
Pros
Out of the box
No overhead
Comparatively cheap
AWS support
Cons
Complicated
https://lumigo.io/blog/serverless-applications-automate-chores-cloudwatch-logs/
Pros
Out of the box
No overhead
Comparatively cheap
AWS support
Cons
Complicated
Hard to query*
* Insights improved things drastically, but still a gap to ELK
https://lumigo.io/blog/how-to-monitor-lambda-with-cloudwatch-metrics/
Pros
Out of the box
Source of truth
No overhead*
Comparatively cheap
AWS support
* unless you record custom metrics synchronously
Pros
Out of the box
Source of truth
No overhead*
Comparatively cheap
AWS support
* unless you record custom metrics synchronously
** can compensate with custom metrics/metric filters, etc.
Cons
Missing metrics**
Lambda percentile
latencies don’t work
Only granular to 1 min
No query language
Pros
Out of the box
SDK
No overhead
Comparatively cheap
AWS support
Pros
Out of the box
SDK
No overhead
Comparatively cheap
AWS support
Cons
Poor async support
Pros
Out of the box
SDK
No overhead
Comparatively cheap
AWS support
Cons
Poor async support
No auto-
instrumentation
Bad DX (for node.js)
Poor documentation
option 2: custom built solutions
https://github.com/getndazn/dazn-lambda-powertools
Structured Logging
Structured Logging
Sampling
Structured Logging
Sampling
Correlation IDs
Structured Logging
Sampling
Correlation IDs
Auto “instrumentation”
Structured Logging
Sampling
Correlation IDs
Auto “instrumentation”
Support async events
enrich the usefulness of your logs
https://theburningmonk.com/2017/08/centralised-logging-for-aws-lambda/
https://theburningmonk.com/2018/07/centralised-logging-for-aws-lambda-revised-2018/
Pros
Tailor fit
Free!
Pros
Tailor fit
Free!
Cons
Very high-touch
Not all services are
supported equally
Tailor fit (for someone
else…)
option 3: serverless monitoring solutions
Pros
SAAS
Serverless focus
More than just tracing
Very low touch
Cons
Yet another 3rd party
More than just tracing
Takeaways
Serverless is a game-changer
Serverless has challenges
Options for troubleshooting serverless applications
https://info.lumigo.io/serverless-consulting
Start off on the right foot
@theburningmonk
theburningmonk.com
github.com/theburningmonk
yan@lumigo.io

Troubleshooting serverless applications