applying the best parts of Microservices
to Serverless
Yan Cui
http://theburningmonk.com
@theburningmonk
Principal Engineer @
follow @DAZN_ngnrs
for updates about the
engineering team
We’re hiring! Visit
engineering.dazn.com
to learn more.
2006
2010
2010
2016
2016
SQL NoSQL
OOP Functional
On Premise Cloud
Waterfall Agile
Monoliths Microservices
Server-ful Serverless
https://en.wikipedia.org/wiki/Hype_cycle
https://gtnr.it/2KGyGCM
what’s this?
what’s this?
this solves all my problems!
what’s this?
this solves all my problems!
this is rubbish!
what’s this?
this solves all my problems!
this is rubbish!
I’m starting to get it..
what’s this?
this solves all my problems!
this is rubbish!
I’m starting to get it..
I know what I’m doing
SQL NoSQL
OOP Functional
On Premise Cloud
Waterfall Agile
Monoliths Microservices
Server-ful Serverless
“those who cannot remember the
past are condemned to repeat it”
- George Santayana
what’s this?
this solves all my problems!
this is rubbish!
I’m starting to get it..
I know what I’m doing
lesson 1. don’t fly blind
2017
observability
http://bit.ly/2EXQZBj
http://bit.ly/2EXKEFZ
These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
NO ACCESS
to underlying OS
NOWHERE
to install agents/daemons
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
handler
handler
handler
handler
handler
handler
handler
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
NO background processing
except what platform provides
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp Request Id
your log message
one log group per
function
one log stream for each
concurrent invocation
logs are not easily searchable in
CloudWatch Logs
me
CloudWatch Logs
CloudWatch Logs AWS Lambda ELK stack
http://bit.ly/lambda-log-aggregation
you need to use structured logging
me
CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
CloudWatch Logs
$0.50 per GB ingested
$0.03 per GB archived per month
1M invocation of a 128MB function =
$0.000000208 * 1M + $0.20 =
$0.408
DON’T leave debug logging ON in production
you need to sample debug logs in production
me
volume of logs
observability
all debug logs
no debug logs
sampled debug logs
hurts mean time to
resolution (MTTR) during a
production incident
volume of logs
observability
all debug logs
no debug logs
sampled debug logs $$$$$$
always log the invocation event on error
me
http://bit.ly/lambda-sample-debug-logs
“what about metrics?”
my code
send metrics
my code
send metrics
my code
send metrics
internet internet
press button something happens
those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called within one slice
of user event
Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
CloudWatch Logs AWS Lambda
ELK stack
logs
m
etrics
CloudWatch
API Gateway
send custom metrics
asynchronously
SNS KinesisS3API Gateway
…
send custom metrics
asynchronously
send custom metrics as
part of function invocation
http://bit.ly/2Dpidje
?
functions are often chained together
via asynchronous invocations
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
tracing ASYNCHRONOUS
invocations through so many
different event sources is difficult
X-Ray
do not span over API Gateway
narrow focus on a function
good for homing in on performance issues
for a particular function, but offers little to
help you build intuition about how your
system operates as a whole.
don’t span over async invocations
good for identifying dependencies of a function,
but not good enough for tracing the entire call
chain as user request/data flows through the
system via async event sources.
don’t span over non-AWS services
Nitzan Shapira
@nitzanshapira
Ran Ribenzaft
@ranrib
correlation-IDs
correlation IDs*
* eg. request-id, user-id, yubl-id, etc.
kinesis client
http client
sns client
http://bit.ly/lambda-correlation-ids
lesson 2. no shared DBs
shared DBs create TIGHT COUPLING
between services
build loosely-coupled system through events
service A service B
service C service D
bounded context
bounded context
service A service B
service C service D
bounded context
bounded context
service A service B
service C service D
bounded context
bounded context
service A service B
service C service D
bounded context
bounded context
lesson 3. spiky load between services
service A service B
downstream systems might not be as scalable
service A service B
Kinesis Lambda
service A service B
Kinesis Lambda
concurrency == no. of shards
service A service B
Kinesis Lambda
retried until success
lesson 4. failures are inevitable
complex distributed systems fail in.. well,
complex, sometimes cascaded ways..
the only way to truly know your system’s
resilience against failures is to test it
through controlled experiments
there are more inherent chaos and
complexity in a Serverless architecture
smaller units of deployment
but A LOT more of them!
more difficult to harden
around boundaries
serverful
serverless
?
SNS
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3 SES
?
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3
more intermediary services,
and greater variety too
SNS
SES
?
Kinesis
CloudWatch
Events
CloudWatch
LogsIoT
DynamoDB
S3
more intermediary services,
and greater variety too
each with its own set
of failure modes
SNS
SES
more configurations,
more opportunities for misconfiguration
serverful
serverless
more unknown failure modes in
infrastructure that we don’t control
often there’s little we can do when
an outage occurs in the platform
improperly tuned timeouts
missing error handling
missing fallback when downstream is unavailable
FAILURE INJECTION
inject failures
inject failures
validate failure handling
Recap
Server-ful Serverless
“those who cannot remember the
past are condemned to repeat it”
- George Santayana
don’t fly blind
no shared DBs
amortize spiky load between services
failures are inevitable
@theburningmonk
theburningmonk.com
github.com/theburningmonk

Apply best parts of microservices to serverless