MIND
THE
POTHOLES
MIND
THE
POTHOLES
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
http://bit.ly/yubl-serverless
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
What do you mean
by ‘serverless’?
“Serverless”
Gojko Adzic
It is serverless the same way
WiFi is wireless.
http://bit.ly/2yQgwwb
Serverless means…
don’t pay for it if no-one uses it
don’t need to worry about scaling
don’t need to provision and manage servers
in other words, it’s a lot like taking a cab
Ownership
Fuel
Navigate
To get there!
Focus on
getting there!
HW Ownership
OS
Runtime & Scale
Code
Focus on
getting there!
Physical
Servers
Virtual
Machines
Containers Serverless
Nano Services Self Managed Cost Paradigm
ChangeAsync
Dynamic agile env
“why are we failing at this?”
hidden dangers
monolith microservices serverless
monolith microservices serverless
observability
distributed
systems
bounded
context
monolith microservices serverless
observability
distributed
systems
bounded
context
monolith microservices serverless
observability
distributed
systems
bounded
context
event driven
monolith serverless
missing learnings
from microservices
monolith serverless
missing learnings
from microservices
poor decisions
#1
not letting go of legacy
thinking
“we’re doing serverless,
but why aren’t thing
going faster?”
Socio Technical
there are no silver bullets
centralised team
Team A Team B Team C Team D …
“but the developers don’t understand AWS and how
our infrastructure is set up”
“but the developers don’t understand AWS and how
our infrastructure is set up”
let’s solve this
problem instead!
what got you here won’t get you there
if (path == “/user” && method == “GET”) {
return getUser(…);
} else if (path == “/user” && method == “DELETE”) {
return deleteUser(…);
} else if (path == “/user” && method == “POST”) {
return createUser(…);
} else if ….
Monolithic Functions
GET /user
POST /user
DELETE /user
Single-Purposed Functions
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
find related
functions by prefix
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
discoverability
(without having to dig into the code)
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
what does it do?
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
dynamodb:GetItem
dynamodb:PutItem
dynamodb:DeleteItem
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
dynamodb:GetItem
dynamodb:PutItem
dynamodb:DeleteItem
no least privilege…
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
require(x)
require(y)
require(z)
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
require(x)
require(y)
require(z)
more dependecies equals
slower cold start
author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user
require(x)
require(y)
require(z)
worse cold start
performance
keep functions simple, and single-purposed
#2
one account that rules
them all
mind the shared limits
no. of DynamoDB tables
no. of API Gateway regional APIs
no. of API Gateway edge-optimized APIs
no. of Kinesis shards
no. of IAM roles
no. of S3 buckets
no. of CloudFormation stacks
no. of SNS subscription filters
no. of SSM parameters
…
Resource Limits
DynamoDB read & write
API Gateway requests/second
Lambda concurrent executions
SSM parameter ops/second
…
Throughput Limits
One account per Team per Environment
compartmentalise security breaches
#3
do first, research later
https://einaregilsson.com/serverless-15-percent-slower-and-eight-times-more-expensive/
the platforms need to do better at educating users on
how to choose between different services
SNS vs SQS vs Kinesis vs MKS?
the platforms need to do better at educating users on
how to choose between different services
ordering
replay events
Kinesis SQS SNS
by shard
none (standard)
global (FIFO)
none
up to 7 days none none
mode
retry
batched batched (up to 10) singular
retried until
success
retry + DLQ retry + DLQ
concurrency 1 per shard auto-scaled fan-out!!!
subscribers many one-to-one many
https://medium.com/theburningmonk-com/all-my-posts-on-serverless-aws-lambda-43c17a147f91
https://www.jeremydaly.com/newsletter/
#4
not using a deployment
toolkit
https://lumigo.io/blog/comparison-of-lambda-deployment-frameworks/
don’t write your own deployment framework
#5
console-driven
development
#6
one repo per function
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
monorepo?
github
repo
https://lumigo.io/blog/mono-repo-vs-one-per-service/
one repo per service?
github
repo
github
repo
github
repo
github
repo
user-api
timeline-api
relationship-api
search-api
CI/CD pipeline per service
functions are deployed together, as a stack
unencrypted secrets
in env vars
#7
secrets should NEVER be in plain text in env variables
SSM Parameter Store
Secret 1
Secret 2
IAM
Environment:
SECRET_1: …
SECRET_2: …
Environment:
SECRET_1: …
SECRET_2: …
SSM Parameter Store
Secret 1
Secret 2
IAM
Environment:
SECRET_1: …
SECRET_2: …
Environment:
SECRET_1: …
SECRET_2: …
yay!
SSM Parameter Store
Secret 1
Secret 2
IAM
fetch at cold start,
cache,
invalidate every x mins
https://github.com/middyjs/middy
SSM Parameter Store
Secret 1
Secret 2
IAM
switch to Higher
Throughput if you need
more than 40 ops/s
not following least
privilege principle
#8
missing DLQs
#9
async sync
S3
SNS
SES
CloudFormation
CloudWatch Logs
CloudWatch Events
Scheduled Events
CodeCommit
AWS Config
http://amzn.to/2v7Kc3b
Cognito
Alexa
Lex
API Gateway
pulling
DynamoDB Stream
Kinesis Stream
SQS
async sync
S3
SNS
SES
CloudFormation
CloudWatch Logs
CloudWatch Events
Scheduled Events
CodeCommit
AWS Config
http://amzn.to/2vs2lIg
Cognito
Alexa
Lex
API Gateway
pulling
DynamoDB Stream
Kinesis Stream
SQS
Lambda handles retries
(twice, then DLQ)
configure DLQ for async functions so you don’t lose failed events
too much/too little
concurrency
#10
“Lambda generates too much load for the downstream system”
one invocation
per message
SNS
Lambda
Downstream
System
SNS
Lambda
ordering
replay events
Kinesis SQS SNS
by shard
none (standard)
global (FIFO)
none
up to 7 days none none
mode
retry
batched batched (up to 10) singular
retried until
success
retry + DLQ retry + DLQ
concurrency 1 per shard auto-scaled fan-out!!!
subscribers many one-to-one many
if you want…
maximum
throughput
SNS
precise control
over throughput
Kinesis
if you want…
maximum
throughput
SNS
precise control
over throughput
Kinesis
how quickly it scales out
if you want…
maximum
throughput
SNS
precise control
over throughput
Kinesis
how quickly it scales out
SQS DynamoDB
Streams
ordering
replay events
Kinesis SQS SNS
by shard
none (standard)
global (FIFO)
none
up to 7 days none none
mode
retry
batched batched (up to 10) singular
retried until
success
retry + DLQ retry + DLQ
concurrency 1 per shard auto-scaled fan-out!!!
subscribers many one-to-one many
cold starts
#11
“cold starts only happen to the first request”
function invocationconcurrent execution
i.e. a container
function invocationconcurrent execution
i.e. a container
class instance method call
Lambda scales the number of concurrent executions
based on traffic
existing “containers” are reused where possible
time
invocation
time
invocation
invocation
time
invocation
invocation
time
invocation
invocation
invocation
invocation
time
invocation
invocation
invocation
invocation
invocation
invocation
time
invocation
invocation
invocation
invocation
invocation
invocation
time
invocation
invocation
invocation
invocation
invocation
invocation
invocation
time
invocation
invocation
invocation
invocation
invocation
invocation
invocation invocation
time
invocation
invocation
invocation
invocation
invocation
invocation
invocation invocation
time
invocation
invocation
invocation
invocation
invocation
invocation
invocation invocation
time
invocation
invocation
ping
invocation
invocation
invocation
ping ping
Lambda warmers don’t work when you have > 1
concurrent executions
FREQUENCY DURATION
FREQUENCY DURATION
dictated by user traffic,
out of your control
cold starts is generally not an issue if you have a
steady traffic pattern
time
req/s
time
req/s
El Classico
time
req/s
lunch dinner
FREQUENCY DURATION
optimize this!
minimise the duration of cold starts so they
fall within acceptable latency range
RDS connection
handling
#12
default RDS configs are bad for Lambda
default RDS configs are bad for Lambda
idle connections are
not closed
too many connections
per “container”
max open connection
is too low
https://www.jeremydaly.com/manage-rds-connections-aws-lambda/
set “wait_timeout” and “interactive_timeout” to 10 mins
(default is 8 hours!)
increase “max_connections” setting
set client socket pool size to 1
(lack of) observability
#13
happened system repaireduser impact
reduce MTTR
Identify & Resolve
Issues
Understanding
costs
Visibility
Identify & Resolve
Issues
Understanding
costs
Visibility
happened system repaireduser impact
MTTDiscovery
“What alerts should I have?”
It depends on what you’re building…
But, this is a good starting point
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
SQS
message age
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
SQS
message age
Step Functions
failed count
throttle count
timed out count
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
SQS
message age
Step Functions
failed count
throttle count
timed out count
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency
“Can’t you codify these?”
https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead
@theburningmonk
theburningmonk.com
github.com/theburningmonk

Beware the potholes