Beware the potholes

MIND
THE
POTHOLES
MIND
THE
POTHOLES

Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years

Yan Cui
@theburningmonk
Developer Advocate @

Yan Cui
@theburningmonk
Independent Consultant

What do you mean
by ‘serverless’?

Gojko Adzic
It is serverless the same way
WiFi is wireless.
http://bit.ly/2yQgwwb

Serverless means…
don’t pay for it if no-one uses it
don’t need to worry about scaling
don’t need to provision and manage servers

in other words, it’s a lot like taking a cab

Ownership
Fuel
Navigate
To get there!
Focus on
getting there!

HW Ownership
OS
Runtime & Scale
Code
Focus on
getting there!
Physical
Servers
Virtual
Machines
Containers Serverless

Nano Services Self Managed Cost Paradigm
ChangeAsync
Dynamic agile env

“why are we failing at this?”

monolith microservices serverless

observability
distributed
systems
bounded
context

observability
distributed
systems
bounded
context
event driven

monolith serverless
missing learnings
from microservices

monolith serverless
missing learnings
from microservices
poor decisions

#1
not letting go of legacy
thinking

“we’re doing serverless,
but why aren’t thing
going faster?”

centralised team
Team A Team B Team C Team D …

“but the developers don’t understand AWS and how
our infrastructure is set up”

“but the developers don’t understand AWS and how
our infrastructure is set up”
let’s solve this
problem instead!

what got you here won’t get you there

if (path == “/user” && method == “GET”) {
return getUser(…);
} else if (path == “/user” && method == “DELETE”) {
return deleteUser(…);
} else if (path == “/user” && method == “POST”) {
return createUser(…);
} else if ….
Monolithic Functions

GET /user
POST /user
DELETE /user
Single-Purposed Functions

author: yan.cui
feature: user-api
user-api-dev
Monolithic Single-Purposed
author: yan.cui
feature: user-api
user-api-dev-get-user
author: yan.cui
feature: user-api
user-api-dev-create-user
author: yan.cui
feature: user-api
user-api-dev-delete-user

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
ﬁnd related
functions by preﬁx

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
discoverability
(without having to dig into the code)

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
what does it do?

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
dynamodb:GetItem
dynamodb:PutItem
dynamodb:DeleteItem

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
dynamodb:GetItem
dynamodb:PutItem
dynamodb:DeleteItem
no least privilege…

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
require(x)
require(y)
require(z)

more dependecies equals
slower cold start

author: yan.cui
feature: user-api
user-api-dev
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
author: yan.cui
feature: user-api
require(x)
require(y)
require(z)
worse cold start
performance

keep functions simple, and single-purposed

#2
one account that rules
them all

no. of DynamoDB tables
no. of API Gateway regional APIs
no. of API Gateway edge-optimized APIs
no. of Kinesis shards
no. of IAM roles
no. of S3 buckets
no. of CloudFormation stacks
no. of SNS subscription ﬁlters
no. of SSM parameters
…
Resource Limits

DynamoDB read & write
API Gateway requests/second
Lambda concurrent executions
SSM parameter ops/second
…
Throughput Limits

One account per Team per Environment

compartmentalise security breaches

https://einaregilsson.com/serverless-15-percent-slower-and-eight-times-more-expensive/

the platforms need to do better at educating users on
how to choose between different services

SNS vs SQS vs Kinesis vs MKS?
the platforms need to do better at educating users on
how to choose between different services

ordering
replay events
Kinesis SQS SNS
by shard
none (standard)
global (FIFO)
none
up to 7 days none none
mode
retry
batched batched (up to 10) singular
retried until
success
retry + DLQ retry + DLQ
concurrency 1 per shard auto-scaled fan-out!!!
subscribers many one-to-one many

https://medium.com/theburningmonk-com/all-my-posts-on-serverless-aws-lambda-43c17a147f91

https://www.jeremydaly.com/newsletter/

#4
not using a deployment
toolkit

https://lumigo.io/blog/comparison-of-lambda-deployment-frameworks/

don’t write your own deployment framework

github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo
github
repo

https://lumigo.io/blog/mono-repo-vs-one-per-service/

github
repo
github
repo
github
repo
github
repo
user-api
timeline-api
relationship-api
search-api

functions are deployed together, as a stack

unencrypted secrets
in env vars
#7

secrets should NEVER be in plain text in env variables

SSM Parameter Store
Secret 1
Secret 2
IAM
Environment:
SECRET_1: …
SECRET_2: …
Environment:
SECRET_1: …
SECRET_2: …

SSM Parameter Store
Secret 1
Secret 2
IAM
Environment:
SECRET_1: …
SECRET_2: …
Environment:
SECRET_1: …
SECRET_2: …
yay!

SSM Parameter Store
Secret 1
Secret 2
IAM
fetch at cold start,
cache,
invalidate every x mins

https://github.com/middyjs/middy

SSM Parameter Store
Secret 1
Secret 2
IAM
switch to Higher
Throughput if you need
more than 40 ops/s

not following least
privilege principle
#8

async sync
S3
SNS
SES
CloudFormation
CloudWatch Logs
CloudWatch Events
Scheduled Events
CodeCommit
AWS Conﬁg
http://amzn.to/2v7Kc3b
Cognito
Alexa
Lex
API Gateway
pulling
DynamoDB Stream
Kinesis Stream
SQS

async sync
S3
SNS
SES
CloudFormation
CloudWatch Logs
CloudWatch Events
Scheduled Events
CodeCommit
AWS Conﬁg
http://amzn.to/2vs2lIg
Cognito
Alexa
Lex
API Gateway
pulling
DynamoDB Stream
Kinesis Stream
SQS
Lambda handles retries
(twice, then DLQ)

conﬁgure DLQ for async functions so you don’t lose failed events

too much/too little
concurrency
#10

“Lambda generates too much load for the downstream system”

one invocation
per message
SNS
Lambda

if you want…
maximum
throughput
SNS
precise control
over throughput
Kinesis

if you want…
maximum
throughput
SNS
precise control
over throughput
Kinesis
how quickly it scales out

if you want…
maximum
throughput
SNS
precise control
over throughput
Kinesis
how quickly it scales out
SQS DynamoDB
Streams

“cold starts only happen to the ﬁrst request”

function invocationconcurrent execution
i.e. a container

function invocationconcurrent execution
i.e. a container
class instance method call

Lambda scales the number of concurrent executions
based on trafﬁc

existing “containers” are reused where possible

time
invocation
invocation
invocation
invocation

time
invocation
invocation
invocation
invocation
invocation
invocation

time
invocation
invocation
invocation
invocation
invocation
invocation
invocation

time
invocation
invocation
invocation
invocation
invocation
invocation
invocation invocation

time
invocation
invocation
ping
invocation
invocation
invocation
ping ping

Lambda warmers don’t work when you have > 1
concurrent executions

FREQUENCY DURATION
dictated by user trafﬁc,
out of your control

cold starts is generally not an issue if you have a
steady trafﬁc pattern

FREQUENCY DURATION
optimize this!

minimise the duration of cold starts so they
fall within acceptable latency range

default RDS conﬁgs are bad for Lambda

default RDS conﬁgs are bad for Lambda
idle connections are
not closed
too many connections
per “container”
max open connection
is too low

https://www.jeremydaly.com/manage-rds-connections-aws-lambda/

set “wait_timeout” and “interactive_timeout” to 10 mins
(default is 8 hours!)

increase “max_connections” setting

set client socket pool size to 1

happened system repaireduser impact
reduce MTTR

Identify & Resolve
Issues
Understanding
costs
Visibility

happened system repaireduser impact
MTTDiscovery

“What alerts should I have?”

It depends on what you’re building…

But, this is a good starting point

Lambda
error rate %
throttle count
DLR error count
iterator age
regional concurrency

Lambda
error rate %
throttle count
DLR error count
iterator age
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %

API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
SQS
message age
Lambda
error rate %
throttle count
DLR error count
iterator age

API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
SQS
message age
Step Functions
failed count
throttle count
timed out count
Lambda
error rate %
throttle count
DLR error count
iterator age

SQS
message age
Step Functions
failed count
throttle count
timed out count
API Gateway
p90/95/99 latency
success rate %
4xx rate %
5xx rate %
Lambda
error rate %
throttle count
DLR error count
iterator age

“Can’t you codify these?”

https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead

@theburningmonk
theburningmonk.com
github.com/theburningmonk

Beware the potholes

More Related Content

What's hot

Similar to Beware the potholes

More from Yan Cui

Recently uploaded

Beware the potholes