Patterns and practices for building resilient serverless applications

Patterns and Practices
for building resilient
serverless applications
presented by Yan Cui
@theburningmonk

@theburningmonk theburningmonk.com

“the capacity to recover quickly from difﬁculties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun

“the capacity to recover quickly from difﬁculties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
it’s not about
preventing failures!

everything fails, all the time

we need to build applications that can withstand failures

don’t run your application on one server…

entire data centers can
go down…

run your application in multiple AZs and regions

Failures on load: exhaustion of resources

latency
reqs/s
Failures on load: exhaustion of resources
CPU saturation

Failures in distributed systems
Service A Service B Service C
user

user
HTTP 502

user
You suck!

microservices death stars circa 2015

Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years

Yan Cui
@theburningmonk
http://bit.ly/yubl-serverless

Yan Cui
@theburningmonk
Developer Advocate @

Yan Cui
@theburningmonk
Independent Consultant
advisetraining delivery

Lambda execution environment

Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda

Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda
don’t pay for idle
redundant resources!

Load balancing

Data replication in different AZ’s
DynamoDB
Global Tables

There are throttling everywhere!

Beware of timeout mismatch
API Gateway 
Integration timeout  
Default: 29s
Lambda 
Timeout
Max: 15 minutes

Lambda 
Timeout
Max: 15 minutes
SQS 
Visibility timeout 
Default: 30s
Min: 0s
Max: 12 hours

Lambda 
Timeout
Max: 15 minutes
SQS 
Visibility timeout 
Default: 30s
Min: 0s
Max: 12 hours
set VisibilityTimeout to
6x Lambda timeout

Ofﬂoad computing operations to queues

better absorb
downstream problems

need way to replay
DLQ events

https://www.npmjs.com/package/lumigo-cli

great for ﬁre-and-forget tasks

“what if the client is waiting for a response?”

“Decoupled Invocation”

task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…

xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
202

xxx xxx <null>
xxx xxx <null>
… … …
task results
reporting for duty!

xxx xxx <null>
xxx xxx <null>
… … …
task results
working hard…
not ready…

xxx xxx <null>
xxx xxx <null>
… … …
task results
202
working hard…

xxx xxx <null>
xxx xxx { … }
… … …
task results
done!

xxx xxx <null>
xxx xxx { … }
… … …
task results
200
{ … }

wait…

a distributed
transaction!

a distributed
transaction!
needs rollback

how do you implement distributed transactions?

The Saga pattern
A pattern for managing failures where each action
has a compensating action for rollback

The Saga pattern
https://www.youtube.com/watch?v=xDuwrtwYHu8

The Saga pattern
Begin transaction
Start book hotel request
End book hotel request
Start book ﬂight request
End book ﬂight request
Start book car rental request
End book car rental request
End transaction

The Saga pattern
model both actions and
compensating actions as
Lambda functions

The Saga pattern
use Step Functions as the
coordinator for the saga

The Saga pattern
Input

The Saga pattern

no distributed
transactions

do the work here

retry-until-success

24 hours data retention

24 hours data retention
need alerting to ensure
issue are addressed quickly

Mind the poison message

retry-until-success
needs to deal with
poinson messages

6, 3, 1, 1, 1, 1, …

6, 3, 1, 1, 1, 1, …
only count the “same” batch

have to fetch
from the stream

have to fetch
from the stream
do it before they expire
from the stream!

Mind the partial failures
LambdaSQS

LambdaSQS Poller

LambdaSQS Poller
Delete

LambdaSQS Poller
Error

LambdaSQS Poller
Error
DLQ

LambdaSQS Poller
Error
DLQ
batch fails as a unit

https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes

Mind the retry storm
Service A

Mind the retry storm
Service A
retry
retry
retry
retry

retry storm

circuit breaker pattern
After X consecutive timeouts, trip the circuit

When circuit is open, fail fast

but, allow 1 request through every Y mins

but, allow 1 request through every Y mins
If request succeeds, close the circuit

where do I keep the state of the circuit?

in-memory
PROS
simplicity
no dependency on external service
CONS
takes longer & more requests to stop all trafﬁc
new containers would generate more trafﬁc

external service
PROS
minimizes no. of total requests to trip circuit
new containers respect collective decision
CONS
complexity
dependency on an external service

which approach should I use?
It depends. Maybe start with the simplest solution ﬁrst?

multi-region, active-active

us-east-1
API Gateway Lambda DynamoDBRoute53

eu-west-1
us-east-1
us-west-1

eu-west-1
us-east-1
us-west-1
GlobalTable

eu-central-1
us-east-1
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
SNS
SNS

us-east-1
eu-central-1
us-east-1
SNS
SNS

us-east-1
eu-central-1
us-east-1
SNS
SNS
Ddedupe

us-east-1
us-east-1
SNS
eu-central-1
SNS
eu-central-1
Global Table

Multi-region architecture - beneﬁts & tradeoffs
Protection against 
regional failures
Higher complexity Very hard to test

MUST KILL SERVERS!
RAWR!!
RAWR!!

“the discipline of experimenting on a system in order to build conﬁdence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org

“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch

“what if DynamoDB has an elevated error rate?”

“what if service X has elevated latency?”

identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL

https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead

Learn GraphQL and AppSync by building a
Twitter clone with these technologies
appsyncmasterclass.com

@theburningmonk
theburningmonk.com
github.com/theburningmonk

Patterns and practices for building resilient serverless applications

More Related Content

What's hot

Similar to Patterns and practices for building resilient serverless applications

More from Yan Cui

Recently uploaded

Patterns and practices for building resilient serverless applications