Patterns and Practices
for building resilient
serverless applications
presented by Yan Cui
@theburningmonk
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
it’s not about
preventing failures!
everything fails, all the time
@theburningmonk theburningmonk.com
we need to build applications that can withstand failures
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
don’t run your application on one server…
@theburningmonk theburningmonk.com
entire data centers can
go down…
@theburningmonk theburningmonk.com
run your application in multiple AZs and regions
@theburningmonk theburningmonk.com
Failures on load: exhaustion of resources
@theburningmonk theburningmonk.com
Failures on load: exhaustion of resources
@theburningmonk theburningmonk.com
latency
reqs/s
Failures on load: exhaustion of resources
CPU saturation
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
HTTP 502
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
You suck!
@theburningmonk theburningmonk.com
microservices death stars circa 2015
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
Yan Cui
http://theburningmonk.com
@theburningmonk
http://bit.ly/yubl-serverless
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
by Uwe Friedrichsen
@theburningmonk theburningmonk.com
Lambda execution environment
@theburningmonk theburningmonk.com
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda
@theburningmonk theburningmonk.com
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda
don’t pay for idle
redundant resources!
@theburningmonk theburningmonk.com
Load balancing
@theburningmonk theburningmonk.com
Data replication in different AZ’s
DynamoDB
Global Tables
@theburningmonk theburningmonk.com
There are throttling everywhere!
@theburningmonk theburningmonk.com
Beware of timeout mismatch
API Gateway

Integration timeout 

Default: 29s
Lambda

Timeout
Max: 15 minutes
@theburningmonk theburningmonk.com
Beware of timeout mismatch
Lambda

Timeout
Max: 15 minutes
SQS

Visibility timeout

Default: 30s
Min: 0s
Max: 12 hours
@theburningmonk theburningmonk.com
Beware of timeout mismatch
Lambda

Timeout
Max: 15 minutes
SQS

Visibility timeout

Default: 30s
Min: 0s
Max: 12 hours
set VisibilityTimeout to
6x Lambda timeout
@theburningmonk theburningmonk.com
Offload computing operations to queues
@theburningmonk theburningmonk.com
Offload computing operations to queues
@theburningmonk theburningmonk.com
Offload computing operations to queues
better absorb
downstream problems
@theburningmonk theburningmonk.com
Offload computing operations to queues
need way to replay
DLQ events
https://www.npmjs.com/package/lumigo-cli
@theburningmonk theburningmonk.com
Offload computing operations to queues
great for fire-and-forget tasks
@theburningmonk theburningmonk.com
“what if the client is waiting for a response?”
@theburningmonk theburningmonk.com
“Decoupled Invocation”
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
202
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
reporting for duty!
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
working hard…
not ready…
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
202
working hard…
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
done!
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
done!
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
200
{ … }
@theburningmonk theburningmonk.com
wait…
@theburningmonk theburningmonk.com
a distributed
transaction!
@theburningmonk theburningmonk.com
a distributed
transaction!
needs rollback
@theburningmonk theburningmonk.com
how do you implement distributed transactions?
@theburningmonk theburningmonk.com
The Saga pattern
A pattern for managing failures where each action
has a compensating action for rollback
@theburningmonk theburningmonk.com
The Saga pattern
https://www.youtube.com/watch?v=xDuwrtwYHu8
@theburningmonk theburningmonk.com
The Saga pattern
Begin transaction
Start book hotel request
End book hotel request
Start book flight request
End book flight request
Start book car rental request
End book car rental request
End transaction
@theburningmonk theburningmonk.com
The Saga pattern
model both actions and
compensating actions as
Lambda functions
@theburningmonk theburningmonk.com
The Saga pattern
use Step Functions as the
coordinator for the saga
@theburningmonk theburningmonk.com
The Saga pattern
Input
@theburningmonk theburningmonk.com
The Saga pattern
@theburningmonk theburningmonk.com
The Saga pattern
@theburningmonk theburningmonk.com
The Saga pattern
@theburningmonk theburningmonk.com
no distributed
transactions
@theburningmonk theburningmonk.com
do the work here
@theburningmonk theburningmonk.com
retry-until-success
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
24 hours data retention
@theburningmonk theburningmonk.com
24 hours data retention
need alerting to ensure
issue are addressed quickly
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
retry-until-success
needs to deal with
poinson messages
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
6, 3, 1, 1, 1, 1, …
@theburningmonk theburningmonk.com
Mind the poison message
6, 3, 1, 1, 1, 1, …
only count the “same” batch
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
have to fetch
from the stream
@theburningmonk theburningmonk.com
Mind the poison message
have to fetch
from the stream
do it before they expire
from the stream!
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
@theburningmonk theburningmonk.com
LambdaSQS Poller
Mind the partial failures
Delete
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
Error
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
Error
DLQ
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
Error
DLQ
batch fails as a unit
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
retry
retry
retry
retry
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
retry storm
@theburningmonk theburningmonk.com
circuit breaker pattern
After X consecutive timeouts, trip the circuit
@theburningmonk theburningmonk.com
circuit breaker pattern
After X consecutive timeouts, trip the circuit
When circuit is open, fail fast
@theburningmonk theburningmonk.com
circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through every Y mins
After X consecutive timeouts, trip the circuit
@theburningmonk theburningmonk.com
circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through every Y mins
If request succeeds, close the circuit
After X consecutive timeouts, trip the circuit
@theburningmonk theburningmonk.com
where do I keep the state of the circuit?
@theburningmonk theburningmonk.com
in-memory
PROS
simplicity
no dependency on external service
CONS
takes longer & more requests to stop all traffic
new containers would generate more traffic
@theburningmonk theburningmonk.com
external service
PROS
minimizes no. of total requests to trip circuit
new containers respect collective decision
CONS
complexity
dependency on an external service
@theburningmonk theburningmonk.com
which approach should I use?
It depends. Maybe start with the simplest solution first?
@theburningmonk theburningmonk.com
multi-region, active-active
@theburningmonk theburningmonk.com
us-east-1
API Gateway Lambda DynamoDBRoute53
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
GlobalTable
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
GlobalTable
@theburningmonk theburningmonk.com
eu-central-1
us-east-1
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
SNS
SNS
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
eu-central-1
us-east-1
SNS
SNS
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
eu-central-1
us-east-1
SNS
SNS
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
eu-central-1
us-east-1
SNS
SNS
Ddedupe
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-central-1
SQS Lambda DynamoDB Lambda API Gateway
Global Table
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-central-1
SQS Lambda DynamoDB Lambda API Gateway
Global Table
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-central-1
SQS Lambda DynamoDB Lambda API Gateway
Global Table
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-central-1
SQS Lambda DynamoDB Lambda API Gateway
Global Table
@theburningmonk theburningmonk.com
Multi-region architecture - benefits & tradeoffs
Protection against

regional failures
Higher complexity Very hard to test
CHAOS ENGINEERING
MUST KILL SERVERS!
RAWR!!
RAWR!!
@theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
@theburningmonk theburningmonk.com
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
“what if DynamoDB has an elevated error rate?”
@theburningmonk theburningmonk.com
“what if service X has elevated latency?”
@theburningmonk theburningmonk.com
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
everything fails, all the time
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
@theburningmonk theburningmonk.com
https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead
Learn GraphQL and AppSync by building a
Twitter clone with these technologies
appsyncmasterclass.com
@theburningmonk
theburningmonk.com
github.com/theburningmonk

Patterns and practices for building resilient serverless applications