Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Patterns and Practices
for building resilient
serverless applications
presented by Yan Cui
@theburningmonk
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
...
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
...
everything fails, all the time
@theburningmonk theburningmonk.com
we need to build applications that can withstand failures
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
don’t run your application on one server…
@theburningmonk theburningmonk.com
entire data centers can
go down…
@theburningmonk theburningmonk.com
run your application in multiple AZs and regions
@theburningmonk theburningmonk.com
Failures on load: exhaustion of resources
@theburningmonk theburningmonk.com
Failures on load: exhaustion of resources
@theburningmonk theburningmonk.com
latency
reqs/s
Failures on load: exhaustion of resources
CPU saturation
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
HTTP 502
@theburningmonk theburningmonk.com
Failures in distributed systems
Service A Service B Service C
user
You suck!
@theburningmonk theburningmonk.com
microservices death stars circa 2015
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
Yan Cui
http://theburningmonk.com
@theburningmonk
http://bit.ly/yubl-serverless
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
by Uwe Friedrichsen
@theburningmonk theburningmonk.com
Lambda execution environment
@theburningmonk theburningmonk.com
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lamb...
@theburningmonk theburningmonk.com
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lamb...
@theburningmonk theburningmonk.com
Load balancing
@theburningmonk theburningmonk.com
Data replication in different AZ’s
DynamoDB
Global Tables
@theburningmonk theburningmonk.com
There are throttling everywhere!
@theburningmonk theburningmonk.com
Beware of timeout mismatch
API Gateway

Integration timeout 

Default: 29s
Lambda

Time...
@theburningmonk theburningmonk.com
Beware of timeout mismatch
Lambda

Timeout
Max: 15 minutes
SQS

Visibility timeout

Def...
@theburningmonk theburningmonk.com
Beware of timeout mismatch
Lambda

Timeout
Max: 15 minutes
SQS

Visibility timeout

Def...
@theburningmonk theburningmonk.com
Offload computing operations to queues
@theburningmonk theburningmonk.com
Offload computing operations to queues
@theburningmonk theburningmonk.com
Offload computing operations to queues
better absorb
downstream problems
@theburningmonk theburningmonk.com
Offload computing operations to queues
need way to replay
DLQ events
https://www.npmjs.com/package/lumigo-cli
@theburningmonk theburningmonk.com
Offload computing operations to queues
great for fire-and-forget tasks
@theburningmonk theburningmonk.com
“what if the client is waiting for a response?”
@theburningmonk theburningmonk.com
“Decoupled Invocation”
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
2...
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
reporting fo...
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
working hard...
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
202
working ...
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
done!
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
done!
@theburningmonk theburningmonk.com
task id created at result
xxx xxx <null>
xxx xxx { … }
… … …
task results
200
{ … }
@theburningmonk theburningmonk.com
wait…
@theburningmonk theburningmonk.com
a distributed
transaction!
@theburningmonk theburningmonk.com
a distributed
transaction!
needs rollback
@theburningmonk theburningmonk.com
no distributed
transactions
@theburningmonk theburningmonk.com
do the work here
@theburningmonk theburningmonk.com
retry-until-success
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
24 hours data retention
@theburningmonk theburningmonk.com
24 hours data retention
need alerting to ensure
issue are addressed quickly
@theburningmonk theburningmonk.com
retry-until-success
needs to deal with
poinson messages
@theburningmonk theburningmonk.com
what if you can’t avoid distributed transactions?
@theburningmonk theburningmonk.com
The Saga pattern
A pattern for managing failures where each action
has a compensating a...
@theburningmonk theburningmonk.com
The Saga pattern
https://www.youtube.com/watch?v=xDuwrtwYHu8
@theburningmonk theburningmonk.com
The Saga pattern
Begin transaction
Start book hotel request
End book hotel request
Star...
@theburningmonk theburningmonk.com
The Saga pattern
model both actions and
compensating actions as
Lambda functions
@theburningmonk theburningmonk.com
The Saga pattern
use Step Functions as the
coordinator for the saga
@theburningmonk theburningmonk.com
The Saga pattern
Input
@theburningmonk theburningmonk.com
The Saga pattern
@theburningmonk theburningmonk.com
The Saga pattern
@theburningmonk theburningmonk.com
The Saga pattern
@theburningmonk theburningmonk.com
retry-until-success
needs to deal with
poinson messages
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
6, 3, 1, 1, 1, 1, …
@theburningmonk theburningmonk.com
Mind the poison message
6, 3, 1, 1, 1, 1, …
only count the “same” batch
@theburningmonk theburningmonk.com
Mind the poison message
@theburningmonk theburningmonk.com
Mind the poison message
have to fetch
from the stream
@theburningmonk theburningmonk.com
Mind the poison message
have to fetch
from the stream
do it before they expire
from the...
@theburningmonk theburningmonk.com
how do you prevent building up an insurmountable backlog?
@theburningmonk theburningmonk.com
Load shedding
implement load shedding
prioritize newer messages with
a better chance to...
@theburningmonk theburningmonk.com
Load shedding
excess load is sent to DLQ
@theburningmonk theburningmonk.com
Load shedding
process with a delay
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
@theburningmonk theburningmonk.com
LambdaSQS Poller
Mind the partial failures
Delete
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
Error
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
Error
DLQ
@theburningmonk theburningmonk.com
Mind the partial failures
LambdaSQS Poller
Error
DLQ
batch fails as a unit
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
retry
retry
retry
retry
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
Mind the retry storm
Service A
@theburningmonk theburningmonk.com
retry storm
@theburningmonk theburningmonk.com
circuit breaker pattern
After X consecutive timeouts, trip the circuit
@theburningmonk theburningmonk.com
circuit breaker pattern
After X consecutive timeouts, trip the circuit
When circuit is ...
@theburningmonk theburningmonk.com
circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through ev...
@theburningmonk theburningmonk.com
circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through ev...
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
where do I keep the state of the circuit?
@theburningmonk theburningmonk.com
in-memory
Service A
isOpen: false
isOpen: false
isOpen: false
isOpen: false
@theburningmonk theburningmonk.com
in-memory
Service A
isOpen: true
isOpen: false
isOpen: true
isOpen: false
@theburningmonk theburningmonk.com
in-memory
PROS
simplicity
@theburningmonk theburningmonk.com
in-memory
PROS
simplicity
no dependency on external service
requires another circuit
br...
@theburningmonk theburningmonk.com
in-memory
PROS
simplicity
no dependency on external service
CONS
takes longer & more re...
@theburningmonk theburningmonk.com
in-memory
PROS
simplicity
no dependency on external service
CONS
takes longer & more re...
@theburningmonk theburningmonk.com
external service
Service AisOpen: false
@theburningmonk theburningmonk.com
external service
Service AisOpen: true
@theburningmonk theburningmonk.com
external service
Service AisOpen: true
@theburningmonk theburningmonk.com
external service
PROS
minimizes no. of total requests to trip circuit
new containers re...
@theburningmonk theburningmonk.com
which approach should I use?
@theburningmonk theburningmonk.com
which approach should I use?
It depends. Maybe start with the simplest solution first?
@theburningmonk theburningmonk.com
Lambda autoscaling
Burst concurrency limits:

3000 – US West (Oregon), US East (N.
Virg...
@theburningmonk theburningmonk.com
Lambda autoscaling
Burst concurrency limits:

3000 – US West (Oregon), US East (N.
Virg...
@theburningmonk theburningmonk.com
Lambda autoscaling
Burst concurrency limits:

3000 – US West (Oregon), US East (N.
Virg...
@theburningmonk theburningmonk.com
Lambda limitations & throttling
Concurrent executions: 1000*

Timeout: 15 minutes

Burs...
@theburningmonk theburningmonk.com
Lambda limitations & throttling
good for spikey
traffic, up to a point
Concurrent execut...
@theburningmonk theburningmonk.com
“what if my traffic is more spiky than that?”
@theburningmonk theburningmonk.com
Scenario: predictable spikes
Holidays, weekends,

celebrations

(Black Friday)
Planned ...
@theburningmonk theburningmonk.com
Scenario: predictable spikes
scheduled auto-scaling
@theburningmonk theburningmonk.com
Scenario: predictable spikes
scheduled auto-scaling
the burst limits still
apply, facto...
@theburningmonk theburningmonk.com
Scenario: predictable spikes
@theburningmonk theburningmonk.com
Scenario: unpredictable spikes
Traffic generated by user
actions



Jennifer Aniston’s fi...
@theburningmonk theburningmonk.com
“if Lambda scaling is the problem…”
@theburningmonk theburningmonk.com
Client only needs an acknowledgement
https://lumigo.io/blog/the-why-when-and-how-of-api-gateway-service-proxies
@theburningmonk theburningmonk.com
multi-region, active-active
@theburningmonk theburningmonk.com
us-east-1
API Gateway Lambda DynamoDBRoute53
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
GlobalTable
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
GlobalTable
@theburningmonk theburningmonk.com
eu-central-1
us-east-1
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
SNS
SNS
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
eu-central-1
us-east-1
SNS
SNS
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
eu-central-1
us-east-1
SNS
SNS
https://lumigo.io/blog/amazon-builders-library-in-focus-5-static-stability-using-availability-zones
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
eu-central-1
us-east-1
SNS
SNS
Ddedupe
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-cent...
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-cent...
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-cent...
@theburningmonk theburningmonk.com
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
us-east-1
SNS
eu-central-1
SNS
eu-cent...
@theburningmonk theburningmonk.com
Multi-region architecture - benefits & tradeoffs
Protection against

regional failures
H...
CHAOS ENGINEERING
MUST KILL SERVERS!
RAWR!!
RAWR!!
@theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’...
@theburningmonk theburningmonk.com
“You don't choose the moment, the moment chooses you!
You only choose how prepared you ...
@theburningmonk theburningmonk.com
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
g...
@theburningmonk theburningmonk.com
MUST KILL SERVERS!
RAWR!!
RAWR!!
ahhhhhhh!!!!
HELP!!!
OMG!!!
F***!!!
@theburningmonk theburningmonk.com
phew!
@theburningmonk theburningmonk.com
STEP 1.
define steady state
i.e. “what does normal look like”
@theburningmonk theburningmonk.com
STEP 2.
hypothesis that steady state continues in control and experimental group
e.g. “...
@theburningmonk theburningmonk.com
STEP 3.
inject realistic failures
e.g. “slow response from 3rd-party service”
@theburningmonk theburningmonk.com
STEP 4.
try to disprove hypothesis
i.e. “look for difference between control and experi...
DON’T START
EXPERIMENTS
IN PRODUCTION
@theburningmonk theburningmonk.com
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
@theburningmonk theburningmonk.com
“Corporation X lost millions due to a
chaos experiment went wrong and
destroyed key inf...
@theburningmonk theburningmonk.com
Chaos Engineering doesn't cause problems. It reveals them.
Nora Jones
CONTAINMENT
CONTAINMENT
run experiments during office hours
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
avoid important dates
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
avoid important dates
make ...
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
avoid important dates
make ...
DON’T START
EXPERIMENTS
IN PRODUCTION
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos...
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
there are no servers to kill!
SERVERLESS
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
improperly tuned timeouts
@theburningmonk theburningmonk.com
missing error handling
@theburningmonk theburningmonk.com
missing fallbacks
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
“what if DynamoDB has an elevated error rate?”
@theburningmonk theburningmonk.com
hypothesis: the AWS SDK retries would handle it
DEMO TIME!
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math...
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
action: set max retry count + fallback
DEMO TIME!
@theburningmonk theburningmonk.com
outcome: a more resilient system
@theburningmonk theburningmonk.com
“what if service X has elevated latency?”
@theburningmonk theburningmonk.com
hypothesis: our try-catch would handle it
DEMO TIME!
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk theburningmonk.com
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integra...
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
https://bit.ly/2Wvfort
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
DEMO TIME!
@theburningmonk theburningmonk.com
outcome: a more resilient system
recap
everything fails, all the time
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
...
@theburningmonk theburningmonk.com
Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lamb...
@theburningmonk theburningmonk.com
Beware of timeouts
API Gateway

Integration timeout 

Default: 29s
Lambda

Timeout
Max:...
@theburningmonk theburningmonk.com
Offload computing operations to queues
@theburningmonk theburningmonk.com
“Decoupled Invocation”
@theburningmonk theburningmonk.com
no distributed
transactions
@theburningmonk theburningmonk.com
retry-until-success
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
retry-until-success
needs to deal with
poinson messages
@theburningmonk theburningmonk.com
Mind the poison message
6, 3, 1, 1, 1, 1, …
only count the “same” batch
@theburningmonk theburningmonk.com
Load shedding
implement load shedding
prioritize newer messages with
a better chance to...
@theburningmonk theburningmonk.com
circuit breaker pattern
When circuit is open, fail fast
but, allow 1 request through ev...
@theburningmonk theburningmonk.com
The Saga pattern
A pattern for managing failures where each action
has a compensating a...
@theburningmonk theburningmonk.com
Mind the partial failures
@theburningmonk theburningmonk.com
Lambda autoscaling
Burst concurrency limits:

3000 – US West (Oregon), US East (N.
Virg...
@theburningmonk theburningmonk.com
Scenario: predictable spikes
scheduled auto-scaling
the burst limits still
apply, facto...
@theburningmonk theburningmonk.com
eu-west-1
us-east-1
us-west-1
GlobalTable
@theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’...
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
abi...
@theburningmonk theburningmonk.com
lambdabestpractice.com bit.ly/complete-guide-to-aws-step-functions
20% off my courses
a...
@theburningmonk
theburningmonk.com
github.com/theburningmonk
Patterns and practices for building resilient Serverless applications
Upcoming SlideShare
Loading in …5
×

Patterns and practices for building resilient Serverless applications

Recording: https://www.youtube.com/watch?v=pSfKZRv3nhY

Real-world serverless podcast: https://realworldserverless.com
Learn Lambda best practices: https://lambdabestpractice.com
Blog: https://theburningmonk.com
Consulting services: https://theburningmonk.com/hire-me
Production-Ready Serverless workshop: https://productionreadyserverless.com

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Patterns and practices for building resilient Serverless applications

  1. 1. Patterns and Practices for building resilient serverless applications presented by Yan Cui @theburningmonk
  2. 2. @theburningmonk theburningmonk.com
  3. 3. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  4. 4. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun it’s not about preventing failures!
  5. 5. everything fails, all the time
  6. 6. @theburningmonk theburningmonk.com we need to build applications that can withstand failures
  7. 7. @theburningmonk theburningmonk.com
  8. 8. @theburningmonk theburningmonk.com don’t run your application on one server…
  9. 9. @theburningmonk theburningmonk.com entire data centers can go down…
  10. 10. @theburningmonk theburningmonk.com run your application in multiple AZs and regions
  11. 11. @theburningmonk theburningmonk.com Failures on load: exhaustion of resources
  12. 12. @theburningmonk theburningmonk.com Failures on load: exhaustion of resources
  13. 13. @theburningmonk theburningmonk.com latency reqs/s Failures on load: exhaustion of resources CPU saturation
  14. 14. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user
  15. 15. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user
  16. 16. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user HTTP 502
  17. 17. @theburningmonk theburningmonk.com Failures in distributed systems Service A Service B Service C user You suck!
  18. 18. @theburningmonk theburningmonk.com microservices death stars circa 2015
  19. 19. Yan Cui http://theburningmonk.com @theburningmonk AWS user for 10 years
  20. 20. Yan Cui http://theburningmonk.com @theburningmonk http://bit.ly/yubl-serverless
  21. 21. Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @
  22. 22. Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
  23. 23. by Uwe Friedrichsen
  24. 24. @theburningmonk theburningmonk.com Lambda execution environment
  25. 25. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda
  26. 26. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda don’t pay for idle redundant resources!
  27. 27. @theburningmonk theburningmonk.com Load balancing
  28. 28. @theburningmonk theburningmonk.com Data replication in different AZ’s DynamoDB Global Tables
  29. 29. @theburningmonk theburningmonk.com There are throttling everywhere!
  30. 30. @theburningmonk theburningmonk.com Beware of timeout mismatch API Gateway
 Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes
  31. 31. @theburningmonk theburningmonk.com Beware of timeout mismatch Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  32. 32. @theburningmonk theburningmonk.com Beware of timeout mismatch Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours set VisibilityTimeout to 6x Lambda timeout
  33. 33. @theburningmonk theburningmonk.com Offload computing operations to queues
  34. 34. @theburningmonk theburningmonk.com Offload computing operations to queues
  35. 35. @theburningmonk theburningmonk.com Offload computing operations to queues better absorb downstream problems
  36. 36. @theburningmonk theburningmonk.com Offload computing operations to queues need way to replay DLQ events
  37. 37. https://www.npmjs.com/package/lumigo-cli
  38. 38. @theburningmonk theburningmonk.com Offload computing operations to queues great for fire-and-forget tasks
  39. 39. @theburningmonk theburningmonk.com “what if the client is waiting for a response?”
  40. 40. @theburningmonk theburningmonk.com “Decoupled Invocation”
  41. 41. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results not ready…
  42. 42. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results not ready… 202
  43. 43. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results reporting for duty!
  44. 44. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results working hard… not ready…
  45. 45. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx <null> … … … task results 202 working hard…
  46. 46. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results done!
  47. 47. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results done!
  48. 48. @theburningmonk theburningmonk.com task id created at result xxx xxx <null> xxx xxx { … } … … … task results 200 { … }
  49. 49. @theburningmonk theburningmonk.com wait…
  50. 50. @theburningmonk theburningmonk.com a distributed transaction!
  51. 51. @theburningmonk theburningmonk.com a distributed transaction! needs rollback
  52. 52. @theburningmonk theburningmonk.com no distributed transactions
  53. 53. @theburningmonk theburningmonk.com do the work here
  54. 54. @theburningmonk theburningmonk.com retry-until-success
  55. 55. @theburningmonk theburningmonk.com
  56. 56. @theburningmonk theburningmonk.com 24 hours data retention
  57. 57. @theburningmonk theburningmonk.com 24 hours data retention need alerting to ensure issue are addressed quickly
  58. 58. @theburningmonk theburningmonk.com retry-until-success needs to deal with poinson messages
  59. 59. @theburningmonk theburningmonk.com what if you can’t avoid distributed transactions?
  60. 60. @theburningmonk theburningmonk.com The Saga pattern A pattern for managing failures where each action has a compensating action for rollback
  61. 61. @theburningmonk theburningmonk.com The Saga pattern https://www.youtube.com/watch?v=xDuwrtwYHu8
  62. 62. @theburningmonk theburningmonk.com The Saga pattern Begin transaction Start book hotel request End book hotel request Start book flight request End book flight request Start book car rental request End book car rental request End transaction
  63. 63. @theburningmonk theburningmonk.com The Saga pattern model both actions and compensating actions as Lambda functions
  64. 64. @theburningmonk theburningmonk.com The Saga pattern use Step Functions as the coordinator for the saga
  65. 65. @theburningmonk theburningmonk.com The Saga pattern Input
  66. 66. @theburningmonk theburningmonk.com The Saga pattern
  67. 67. @theburningmonk theburningmonk.com The Saga pattern
  68. 68. @theburningmonk theburningmonk.com The Saga pattern
  69. 69. @theburningmonk theburningmonk.com retry-until-success needs to deal with poinson messages Mind the poison message
  70. 70. @theburningmonk theburningmonk.com Mind the poison message
  71. 71. @theburningmonk theburningmonk.com Mind the poison message
  72. 72. @theburningmonk theburningmonk.com Mind the poison message
  73. 73. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, …
  74. 74. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, … only count the “same” batch
  75. 75. @theburningmonk theburningmonk.com Mind the poison message
  76. 76. @theburningmonk theburningmonk.com Mind the poison message have to fetch from the stream
  77. 77. @theburningmonk theburningmonk.com Mind the poison message have to fetch from the stream do it before they expire from the stream!
  78. 78. @theburningmonk theburningmonk.com how do you prevent building up an insurmountable backlog?
  79. 79. @theburningmonk theburningmonk.com Load shedding implement load shedding prioritize newer messages with a better chance to succeed
  80. 80. @theburningmonk theburningmonk.com Load shedding excess load is sent to DLQ
  81. 81. @theburningmonk theburningmonk.com Load shedding process with a delay
  82. 82. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS
  83. 83. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller
  84. 84. @theburningmonk theburningmonk.com LambdaSQS Poller Mind the partial failures Delete
  85. 85. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error
  86. 86. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error DLQ
  87. 87. @theburningmonk theburningmonk.com Mind the partial failures LambdaSQS Poller Error DLQ batch fails as a unit
  88. 88. https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes Mind the partial failures
  89. 89. @theburningmonk theburningmonk.com Mind the partial failures
  90. 90. @theburningmonk theburningmonk.com Mind the partial failures
  91. 91. @theburningmonk theburningmonk.com Mind the partial failures
  92. 92. @theburningmonk theburningmonk.com Mind the retry storm Service A
  93. 93. @theburningmonk theburningmonk.com Mind the retry storm Service A
  94. 94. @theburningmonk theburningmonk.com Mind the retry storm Service A retry retry retry retry
  95. 95. @theburningmonk theburningmonk.com Mind the retry storm Service A
  96. 96. @theburningmonk theburningmonk.com Mind the retry storm Service A
  97. 97. @theburningmonk theburningmonk.com Mind the retry storm Service A
  98. 98. @theburningmonk theburningmonk.com Mind the retry storm Service A
  99. 99. @theburningmonk theburningmonk.com retry storm
  100. 100. @theburningmonk theburningmonk.com circuit breaker pattern After X consecutive timeouts, trip the circuit
  101. 101. @theburningmonk theburningmonk.com circuit breaker pattern After X consecutive timeouts, trip the circuit When circuit is open, fail fast
  102. 102. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins After X consecutive timeouts, trip the circuit
  103. 103. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins If request succeeds, close the circuit After X consecutive timeouts, trip the circuit
  104. 104. @theburningmonk theburningmonk.com
  105. 105. @theburningmonk theburningmonk.com where do I keep the state of the circuit?
  106. 106. @theburningmonk theburningmonk.com in-memory Service A isOpen: false isOpen: false isOpen: false isOpen: false
  107. 107. @theburningmonk theburningmonk.com in-memory Service A isOpen: true isOpen: false isOpen: true isOpen: false
  108. 108. @theburningmonk theburningmonk.com in-memory PROS simplicity
  109. 109. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service requires another circuit breaker to protect… cost & maintenance overhead (IAM, infra, etc.)
  110. 110. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service CONS takes longer & more requests to stop all traffic
  111. 111. @theburningmonk theburningmonk.com in-memory PROS simplicity no dependency on external service CONS takes longer & more requests to stop all traffic new containers would generate more traffic
  112. 112. @theburningmonk theburningmonk.com external service Service AisOpen: false
  113. 113. @theburningmonk theburningmonk.com external service Service AisOpen: true
  114. 114. @theburningmonk theburningmonk.com external service Service AisOpen: true
  115. 115. @theburningmonk theburningmonk.com external service PROS minimizes no. of total requests to trip circuit new containers respect collective decision CONS complexity dependency on an external service
  116. 116. @theburningmonk theburningmonk.com which approach should I use?
  117. 117. @theburningmonk theburningmonk.com which approach should I use? It depends. Maybe start with the simplest solution first?
  118. 118. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute

  119. 119. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute
 Standard burst concurrency limits when over the provisioned capacity 

  120. 120. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute
 Adjustable provisioned capacity based on CloudWatch metrics Standard burst concurrency limits when over the provisioned capacity 

  121. 121. @theburningmonk theburningmonk.com Lambda limitations & throttling Concurrent executions: 1000*
 Timeout: 15 minutes
 Burst concurrency: 500 - 3000
 Burst: 500 new instances / minute * Can be increased with support ticket
  122. 122. @theburningmonk theburningmonk.com Lambda limitations & throttling good for spikey traffic, up to a point Concurrent executions: 1000*
 Timeout: 15 minutes
 Burst concurrency: 500 - 3000
 Burst: 500 new instances / minute * Can be increased with support ticket
  123. 123. @theburningmonk theburningmonk.com “what if my traffic is more spiky than that?”
  124. 124. @theburningmonk theburningmonk.com Scenario: predictable spikes Holidays, weekends,
 celebrations
 (Black Friday) Planned launch of
 resources
 (new series available) Sport events
  125. 125. @theburningmonk theburningmonk.com Scenario: predictable spikes scheduled auto-scaling
  126. 126. @theburningmonk theburningmonk.com Scenario: predictable spikes scheduled auto-scaling the burst limits still apply, factor the timing into account
  127. 127. @theburningmonk theburningmonk.com Scenario: predictable spikes
  128. 128. @theburningmonk theburningmonk.com Scenario: unpredictable spikes Traffic generated by user actions
 
 Jennifer Aniston’s first post
  129. 129. @theburningmonk theburningmonk.com “if Lambda scaling is the problem…”
  130. 130. @theburningmonk theburningmonk.com Client only needs an acknowledgement
  131. 131. https://lumigo.io/blog/the-why-when-and-how-of-api-gateway-service-proxies
  132. 132. @theburningmonk theburningmonk.com multi-region, active-active
  133. 133. @theburningmonk theburningmonk.com us-east-1 API Gateway Lambda DynamoDBRoute53
  134. 134. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1
  135. 135. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1 GlobalTable
  136. 136. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1 GlobalTable
  137. 137. @theburningmonk theburningmonk.com eu-central-1 us-east-1 us-east-1 SQS Lambda DynamoDB Lambda API Gateway SNS SNS
  138. 138. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway eu-central-1 us-east-1 SNS SNS
  139. 139. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway eu-central-1 us-east-1 SNS SNS
  140. 140. https://lumigo.io/blog/amazon-builders-library-in-focus-5-static-stability-using-availability-zones
  141. 141. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway eu-central-1 us-east-1 SNS SNS Ddedupe
  142. 142. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  143. 143. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  144. 144. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  145. 145. @theburningmonk theburningmonk.com us-east-1 SQS Lambda DynamoDB Lambda API Gateway us-east-1 SNS eu-central-1 SNS eu-central-1 SQS Lambda DynamoDB Lambda API Gateway Global Table
  146. 146. @theburningmonk theburningmonk.com Multi-region architecture - benefits & tradeoffs Protection against
 regional failures Higher complexity Very hard to test
  147. 147. CHAOS ENGINEERING
  148. 148. MUST KILL SERVERS! RAWR!! RAWR!!
  149. 149. @theburningmonk theburningmonk.com “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  150. 150. @theburningmonk theburningmonk.com “You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  151. 151. @theburningmonk theburningmonk.com identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  152. 152. @theburningmonk theburningmonk.com learn about the system’s behavior by observing it during a controlled experiments HOW
  153. 153. @theburningmonk theburningmonk.com learn about the system’s behavior by observing it during a controlled experiments HOW game days failure injection
  154. 154. @theburningmonk theburningmonk.com MUST KILL SERVERS! RAWR!! RAWR!! ahhhhhhh!!!! HELP!!! OMG!!! F***!!!
  155. 155. @theburningmonk theburningmonk.com phew!
  156. 156. @theburningmonk theburningmonk.com STEP 1. define steady state i.e. “what does normal look like”
  157. 157. @theburningmonk theburningmonk.com STEP 2. hypothesis that steady state continues in control and experimental group e.g. “the system stays up if a server dies”
  158. 158. @theburningmonk theburningmonk.com STEP 3. inject realistic failures e.g. “slow response from 3rd-party service”
  159. 159. @theburningmonk theburningmonk.com STEP 4. try to disprove hypothesis i.e. “look for difference between control and experimental group”
  160. 160. DON’T START EXPERIMENTS IN PRODUCTION
  161. 161. @theburningmonk theburningmonk.com identify weaknesses before they manifest in system-wide, aberrant behaviors GOAL
  162. 162. @theburningmonk theburningmonk.com “Corporation X lost millions due to a chaos experiment went wrong and destroyed key infrastructure, resulting in hours of downtime and unrecoverable data loss.”
  163. 163. @theburningmonk theburningmonk.com Chaos Engineering doesn't cause problems. It reveals them. Nora Jones
  164. 164. CONTAINMENT
  165. 165. CONTAINMENT run experiments during office hours
  166. 166. CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises
  167. 167. CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises avoid important dates
  168. 168. CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises avoid important dates make the smallest change possible
  169. 169. CONTAINMENT run experiments during office hours let others know what you’re doing, no surprises avoid important dates make the smallest change possible have a rollback plan before you start
  170. 170. DON’T START EXPERIMENTS IN PRODUCTION
  171. 171. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  172. 172. @theburningmonk theburningmonk.com chaos monkey kills an EC2 instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  173. 173. @theburningmonk theburningmonk.com
  174. 174. @theburningmonk theburningmonk.com there are no servers to kill! SERVERLESS
  175. 175. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  176. 176. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  177. 177. @theburningmonk theburningmonk.com improperly tuned timeouts
  178. 178. @theburningmonk theburningmonk.com missing error handling
  179. 179. @theburningmonk theburningmonk.com missing fallbacks
  180. 180. @theburningmonk theburningmonk.com
  181. 181. @theburningmonk theburningmonk.com “what if DynamoDB has an elevated error rate?”
  182. 182. @theburningmonk theburningmonk.com hypothesis: the AWS SDK retries would handle it
  183. 183. DEMO TIME!
  184. 184. @theburningmonk theburningmonk.com result: function times out after 6s (hypothesis is disproved)
  185. 185. @theburningmonk theburningmonk.com TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms
  186. 186. @theburningmonk theburningmonk.com TIL: the js DynamoDB client defaults to 10 retries with base delay of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
  187. 187. @theburningmonk theburningmonk.com
  188. 188. @theburningmonk theburningmonk.com action: set max retry count + fallback
  189. 189. DEMO TIME!
  190. 190. @theburningmonk theburningmonk.com outcome: a more resilient system
  191. 191. @theburningmonk theburningmonk.com “what if service X has elevated latency?”
  192. 192. @theburningmonk theburningmonk.com hypothesis: our try-catch would handle it
  193. 193. DEMO TIME!
  194. 194. @theburningmonk theburningmonk.com result: function times out after 6s (hypothesis is disproved)
  195. 195. @theburningmonk theburningmonk.com TIL: most HTTP client libraries have default timeout of 60s. API Gateway has an integration timeout of 29s. Most Lambda functions default to timeout of 3-6s.
  196. 196. @theburningmonk theburningmonk.com
  197. 197. @theburningmonk theburningmonk.com
  198. 198. @theburningmonk theburningmonk.com https://bit.ly/2Wvfort
  199. 199. @theburningmonk theburningmonk.com
  200. 200. @theburningmonk theburningmonk.com
  201. 201. DEMO TIME!
  202. 202. @theburningmonk theburningmonk.com outcome: a more resilient system
  203. 203. recap
  204. 204. everything fails, all the time
  205. 205. @theburningmonk theburningmonk.com “the capacity to recover quickly from difficulties; toughness.” resilience /rɪˈzɪlɪəns/ noun
  206. 206. @theburningmonk theburningmonk.com Serverless - multiple AZ’s out of the box Total resources created: 1 API Gateway 1 Lambda
  207. 207. @theburningmonk theburningmonk.com Beware of timeouts API Gateway
 Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  208. 208. @theburningmonk theburningmonk.com Offload computing operations to queues
  209. 209. @theburningmonk theburningmonk.com “Decoupled Invocation”
  210. 210. @theburningmonk theburningmonk.com no distributed transactions
  211. 211. @theburningmonk theburningmonk.com retry-until-success
  212. 212. @theburningmonk theburningmonk.com
  213. 213. @theburningmonk theburningmonk.com retry-until-success needs to deal with poinson messages
  214. 214. @theburningmonk theburningmonk.com Mind the poison message 6, 3, 1, 1, 1, 1, … only count the “same” batch
  215. 215. @theburningmonk theburningmonk.com Load shedding implement load shedding prioritize newer messages with a better chance to succeed
  216. 216. @theburningmonk theburningmonk.com circuit breaker pattern When circuit is open, fail fast but, allow 1 request through every Y mins If request succeeds, close the circuit After X consecutive timeouts, trip the circuit
  217. 217. @theburningmonk theburningmonk.com The Saga pattern A pattern for managing failures where each action has a compensating action for rollback
  218. 218. @theburningmonk theburningmonk.com Mind the partial failures
  219. 219. @theburningmonk theburningmonk.com Lambda autoscaling Burst concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Burst: 500 new instances / each minute
 Adjustable provisioned capacity based on CloudWatch metrics Standard burst concurrency limits when over the provisioned capacity 

  220. 220. @theburningmonk theburningmonk.com Scenario: predictable spikes scheduled auto-scaling the burst limits still apply, factor the timing into account
  221. 221. @theburningmonk theburningmonk.com eu-west-1 us-east-1 us-west-1 GlobalTable
  222. 222. @theburningmonk theburningmonk.com “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  223. 223. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  224. 224. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  225. 225. @theburningmonk theburningmonk.com
  226. 226. https://theburningmonk.com/hire-me AdviseTraining Delivery “Fundamentally, Yan has improved our team by increasing our ability to derive value from AWS and Lambda in particular.” Nick Blair Tech Lead
  227. 227. @theburningmonk theburningmonk.com lambdabestpractice.com bit.ly/complete-guide-to-aws-step-functions 20% off my courses aws-delhi-may2020
  228. 228. @theburningmonk theburningmonk.com github.com/theburningmonk

×