Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

SELF-HEALING SERVERLESS APPLICATIONS
Glue Conference 2018
NATE TAGGART

AWS | LAMBDA FEATURES PAGE
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to conﬁgure anything. There is no limit to the
number of requests your code can handle.
The Promise:
SELF-HEALING SERVERLESS APPLICATIONS | PG2

AWS | LAMBDA FEATURES PAGE
The Reality:
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to conﬁgure anything. There is no limit to the
number of requests your code can handle.
s
architecture
sometimes
certain
s
es
every
can
but
areproperly
^
(suggested edits)

What to expect 
when you’re not expecting.

FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking

DEFAULT BEHAVIOR
Synchronous invocations:
• Function fails
• Returns error to caller
• Logs timestamp, error message,
& stack trace to CloudWatch
Asynchronous invocations:
• Retries up to three times (or
more if reading from a stream)
• Caller is unaware of error
• Logs timestamp, error message,
& stack trace to CloudWatch
• Runtime Error:
• Timeout
• Bad State
• Scaling:
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but raises
an unhandled exception
in your code.

DEFAULT BEHAVIOR
Synchronous invocations:
• Lambda returns error to caller
(if client hasn’t timed out)
• Logs timestamp and error
message to CloudWatch
Asynchronous invocations:
• Retries up to three times (more
if reading from stream)
• Caller is unaware of error
• Logs timestamp & error
message to CloudWatch
• Runtime Error:
• Timeout
• Bad State
• Scaling:
• Spawn Limits
• Bottlenecking
Lambda to run, but
execution does not
complete within the
conﬁgured maximum
execution time.
(Lambda’s default
conﬁguration is a  
3-second timeout.)

DEFAULT BEHAVIOR
• Runtime Error:
• Timeout
• Bad State
• Scaling:
• Spawn Limits
• Bottlenecking
When noisy:
• Behaves as Uncaught
Exception
• Visible in CloudWatch, but may
be difﬁcult to diagnose without
event visibility
When silent:
• Unexpected application
behavior
• Can be lost permanently
• Can tank performance and
dramatically spike costs
Lambda to run, but the
message is malformed or
state is improperly
provided causing
unexpected behavior.

DEFAULT BEHAVIOR
• Runtime Error:
• Timeout
• Bad State
• Scaling:
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
but not in logs
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more Lambda
instances are required
than are allowed to be
concurrently running by
AWS for your account.
Your compute can’t scale
high enough.

DEFAULT BEHAVIOR
• Runtime Error:
• Timeout
• Bad State
• Scaling:
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
nothing in logs 
(but really non-obvious)
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more new
Lambda instances are
required than are allowed
to spawn by AWS for your
account.
Your compute can’t scale
fast enough.

DEFAULT BEHAVIOR
• Runtime Error:
• Timeout
• Bad State
• Scaling:
• Spawn Limits
• Bottlenecking
Upstream bottlenecks:
• Fails to invoke
• No retry
• Visible in CloudWatch, as long
as you know where to look
Downstream bottlenecks:
• Can throw error, timeout,  
and/or distribute failures to
other functions.
• Can cause cascading failures
• Can tank performance and
dramatically spike costs
Your application is
throttled due to
throughput pressure
upstream or downstream
of your Lambda.
Your architecture can’t
scale enough.

Introducing:
Self-Healing Serverless Applications

Self-Healing Design Principles
LEADING PRACTICES FOR RESILIENT SYSTEMS
STANDARDIZE FAIL GRACEFULLY
• Reroute and unblock
• Automate known
solutions
• Notify a human
Learn to fail.
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
PLAN FOR FAILURE
• Identify service limits
• Use self-throttling
• Consider alternative
resource types

Scenario: Uncaught Exceptions
WHEN THINGS BREAK AND YOU DON’T KNOW WHY
PROBLEM
Lambda periodically fails.
Error messages and stack
traces are visible in
CloudWatch logs. Failing
events are lost, making
reproduction difﬁcult.
KEY PRINCIPLES
instrumentation
• Collect event-centric
diagnostics
SOLUTION
• Use function wrapper or
decorator pattern
• Capture and log events
which fail
Decrease time to resolution by capturing event data.

Event Diagnostics Wrapper Example

WHEN YOUR LAMBDAS AREN’T GETTING INVOKED
PROBLEM
API Gateway hits
throughput limits and fails
to invoke Lambda on
every request.
KEY PRINCIPLES
• Notify a human
SOLUTION
• Implement retries with
exponential backoff
logic for 429 responses
• Raise alarm on:
4XXError
Scenario: Upstream bottleneck
Don’t overlook client-side solutions to backend failures.

WHEN EXECUTION TAKES TOO LONG
PROBLEM
Lambda is periodically
timing out.
KEY PRINCIPLES
instrumentation
resource types
SOLUTION
• Use function wrapper or
decorator pattern
• Evaluate Fargate or
alternative long-running
resources
Scenario: Timeouts
Enforce your own limits.

Timeout Wrapper Example

WHEN FAILURES ARE BLOCKING THE REST OF THE STREAM
PROBLEM
Lambda exceptions and/or
timeouts are blocking
processing of a Kinesis
shard.
KEY PRINCIPLES
• Reroute and unblock
• Automate known
solutions
resource types
SOLUTION
• Introduce state machine-
type logic
• Move bad messages to
alternate stream
• Potentially architect with
Fargate or SNS
Scenario: Stream processing gets “stuck”
Small failures are preferable to large ones.

PROBLEM
Your Lambdas have scaled
up but are depleting your
RDS database connection
pools.
KEY PRINCIPLES
• Automate known
solutions
SOLUTION
• Always close database
connections
• Scale your database
• Map your dependencies
Scenario: Downstream bottleneck
WHEN LAMBDA IS OUT-SCALING YOUR DATABASE
Scale dependencies, too.

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

More Related Content

What's hot

Similar to Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

Recently uploaded

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)