SELF-HEALING SERVERLESS APPLICATIONS
Glue Conference 2018
NATE TAGGART
AWS | LAMBDA FEATURES PAGE
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to configure anything. There is no limit to the
number of requests your code can handle.
The Promise:
SELF-HEALING SERVERLESS APPLICATIONS | PG2
AWS | LAMBDA FEATURES PAGE
The Reality:
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to configure anything. There is no limit to the
number of requests your code can handle.
s
architecture
sometimes
certain
s
es
every
can
but
areproperly
^
(suggested edits)
SELF-HEALING SERVERLESS APPLICATIONS | PG3
What to expect

when you’re not expecting.
SELF-HEALING SERVERLESS APPLICATIONS | PG4
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
SELF-HEALING SERVERLESS APPLICATIONS | PG5
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
Synchronous invocations:
• Function fails
• Returns error to caller
• Logs timestamp, error message,
& stack trace to CloudWatch
Asynchronous invocations:
• Retries up to three times (or
more if reading from a stream)
• Caller is unaware of error
• Logs timestamp, error message,
& stack trace to CloudWatch
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but raises
an unhandled exception
in your code.
SELF-HEALING SERVERLESS APPLICATIONS | PG6
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
Synchronous invocations:
• Lambda returns error to caller
(if client hasn’t timed out)
• Logs timestamp and error
message to CloudWatch
Asynchronous invocations:
• Retries up to three times (more
if reading from stream)
• Caller is unaware of error
• Logs timestamp & error
message to CloudWatch
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but
execution does not
complete within the
configured maximum
execution time.
(Lambda’s default
configuration is a 

3-second timeout.)
SELF-HEALING SERVERLESS APPLICATIONS | PG7
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
When noisy:
• Behaves as Uncaught
Exception
• Visible in CloudWatch, but may
be difficult to diagnose without
event visibility
When silent:
• Unexpected application
behavior
• Can be lost permanently
• Can tank performance and
dramatically spike costs
An event triggers your
Lambda to run, but the
message is malformed or
state is improperly
provided causing
unexpected behavior.
SELF-HEALING SERVERLESS APPLICATIONS | PG8
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
but not in logs
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more Lambda
instances are required
than are allowed to be
concurrently running by
AWS for your account.
Your compute can’t scale
high enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG9
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
nothing in logs

(but really non-obvious)
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more new
Lambda instances are
required than are allowed
to spawn by AWS for your
account.
Your compute can’t scale
fast enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG10
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Upstream bottlenecks:
• Fails to invoke
• No retry
• Visible in CloudWatch, as long
as you know where to look
Downstream bottlenecks:
• Can throw error, timeout, 

and/or distribute failures to
other functions.
• Can cause cascading failures
• Can tank performance and
dramatically spike costs
Your application is
throttled due to
throughput pressure
upstream or downstream
of your Lambda.
Your architecture can’t
scale enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG11
Introducing:
Self-Healing Serverless Applications
SELF-HEALING SERVERLESS APPLICATIONS | PG12
Self-Healing Design Principles
LEADING PRACTICES FOR RESILIENT SYSTEMS
STANDARDIZE FAIL GRACEFULLY
• Reroute and unblock
• Automate known
solutions
• Notify a human
SELF-HEALING SERVERLESS APPLICATIONS | PG13
Learn to fail.
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
PLAN FOR FAILURE
• Identify service limits
• Use self-throttling
• Consider alternative
resource types
SELF-HEALING SERVERLESS APPLICATIONS | PG14
Scenario: Uncaught Exceptions
WHEN THINGS BREAK AND YOU DON’T KNOW WHY
PROBLEM
Lambda periodically fails.
Error messages and stack
traces are visible in
CloudWatch logs. Failing
events are lost, making
reproduction difficult.
KEY PRINCIPLES
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
SOLUTION
• Use function wrapper or
decorator pattern
• Capture and log events
which fail
SELF-HEALING SERVERLESS APPLICATIONS | PG15
Decrease time to resolution by capturing event data.
Event Diagnostics Wrapper Example
SELF-HEALING SERVERLESS APPLICATIONS | PG16
WHEN YOUR LAMBDAS AREN’T GETTING INVOKED
PROBLEM
API Gateway hits
throughput limits and fails
to invoke Lambda on
every request.
KEY PRINCIPLES
• Identify service limits
• Use self-throttling
• Notify a human
SOLUTION
• Implement retries with
exponential backoff
logic for 429 responses
• Raise alarm on:
4XXError
Scenario: Upstream bottleneck
SELF-HEALING SERVERLESS APPLICATIONS | PG17
Don’t overlook client-side solutions to backend failures.
SELF-HEALING SERVERLESS APPLICATIONS | PG18
WHEN EXECUTION TAKES TOO LONG
PROBLEM
Lambda is periodically
timing out.
KEY PRINCIPLES
• Introduce universal
instrumentation
• Use self-throttling
• Consider alternative
resource types
SOLUTION
• Use function wrapper or
decorator pattern
• Evaluate Fargate or
alternative long-running
resources
Scenario: Timeouts
SELF-HEALING SERVERLESS APPLICATIONS | PG19
Enforce your own limits.
Timeout Wrapper Example
SELF-HEALING SERVERLESS APPLICATIONS | PG20
WHEN FAILURES ARE BLOCKING THE REST OF THE STREAM
PROBLEM
Lambda exceptions and/or
timeouts are blocking
processing of a Kinesis
shard.
KEY PRINCIPLES
• Reroute and unblock
• Automate known
solutions
• Consider alternative
resource types
SOLUTION
• Introduce state machine-
type logic
• Move bad messages to
alternate stream
• Potentially architect with
Fargate or SNS
Scenario: Stream processing gets “stuck”
SELF-HEALING SERVERLESS APPLICATIONS | PG21
Small failures are preferable to large ones.
PROBLEM
Your Lambdas have scaled
up but are depleting your
RDS database connection
pools.
KEY PRINCIPLES
• Identify service limits
• Automate known
solutions
• Give everyone visibility
SOLUTION
• Always close database
connections
• Scale your database
• Map your dependencies
Scenario: Downstream bottleneck
WHEN LAMBDA IS OUT-SCALING YOUR DATABASE
SELF-HEALING SERVERLESS APPLICATIONS | PG22
Scale dependencies, too.
Thank you!
@stackeryio

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

  • 1.
    SELF-HEALING SERVERLESS APPLICATIONS GlueConference 2018 NATE TAGGART
  • 2.
    AWS | LAMBDAFEATURES PAGE AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. There is no limit to the number of requests your code can handle. The Promise: SELF-HEALING SERVERLESS APPLICATIONS | PG2
  • 3.
    AWS | LAMBDAFEATURES PAGE The Reality: AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. There is no limit to the number of requests your code can handle. s architecture sometimes certain s es every can but areproperly ^ (suggested edits) SELF-HEALING SERVERLESS APPLICATIONS | PG3
  • 4.
    What to expect
 whenyou’re not expecting. SELF-HEALING SERVERLESS APPLICATIONS | PG4
  • 5.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR SELF-HEALING SERVERLESS APPLICATIONS | PG5 • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking
  • 6.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR Synchronous invocations: • Function fails • Returns error to caller • Logs timestamp, error message, & stack trace to CloudWatch Asynchronous invocations: • Retries up to three times (or more if reading from a stream) • Caller is unaware of error • Logs timestamp, error message, & stack trace to CloudWatch • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking An event triggers your Lambda to run, but raises an unhandled exception in your code. SELF-HEALING SERVERLESS APPLICATIONS | PG6
  • 7.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR Synchronous invocations: • Lambda returns error to caller (if client hasn’t timed out) • Logs timestamp and error message to CloudWatch Asynchronous invocations: • Retries up to three times (more if reading from stream) • Caller is unaware of error • Logs timestamp & error message to CloudWatch • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking An event triggers your Lambda to run, but execution does not complete within the configured maximum execution time. (Lambda’s default configuration is a 
 3-second timeout.) SELF-HEALING SERVERLESS APPLICATIONS | PG7
  • 8.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking When noisy: • Behaves as Uncaught Exception • Visible in CloudWatch, but may be difficult to diagnose without event visibility When silent: • Unexpected application behavior • Can be lost permanently • Can tank performance and dramatically spike costs An event triggers your Lambda to run, but the message is malformed or state is improperly provided causing unexpected behavior. SELF-HEALING SERVERLESS APPLICATIONS | PG8
  • 9.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Unbuffered invocations: • Fails to invoke • No retry • Visible in CloudWatch metrics, but not in logs Buffered invocations: • Initially fails to invoke • Will eventually continue reading from stream as volume drops Your application becomes throttled as more Lambda instances are required than are allowed to be concurrently running by AWS for your account. Your compute can’t scale high enough. SELF-HEALING SERVERLESS APPLICATIONS | PG9
  • 10.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Unbuffered invocations: • Fails to invoke • No retry • Visible in CloudWatch metrics, nothing in logs
 (but really non-obvious) Buffered invocations: • Initially fails to invoke • Will eventually continue reading from stream as volume drops Your application becomes throttled as more new Lambda instances are required than are allowed to spawn by AWS for your account. Your compute can’t scale fast enough. SELF-HEALING SERVERLESS APPLICATIONS | PG10
  • 11.
    FAILURE TYPES DESCRIPTION CommonServerless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Upstream bottlenecks: • Fails to invoke • No retry • Visible in CloudWatch, as long as you know where to look Downstream bottlenecks: • Can throw error, timeout, 
 and/or distribute failures to other functions. • Can cause cascading failures • Can tank performance and dramatically spike costs Your application is throttled due to throughput pressure upstream or downstream of your Lambda. Your architecture can’t scale enough. SELF-HEALING SERVERLESS APPLICATIONS | PG11
  • 12.
  • 13.
    Self-Healing Design Principles LEADINGPRACTICES FOR RESILIENT SYSTEMS STANDARDIZE FAIL GRACEFULLY • Reroute and unblock • Automate known solutions • Notify a human SELF-HEALING SERVERLESS APPLICATIONS | PG13 Learn to fail. • Introduce universal instrumentation • Collect event-centric diagnostics • Give everyone visibility PLAN FOR FAILURE • Identify service limits • Use self-throttling • Consider alternative resource types
  • 14.
  • 15.
    Scenario: Uncaught Exceptions WHENTHINGS BREAK AND YOU DON’T KNOW WHY PROBLEM Lambda periodically fails. Error messages and stack traces are visible in CloudWatch logs. Failing events are lost, making reproduction difficult. KEY PRINCIPLES • Introduce universal instrumentation • Collect event-centric diagnostics • Give everyone visibility SOLUTION • Use function wrapper or decorator pattern • Capture and log events which fail SELF-HEALING SERVERLESS APPLICATIONS | PG15 Decrease time to resolution by capturing event data.
  • 16.
    Event Diagnostics WrapperExample SELF-HEALING SERVERLESS APPLICATIONS | PG16
  • 17.
    WHEN YOUR LAMBDASAREN’T GETTING INVOKED PROBLEM API Gateway hits throughput limits and fails to invoke Lambda on every request. KEY PRINCIPLES • Identify service limits • Use self-throttling • Notify a human SOLUTION • Implement retries with exponential backoff logic for 429 responses • Raise alarm on: 4XXError Scenario: Upstream bottleneck SELF-HEALING SERVERLESS APPLICATIONS | PG17 Don’t overlook client-side solutions to backend failures.
  • 18.
  • 19.
    WHEN EXECUTION TAKESTOO LONG PROBLEM Lambda is periodically timing out. KEY PRINCIPLES • Introduce universal instrumentation • Use self-throttling • Consider alternative resource types SOLUTION • Use function wrapper or decorator pattern • Evaluate Fargate or alternative long-running resources Scenario: Timeouts SELF-HEALING SERVERLESS APPLICATIONS | PG19 Enforce your own limits.
  • 20.
    Timeout Wrapper Example SELF-HEALINGSERVERLESS APPLICATIONS | PG20
  • 21.
    WHEN FAILURES AREBLOCKING THE REST OF THE STREAM PROBLEM Lambda exceptions and/or timeouts are blocking processing of a Kinesis shard. KEY PRINCIPLES • Reroute and unblock • Automate known solutions • Consider alternative resource types SOLUTION • Introduce state machine- type logic • Move bad messages to alternate stream • Potentially architect with Fargate or SNS Scenario: Stream processing gets “stuck” SELF-HEALING SERVERLESS APPLICATIONS | PG21 Small failures are preferable to large ones.
  • 22.
    PROBLEM Your Lambdas havescaled up but are depleting your RDS database connection pools. KEY PRINCIPLES • Identify service limits • Automate known solutions • Give everyone visibility SOLUTION • Always close database connections • Scale your database • Map your dependencies Scenario: Downstream bottleneck WHEN LAMBDA IS OUT-SCALING YOUR DATABASE SELF-HEALING SERVERLESS APPLICATIONS | PG22 Scale dependencies, too.
  • 23.
  • 24.