Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

624 views

Published on

Serverless applications increasingly involve distributed systems where errors and bottlenecks can have significant downstream impact. This can be compounded by the ephemeral nature of FaaS offerings in which errors can be difficult to diagnose retroactively. In this session we'll discuss instrumentation and "self-healing" architectural patterns that will improve resiliency of your application and drive improved observability and performance.

Published in: Internet
  • Login to see the comments

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

  1. 1. SELF-HEALING SERVERLESS APPLICATIONS Glue Conference 2018 NATE TAGGART
  2. 2. AWS | LAMBDA FEATURES PAGE AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. There is no limit to the number of requests your code can handle. The Promise: SELF-HEALING SERVERLESS APPLICATIONS | PG2
  3. 3. AWS | LAMBDA FEATURES PAGE The Reality: AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. There is no limit to the number of requests your code can handle. s architecture sometimes certain s es every can but areproperly ^ (suggested edits) SELF-HEALING SERVERLESS APPLICATIONS | PG3
  4. 4. What to expect
 when you’re not expecting. SELF-HEALING SERVERLESS APPLICATIONS | PG4
  5. 5. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR SELF-HEALING SERVERLESS APPLICATIONS | PG5 • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking
  6. 6. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR Synchronous invocations: • Function fails • Returns error to caller • Logs timestamp, error message, & stack trace to CloudWatch Asynchronous invocations: • Retries up to three times (or more if reading from a stream) • Caller is unaware of error • Logs timestamp, error message, & stack trace to CloudWatch • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking An event triggers your Lambda to run, but raises an unhandled exception in your code. SELF-HEALING SERVERLESS APPLICATIONS | PG6
  7. 7. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR Synchronous invocations: • Lambda returns error to caller (if client hasn’t timed out) • Logs timestamp and error message to CloudWatch Asynchronous invocations: • Retries up to three times (more if reading from stream) • Caller is unaware of error • Logs timestamp & error message to CloudWatch • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking An event triggers your Lambda to run, but execution does not complete within the configured maximum execution time. (Lambda’s default configuration is a 
 3-second timeout.) SELF-HEALING SERVERLESS APPLICATIONS | PG7
  8. 8. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking When noisy: • Behaves as Uncaught Exception • Visible in CloudWatch, but may be difficult to diagnose without event visibility When silent: • Unexpected application behavior • Can be lost permanently • Can tank performance and dramatically spike costs An event triggers your Lambda to run, but the message is malformed or state is improperly provided causing unexpected behavior. SELF-HEALING SERVERLESS APPLICATIONS | PG8
  9. 9. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Unbuffered invocations: • Fails to invoke • No retry • Visible in CloudWatch metrics, but not in logs Buffered invocations: • Initially fails to invoke • Will eventually continue reading from stream as volume drops Your application becomes throttled as more Lambda instances are required than are allowed to be concurrently running by AWS for your account. Your compute can’t scale high enough. SELF-HEALING SERVERLESS APPLICATIONS | PG9
  10. 10. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Unbuffered invocations: • Fails to invoke • No retry • Visible in CloudWatch metrics, nothing in logs
 (but really non-obvious) Buffered invocations: • Initially fails to invoke • Will eventually continue reading from stream as volume drops Your application becomes throttled as more new Lambda instances are required than are allowed to spawn by AWS for your account. Your compute can’t scale fast enough. SELF-HEALING SERVERLESS APPLICATIONS | PG10
  11. 11. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Upstream bottlenecks: • Fails to invoke • No retry • Visible in CloudWatch, as long as you know where to look Downstream bottlenecks: • Can throw error, timeout, 
 and/or distribute failures to other functions. • Can cause cascading failures • Can tank performance and dramatically spike costs Your application is throttled due to throughput pressure upstream or downstream of your Lambda. Your architecture can’t scale enough. SELF-HEALING SERVERLESS APPLICATIONS | PG11
  12. 12. Introducing: Self-Healing Serverless Applications SELF-HEALING SERVERLESS APPLICATIONS | PG12
  13. 13. Self-Healing Design Principles LEADING PRACTICES FOR RESILIENT SYSTEMS STANDARDIZE FAIL GRACEFULLY • Reroute and unblock • Automate known solutions • Notify a human SELF-HEALING SERVERLESS APPLICATIONS | PG13 Learn to fail. • Introduce universal instrumentation • Collect event-centric diagnostics • Give everyone visibility PLAN FOR FAILURE • Identify service limits • Use self-throttling • Consider alternative resource types
  14. 14. SELF-HEALING SERVERLESS APPLICATIONS | PG14
  15. 15. Scenario: Uncaught Exceptions WHEN THINGS BREAK AND YOU DON’T KNOW WHY PROBLEM Lambda periodically fails. Error messages and stack traces are visible in CloudWatch logs. Failing events are lost, making reproduction difficult. KEY PRINCIPLES • Introduce universal instrumentation • Collect event-centric diagnostics • Give everyone visibility SOLUTION • Use function wrapper or decorator pattern • Capture and log events which fail SELF-HEALING SERVERLESS APPLICATIONS | PG15 Decrease time to resolution by capturing event data.
  16. 16. Event Diagnostics Wrapper Example SELF-HEALING SERVERLESS APPLICATIONS | PG16
  17. 17. WHEN YOUR LAMBDAS AREN’T GETTING INVOKED PROBLEM API Gateway hits throughput limits and fails to invoke Lambda on every request. KEY PRINCIPLES • Identify service limits • Use self-throttling • Notify a human SOLUTION • Implement retries with exponential backoff logic for 429 responses • Raise alarm on: 4XXError Scenario: Upstream bottleneck SELF-HEALING SERVERLESS APPLICATIONS | PG17 Don’t overlook client-side solutions to backend failures.
  18. 18. SELF-HEALING SERVERLESS APPLICATIONS | PG18
  19. 19. WHEN EXECUTION TAKES TOO LONG PROBLEM Lambda is periodically timing out. KEY PRINCIPLES • Introduce universal instrumentation • Use self-throttling • Consider alternative resource types SOLUTION • Use function wrapper or decorator pattern • Evaluate Fargate or alternative long-running resources Scenario: Timeouts SELF-HEALING SERVERLESS APPLICATIONS | PG19 Enforce your own limits.
  20. 20. Timeout Wrapper Example SELF-HEALING SERVERLESS APPLICATIONS | PG20
  21. 21. WHEN FAILURES ARE BLOCKING THE REST OF THE STREAM PROBLEM Lambda exceptions and/or timeouts are blocking processing of a Kinesis shard. KEY PRINCIPLES • Reroute and unblock • Automate known solutions • Consider alternative resource types SOLUTION • Introduce state machine- type logic • Move bad messages to alternate stream • Potentially architect with Fargate or SNS Scenario: Stream processing gets “stuck” SELF-HEALING SERVERLESS APPLICATIONS | PG21 Small failures are preferable to large ones.
  22. 22. PROBLEM Your Lambdas have scaled up but are depleting your RDS database connection pools. KEY PRINCIPLES • Identify service limits • Automate known solutions • Give everyone visibility SOLUTION • Always close database connections • Scale your database • Map your dependencies Scenario: Downstream bottleneck WHEN LAMBDA IS OUT-SCALING YOUR DATABASE SELF-HEALING SERVERLESS APPLICATIONS | PG22 Scale dependencies, too.
  23. 23. Thank you!
  24. 24. @stackeryio

×