SlideShare a Scribd company logo
NORDICS
DGI Byen’s CPH Conference
2024
NORDICS
Building resilient serverless workloads:
Navigating through failures
JIMMY DAHLQVIST | 2024-05-07
Thank You!
@jimmydahlqvist
JIMMY DAHLQVIST
Serverless enthusiast
Head of AWS @ Sigma Technology Cloud
Founder of serverless-handbook.com
Blogging on jimmydqv.com
AWS Ambassador | AWS Community Builder | User Group Leader
§
Hello, I'm
@jimmydahlqvist
Resources
• https://serverless-handbook.com
• Architecture patterns
• Solutions
• Workshops
@jimmydahlqvist
Agenda
• What is serverless and resiliency
• Architecting resilient system – Good practices
• Summary
@jimmydahlqvist
What is serverless?
• Automatic and flexible scaling
• No capacity planning
• High Availability
• Pay-for-use billing
@jimmydahlqvist
What is resiliency?
The ability for a software solution to handle the impact of problems, and recover from
turbulent conditions, when other parts in the system fails.
@jimmydahlqvist
“Everything fails all the time
Dr. Werner Vogels, CTO, Amazon.com
@jimmydahlqvist
Understand AWS Services
• Everything has a limit
• Understand how services work under the hood
@jimmydahlqvist
Resiliency testing
• Chaos Engineering
• Amazon Fault Injector Service
• Start in QA
• Don’t forget about data
@jimmydahlqvist
Web application
Do we need an immediate
response?
@jimmydahlqvist
Storage-First
@jimmydahlqvist
Storage-First
• Data-centric design
• Durability and availability
• Scalable System Design
• Asynchronous processing
@jimmydahlqvist
Storage-First Things to consider
• Architectural complexity
• Eventual consistency
• Design for idempotency
• Risk of over-optimization
@jimmydahlqvist
Queue Load Leveling
@jimmydahlqvist
Queue load leveling
• System stability
• Handle unexpected spikes
• Protect downstream resources
@jimmydahlqvist
Decoupling
@jimmydahlqvist
Decoupling
@jimmydahlqvist
Retries
• Selfish
• Exponential backoff
• Users can make it worse
@jimmydahlqvist
DLQ
@jimmydahlqvist
DLQ
@jimmydahlqvist
Retries with backoff and jitter
No Jitter With Jitter
Image: Amazon Architecture blog (https://tinyurl.com/y48t2v4h)
@jimmydahlqvist
Circuit breaker
@jimmydahlqvist
Circuit breaker
Half Open
@jimmydahlqvist
Circuit breaker
• Avoid cascading failures
• Protect system resources
• Risk of early circuit break
• Good observability required
Put it all together
Notification Service
Payment Service
@jimmydahlqvist
What we talked about
• Design for failure
• Buffer and store messages first
• Process asynchronously
• Level the load
• Retry on failures
• Break if integrations are not healthy
B
B
@jimmydahlqvist
dahlqvistjimmy
https://serverless-handbook.com
https://jimmydqv.com

More Related Content

Similar to Building-resilient-serverless-workloads-Navigating-through-failure

Microservices under the microscope
Microservices under the microscopeMicroservices under the microscope
Microservices under the microscopeRoss Garrett
 
Building Scalable Web Applications using Microservices Architecture and NodeJ...
Building Scalable Web Applications using Microservices Architecture and NodeJ...Building Scalable Web Applications using Microservices Architecture and NodeJ...
Building Scalable Web Applications using Microservices Architecture and NodeJ...Mitoc Group
 
Hybrid Cloud Transformation Fast Track.pptx
Hybrid Cloud Transformation Fast Track.pptxHybrid Cloud Transformation Fast Track.pptx
Hybrid Cloud Transformation Fast Track.pptxzhunli4
 
Cloud Migration and Portability (with and without Containers)
Cloud Migration and Portability (with and without Containers)Cloud Migration and Portability (with and without Containers)
Cloud Migration and Portability (with and without Containers)RightScale
 
Comparing Cloud-Based Infrastructure Services
Comparing Cloud-Based Infrastructure ServicesComparing Cloud-Based Infrastructure Services
Comparing Cloud-Based Infrastructure ServicesCDW
 
How to Leverage Serverless to Optimize for Cost and Performance
How to Leverage Serverless to Optimize for Cost and PerformanceHow to Leverage Serverless to Optimize for Cost and Performance
How to Leverage Serverless to Optimize for Cost and PerformanceDevOps.com
 
DevOps - What is | Advantages | Challenges | DevSecOps | Capabilities
DevOps - What is | Advantages | Challenges | DevSecOps | CapabilitiesDevOps - What is | Advantages | Challenges | DevSecOps | Capabilities
DevOps - What is | Advantages | Challenges | DevSecOps | CapabilitiesSoftClouds LLC
 
Best of re:Invent 2016 meetup presentation
Best of re:Invent 2016 meetup presentationBest of re:Invent 2016 meetup presentation
Best of re:Invent 2016 meetup presentationLahav Savir
 
Conversations in the Cloud
Conversations in the CloudConversations in the Cloud
Conversations in the CloudJames Kelly
 
DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...
DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...
DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...DevOpsDays Houston
 
Govern Your Cloud: The Foundation for Success
Govern Your Cloud: The Foundation for SuccessGovern Your Cloud: The Foundation for Success
Govern Your Cloud: The Foundation for SuccessAlert Logic
 
The Benefits of a Public Cloud: Why You Really Can't Build a Better One
The Benefits of a Public Cloud: Why You Really Can't Build a Better OneThe Benefits of a Public Cloud: Why You Really Can't Build a Better One
The Benefits of a Public Cloud: Why You Really Can't Build a Better OneChef
 
Why Not Public Cloud?
Why Not Public Cloud?Why Not Public Cloud?
Why Not Public Cloud?Matt Ray
 
Capacity Management for a Digital and Agile World
Capacity Management for a Digital and Agile WorldCapacity Management for a Digital and Agile World
Capacity Management for a Digital and Agile WorldPrecisely
 
3 Steps to Accelerate to Cloud
3 Steps to Accelerate to Cloud3 Steps to Accelerate to Cloud
3 Steps to Accelerate to CloudRightScale
 
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices FrameworkIntroducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices FrameworkRevelation Technologies
 
Get your head in the clouds! - Swansea Con 2016
Get your head in the clouds! - Swansea Con 2016Get your head in the clouds! - Swansea Con 2016
Get your head in the clouds! - Swansea Con 2016Christopher Cundill
 

Similar to Building-resilient-serverless-workloads-Navigating-through-failure (20)

Microservices under the microscope
Microservices under the microscopeMicroservices under the microscope
Microservices under the microscope
 
Building Scalable Web Applications using Microservices Architecture and NodeJ...
Building Scalable Web Applications using Microservices Architecture and NodeJ...Building Scalable Web Applications using Microservices Architecture and NodeJ...
Building Scalable Web Applications using Microservices Architecture and NodeJ...
 
Hybrid Cloud Transformation Fast Track.pptx
Hybrid Cloud Transformation Fast Track.pptxHybrid Cloud Transformation Fast Track.pptx
Hybrid Cloud Transformation Fast Track.pptx
 
Cloud Migration and Portability (with and without Containers)
Cloud Migration and Portability (with and without Containers)Cloud Migration and Portability (with and without Containers)
Cloud Migration and Portability (with and without Containers)
 
Comparing Cloud-Based Infrastructure Services
Comparing Cloud-Based Infrastructure ServicesComparing Cloud-Based Infrastructure Services
Comparing Cloud-Based Infrastructure Services
 
How to Leverage Serverless to Optimize for Cost and Performance
How to Leverage Serverless to Optimize for Cost and PerformanceHow to Leverage Serverless to Optimize for Cost and Performance
How to Leverage Serverless to Optimize for Cost and Performance
 
DevOps - What is | Advantages | Challenges | DevSecOps | Capabilities
DevOps - What is | Advantages | Challenges | DevSecOps | CapabilitiesDevOps - What is | Advantages | Challenges | DevSecOps | Capabilities
DevOps - What is | Advantages | Challenges | DevSecOps | Capabilities
 
Cloud Jiffy overview
Cloud Jiffy overviewCloud Jiffy overview
Cloud Jiffy overview
 
Best of re:Invent 2016 meetup presentation
Best of re:Invent 2016 meetup presentationBest of re:Invent 2016 meetup presentation
Best of re:Invent 2016 meetup presentation
 
Conversations in the Cloud
Conversations in the CloudConversations in the Cloud
Conversations in the Cloud
 
Cloud Computing.pdf
Cloud Computing.pdfCloud Computing.pdf
Cloud Computing.pdf
 
DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...
DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...
DevOpsDays Houston 2019 - Erik Peterson - FinDevOps: Site Reliability in the ...
 
Govern Your Cloud: The Foundation for Success
Govern Your Cloud: The Foundation for SuccessGovern Your Cloud: The Foundation for Success
Govern Your Cloud: The Foundation for Success
 
Cloud-Migrations
Cloud-MigrationsCloud-Migrations
Cloud-Migrations
 
The Benefits of a Public Cloud: Why You Really Can't Build a Better One
The Benefits of a Public Cloud: Why You Really Can't Build a Better OneThe Benefits of a Public Cloud: Why You Really Can't Build a Better One
The Benefits of a Public Cloud: Why You Really Can't Build a Better One
 
Why Not Public Cloud?
Why Not Public Cloud?Why Not Public Cloud?
Why Not Public Cloud?
 
Capacity Management for a Digital and Agile World
Capacity Management for a Digital and Agile WorldCapacity Management for a Digital and Agile World
Capacity Management for a Digital and Agile World
 
3 Steps to Accelerate to Cloud
3 Steps to Accelerate to Cloud3 Steps to Accelerate to Cloud
3 Steps to Accelerate to Cloud
 
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices FrameworkIntroducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
Introducing the Oracle Cloud Infrastructure (OCI) Best Practices Framework
 
Get your head in the clouds! - Swansea Con 2016
Get your head in the clouds! - Swansea Con 2016Get your head in the clouds! - Swansea Con 2016
Get your head in the clouds! - Swansea Con 2016
 

More from Jimmy Dahlqvist

Cloud-grilled delights a high-tech approach to perfect BBQ
Cloud-grilled delights a high-tech approach to perfect BBQCloud-grilled delights a high-tech approach to perfect BBQ
Cloud-grilled delights a high-tech approach to perfect BBQJimmy Dahlqvist
 
Serverless website analytics with Lambda@Edge
Serverless website analytics with Lambda@EdgeServerless website analytics with Lambda@Edge
Serverless website analytics with Lambda@EdgeJimmy Dahlqvist
 
Encrypting data in S3 with Stepfunctions
Encrypting data in S3 with StepfunctionsEncrypting data in S3 with Stepfunctions
Encrypting data in S3 with StepfunctionsJimmy Dahlqvist
 
Building a serverless AI powered translation service
Building a serverless AI powered translation serviceBuilding a serverless AI powered translation service
Building a serverless AI powered translation serviceJimmy Dahlqvist
 
Serverless cloud architecture patterns
Serverless cloud architecture patternsServerless cloud architecture patterns
Serverless cloud architecture patternsJimmy Dahlqvist
 
AI Powered event-driven translation bot
AI Powered event-driven translation botAI Powered event-driven translation bot
AI Powered event-driven translation botJimmy Dahlqvist
 
Serverless and event-driven in a world of IoT
Serverless and event-driven in a world of IoTServerless and event-driven in a world of IoT
Serverless and event-driven in a world of IoTJimmy Dahlqvist
 
Event-driven and serverless in the world of IoT
Event-driven and serverless in the world of IoTEvent-driven and serverless in the world of IoT
Event-driven and serverless in the world of IoTJimmy Dahlqvist
 
IoT Enabled Smoker for Great BBQ
IoT Enabled Smoker for Great BBQIoT Enabled Smoker for Great BBQ
IoT Enabled Smoker for Great BBQJimmy Dahlqvist
 
Building a serverless event driven Slack Bot
Building a serverless event driven Slack BotBuilding a serverless event driven Slack Bot
Building a serverless event driven Slack BotJimmy Dahlqvist
 
IoT Enabled smoker for Great BBQ
IoT Enabled smoker for Great BBQIoT Enabled smoker for Great BBQ
IoT Enabled smoker for Great BBQJimmy Dahlqvist
 
IoT enable smoker for great BBQ
IoT enable smoker  for great BBQIoT enable smoker  for great BBQ
IoT enable smoker for great BBQJimmy Dahlqvist
 
Autoscaled Github Runners using StepFunctions
Autoscaled Github Runners using StepFunctionsAutoscaled Github Runners using StepFunctions
Autoscaled Github Runners using StepFunctionsJimmy Dahlqvist
 
EventBridge Patterns and real world use case
EventBridge Patterns and real world use caseEventBridge Patterns and real world use case
EventBridge Patterns and real world use caseJimmy Dahlqvist
 
re:Invent Recap Breakfast
re:Invent Recap Breakfastre:Invent Recap Breakfast
re:Invent Recap BreakfastJimmy Dahlqvist
 
CI/CD As first and last line of defence
CI/CD As first and last line of defenceCI/CD As first and last line of defence
CI/CD As first and last line of defenceJimmy Dahlqvist
 
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREECHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREEJimmy Dahlqvist
 
Road to an asynchronous device registration API
Road to an asynchronous device registration APIRoad to an asynchronous device registration API
Road to an asynchronous device registration APIJimmy Dahlqvist
 
GitOps in action, powered by Alexa.
GitOps in action, powered by Alexa.GitOps in action, powered by Alexa.
GitOps in action, powered by Alexa.Jimmy Dahlqvist
 

More from Jimmy Dahlqvist (19)

Cloud-grilled delights a high-tech approach to perfect BBQ
Cloud-grilled delights a high-tech approach to perfect BBQCloud-grilled delights a high-tech approach to perfect BBQ
Cloud-grilled delights a high-tech approach to perfect BBQ
 
Serverless website analytics with Lambda@Edge
Serverless website analytics with Lambda@EdgeServerless website analytics with Lambda@Edge
Serverless website analytics with Lambda@Edge
 
Encrypting data in S3 with Stepfunctions
Encrypting data in S3 with StepfunctionsEncrypting data in S3 with Stepfunctions
Encrypting data in S3 with Stepfunctions
 
Building a serverless AI powered translation service
Building a serverless AI powered translation serviceBuilding a serverless AI powered translation service
Building a serverless AI powered translation service
 
Serverless cloud architecture patterns
Serverless cloud architecture patternsServerless cloud architecture patterns
Serverless cloud architecture patterns
 
AI Powered event-driven translation bot
AI Powered event-driven translation botAI Powered event-driven translation bot
AI Powered event-driven translation bot
 
Serverless and event-driven in a world of IoT
Serverless and event-driven in a world of IoTServerless and event-driven in a world of IoT
Serverless and event-driven in a world of IoT
 
Event-driven and serverless in the world of IoT
Event-driven and serverless in the world of IoTEvent-driven and serverless in the world of IoT
Event-driven and serverless in the world of IoT
 
IoT Enabled Smoker for Great BBQ
IoT Enabled Smoker for Great BBQIoT Enabled Smoker for Great BBQ
IoT Enabled Smoker for Great BBQ
 
Building a serverless event driven Slack Bot
Building a serverless event driven Slack BotBuilding a serverless event driven Slack Bot
Building a serverless event driven Slack Bot
 
IoT Enabled smoker for Great BBQ
IoT Enabled smoker for Great BBQIoT Enabled smoker for Great BBQ
IoT Enabled smoker for Great BBQ
 
IoT enable smoker for great BBQ
IoT enable smoker  for great BBQIoT enable smoker  for great BBQ
IoT enable smoker for great BBQ
 
Autoscaled Github Runners using StepFunctions
Autoscaled Github Runners using StepFunctionsAutoscaled Github Runners using StepFunctions
Autoscaled Github Runners using StepFunctions
 
EventBridge Patterns and real world use case
EventBridge Patterns and real world use caseEventBridge Patterns and real world use case
EventBridge Patterns and real world use case
 
re:Invent Recap Breakfast
re:Invent Recap Breakfastre:Invent Recap Breakfast
re:Invent Recap Breakfast
 
CI/CD As first and last line of defence
CI/CD As first and last line of defenceCI/CD As first and last line of defence
CI/CD As first and last line of defence
 
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREECHAOS ENGINEERING – OR LET'S SHAKE THE TREE
CHAOS ENGINEERING – OR LET'S SHAKE THE TREE
 
Road to an asynchronous device registration API
Road to an asynchronous device registration APIRoad to an asynchronous device registration API
Road to an asynchronous device registration API
 
GitOps in action, powered by Alexa.
GitOps in action, powered by Alexa.GitOps in action, powered by Alexa.
GitOps in action, powered by Alexa.
 

Recently uploaded

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareinfo611746
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring SoftwareMera Monitor
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion Clinic
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzisteffenkarlsson2
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 

Recently uploaded (20)

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring Software
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 

Building-resilient-serverless-workloads-Navigating-through-failure

Editor's Notes

  1. Navigating failures! Building resilient serverless workloads! That is what we are going to talk about here today, failures! We are going to talk about how we can architect and build resilient serverless systems. We will not talk about preventing failures, instead we will talk about recovering from them, and some good architectures that can help you on your way. We will start with a serverless workload, and add to it to make it more resilient.
  2. Some of you might now be thinking. But hey, doesn't serverless on AWS com with built in availability and resiliency? Isn’t that one of the strengths with serverless? It's resilient out of the box? And you are absolutely right, serverless services from AWS has availability and resiliency built into them.
  3. This was a short talk then. So, thank you very much for listing and see you around..... Or is there more to resilient systems than that?
  4. Serverless services does come with high availability and resiliency built in. But that is on the service level, Lambda is highly available and resilient, StepFunctions are highly available and resilient, EventBridge are highly available and resilient. If all components in our systems were serverless that would be great. However, that is not the case. In 95% of all systems that I have designed or worked on there has been components that are not serverless. It can be the need for a relational database, and yes, I do argue that not all datamodels and search requirements can't be designed for DynamoDB. It can be that we need to integrate with a 3rd party and their API and connections come with quotas and throttling. It could be that we as developers are unaware of certain traits of AWS services that make our serverless services break. And even sometimes our users are even our worst enemy. When we are building our serverless systems we must always remember that there are components involved that don't scale as fast or to that degree that serverless components does.
  5. Hi! I'm Jimmy! I have worked with AWS and severless since 2015, almost a decade now, and I have seen all kind of strange things. I’m a true serverless enthusiast, the very first solution I built on AWS was serverless and I have not looked back since. I have built serverless solutions for a variaty of companies, from startups to large enterpices. I'm the founder of serverless-handbook.com where you can find all kind of serverless things that i have built, ranging from workshops to small architecture patterns. And I have my blog on Jimmydqv.com As a day-time job, and yes, I do have a daytime job, I know people have been questioning that. I work as Head of AWS at Sigma Technology Cloud, we are an advanced services partner with AWS and do all kind of fun solutions. If you like to know more about us, visit our booth outside.... I'm AWS Ambassador, AWS Community Builder and one of user Group leader for the Scania user group.
  6. So, when I say serverless I mean services that come with automatic scaling, little to no capacity planning, built-in high availability pay-for-use billing model. This is my definition of serverless, and I’m sure many of you can agree with that.
  7. Looking at services from AWS, in the red corner we have serverless services API gateway, Lambda, SQS, StepFunctions. Services that we can throw a ton of work on and they will just happily scale up and down, without any capacity planning, to handle our traffic. In the blue corner we have the managed services. This is services like Amazon Aurora, Fargate, OpenSearch, Kinesis data streams. This is services that can scale basically to infinity, but they do require some capacity planning for that to happen, and if we plan incorrectly we can be throttled or even have failing requests. And yes, I do categorize Kinesis Data streams as managed as we need to plan number of shards. Kinesis Firehose on the other hand would be a serverless service. Then there is the server corner, that would be anything with EC2 instances. We don’t talk about them….
  8. So, what is resiliency? Sometimes it gets confusing, and people mix up resiliency with reliability. As I said in the beginning, resiliency is not about preventing failures, it's about recovering from them. It’s about making sure our system maintain an acceptable level of service even if when other parts of our system is not healthy. It's about gracefully deal with failures. Reliability focuses on the prevention of the failure happening in the first place, while resiliency is about recovering from it.
  9. This is by far one of my favorite quotes by Dr. Werner Vogels. Because this is real life! Running large distributed systems, then everything will eventually fail. We can have down-stream services that is not responding as we expect, they can be having health problems. Or we can be throttled by a 3rd party or even our own services. We need to design our system not with mindset “what happens if this fails” but instead “How can we keep running and recover WHEN this fails”
  10. It's important that the cracks that form when components in our system fail, doesn't spread. That they don’t take down our entire system. We need ways to handle and contain the cracks. That way we can isolate and protect our entire system. When our serverless systems integrate with non-serverless components. In some cases it can be obvious, your system interacts with a Amazon Aurora database. Other times it's not that clear, the system integrates with a 3rd party API or does encryption using KMS. Both of these scenarios can lead to throttling that can affect our system and start forming cracks if not handled properly. How does our system handle a integration point that is not responding, specially under a period of high load. This can easily start creating cracks that can bring our entire system to a halt or that we start loosing data.
  11. When we build serverless systems we must remember that every API in AWS has a limit. We can store application properties in System Manager Parameter Store, a few of them might be sensitive and encrypted with KMS. What now happens is that we can get throttled by a different service without realizing it. SSM might have a higher limit but getting an encrypted value would then be impacted by the KMS limit. If we then don't design our functions correctly, and call SSM in the Lambda handler on every invocation we would quickly get throttles and build up a heft bill. Instead we could load properties in the initialization phase. IF LUC IN AUDIENCE!!! Or if we call Secrets manager on every invocation, that can quickly throttle and build a huge bill. Luc where are you my friend? I’m sure Luc can tell you more about that…. Understanding how AWS services work under the hood, to some extent, is extremely important, so our systems doesn't fail due to some unknown kink. For example, consuming a Kinesis Data Stream with a Lambda function, if processing an item in a batch fails, the entire batch will fail. The batch would then be sent to the Lambda function over and over again. TELL ASSA KINESIS STORY!!!!! What we can do in this case is to bisect batches on Lambda function failures. The processed batch will be split in half and sent to the function. Bisect would continue to we only have the single failing item left.
  12. Now, I bet everyone in this room run a multi-environment system, you have your dev, test, pre-prod, and prod environments. With a show of hands, how many in here would say that your QA, Staging, Pre-prod, or what ever you call it, has an identical setup with your prod environment? Most of you raised your hand and that is what I normally see. But now, let's make sure you consider data as well. The amount of data, the difference in user generated data. How many would now say that you environment are identical? As I thought not that many of you. This is an important part when we consider our systems and when we plan for Resiliency testing. Data is different. I have seen system been taken down on multiple occasions due to differences in data and even in integration points. With one client, we had an update that had been tested and prepared in all environments. But when deploying to prod, the database went haywire on us. We used Amazon Aurora serverless and the database suddenly scaled to max and then could cope anymore. Our entire service was brought down. All due to a SQL query that due to the amount of data in prod consumed all database resources. Or if you have a integration with a 3rd party where that 3rd party integration staging environment is different. I had a scenario where in prod the 3rd party had an IP-Allow listing in place, so when we extended our system and got some new IPs suddenly only 1/3 of our calls was allowed. In staging, this was not in place. That was.... intermittent failures are always the most fun to debug. A good way to practice and prepare for failures are through Resiliency testing, chaos engineering. AWS offers their service around this topic, AWS Fault Injection Service, which you can use to simulate failures and see how your components and system handles them. Now.... So what I'm saying is that when you plan for your Resiliency testing, start in your QA or staging environment. But don't forget about prod and do plan to run test there as well.
  13. Now let's start off with a classic web application with an API. Compute in Lambda and a database in DynamoDB. Now that is one scalable application. But maybe we actually need an SQL database, as mentioned in the beginning this is still frequently used in many applications. Or we need to integrate with 3rd party, and this could be an integration that on prem, it could be running in a different cloud on servers. A compute solution that doesn’t scale as fast and flexible as our serverless solution. With a lot of users we could quickly overwhelm the 3rd party API or any downstream service in our solution that doesn’t scale as fast. This application is setup as a classic synchronous request-response where our client expect a response back immediate to the request. We wait for this entire process to happen, storing data directly to a database might be very fast and the blocking isn't that long. But, with more complex integrations with chained calls and even 3rd party integrations the time quickly adds up, and if one of the components is down and not responding we need to fail the entire operation, and we leave any form of retry to the calling application.
  14. One question we need to ask when building our APIs is does our write operations really need an immediate response? Can we make this an asynchronous process? In a distributed system. Does the calling application need to know that we have stored the data already, or can we just hand over the event and expect response back saying that "Hey I got the message and I will do something with it". ### Buffer events With an asynchronous we can add a buffer between our calls and storage of our data. What this will do is protect us and the downstream service. The downstream service will not be overwhelmed and and by that we protect our own system as well from failures. This can however create an eventual consistency model, where read after write not always gives us the same data back.
  15. Let’s return to our application, but we focus only on the API part from now on. Let's get rid of our Lambda integration completely and instead integrate directly to the SQS. This will create one of the most powerful patterns when building resilient serverles systems, and I use this all the time. Storage First! So instead of having this integration API GW to Lambda we move the Lambda function
  16. This takes us to the Storage-first architecture pattern. The idea behind this architecture pattern is to safely store the messages in a durable storage, and then process them in an asynchrounus way. This way we can handle them in the pace we sit fit and we can re-process them if they fail. Basically we add an buffer to our API.
  17. Latency: Prioritizing storage might increase the time it takes to process or access the data in real-time scenarios. Complexity: Designing with storage in mind may lead to intricate architectures, especially when integrating with diverse processing systems. Prerequisites: Requires robust and often expensive storage solutions to ensure data durability and high availability. Data Integrity: Ensuring data stored is accurate and consistent can pose challenges, especially in high ingestion systems. Potential for Over-Optimization: There's a risk of over-investing in storage without considering the balance of other architectural needs.
  18. If we circle back to API solution again, not only do we use the storage first pattern in the current setup, we have the possibility for other resilient solution as well. I have already briefly mentioned this several times without putting a name on it. In this solution we also use the Queue Load leveling pattern
  19. Using the queue load leveling pattern we protect the downstream service, and by doing that our self, by only processing events in a pace that we know the service can handle. Other benefits that come with this pattern, that might not be that obvious. It can help us control cost as we can run on subscriptions with lower throughput that is lower in cost, or we can down-scale our database as we don't need to run a huge instance to deal with peaks. Same goes if we don't process the queue with Lambda functions but instead use containers, we can set the scaling to fewer instances or even do a better auto-scaling solution. Now! One consideration with this pattern is if our producers are always creating more requests than we can process, we can end up in a situation where we are always trailing behind. For that scenario we either need to scale up the consumers, which might lead to unwanted downstream consequences or we need at some point evict and throw away messages. What you choose of course come with the standard architect answer "It depends...."
  20. So what if we have more than one service that it is interested in the request? For a SQS queue we can only have one consumer, two consumers can't get the same message. In this case we need to create a fan out or multicast system.
  21. So, what we can do in this solution. Is that we can then replace our queue with EventBridge that can route the request or the message to many different services. It can be SQS queues, StepFunctions, Lamda Functions, other EventBridge buses and many many more. EventBridge is highly scalable with high availability and resiliency with a built in retry mechanism for 24 hours. With the archive feature we can also replay messages in case they failed. And if there is a problem delivering message to a target we can set a DLQ to handle that scenario. We just however remember the DLQ only come into affect if there is a problem calling the target, lacking IAM permissions or similar. If the target it self has a problem and fails processing message will not end up in the DLQ. Therefor each of our target services must implement resiliency using the patterns we have been talking about.
  22. Even with a storage-first approach we are of course not protected against failures. They will happen, remember "Everything fails all the time". In the scenarios where our processing do fail we need to retry again. But, retries are selfish and what we don't want to do, in case it's a downstream services that fail, or if we are throttled by the database, is to just retry again. Instead we like to backoff and give the service som breathing room. We would also like to apply exponential backoff, so if our second call also fails we like to back off a bit more. So first retry we do after 1 second, then 2, then 4, and so on till we either timeout and give up of have a success.
  23. In the cases where we do give up the processing. We have hit the max number of retries, this is where the DLQ come in. We route the messages to a DLQ where we can use a different retry logic or even inspect the messages manually. The DLQ also create a good indicator that something might be wrong, and we can create alarms and alerts based on number of messages in the DLQ. One message might not be an problem but the number of messages start stacking up it's a clear indicator that something is wrong. In case we are using SQS as our message buffer we can directly connect a DLQ to it. We can also use Lambda functions failure destinations and set a SQS as that destination. So in case the function exit with an failure the message is sent to the destination. If we use StepFunctions as our processor we can send messages to a SQS queue if we reach our retry limit.
  24. One more approach would be for use to use Step Functions built in retry with backoff. However, SQS can’t invoke StepFunctions, so what we can do is to use EventBridge instead of SQS, rely on EventBridge durability and archive and replay mechanism. We add a DLQ where we send event to when we give up the calls.
  25. In our retry scenario there is a study conducted by AWS that show that in a highly distributed system retries will happen at the same time. If all retries happen with the same backoff, 1 second, 2 seconds, 4 seconds and so on they will eventually line up and happen at the same time. This can then lead to the downstream service crashing directly after becoming healthy just due to the amount of job that has stacked up and now happen at the same time. It's like in an electric grid, after a power failure, all appliances turn on at the same time creating such a load on the grid that it go out again, or we blow a fuse. Then we change the fuse, everything turn on at the same time, and the fuse blow again. Therefor we should also use some form of jitter in our backoff algorithm. This could be that we add a random wait time to the backoff time. It would work that we first wait 1 second + a random number of hundreds of milliseconds. Second time we wait 2 second + 2x a random number, and so on. By doing that, our services will not line up the retries. How we add the jitter and how much, that well depends on your system and implementation. Users are our worst enemy story…….
  26. Retries is all good, but there is no point for us to send requests to an integration that is not healthy, it will just keep failing over and over again. So what we can do here is implement Circuit breakers. If you are not familiar with Circuit breakers it is a classic pattern, and what it does is make sure we don't send requests to API or integration that is not healthy and doesn't respond. This way we can both protect the integration or API but also our self from doing work we know will fail. Because everything fails all the time, right. So before we call the API we'll have a status check, if the API is all healthy we'll send the request this is a closed state of the circuit breaker. Think of it as an electric circuit, when the circuit is closed electricity can flow and the lights are on.
  27. So before we call the API we'll have a status check, if the API is all healthy we'll send the request this is a closed state of the circuit breaker. Think of it as an electric circuit, when the circuit is closed electricity can flow and the lights are on. As we do make calls to the API we'll update the status, if we start to get error responses on our requests we'll open the circuit and stop sending requests. In this state is where storage-first shine, we can keep our messages in the storage queue until the integration is back healthy again. But we just can't stop sending requests for ever. So what we do is to periodically place the circuit in a half-open state to send a few requests to it and update our status with the health from these requests.