8. @jimmydahlqvist
Envelope Wrapper
• Wrap original message in an envelope
• Seperation of information
• Use predefined keys
• Improved filtering and debugging
• Additional overhead
15. @jimmydahlqvist
Storage First Benefits
• Assured Data Durability
• Processing Flexibility
• Level the processing load
• High volume data ingestion
• De-duplication of data
16. @jimmydahlqvist
Storage First Things to consider
• Potential for increased latency
• Architectural complexity
• Need for robust storage solutions
• Maintaining data integrity
• Risk of over-optimization
20. @jimmydahlqvist
Circuit Breaker Benefits
• Avoid cascading failures
• Enhance system resilience
• Protect system resources
• Provide failover possibility
• Improve user experience
21. @jimmydahlqvist
Circuit Breaker Things to consider
• Need for configuration
• Risk of early circuit break
• Good observability required
• System complexity increase
• When to recover after a failure
I started working with cloud several years ago, I was part of a team that built a new system that was going to handle information coming from end users.
What was very important in this solution was that no messages coming from the end users could be lost, we needed to ensure we processed them all.
We decided to go for a serverless solution and used AWS ApiGateway together with an SQS queue, and then processed the messages in an async way.
At that time I didn’t realize that this pattern actually had a name…..
This is what we are going to talk about here today, we will look at some of the patterns I use the most and that I think is the most essential.
It’s clearly a oppinionated talk from that perspective, but in the end we should have put some names to them.
PIZZA EXAMPLES!
Before we start and deep dive into different patterns, we should establish some common ground.
And define some patterns, definitions and concepts that I will return to in several of the patterns
Event Producers are a system or a service that create and publish events and commands. AWS services, clients, Saas applications and more can be a producer.
Event Router this is a service or system that routes events and commands to consumers. This can be queues, event brokers, etc. There are several AWS services that can act as message router, such as SQS, SNS, IoT Core, and EventBridge.
Event Consumer are the system or service that react on, consume, specific events or commands and carry out work accordingly. Our consumers can be services implemented with AWS services, it can be other SaaS services, API Endpoints, and other.
Orchestration
A centralized control pattern where a single orchestrator (often a service or function) dictates the control flow, making decisions about which functions should be executed, in which order, and managing data flow between them.
AWS StepFunctions!
Choreography:
A decentralized control pattern where each service or function knows what to do when an event occurs. There's no central authority directing traffic; rather, services interact in a loosely coupled manner based on events.
AWS Event Bridge
Key Points:
Central Control: One service/function dictates the flow.
Predictable Flow: Control flow is predefined and can be visualized easily.
Tight Coupling: The orchestrator is often tightly coupled with services, knowing about their interfaces and data.
Decentralized Control: No single point dictates the flow; services/functions react to events.
Loose Coupling: Services are decoupled, only knowing about the events they produce or consume.
Scalable & Flexible: Easy to add or modify services without changing the entire system.
Self-Managed: Services handle their own failures and compensating actions based on events.
By using the we wrap the original message in an envelope, that way we can….
We need to use prdefined keys, as this gives both the producer and consumer….
Metadata – data pattern….. Very popular
Invented back in 2020 by Sheen Brials at Lego Group….
The data key is the original message, the payload.
The metadata key gives us the possibility to add additional information ABOUT the message.
The similarities
Use JSON for all messages, this is an opinionated design, but it’s a good pratcice for a well designed message system.
So don’t use txt, xml, yaml, protobuf
Now let’s move into the relalm of Resiliency and some patterns that can help us.
Definition: The "Storage First" architecture pattern emphasizes putting persistent storage at the forefront of system design, ensuring data durability and availability before considering other components.
Key Principle: Designing systems around the notion that data storage is the central pillar, optimizing for data retention, retrieval, and resilience.
When to Use:
High data ingestion systems where data loss is critical.
Systems requiring consistent backup and failover capabilities.
Applications where real-time processing is secondary to data capture.
Data Durability: Ensuring data is stored safely reduces risks related to data loss.
Flexibility: Once data is stored, it can be processed, transformed, or analyzed in various ways without worrying about initial capture.
Scalability Level the processing load: By prioritizing storage, systems can efficiently handle large volumes of data without immediate processing.
In a high volume event-driven system, slow consumers can slow down the producers if the consumer process the event synchronously. Instead, by storing the event immediately, reporting success, and then process on their own time, the producers will not be slowed down.
By storing the data before processing, the consumers can also implement an efficient message de-duplication for data that has been sent twice. In most event-driven architectures data will be delivered with a “At-least-once” approach.
Cost-Efficient: Optimizing for storage can lead to reduced costs in data retrieval and processing.
Latency: Prioritizing storage might increase the time it takes to process or access the data in real-time scenarios.
Complexity: Designing with storage in mind may lead to intricate architectures, especially when integrating with diverse processing systems.
Prerequisites: Requires robust and often expensive storage solutions to ensure data durability and high availability.
Data Integrity: Ensuring data stored is accurate and consistent can pose challenges, especially in high ingestion systems.
Potential for Over-Optimization: There's a risk of over-investing in storage without considering the balance of other architectural needs.
PIZZA ORDER!!
Definition: A design pattern used in software development to improve system stability and prevent cascading failures by detecting faults and halting system operations, much like an electrical circuit breaker.
Key Principle: The circuit breaker monitors requests to a service and "trips" (or opens) to stop sending requests to a failing service, giving it time to recover.
When to Use:
Microservices architectures where failures in one service might cascade to others.
Systems that rely on external services or APIs that might be unreliable.
Applications where preserving system functionality during partial failures is crucial.
System Stability: Reduces the risk of system-wide outages due to a single point of failure.
Resource Protection: Prevents resource exhaustion by halting requests to a failing component.
Enhanced User Experience: By avoiding system hang or timeouts, users receive quicker feedback even during failures.
Facilitates System Recovery: Provides failing components an opportunity to recover without being inundated with requests.
Predictable Failures: System components fail in a predictable manner, allowing for easier troubleshooting and maintenance.
Configuration Overhead: Proper thresholds and timeouts need to be set, which might require fine-tuning.
Risk of False Positives: Might trip during transient failures, causing unnecessary disruption.
Complexity: Introduces additional logic and monitoring into the system.
Dependency on Monitoring: Requires robust monitoring and alerting to function effectively.
Recovery Strategy: Deciding when and how to close (or reset) the circuit breaker can be challenging.
Get circuit status from DynamoDB
If Closed carry out work and update status
If open –
Check if sufficient time has passed for a retry
If so
Carry out work and update status
if NO
no retry
Now let’s look at a combination with Storage First where we like to process messages in a Queue using a Lambda Function.
I need to give credit to Cristoph Gerkens that created the first variant that I later modified.
In this:
Lambda send logs and metrics to CloudWatch -> in case of failures this trigger an alarm that invokes a Lambda function, that disabled the Lambda integration on the queue.
When the Alarm goes back to OK this will invoke a stepfunction, that polls a messages from the queue and makes an test invoke with that message to the lambda function doing the work, if this is an success the integration is enabled.
EventBridge on a schedule invokes the stepfunction that check if the integration is enabled, if not it does an test invoke.
This is a very famous quote….
And it’s very true, everything fails all the time and we need to be able to handle failures and retry.
PIZZA schedule delivery….
Our application need to handle failures and retry the operations.
And we should not just retry….. We need to retry with an exponential backoff.
Meaning that we first retry after 1s then 2s, 4s, 8s and so on…..
This will:
Reduce System load and strain and let’t the failing component breathe…..
It creates a better user experience
We can save cost by not burning CPU cycles
There is an very interesting study done by AWS on this topic, that show how retries cluster in a large distrubited system.
So we should not just do retry with backoff, we should also add jitter to this.
Meaning that we add a random sleep to each backoff operation.
This will then avoid synced synced retries….
It will distribute the load on the failing component more even, because after healing in case of a synced retry from many clients this can cause another failure and outage.
It will increase the success rate since we distrubite the retries.
And it’s a very adaptive way of handling retry
Now the same study with jitter added, we can see that the clustering is less frequent.
So ho can we now implement retries in a smart way in serverless AWS?
First, let’s introduce a retry envelope, based on the envelope wrapper pattern.
Here we add metadata about retries so we can keep track of attempts, when it was last run etc.
So how could a implementation in StepFunctions then look like?
We can utlize the StepFunctions built in ability to catch errors. When Lambda fail e use the retry metadata to calculate a wait time based on that.
We then can then wait thet amount of time and try again.
If we reach our upper limit of number of retrties we add the message to a DLQ for manual processing.
So if we look at an StepFunctions visualization
Here is a setup where Lambda get invoked asynch by something.
Credit given to Luc…. That first wrote about this setup.
Lambda Fails -> onFailure destination to SQS.
SQS polled by Lambda event source.
Retry manager checks the ”Retry Metadata” puts back to queue and set the visibility timeout.
Next time message is returned, the Manager invokes the “Work Lambda” and the loop continue….
That was a few resiellncy patterns…
Now let us move over and look at some event-driven and messagsing patterns….
A pattern I use very often is the Saga Pattern. Which is a way to manage long running transactions.
It’s possible to use both in an orchestration and choreography scenario, but we will focus on choreography.
In this pattern, each completed transaction will publish a domain event to inform on what just happened sp the next part in the saga can pick up.
With the Saga pattern, even if we're not using traditional ACID transactions, we can still ensure data is consistent across services.
As each service in the pattern is loosely coupled, we gain the flexibility to develop, deploy, and scale services independently.
If one transaction fails, the entire system doesn’t crash. Instead, compensating actions are triggered to rectify the inconsistency.
There might be additional complexity in the system, so it might be hard to track the transactions and where in the chain it is.
The system will be Eventual consistency, since a transaction can be in the middle of everything.
We must make sure we use the envelope wrapper so we can add eventid to it so we can track the saga.
Testinh might be hard, since it might require us to run the full saga for every test
So by using eventbridge each service can publish events that other services can track…
So in this example it all start with a Pizza Order being created…..
The Data Enricher Pattern is about enhancing the value of data by combining it with other relevant data sources.
At its core, the Data Enricher Pattern aims to elevate the inherent value of raw data by integrating it with additional context. This enhancement transforms simple data into information, making it more actionable.
This richer dataset is a boon for decision-makers, offering a more comprehensive view of the situation.
One of the pattern's strengths is its automation, streamlining data processes and reducing manual errors.
Furthermore, it's a versatile pattern. Whether you're integrating data from APIs, databases, or other systems, the pattern facilitates this combination, offering a more holistic dataset.
Importantly, as your business grows and evolves, so can your data enrichment sources, ensuring your data remains relevant and valuable.
One of the top considerations when adopting the Data Enricher Pattern is maintaining the data's integrity and quality. The enrichment should add value, not distort or degrade the original data.
Reliability is another concern. If you're depending on third-party data sources, their availability and trustworthiness become crucial.
As you integrate diverse data sources, be prepared to handle complex integration scenarios, ensuring smooth data flow.
Also, while the pattern aims to provide enriched data, it's important to be aware of potential latency, especially if real-time data processing is essential.
Lastly, and perhaps most importantly, when dealing with sensitive or third-party data, uphold the highest standards of data privacy and security.
PIZZA membership…. Discounts…
Messages are processed based on their priority rather than their arrival order
High value messages processed first
We can use computer resources efficient
We can improve the system responisvness for high priority customers. We can see it as customers with high member status get processed first.
This we need to consider in this pattern is how do we:
Define priority -> Must be clear
There is an increased complexity with more queues
Queue starvation can happen…. Low prio never get processed.
In this case we can’t guarntee the order and Normal prio and high prio will be intermixed.
Instead we would need to do something like this…..
Where we check the prio of the message and if there are high prio messages in queue or not.
Next queue pattern that we need to talk about is queue based load leveling….. This is when we use a queue in between a high volume producer and a slow consumer, where the producer have high spikes.
That way we can level the load on the consumer and ensure it doesn’t get over whelmed in the spikes.
Increased system stability
Handle spikes
Protect downstream services
What we need to consider when implementing this, is that it will come with increased latency -> Since messages during a spike will be in the queue longer.
The data integrity we need to consider that – the system will be eventually consistence
And backperassure? How do we handle the case when the queue keep growing and the consumer can’t keep up?
There are two ways to handle it. Eirher we add more consumers or we set TTL on the messages and send old messages to a DLQ or discard them.
If you remember the very first story I told…. We actually used load leveling with storage first.
This scenario was from Sony and we processed messages from mobile phones. Every time a phone started up it sent us a message.
So when there was a new sw for the phone, we got huge spikes, since most pople actually update their phones at the same time…..