SlideShare a Scribd company logo
1 of 25
Shuffle Shards:
Patterns for
Workload Isolation
DevNexus April 2024
Christopher Curtin
Resilience and Reliability
“Reliability is the outcome cloud service providers strive for – it’s the result.
Resiliency is the ability of a cloud-based service to withstand certain types of failure and
yet remain functional from the customer perspective.
In other words, reliability is the outcome and resilience is the way you achieve the
outcome.”
https://www.microsoft.com/en-us/security/blog/2014/03/24/reliability-series-1-reliability-vs-resilience/
Defensive architecture
Clients can be our worst enemy
How do we protect our product and business systems from their actions,
innocent or malicious?
How do we protect our clients from each other?
Security, multi-tenancy etc are topics for a separate session
Examples
Client tries to make 10,000 API calls in a minute since they
didn’t read our docs about our batch API (RTFM attacks)
Innocent API request fans out to millions of backend
requests (Happy Birthday, here is your coupon!)
https://apnews.com/article/south-korea-age-counting-law-
a38a4a6b47c6864bd13433fdac071cec
Client has a defect and is making update requests in a tight
loop
But this is the cloud!Let them do it!
Auto scale!
Use Lambdas!
Use MongoDB!
Cross Cloud Architecture!
Reality
You have a fixed budget per transaction for your platform so your company
can be profitable
The impact of each of those clients on the other clients in your system can lead
to SLA credits
Auto scaling may mean that a client costs you more than they pay you
Example, if your business charges $4/month/user, but you get $0.50 fully
loaded to provide the service (remember capex vs opex in the cloud is
different!)
“Industry benchmarks suggest that an acceptable cost per call could
range anywhere between $2.70 – $5.60, including “direct labor, indirect
labor, and operational expenses.”
Minimum First Level of Defense
API throttling at your gateway by API and API Key
Auto scaling to a fiscally responsible level
Fine grained throttling by business process
“Dials and Levers” that can be adjusted without a deployment
Workload Isolation
How do you protect clients from each other’s normal transactions?
How do you protect clients from defects with specific triggers?
How do you feel confident in your SLA with so many unknowns?
One option: sharding of clients
Naive Sharding
Separate your resources into ‘buckets’, then assign clients to the buckets so clients can’t
impact all other clients
“Blast radius” is simple math (100/# buckets) 8 buckets= 12.5% of clients could be
impacted
Naive Sharding
Adding more buckets changes the impact %
Getting the bucket assignment is hard as clients are added, attrit, change usage patterns
High percentage impact might not be acceptable to the business
Shuffle Sharding
Very similar to the Naive sharding, but rather than each shard sharing the same resources,
there is an overlap between which shards access which resources
Back to Middle School statistics: combinations - 8 resources with a shard being able to
access 3 of them, creates 56 shards to assign clients to, or 1.7% of your clients per shard
(vs 12.5% if 8 naive shards)
Using 6 here to make the slide visible
How does this work?
Each Resource is called a Worker Node (colored balls in previous slide)
Shards are assigned non-overlapping sets of Worker Nodes
When a request is received with the shard key, the load balancer/Istio routes to one of the
Worker Nodes for that shard
BUT if the requests overwhelm those Worker Nodes, only those clients in the same shard
get impacted
Layering circuit breakers, retries, load balancers, etc. ‘shuffles’ the load to the next
resource in the shard
What’s going on?
Since none of the other shards have the exact same set of Worker Nodes, the ‘bad actor’
can only consume a percentage of the total
YES the other clients in that shard get impacted, but this isn't about socialism, it is about
reducing the impact of things going wrong
YES the overloaded Worker Nodes are included in other Shards, but with proper retry
logic, the requests are eventually processed in the healthy Worker Node(s) for that Shard
What is a Worker Node?
Whatever you want it to be
In some scenarios it is a physical (logical) server such as EC2 instance, VM, real hardware
It can be a Kubernetes virtual service, so there are actually multiple pods providing the
service
K8s virtual services have the benefit of being able to scale up or down within the Worker
Node routing concept
No silver bullets
You have resources sitting around waiting for clients, that could help that 'bad actor'
complete sooner and not impact the others in their Shuffle Shard
We'll talk about some ideas for addressing this later
Does not protect from issues that impact all clients (bad queries, infrastructure outages
etc.)
Shared resources like data stores, queues, S3 etc. across Worker Nodes can be overloaded
Complex if using messaging (SQS, Kafka etc.)
Use case 1: Amazon DNS (Route 53)
AWS routes billions of requests, what if someone launches a DDOS against one of their
client’s domain?
Worker Nodes for that client get overwhelmed, other clients in other shards aren’t
impacted
This allows AWS to then add the load shedding services to the Routing to protect that
client
Since load shedding tools are expensive you don't want to do it always
2048 Worker Nodes = 730 Billion possible Shuffle Shards, so they actually assign a
Shuffle Shard to each domain
Use Case 2: Happy Birthday!
The traffic from a client is first throttled due to # of requests per second
Requests that make it through then are routed to Worker Nodes to perform the updated age
calculation, then trigger the Drools engine to determine what features of your application the user
can have access to or activate (Birthday coupon email, hey you’re not 18!)
Meanwhile if a second client triggers a similar age update and triggers their Drools rules, the two
won't take all the resources
Then a third client performs their (simpilier) request, unrelated to the birthday update, and it
finishes as expected, though the overall volume of traffic for the system is significantly higher
Others using resources through the ‘normal’ patterns don’t see anything
(Ideally you’d flip the lever to not run the profile logic across all clients when you noticed the
pattern)
Use Case 3: Dynamic Shard Assignment
Fixed assignment of clients to shards can actually hurt you without a bad actor
If your clients have predictable use patterns you can re-shard them in anticipation of their load
For example, 50% of your clients consistently request your service Monday and Thursday from 9am-11am. 25% Tuesday
and Friday from 3pm - 7 pm, the rest random
Why not put those 50% of clients into shards every Monday and Thursday at 6 am, so their load is spread, then
rebalance Tuesday/Friday at noon for the other clients?
When you detect an overload, you can move the clients NOT causing the problem out of that shuffle shard into a less used
one (though usually by the time you get this point you might have recovered)
Note this does not work as well if the Worker Nodes are tied to data services (databases, queues etc) that themselves are not
scalable
Use Case 4: Bad Query mitigation
A bad query or business logic is triggered only for clients who have a specific use case. For
example, NYC fine calculations based on the time of day you parked illegally
If clients are randomly assigned to shuffle shards, you could impact all Worker nodes
So move these clients doing these calculations to 2-3 shuffle shards KNOWING their experience is
going to be bad while you fix it
Blue/Green deploy an emergency fix ONLY for those Worker Nodes since you don’t know 100%
that it will fix the issue
Depending on the issue you can scale up these Worker Nodes OR deliberately change the # of
data source connections in the pools so they don’t take a lot of resources (you are using Spring
Cloud Config or other tools for deployment-less config changes right?)
Outage
Simulations
A lot of the benefits of workload isolation prevent the outage. But sometimes the outage
happens.
Performing outage simulations allows the response teams to know where to look to see
what is going wrong, identify missing or incorrect run books, dashboards etc.
Then practice dynamic sharding, hot fixing specific worker nodes, etc.
Don’t wait until the outage to script and try these options, getting them wrong can make it
worse!
Summary
Stuff Happens, plan for it
Perform outage simulations
Shuffle Shards does not make all clients equal in their access to resources, instead it is a pattern to
restrict the impact when things go wrong, giving you time to react
Sits on top of strong patterns for retries, virtual routing, database sharding etc.
Observability and monitoring are critical to identify issues
If the data tier isn't robust, you'll still knock that layer over
Articles
Amazon’s DNS service presentation: https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-
sharding/
Great video by AWS TAMs talking about how they use Shuffle Sharding in many AWS services
https://www.youtube.com/watch?v=TVdxPF7KL1c&t=1s
Shuffle Sharding in the context of a Cell topology
https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/faq.html

More Related Content

Similar to 2024 DevNexus Patterns for Resiliency: Shuffle shards

Data stream processing and micro service architecture
Data stream processing and micro service architectureData stream processing and micro service architecture
Data stream processing and micro service architectureVyacheslav Benedichuk
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)Dilip Sharma
 
Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...
Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...
Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...stevemcpherson
 
Some questions on microservices
Some questions on microservicesSome questions on microservices
Some questions on microservicesPranab Das
 
Cloud Insecurity and True Accountability - Guardtime Whitepaper
Cloud Insecurity and True Accountability - Guardtime WhitepaperCloud Insecurity and True Accountability - Guardtime Whitepaper
Cloud Insecurity and True Accountability - Guardtime WhitepaperMartin Ruubel
 
Operator-less DataCenters -- A Reality
Operator-less DataCenters -- A RealityOperator-less DataCenters -- A Reality
Operator-less DataCenters -- A RealityKishore Arya
 
Operator-Less DataCenters A Near Future Reality
Operator-Less DataCenters A Near Future RealityOperator-Less DataCenters A Near Future Reality
Operator-Less DataCenters A Near Future RealityKishore Arya
 
Patterns&Antipatternsof SOA
Patterns&Antipatternsof SOAPatterns&Antipatternsof SOA
Patterns&Antipatternsof SOAMohamed Samy
 
Mmckeown hadr that_conf
Mmckeown hadr that_confMmckeown hadr that_conf
Mmckeown hadr that_confMike McKeown
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Queues, Pools and Caches - Paper
Queues, Pools and Caches - PaperQueues, Pools and Caches - Paper
Queues, Pools and Caches - PaperGwen (Chen) Shapira
 
Hide and seek - Attack Surface Management and continuous assessment.
Hide and seek - Attack Surface Management and continuous assessment.Hide and seek - Attack Surface Management and continuous assessment.
Hide and seek - Attack Surface Management and continuous assessment.Eoin Keary
 
Crash Only Web Services
Crash Only Web ServicesCrash Only Web Services
Crash Only Web ServicesAbbie Barbir
 

Similar to 2024 DevNexus Patterns for Resiliency: Shuffle shards (20)

Data stream processing and micro service architecture
Data stream processing and micro service architectureData stream processing and micro service architecture
Data stream processing and micro service architecture
 
Cloud Computing Security Issues
Cloud Computing Security Issues Cloud Computing Security Issues
Cloud Computing Security Issues
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
 
Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...
Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...
Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Me...
 
Some questions on microservices
Some questions on microservicesSome questions on microservices
Some questions on microservices
 
Incident response in cloud environments
Incident response in cloud environmentsIncident response in cloud environments
Incident response in cloud environments
 
Moving To SaaS
Moving To SaaSMoving To SaaS
Moving To SaaS
 
Event sourcing
Event sourcingEvent sourcing
Event sourcing
 
Cloud Insecurity and True Accountability - Guardtime Whitepaper
Cloud Insecurity and True Accountability - Guardtime WhitepaperCloud Insecurity and True Accountability - Guardtime Whitepaper
Cloud Insecurity and True Accountability - Guardtime Whitepaper
 
Operator-less DataCenters -- A Reality
Operator-less DataCenters -- A RealityOperator-less DataCenters -- A Reality
Operator-less DataCenters -- A Reality
 
Operator-Less DataCenters A Near Future Reality
Operator-Less DataCenters A Near Future RealityOperator-Less DataCenters A Near Future Reality
Operator-Less DataCenters A Near Future Reality
 
Cloud architecture
Cloud architectureCloud architecture
Cloud architecture
 
Patterns&Antipatternsof SOA
Patterns&Antipatternsof SOAPatterns&Antipatternsof SOA
Patterns&Antipatternsof SOA
 
Mmckeown hadr that_conf
Mmckeown hadr that_confMmckeown hadr that_conf
Mmckeown hadr that_conf
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Queues, Pools and Caches - Paper
Queues, Pools and Caches - PaperQueues, Pools and Caches - Paper
Queues, Pools and Caches - Paper
 
Queues, Pools and Caches paper
Queues, Pools and Caches paperQueues, Pools and Caches paper
Queues, Pools and Caches paper
 
Hide and seek - Attack Surface Management and continuous assessment.
Hide and seek - Attack Surface Management and continuous assessment.Hide and seek - Attack Surface Management and continuous assessment.
Hide and seek - Attack Surface Management and continuous assessment.
 
Building a SaaS Style Application
Building a SaaS Style ApplicationBuilding a SaaS Style Application
Building a SaaS Style Application
 
Crash Only Web Services
Crash Only Web ServicesCrash Only Web Services
Crash Only Web Services
 

More from Christopher Curtin

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Christopher Curtin
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
2011 march cloud computing atlanta
2011 march cloud computing atlanta2011 march cloud computing atlanta
2011 march cloud computing atlantaChristopher Curtin
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleChristopher Curtin
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleChristopher Curtin
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 

More from Christopher Curtin (12)

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014Redis and Bloom Filters - Atlanta Java Users Group 9/2014
Redis and Bloom Filters - Atlanta Java Users Group 9/2014
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
2011 march cloud computing atlanta
2011 march cloud computing atlanta2011 march cloud computing atlanta
2011 march cloud computing atlanta
 
Ajug april 2011
Ajug april 2011Ajug april 2011
Ajug april 2011
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Nosql East October 2009
Nosql East October 2009Nosql East October 2009
Nosql East October 2009
 
IASA Atlanta September 2009
IASA Atlanta September 2009IASA Atlanta September 2009
IASA Atlanta September 2009
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 

Recently uploaded

Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Henry Schreiner
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfSrushith Repakula
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfWSO2
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAShane Coughlan
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksJinanKordab
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfkalichargn70th171
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringPrakhyath Rai
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)Roberto Bettazzoni
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Flutter Agency
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletAndrea Goulet
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIInflectra
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...drm1699
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...naitiksharma1124
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AIAGATSoftware
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Conceptsthomashtkim
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...Neo4j
 

Recently uploaded (20)

Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 

2024 DevNexus Patterns for Resiliency: Shuffle shards

  • 1. Shuffle Shards: Patterns for Workload Isolation DevNexus April 2024 Christopher Curtin
  • 2. Resilience and Reliability “Reliability is the outcome cloud service providers strive for – it’s the result. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain functional from the customer perspective. In other words, reliability is the outcome and resilience is the way you achieve the outcome.” https://www.microsoft.com/en-us/security/blog/2014/03/24/reliability-series-1-reliability-vs-resilience/
  • 3. Defensive architecture Clients can be our worst enemy How do we protect our product and business systems from their actions, innocent or malicious? How do we protect our clients from each other? Security, multi-tenancy etc are topics for a separate session
  • 4. Examples Client tries to make 10,000 API calls in a minute since they didn’t read our docs about our batch API (RTFM attacks) Innocent API request fans out to millions of backend requests (Happy Birthday, here is your coupon!) https://apnews.com/article/south-korea-age-counting-law- a38a4a6b47c6864bd13433fdac071cec Client has a defect and is making update requests in a tight loop
  • 5. But this is the cloud!Let them do it! Auto scale! Use Lambdas! Use MongoDB! Cross Cloud Architecture!
  • 6. Reality You have a fixed budget per transaction for your platform so your company can be profitable The impact of each of those clients on the other clients in your system can lead to SLA credits Auto scaling may mean that a client costs you more than they pay you Example, if your business charges $4/month/user, but you get $0.50 fully loaded to provide the service (remember capex vs opex in the cloud is different!) “Industry benchmarks suggest that an acceptable cost per call could range anywhere between $2.70 – $5.60, including “direct labor, indirect labor, and operational expenses.”
  • 7. Minimum First Level of Defense API throttling at your gateway by API and API Key Auto scaling to a fiscally responsible level Fine grained throttling by business process “Dials and Levers” that can be adjusted without a deployment
  • 8. Workload Isolation How do you protect clients from each other’s normal transactions? How do you protect clients from defects with specific triggers? How do you feel confident in your SLA with so many unknowns? One option: sharding of clients
  • 9. Naive Sharding Separate your resources into ‘buckets’, then assign clients to the buckets so clients can’t impact all other clients “Blast radius” is simple math (100/# buckets) 8 buckets= 12.5% of clients could be impacted
  • 10. Naive Sharding Adding more buckets changes the impact % Getting the bucket assignment is hard as clients are added, attrit, change usage patterns High percentage impact might not be acceptable to the business
  • 11. Shuffle Sharding Very similar to the Naive sharding, but rather than each shard sharing the same resources, there is an overlap between which shards access which resources Back to Middle School statistics: combinations - 8 resources with a shard being able to access 3 of them, creates 56 shards to assign clients to, or 1.7% of your clients per shard (vs 12.5% if 8 naive shards) Using 6 here to make the slide visible
  • 12. How does this work? Each Resource is called a Worker Node (colored balls in previous slide) Shards are assigned non-overlapping sets of Worker Nodes When a request is received with the shard key, the load balancer/Istio routes to one of the Worker Nodes for that shard BUT if the requests overwhelm those Worker Nodes, only those clients in the same shard get impacted Layering circuit breakers, retries, load balancers, etc. ‘shuffles’ the load to the next resource in the shard
  • 13.
  • 14. What’s going on? Since none of the other shards have the exact same set of Worker Nodes, the ‘bad actor’ can only consume a percentage of the total YES the other clients in that shard get impacted, but this isn't about socialism, it is about reducing the impact of things going wrong YES the overloaded Worker Nodes are included in other Shards, but with proper retry logic, the requests are eventually processed in the healthy Worker Node(s) for that Shard
  • 15. What is a Worker Node? Whatever you want it to be In some scenarios it is a physical (logical) server such as EC2 instance, VM, real hardware It can be a Kubernetes virtual service, so there are actually multiple pods providing the service K8s virtual services have the benefit of being able to scale up or down within the Worker Node routing concept
  • 16.
  • 17.
  • 18. No silver bullets You have resources sitting around waiting for clients, that could help that 'bad actor' complete sooner and not impact the others in their Shuffle Shard We'll talk about some ideas for addressing this later Does not protect from issues that impact all clients (bad queries, infrastructure outages etc.) Shared resources like data stores, queues, S3 etc. across Worker Nodes can be overloaded Complex if using messaging (SQS, Kafka etc.)
  • 19. Use case 1: Amazon DNS (Route 53) AWS routes billions of requests, what if someone launches a DDOS against one of their client’s domain? Worker Nodes for that client get overwhelmed, other clients in other shards aren’t impacted This allows AWS to then add the load shedding services to the Routing to protect that client Since load shedding tools are expensive you don't want to do it always 2048 Worker Nodes = 730 Billion possible Shuffle Shards, so they actually assign a Shuffle Shard to each domain
  • 20. Use Case 2: Happy Birthday! The traffic from a client is first throttled due to # of requests per second Requests that make it through then are routed to Worker Nodes to perform the updated age calculation, then trigger the Drools engine to determine what features of your application the user can have access to or activate (Birthday coupon email, hey you’re not 18!) Meanwhile if a second client triggers a similar age update and triggers their Drools rules, the two won't take all the resources Then a third client performs their (simpilier) request, unrelated to the birthday update, and it finishes as expected, though the overall volume of traffic for the system is significantly higher Others using resources through the ‘normal’ patterns don’t see anything (Ideally you’d flip the lever to not run the profile logic across all clients when you noticed the pattern)
  • 21. Use Case 3: Dynamic Shard Assignment Fixed assignment of clients to shards can actually hurt you without a bad actor If your clients have predictable use patterns you can re-shard them in anticipation of their load For example, 50% of your clients consistently request your service Monday and Thursday from 9am-11am. 25% Tuesday and Friday from 3pm - 7 pm, the rest random Why not put those 50% of clients into shards every Monday and Thursday at 6 am, so their load is spread, then rebalance Tuesday/Friday at noon for the other clients? When you detect an overload, you can move the clients NOT causing the problem out of that shuffle shard into a less used one (though usually by the time you get this point you might have recovered) Note this does not work as well if the Worker Nodes are tied to data services (databases, queues etc) that themselves are not scalable
  • 22. Use Case 4: Bad Query mitigation A bad query or business logic is triggered only for clients who have a specific use case. For example, NYC fine calculations based on the time of day you parked illegally If clients are randomly assigned to shuffle shards, you could impact all Worker nodes So move these clients doing these calculations to 2-3 shuffle shards KNOWING their experience is going to be bad while you fix it Blue/Green deploy an emergency fix ONLY for those Worker Nodes since you don’t know 100% that it will fix the issue Depending on the issue you can scale up these Worker Nodes OR deliberately change the # of data source connections in the pools so they don’t take a lot of resources (you are using Spring Cloud Config or other tools for deployment-less config changes right?)
  • 23. Outage Simulations A lot of the benefits of workload isolation prevent the outage. But sometimes the outage happens. Performing outage simulations allows the response teams to know where to look to see what is going wrong, identify missing or incorrect run books, dashboards etc. Then practice dynamic sharding, hot fixing specific worker nodes, etc. Don’t wait until the outage to script and try these options, getting them wrong can make it worse!
  • 24. Summary Stuff Happens, plan for it Perform outage simulations Shuffle Shards does not make all clients equal in their access to resources, instead it is a pattern to restrict the impact when things go wrong, giving you time to react Sits on top of strong patterns for retries, virtual routing, database sharding etc. Observability and monitoring are critical to identify issues If the data tier isn't robust, you'll still knock that layer over
  • 25. Articles Amazon’s DNS service presentation: https://aws.amazon.com/builders-library/workload-isolation-using-shuffle- sharding/ Great video by AWS TAMs talking about how they use Shuffle Sharding in many AWS services https://www.youtube.com/watch?v=TVdxPF7KL1c&t=1s Shuffle Sharding in the context of a Cell topology https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/faq.html