Microservice Resilience Patterns @VoxxedCern'24

The Resilience Patterns
your Microservices Teams
Should Know
Victor Rentea | https://victorrentea.ro | @victorrentea

👉 victorrentea.ro/training-offer
VictorRentea.ro
👋 I'm Victor Rentea 🇷🇴 PhD(CS), Java Champion
18 years of code
10 years training bright developers at 120+ EU companies:
❤ Clean Code, Tes+ng, Architecture
🛠 Java, Spring, Hibernate, Reac+ve
⚡ Java Performance, Secure Coding
Educa9ve Talks on YouTube.com/vrentea
European So=ware Cra=ers Community (6K devs)
👉 Join for free monthly events at victorrentea.ro/community
Life += 👪 + 🐈 + 🌷garden

3 VictorRentea.ro
Benefits of Microservices
ü Faster Time-to-Market: 😁 Business
ü Lower Cognitive Load: 😁 Developers
ü Technology Upgrade/Change
ü Scalability for the 🔥hot parts that require it
ü Availability, tolerance to partial failures

4 VictorRentea.ro
But we're safe.
We're using HTTP between our services.
😎

5 VictorRentea.ro
Fallacies of Distributed Computing
The network is reliable
Latency is zero
Bandwidth is inﬁnite
Transport cost is zero
The network is secure
Fixed topology
Has one administrator
The network is homogeneous

6 VictorRentea.ro
A distributed system is one in which
the failure of a computer
you didn't even know existed
can render your own computer unusable.
Leslie Lamport

7 VictorRentea.ro
production
deploy

8 VictorRentea.ro
=
𝑢𝑝𝑡𝑖𝑚𝑒
𝑡𝑜𝑡𝑎𝑙_𝑡𝑖𝑚𝑒
𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦

9 VictorRentea.ro
=
𝑀𝑇𝑇𝐹
𝑀𝑇𝑇𝐹 + 𝑀𝑇𝑇𝑅
Mean Time To Failure (crash)
Mean Time To Recovery (downtime)
⬆
Write more tests: unit-, integration-, smoke-, end-to-end-
Also: load-, spike-, resilience- (see ToxyProxy by shopify)
⬆
⬇
Faster Recovery
𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦

10 VictorRentea.ro
Resilience Pa+erns

11 VictorRentea.ro
The ability of a system to handle unexpected situations
- without users noticing it (best case), or
- with a graceful degradation of service
Resilience

12 VictorRentea.ro
👉 A Query reading data can failover to:
- Return an older value, eg. cached 5 minutes ago
- Return a lower quality response: eg. NETFLIX: per-country, not client-tailored >tles
- Call a slower system: eg. search in SQL DB when ES is down
- Send results later: "We'll email you the results when done"
👉 A Command changing data:
- Outbox table paJern: insert it in DB + schedule⏱ a persistent retry
- Send it as a message instead of HTTP
- Log an error raising an alarm🔔 calling for manual interven+on
- Send an error event to a supervisor for automa+c recovery ("saga" paNern)
Graceful Degrada5on

13 VictorRentea.ro
ISOLATION
LOOSE COUPLING
LATENCY CONTROL
SUPERVISION

14 VictorRentea.ro
The enKre system becomes unavailable
↓
Split the system in separate pieces and
isolate those pieces against each other.
Catastrophic Failure

15 VictorRentea.ro
§Long chains of REST calls A à B à C à D à E
§A request causes the instance to restart & client keeps retrying
§A "poison pill" message is retried inﬁnitely, blocking the listener
§Unbounded queues kept in memory causes OutOfMemoryError
§Concurrent massive import/export overloads the DB
☠ Isola7on An7-Pa;erns ☠

16 VictorRentea.ro
Bulkhead
Isolated Failure

17 VictorRentea.ro
Bulkhead
Isolated Failure

18 VictorRentea.ro
Bulkhead
Isolated Failure
The ship does not sink! 🎉

19 VictorRentea.ro
§Core isolaKon paVern
§Pure design issue: What should be independent of what ?
§Used as units of redundancy
§Used as units of scale
Bulkhead
(aka Failure Units)
Microservices Resilience Pa/erns by jRebel

20 VictorRentea.ro
§Key Features: catalog, search, checkout
§Markets / Regions
§Tenants ☁
Isolate them using separate:
§ConnecKon-/ Thread- Pools
§ApplicaKon Instances
§Databases, Queues
Bulkhead Examples

22 VictorRentea.ro
§Limit the load on a server to
- 💥 Prevent a crash: a few 503 are beNer than OutOfMemoryError
- ⏱ Preserve response +me: error is beNer than OK aTer 60 seconds
- ⚠ Protect cri+cal endpoints: place order over export invoices
- ⚖ Ensure fairness: return 429 to greedy clients/tenants
- 💲 Limit auto-scaling to ﬁt budget
§What to throEle:
- Request Rate: max 300 requests/second, eg via @RateLimiter
- Concurrency: max 3 exports in parallel, eg via @BulkHead
- Traﬃc: max 1 GB/minute, or cost: max 300 credits/minute
Thro=ling

23 VictorRentea.ro
C degraded
B throNled
(stopped)
A untouched
as it's cri>cal
Thro%ling Features

24 VictorRentea.ro
Usage Pa=ern
Spike

25 VictorRentea.ro
Sweet Spot
ß best performance à
Performance Response Curve
throughput
# requests
completed / sec
by one machine
# concurrent requests - load
Enqueuing excess load
can improve overall performance
response_time = queue_waiting + execution_time
Monitored.
Bounded.
💥

26 VictorRentea.ro
§Obvious, yet oen neglected
§Validate data when you see it: client requests & API responses
§But don't go too far: validate only what you care about👉
Full Parameter Check

27 VictorRentea.ro
Be conservative in what you do,
but liberal in what you accept from others.
-- Robustness Principle
(aka Postel's Law)

28 VictorRentea.ro
ISOLATION
LOOSE COUPLING
LATENCY CONTROL
SUPERVISION
Thro%ling
Bulkhead
Complete
Parameter
Checking
Bounded
Queues

29 VictorRentea.ro
Latency Control

30 VictorRentea.ro
§Set a Kmeout every Kme you block👍
§If too large
- Impacts my response >me
- ⚠ Mind the defaults: RestTemplate=1 minute, WebClient=unbounded😱
§If too short
- False errors: the opera>on might succeed later on the server
- Keep above the API measured/SLA response >me (max or 99%th)
- Tailor per endpoint, eg: longer for /export
§Monitoring + Alarms🔔
Timeout

31 VictorRentea.ro
A call failed or ,med out > I want to retry it.
Key ques,ons to ask?🤔
Error cause: 400, 500, Kmeout, {retryable:false} in response?..
Worth retrying?
Max aVempts?
What backoﬀ?
How to monitor?
Is the operaKon idempotent?

32 VictorRentea.ro
An operaEon is idempotent if repeaEng it doesn't change any state
§Get product by id via GET ?
- ✅ YES: no state changes on server
§Cancel Payment by id via HTTP DELETE or message
- ✅ YES: canceling it again won't change anything
§Update Product price by id via PUT or message
- ✅ YES: the price stays unchanged on server
§Place Order { items: [..] } via POST or message
- ❌ NO the retry could create a second order
- ✅ YES, detect duplicates via pastHourOrders = Map<customerId, List<hash(order)>>
§Place Order { id: UUID, items: [..] } (client generated id) via PUT or message
- ✅ YES: a duplicate would cause a PK/UK viola>on
Idempotency

33 VictorRentea.ro
Happy Retrying...

34 VictorRentea.ro
Over the last 5 minutes,
99% requests to a system
failed or timed-out.
What are the chances a new call will succeed?

35 VictorRentea.ro
If you know you're going to fail,
at least fail fast.

36 VictorRentea.ro
Circuit Breaker
🛑 OPEN
✅ CLOSED
🛑 OPEN = ﬂow is stopped
Introduced by Michael Nygard in his book Release It h"ps://www.amazon.com/Release-Produc7on-Ready-So:ware-Pragma7c-Programmers/dp/0978739213

37 VictorRentea.ro
Circuit Breaker
✅ CLOSED
allow all calls, coun;ng
OK, errors, ;med-out
🛑 OPEN
stop all calls for a while
🟡 HALF-OPEN
allow only a few calls (eg: 2),
to test if server is back
any request failed
⏱ delay
all requests=OK
failed % > threshold (eg. >99%)
over last 100 calls/seconds
START
Fail fast to save client resources.
Let server recover.
Hit server gently aTer recovery.

38 VictorRentea.ro
§When calling a fragile/slow systems: SAP, Siebel, external API
§Between services in diﬀerent bulkheads
§At API Gateway level (entry point to ecosystem)
Where to Implement These?
Timeout
Retry
Circuit
Breaker
ThroYling
Fallback

39 VictorRentea.ro
Bounded
Queues
ISOLATION
LOOSE COUPLING
LATENCY CONTROL
SUPERVISION
Fan out &
quickest reply
Retry
Circuit Breaker
Timeouts
Fail Fast
Thro%ling
Bulkhead
Complete
Parameter
Checking
Idempotency
SLA

40 VictorRentea.ro
Loose Coupling

41 VictorRentea.ro
§Keeping state inside a service can impact:
- Consistency: all nodes must sync it
- Availability: failover nodes must replicate it
- Scalability: new instances must copy it
§Move state out 👍
- To databases
- To clients (browser, mobile) if related to UI ﬂow
- Via request tokens if user metadata (email, roles..)
Stateless Service

42 VictorRentea.ro
Loca,on Transparency

43 VictorRentea.ro
§Guaranteed delivery, even if the receiver is unavailable now
§Prevent cascading failures between bulkheads
§Breaks the call stack paradigm -- thus, apparently hard
§With resilience concerns, REST also becomes scary
§Sender should NOT need a response or status – design change
§DON'T expect that your changes are applied instantaneously
è Eventual Consistency ç
Asynchronous Communica,on (MQ)

44 VictorRentea.ro
You can either be
Consistent or Available
when Partitioned
CAP Theorem
All data is in
perfect sync
You get
an answer
Using
mulEple machines

45 VictorRentea.ro
§Examples:
- One of the items in stock is accidentally broken during handling
- The product page displays stock: 1, but by the Eme the customer clicks
"add to cart", someone else already bought it!
è Relax, and embrace Eventual Consistency
- A"er you understand the busine$$ impact
The Real World is not Consistent

46 VictorRentea.ro
§State moves via events à less REST J
§Lossless capture of history
§Ability to =me-travel by replaying events
§Allows scaling up the read ﬂows separately (CQRS)
Event-Sourcing

48 VictorRentea.ro
§Poison pill message è send to dead-leEer or retry queue
§Duplicated messages è idempotent consumer
§Out-of-order messages è aggregator, Kme-based- on smart-
§Data privacy è Claim Check paVern
§Event versioning è Backward-forward compaPbilty
§MisconﬁguraKon of infrastructure
Messaging PiIalls

49 VictorRentea.ro
Bounded
Queues
ISOLATION
LOOSE COUPLING
LATENCY CONTROL
SUPERVISION
Fan out &
quickest reply
Retry
Circuit Breaker
Timeouts
Fail Fast
Thro%ling
Bulkhead
Complete
Parameter
Checking
Error Handler
Asynchronous
Communica;on
Loca;on
Transparency
Idempotency
Event-Driven
Stateless
Eventual
Consistency
Monitor
Escala;on
SLA
Also see: "Pa:erns of resilience" talk by Uwe Friedrichsen
Health Check

50 VictorRentea.ro
Clients cannot / should not
handle downstream errors

51 VictorRentea.ro
People
No matter what a problem seems to be at ﬁrst,
it's always a people problem.
- First Law of Consulting

52 VictorRentea.ro
To build highly-resilient systems,
you need DevOps teams

55 VictorRentea.ro
Own Your Produc8on
See metrics & set alarms
Prometheus+Grafana..
Distributed request tracing
OpenTelemetry
Log aggregaPon
ELK...
Time span samples
Zipkin...
SHOW STOPPER
if lacking these

56 VictorRentea.ro
§IdenKfy bulkheads (failure units)
§Sync calls += Kmeout, retry, throVling, circuit breaker, failover
§Slow responses can cause cascading failures
§Slow down to move faster
§If you fail, at least fail fast
§Embrace eventual consistency and favor async messages
Key Points

The Resilience Patterns
your Microservices Teams
Should Know
Victor Rentea | https://victorrentea.ro | @victorrentea
Thank you!
Join my community:

Microservice Resilience Patterns @VoxxedCern'24

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Microservice Resilience Patterns @VoxxedCern'24

Similar to Microservice Resilience Patterns @VoxxedCern'24 (20)

More from Victor Rentea

More from Victor Rentea (20)

Recently uploaded

Recently uploaded (20)

Microservice Resilience Patterns @VoxxedCern'24