Resilient service to-service calls in a post-Hystrix world

Resilient
service-to-service
calls in a
post-Hystrix world
Rareș Mușină, Tech Lead @N26
@r3sm4n

Integration Patterns
Sync vs. Eventual
Consistency

The anatomy of a
cascading failure

Triggering Conditions - What happened?
The anatomy of a cascading failure

New Rollouts
Planned Changes
Trafﬁc Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
UserInfo userInfo = userInfoService.update(userId);
updateCache(userInfo);
...
return getCountryFromCache(userId);
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}

Triggering Conditions - Throttling

Burstiness (e.g. scheduled tasks)
DDOSes
Instance Death (gee, thanks Spotinst)
Organic Growth
Request proﬁle changes
Triggering Conditions - Entropy

CPU
Memory
Network
Disk space
Threads
File descriptors
………………………...
Resource Starvation - Common Resources

Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥

Server Overload/Meltdown/Crash/Unavailability
:(
CPU/Memory maxed out
Health checks returning 5xx
Endpoints returning 5xx
Timeouts
Increased load on other instances

Cascading Failures - Load Redistribution
ELB ELB
A B
500 350
100 250
ELB ELB
A
600 600

Cascading Failures - Retry Ampliﬁcation

Cascading Failures - Latency Propagation

Cascading Failures - Resource Contention During Recovery

Strategies for
Improving Resilience

Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes

Capacity Planning - Do I need it in the age of the cloud?
Helpful, but not sufﬁcient to protect against cascading failures
Accuracy is overrated and expensive (especially for new services)
It’s (usually) ok (and cheaper) to overprovision at ﬁrst

Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing trafﬁc (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely

Capacity Planning - If I do need it, then what do I do?
Business requirements
Critical services, and YOLO the rest
⚠ Seasonality 🎄🥚🦃
Use hardware resources to measure capacity instead of Requests Per Second:
● cost of request = CPU time it has consumed
● (on GC platforms) higher memory => higher CPU

“Chaos Engineering is the discipline of experimenting
on a system in order to build conﬁdence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing

Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides

Retrying - Backing Off With Jitter

Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy

Throttling - Timeouts
Nesting is 🔥👿🔥
Retries make ☝ worse
Timing out => upstream service might still be processing request
Maintain discipline when setting timeouts/Propagate timeouts
Service B
Service A
3s timeout
Service C
Service D
2s timeout
5s timeout
timeout
⚠ Avoid circular dependencies at all cost ⚠

Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time
interval on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service
and upstream
If this is too complicated
=> limit based on RPS/customer/endpoint

Throttling - Circuit Breaking
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open

gradient = (RTTnoload/RTTactual)
newLimit = currentLimit × gradient + queueSize
Throttling - Adaptive Concurrency Limits
Queue
Concurrency

Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��

Tools for Improving
Resilience

Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netﬂix/concurrency-limits

Hystrix - Topology
Tools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client

Category Defense Mechanism Hystrix Resilience4j Envoy
Netﬂix/
Concurrency-
limits
Retrying Retrying 🚫
Throttling
Timeouts ✅
Rate Limiting 🚫
Circuit Breaking ✅
Adaptive Concurrency Limits 🚫
Rejection
Fallbacks ✅
Response Caching ✅
Hystrix - Resilience Features

Hystrix - Conﬁguration management
public class GetUserInfoCommand extends HystrixCommand<UserInfo> {
private final String userId;
private final UserInfoApi userInfoApi;
public GetUserInfoCommand(String userId) {
super(HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey("UserInfo"))
.andCommandKey(HystrixCommandKey.Factory.asKey("getUserInfo")));
this.userId = userId;
this.userInfoApi = userInfoApi;
}
@Override
protected UserInfo run() {
// Simplified to fit on a slide - you'd have some exception handling
return userInfoApi.getUserInfo(userId);
}
@Override
protected String getCacheKey() { // Dragons reside here
return userId;
}
@Override
protected UserInfo getFallback() {
return UserInfo.empty();
}
}

Hystrix - Observability

Hystrix - Testing
Scope:
● Unit tests easy for circuit opening/closing, fallbacks
● Integration tests reasonably easy for caching
Caveats:
● If you are using response caching, DO NOT FORGET to test
HystrixRequestContext
● Depending on the errors thrown by the call, you might need to
test circuit tripping (HystrixRuntimeException vs
HystrixBadRequestException)
● If you’re not careful, you might set the same
HystrixCommandGroupKey and HystrixCommandKey

Hystrix - Adoption Considerations
✅ 🚫 ⚠
Observability No longer
supported
Forces you towards
building thick clients
Mostly easy to test
Not language
agnostic
Tricky to enforce on
calling services
Cumbersome to
conﬁgure
HystrixRequestContext

Resilience4j - Topology
Tools for Improving ResilienceTools for Improving Resilience
Dependency A
Dependency D
Client

Netﬂix/
Concurrency-
limits
Retrying Retrying 🚫 ✅
Throttling
Timeouts ✅ ✅
Rate Limiting 🚫 ✅
Circuit Breaking ✅ ✅
Adaptive Concurrency Limits 🚫 👷
Rejection
Fallbacks ✅ ✅
Response Caching ✅ ✅
Resilience4j - Resilience Features

CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofMillis(1000))
.recordExceptions(IOException.class, TimeoutException.class)
.build();
CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig);
CircuitBreaker circuitBreaker =
circuitBreakerRegistry.circuitBreaker("getUserInfo",circuitBreakerConfig);
RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build();
Retry retry = Retry.of("getUserInfo", retryConfig);
Supplier<UserInfo> decorateSupplier = CircuitBreaker.decorateSupplier(circuitBreaker,
Retry.decorateSupplier(retry,
() -> userInfoApi.getUserInfo(userId)));
UserInfo result = Try.ofSupplier(decorateSupplier).getOrElse(UserInfo.empty());
Resilience4j - Conﬁguration management

Resilience4j - Observability
You can subscribe to various events for most of the decorators:
Built in support for:
● Dropwizard (resilience4j-metrics)
● Prometheus (resilience4j-prometheus)
● Micrometer (resilience4j-micrometer)
● Spring-boot actuator health information (resilience4j-spring-boot2)
circuitBreaker.getEventPublisher()
.onSuccess(event -> logger.info(...))
.onError(event -> logger.info(...))
.onIgnoredError(event -> logger.info(...))
.onReset(event -> logger.info(...))
.onStateTransition(event -> logger.info(...));

Resilience4j - Testing
Scope:
● Unit tests easy for composed layers and different scenarios
● Integration tests reasonably easy for caching
Caveats:
● Cache scope is tricky here as well
● Basically similar problems to Hystrix testing

Resilience4j - Adoption Considerations
✅ 🚫 ⚠
Observability Not language
agnostic
Forces you towards
building thick clients
Feature rich
Tricky to enforce on
calling services
Easier to conﬁgure
than Hystrix
Modularization
Less transitive
dependencies than
Hystrix

Envoy - Topology
Service Cluster
Service
Service Cluster
Service
External Services
Discovery

Netﬂix/
Concurrency-
limits
Retrying Retrying 🚫 ✅ ✅
Throttling
Timeouts ✅ ✅ ✅
Rate Limiting 🚫 ✅ ✅
Circuit Breaking ✅ ✅ ✅
Adaptive Concurrency Limits 🚫 👷 👷
Rejection
Fallbacks ✅ ✅ 🚫
Response Caching ✅ ✅ 🚫
Envoy - Resilience Features

Envoy - Conﬁguration management
clusters:
- name: get-cluster
connect_timeout: 10s
type: STRICT_DNS
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 10
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 3
max_pending_requests: 3
max_requests: 3
max_retries: 3
- priority: HIGH
max_connections: 10
max_pending_requests: 10
max_requests: 10
max_retries: 10
hosts:
- socket_address:
address: httpbin-get
port_value: 8080
static_resources:
listeners:
- address:
socket_address:
...
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
...
route_config:
...
virtual_hosts:
- name: backend
...
routes:
- match:
prefix: "/"
headers:
- exact_match: "GET"
name: ":method"
route:
cluster: get-cluster
...
retry_policy:
retry_on: 5xx
num_retries: 2
priority: HIGH

Envoy - Configuration Deployment
Static config:
● You will benefit from some scripting/tools to generate this config
● Deploy the generated yaml as a Docker container side-car using the
official Docker image
Dynamic config:
● gRPC APis for dynamically updating these settings
○ Endpoint Discovery Service (EDS)
○ Cluster Discovery Service (CDS)
○ Route Discovery Service (RDS)
○ Listener discovery service (LDS)
○ Secret discovery service (SDS)
● Control planes like Istio makes this manageable

Envoy - Observability
Data sinks:
● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics)
● envoy.dog_statsd - emits stats with DogStatsD compatible tags
● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use
by Hystrix dashboard
● build your own
(Small) subset of stats:
● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout,
downstream_rq_time, etc.
Detecting open circuits/throttling:
● x-envoy-overloaded header will be injected in the downstream response
● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit
breaker), remaining_rq (remaining requests until circuit will open), etc

Envoy - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Setup can be tricky (boot the side-car in a Docker container, put
a mock server behind it and start simulating requests and
different types of failures)
● Will probably need to test this / route or whatever your conﬁg
granularity is

Envoy - Adoption Considerations
✅ 🚫 ⚠
Application
language agnostic
Fallbacks Testability
Enforcement Cache
Ownership (SRE vs
Dev teams)
Change rollout
Conﬁguration
Complexity
Caller/callee
resilience
Operational
Complexity
Observability

Netﬂix/concurrency-limits - Topology
Dependency A
Dependency D
Client

Netﬂix/
Concurrency-
limits
Retrying Retrying 🚫 ✅ ✅ 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫
Circuit Breaking ✅ ✅ ✅ 🚫
Adaptive Concurrency Limits 🚫 👷 👷 ✅
Rejection
Fallbacks ✅ ✅ 🚫 🚫
Response Caching ✅ ✅ 🚫 🚫
Netﬂix/concurrency-limits - Resilience Features

Netﬂix/concurrency-limits - Conﬁguration management
ConcurrencyLimitServletFilter(
ServletLimiterBuilder()
.limit(VegasLimit.newBuilder().build())
.metricRegistry(concurrencyLimitMetricRegistry)
.build())

Netﬂix/concurrency-limits - Observability
class ConcurrencyLimitMetricRegistry(private val meterRegistry: MeterRegistry) : MetricRegistry {
override fun registerDistribution(id: String?, vararg tagNameValuePairs: String?):
MetricRegistry.SampleListener {
return MetricRegistry.SampleListener { }
}
override fun registerGauge(id: String?, supplier: Supplier<Number>?, vararg tagNameValuePairs:
String?) {
id?.let {
supplier?.let {
val tags = tagNameValuePairs.toList().zipWithNext().map { Tag.of(it.first,
it.second) }
meterRegistry.gauge(id, tags, supplier.get())
}
}
}
}

Netﬂix/concurrency-limits - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Haha good luck with that

✅ 🚫 ⚠
Caller/callee
resilience
Not language agnostic
Harder to predict
throttling
Does not require
manual, per-endpoint
conﬁg
Less mature than
others
Observability
Documentation is
quite scarce
Easier to enforce on
calling services
Netﬂix/concurrency-limits - Adoption Considerations

What did we go for in
the end?

Resilience libraries showdown
Category
Defense
Mechanism
Hystrix
Resilience
4j
Envoy
Netﬂix/
concurrency
-limits
gRPC Sentinel
Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫
Circuit
Breaking ✅ ✅ ✅ 🚫 👷 ✅
Adaptive
Concurrency
Limits
🚫 👷 👷 ✅ 👷 🚫
Rejection
Fallbacks ✅ ✅ 🚫 🚫 👷 ✅
Response
Caching ✅ ✅ 🚫 🚫 👷 🚫

And the winner is… Envoy 🥇🥇🥇
Category
Defense
Mechanism
Hystrix
Resilience
4j
Envoy
Netﬂix/
concurrency
-limits
gRPC Sentinel
Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫
Circuit
Breaking ✅ ✅ ✅ 🚫 👷 ✅
Adaptive
Concurrency
Limits
🚫 👷 👷 ✅ 👷 🚫
Rejection
Fallbacks ✅ ✅ 🚫 🚫 👷 ✅
Response
Caching ✅ ✅ 🚫 🚫 👷 🚫

And the winner is… Envoy 🥇🥇🥇, but why?
Reasons:
● We already have it
● Observability is super strong
● Easy to enforce across all our infrastructure
● Allows us to have thin clients
● Language agnostic
And the runner up is… Resilience4j (kinda) 🥈🥈🥈
● Allowed for retries, caching and fallbacks, but it’s up to the teams
● We discourage using request caching for the most part

Resilient service to-service calls in a post-Hystrix world

More Related Content

What's hot

Similar to Resilient service to-service calls in a post-Hystrix world

Recently uploaded

Resilient service to-service calls in a post-Hystrix world