Resilient
service-to-service
calls in a
post-Hystrix world
Rareș Mușină, Tech Lead @N26
@r3sm4n
R.I.P. Hystrix (2012-2018)
Integration Patterns
Sync vs. Eventual
Consistency
The anatomy of a
cascading failure
Triggering Conditions - What happened?
The anatomy of a cascading failure
New Rollouts
Planned Changes
Traffic Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
UserInfo userInfo = userInfoService.update(userId);
updateCache(userInfo);
...
return getCountryFromCache(userId);
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}
The anatomy of a cascading failure
Triggering Conditions - What happened?
The anatomy of a cascading failure
Triggering Conditions - Throttling
The anatomy of a cascading failure
Triggering Conditions - What happened?
The anatomy of a cascading failure
Burstiness (e.g. scheduled tasks)
DDOSes
Instance Death (gee, thanks Spotinst)
Organic Growth
Request profile changes
Triggering Conditions - Entropy
The anatomy of a cascading failure
CPU
Memory
Network
Disk space
Threads
File descriptors
………………………...
Resource Starvation - Common Resources
The anatomy of a cascading failure
Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥
The anatomy of a cascading failure
Server Overload/Meltdown/Crash/Unavailability
:(
CPU/Memory maxed out
Health checks returning 5xx
Endpoints returning 5xx
Timeouts
Increased load on other instances
The anatomy of a cascading failure
Cascading Failures - Load Redistribution
The anatomy of a cascading failure
ELB ELB
A B
500 350
100 250
ELB ELB
A
600 600
Cascading Failures - Retry Amplification
The anatomy of a cascading failure
Cascading Failures - Latency Propagation
The anatomy of a cascading failure
Cascading Failures - Resource Contention During Recovery
The anatomy of a cascading failure
Strategies for
Improving Resilience
Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes
Capacity Planning - Do I need it in the age of the cloud?
Helpful, but not sufficient to protect against cascading failures
Accuracy is overrated and expensive (especially for new services)
It’s (usually) ok (and cheaper) to overprovision at first
Strategies for Improving Resilience
Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing traffic (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely
Strategies for Improving Resilience
Capacity Planning - If I do need it, then what do I do?
Business requirements
Critical services, and YOLO the rest
⚠ Seasonality 🎄🥚🦃
Use hardware resources to measure capacity instead of Requests Per Second:
● cost of request = CPU time it has consumed
● (on GC platforms) higher memory => higher CPU
Strategies for Improving Resilience
“Chaos Engineering is the discipline of experimenting
on a system in order to build confidence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing
Strategies for Improving Resilience
Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides
Strategies for Improving Resilience
Retrying - Backing Off With Jitter
Strategies for Improving Resilience
Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy
Strategies for Improving Resilience
Throttling - Timeouts
Nesting is 🔥👿🔥
Retries make ☝ worse
Timing out => upstream service might still be processing request
Maintain discipline when setting timeouts/Propagate timeouts
Strategies for Improving Resilience
Service B
Service A
3s timeout
Service C
Service D
2s timeout
5s timeout
timeout
⚠ Avoid circular dependencies at all cost ⚠
Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time
interval on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service
and upstream
If this is too complicated
=> limit based on RPS/customer/endpoint
Strategies for Improving Resilience
Throttling - Circuit Breaking
Strategies for Improving Resilience
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open
gradient = (RTTnoload/RTTactual)
newLimit = currentLimit × gradient + queueSize
Throttling - Adaptive Concurrency Limits
Queue
Concurrency
Strategies for Improving Resilience
Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��
Strategies for Improving Resilience
Tools for Improving
Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Hystrix - Topology
Tools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫
Throttling
Timeouts ✅
Rate Limiting 🚫
Circuit Breaking ✅
Adaptive Concurrency Limits 🚫
Rejection
Fallbacks ✅
Response Caching ✅
Hystrix - Resilience Features
Tools for Improving Resilience
Hystrix - Configuration management
public class GetUserInfoCommand extends HystrixCommand<UserInfo> {
private final String userId;
private final UserInfoApi userInfoApi;
public GetUserInfoCommand(String userId) {
super(HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey("UserInfo"))
.andCommandKey(HystrixCommandKey.Factory.asKey("getUserInfo")));
this.userId = userId;
this.userInfoApi = userInfoApi;
}
@Override
protected UserInfo run() {
// Simplified to fit on a slide - you'd have some exception handling
return userInfoApi.getUserInfo(userId);
}
@Override
protected String getCacheKey() { // Dragons reside here
return userId;
}
@Override
protected UserInfo getFallback() {
return UserInfo.empty();
}
}
Tools for Improving Resilience
Hystrix - Observability
Tools for Improving Resilience
Hystrix - Testing
Scope:
● Unit tests easy for circuit opening/closing, fallbacks
● Integration tests reasonably easy for caching
Caveats:
● If you are using response caching, DO NOT FORGET to test
HystrixRequestContext
● Depending on the errors thrown by the call, you might need to
test circuit tripping (HystrixRuntimeException vs
HystrixBadRequestException)
● If you’re not careful, you might set the same
HystrixCommandGroupKey and HystrixCommandKey
Tools for Improving Resilience
Hystrix - Adoption Considerations
✅ 🚫 ⚠
Observability No longer
supported
Forces you towards
building thick clients
Mostly easy to test
Not language
agnostic
Tricky to enforce on
calling services
Cumbersome to
configure
HystrixRequestContext
Tools for Improving Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Resilience4j - Topology
Tools for Improving ResilienceTools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫 ✅
Throttling
Timeouts ✅ ✅
Rate Limiting 🚫 ✅
Circuit Breaking ✅ ✅
Adaptive Concurrency Limits 🚫 👷
Rejection
Fallbacks ✅ ✅
Response Caching ✅ ✅
Resilience4j - Resilience Features
Tools for Improving Resilience
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofMillis(1000))
.recordExceptions(IOException.class, TimeoutException.class)
.build();
CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig);
CircuitBreaker circuitBreaker =
circuitBreakerRegistry.circuitBreaker("getUserInfo",circuitBreakerConfig);
RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build();
Retry retry = Retry.of("getUserInfo", retryConfig);
Supplier<UserInfo> decorateSupplier = CircuitBreaker.decorateSupplier(circuitBreaker,
Retry.decorateSupplier(retry,
() -> userInfoApi.getUserInfo(userId)));
UserInfo result = Try.ofSupplier(decorateSupplier).getOrElse(UserInfo.empty());
Resilience4j - Configuration management
Tools for Improving Resilience
Resilience4j - Observability
You can subscribe to various events for most of the decorators:
Built in support for:
● Dropwizard (resilience4j-metrics)
● Prometheus (resilience4j-prometheus)
● Micrometer (resilience4j-micrometer)
● Spring-boot actuator health information (resilience4j-spring-boot2)
circuitBreaker.getEventPublisher()
.onSuccess(event -> logger.info(...))
.onError(event -> logger.info(...))
.onIgnoredError(event -> logger.info(...))
.onReset(event -> logger.info(...))
.onStateTransition(event -> logger.info(...));
Tools for Improving Resilience
Resilience4j - Testing
Scope:
● Unit tests easy for composed layers and different scenarios
● Integration tests reasonably easy for caching
Caveats:
● Cache scope is tricky here as well
● Basically similar problems to Hystrix testing
Tools for Improving Resilience
Resilience4j - Adoption Considerations
✅ 🚫 ⚠
Observability Not language
agnostic
Forces you towards
building thick clients
Feature rich
Tricky to enforce on
calling services
Easier to configure
than Hystrix
Modularization
Less transitive
dependencies than
Hystrix
Tools for Improving Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Envoy - Topology
Service Cluster
Service
Service Cluster
Service
External Services
Discovery
Tools for Improving Resilience
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫 ✅ ✅
Throttling
Timeouts ✅ ✅ ✅
Rate Limiting 🚫 ✅ ✅
Circuit Breaking ✅ ✅ ✅
Adaptive Concurrency Limits 🚫 👷 👷
Rejection
Fallbacks ✅ ✅ 🚫
Response Caching ✅ ✅ 🚫
Envoy - Resilience Features
Tools for Improving Resilience
Envoy - Configuration management
clusters:
- name: get-cluster
connect_timeout: 10s
type: STRICT_DNS
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 10
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 3
max_pending_requests: 3
max_requests: 3
max_retries: 3
- priority: HIGH
max_connections: 10
max_pending_requests: 10
max_requests: 10
max_retries: 10
hosts:
- socket_address:
address: httpbin-get
port_value: 8080
static_resources:
listeners:
- address:
socket_address:
...
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
...
route_config:
...
virtual_hosts:
- name: backend
...
routes:
- match:
prefix: "/"
headers:
- exact_match: "GET"
name: ":method"
route:
cluster: get-cluster
...
retry_policy:
retry_on: 5xx
num_retries: 2
priority: HIGH
Tools for Improving Resilience
Envoy - Configuration Deployment
Static config:
● You will benefit from some scripting/tools to generate this config
● Deploy the generated yaml as a Docker container side-car using the
official Docker image
Dynamic config:
● gRPC APis for dynamically updating these settings
○ Endpoint Discovery Service (EDS)
○ Cluster Discovery Service (CDS)
○ Route Discovery Service (RDS)
○ Listener discovery service (LDS)
○ Secret discovery service (SDS)
● Control planes like Istio makes this manageable
Tools for Improving Resilience
Envoy - Observability
Data sinks:
● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics)
● envoy.dog_statsd - emits stats with DogStatsD compatible tags
● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use
by Hystrix dashboard
● build your own
(Small) subset of stats:
● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout,
downstream_rq_time, etc.
Detecting open circuits/throttling:
● x-envoy-overloaded header will be injected in the downstream response
● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit
breaker), remaining_rq (remaining requests until circuit will open), etc
Tools for Improving Resilience
Envoy - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Setup can be tricky (boot the side-car in a Docker container, put
a mock server behind it and start simulating requests and
different types of failures)
● Will probably need to test this / route or whatever your config
granularity is
Tools for Improving Resilience
Envoy - Adoption Considerations
✅ 🚫 ⚠
Application
language agnostic
Fallbacks Testability
Enforcement Cache
Ownership (SRE vs
Dev teams)
Change rollout
Configuration
Complexity
Caller/callee
resilience
Operational
Complexity
Observability
Tools for Improving Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Netflix/concurrency-limits - Topology
Tools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫 ✅ ✅ 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫
Circuit Breaking ✅ ✅ ✅ 🚫
Adaptive Concurrency Limits 🚫 👷 👷 ✅
Rejection
Fallbacks ✅ ✅ 🚫 🚫
Response Caching ✅ ✅ 🚫 🚫
Netflix/concurrency-limits - Resilience Features
Tools for Improving Resilience
Netflix/concurrency-limits - Configuration management
ConcurrencyLimitServletFilter(
ServletLimiterBuilder()
.limit(VegasLimit.newBuilder().build())
.metricRegistry(concurrencyLimitMetricRegistry)
.build())
Tools for Improving Resilience
Netflix/concurrency-limits - Observability
class ConcurrencyLimitMetricRegistry(private val meterRegistry: MeterRegistry) : MetricRegistry {
override fun registerDistribution(id: String?, vararg tagNameValuePairs: String?):
MetricRegistry.SampleListener {
return MetricRegistry.SampleListener { }
}
override fun registerGauge(id: String?, supplier: Supplier<Number>?, vararg tagNameValuePairs:
String?) {
id?.let {
supplier?.let {
val tags = tagNameValuePairs.toList().zipWithNext().map { Tag.of(it.first,
it.second) }
meterRegistry.gauge(id, tags, supplier.get())
}
}
}
}
Tools for Improving Resilience
Netflix/concurrency-limits - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Haha good luck with that
Tools for Improving Resilience
✅ 🚫 ⚠
Caller/callee
resilience
Not language agnostic
Harder to predict
throttling
Does not require
manual, per-endpoint
config
Less mature than
others
Observability
Documentation is
quite scarce
Easier to enforce on
calling services
Netflix/concurrency-limits - Adoption Considerations
Tools for Improving Resilience
What did we go for in
the end?
Resilience libraries showdown
Category
Defense
Mechanism
Hystrix
Resilience
4j
Envoy
Netflix/
concurrency
-limits
gRPC Sentinel
Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫
Circuit
Breaking ✅ ✅ ✅ 🚫 👷 ✅
Adaptive
Concurrency
Limits
🚫 👷 👷 ✅ 👷 🚫
Rejection
Fallbacks ✅ ✅ 🚫 🚫 👷 ✅
Response
Caching ✅ ✅ 🚫 🚫 👷 🚫
And the winner is… Envoy 🥇🥇🥇
Category
Defense
Mechanism
Hystrix
Resilience
4j
Envoy
Netflix/
concurrency
-limits
gRPC Sentinel
Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫
Circuit
Breaking ✅ ✅ ✅ 🚫 👷 ✅
Adaptive
Concurrency
Limits
🚫 👷 👷 ✅ 👷 🚫
Rejection
Fallbacks ✅ ✅ 🚫 🚫 👷 ✅
Response
Caching ✅ ✅ 🚫 🚫 👷 🚫
And the winner is… Envoy 🥇🥇🥇, but why?
Reasons:
● We already have it
● Observability is super strong
● Easy to enforce across all our infrastructure
● Allows us to have thin clients
● Language agnostic
And the runner up is… Resilience4j (kinda) 🥈🥈🥈
● Allowed for retries, caching and fallbacks, but it’s up to the teams
● We discourage using request caching for the most part
Ask Away
Thank you!
🙇 🙇 🙇

Resilient service to-service calls in a post-Hystrix world

  • 1.
    Resilient service-to-service calls in a post-Hystrixworld Rareș Mușină, Tech Lead @N26 @r3sm4n
  • 2.
  • 3.
    Integration Patterns Sync vs.Eventual Consistency
  • 4.
    The anatomy ofa cascading failure
  • 5.
    Triggering Conditions -What happened? The anatomy of a cascading failure
  • 6.
    New Rollouts Planned Changes TrafficDrains Turndowns Triggering Conditions - Change public String getCountry(String userId) { try { // Try to get latest country to avoid stale info UserInfo userInfo = userInfoService.update(userId); updateCache(userInfo); ... return getCountryFromCache(userId); } catch (Exception e) { // Default to cache if service is down return getCountryFromCache(userId); } } The anatomy of a cascading failure
  • 7.
    Triggering Conditions -What happened? The anatomy of a cascading failure
  • 8.
    Triggering Conditions -Throttling The anatomy of a cascading failure
  • 9.
    Triggering Conditions -What happened? The anatomy of a cascading failure
  • 10.
    Burstiness (e.g. scheduledtasks) DDOSes Instance Death (gee, thanks Spotinst) Organic Growth Request profile changes Triggering Conditions - Entropy The anatomy of a cascading failure
  • 11.
    CPU Memory Network Disk space Threads File descriptors ………………………... ResourceStarvation - Common Resources The anatomy of a cascading failure
  • 12.
    Resource Starvation -Dependencies Between Resources Poorly tuned Garbage Collection Slow requests Increased CPU due to GC More in-progress requests More RAM due to queuing Less RAM for caching Lower cache hit rate More requests to backend 🔥🔥🔥 The anatomy of a cascading failure
  • 13.
    Server Overload/Meltdown/Crash/Unavailability :( CPU/Memory maxedout Health checks returning 5xx Endpoints returning 5xx Timeouts Increased load on other instances The anatomy of a cascading failure
  • 14.
    Cascading Failures -Load Redistribution The anatomy of a cascading failure ELB ELB A B 500 350 100 250 ELB ELB A 600 600
  • 15.
    Cascading Failures -Retry Amplification The anatomy of a cascading failure
  • 16.
    Cascading Failures -Latency Propagation The anatomy of a cascading failure
  • 17.
    Cascading Failures -Resource Contention During Recovery The anatomy of a cascading failure
  • 18.
  • 19.
    Architecture - Orchestrationvs Choreography Orchestration Choreography Strategies for Improving Resilience Card service Account service User service Signup service Ship card Create Account Create user Card service Account service User service User signup event Subscribes Signup service Publishes
  • 20.
    Capacity Planning -Do I need it in the age of the cloud? Helpful, but not sufficient to protect against cascading failures Accuracy is overrated and expensive (especially for new services) It’s (usually) ok (and cheaper) to overprovision at first Strategies for Improving Resilience
  • 21.
    Capacity Planning -More important things Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹) Auto-scaling and auto-healing Robust architecture in the face of growing traffic (pub/sub helps) Agree on SLIs and SLOs and monitor them closely Strategies for Improving Resilience
  • 22.
    Capacity Planning -If I do need it, then what do I do? Business requirements Critical services, and YOLO the rest ⚠ Seasonality 🎄🥚🦃 Use hardware resources to measure capacity instead of Requests Per Second: ● cost of request = CPU time it has consumed ● (on GC platforms) higher memory => higher CPU Strategies for Improving Resilience
  • 23.
    “Chaos Engineering isthe discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Principles of Chaos Engineering Chaos Testing Strategies for Improving Resilience
  • 24.
    Retrying - Whatshould I retry? What makes a request retriable? ● ⚠ idempotency ● 🚫 GET with side-effects ● ✅ stateless if you can Should you retry timeouts? ● Stay tuned to the next slides Strategies for Improving Resilience
  • 25.
    Retrying - BackingOff With Jitter Strategies for Improving Resilience
  • 26.
    Retrying - RetryBudgets Per-request retry budget ● Each request retried at most 3x Per-client retry budget ● Retry requests = at most 10% total requests to upstream ● If > 10% of requests are failing => upstream is likely unhealthy Strategies for Improving Resilience
  • 27.
    Throttling - Timeouts Nestingis 🔥👿🔥 Retries make ☝ worse Timing out => upstream service might still be processing request Maintain discipline when setting timeouts/Propagate timeouts Strategies for Improving Resilience Service B Service A 3s timeout Service C Service D 2s timeout 5s timeout timeout ⚠ Avoid circular dependencies at all cost ⚠
  • 28.
    Throttling - RateLimiting Avoid overload by clients and set per-client limits: ● requests from one calling service can use up to x CPU seconds/time interval on the upstream ● anything above that will be throttled ● these metrics are aggregated across all instances of a calling service and upstream If this is too complicated => limit based on RPS/customer/endpoint Strategies for Improving Resilience
  • 29.
    Throttling - CircuitBreaking Strategies for Improving Resilience Closed Open Half Open fail (threshold reached) reset timeout fail fail (under threshold) success call/raise circuit open success Service A Circuit Breaker Service B ⚠ ⚠ ⚠ ⚠ 🚫 timeout timeout timeout timeout trip circuit circuit open
  • 30.
    gradient = (RTTnoload/RTTactual) newLimit= currentLimit × gradient + queueSize Throttling - Adaptive Concurrency Limits Queue Concurrency Strategies for Improving Resilience
  • 31.
    Fallbacks and Rejection Cache Deadletter queues for writes Return hard-coded value Empty Response (“Fail Silent”) User experience ⚠ Make sure to discuss these with your product owners ⚠ �� �� Strategies for Improving Resilience
  • 32.
  • 33.
  • 34.
    Hystrix - Topology Toolsfor Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  • 35.
    Category Defense MechanismHystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 Throttling Timeouts ✅ Rate Limiting 🚫 Circuit Breaking ✅ Adaptive Concurrency Limits 🚫 Rejection Fallbacks ✅ Response Caching ✅ Hystrix - Resilience Features Tools for Improving Resilience
  • 36.
    Hystrix - Configurationmanagement public class GetUserInfoCommand extends HystrixCommand<UserInfo> { private final String userId; private final UserInfoApi userInfoApi; public GetUserInfoCommand(String userId) { super(HystrixCommand.Setter .withGroupKey(HystrixCommandGroupKey.Factory.asKey("UserInfo")) .andCommandKey(HystrixCommandKey.Factory.asKey("getUserInfo"))); this.userId = userId; this.userInfoApi = userInfoApi; } @Override protected UserInfo run() { // Simplified to fit on a slide - you'd have some exception handling return userInfoApi.getUserInfo(userId); } @Override protected String getCacheKey() { // Dragons reside here return userId; } @Override protected UserInfo getFallback() { return UserInfo.empty(); } } Tools for Improving Resilience
  • 37.
    Hystrix - Observability Toolsfor Improving Resilience
  • 38.
    Hystrix - Testing Scope: ●Unit tests easy for circuit opening/closing, fallbacks ● Integration tests reasonably easy for caching Caveats: ● If you are using response caching, DO NOT FORGET to test HystrixRequestContext ● Depending on the errors thrown by the call, you might need to test circuit tripping (HystrixRuntimeException vs HystrixBadRequestException) ● If you’re not careful, you might set the same HystrixCommandGroupKey and HystrixCommandKey Tools for Improving Resilience
  • 39.
    Hystrix - AdoptionConsiderations ✅ 🚫 ⚠ Observability No longer supported Forces you towards building thick clients Mostly easy to test Not language agnostic Tricky to enforce on calling services Cumbersome to configure HystrixRequestContext Tools for Improving Resilience
  • 40.
  • 41.
    Resilience4j - Topology Toolsfor Improving ResilienceTools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  • 42.
    Category Defense MechanismHystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ Throttling Timeouts ✅ ✅ Rate Limiting 🚫 ✅ Circuit Breaking ✅ ✅ Adaptive Concurrency Limits 🚫 👷 Rejection Fallbacks ✅ ✅ Response Caching ✅ ✅ Resilience4j - Resilience Features Tools for Improving Resilience
  • 43.
    CircuitBreakerConfig circuitBreakerConfig =CircuitBreakerConfig.custom() .failureRateThreshold(50) .waitDurationInOpenState(Duration.ofMillis(1000)) .recordExceptions(IOException.class, TimeoutException.class) .build(); CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig); CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("getUserInfo",circuitBreakerConfig); RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build(); Retry retry = Retry.of("getUserInfo", retryConfig); Supplier<UserInfo> decorateSupplier = CircuitBreaker.decorateSupplier(circuitBreaker, Retry.decorateSupplier(retry, () -> userInfoApi.getUserInfo(userId))); UserInfo result = Try.ofSupplier(decorateSupplier).getOrElse(UserInfo.empty()); Resilience4j - Configuration management Tools for Improving Resilience
  • 44.
    Resilience4j - Observability Youcan subscribe to various events for most of the decorators: Built in support for: ● Dropwizard (resilience4j-metrics) ● Prometheus (resilience4j-prometheus) ● Micrometer (resilience4j-micrometer) ● Spring-boot actuator health information (resilience4j-spring-boot2) circuitBreaker.getEventPublisher() .onSuccess(event -> logger.info(...)) .onError(event -> logger.info(...)) .onIgnoredError(event -> logger.info(...)) .onReset(event -> logger.info(...)) .onStateTransition(event -> logger.info(...)); Tools for Improving Resilience
  • 45.
    Resilience4j - Testing Scope: ●Unit tests easy for composed layers and different scenarios ● Integration tests reasonably easy for caching Caveats: ● Cache scope is tricky here as well ● Basically similar problems to Hystrix testing Tools for Improving Resilience
  • 46.
    Resilience4j - AdoptionConsiderations ✅ 🚫 ⚠ Observability Not language agnostic Forces you towards building thick clients Feature rich Tricky to enforce on calling services Easier to configure than Hystrix Modularization Less transitive dependencies than Hystrix Tools for Improving Resilience
  • 47.
  • 48.
    Envoy - Topology ServiceCluster Service Service Cluster Service External Services Discovery Tools for Improving Resilience
  • 49.
    Category Defense MechanismHystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ ✅ Throttling Timeouts ✅ ✅ ✅ Rate Limiting 🚫 ✅ ✅ Circuit Breaking ✅ ✅ ✅ Adaptive Concurrency Limits 🚫 👷 👷 Rejection Fallbacks ✅ ✅ 🚫 Response Caching ✅ ✅ 🚫 Envoy - Resilience Features Tools for Improving Resilience
  • 50.
    Envoy - Configurationmanagement clusters: - name: get-cluster connect_timeout: 10s type: STRICT_DNS outlier_detection: consecutive_5xx: 5 interval: 10s base_ejection_time: 30s max_ejection_percent: 10 circuit_breakers: thresholds: - priority: DEFAULT max_connections: 3 max_pending_requests: 3 max_requests: 3 max_retries: 3 - priority: HIGH max_connections: 10 max_pending_requests: 10 max_requests: 10 max_retries: 10 hosts: - socket_address: address: httpbin-get port_value: 8080 static_resources: listeners: - address: socket_address: ... filter_chains: - filters: - name: envoy.http_connection_manager config: ... route_config: ... virtual_hosts: - name: backend ... routes: - match: prefix: "/" headers: - exact_match: "GET" name: ":method" route: cluster: get-cluster ... retry_policy: retry_on: 5xx num_retries: 2 priority: HIGH Tools for Improving Resilience
  • 51.
    Envoy - ConfigurationDeployment Static config: ● You will benefit from some scripting/tools to generate this config ● Deploy the generated yaml as a Docker container side-car using the official Docker image Dynamic config: ● gRPC APis for dynamically updating these settings ○ Endpoint Discovery Service (EDS) ○ Cluster Discovery Service (CDS) ○ Route Discovery Service (RDS) ○ Listener discovery service (LDS) ○ Secret discovery service (SDS) ● Control planes like Istio makes this manageable Tools for Improving Resilience
  • 52.
    Envoy - Observability Datasinks: ● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics) ● envoy.dog_statsd - emits stats with DogStatsD compatible tags ● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use by Hystrix dashboard ● build your own (Small) subset of stats: ● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout, downstream_rq_time, etc. Detecting open circuits/throttling: ● x-envoy-overloaded header will be injected in the downstream response ● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit breaker), remaining_rq (remaining requests until circuit will open), etc Tools for Improving Resilience
  • 53.
    Envoy - Testing Scope: ●E2E-ish 🙂 Caveats: ● Setup can be tricky (boot the side-car in a Docker container, put a mock server behind it and start simulating requests and different types of failures) ● Will probably need to test this / route or whatever your config granularity is Tools for Improving Resilience
  • 54.
    Envoy - AdoptionConsiderations ✅ 🚫 ⚠ Application language agnostic Fallbacks Testability Enforcement Cache Ownership (SRE vs Dev teams) Change rollout Configuration Complexity Caller/callee resilience Operational Complexity Observability Tools for Improving Resilience
  • 55.
  • 56.
    Netflix/concurrency-limits - Topology Toolsfor Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  • 57.
    Category Defense MechanismHystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ ✅ 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 Adaptive Concurrency Limits 🚫 👷 👷 ✅ Rejection Fallbacks ✅ ✅ 🚫 🚫 Response Caching ✅ ✅ 🚫 🚫 Netflix/concurrency-limits - Resilience Features Tools for Improving Resilience
  • 58.
    Netflix/concurrency-limits - Configurationmanagement ConcurrencyLimitServletFilter( ServletLimiterBuilder() .limit(VegasLimit.newBuilder().build()) .metricRegistry(concurrencyLimitMetricRegistry) .build()) Tools for Improving Resilience
  • 59.
    Netflix/concurrency-limits - Observability classConcurrencyLimitMetricRegistry(private val meterRegistry: MeterRegistry) : MetricRegistry { override fun registerDistribution(id: String?, vararg tagNameValuePairs: String?): MetricRegistry.SampleListener { return MetricRegistry.SampleListener { } } override fun registerGauge(id: String?, supplier: Supplier<Number>?, vararg tagNameValuePairs: String?) { id?.let { supplier?.let { val tags = tagNameValuePairs.toList().zipWithNext().map { Tag.of(it.first, it.second) } meterRegistry.gauge(id, tags, supplier.get()) } } } } Tools for Improving Resilience
  • 60.
    Netflix/concurrency-limits - Testing Scope: ●E2E-ish 🙂 Caveats: ● Haha good luck with that Tools for Improving Resilience
  • 61.
    ✅ 🚫 ⚠ Caller/callee resilience Notlanguage agnostic Harder to predict throttling Does not require manual, per-endpoint config Less mature than others Observability Documentation is quite scarce Easier to enforce on calling services Netflix/concurrency-limits - Adoption Considerations Tools for Improving Resilience
  • 62.
    What did wego for in the end?
  • 63.
    Resilience libraries showdown Category Defense Mechanism Hystrix Resilience 4j Envoy Netflix/ concurrency -limits gRPCSentinel Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 👷 ✅ Adaptive Concurrency Limits 🚫 👷 👷 ✅ 👷 🚫 Rejection Fallbacks ✅ ✅ 🚫 🚫 👷 ✅ Response Caching ✅ ✅ 🚫 🚫 👷 🚫
  • 64.
    And the winneris… Envoy 🥇🥇🥇 Category Defense Mechanism Hystrix Resilience 4j Envoy Netflix/ concurrency -limits gRPC Sentinel Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 👷 ✅ Adaptive Concurrency Limits 🚫 👷 👷 ✅ 👷 🚫 Rejection Fallbacks ✅ ✅ 🚫 🚫 👷 ✅ Response Caching ✅ ✅ 🚫 🚫 👷 🚫
  • 65.
    And the winneris… Envoy 🥇🥇🥇, but why? Reasons: ● We already have it ● Observability is super strong ● Easy to enforce across all our infrastructure ● Allows us to have thin clients ● Language agnostic And the runner up is… Resilience4j (kinda) 🥈🥈🥈 ● Allowed for retries, caching and fallbacks, but it’s up to the teams ● We discourage using request caching for the most part
  • 66.
  • 67.