Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Resilient service to-service calls in a post-Hystrix world

87 views

Published on

At N26, we want to make sure we have resilience and fault tolerance built into our backend service-to-service calls. Our services used a combination of Hystrix, Retrofit, Retryer, and other tools to achieve this goal. However, Netflix recently announced that Hystrix is no longer under active development. Therefore, we needed to come up with a replacement solution that maintains the same level of functionality. Since Hystrix provided a big portion of our http client resilience (including circuit breaking, connection thread pool thresholds, easy to add fallbacks, response caching, etc.), we used this announcement as a good opportunity to revisit our entire http client resilience stack. We wanted to find a solution that consolidated our fragmented tooling into an easy-to-use and consistent approach.

This talk will share the approach we are currently implementing and the tools we analyzed while making the decision. Its aim is to provide backend devs (primarily working on JVM languages) and SREs with a comprehensive view on the state of the art for service-to-service call tooling (resilience 4j, envoy, gRPC, retrofit, etc), mechanisms to improve service-to-service call resiliency (timeouts, circuit breaking, adaptive concurrency limits, outlier detection, rate limiting, etc.) and a discussion on where these mechanisms should be implemented (client side, side-car proxy, server-side side-car proxy or server-side).

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Resilient service to-service calls in a post-Hystrix world

  1. 1. Resilient service-to-service calls in a post-Hystrix world Rareș Mușină, Tech Lead @N26 @r3sm4n
  2. 2. R.I.P. Hystrix (2012-2018)
  3. 3. Integration Patterns Sync vs. Eventual Consistency
  4. 4. The anatomy of a cascading failure
  5. 5. Triggering Conditions - What happened? The anatomy of a cascading failure
  6. 6. New Rollouts Planned Changes Traffic Drains Turndowns Triggering Conditions - Change public String getCountry(String userId) { try { // Try to get latest country to avoid stale info UserInfo userInfo = userInfoService.update(userId); updateCache(userInfo); ... return getCountryFromCache(userId); } catch (Exception e) { // Default to cache if service is down return getCountryFromCache(userId); } } The anatomy of a cascading failure
  7. 7. Triggering Conditions - What happened? The anatomy of a cascading failure
  8. 8. Triggering Conditions - Throttling The anatomy of a cascading failure
  9. 9. Triggering Conditions - What happened? The anatomy of a cascading failure
  10. 10. Burstiness (e.g. scheduled tasks) DDOSes Instance Death (gee, thanks Spotinst) Organic Growth Request profile changes Triggering Conditions - Entropy The anatomy of a cascading failure
  11. 11. CPU Memory Network Disk space Threads File descriptors ………………………... Resource Starvation - Common Resources The anatomy of a cascading failure
  12. 12. Resource Starvation - Dependencies Between Resources Poorly tuned Garbage Collection Slow requests Increased CPU due to GC More in-progress requests More RAM due to queuing Less RAM for caching Lower cache hit rate More requests to backend 🔥🔥🔥 The anatomy of a cascading failure
  13. 13. Server Overload/Meltdown/Crash/Unavailability :( CPU/Memory maxed out Health checks returning 5xx Endpoints returning 5xx Timeouts Increased load on other instances The anatomy of a cascading failure
  14. 14. Cascading Failures - Load Redistribution The anatomy of a cascading failure ELB ELB A B 500 350 100 250 ELB ELB A 600 600
  15. 15. Cascading Failures - Retry Amplification The anatomy of a cascading failure
  16. 16. Cascading Failures - Latency Propagation The anatomy of a cascading failure
  17. 17. Cascading Failures - Resource Contention During Recovery The anatomy of a cascading failure
  18. 18. Strategies for Improving Resilience
  19. 19. Architecture - Orchestration vs Choreography Orchestration Choreography Strategies for Improving Resilience Card service Account service User service Signup service Ship card Create Account Create user Card service Account service User service User signup event Subscribes Signup service Publishes
  20. 20. Capacity Planning - Do I need it in the age of the cloud? Helpful, but not sufficient to protect against cascading failures Accuracy is overrated and expensive (especially for new services) It’s (usually) ok (and cheaper) to overprovision at first Strategies for Improving Resilience
  21. 21. Capacity Planning - More important things Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹) Auto-scaling and auto-healing Robust architecture in the face of growing traffic (pub/sub helps) Agree on SLIs and SLOs and monitor them closely Strategies for Improving Resilience
  22. 22. Capacity Planning - If I do need it, then what do I do? Business requirements Critical services, and YOLO the rest ⚠ Seasonality 🎄🥚🦃 Use hardware resources to measure capacity instead of Requests Per Second: ● cost of request = CPU time it has consumed ● (on GC platforms) higher memory => higher CPU Strategies for Improving Resilience
  23. 23. “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Principles of Chaos Engineering Chaos Testing Strategies for Improving Resilience
  24. 24. Retrying - What should I retry? What makes a request retriable? ● ⚠ idempotency ● 🚫 GET with side-effects ● ✅ stateless if you can Should you retry timeouts? ● Stay tuned to the next slides Strategies for Improving Resilience
  25. 25. Retrying - Backing Off With Jitter Strategies for Improving Resilience
  26. 26. Retrying - Retry Budgets Per-request retry budget ● Each request retried at most 3x Per-client retry budget ● Retry requests = at most 10% total requests to upstream ● If > 10% of requests are failing => upstream is likely unhealthy Strategies for Improving Resilience
  27. 27. Throttling - Timeouts Nesting is 🔥👿🔥 Retries make ☝ worse Timing out => upstream service might still be processing request Maintain discipline when setting timeouts/Propagate timeouts Strategies for Improving Resilience Service B Service A 3s timeout Service C Service D 2s timeout 5s timeout timeout ⚠ Avoid circular dependencies at all cost ⚠
  28. 28. Throttling - Rate Limiting Avoid overload by clients and set per-client limits: ● requests from one calling service can use up to x CPU seconds/time interval on the upstream ● anything above that will be throttled ● these metrics are aggregated across all instances of a calling service and upstream If this is too complicated => limit based on RPS/customer/endpoint Strategies for Improving Resilience
  29. 29. Throttling - Circuit Breaking Strategies for Improving Resilience Closed Open Half Open fail (threshold reached) reset timeout fail fail (under threshold) success call/raise circuit open success Service A Circuit Breaker Service B ⚠ ⚠ ⚠ ⚠ 🚫 timeout timeout timeout timeout trip circuit circuit open
  30. 30. gradient = (RTTnoload/RTTactual) newLimit = currentLimit × gradient + queueSize Throttling - Adaptive Concurrency Limits Queue Concurrency Strategies for Improving Resilience
  31. 31. Fallbacks and Rejection Cache Dead letter queues for writes Return hard-coded value Empty Response (“Fail Silent”) User experience ⚠ Make sure to discuss these with your product owners ⚠ �� �� Strategies for Improving Resilience
  32. 32. Tools for Improving Resilience
  33. 33. Tools for Improving Resilience Hystrix Resilience4j Envoy Netflix/concurrency-limits
  34. 34. Hystrix - Topology Tools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  35. 35. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 Throttling Timeouts ✅ Rate Limiting 🚫 Circuit Breaking ✅ Adaptive Concurrency Limits 🚫 Rejection Fallbacks ✅ Response Caching ✅ Hystrix - Resilience Features Tools for Improving Resilience
  36. 36. Hystrix - Configuration management public class GetUserInfoCommand extends HystrixCommand<UserInfo> { private final String userId; private final UserInfoApi userInfoApi; public GetUserInfoCommand(String userId) { super(HystrixCommand.Setter .withGroupKey(HystrixCommandGroupKey.Factory.asKey("UserInfo")) .andCommandKey(HystrixCommandKey.Factory.asKey("getUserInfo"))); this.userId = userId; this.userInfoApi = userInfoApi; } @Override protected UserInfo run() { // Simplified to fit on a slide - you'd have some exception handling return userInfoApi.getUserInfo(userId); } @Override protected String getCacheKey() { // Dragons reside here return userId; } @Override protected UserInfo getFallback() { return UserInfo.empty(); } } Tools for Improving Resilience
  37. 37. Hystrix - Observability Tools for Improving Resilience
  38. 38. Hystrix - Testing Scope: ● Unit tests easy for circuit opening/closing, fallbacks ● Integration tests reasonably easy for caching Caveats: ● If you are using response caching, DO NOT FORGET to test HystrixRequestContext ● Depending on the errors thrown by the call, you might need to test circuit tripping (HystrixRuntimeException vs HystrixBadRequestException) ● If you’re not careful, you might set the same HystrixCommandGroupKey and HystrixCommandKey Tools for Improving Resilience
  39. 39. Hystrix - Adoption Considerations ✅ 🚫 ⚠ Observability No longer supported Forces you towards building thick clients Mostly easy to test Not language agnostic Tricky to enforce on calling services Cumbersome to configure HystrixRequestContext Tools for Improving Resilience
  40. 40. Tools for Improving Resilience Hystrix Resilience4j Envoy Netflix/concurrency-limits
  41. 41. Resilience4j - Topology Tools for Improving ResilienceTools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  42. 42. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ Throttling Timeouts ✅ ✅ Rate Limiting 🚫 ✅ Circuit Breaking ✅ ✅ Adaptive Concurrency Limits 🚫 👷 Rejection Fallbacks ✅ ✅ Response Caching ✅ ✅ Resilience4j - Resilience Features Tools for Improving Resilience
  43. 43. CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom() .failureRateThreshold(50) .waitDurationInOpenState(Duration.ofMillis(1000)) .recordExceptions(IOException.class, TimeoutException.class) .build(); CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig); CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("getUserInfo",circuitBreakerConfig); RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build(); Retry retry = Retry.of("getUserInfo", retryConfig); Supplier<UserInfo> decorateSupplier = CircuitBreaker.decorateSupplier(circuitBreaker, Retry.decorateSupplier(retry, () -> userInfoApi.getUserInfo(userId))); UserInfo result = Try.ofSupplier(decorateSupplier).getOrElse(UserInfo.empty()); Resilience4j - Configuration management Tools for Improving Resilience
  44. 44. Resilience4j - Observability You can subscribe to various events for most of the decorators: Built in support for: ● Dropwizard (resilience4j-metrics) ● Prometheus (resilience4j-prometheus) ● Micrometer (resilience4j-micrometer) ● Spring-boot actuator health information (resilience4j-spring-boot2) circuitBreaker.getEventPublisher() .onSuccess(event -> logger.info(...)) .onError(event -> logger.info(...)) .onIgnoredError(event -> logger.info(...)) .onReset(event -> logger.info(...)) .onStateTransition(event -> logger.info(...)); Tools for Improving Resilience
  45. 45. Resilience4j - Testing Scope: ● Unit tests easy for composed layers and different scenarios ● Integration tests reasonably easy for caching Caveats: ● Cache scope is tricky here as well ● Basically similar problems to Hystrix testing Tools for Improving Resilience
  46. 46. Resilience4j - Adoption Considerations ✅ 🚫 ⚠ Observability Not language agnostic Forces you towards building thick clients Feature rich Tricky to enforce on calling services Easier to configure than Hystrix Modularization Less transitive dependencies than Hystrix Tools for Improving Resilience
  47. 47. Tools for Improving Resilience Hystrix Resilience4j Envoy Netflix/concurrency-limits
  48. 48. Envoy - Topology Service Cluster Service Service Cluster Service External Services Discovery Tools for Improving Resilience
  49. 49. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ ✅ Throttling Timeouts ✅ ✅ ✅ Rate Limiting 🚫 ✅ ✅ Circuit Breaking ✅ ✅ ✅ Adaptive Concurrency Limits 🚫 👷 👷 Rejection Fallbacks ✅ ✅ 🚫 Response Caching ✅ ✅ 🚫 Envoy - Resilience Features Tools for Improving Resilience
  50. 50. Envoy - Configuration management clusters: - name: get-cluster connect_timeout: 10s type: STRICT_DNS outlier_detection: consecutive_5xx: 5 interval: 10s base_ejection_time: 30s max_ejection_percent: 10 circuit_breakers: thresholds: - priority: DEFAULT max_connections: 3 max_pending_requests: 3 max_requests: 3 max_retries: 3 - priority: HIGH max_connections: 10 max_pending_requests: 10 max_requests: 10 max_retries: 10 hosts: - socket_address: address: httpbin-get port_value: 8080 static_resources: listeners: - address: socket_address: ... filter_chains: - filters: - name: envoy.http_connection_manager config: ... route_config: ... virtual_hosts: - name: backend ... routes: - match: prefix: "/" headers: - exact_match: "GET" name: ":method" route: cluster: get-cluster ... retry_policy: retry_on: 5xx num_retries: 2 priority: HIGH Tools for Improving Resilience
  51. 51. Envoy - Configuration Deployment Static config: ● You will benefit from some scripting/tools to generate this config ● Deploy the generated yaml as a Docker container side-car using the official Docker image Dynamic config: ● gRPC APis for dynamically updating these settings ○ Endpoint Discovery Service (EDS) ○ Cluster Discovery Service (CDS) ○ Route Discovery Service (RDS) ○ Listener discovery service (LDS) ○ Secret discovery service (SDS) ● Control planes like Istio makes this manageable Tools for Improving Resilience
  52. 52. Envoy - Observability Data sinks: ● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics) ● envoy.dog_statsd - emits stats with DogStatsD compatible tags ● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use by Hystrix dashboard ● build your own (Small) subset of stats: ● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout, downstream_rq_time, etc. Detecting open circuits/throttling: ● x-envoy-overloaded header will be injected in the downstream response ● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit breaker), remaining_rq (remaining requests until circuit will open), etc Tools for Improving Resilience
  53. 53. Envoy - Testing Scope: ● E2E-ish 🙂 Caveats: ● Setup can be tricky (boot the side-car in a Docker container, put a mock server behind it and start simulating requests and different types of failures) ● Will probably need to test this / route or whatever your config granularity is Tools for Improving Resilience
  54. 54. Envoy - Adoption Considerations ✅ 🚫 ⚠ Application language agnostic Fallbacks Testability Enforcement Cache Ownership (SRE vs Dev teams) Change rollout Configuration Complexity Caller/callee resilience Operational Complexity Observability Tools for Improving Resilience
  55. 55. Tools for Improving Resilience Hystrix Resilience4j Envoy Netflix/concurrency-limits
  56. 56. Netflix/concurrency-limits - Topology Tools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  57. 57. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ ✅ 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 Adaptive Concurrency Limits 🚫 👷 👷 ✅ Rejection Fallbacks ✅ ✅ 🚫 🚫 Response Caching ✅ ✅ 🚫 🚫 Netflix/concurrency-limits - Resilience Features Tools for Improving Resilience
  58. 58. Netflix/concurrency-limits - Configuration management ConcurrencyLimitServletFilter( ServletLimiterBuilder() .limit(VegasLimit.newBuilder().build()) .metricRegistry(concurrencyLimitMetricRegistry) .build()) Tools for Improving Resilience
  59. 59. Netflix/concurrency-limits - Observability class ConcurrencyLimitMetricRegistry(private val meterRegistry: MeterRegistry) : MetricRegistry { override fun registerDistribution(id: String?, vararg tagNameValuePairs: String?): MetricRegistry.SampleListener { return MetricRegistry.SampleListener { } } override fun registerGauge(id: String?, supplier: Supplier<Number>?, vararg tagNameValuePairs: String?) { id?.let { supplier?.let { val tags = tagNameValuePairs.toList().zipWithNext().map { Tag.of(it.first, it.second) } meterRegistry.gauge(id, tags, supplier.get()) } } } } Tools for Improving Resilience
  60. 60. Netflix/concurrency-limits - Testing Scope: ● E2E-ish 🙂 Caveats: ● Haha good luck with that Tools for Improving Resilience
  61. 61. ✅ 🚫 ⚠ Caller/callee resilience Not language agnostic Harder to predict throttling Does not require manual, per-endpoint config Less mature than others Observability Documentation is quite scarce Easier to enforce on calling services Netflix/concurrency-limits - Adoption Considerations Tools for Improving Resilience
  62. 62. What did we go for in the end?
  63. 63. Resilience libraries showdown Category Defense Mechanism Hystrix Resilience 4j Envoy Netflix/ concurrency -limits gRPC Sentinel Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 👷 ✅ Adaptive Concurrency Limits 🚫 👷 👷 ✅ 👷 🚫 Rejection Fallbacks ✅ ✅ 🚫 🚫 👷 ✅ Response Caching ✅ ✅ 🚫 🚫 👷 🚫
  64. 64. And the winner is… Envoy 🥇🥇🥇 Category Defense Mechanism Hystrix Resilience 4j Envoy Netflix/ concurrency -limits gRPC Sentinel Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 👷 ✅ Adaptive Concurrency Limits 🚫 👷 👷 ✅ 👷 🚫 Rejection Fallbacks ✅ ✅ 🚫 🚫 👷 ✅ Response Caching ✅ ✅ 🚫 🚫 👷 🚫
  65. 65. And the winner is… Envoy 🥇🥇🥇, but why? Reasons: ● We already have it ● Observability is super strong ● Easy to enforce across all our infrastructure ● Allows us to have thin clients ● Language agnostic And the runner up is… Resilience4j (kinda) 🥈🥈🥈 ● Allowed for retries, caching and fallbacks, but it’s up to the teams ● We discourage using request caching for the most part
  66. 66. Ask Away
  67. 67. Thank you! 🙇 🙇 🙇

×