Successfully reported this slideshow.
Your SlideShare is downloading. ×

Devoxx Ukraine 2018 "Break me if you can: practical guide to building fault-tolerant systems (with examples from REST and gRPC polyglot stacks)"

Ad

Break Me If You Can
Practical Guide to Building Fault-tolerant Systems
Devoxx Ukraine, November 23, 2018
Alex Borysov, Sof...

Ad

Who are we?
Alex Borysov
Software Engineer @Google
Mykyta Protsenko
Software Engineer @Netflix
@aiborisov
@mykyta_p

Ad

Fault-Tolerance?
@aiborisov
@mykyta_p

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 203 Ad
1 of 203 Ad

More Related Content

Devoxx Ukraine 2018 "Break me if you can: practical guide to building fault-tolerant systems (with examples from REST and gRPC polyglot stacks)"

  1. 1. Break Me If You Can Practical Guide to Building Fault-tolerant Systems Devoxx Ukraine, November 23, 2018 Alex Borysov, Software Engineer @ Google Mykyta Protsenko, Software Engineer @ Netflix
  2. 2. Who are we? Alex Borysov Software Engineer @Google Mykyta Protsenko Software Engineer @Netflix @aiborisov @mykyta_p
  3. 3. Fault-Tolerance? @aiborisov @mykyta_p
  4. 4. Fault vs Error vs Failure @aiborisov @mykyta_p
  5. 5. @aiborisov @mykyta_p Fault @aiborisov @mykyta_p incorrect internal state Picture by Bob McMillan. Public domain. See slide #200 for details.
  6. 6. @aiborisov @mykyta_p Error @aiborisov @mykyta_p visibly incorrect behaviour Picture by David Goehring. CC BY 2.0. See slide #200 for details.
  7. 7. @aiborisov @mykyta_p Failure @aiborisov @mykyta_p main functionality is broken Picture by Camerafiend. CC BY-SA 3.0. See slide #200 for details.
  8. 8. @aiborisov @mykyta_p RMS Titanic vs Miracle on the Hudson @aiborisov @mykyta_p Willy Stöwer. Public domain. See slide #200 for details. By Greg Lam Pak Ng. CC BY 2.0. See slide #201 for details.
  9. 9. @aiborisov @mykyta_p RMS Titanic @aiborisov @mykyta_p Fault: Hitting an iceberg Error: Water in the hull Failure: Sinking Willy Stöwer. Public domain. See slide #200 for details.
  10. 10. @aiborisov @mykyta_p Miracle on the Hudson @aiborisov @mykyta_p Fault: Hitting geese at 859 m Error: Engines shut down No Failure! By Greg Lam Pak Ng. CC BY 2.0. See slide #201 for details.
  11. 11. Fault Error Failure @aiborisov @mykyta_p → →
  12. 12. Fault Error Failure @aiborisov @mykyta_p → →
  13. 13. @aiborisov @mykyta_p Fault Tolerance @aiborisov @mykyta_p Code and Design Patterns Product-Driven Decisions Communication By Greg Lam Pak Ng. CC BY 2.0. See slide #201 for details.
  14. 14. Dodging Geese @aiborisov @mykyta_p
  15. 15. Dodging Geese @aiborisov @mykyta_p #вгеймдевенасовсем
  16. 16. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p See slides ##200, 201 for licensing details.
  17. 17. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  18. 18. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service
  19. 19. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service Geese Service
  20. 20. @aiborisov @mykyta_p Dodging Geese Architecture Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p TOP-5 Leaderboard Service
  21. 21. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  22. 22. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  23. 23. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  24. 24. @aiborisov @mykyta_p Leaderboard API (REST) /players/<username>/score {"name": "Jane", "score": 100} /leaderboard/top/<n> [{"name": "Jane", "score": 100}, {"name": "John", "score": 50}, ...] @aiborisov @mykyta_p
  25. 25. @aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); }
  26. 26. @aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { // Return next line of clouds. rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); }
  27. 27. @aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } gRPC Gateway Service @aiborisov @mykyta_p
  28. 28. @aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } + = Fixture gRPC Gateway Service @aiborisov @mykyta_p
  29. 29. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Fixture Service @aiborisov @mykyta_p
  30. 30. @aiborisov @mykyta_p Gateway Fixture Service Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p
  31. 31. @aiborisov @mykyta_p Gateway Fixture Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p Geese Service
  32. 32. @aiborisov @mykyta_p Gateway Fixture Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p Geese Service
  33. 33. @aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
  34. 34. @aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
  35. 35. @aiborisov @mykyta_p @aiborisov @mykyta_p Fixture Latency = Geese Latency + Clouds Latency
  36. 36. @aiborisov @mykyta_p @aiborisov @mykyta_p Non-Blocking Calls Don’t block Send requests in parallel Combine results when ready
  37. 37. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;
  38. 38. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  39. 39. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  40. 40. @aiborisov @mykyta_p
  41. 41. @aiborisov @mykyta_p @aiborisov @mykyta_p Slow dependencies Slow upstream services
  42. 42. @aiborisov @mykyta_p @aiborisov @mykyta_p Timeouts Guaranteed latency for integration points
  43. 43. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  44. 44. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  45. 45. @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p
  46. 46. @aiborisov @mykyta_p REST: Non-Blocking Calls CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
  47. 47. @aiborisov @mykyta_p REST: Non-Blocking Calls with Timeout CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
  48. 48. @aiborisov @mykyta_p
  49. 49. Demo @aiborisov @mykyta_p
  50. 50. @aiborisov @mykyta_p @aiborisov @mykyta_p No Geese No Clouds Blinking Leaderboard
  51. 51. @aiborisov @mykyta_p @aiborisov @mykyta_p Observability Monitoring: QPS, latency, errors, ...
  52. 52. @aiborisov @mykyta_p @aiborisov @mykyta_p Observability: gRPC Monitoring: QPS, latency, errors, ... // OpenCensus RpcViews.registerAllViews();
  53. 53. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build() ; ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
  54. 54. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build(); ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
  55. 55. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: REST build.gradle: dependencies { compile '...:spring-cloud-sleuth-zipkin' compile '...:spring-cloud-starter-sleuth' ... } application.properties: spring.zipkin.baseUrl=http://zipkin:9411/ spring.sleuth.sampler.probability=1.0 spring.sleuth.web.enabled=true
  56. 56. @aiborisov @mykyta_p
  57. 57. Demo @aiborisov @mykyta_p
  58. 58. @aiborisov @mykyta_p @aiborisov @mykyta_p Clouds are slow Geese are fast Entire call fails
  59. 59. @aiborisov @mykyta_p ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... @aiborisov @mykyta_p Partial Degradation
  60. 60. @aiborisov @mykyta_p @aiborisov @mykyta_p Partial Degradation ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.successfulAsList(geese, clouds); ...
  61. 61. @aiborisov @mykyta_p
  62. 62. @aiborisov @mykyta_p @aiborisov @mykyta_p Some L-board calls fail L-board latency is low Scores disappear
  63. 63. @aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } @aiborisov @mykyta_p Retries: REST
  64. 64. @aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } RetryPolicy RETRY_POLICY = new RetryPolicy() .retryOn(IOException.class) .withMaxRetries(MAX_RETRIES); CompletableFuture<List<Leaderboard>> top5 = Failsafe.with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Retries: REST
  65. 65. @aiborisov @mykyta_p
  66. 66. Demo @aiborisov @mykyta_p
  67. 67. @aiborisov @mykyta_p @aiborisov @mykyta_p Retry slow calls? Retry failed calls? Retry network faults?
  68. 68. @aiborisov @mykyta_p Retry Storm Clouds ServiceAPI Gateway @aiborisov @mykyta_p
  69. 69. @aiborisov @mykyta_p new RetryPolicy() .withBackoff( MIN_DELAY, MAX_DELAY, TimeUnit.MILLISECONDS, 100.0) ... ... @aiborisov @mykyta_p Exponential Backoffs
  70. 70. @aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> emptyLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
  71. 71. @aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> cachedLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
  72. 72. @aiborisov @mykyta_p Retry Fallback Fail Fast @aiborisov @mykyta_p On Error
  73. 73. @aiborisov @mykyta_p
  74. 74. @aiborisov @mykyta_p @aiborisov @mykyta_p
  75. 75. @aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability?
  76. 76. @aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 1 – 0.99^100 = 63%
  77. 77. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline
  78. 78. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms
  79. 79. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request
  80. 80. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request Fastest Response
  81. 81. @aiborisov @mykyta_p High 99%ile latency 100 requests @aiborisov @mykyta_p Request Hedging
  82. 82. @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 63% x 0.01 < 1% @aiborisov @mykyta_p Request Hedging
  83. 83. @aiborisov @mykyta_p Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build(); GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel); @aiborisov @mykyta_p Hedging in gRPC (soon)
  84. 84. @aiborisov @mykyta_p Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build(); GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel); @aiborisov @mykyta_p Hedging in gRPC (soon)
  85. 85. @aiborisov @mykyta_p
  86. 86. @aiborisov @mykyta_p @aiborisov @mykyta_p
  87. 87. @aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability?
  88. 88. @aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability: 1 – 0.50^100 = 99.99...%
  89. 89. @aiborisov @mykyta_p CircuitBreaker CIRCUIT_BREAKER = new CircuitBreaker() .withFailureThreshold(3, 5); CompletableFuture<...> top5 = Failsafe .with(CIRCUIT_BREAKER) .with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Circuit Breaker
  90. 90. @aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
  91. 91. @aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
  92. 92. @aiborisov @mykyta_p
  93. 93. @aiborisov @mykyta_p @aiborisov @mykyta_p Client-driven deadline Don’t process failed calls
  94. 94. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p See slides ##200, 201 for licensing details.
  95. 95. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 200 ms →
  96. 96. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 200 ms → Spent 120 ms →
  97. 97. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X
  98. 98. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X →
  99. 99. @aiborisov @mykyta_p Deadlines Propagation API Gateway @aiborisov @mykyta_p Deadline 200 ms →
  100. 100. @aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Deadline 200 ms → Spent 120 ms →
  101. 101. @aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X
  102. 102. @aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline -10 ms Deadline 200 ms → X
  103. 103. @aiborisov @mykyta_p
  104. 104. @aiborisov @mykyta_p @aiborisov @mykyta_p Throughput has limits Exceeding limits?
  105. 105. @aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
  106. 106. @aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
  107. 107. @aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
  108. 108. @aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
  109. 109. @aiborisov @mykyta_p new GrpcClientLimiterBuilder() .limit( newBuilder() .initialLimit(1000).build()) .blockOnLimit(false) // fail-fast .build(); @aiborisov @mykyta_p gRPC: Client
  110. 110. @aiborisov @mykyta_p
  111. 111. Demo @aiborisov @mykyta_p
  112. 112. Demo @aiborisov @mykyta_p
  113. 113. @aiborisov @mykyta_p Monitoring @aiborisov @mykyta_p APM Service metrics Distributed tracing Business metrics Picture by Alex Borysov. CC BY 2.0. See slide #201 for details.
  114. 114. @aiborisov @mykyta_p @aiborisov @mykyta_p Code and Design Timeouts / Deadline Propagation Retries / Hedging Proper Fallbacks Concurrency Limits Load Shedding Observability
  115. 115. @aiborisov @mykyta_p @aiborisov @mykyta_p Request for each response Requests don’t change
  116. 116. @aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
  117. 117. @aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
  118. 118. @aiborisov @mykyta_p Streaming @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseResponse GeeseResponse
  119. 119. @aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
  120. 120. @aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (stream GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (stream CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
  121. 121. @aiborisov @mykyta_p
  122. 122. @aiborisov @mykyta_p @aiborisov @mykyta_p Server faster than client Client cannot keep up
  123. 123. @aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest
  124. 124. @aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest X
  125. 125. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest
  126. 126. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
  127. 127. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
  128. 128. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
  129. 129. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
  130. 130. @aiborisov @mykyta_p
  131. 131. @aiborisov @mykyta_p @aiborisov @mykyta_p Decople producer and consumer Decople failures
  132. 132. @aiborisov @mykyta_p Message-driven Elastic Responsive Resilient @aiborisov @mykyta_p Reactive Systems
  133. 133. @aiborisov @mykyta_p
  134. 134. @aiborisov @mykyta_p @aiborisov @mykyta_p Per instance limits
  135. 135. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? Willy Stöwer. Public domain. See slide #200 for details.
  136. 136. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? “ The answer is very simple because it says on page 147 that Jack dies “ James Cameron Willy Stöwer. Public domain. See slide #200 for details.
  137. 137. @aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
  138. 138. @aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
  139. 139. @aiborisov @mykyta_p Autoscaling @aiborisov @mykyta_p
  140. 140. @aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p
  141. 141. @aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p See slides ##200, 202 for licensing details.
  142. 142. @aiborisov @mykyta_p
  143. 143. @aiborisov @mykyta_p @aiborisov @mykyta_p Services break each other
  144. 144. @aiborisov @mykyta_p $ Free and Premium? Free Premium $
  145. 145. @aiborisov @mykyta_p Free and Premium Outage Free Premium $ $
  146. 146. @aiborisov @mykyta_p $ $ Bulkheads Free Premium $
  147. 147. @aiborisov @mykyta_p Bulkheads Free Premium $ $ $
  148. 148. @aiborisov @mykyta_p @aiborisov @mykyta_p Bulkheads By Request Type By Client Priority By Region By Availability Zone etc
  149. 149. @aiborisov @mykyta_p
  150. 150. @aiborisov @mykyta_p @aiborisov @mykyta_p Deployments can be risky
  151. 151. @aiborisov @mykyta_p Exploding Whale Engineering solution Half a ton of dynamite @aiborisov @mykyta_p Illustration by Greg Williams. CC BY 3.0. See slide #203.
  152. 152. @aiborisov @mykyta_p Exploding Whale Engineering solution Half a ton of dynamite Ooops! Non-limited blast radius Learn more at TheExplodingWhale.com @aiborisov @mykyta_p Illustration by Greg Williams. CC BY 3.0. See slide #203.
  153. 153. @aiborisov @mykyta_p
  154. 154. @aiborisov @mykyta_p @aiborisov @mykyta_p Testing is hard
  155. 155. @aiborisov @mykyta_p @aiborisov @mykyta_p Willy Stöwer. Public domain. See slide #200 for details.
  156. 156. @aiborisov @mykyta_p @aiborisov @mykyta_p ✔ Willy Stöwer. Public domain. See slide #200 for details.
  157. 157. @aiborisov @mykyta_p @aiborisov @mykyta_p ✔ ✔ Willy Stöwer. Public domain. See slide #200 for details.
  158. 158. @aiborisov @mykyta_p @aiborisov @mykyta_p ✔ ✔ ✔ Willy Stöwer. Public domain. See slide #200 for details.
  159. 159. @aiborisov @mykyta_p @aiborisov @mykyta_p ✔ ✔ ✔ ✔ Willy Stöwer. Public domain. See slide #200 for details.
  160. 160. @aiborisov @mykyta_p @aiborisov @mykyta_p ✔ ✔ ✔ ✔ ✔ Willy Stöwer. Public domain. See slide #200 for details.
  161. 161. @aiborisov @mykyta_p @aiborisov @mykyta_p ✔ ✔ ✔ ✔ ✔ Willy Stöwer. Public domain. See slide #200 for details.
  162. 162. @aiborisov @mykyta_p
  163. 163. @aiborisov @mykyta_p
  164. 164. @aiborisov @mykyta_p
  165. 165. @aiborisov @mykyta_p
  166. 166. @aiborisov @mykyta_p
  167. 167. @aiborisov @mykyta_p @aiborisov @mykyta_p Chaos Engineering Break on purpose Find problems Prevent catastrophes PrinciplesOfChaos.org
  168. 168. @aiborisov @mykyta_p
  169. 169. Demo @aiborisov @mykyta_p
  170. 170. @aiborisov @mykyta_p @aiborisov @mykyta_p Bad user experience Metrics are not enough
  171. 171. @aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p
  172. 172. @aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p See slides ##200, 202 for licensing details.
  173. 173. @aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification
  174. 174. @aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification CloudProber.org
  175. 175. @aiborisov @mykyta_p
  176. 176. @aiborisov @mykyta_p
  177. 177. @aiborisov @mykyta_p
  178. 178. @aiborisov @mykyta_p @aiborisov @mykyta_p Technical solutions are not enough
  179. 179. @aiborisov @mykyta_p Communication @aiborisov @mykyta_p
  180. 180. @aiborisov @mykyta_p Communication @aiborisov @mykyta_p
  181. 181. @aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
  182. 182. @aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
  183. 183. @aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
  184. 184. @aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
  185. 185. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive
  186. 186. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive Social See slides ##200, 202, 203 for licensing details.
  187. 187. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Timeline Causes Remedies
  188. 188. @aiborisov @mykyta_p @aiborisov @mykyta_p Learn from Failure Blameless postmortems Alert playbooks Incident knowledge base
  189. 189. @aiborisov @mykyta_p
  190. 190. @aiborisov @mykyta_p Libraries and Tools @aiborisov @mykyta_p Demo: github.com/break-me-if-you-can Failsafe: github.com/jhalterman/failsafe Observability: opencensus.io, opentracing.io Prober: cloudprober.org Concurrency Limits: github.com/Netflix/concurrency-limits Canaries: github.com/spinnaker/kayenta
  191. 191. @aiborisov @mykyta_p Demo UI @HalloGene_ Yevgen Golubenko Twitter: @HalloGene_ github.com/HalloGene Picture by Yevgen Golubenko. Also see slide #203 for licensing details.
  192. 192. @aiborisov @mykyta_p Books @aiborisov @mykyta_p
  193. 193. @aiborisov @mykyta_p @aiborisov @mykyta_p Fault-Tolerance Code & Design Patterns Product decisions Communication culture
  194. 194. @aiborisov @mykyta_p Please Break Me! If you can
  195. 195. @aiborisov @mykyta_p Please Break Me! Rate If you can
  196. 196. @aiborisov @mykyta_p Please Break Me! Rate Us If you can
  197. 197. @aiborisov @mykyta_p Please Break Me! Rate Us If you enjoyed the talk Or give feedback If you can
  198. 198. @aiborisov @mykyta_p Please Break Me! Rate Us If you enjoyed the talk Or give feedback If you can 5 STARS!
  199. 199. @aiborisov @mykyta_p
  200. 200. @aiborisov @mykyta_p Images and Licensing Images of geese, clouds, pilots, plane, arrows, cup, airport traffic control tower are property of Mykyta Protsenko and Alex Borysov, if not stated otherwise (see below). All Rights Reserved. Other images used: Slide #5: commons.wikimedia.org/wiki/File:FEMA_-_16381_-_Photograph_by_Bob_McMillan_taken_on_09-28-2005_in_Texas.jpg - Picture by Bob McMillan, the US federal government work, public domain Slide #6: www.flickr.com/photos/carbonnyc/3290528875 - Picture by David Goehring. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - changes were made Slide #7: www.flickr.com/photos/carbonnyc/3290528875 - Picture by Camerafiend. Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0): creativecommons.org/licenses/by-sa/3.0/deed.en - no changes were made Slides ##8, 9, 135, 136, 155-161: commons.wikimedia.org/wiki/File:Titanic_sinking,_painting_by_Willy_St%C3%B6wer.jpg - Willy Stöwer. Public domain work of art
  201. 201. @aiborisov @mykyta_p Images and Licensing Slides ##8, 10, 13: www.flickr.com/photos/22608787@N00/3200086900 - Picture y Greg Lam Pak Ng. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - no changes were made Slides ##16-23, 30-34, 68, 77-80, 94-102, 116-118, 123-129, 137-141, 144-147, 171-172: - Blue Game Boy Color by kure: piq.codeus.net/picture/31994/Blue-Game-Boy-Color - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slides ##94-102: - The Sun by Vinicius615: piq.codeus.net/picture/191706/The-Sun - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #113: - Picture by Alex Borysov. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0
  202. 202. @aiborisov @mykyta_p Images and Licensing Slide #141: piq.codeus.net/picture/254492/CVsantahat - Santa hat for CommanderVideo, CVsantahat by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made Slide #172: piq.codeus.net/picture/423109/UFO - UFO by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made Slides #186, 187: piq.codeus.net/picture/334023/beer - beer by Investa - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made
  203. 203. @aiborisov @mykyta_p Images and Licensing Slides #186, 187: piq.codeus.net/picture/444498/Beer-Bottle - Beer Bottle by jacklrj - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #191: https://piq.codeus.net/picture/330338/Deal-With-It - Deal With It by Shiro - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slides ##151, 152: https://commons.wikimedia.org/wiki/File:Whale_WikiWorld.png - Cartoon illustration has been created by Greg Williams in cooperation with the Wikimedia Foundation - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

×