Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DevNexus 2020 "Break me if you can: practical guide to building fault-tolerant systems" slides

507 views

Published on

Slides from ourDevNexus 2020"Break me if you can: a practical guide to building fault-tolerant systems" talk: https://devnexus.com/presentations/4794/

Published in: Technology
  • Login to see the comments

DevNexus 2020 "Break me if you can: practical guide to building fault-tolerant systems" slides

  1. 1. Break Me If You Can Practical Guide to Building Fault-tolerant Systems DevNexus, Atlanta, GA February 20, 2020 Alex Borysov, Software Engineer @ Netflix Mykyta Protsenko, Software Engineer @ Netflix
  2. 2. Who are we? Alex Borysov Software Engineer @Netflix Mykyta Protsenko Software Engineer @Netflix @aiborisov @mykyta_p @WeAreNetflix
  3. 3. Fault-Tolerance? @aiborisov @mykyta_p
  4. 4. Fault vs Error vs Failure @aiborisov @mykyta_p
  5. 5. @aiborisov @mykyta_p Fault @aiborisov @mykyta_p incorrect internal state Picture by Bob McMillan. Public domain. See slides ##178-181 for details
  6. 6. @aiborisov @mykyta_p Error @aiborisov @mykyta_p visibly incorrect behaviour Picture by David Goehring. CC BY 2.0. See slides ##178-181 for details
  7. 7. @aiborisov @mykyta_p Failure @aiborisov @mykyta_p main functionality is broken Picture by Camerafiend. CC BY-SA 3.0. See slides ##178-181 for details
  8. 8. @aiborisov @mykyta_p RMS Titanic vs Miracle on the Hudson @aiborisov @mykyta_p Willy Stöwer. Public domain. See slides ##178-181 for details By Greg Lam Pak Ng. CC BY 2.0. See slides ##178-181 for details
  9. 9. @aiborisov @mykyta_p RMS Titanic @aiborisov @mykyta_p Fault: Hitting an iceberg Error: Water in the hull Failure: Sinking Willy Stöwer. Public domain. See slides ##178-181 for details
  10. 10. @aiborisov @mykyta_p Miracle on the Hudson @aiborisov @mykyta_p Fault: Hitting geese at 859 m Error: Engines shut down No Failure! By Greg Lam Pak Ng. CC BY 2.0. See slides ##178-181 for details
  11. 11. Fault Error Failure @aiborisov @mykyta_p → →
  12. 12. Fault Error Failure @aiborisov @mykyta_p → →
  13. 13. @aiborisov @mykyta_p Fault Tolerance @aiborisov @mykyta_p Code and Design Patterns Product-Driven Decisions Communication By Greg Lam Pak Ng. CC BY 2.0. See slides ##178-181 for details
  14. 14. Dodging Geese @aiborisov @mykyta_p
  15. 15. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p See slides ##178-181 for licensing details
  16. 16. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  17. 17. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service
  18. 18. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service Geese Service
  19. 19. @aiborisov @mykyta_p Dodging Geese Architecture Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p TOP-5 Leaderboard Service
  20. 20. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  21. 21. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  22. 22. @aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
  23. 23. @aiborisov @mykyta_p Leaderboard API (REST) /players/<username>/score {"name": "Jane", "score": 100} /leaderboard/top/<n> [{"name": "Jane", "score": 100}, {"name": "John", "score": 50}, ...] @aiborisov @mykyta_p
  24. 24. @aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); }
  25. 25. @aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { // Return next line of clouds. rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); }
  26. 26. @aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } gRPC Gateway Service @aiborisov @mykyta_p
  27. 27. @aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } + = Fixture gRPC Gateway Service @aiborisov @mykyta_p
  28. 28. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Fixture Service @aiborisov @mykyta_p
  29. 29. @aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
  30. 30. @aiborisov @mykyta_p @aiborisov @mykyta_p Non-Blocking Calls Don’t block Send requests in parallel Combine results when ready
  31. 31. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;
  32. 32. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  33. 33. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  34. 34. @aiborisov @mykyta_p
  35. 35. @aiborisov @mykyta_p @aiborisov @mykyta_p Slow dependencies Slow upstream services
  36. 36. @aiborisov @mykyta_p @aiborisov @mykyta_p Timeouts Guaranteed latency for integration points
  37. 37. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  38. 38. @aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
  39. 39. @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p
  40. 40. @aiborisov @mykyta_p REST: Non-Blocking Calls CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
  41. 41. @aiborisov @mykyta_p REST: Non-Blocking Calls with Timeout CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
  42. 42. @aiborisov @mykyta_p
  43. 43. Demo @aiborisov @mykyta_p
  44. 44. @aiborisov @mykyta_p @aiborisov @mykyta_p No Geese No Clouds Blinking Leaderboard
  45. 45. @aiborisov @mykyta_p @aiborisov @mykyta_p Observability Monitoring: QPS, latency, errors, ...
  46. 46. @aiborisov @mykyta_p @aiborisov @mykyta_p Observability: gRPC Monitoring: QPS, latency, errors, ... // OpenCensus RpcViews.registerAllViews();
  47. 47. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build() ; ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
  48. 48. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build(); ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
  49. 49. @aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: REST build.gradle: dependencies { compile '...:spring-cloud-sleuth-zipkin' compile '...:spring-cloud-starter-sleuth' ... } application.properties: spring.zipkin.baseUrl=http://zipkin:9411/ spring.sleuth.sampler.probability=1.0 spring.sleuth.web.enabled=true
  50. 50. @aiborisov @mykyta_p
  51. 51. Demo @aiborisov @mykyta_p
  52. 52. @aiborisov @mykyta_p @aiborisov @mykyta_p Clouds are slow Geese are fast Entire call fails
  53. 53. @aiborisov @mykyta_p ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... @aiborisov @mykyta_p Partial Degradation
  54. 54. @aiborisov @mykyta_p @aiborisov @mykyta_p Partial Degradation ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.successfulAsList(geese, clouds); ...
  55. 55. @aiborisov @mykyta_p
  56. 56. @aiborisov @mykyta_p @aiborisov @mykyta_p Some L-board calls fail L-board latency is low Scores disappear
  57. 57. @aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } @aiborisov @mykyta_p Retries: REST
  58. 58. @aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } RetryPolicy RETRY_POLICY = new RetryPolicy() .retryOn(IOException.class) .withMaxRetries(MAX_RETRIES); CompletableFuture<List<Leaderboard>> top5 = Failsafe.with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Retries: REST
  59. 59. @aiborisov @mykyta_p
  60. 60. Demo @aiborisov @mykyta_p
  61. 61. @aiborisov @mykyta_p @aiborisov @mykyta_p Retry slow calls? Retry failed calls? Retry network faults?
  62. 62. @aiborisov @mykyta_p Retry Storm Clouds ServiceAPI Gateway @aiborisov @mykyta_p
  63. 63. @aiborisov @mykyta_p new RetryPolicy() .withBackoff( MIN_DELAY, MAX_DELAY, TimeUnit.MILLISECONDS, 100.0) ... ... @aiborisov @mykyta_p Exponential Backoffs
  64. 64. @aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> emptyLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
  65. 65. @aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> cachedLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
  66. 66. @aiborisov @mykyta_p Retry Fallback Fail Fast @aiborisov @mykyta_p On Error
  67. 67. @aiborisov @mykyta_p
  68. 68. @aiborisov @mykyta_p @aiborisov @mykyta_p
  69. 69. @aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability?
  70. 70. @aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 1 – 0.99^100 = 63%
  71. 71. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline
  72. 72. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms
  73. 73. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request
  74. 74. @aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request Fastest Response
  75. 75. @aiborisov @mykyta_p High 99%ile latency 100 requests @aiborisov @mykyta_p Request Hedging
  76. 76. @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 63% x 0.01 < 1% @aiborisov @mykyta_p Request Hedging
  77. 77. @aiborisov @mykyta_p In gRPC service config "hedgingPolicy": { "maxAttempts": 3, "hedgingDelay": "100ms" } @aiborisov @mykyta_p Hedging in gRPC
  78. 78. @aiborisov @mykyta_p
  79. 79. @aiborisov @mykyta_p @aiborisov @mykyta_p
  80. 80. @aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability?
  81. 81. @aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability: 1 – 0.50^100 = 99.99...%
  82. 82. @aiborisov @mykyta_p CircuitBreaker CIRCUIT_BREAKER = new CircuitBreaker() .withFailureThreshold(3, 5); CompletableFuture<...> top5 = Failsafe .with(CIRCUIT_BREAKER) .with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Circuit Breaker
  83. 83. @aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
  84. 84. @aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
  85. 85. @aiborisov @mykyta_p
  86. 86. @aiborisov @mykyta_p @aiborisov @mykyta_p Client-driven deadline Don’t process failed calls
  87. 87. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p See slides ##178-181 for licensing details
  88. 88. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 500 ms →
  89. 89. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 500 ms → Spent 300 ms →
  90. 90. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 300 ms → Spent 250 ms Deadline 500 ms → X
  91. 91. @aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 300 ms → Spent 250 ms Deadline 500 ms → X →
  92. 92. @aiborisov @mykyta_p Deadline Propagation API Gateway @aiborisov @mykyta_p Deadline 500 ms →
  93. 93. @aiborisov @mykyta_p Deadline 200 ms Deadline Propagation API Gateway @aiborisov @mykyta_p Deadline 500 ms → Spent 300 ms →
  94. 94. @aiborisov @mykyta_p Deadline 200 ms Deadline Propagation API Gateway @aiborisov @mykyta_p Spent 300 ms → Spent 250 ms Deadline 500 ms → X
  95. 95. @aiborisov @mykyta_p Deadline 200 ms Deadline Propagation API Gateway @aiborisov @mykyta_p Spent 300 ms → Spent 250 ms Deadline -50 ms Deadline 500 ms → X
  96. 96. @aiborisov @mykyta_p
  97. 97. @aiborisov @mykyta_p @aiborisov @mykyta_p Throughput has limits Exceeding limits?
  98. 98. @aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
  99. 99. @aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
  100. 100. @aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
  101. 101. @aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
  102. 102. @aiborisov @mykyta_p new GrpcClientLimiterBuilder() .limit( newBuilder() .initialLimit(1000).build()) .blockOnLimit(false) // fail-fast .build(); @aiborisov @mykyta_p gRPC: Client
  103. 103. @aiborisov @mykyta_p
  104. 104. Demo @aiborisov @mykyta_p
  105. 105. Demo @aiborisov @mykyta_p
  106. 106. @aiborisov @mykyta_p Monitoring @aiborisov @mykyta_p APM Service metrics Distributed tracing Business metrics Picture by Alex Borysov. CC BY 2.0. See slides ##178-181 for details
  107. 107. @aiborisov @mykyta_p @aiborisov @mykyta_p Code and Design Timeouts / Deadline Propagation Retries / Hedging Proper Fallbacks Concurrency Limits Load Shedding Observability
  108. 108. @aiborisov @mykyta_p @aiborisov @mykyta_p Request for each response Requests don’t change
  109. 109. @aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
  110. 110. @aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
  111. 111. @aiborisov @mykyta_p Streaming @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseResponse GeeseResponse
  112. 112. @aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
  113. 113. @aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (stream GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (stream CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
  114. 114. @aiborisov @mykyta_p
  115. 115. @aiborisov @mykyta_p @aiborisov @mykyta_p Server faster than client Client cannot keep up
  116. 116. @aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest
  117. 117. @aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest X
  118. 118. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest
  119. 119. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
  120. 120. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
  121. 121. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
  122. 122. @aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
  123. 123. @aiborisov @mykyta_p
  124. 124. @aiborisov @mykyta_p @aiborisov @mykyta_p Decople producer and consumer Decople failures
  125. 125. @aiborisov @mykyta_p Message-driven Elastic Responsive Resilient @aiborisov @mykyta_p Reactive Systems
  126. 126. @aiborisov @mykyta_p
  127. 127. @aiborisov @mykyta_p @aiborisov @mykyta_p Per instance limits
  128. 128. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? Willy Stöwer. Public domain. See slides ##178-181 for details
  129. 129. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p
  130. 130. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p
  131. 131. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p
  132. 132. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p
  133. 133. @aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? “ The answer is very simple because it says on page 147 that Jack dies “ James Cameron Willy Stöwer. Public domain. See slides ##178-181 for details
  134. 134. @aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
  135. 135. @aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
  136. 136. @aiborisov @mykyta_p Autoscaling @aiborisov @mykyta_p
  137. 137. @aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p
  138. 138. @aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p See slides ##178-181 for licensing details.
  139. 139. @aiborisov @mykyta_p
  140. 140. @aiborisov @mykyta_p @aiborisov @mykyta_p Services break each other
  141. 141. @aiborisov @mykyta_p $ Free and Premium? Free Premium $
  142. 142. @aiborisov @mykyta_p Free and Premium Outage Free Premium $ $
  143. 143. @aiborisov @mykyta_p $ $ Bulkheads Free Premium $
  144. 144. @aiborisov @mykyta_p Bulkheads Free Premium $ $ $
  145. 145. @aiborisov @mykyta_p @aiborisov @mykyta_p Bulkheads By Request Type By Client Priority By Region By Availability Zone etc
  146. 146. @aiborisov @mykyta_p
  147. 147. @aiborisov @mykyta_p @aiborisov @mykyta_p Deployments can be risky
  148. 148. @aiborisov @mykyta_p Exploding Whale Engineering solution Half a ton of dynamite @aiborisov @mykyta_p Illustration by Greg Williams. CC BY 3.0. See slides ##178-181
  149. 149. @aiborisov @mykyta_p Exploding Whale Engineering solution Half a ton of dynamite Ooops! Non-limited blast radius Learn more at TheExplodingWhale.com @aiborisov @mykyta_p Illustration by Greg Williams. CC BY 3.0. See slides ##178-181
  150. 150. @aiborisov @mykyta_p
  151. 151. Demo @aiborisov @mykyta_p
  152. 152. @aiborisov @mykyta_p @aiborisov @mykyta_p Bad user experience Metrics are not enough
  153. 153. @aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p
  154. 154. @aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p See slides ##178-181 for licensing details.
  155. 155. @aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification
  156. 156. @aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification CloudProber.org
  157. 157. @aiborisov @mykyta_p
  158. 158. @aiborisov @mykyta_p
  159. 159. @aiborisov @mykyta_p
  160. 160. @aiborisov @mykyta_p @aiborisov @mykyta_p Technical solutions are not enough
  161. 161. @aiborisov @mykyta_p Communication @aiborisov @mykyta_p
  162. 162. @aiborisov @mykyta_p Communication @aiborisov @mykyta_p
  163. 163. @aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
  164. 164. @aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
  165. 165. @aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
  166. 166. @aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
  167. 167. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive
  168. 168. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive Social See slides ##178-181 for licensing details
  169. 169. @aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Timeline Causes Remedies
  170. 170. @aiborisov @mykyta_p @aiborisov @mykyta_p Learn from Failure Blameless postmortems Alert playbooks Incident knowledge base
  171. 171. @aiborisov @mykyta_p
  172. 172. @aiborisov @mykyta_p Libraries and Tools @aiborisov @mykyta_p Demo: github.com/break-me-if-you-can Slides: slides-devnexus.breakit.xyz Failsafe: github.com/jhalterman/failsafe Observability: opencensus.io, opentracing.io Prober: cloudprober.org Concurrency Limits: github.com/Netflix/concurrency-limits
  173. 173. @aiborisov @mykyta_p Demo UI @HalloGene_ Yevgen Golubenko Twitter: @HalloGene_ github.com/HalloGene Picture by Yevgen Golubenko. Also see slides ##178-181 for licensing details
  174. 174. @aiborisov @mykyta_p Books @aiborisov @mykyta_p
  175. 175. @aiborisov @mykyta_p @aiborisov @mykyta_p Fault-Tolerance Code & Design Patterns Product decisions Communication culture
  176. 176. @aiborisov @mykyta_p
  177. 177. @aiborisov @mykyta_p AMA @ the Netflix Booth (#13) @aiborisov @mykyta_p Thursday, Feb 20th 12:50 PM ask Mykyta Protsenko & Alex Borysov anything about fault-tolerance 3:10 PM ask Nadav Cohen anything about developer tooling Friday, Feb 21th 10:00 AM ask Philip Fisher-Ogden anything about Netflix’s streaming architecture 12:50 PM ask Sangeeta Narayanan anything about cloud native services 3:10 PM ask Vinod Viswanathan anything about media processing
  178. 178. @aiborisov @mykyta_p Images and Licensing Images of geese, clouds, pilots, plane, arrows, cup, airport traffic control tower are property of Mykyta Protsenko and Alex Borysov, if not stated otherwise (see below). All Rights Reserved. Other images used: commons.wikimedia.org/wiki/File:FEMA_-_16381_-_Photograph_by_Bob_McMillan_taken_on_09-28-2005_in_Texas.jpg - Picture by Bob McMillan, the US federal government work, public domain www.flickr.com/photos/carbonnyc/3290528875 - Picture by David Goehring. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - changes were made www.flickr.com/photos/carbonnyc/3290528875 - Picture by Camerafiend. Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0): creativecommons.org/licenses/by-sa/3.0/deed.en - no changes were made commons.wikimedia.org/wiki/File:Titanic_sinking,_painting_by_Willy_St%C3%B6wer.jpg - Willy Stöwer. Public domain work of art
  179. 179. @aiborisov @mykyta_p Images and Licensing www.flickr.com/photos/22608787@N00/3200086900 - Picture y Greg Lam Pak Ng. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - no changes were made piq.codeus.net/picture/31994/Blue-Game-Boy-Color - Blue Game Boy Color by kure - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made piq.codeus.net/picture/191706/The-Sun - The Sun by Vinicius615 - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #106: - Picture by Alex Borysov. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0
  180. 180. @aiborisov @mykyta_p Images and Licensing piq.codeus.net/picture/254492/CVsantahat - Santa hat for CommanderVideo, CVsantahat by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made piq.codeus.net/picture/423109/UFO - UFO by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made piq.codeus.net/picture/334023/beer - beer by Investa - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made
  181. 181. @aiborisov @mykyta_p Images and Licensing piq.codeus.net/picture/444498/Beer-Bottle - Beer Bottle by jacklrj - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made https://piq.codeus.net/picture/330338/Deal-With-It - Deal With It by Shiro - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made https://commons.wikimedia.org/wiki/File:Whale_WikiWorld.png - Cartoon illustration has been created by Greg Williams in cooperation with the Wikimedia Foundation - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

×