Successfully reported this slideshow.
Your SlideShare is downloading. ×

Building Fault Tolerant Microservices

Ad

Building Fault Tolerant
Micro Services
Kristoffer Erlandsson
kristoffer.erlandsson@avanza.se
@kerlandsson

Ad

Building Fault Tolerant
Micro ServicesWhy?
Failure modes
Stability patterns
Monitoring guidelines

Ad

Kristoffer Erlandsson

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Azure from scratch part 4
Azure from scratch part 4
Loading in …3
×

Check these out next

1 of 112 Ad
1 of 112 Ad

Building Fault Tolerant Microservices

Download to read offline

Slides from presentation as held at Jfokus 2016.

Despite all efforts to the contrary, faults will occur in our applications. When building an application we must prepare to minimize the faults' impact. This fact is especially evident in a microservices setting where the failure of one service must not be allowed to cascade to other, possibly more critical, services.

This presentation will introduce you to stability patterns such as Bulkheads and Circuit Breakers. We will talk about how these patterns make your applications resilient to faults and how you implement them in a microservices architecture.

Another important component to making your system resilient is the ability to detect and act on anomalies. Hence, we will also discuss monitoring guidelines for a distributed microservices system.

Slides from presentation as held at Jfokus 2016.

Despite all efforts to the contrary, faults will occur in our applications. When building an application we must prepare to minimize the faults' impact. This fact is especially evident in a microservices setting where the failure of one service must not be allowed to cascade to other, possibly more critical, services.

This presentation will introduce you to stability patterns such as Bulkheads and Circuit Breakers. We will talk about how these patterns make your applications resilient to faults and how you implement them in a microservices architecture.

Another important component to making your system resilient is the ability to detect and act on anomalies. Hence, we will also discuss monitoring guidelines for a distributed microservices system.

Advertisement
Advertisement

More Related Content

Slideshows for you (17)

Advertisement

Building Fault Tolerant Microservices

  1. 1. Building Fault Tolerant Micro Services Kristoffer Erlandsson kristoffer.erlandsson@avanza.se @kerlandsson
  2. 2. Building Fault Tolerant Micro ServicesWhy? Failure modes Stability patterns Monitoring guidelines
  3. 3. Kristoffer Erlandsson
  4. 4. 250+ services 1000+ service instances (JVMs) Largest actor on the Stockholm Stock Exchange
  5. 5. Acme Books
  6. 6. Database Web Application
  7. 7. Database Web Application
  8. 8. Database Web Application
  9. 9. Database Web Application
  10. 10. Database Web Application
  11. 11. Database Web Application
  12. 12. Products Payments Web Application Special Offers
  13. 13. Products Payments Special Offers Web Application
  14. 14. Products Payments Special Offers Web Application
  15. 15. Products Payments Special Offers Web Application Database Purchase History
  16. 16. Cascading Failure Products Payments Special Offers Web Application Database Purchase History
  17. 17. How about increasing availability?
  18. 18. 0.99999 5 minutes per year
  19. 19. 1000 service instances?
  20. 20. 0.999991000
  21. 21. 0.999991000 ≈ 0.99 87 hours per year
  22. 22. Design for Failure
  23. 23. Web Application Thread pool
  24. 24. Special Offers Web Application Thread pool
  25. 25. Special Offers Web Application Thread pool
  26. 26. Special Offers Web Application Thread pool
  27. 27. Web Application Thread pool Special Offers Products Payments
  28. 28. Special Offers Web Application Thread pool Products Payments
  29. 29. Special Offers Web Application Thread pool Products Payments
  30. 30. URL url = new URL("http://acme-books.com/special-offers"); URLConnection connection = url.openConnection(); connection.connect(); InputStream inputStream = connection.getInputStream(); // Read response from stream
  31. 31. Use TimeoutsPrevents blocked threads
  32. 32. URL url = new URL("http://acme-books.com/special-offers"); URLConnection connection = url.openConnection(); connection.setConnectTimeout(100); connection.setReadTimeout(500); connection.connect(); InputStream inputStream = connection.getInputStream(); // Read response from stream
  33. 33. Set Aggresive Timeouts
  34. 34. Special Offers Database Purchase History Timeouts here?
  35. 35. Special Offers Database Purchase History Here too!
  36. 36. Terrible response times Awful throughput
  37. 37. Web Application Thread pool Special Offers Timeout
  38. 38. Web Application Thread pool Special Offers Many timeouts
  39. 39. Web Application Thread pool Special Offers Throughput lower than number of incoming requests Many timeouts
  40. 40. Web Application Thread pool Special Offers Throughput lower than number of incoming requests Many timeouts
  41. 41. Web Application Thread pool Special Offers Throughput lower than number of incoming requests Growing queue! Many timeouts
  42. 42. Frequently called service Timeouts are not enough
  43. 43. Circuit BreakersCalls to broken services fail fast Offloads broken services
  44. 44. Special Offers Web Application
  45. 45. Timeout Web Application Web Application Special Offers
  46. 46. Error Open state Special Offers Web Application
  47. 47. Single call Half open state Special Offers Web Application Error
  48. 48. Closed state Special Offers Web Application
  49. 49. Timeouts over threshold Unhandled errors over threshold Known irrecoverable error occurs
  50. 50. Handle service call errors
  51. 51. try { return specialOffers.getOffers(); } catch (Exception e) { return Offers.emptyOffers(); }
  52. 52. Terrible response times Awful throughput Again?!?
  53. 53. Web Application Thread pool Special Offers Slow response
  54. 54. Web Application Thread pool Special Offers Slow responses
  55. 55. Web Application Thread pool Special Offers Throughput lower than number of incoming requests (again) Slow responses
  56. 56. Response time < timeout Timeouts and circuit breakers are not enough
  57. 57. Bulkheads Isolates components Prevents cascading
  58. 58. Limit number of concurrent calls Upper bound on number of waiting threads
  59. 59. Special Offers Web Application Thread pool Bulkhead (size=2)
  60. 60. Special Offers Web Application Thread pool Bulkhead (size=2)
  61. 61. Special Offers Web Application Thread pool Error Error Bulkhead (size=2)
  62. 62. Special Offers Web Application Thread pool Error Error Bulkhead (size=2) - Fast page load - No special offers - Slow page load - Including special offers
  63. 63. Products Payments Web Application Special Offers Purchase History One bulkhead per service
  64. 64. Upper bound on number of waiting threads Protects very well against cascading failure …
  65. 65. … if bulkhead sizes are … … significantly smaller than request pool size
  66. 66. Peak load when healthy 40 requests per second (rps) 0.1 seconds response time
  67. 67. Suitable bulkhead size 40 rps x 0.1 seconds = 4 + breathing room = 7
  68. 68. Bonus: protects services from overload
  69. 69. Semaphore bulkhead = new Semaphore(2); Offers protectedGetOffers() { if (bulkhead.tryAcquire(0, TimeUnit.SECONDS)) { try { return specialOffers.getOffers(); } finally { bulkhead.release(); } } else { throw new RejectedByBulkheadException(); } }
  70. 70. Many threads are waiting Few available threads - low throughput All service calls are rejected!
  71. 71. Products Payments Web Application Special Offers
  72. 72. Products Payments Web Application Special Offers
  73. 73. Products Payments Web Application Special Offers Where have our timeouts gone?!?
  74. 74. Broken Client Library More protection required
  75. 75. Thread Pool HandoversCalling threads can always walk away Generic timeouts
  76. 76. Web Application Request thread pool Service thread pool Service
  77. 77. Web Application Request thread pool Service thread pool Service
  78. 78. Web Application Request thread pool Service thread pool Service
  79. 79. Web Application Request thread pool Service thread pool Service
  80. 80. Web Application Request thread pool Service thread pool Service
  81. 81. Web Application Request thread pool Service thread pool Service Timeout
  82. 82. Web Application Error Request thread pool Service thread pool Service
  83. 83. Web Application Error Request thread pool Service thread pool Service call timeouts still required Service
  84. 84. Web Application Bulkhead includedRequest thread pool Service thread pool Service Error
  85. 85. ExecutorService executor = new ThreadPoolExecutor(3, 3, 1, TimeUnit.MINUTES, new SynchronousQueue<>()); Offers protectedGetOffers() { try { Future<Offers> future = executor.submit(specialOffers::getOffers); return future.get(1, TimeUnit.SECONDS); } catch (RejectedExecutionException e) { throw new RejectedByBulkheadException(); } catch (TimeoutException e) { throw new ServiceCallTimeoutException(); } }
  86. 86. Thread pool handovers are very powerful! What about performance?
  87. 87. Monitor Service Calls
  88. 88. Timeout rate Rejected call rate Short circuit rate Failure/success rate Response times
  89. 89. Understand problems before changing configuration
  90. 90. ”All this seems like a lot of work!”
  91. 91. class GetOffersCommand extends HystrixCommand<Offers> { public GetOffersCommand() { super(HystrixCommandGroupKey .Factory.asKey("SpecialOffers")); } @Override protected Offers run() throws Exception { return specialOffers.getOffers(); } } public Offers getOffers() { return new GetOffersCommand().execute(); }
  92. 92. class GetOffersCommand extends HystrixCommand<Offers> { // ... @Override protected Offers getFallback() { return Offers.emptyOffers(); } }
  93. 93. Design for failure Use timeouts Circuit Breakers Bulkheads Monitor service calls
  94. 94. https://github.com/ Netflix/Hystrix
  95. 95. Image Attributions • Polycelis felina" by Eduard Solà - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Polycelis_felina.jpg#/media/File:Polycelis_felina.jpg • "Old book bindings" by Tom Murphy VII - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Old_book_bindings.jpg#/media/File:Old_book_bindings.jpg • "Cute Snail" by gniyuhs - Own work. Licensed under CC BY-SA 3.0 via deviantart - http://gniyuhs.deviantart.com/art/Cute- Snail-278597934 • ”Cash” by 401(K) 2012 – Own work. Licensed under CC BY-SA 2.0 via Flickr - https://www.flickr.com/photos/68751915@N05/6355816649 • "Circuit breakers at substation near Denver International Airport, Colorado" by Greg Goebel from Loveland CO, USA - Yipws_2bUploaded by PDTillman. Licensed under CC BY-SA 2.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Circuit_breakers_at_substation_near_Denver_International_Airport,_Colorado.j pg#/media/File:Circuit_breakers_at_substation_near_Denver_International_Airport,_Colorado.jpg • ”The control room of the nuclear ship NS Savannah, Baltimore, Maryland, USA” - Own work. Licensed under CC BY-SA 3.0 via Commons - https://en.wikipedia.org/wiki/File:NS_Savannah_control_room_MD1.jpg#/media/File:NS_Savannah_control_room_MD1.j pg
  96. 96. Thank you! - Questions? Kristoffer Erlandsson kristoffer.erlandsson@avanza.se @kerlandsson

×