Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fault tolerant microservices - LJC Skills Matter 4thNov2014

3,502 views

Published on

Fault tolerant microservices - LJC Skills Matter 4thNov2014

Published in: Technology

Fault tolerant microservices - LJC Skills Matter 4thNov2014

  1. 1. Fault tolerant microservices BSkyB @chbatey
  2. 2. @chbatey Who is this guy? ● Enthusiastic nerd ● Senior software engineer at BSkyB ● Builds a lot of distributed applications ● Apache Cassandra MVP
  3. 3. @chbatey Agenda 1. Setting the scene ○ What do we mean by a fault? ○ What is a microservice? ○ Monolith application vs the micro(ish) service 2. A worked example ○ Identify an issue ○ Reproduce/test it ○ Show how to deal with the issue
  4. 4. So… what do applications look like? @chbatey
  5. 5. So... what do systems look like now? @chbatey
  6. 6. But different things go wrong... @chbatey down slow network slow app 2 second max GC :( missing packets
  7. 7. Fault tolerance 1. Don’t take forever - Timeouts 2. Don’t try if you can’t succeed 3. Fail gracefully 4. Know if it’s your fault 5. Don’t whack a dead horse 6. Turn broken stuff off @chbatey
  8. 8. Time for an example... ● All examples are on github ● Technologies used: @chbatey ○ Dropwizard ○ Spring Boot ○ Wiremock ○ Hystrix ○ Graphite ○ Saboteur
  9. 9. Example: Movie player service @chbatey Shiny App User Service Device Service Pin Service Shiny App Shiny App Shiny App User Se rUvisceer Service Device Service Play Movie
  10. 10. Testing microservices You don’t know a service is fault tolerant if you don’t test faults @chbatey
  11. 11. Isolated service tests Shiny App @chbatey Mocks User Device Pin service Acceptance Play Movie Test Prime
  12. 12. 1 - Don’t take forever @chbatey ● If at first you don’t succeed, don’t take forever to tell someone ● Timeout and fail fast
  13. 13. Which timeouts? ● Socket connection timeout ● Socket read timeout @chbatey
  14. 14. Your service hung for 30 seconds :( @chbatey Customer You :(
  15. 15. Which timeouts? ● Socket connection timeout ● Socket read timeout ● Resource acquisition @chbatey
  16. 16. Your service hung for 10 minutes :( @chbatey
  17. 17. Let’s think about this @chbatey
  18. 18. A little more detail @chbatey
  19. 19. Wiremock + Saboteur + Vagrant ● Vagrant - launches + provisions local VMs ● Saboteur - uses tc, iptables to simulate @chbatey network issues ● Wiremock - used to mock HTTP dependencies ● Cucumber - acceptance tests
  20. 20. I can write an automated test for that? @chbatey Vagrant + Virtual box VM Wiremock User Service Device Service Pin Service Sabot eur Play Movie Service Acceptance Test prime to drop traffic reset
  21. 21. Implementing reliable timeouts ● Homemade: Worker Queue + Thread pool @chbatey (executor)
  22. 22. Implementing reliable timeouts ● Homemade: Worker Queue + Thread pool @chbatey (executor) ● Hystrix
  23. 23. Implementing reliable timeouts ● Homemade: Worker Queue + Thread pool @chbatey (executor) ● Hystrix ● Spring Cloud Netflix
  24. 24. A simple Spring RestController @chbatey @RestController public class Resource { private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class); @Autowired private ScaryDependency scaryDependency; @RequestMapping("/scary") public String callTheScaryDependency() { LOGGER.info("RestContoller: I wonder which thread I am on!"); return scaryDependency.getScaryString(); } }
  25. 25. Scary dependency @chbatey @Component public class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); public String getScaryString() { LOGGER.info("Scary dependency: I wonder which thread I am on!"); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(10000); return "Really slow scary string"; } } }
  26. 26. All on the tomcat thread 13:07:32.814 [http-nio-8080-exec-1] INFO info.batey. examples.Resource - RestContoller: I wonder which thread I am on! 13:07:32.896 [http-nio-8080-exec-1] INFO info.batey. examples.ScaryDependency - Scary dependency: I wonder which thread I am on! @chbatey
  27. 27. Seriously this simple now? @chbatey @Component public class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); @HystrixCommand public String getScaryString() { LOGGER.info("Scary dependency: I wonder which thread I am on!"); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(10000); return "Really slow scary string"; } } }
  28. 28. What an annotation can do... 13:07:32.814 [http-nio-8080-exec-1] INFO info.batey. examples.Resource - RestController: I wonder which thread I am on! 13:07:32.896 [hystrix-ScaryDependency-1] INFO info. batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! @chbatey
  29. 29. Timeouts take home ● You can’t use network level timeouts for @chbatey SLAs ● Test your SLAs - if someone says you can’t, hit them with a stick ● Scary things happen without network issues
  30. 30. 2 - Don’t try if you can’t succeed @chbatey
  31. 31. Complexity ● When an application grows in complexity it will eventually start sending emails @chbatey
  32. 32. Complexity ● When an application grows in complexity it will eventually start sending emails contain queues and thread pools @chbatey
  33. 33. Don’t try if you can’t succeed ● Executor Unbounded queues :( ○ newFixedThreadPool ○ newSingleThreadExecutor ○ newThreadCachedThreadPool ● Bound your queues and threads ● Fail quickly when the queue / @chbatey maxPoolSize is met ● Know your drivers
  34. 34. This is a functional requirement ● Set the timeout very high ● Use wiremock to add a large delay to the @chbatey requests ● Set queue size and thread pool size to 1 ● Send in 2 requests to use the thread and fill the queue ● What happens on the 3rd request?
  35. 35. 3 - Fail gracefully @chbatey
  36. 36. Expect rubbish ● Expect invalid HTTP ● Expect malformed response bodies ● Expect connection failures ● Expect huge / tiny responses @chbatey
  37. 37. Testing with Wiremock @chbatey stubFor(get(urlEqualTo("/dependencyPath")) .willReturn(aResponse() .withFault(Fault.MALFORMED_RESPONSE_CHUNK))); { "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "RANDOM_DATA_THEN_CLOSE" } } { "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "EMPTY_RESPONSE" } }
  38. 38. 4 - Know if it’s your fault @chbatey
  39. 39. What to record ● Metrics: Timings, errors, concurrent incoming requests, thread pool statistics, connection pool statistics ● Logging: Boundary logging, elasticsearch / @chbatey logstash ● Request identifiers
  40. 40. Graphite + Codahale @chbatey
  41. 41. @chbatey Response times
  42. 42. Separate resource pools ● Don’t flood your dependencies ● Be able to answer the questions: ○ How many connections will you make to dependency X? ○ Are you getting close to your @chbatey max connections?
  43. 43. So easy with Dropwizard + Hystrix @Override public void initialize(Bootstrap<AppConfig> appConfigBootstrap) { HystrixCodaHaleMetricsPublisher metricsPublisher = new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry()) HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher); @chbatey } metrics: reporters: - type: graphite host: 192.168.10.120 port: 2003 prefix: shiny_app
  44. 44. 5 - Don’t whack a dead horse @chbatey Shiny App User Service Device Service Pin Service Shiny App Shiny App Shiny App User Se rUvisceer Service Device Service Play Movie
  45. 45. What to do.. ● Yes this will happen.. ● Mandatory dependency - fail *really* fast ● Throttling ● Fallbacks @chbatey
  46. 46. Circuit breaker pattern @chbatey
  47. 47. Implementation with Hystrix @chbatey @GET @Timed public String integrate() { LOGGER.info("I best do some integration!"); String user = new UserServiceDependency(userService).execute(); String device = new DeviceServiceDependency(deviceService).execute(); Boolean pinCheck = new PinCheckDependency(pinService).execute(); return String.format("[User info: %s] n[Device info: %s] n[Pin check: %s] n", user, device, pinCheck); }
  48. 48. Implementation with Hystrix public class PinCheckDependency extends HystrixCommand<Boolean> { @chbatey @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); } }
  49. 49. Implementation with Hystrix public class PinCheckDependency extends HystrixCommand<Boolean> { @chbatey @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); } @Override public Boolean getFallback() { return true; } }
  50. 50. Triggering the fallback ● Error threshold percentage ● Bucket of time for the percentage ● Minimum number of requests to trigger ● Time before trying a request again ● Disable ● Per instance statistics @chbatey
  51. 51. 6 - Turn off broken stuff ● The kill switch @chbatey
  52. 52. To recap 1. Don’t take forever - Timeouts 2. Don’t try if you can’t succeed 3. Fail gracefully 4. Know if it’s your fault 5. Don’t whack a dead horse 6. Turn broken stuff off @chbatey
  53. 53. @chbatey Links ● Examples: ○ https://github.com/chbatey/spring-cloud-example ○ https://github.com/chbatey/dropwizard-hystrix ○ https://github.com/chbatey/vagrant-wiremock-saboteur ● Tech: ○ https://github.com/Netflix/Hystrix ○ https://www.vagrantup.com/ ○ http://wiremock.org/ ○ https://github.com/tomakehurst/saboteur
  54. 54. Questions? ● Thanks for listening! ● http://christopher-batey.blogspot.co.uk/ @chbatey
  55. 55. Developer takeaways ● Learn about TCP ● Love vagrant, docker etc to enable testing ● Don’t trust libraries @chbatey
  56. 56. Hystrix cost - do this yourself @chbatey
  57. 57. Hystrix metrics ● Failure count ● Percentiles from Hystrix @chbatey point of view ● Error percentages
  58. 58. How to test metric publishing? ● Stub out graphite and verify calls? ● Programmatically call graphite and verify @chbatey numbers? ● Make metrics + logs part of the story demo

×