Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fault Tolerance in a High Volume, Distributed System

6,291 views

Published on

More information can be found at http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Published in: Technology, Business
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Fault Tolerance in a High Volume, Distributed System

  1. 1. Fault Tolerance in aHigh Volume, Distributed SystemBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensen 1
  2. 2. Dozens of dependencies. One going down takes everything down.99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/montheven if all dependencies have excellent uptime. Reality is generally worse. 2
  3. 3. 3
  4. 4. 4
  5. 5. 5
  6. 6. No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  7. 7. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  8. 8. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  9. 9. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  10. 10. Semaphores (Tryable): Limited ConcurrencyTryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();} 10
  11. 11. Semaphores (Tryable): Limited ConcurrencyTryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();} 11
  12. 12. Semaphores (Tryable): Limited ConcurrencyTryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();} 12
  13. 13. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  14. 14. Separate Threads: Limited Concurrencytry { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();} 14
  15. 15. Separate Threads: Limited Concurrencytry { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();} 15
  16. 16. Separate Threads: Limited Concurrencytry { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();} 16
  17. 17. Separate Threads: Timeout Override of Future.get()public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); }} 17
  18. 18. Separate Threads: Timeout Override of Future.get()public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); }} 18
  19. 19. Separate Threads: Timeout Override of Future.get()public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); }} 19
  20. 20. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  21. 21. Circuit Breakerif (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();} 21
  22. 22. Circuit Breakerif (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();} 22
  23. 23. Circuit Breakerif (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();} 23
  24. 24. Netflix uses all 4 in combination 24
  25. 25. 25
  26. 26. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfigurationClient service performance characteristic changes Built-in concurrency 30
  31. 31. Drawbacks of Separate Threads Some computational overheadLoad on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  32. 32. 32
  33. 33. Visualizing Circuits in Realtime (generally sub-second latency) Video available athttps://vimeo.com/33576628 33
  34. 34. Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/(success+latent success+error+timeout+rejected). 34
  35. 35. Netflix DependencyCommand Implementation 35
  36. 36. Netflix DependencyCommand Implementation 36
  37. 37. Netflix DependencyCommand Implementation 37
  38. 38. Netflix DependencyCommand Implementation 38
  39. 39. Netflix DependencyCommand Implementation 39
  40. 40. Netflix DependencyCommand Implementation 40
  41. 41. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  42. 42. Netflix DependencyCommand Implementation 42
  43. 43. Netflix DependencyCommand Implementation 43
  44. 44. Rolling NumberRealtime Stats and Decision Making 44
  45. 45. Request CollapsingTake advantage of resiliency to improve efficiency 45
  46. 46. Request CollapsingTake advantage of resiliency to improve efficiency 46
  47. 47. 47
  48. 48. Fail fast.Fail silent.Fallback.Shed load. 48
  49. 49. Questions & More InformationFault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49

×