Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performance and Fault Tolerance for the Netflix API

4,793 views

Published on

1) How Netflix does resilience engineering to tolerate failures and latency.
2) Changes in approach to API architecture to allow optimizing service endpoints to each of the hundreds of unique streaming devices for optimal performance rather than making all devices use the same one-size-fits-all approach that is optimized for none.

Presented June 28th 2012 at Silicon Valley Cloud Computing Group

http://www.meetup.com/cloudcomputing/events/68006112/

Published in: Technology
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/qlpx8zk } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/qlpx8zk } ......................................................................................................................... Download Full doc Ebook here { https://tinyurl.com/qlpx8zk } ......................................................................................................................... Download PDF EBOOK here { https://tinyurl.com/qlpx8zk } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/qlpx8zk } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/qlpx8zk } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Comments strip line breaks and the description limits total characters, so to see slide notes on a single page you can go to https://speakerdeck.com/u/benjchristensen/p/performance-and-fault-tolerance-for-the-netflix-api
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Some of the slides have notes so I recommend clicking the 'Notes on Slide #' tab to view them while viewing the slides as they provide further explanation. (Line breaks apparently aren't supported so \n is shown in them)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Performance and Fault Tolerance for the Netflix API

  1. 1. Performance and Fault Tolerancefor the Netflix APIBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensenhttp://techblog.netflix.com/
  2. 2. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  3. 3. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  4. 4. Dozens of dependencies. One going bad takes everything down.99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/montheven if all dependencies have excellent uptime. Reality is generally worse.
  5. 5. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load.
  6. 6. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
  7. 7. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve”
  8. 8. 30 rps x 0.2 seconds = 6 + breathing room = 10 threadsThread-pool Queue size: 5-10 (0 doesnt work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free
  9. 9. Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited
  10. 10. Netflix DependencyCommand Implementation
  11. 11. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response
  12. 12. Netflix DependencyCommand Implementation
  13. 13. So, how does it work in the real world?
  14. 14. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628
  15. 15. Rolling 10 second counters1 minute latency percentiles 2 minute rate changecircle color and size represent health and traffic volume
  16. 16. API Daily Incoming vs OutgoingWeekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests
  17. 17. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second)
  18. 18. Fallback.Fail silent. Fail fast.Shed load.
  19. 19. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  20. 20. Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity)
  21. 21. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API landing page requires ~dozen API requests
  22. 22. Single Network Request from Clients (use LAN instead of WAN)some clients are limited in the number of concurrent network connections
  23. 23. Single Network Request from Clients (use LAN instead of WAN)network latency makes this even worse(mobile, home, wifi, geographic distance, etc)
  24. 24. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API push call pattern to server ...
  25. 25. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API ... and eliminate redundant calls
  26. 26. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device ServerClient Client part of client now on server
  27. 27. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Clientclient retrieves and delivers exactly what their device needs in its optimal format
  28. 28. Send Only The Bytes That Matter (optimize responses for each client) Device Server Netflix API Service LayerClient Client interface is now a Java API that client interacts with at a granular level
  29. 29. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service LayerClient Client
  30. 30. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service LayerClient Client no synchronized, volatile, locks, Futures orAtomic*/Concurrent* classes in client-server code
  31. 31. Leverage Concurrency (but abstract away its complexity)Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234);all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can’t block
  32. 32. Device Server Netflix APIOptimize for each device. Leverage the server.
  33. 33. Netflix is Hiring http://jobs.netflix.comFault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen

×