Performance and Fault Tolerancefor the Netflix APIBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensen...
Netflix APIDependency A              Dependency B           Dependency C   Dependency D               Dependency E         ...
Netflix APIDependency A              Dependency B           Dependency C   Dependency D               Dependency E         ...
Dozens of dependencies.    One going bad takes everything down.99.99%30          = 99.7% uptime     0.3% of 1 billion = 3,...
No single dependency should take down the entire app.         Fallback.         Fail silent.          Fail fast.        Sh...
OptionsAggressive Network Timeouts   Semaphores (Tryable)     Separate Threads      Circuit Breaker
Tryable semaphores for “trusted” clients and fallbacks       Separate threads for “untrusted” clients Aggressive timeouts ...
30 rps x 0.2 seconds = 6 + breathing room = 10 threadsThread-pool Queue size: 5-10 (0 doesnt work but get close to it)    ...
Cost of Thread @ 75rps  median - 90th - 99th (time in ms)                 Time for thread to execute   Time user thread wa...
Netflix DependencyCommand Implementation
Netflix DependencyCommand Implementation              Fallbacks               Cache         Eventual Consistency           ...
Netflix DependencyCommand Implementation
So, how does it work in the real world?
Visualizing Circuits in Near-Realtime    (latency is single-digit seconds, generally 1-2)        Video available at  https...
Rolling 10 second counters1 minute latency percentiles  2 minute rate changecircle color and size represent   health and t...
API Daily Incoming vs OutgoingWeekend                                        Weekend               Weekend              8-...
API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second)              Peak at 10...
Fallback.Fail silent. Fail fast.Shed load.
Netflix APIDependency A              Dependency B           Dependency C   Dependency D               Dependency E         ...
Single Network Request from Clients     (use LAN instead of WAN)  Send Only The Bytes That Matter (optimize responses for ...
Single Network Request from Clients     (use LAN instead of WAN)                     Device                              S...
Single Network Request from Clients        (use LAN instead of WAN)some clients are limited in the number of   concurrent ...
Single Network Request from Clients         (use LAN instead of WAN)network latency makes this even worse(mobile, home, wi...
Single Network Request from Clients     (use LAN instead of WAN)                Device                Server              ...
Single Network Request from Clients     (use LAN instead of WAN)                Device                Server              ...
Send Only The Bytes That Matter         (optimize responses for each client)                                              ...
Send Only The Bytes That Matter              (optimize responses for each client)                                         ...
Send Only The Bytes That Matter            (optimize responses for each client)                    Device                 ...
Leverage Concurrency         (but abstract away its complexity)                Device                         Server      ...
Leverage Concurrency         (but abstract away its complexity)                Device                         Server      ...
Leverage Concurrency                         (but abstract away its complexity)Service calls are    def video1Call = api.g...
Device                             Server                                      Netflix APIOptimize for each device. Leverag...
Netflix is Hiring                              http://jobs.netflix.comFault Tolerance in a High Volume, Distributed System  ...
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Upcoming SlideShare
Loading in...5
×

Performance and Fault Tolerance for the Netflix API

3,781

Published on

1) How Netflix does resilience engineering to tolerate failures and latency.
2) Changes in approach to API architecture to allow optimizing service endpoints to each of the hundreds of unique streaming devices for optimal performance rather than making all devices use the same one-size-fits-all approach that is optimized for none.

Presented June 28th 2012 at Silicon Valley Cloud Computing Group

http://www.meetup.com/cloudcomputing/events/68006112/

Published in: Technology
2 Comments
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,781
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
2
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. \n\nMore than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second. \n
  • First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies. \n
  • Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience. \n
  • \n
  • \n
  • \n
  • It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server. \n\nEach execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values. \n\nThe calling thread median, 90th and 99th percentiles are the last 3 legend values. \n\nThus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same. \n
  • Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API. \n
  • The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page. \n
  • \n
  • \n
  • The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently. \n
  • \n
  • The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  • The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  • The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  • Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  • Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  • Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  • The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.\n
  • \n
  • Transcript of "Performance and Fault Tolerance for the Netflix API "

    1. 1. Performance and Fault Tolerancefor the Netflix APIBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensenhttp://techblog.netflix.com/
    2. 2. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
    3. 3. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
    4. 4. Dozens of dependencies. One going bad takes everything down.99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/montheven if all dependencies have excellent uptime. Reality is generally worse.
    5. 5. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load.
    6. 6. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
    7. 7. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve”
    8. 8. 30 rps x 0.2 seconds = 6 + breathing room = 10 threadsThread-pool Queue size: 5-10 (0 doesnt work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free
    9. 9. Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited
    10. 10. Netflix DependencyCommand Implementation
    11. 11. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response
    12. 12. Netflix DependencyCommand Implementation
    13. 13. So, how does it work in the real world?
    14. 14. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628
    15. 15. Rolling 10 second counters1 minute latency percentiles 2 minute rate changecircle color and size represent health and traffic volume
    16. 16. API Daily Incoming vs OutgoingWeekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests
    17. 17. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second)
    18. 18. Fallback.Fail silent. Fail fast.Shed load.
    19. 19. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
    20. 20. Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity)
    21. 21. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API landing page requires ~dozen API requests
    22. 22. Single Network Request from Clients (use LAN instead of WAN)some clients are limited in the number of concurrent network connections
    23. 23. Single Network Request from Clients (use LAN instead of WAN)network latency makes this even worse(mobile, home, wifi, geographic distance, etc)
    24. 24. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API push call pattern to server ...
    25. 25. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API ... and eliminate redundant calls
    26. 26. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device ServerClient Client part of client now on server
    27. 27. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Clientclient retrieves and delivers exactly what their device needs in its optimal format
    28. 28. Send Only The Bytes That Matter (optimize responses for each client) Device Server Netflix API Service LayerClient Client interface is now a Java API that client interacts with at a granular level
    29. 29. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service LayerClient Client
    30. 30. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service LayerClient Client no synchronized, volatile, locks, Futures orAtomic*/Concurrent* classes in client-server code
    31. 31. Leverage Concurrency (but abstract away its complexity)Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234);all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can’t block
    32. 32. Device Server Netflix APIOptimize for each device. Leverage the server.
    33. 33. Netflix is Hiring http://jobs.netflix.comFault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen

    ×