• Save
Performance and Fault Tolerance for the Netflix API
Upcoming SlideShare
Loading in...5
×
 

Performance and Fault Tolerance for the Netflix API

on

  • 3,196 views

1) How Netflix does resilience engineering to tolerate failures and latency. ...

1) How Netflix does resilience engineering to tolerate failures and latency.
2) Changes in approach to API architecture to allow optimizing service endpoints to each of the hundreds of unique streaming devices for optimal performance rather than making all devices use the same one-size-fits-all approach that is optimized for none.

Presented June 28th 2012 at Silicon Valley Cloud Computing Group

http://www.meetup.com/cloudcomputing/events/68006112/

Statistics

Views

Total Views
3,196
Views on SlideShare
1,706
Embed Views
1,490

Actions

Likes
3
Downloads
0
Comments
2

6 Embeds 1,490

http://www.scoop.it 1187
http://g33ktalk.com 268
http://www.linkedin.com 16
https://www.linkedin.com 15
http://webcache.googleusercontent.com 3
http://www.google.com.ua 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Comments strip line breaks and the description limits total characters, so to see slide notes on a single page you can go to https://speakerdeck.com/u/benjchristensen/p/performance-and-fault-tolerance-for-the-netflix-api
    Are you sure you want to
    Your message goes here
    Processing…
  • Some of the slides have notes so I recommend clicking the 'Notes on Slide #' tab to view them while viewing the slides as they provide further explanation. (Line breaks apparently aren't supported so \n is shown in them)
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. \n\nMore than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second. \n
  • First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies. \n
  • Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience. \n
  • \n
  • \n
  • \n
  • It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server. \n\nEach execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values. \n\nThe calling thread median, 90th and 99th percentiles are the last 3 legend values. \n\nThus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms. \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same. \n
  • Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API. \n
  • The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page. \n
  • \n
  • \n
  • The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently. \n
  • \n
  • The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  • The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  • The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  • Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  • Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  • Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  • The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.\n
  • \n

Performance and Fault Tolerance for the Netflix API Presentation Transcript

  • 1. Performance and Fault Tolerancefor the Netflix APIBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensenhttp://techblog.netflix.com/
  • 2. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 3. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 4. Dozens of dependencies. One going bad takes everything down.99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/montheven if all dependencies have excellent uptime. Reality is generally worse.
  • 5. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load.
  • 6. OptionsAggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
  • 7. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve”
  • 8. 30 rps x 0.2 seconds = 6 + breathing room = 10 threadsThread-pool Queue size: 5-10 (0 doesnt work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free
  • 9. Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited
  • 10. Netflix DependencyCommand Implementation
  • 11. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response
  • 12. Netflix DependencyCommand Implementation
  • 13. So, how does it work in the real world?
  • 14. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628
  • 15. Rolling 10 second counters1 minute latency percentiles 2 minute rate changecircle color and size represent health and traffic volume
  • 16. API Daily Incoming vs OutgoingWeekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests
  • 17. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second)
  • 18. Fallback.Fail silent. Fail fast.Shed load.
  • 19. Netflix APIDependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 20. Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity)
  • 21. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API landing page requires ~dozen API requests
  • 22. Single Network Request from Clients (use LAN instead of WAN)some clients are limited in the number of concurrent network connections
  • 23. Single Network Request from Clients (use LAN instead of WAN)network latency makes this even worse(mobile, home, wifi, geographic distance, etc)
  • 24. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API push call pattern to server ...
  • 25. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API ... and eliminate redundant calls
  • 26. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device ServerClient Client part of client now on server
  • 27. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Clientclient retrieves and delivers exactly what their device needs in its optimal format
  • 28. Send Only The Bytes That Matter (optimize responses for each client) Device Server Netflix API Service LayerClient Client interface is now a Java API that client interacts with at a granular level
  • 29. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service LayerClient Client
  • 30. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service LayerClient Client no synchronized, volatile, locks, Futures orAtomic*/Concurrent* classes in client-server code
  • 31. Leverage Concurrency (but abstract away its complexity)Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234);all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can’t block
  • 32. Device Server Netflix APIOptimize for each device. Leverage the server.
  • 33. Netflix is Hiring http://jobs.netflix.comFault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen