Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Maintaining the Front Door to Netflix : The Netflix API

37,216 views

Published on

This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Maintaining the Front Door to Netflix : The Netflix API

  1. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson
  2. There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation
  3. Global Streaming Video for TV Shows and Movies
  4. More than 44 Million Subscribers More than 40 Countries
  5. Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month
  6. Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: • Non-Member • Discovery • Streaming
  7. Key Responsibilities • Broker data between services and UIs • Maintain a resilient front-door • Scale the system vertically and horizontally • Maintain high velocity
  8. But Before Streaming…
  9. Monolithic Application In Netflix Data Centers
  10. The bigger the ship… the slower it turns
  11. Distributed Architecture
  12. 1000+ Device Types
  13. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies
  14. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  15. Dependency Relationships
  16. 2,000,000,000 Requests Per Day to the Netflix API
  17. 30 Distinct Dependent Services for the Netflix API
  18. ~500 Dependency jars Slurped into the Netflix API
  19. 14,000,000,000 Netflix API Calls Per Day to those Dependent Services
  20. 0 Dependent Services with 100% SLA
  21. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month
  22. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month
  23. 99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month
  24. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  25. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  26. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  27. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  28. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  29. Circuit Breaker Dashboard
  30. Call Volume and Health / Last 10 Seconds
  31. Call Volume / Last 2 Minutes
  32. Successful Requests
  33. Successful, But Slower Than Expected
  34. Short-Circuited Requests, Delivering Fallbacks
  35. Timeouts, Delivering Fallbacks
  36. Thread Pool & Task Queue Full, Delivering Fallbacks
  37. Exceptions, Delivering Fallbacks
  38. Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate
  39. Status of Fallback Circuit
  40. Requests per Second, Over Last 10 Seconds
  41. SLA Information
  42. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  43. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  44. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  45. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback
  46. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback
  47. Scaling the Distributed System
  48. AWS Cloud
  49. Autoscaling
  50. Autoscaling
  51. Amazon Auto Scaling Limitations • Hard to fit policies to variable traffic patterns (weekday vs weekend) • Limited control over capacity adjustments (absolute value or %)
  52. The Impact of AAS Limitations • Traffic drop can lead to scale downs during outage • Performance degradation between new instance launch and taking traffic • Excess capacity at peak and trough
  53. Scryer : Predictive Auto Scaling Not yet…
  54. Typical Traffic Patterns Over Five Days
  55. Predicted RPS Compared to Actual RPS
  56. Scaling Plan for Predicted Workload
  57. What is Scryer Doing? • Evaluating needs based on historical data – Week over week, month over month metrics • Adjusts instance minimums based on algorithms • Relies on Amazon Auto Scaling for unpredicted events
  58. Results
  59. Results : Load Average Reactive Predictive
  60. Results : Response Latencies Reactive Predictive
  61. Results : Outage Recovery
  62. Results : Outage Recovery
  63. Results : AWS Costs
  64. Scaling Globally
  65. More than 44 Million Subscribers More than 40 Countries
  66. Zuul Gatekeeper for the Netflix Streaming Application
  67. Zuul * • Multi-Region Resiliency • Insights • Stress Testing • Canary Testing • Dynamic Routing • Load Shedding • Security • Static Response Handling • Authentication * Most closely resembles an API proxy
  68. Isthmus
  69. All of these approaches are designed to prevent failures…
  70. But sometimes the best way to prevent failures is to force them!
  71. I randomly terminate instances in production to identify dormant failures. Chaos Monkey
  72. Chaos Gorilla I simulate an outage of an entire Amazon availability zone.
  73. I simulate an outage in an AWS region. Chaos Kong
  74. I find instances that don’t adhere to best practices. Conformity Monkey
  75. I extend Conformity Monkey to find security violations. Security Monkey
  76. I detect unhealthy instances and remove them from service. Doctor Monkey
  77. I clean up the clutter and waste that runs in the cloud. Janitor Monkey
  78. I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey
  79. Deployments in the Cloud
  80. Dependency Relationships
  81. Testing Philosophy: Act Fast, React Fast
  82. That Doesn’t Mean We Don’t Test
  83. Automated Delivery Pipeline
  84. Cloud-Based Deployment Techniques
  85. Current Code In Production API Requests from the Internet
  86. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet
  87. Canary Analysis Automation
  88. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!
  89. Current Code In Production API Requests from the Internet
  90. Current Code In Production API Requests from the Internet
  91. Current Code In Production API Requests from the Internet Perfect!
  92. Stress Test with Zuul
  93. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  94. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  95. Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  96. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  97. Current Code In Production API Requests from the Internet Perfect!
  98. Stress Test with Zuul
  99. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  100. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  101. API Requests from the Internet New Code Getting Prepared for Production
  102. Brokering Data to 1,000+ Device Types
  103. Screen Real Estate
  104. Controller
  105. Technical Capabilities
  106. One-Size-Fits-All API Request Request Request
  107. Courtesy of South Florida Classical Review
  108. Resource-Based API vs. Experience-Based API
  109. Resource-Based Requests • /users/<id>/ratings/title • /users/<id>/queues • /users/<id>/queues/instant • /users/<id>/recommendations • /catalog/titles/movie • /catalog/titles/series • /catalog/people
  110. REST API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border
  111. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE
  112. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING
  113. Experience-Based Requests • /ps3/homescreen
  114. JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer
  115. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border
  116. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border
  117. https://www.github.com/Netflix
  118. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

×