Advertisement

More Related Content

Slideshows for you(20)

Similar to The new Netflix API(20)

Advertisement

The new Netflix API

  1. The new Netflix API Why more complexity must lead to more simplicity Katharina Probst DevNexus 2017
  2. Js (mostly) java Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary API Server JVM groovy Network boundary Today’s architecture Network boundary Gateway
  3. What is the Netflix
  4. Raison d’Être
  5. Is the API just one gigantic translation layer? Is it a routing layer? If it’s too complex, can we just get rid of it? Raison d’Être.
  6. 1. Orchestration 2. Availability protection 3. Abstraction Raison d’Être
  7. 1. Orchestration
  8. Simple example: search
  9. RelatedTerms
  10. People
  11. Titles
  12. Search request → response ● Search services provides related search terms ● Search service provides IDs for videos and people ○ IDs depend on various factors, e.g., different catalogs in different countries ● For each ID, we need metadata ○ Titles ○ Images ○ Names ○ Ratings ○ etc. ● ..., which depend on ○ Country ○ A/B tests user is in ○ etc. Response: ❏ Hydrated videos ❏ People names ❏ Query suggestions
  13. Orchestration ● Own order of operations ● Provide whatever info clients/services need ○ From other clients/libraries/services ○ From request ● Merge partial results ● Filter results ● Retrieve more info if necessary ● Support mutations (e.g., profile switch) ● Support complex transactions in a limited way
  14. 2. Availability protection
  15. Prevent this as much as possible
  16. What do customers want? ● No personalized recommendations, or no ability to stream? ● No search, or no ability to continue watching the movie you started last night? ● No cutting-edge A/B experiment experience, or no ability to stream?
  17. Top priority: customer experience ● Top priority of top priority: customer can stream videos ● This means API cannot go down entirely ○ If it does, we have an outage ● But some services are not critical to this mission ○ A/B - if we don’t know what A/B tests you’re in, you can still get the default experience ○ Search - if you can’t search, you can still browse
  18. Exposure to failures ● As your app grows, your set of dependencies is much more likely to get bigger, not smaller ● Overall uptime = (Dep uptime)^(num deps)
  19. ● Fault-tolerance pattern as a library ● Provides operational insights in real-time ● Automatic load-shedding under pressure Hystrix
  20. Search client lib Client lib B Ratings client lib Client lib N Cust client lib Client lib Z ... ... scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary Availability protection Search Ratings Customers ... Network boundary Gateway API
  21. Search client lib Client lib B Ratings client lib Client lib N Cust client lib Client lib Z ... ... scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary Availability protection Search Ratings Customers ... Network boundary Gateway API
  22. Search client lib Client lib B Ratings client lib Client lib N Cust client lib Client lib Z ... ... scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary Availability protection Search Ratings Customers ... Network boundary Gateway API
  23. Search client lib Client lib B Ratings client lib Client lib N Cust client lib Client lib Z ... ... scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary If you don’t plan for failure Search Ratings Customers ... Network boundary Gateway API
  24. Search client lib Client lib B Ratings client lib Client lib N Cust client lib Client lib Z ... ... scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary If you do plan for failure Search Ratings Customers ... Network boundary Gateway API No search results >> no Netflix
  25. Search client lib Client lib B Ratings client lib Client lib N Cust client lib Client lib Z ... ... scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary Fallbacks Search Ratings Customers ... Network boundary Gateway API Return static or stale rating
  26. return getRatings(id); How to handle errors
  27. try { return getRatings(id); } catch (Exception ex) { //static value return null; } How to handle errors
  28. try { return getRatings(id); } catch (Exception ex) { //TODO What to return here? } How to handle errors
  29. Handle errors with fallbacks ● Some options for fallbacks ○ Static value ○ Value from in-memory ○ Value from cache ○ Value from network ○ Throw ○ Code ● Make error-handling explicit ● Applications have to work in the presence of either fallbacks or rethrown exceptions
  30. ● Throttling ● Retries ● Timeouts ● Canaries ● Regional rollouts ● Traffic shifting ● Outlier detection (and elimination) ● Advanced load balancing Availability protection beyond Hystrix
  31. 3. Abstraction
  32. Abstraction goals ● Shield all device teams from every single mid-tier change … at least for a time. Allows us to move more independently ● Shield all device teams from every single platform/infrastructure change ● Provide APIs not provided by downstream services ○ Find all movies that... ○ Length of movie ● Implementation flexibility, e.g., ○ Caching ○ Batch APIs
  33. Abstraction challenges ● Tech debt ● Device teams can have black-box view (“api == cloud”) ● But isn’t the API team the bottleneck? ○ Yes, sometimes. But organizational structure makes this less of a problem than m mid-tier teams dealing with n device teams ● But: separation of concerns
  34. Server-side logic
  35. Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary ~2100 active Network boundary Reminder: Today’s architecture Network boundary Gateway API
  36. Device teams write server-side logic ● Decoupling teams → better velocity ● UI teams are empowered to ○ Change presentation ○ Filter ○ Add users to A/B tests, which then leads to e.g., different layout.
  37. What if we didn’t have an API?
  38. Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary What if? Implications for device teams Network boundary Gateway Device teams own client-side applications …
  39. Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary What if? Implications for device teams Network boundary Gateway ...and groovy scripts
  40. What if? Implications for device teams ● Each device team would have to own ○ Orchestration ○ Frequent dependency updates (currently done (attempted) daily) ○ Implement higher level APIs (all movies that…) ○ Fallbacks and other resiliency protection (e.g., timeouts, retries) ● Recent example ○ Library upgrade caused a lot of NPEs -- why? ○ Worked with team to find out why ○ When fixed, no more NPEs, but instead performance degradation ● Should all teams be dealing with this?
  41. Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary What if? Implications for service teams Network boundary Gateway Service teams own services...
  42. Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services scripts scripts scripts scripts ... scripts scripts scripts scripts Network boundary Network boundary What if? Implications for service teams Network boundary Gateway ...and client libraries
  43. What if? Implications for service teams ● Can only make breaking changes if all device teams who use their service upgrade ● Don’t get resiliency protection (e.g., timeouts, load balancing, retries, fallbacks) unless all device teams who use their service provide it ● Should all teams be dealing with this?
  44. What if? Implications for Netflix ● Lower velocity due to tight coupling between many mid-tier teams and many device teams
  45. OR: THE DOWNSIDE OF CENTRALIZATION
  46. Where are we today? ● Principle: don’t repeat logic ○ It’s better to do it once in API than do it n times for n devices. ● Principle is good, but leads to complexity
  47. What complexity challenges to we have?
  48. Complexity challenges ● Frequent (not always canaried) updates to a critical service in production ● Difficulty of debugging (esp. for groovy script writers) ● Slow server startup times ● Lack of operational insights into script resource consumption ● Difficulty of performance profiling ● Lack of feedback loop ● Decoupled code versioning and transitive dependencies
  49. Where are we going next?
  50. Top priorities ● Move groovy scripts out ● Split up API
  51. Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Netflix Micro- services Network boundary ... Network boundary New architecture: Edge PaaS Network boundary Network boundary Gate- way EAS Network boundary Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Node app NodeQuark Node app NodeQuark Node app NodeQuark Node app NodeQuark Titus
  52. Network boundary Network boundary Netflix Micro- services Network boundary ... New architecture: Edge PaaS Network boundary Gate- way EAS Network boundary Node app NodeQuark Node app NodeQuark Node app NodeQuark Node app NodeQuark Titus Edge Auth Service ● Auth termination ● Centralized place for auth Edge PaaS: ● Platform for node scripts ● Developer tooling for entire SDLC ● Remote API with optimized data access (Falcor) Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ...
  53. Two APIs
  54. DNAClient A ... Network boundary ... Network boundary Two (or more) APIs Network boundary Network boundary Gate- way EAS Network boundary Node app NodeQuark Node app NodeQuark Node app NodeQuark Node app NodeQuark Titus PB Service A PB Service B PB Service Z ... DNAClient B DNAClient Z Shared Client C Shared Client A ... PB Client B PB Client Z PB Client C PB Service C DNA Service A DNA Service B DNA Service Z ... DNA Service C Shared Service A Shared Service B Shared Service Z ... Split API by function
  55. NodeQuark Platform
  56. java Netflix Micro- services Network boundary ... Network boundary NodeQuark Platform Network boundary Network boundary Zuul EAS Network boundary Node app NodeQuark Node app NodeQuark Node app NodeQuark Node app NodeQuark Titus Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Platform for node scripts
  57. Edge PaaS: Node Platform ● Node apps run in containers on Titus platform ● Node Platform provides ○ Integration into Netflix ecosystem (e.g., discovery) ○ Logging ○ Dashboards, metrics out of the box with option to customize ○ Support for mocking and testing ● Titus provides ○ Scheduling ○ Autoscaling
  58. Developer experience
  59. java Netflix Micro- services Network boundary ... Network boundary New architecture: Edge PaaS Network boundary Network boundary Gate- way EAS Network boundary Node app NodeQuark Node app NodeQuark Node app NodeQuark Node app NodeQuark Titus Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Developer tooling for entire SDLC
  60. Edge PaaS: Developer tooling ● Command line tool for node apps ○ Setup ○ Starting apps ○ Deploying apps ● Local development and debugging of node apps ● UI for lifecycle management, e.g., version management ● One-click rollouts, one-click rollbacks ● Versioning
  61. Remote API
  62. Netflix Micro- services Network boundary ... Network boundary New architecture: Edge PaaS Network boundary Network boundary Zuul EAS Network boundary Node app NodeQuark Node app NodeQuark Node app NodeQuark Node app NodeQuark Titus Remote API with optimized data access Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ... Client lib A Client lib B Client lib C Client lib N Client lib Y Client lib Z ... ...
  63. Edge PaaS: Remote API ● API still takes care of ○ Orchestration ○ Resiliency protection ○ Abstraction ● Optimized access with Falcor ○ “RESTful composition” with caching ● Binary transport ● Future: channel support
  64. Greater simplicity
  65. Isolated failures: Scripts don’t affect each other (usually) API Temporarily unavailable!
  66. Independent root causing API Latency spike after push: 150ms Average latency: 10ms
  67. Independent autoscaling API
  68. Independent insights API Average latency: 50ms Average latency: 10ms
  69. Better regression/performance testing API Tests not affected by other scripts eating up resources on the same JVM
  70. Conclusion
  71. Complexity and simplicity ● Product has become much more complex ○ Scripts (more scripts, more complex scripts) ○ Features ○ Number of downstream services to integrate ○ More personalization ○ etc. ● Complexity of API service is high → Need to optimize for simplicity now ○ Process isolation ○ Cleaner developer experience
  72. END
Advertisement