1. The new Netflix API aims to provide orchestration of services, availability protection, and abstraction for client libraries and device teams.
2. To address complexity challenges, Netflix plans to move scripts out of the API and split the API into separate services for authentication and an edge platform for scripts.
3. This will reduce complexity, improve debugging and profiling, and allow faster independent development while still providing higher level APIs and resiliency across services.
Introduction to the Netflix API architecture and its complexity challenges, highlighting the need for orchestration and availability.
Introduction to the Netflix API architecture and its complexity challenges, highlighting the need for orchestration and availability.API's purpose defined by orchestration, availability protection, and abstraction, which are essential for effective service management.
Explanation of search requests, metadata requirements, and orchestration processes for search results.
Explanation of search requests, metadata requirements, and orchestration processes for search results.
Strategies for ensuring API reliability, including customer experience, fault tolerance with Hystrix, and handling service dependencies.
Techniques for managing errors in API services using fallbacks and making error-handling explicit to maintain functionality.
Identifying complexity challenges such as debugging and operational insights while discussing future priorities for simplification.
Introduction to the Edge PaaS architecture, emphasizing node applications, enhanced developer tools, and remote API capabilities.
Emphasis on independent failures, monitoring performance metrics, and ensuring better regression testing to enhance API reliability.
Reflection on the increasing complexity of Netflix’s API and the emphasis on simplifying the service architecture.
Search request →response
● Search services provides related search terms
● Search service provides IDs for videos and people
○ IDs depend on various factors, e.g., different
catalogs in different countries
● For each ID, we need metadata
○ Titles
○ Images
○ Names
○ Ratings
○ etc.
● ..., which depend on
○ Country
○ A/B tests user is in
○ etc.
Response:
❏ Hydrated videos
❏ People names
❏ Query suggestions
15.
Orchestration
● Own orderof operations
● Provide whatever info clients/services need
○ From other clients/libraries/services
○ From request
● Merge partial results
● Filter results
● Retrieve more info if necessary
● Support mutations (e.g., profile switch)
● Support complex transactions in a limited way
What do customerswant?
● No personalized recommendations, or no ability to stream?
● No search, or no ability to continue watching the movie you started last night?
● No cutting-edge A/B experiment experience, or no ability to stream?
20.
Top priority: customerexperience
● Top priority of top priority: customer can stream videos
● This means API cannot go down entirely
○ If it does, we have an outage
● But some services are not critical to this mission
○ A/B - if we don’t know what A/B tests you’re in, you can still get the default
experience
○ Search - if you can’t search, you can still browse
21.
Exposure to failures
●As your app grows, your set of dependencies is much more likely to get
bigger, not smaller
● Overall uptime = (Dep uptime)^(num deps)
22.
● Fault-tolerance patternas a library
● Provides operational insights in real-time
● Automatic load-shedding under pressure
Hystrix
Search client lib
Clientlib B
Ratings client lib
Client lib N
Cust client lib
Client lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
If you don’t plan for failure
Search
Ratings
Customers
...
Network
boundary
Gateway
API
27.
Search client lib
Clientlib B
Ratings client lib
Client lib N
Cust client lib
Client lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
If you do plan for failure
Search
Ratings
Customers
...
Network
boundary
Gateway
API
No search results >>
no Netflix
28.
Search client lib
Clientlib B
Ratings client lib
Client lib N
Cust client lib
Client lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
Fallbacks
Search
Ratings
Customers
...
Network
boundary
Gateway
API
Return static or stale
rating
Handle errors withfallbacks
● Some options for fallbacks
○ Static value
○ Value from in-memory
○ Value from cache
○ Value from network
○ Throw
○ Code
● Make error-handling explicit
● Applications have to work in the presence of either fallbacks or rethrown
exceptions
Abstraction goals
● Shieldall device teams from every single mid-tier change … at least for a time.
Allows us to move more independently
● Shield all device teams from every single platform/infrastructure change
● Provide APIs not provided by downstream services
○ Find all movies that...
○ Length of movie
● Implementation flexibility, e.g.,
○ Caching
○ Batch APIs
37.
Abstraction challenges
● Techdebt
● Device teams can have black-box view (“api == cloud”)
● But isn’t the API team the bottleneck?
○ Yes, sometimes. But organizational structure makes this less of a problem
than m mid-tier teams dealing with n device teams
● But: separation of concerns
Client lib A
Clientlib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Netflix
Micro-
services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
~2100 active
Network
boundary
Reminder: Today’s architecture
Network
boundary
Gateway
API
40.
Device teams writeserver-side logic
● Decoupling teams → better velocity
● UI teams are empowered to
○ Change presentation
○ Filter
○ Add users to A/B tests, which then leads to e.g., different layout.
Client lib A
Clientlib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Netflix
Micro-
services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
What if? Implications for device teams
Network
boundary
Gateway
Device teams own
client-side
applications …
43.
Client lib A
Clientlib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Netflix
Micro-
services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
What if? Implications for device teams
Network
boundary
Gateway
...and groovy scripts
44.
What if? Implicationsfor device teams
● Each device team would have to own
○ Orchestration
○ Frequent dependency updates (currently done (attempted) daily)
○ Implement higher level APIs (all movies that…)
○ Fallbacks and other resiliency protection (e.g., timeouts, retries)
● Recent example
○ Library upgrade caused a lot of NPEs -- why?
○ Worked with team to find out why
○ When fixed, no more NPEs, but instead performance degradation
● Should all teams be dealing with this?
45.
Client lib A
Clientlib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Netflix
Micro-
services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
What if? Implications for service teams
Network
boundary
Gateway
Service teams own
services...
46.
Client lib A
Clientlib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Netflix
Micro-
services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network
boundary
Network
boundary
What if? Implications for service teams
Network
boundary
Gateway
...and client libraries
47.
What if? Implicationsfor service teams
● Can only make breaking changes if all device teams who use their service
upgrade
● Don’t get resiliency protection (e.g., timeouts, load balancing, retries, fallbacks)
unless all device teams who use their service provide it
● Should all teams be dealing with this?
48.
What if? Implicationsfor Netflix
● Lower velocity due to tight coupling between many mid-tier teams and many
device teams
Where are wetoday?
● Principle: don’t repeat logic
○ It’s better to do it once in API than do it n times for n devices.
● Principle is good, but leads to complexity
Complexity challenges
● Frequent(not always canaried) updates to a critical service in production
● Difficulty of debugging (esp. for groovy script writers)
● Slow server startup times
● Lack of operational insights into script resource consumption
● Difficulty of performance profiling
● Lack of feedback loop
● Decoupled code versioning and transitive dependencies
Client lib A
Clientlib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Netflix
Micro-
services
Network
boundary
...
Network
boundary
New architecture: Edge PaaS
Network
boundary
Network
boundary
Gate-
way
EAS
Network
boundary Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
56.
Network
boundary
Network
boundary
Netflix
Micro-
services
Network
boundary
...
New architecture: EdgePaaS
Network
boundary
Gate-
way
EAS
Network
boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Edge Auth Service
● Auth
termination
● Centralized
place for
auth
Edge PaaS:
● Platform for node scripts
● Developer tooling for entire SDLC
● Remote API with optimized data access (Falcor)
Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
DNAClient A
...
Network
boundary
...
Network
boundary
Two (ormore) APIs
Network
boundary
Network
boundary
Gate-
way
EAS
Network
boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
PB Service A
PB Service B
PB Service Z
...
DNAClient B
DNAClient Z
Shared Client C
Shared Client A
...
PB Client B
PB Client Z
PB Client C
PB Service C
DNA Service A
DNA Service B
DNA Service Z
...
DNA Service C
Shared Service A
Shared Service B
Shared Service Z
...
Split API by
function
Edge PaaS: NodePlatform
● Node apps run in containers on Titus platform
● Node Platform provides
○ Integration into Netflix ecosystem (e.g., discovery)
○ Logging
○ Dashboards, metrics out of the box with option to customize
○ Support for mocking and testing
● Titus provides
○ Scheduling
○ Autoscaling
java
Netflix
Micro-
services
Network
boundary
...
Network
boundary
New architecture: EdgePaaS
Network
boundary
Network
boundary
Gate-
way
EAS
Network
boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Developer tooling for
entire SDLC
64.
Edge PaaS: Developertooling
● Command line tool for node apps
○ Setup
○ Starting apps
○ Deploying apps
● Local development and debugging of node apps
● UI for lifecycle management, e.g., version management
● One-click rollouts, one-click rollbacks
● Versioning
Netflix
Micro-
services
Network
boundary
...
Network
boundary
New architecture: EdgePaaS
Network
boundary
Network
boundary
Zuul
EAS
Network
boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Remote API with
optimized data access
Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
Client lib A
Client lib B
Client lib C
Client lib N
Client lib Y
Client lib Z
...
...
67.
Edge PaaS: RemoteAPI
● API still takes care of
○ Orchestration
○ Resiliency protection
○ Abstraction
● Optimized access with Falcor
○ “RESTful composition” with caching
● Binary transport
● Future: channel support
Complexity and simplicity
●Product has become much more complex
○ Scripts (more scripts, more complex scripts)
○ Features
○ Number of downstream services to integrate
○ More personalization
○ etc.
● Complexity of API service is high → Need to optimize for simplicity
now
○ Process isolation
○ Cleaner developer experience