This presentation was from QCon SF 2011. In these slides I discuss various techniques that we use to scale the API. I also discuss in more detail our effort around redesigning the API.
Daniel JacobsonDirector of Engineering - Netflix API at Netflix
1. Techniques for Scaling the
Netflix API
By Daniel Jacobson
@daniel_jacobson
djacobson@netflix.com
QCon SF 2011
2. Techniques for Scaling the Netflix API
Agenda
• History of the Netflix API
• The Cloud
• Development and Testing
• Resiliency
• Future of the Netflix API
3. Techniques for Scaling the Netflix API
Agenda
• History of the Netflix API
• The Cloud
• Development and Testing
• Resiliency
• Future of the Netflix API
12. Techniques for Scaling the Netflix API
Agenda
• History of the Netflix API
• The Cloud
• Development and Testing
• Resiliency
• Future of the Netflix API
27. Techniques for Scaling the Netflix API
Agenda
• History of the Netflix API
• The Cloud
• Development and Testing
• Resiliency
• Future of the Netflix API
29. That Doesn’t Mean We Don’t Test
• Unit tests
• Functional tests
• Regression scripts
• Continuous integration
• Capacity planning
• Load / Performance tests
30. Development Contiuous
Run Unit Tests
(Feature & Test) Integration
Perform
Deploy to Run Functional
Integration
Staging Env Tests
Tests
Deploy to
Customers
32. API Requests from
the Internet
Problem!
Current Code
In Production
Single Canary Instance
To Test New Code with Production Traffic
(typically around 1-5% of traffic)
41. Development Continuous
Run Unit Tests
(Feature & Test) Integration
Run
Deploy to Run Functional
Integration
Staging Env Tests
Tests
Perform
Deploy Canary
Canary
Instance(s)
Analysis
Deploy to Perform Black Deploy Black
Customers Analysis Instances
42. Techniques for Scaling the Netflix API
Agenda
• History of the Netflix API
• The Cloud
• Development and Testing
• Resiliency
• Future of the Netflix API
43. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
44. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
45. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
46. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
61. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
62. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
63. API
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
64. API
Fallback
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
65. API
Fallback
Personaliz
Movie Movie Similar A/B Test
ation User Info Reviews
Engine
Metadata Ratings Movies Engine
66. Techniques for Scaling the Netflix API
Agenda
• History of the Netflix API
• The Cloud
• Development and Testing
• Resiliency
• Future of the Netflix API
76. Improve Efficiency of API Requests
Could it have been 5 billion requests per month? Or less?
(Assuming everything else remained the same)
77. Netflix API : Requests Per Month
35
30
25
Request in Billions
20
15
10
5
0
78. Netflix API : Requests Per Month
35
30
25
Request in Billions
20
15
10
5
0
79. API Billionaires Club
13 billion API calls / day (May 2011)
Over 260 billion objects stored in S3 (January 2011)
5 billion API calls / day (April 2010)
5 billion API calls / day (October 2009)
1 billion API calls / day (October 2011)
8 billion API calls / month (Q3 2009)
3.2 billion API-delivered stories / month (October
2011)
3 billion API calls / month (March 2009)
Courtesy of John Musser, ProgrammableWeb
80. API Billionaires Club
13 billion API calls / day (May 2011)
Over 260 billion objects stored in S3 (January 2011)
5 billion API calls / day (April 2010)
5 billion API calls / day (October 2009)
1 billion API calls / day (October 2011)
8 billion API calls / month (Q3 2009)
3.2 billion API-delivered stories / month (October
2011)
3 billion API calls / month (March 2009)
Courtesy of John Musser, ProgrammableWeb
81. Two Major Objective for
API Redesign
• Improve performance for devices
– Minimize network traffic (one API call, if possible)
– Only deliver bytes that are needed
• Improve ability for device teams to rapidly
innovate
– Customized request/response patterns per device
– Deployment is independent from API schedules
82. Current Client / Server Interaction
CLIENT APPS API SERVER
AUTH
SETUP
TIME
QUEUE
LISTS
LIST ( i )
TITLES
83. Future Client / Server Interaction
CLIENT APPS API SERVER
CUSTOM SCRIPTING TIER
GENERIC API
84. Custom Scripting Tier Interaction
CLIENT APPS API SERVER
AUTH
CUSTOM
SCRIPTING SETUP
TIER
TIME
PS3
HOME QUEUE
SCREEN
CUSTOM LISTS
ENDPOINT
LIST ( i )
Owned and operated
by the UI teams TITLES
90. Generic API Interaction
CLIENT APP API SERVER
AUTH
GENERIC CONFIG
SCRIPT
TIME
SIMPLE QUEUE
CATALOG
REQUEST
TO API LISTS
LIST ( i )
Owned and operated
by the API team TITLES
91. Technologies
AUTH
CONFIG
TIME
GROOVY
SIMPLE
CLIENT COMPILED
CATALOG JAVAQUEUE
REQUEST
LANGUAGE INTO
TO API LOLOMO
LISTS
JVM
LIST ( i )
TITLES
93. Active
1
2
3
4
5
SCRIPT
REPOSITORY 6
(DYNAMIC
DEPLOYMENT)
API SERVER
APPLICATION CODE
(REQUIRES API CODE PUSHES)
94. Active
1
2
3
4
5
SCRIPT
REPOSITORY 6
(DYNAMIC
7
DEPLOYMENT)
API SERVER
APPLICATION CODE
(REQUIRES API CODE PUSHES)
95. 1
2
3
4
Active
5
SCRIPT
REPOSITORY 6
(DYNAMIC
7
DEPLOYMENT)
API SERVER
APPLICATION CODE
(REQUIRES API CODE PUSHES)
98. If you are interested in helping us solve these
problems, please contact me at:
Daniel Jacobson
djacobson@netflix.com
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
Editor's Notes
When the Netflix API launched three years ago, it was to “let 1,000 flowers bloom”. Today, that API still exists with almost 23,000 flowers.
At that time, it was exclusively a public API.
Some of the apps developed by the 1,000 flowers.
Then streaming started taking off for Netflix, first with computer-based streaming… At that time, it was still experimental and did not draw from the API.
But over time, as we added more devices, they started drawing their metadata from the API. Today, almost all of our devices are powered by the API.
As a result, today’s consumption is almost entirely from private APIs that service the devices. The Netflix devices account for 99.7% of the API traffic while the public API represents only about .3%.
The Netflix API represents the iceberg model for APIs. That is, public APIs represent a relatively small percentage of the value for the company, but they are typically the most visible part of the API program. They equate to the small part of the iceberg that is above water, in open sight. Conversely, the private APIs that drive web sites, mobile phones, device implementations, etc. account for the vast majority of the value for many companies, although people outside of the company often are not aware of them. These APIs equate to the large, hard to see mass of ice underwater. In the API space, most companies get attracted to the tip of the iceberg because that is what they are aware of. As a result, many companies seek to pursue a public API program. Over time, however, after more inspection into the value propositions of APIs, it becomes clear to many that the greatest value is in the private APIs.
As a result, the current emphasis for the Netflix API is on the majority case… supporting the Netflix
There are basically two types of interactions between Netflix customers and our streaming application… Discovery and Streaming.
Discovery is basically any event with a title other than streaming it. That includes browsing titles, looking for something watch, etc.
It also includes actions such as rating the title, adding it to your instant queue, etc.
Once the customer has identified a title to watch through the Discovery experience, the user can then play that title. Once the Play button is selected, the customer is sent to a different internal service that focuses on handling the streaming. That streaming service also interacts with our CDNs to actually deliver the streaming bits to the device for playback.
The API powers the Discovery experience. The rest of these slides will only focus on Discovery, not Streaming.
As Discovery events grow, so does the growth of the Netflix API. Discovery continues to grow for a variety of reasons, including more devices, more customers, richer UI experiences, etc.
As API traffic grows, so do the infrastructural needs. The more requests, the more servers we need, the more time spent supporting those servers, the higher the costs associated with this support, etc.
And our international expansion will only add complexity and more scaling issues.
The traditional model is to have systems administrators go into server rooms like this one to build out new servers, etc.
Rather than relying on data centers, we have moved everything to the cloud! Enables rapid scaling with relative ease. Adding new servers, in new locations, take minutes. And this is critical when the service needs to grow from 1B requests a month to 1B requests a day in a year.
Instead of going into server rooms, we go into a web page like this one. Within minutes, we can spin up new servers to support growing demands.
Throughautoscaling in the cloud, we can also dynamically grow our server farm in concert with the traffic that we receive.
So, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
And as we continue to expand internationally, we can easily scale up in new regions, closer to the customer base that we are trying to serve, as long as Amazon has a location near there.
As a general practice, Netflix focuses on getting code into production as quickly as possible to expose features to new audiences.
That said, we do spend a lot of time testing. We have just adopted some new techniques to help us learn more about what the new code will look like in production.
Prior to these new changes, our flow looked something like this…
That flow has changed with the addition of new techniques, such as canary deployments and what we call red/black deployments.
The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new servers into production running new code. Monitoring the canary servers will show what the new code will look like in production.
If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc. We will then repeat the process until the analysis of canary servers look good.
If the new code looks good in the canary, we can then use a technique that we call Red/Black Deployments to launch the code. Start with Red, where production code is running. Fire up a new set of servers (Black) equal to the count in Red with the new code.
Then switch the pointer to have external requests draw from the Black servers.
If a problem is encountered from the Black servers, it is easy to rollback quickly by switching the pointer back to Red. We will then re-evaluate the new code, debug it, etc.
Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
If the new code looks good in the canary, we can then bring up another set of servers with the new code.
Then we will switch production traffic to the new code.
Then switch the pointer to have external requests draw from the Black servers. If everything still looks good, we disable the Red servers and the new code becomes the new red servers.
So, the development and testing flow now looks more like this…
At Netflix, we have a range of engineering teams who focus on specific problem sets. Some teams focus on creating rich presentation layers on various devices. Others focus on metadata and algorithms. For the streaming application to work, the metadata from the services needs to make it to the devices. That is where the API comes in. The API essentially acts as a broken, moving the metadata from inside the Netflix system to the devices.
Given the position of the API within the overall system, the API depends on a large number of underlying systems (only some of which are represented here). Moreover, a large number of devices depend on the API (only some of which are represented here). Sometimes, one of these underlying systems experiences an outage.
In the past, such an outage could result in an outage in the API.
And if that outage cascades to the API, it is likely to have some kind of substantive impact on the devices. The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems.
To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
This is a view of asingle circuit.
This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
The blue line represents the traffic trends over the last two minutes for this dependency.
The green number shows the number of successful calls to this dependency over the last two minutes.
The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
The orange number shows the number of calls that have timed out, resulting in fallback responses.
The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
The red number shows the number of exceptions, resulting in fallback responses.
The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
The dashboard also shows host and cluster information for the dependency.
As well as information about our SLAs.
So, going back to the engineering diagram…
If that same service fails today…
We simply disconnect from that service.
And replace it with an appropriate fallback.
Keeping our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
As discussed earlier, the API was originally built for the 1,000 flowers. Accordingly, today’s API design is very much grounded in the same principles for that same audience.
But the audience of the API today is dramatically different.
With the emphasis of the API program being on the large mass underwater – the private API.
As a result, the current API is no longer the right tool for the job. We need a new API, designed for the present and the future. The following slides talk more about the redesign of the Netflix API to better meet the needs of the key audiences.
We already talked about the tremendous growth in API requests…
Metrics like 30B requests per month sound great, don’t they? The reality is that this number is concerning…
For web sites, like NPR, where page views create ad impressions and ad impressions generate revenue, 30B requests would be amazing.
But for systems that yield output that looks like this...
Or this… Ad impressions are not part of the game. As a result, the increase in requests don’t translate into more revenue. In fact, they translate into more expenses. That is, to handle more requests requires more servers, more systems-admins, a potentially different application architecture, etc.
We are challenging ourselves to redesign the API to see if those same 30B requests could have been 5 billion or perhaps even less. Through more targeted API designs based on what we have learned through our metrics, we will be able to reduce our API traffic as Netflix’ overall traffic grows.
Given the same growth charts in the API, it would be great to imagine the traffic patterns being the blue bars instead of the red ones (assuming the customer usage and user experiences remain the same).
Similarly, with lower traffic levels for the same user experience, server and administration complexity and costs go down as well.
To state the goal another way, John Musser maintains a list of the API Billionaires. Netflix has a pretty lofty position in that club.
We aspire to no longer be in that exclusive company. That is one of the things the redesign strives for.
Today, the devices call back to the API in a mostly synchronous way to retrieve granular metadata needed to start up the client UI. That model requires a large number of network transactions which are the most expensive part of the overall interaction.
We want to break the interaction model into two types of interactions. Custom calls and generic calls.
For highly complex or critical interfaces, we want the device to make a single call to the API to a custom endpoint. Behind that endpoint will be a script that the UI teams maintain. That script is the traffic cop for gathering and formatting the metadata needed for that UI. The script will call to backend services to get metadata, but in this model, much of this will be concurrent and progressively rendered to the devices.
One way to think of it is to imagine a full grid of movies/TV shows as part of a Netflix UI. The red box represents the viewable area of the grid when the screen loads. In today’s REST-ful resource model, the device needs to make calls for the individual lists, then populate the individual titles with distinct calls for each title. Granted, in today’s model, the are some asynchronous calls and some of them are also performed in bulk. But this still demonstrates the chatty nature of this REST-ful API design.
Alternatively, we believe that the custom script could easily return the desiredviewables for each list in one payload much more efficiently.
Moreover, this model can return any payload, filling out any portions of the grid structure, in that single response. Now, we are not likely going to want to populate a grid in this way, but it is possible given the highly customizable nature of this model.
But once we populate the complex start-up screens through the custom scripting tier, the interactions become much more predictable and device-agnostic. If you want to extend the movies in a given row, you don’t need a custom script. That is why we are exposing the Generic API as well.
To populate the grid with more titles for more rows, it is a simple call to get more titles.
The call pattern looks like this for the Generic API. Notice, there is no need for some of the session-start requests when using the Generic API.
For this model, the technology stack is pretty simple. The client apps have their own languages designed for that particular device. The overall API server codebase is Java. And the custom scripts will be written in Groovy and compiled into the same JVM as the backend API code. This should help with overall performance and library sharing for more complex scripting operations.
To publish new scripts to the system, UI engineers will publish the script to Perforce for code management. Then it will be pushed up to a Cassandra cluster in AWS, which acts as a script repository and management system. Every 30 seconds or so, a job will scan the Cassandra cluster looking for new scripts. For all new scripts, they will be pushed to the full farm of API servers to be compiled into the JVM.
From the device perspective, there could be many scripts for a given endpoint, only one of which is active at a given time. In this case, the iPad is running off of script #2.
New scripts can be dynamically added at any time by any team (most often by the UI engineers). The new script (script 7) will arrive in an inactive state. At that time, the script can be tested from a production server before running the device off of it.
When the script looks good and is ready to go live, a configuration change is made and script 7 becomes active immediately across the full server farm.
All of these changes in our redesign effort are designed to help the apps and the UI engineers run faster.
And achieving those goals will help us keep our customers happy.