I gave this presentation to the engineering team at PayPal. This presentation discusses the history and future of the Netflix API. It also goes into API design principles as well as concepts behind system scalability and resiliency.
Whether your audience is a bunch of programmers at a Hackday event,
Or a huge population of users who consume your APIs through a range of applications…
Understanding Your Target Audience(s) should be the #1 Influence to API Design Decisions!
To better understand how target audiences influenced the Netflix API, the following slides will provide background on the Netflix API history.
When the Netflix API launched three years ago, it was to “let 1,000 flowers bloom”. It was exclusively a public API.Today, that API still exists with about 80,000 flowers.
At the time of launch, it was exclusively a public API, with all of the traffic coming in from third-party developers.
These are some examples of the various flowers that have bloomed from this program.
Then streaming started taking off for Netflix, first with computer-based streaming…
And now the model looks more like this, with hundreds of Netflix-branded device implementations running off of the API. The third-party developers are just one of the many consumers of the API.
In this new model, however, the public API represents only about .3% of the total API traffic!
As a result, the emphasis around the API for Netflix is innovation and support for the Netflix-branded device implementations.
For Netflix, our API strategy can be discussed in the form of an iceberg. The public API strategy is the prominent, highly visible part of the iceberg that is above water. It is also the smallest part of the iceberg, in terms of mass. Meanwhile, the large mass of ice underwater that you cannot see is the most critical and biggest part of the iceberg. The visible part of the iceberg represents the public API and the part underwater is the internal strategy. Netflix’s strategy over the years has shifted from public to internal.
To better understand the strategy, I should explain the basics of what the Netflix API supports. There are basically two types of interactions between Netflix customers and our streaming application… Discovery and Streaming.
Discovery is basically any event with a title other than streaming it. That includes browsing titles, looking for something watch, etc.
It also includes actions such as rating the title, adding it to your instant queue, etc.
Once the customer has identified a title to watch through the Discovery experience, the user can then play that title. Once the Play button is selected, the customer is sent to a different internal service that focuses on handling the streaming. That streaming service also interacts with our CDNs to actually deliver the streaming bits to the device for playback.
The API powers the Discovery experience. The rest of these slides will only focus on Discovery, not Streaming.
I like to think of the Netflix engineering teams that support development and innovation for Discovery as being shaped like an hourglass…
In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. There are currently more than 800 different device types that we support. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
The API is in the skinny part of the hourglass, brokering content and algorithmic output from the dependency layers to the UIs. In this model, each team specializes in solving specific problems for the product pipeline, making each team (and each engineer) highly impactful for the success of the company.
With respect to product resiliency, the API plays a key role. None of our dependency services have SLAs of 100%. Given our unique position in the stack as being the last point just before delivery to our users, the API can serve a critical role in protecting our customers from various failures throughout the system.
In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
Such an outage could have resulted in an outage in the API.
And if that outage likely would have cascaded to have some kind of substantive impact on the devices. The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems.
To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
This is a view of asingle circuit.
This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
The blue line represents the traffic trends over the last two minutes for this dependency.
The green number shows the number of successful calls to this dependency over the last two minutes.
The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
The orange number shows the number of calls that have timed out, resulting in fallback responses.
The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
The red number shows the number of exceptions, resulting in fallback responses.
The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
The dashboard also shows host and cluster information for the dependency.
As well as information about our SLAs.
So, going back to the engineering diagram…
If that same service fails today…
We simply disconnect from that service.
And replace it with an appropriate fallback.
Keeping our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
From a scalability perspective, let’s go back to the devices… We have a growing number of devices…
And a huge population of users who consume your APIs through a range of applications…
We are continually create richer experiences on these various devices. And our users are spending lots of time on such experiences.
As a result, the API traffic (in terms of incoming requests) is growing at a very high rate. In the last two years, the API requests has grown by 70x, going from 600 million requests a month to about 42 billion a month. Scaling a system in those conditions is not a trivial task.
At our current scale, we are support more than 1.5 billion incoming requests per day. Each of those requests actually explodes out to an average of 5-6 dependency calls. That means that the API makes about 8-9 billion outgoing calls per day.
Of course, we are using Amazon Web Services (EC2) to host our systems, which goes a long way to helping us scale (although it is not, in itself, a silver bullet).
So, rather than having systems team spending a ton of time trying build new servers, patch and upgrade existing ones, etc., in server rooms such as this one…
We spend time in tools such as this one, created by Netflix staff, to help us manage our instance types and counts.
Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
The Autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
Similarly, as our overall traffic grows over time…
Our overall server counts can increase as well, commensurate with that growth.
Finally, as we continue to expand internationally, we can easily scale up in new regions, closer to the customer base that we are trying to serve, as long as Amazon has a location near there.
Finally, the API team needs to support innovation. This is the most important goal, although if the other two don’t have a good foundation, then this one is not possible.
One of the more common debates in the API marketplace is whether or the API should support XML or JSON (or both). Here is ProgrammableWeb’s breakdown of the format war. About a third of the APIs they are aware of use JSON and two-thirds use XML.
The migration to JSON is mostly due to improved language support and the slenderness of JSON’s payload. As a result, the debate seems to be shifting pretty quickly towards JSON.
My thought on this debate is… WHO CARES about the debate itself? Ultimately, it will come down to knowing the audiences of the API and making sure that the API supports those audiences effectively.
Another prominent debate is how to authenticate users, with OAuth being the de facto standard.
My thought on that debate is… WHO CARES about the debate. Again, know your audiences for the API and that will go a far way to making the right decisions. Netflix, for example, started out using OAuth exclusively for the third-party developers. Now, for the streaming device support, many of them use cookie-based authentication because the API team has a different relationship with the device teams than we do with the third-party developers (who still use OAuth).
Another populate debate around APIs is on REST vs. SOAP. According to ProgrammableWeb, over the last half-year, REST (which had already gained prominence) has become even more dominant, representing about ¾ of every API.
My thought on that debate is… WHO CARES?!?! Again, the right implementation will be based on you audience. I will a little more about this particular debate later in the presentation.
The next debate, Versioning!
WHO CARES about this one?
I do care about the bigger question about the need for versioning. If implemented, however, I don’t have a strong stance on how it gets implemented. Ultimately, those questions need to be addressed on an implementation-to-implementation basis, based on the audience needs and the architectural sensibilities for the API team.
The problem with versioning, particularly in supporting as many devices at Netflix does, is that many of these devices cannot be updated. And in the case of TVs, for example, they sit on people’s walls for 7-10 years with limited (if even possible) options for updating the app. As a result, any API version that is published that a TV app calls needs to be supported for that long duration.
Ultimately, you may end up supporting a large, and growing, number of API versions. The more you support, the tougher it is to maintain and the greater the burden it places on your innovation. Right now, Netflix has a 1.0, 1.5 and 2.0 API. You can quickly imagine in the next 10 years the possibility of a 3.0, 4.0, 5.0, 6.0, etc., making the codebase daunting, ugly and brittle.
This is not often possible, but thinking about a versionless API has many benefits, solving many of the problems discussed in the previous slide. It introduces some other problems, but if done right, the benefits could outweigh the challenges.
In typical one-size-fits-all REST-ful APIs, the model is quite simple. A backend repository (or distributed repositories) stores data leveraged by an API application that exposes such data through REST-ful endpoints. Those REST-ful endpoints are typically managed in a highly vertical, resource-oriented way, treating the resources as granular normalized data access points. Problems start to surface with this model, however, when you sit 800+ different device types on top of it. The degree of variability across such a breadth of devices really starts to expose weaknesses in this model.
Another issue that gets exposed in REST-ful models is in its chattiness. The Netflix API growth rate is a result of several factors, including more users, more device types, richer UI experiences, more time spent by users, etc.
If this same growth rate continues, the request numbers could start to get very large resulting in corresponding demands in the infrastructure, etc.
Metrics like 1.5B requests per day sound great, don’t they? The reality is that this number is concerning…
For web sites, like NPR, where page views create ad impressions and ad impressions generate revenue, 1.5B requests per day would be amazing.
But for systems that yield output that looks like this...
Or this… Ad impressions are not part of the game. As a result, the increase in requests don’t translate into more revenue. In fact, they translate into more expenses. That is, to handle more requests requires more servers, more systems-admins, a potentially different application architecture, etc.
We are challenging ourselves to redesign the API to see if those same 1.5B requests could have been 300 million a day or perhaps even less. Through more targeted API designs based on what we have learned through our metrics, we will be able to reduce our API traffic as Netflix’ overall traffic grows.
As a result, our REST API is no longer the right tool for the job. We need a new API, designed to handle the same degree of variability that is present in the 800+ device types that Netflix supports (and that many other companies aspire to be on).
Some other solutions have tried to address this issue, including OData, YQL and most recently, eBay’s ql.io. These are query-based APIs that allow the requester to have much more control over the queries to the backend which will control a much more flexible response.
The SQL-like, or query-based, APIs are considerable better for large number of devices. They still seem to potentially weaken when dealing with the massive variability of the devices. The issue is that, like the REST APIs, SQL-like APIs are also set up so the server team (or the team providing the API) defines the request and response model. Granted, the SQL-like API offers much more flexibility than the REST API, but at the end of the day, device teams still need to adhere to the server-side rules.
From the Netflix perspective, we are putting the days of the “one API to rule them all” approach behind us. Rather, our API, to support the growing number of devices, needs to let the device teams define their own rules. And the API platform needs to be able to support them, even if they are divergent from each other in format, delivery method, etc.
Netflix API - Presentation to PayPal
API Strategy : Know Your Audience Daniel Jacobson Director of Engineering, Netflix API firstname.lastname@example.org @daniel_jacobsonhttp://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson
Quick NoteHere is the basic outline for this presenation. Feel free tobounce to the sections that are most interesting toyou, especially if you have seen some of this before:• Introduction (slides 3 - 9)• Netflix API background (10 – 23)• Netflix Product Engineering Organization (24 – 27)• Netflix API Resiliency (28 – 52)• Netflix API Scaling (53 – 66)• Supporting Innovation and API Design Principles (67 – 100)• Netflix API Team is Hiring – Contact Information (101)
Who Am I?• Director of Engineering, Netflix API• Previously: Director of App Dev, NPR – Responsible for and built custom CMS and API among other things• Co-Author of O’Reilly’s “APIs: A Strategy Guide”
Who is Netflix?• Focus is to become the best at streaming TV shows and movies internationally• Netflix is responsible for 30+% of US internet traffic at peak times• Netflix streaming can be enjoyed on more than 800 different device types
Growth of Netflix API Requests 45 41.7 40 70x growth in two years! 35Request in Billions 30 25 20.7 20 15 10 5 0.6 - Jan-10 Jan-11 Jan-12
APIPersonaliz Movie Movie Similar A/B Test ation User Info Reviews Engine Metadata Ratings Movies Engine Each API request translates into 5-6 dependency calls on average That is about 8-9 billion outgoing calls per day for the API
Debate: XML vs. JSON• This debate is over-simplified! Increased variability in device capabilities and requirements means increased approaches to interface with them – Some devices perform better with hierarchical JSON and others with flat object models – Different devices may require different XML schemas – Some devices prefer full document delivery and others prefer streaming bits – Etc.
Debate: XML vs. JSON• Moreover, this should be a non-issue from an architectural perspective – Design the API to separate out the concepts of data gathering from data formatting and delivery – Rendering new output formats should be easy – This will help tremendously with device proliferation
Debate: OAuth vs. Other• Oauth is the de facto standard – Excellent for unknown developers – Somewhat difficult to deal with – Chatty – (Netflix’s original approach)• Cookie-based auth – Excellent for device implementations under your control – Significantly less chatty and complicated – Not a good option for unknown developers – (Netflix’s new approach)• Query-based auth – Easy to implement – Easy to use – Not secure• Etc…
Debate: Versioning• Directory structure for version number – developer.paypal.com/v1/getPaymentOptions• Query string for version number – developer.paypal.com/getPaymentOptions?version=1• Versionless API
Benefit to Thinking Versionless• If you can achieve it, maintenance will be MUCH simpler• If you cannot, it instills better practices – Reduces lazy programming – Results in fewer versions – Results in a cleaner, less brittle system• And keep in mind, adding new features typically does not require a new version… – Schematic or structural changes, however, do
So, why don’t I care about most of these debates?
Traditional REST-ful ModelREST-ful REST-ful REST-ful REST-ful REST-fulEndpoint Endpoint Endpoint Endpoint Endpoint API Application Data Repository
Comparing just these two devices…• XBox has Kinect features like voice commands and gestures (iPhone does not)• iPhone has touch features (XBox does not)• iPhone has small real estate, XBox has a full sized TV – different metadata needs• XBox has more powerful hard drive with better memory capacity than the iPhone• iPhone has different connection rates, XBox is connected to stable home wi-fi• Etc…
Growth of Netflix API Requests 45 41.7 40 70x growth in two years! 35Request in Billions 30 25 20.7 20 15 10 5 0.6 - Jan-10 Jan-11 Jan-12
What if the API request growth rate looks like this??? 160 140 Is this good for the long run??? 120Request is Billions 100 80 60 40 20 -
Growth of the Netflix API 1.5 billion requests per day Exploding out to 8-9 billion outgoing calls per day
Improve Efficiency of API RequestsCould it have been 300 million requests per day? Or less? (Assuming everything else remained the same)
There is a better tool for the job than the One-Size-Fits-All REST API
SQL-Like APIs+ Much more flexibility than traditional RESTmodels+ Allows for a one-size-fits-all dev approach for API- Still has the server teams dictating the rules (although the rules are much more flexible)----------------------------------------------------------------= Substantially better for the device world= May hit scaling problems for increasing numberof device types (due to the increasing variabilityacross devices)
We Are Hiring!If you are interested in helping us solvethese problems, you can contact me at: email@example.com @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson