Your SlideShare is downloading. ×
0
Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://ww...
Global Streaming Video
for TV Shows and Movies
More than 48 Million Subscribers
More than 40 Countries
Netflix Accounts for >34% of Peak
Downstream Traffic in North America
Netflix subscribers are watching more than 1 billion...
Netflix Accounts for >6% of Peak
Upstream Traffic in North America
Netflix subscribers are watching more than 1 billion ho...
Team Focus:
Build the Best Global Streaming Product
Three aspects of the Streaming Product:
• Non-Member
• Discovery
• Str...
The Netflix API - Background
Netflix
API
Netflix API Requests by Audience
At Launch In 2008
Netflix Devices
Open API Developers
Netflix
API
Netflix API Requests by Audience
From 2011
Netflix Devices
Open API Developers
Current Emphasis of Netflix API
Netflix Devices
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Main...
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Main...
APIs Do
Lots of Things!
Data Gathering
Data Formatting
Data Delivery
Security
Authorization
Authentication
System Scaling
Discoverability
Data Con...
Data Gathering
Data Formatting
Data Delivery
Security
Authorization
Authentication
System Scaling
Discoverability
Data Con...
Definitions
• Data Gathering
– Retrieving the requested data from one or many local
or remote data sources
• Data Formatti...
Meanwhile…
There are two players in APIs
API Provider API Consumer
API Provider
PROVIDES
API Consumer
CONSUMES
Traditional API Interactions
API Provider
PROVIDES
EVERYTHING
API Consumer
CONSUMES
WHAT IS
PROVIDED
Everything means, API Provider does:
• Data Gather...
Why do most API providers provide
everything?
• API design tends to be easier for teams closer
to the source
• Centralized...
Why do most API providers provide
everything?
• API design tends to be easier for teams closer
to the source
• Centralized...
Data Gathering Data Formatting Data Delivery
API Consumer
API Provider
Separation of Concerns
To be a better provider, the...
Data Gathering Data Formatting Data Delivery
API Consumer
Don’t care how data
is gathered, as long
as it is gathered
API P...
Data Gathering Data Formatting Data Delivery
API Consumer
Don’t care how data
is gathered, as long
as it is gathered
Each ...
Data Gathering Data Formatting Data Delivery
API Consumer
Don’t care how data
is gathered, as long
as it is gathered
Each ...
Because of our separation of
concerns, the Netflix API team is
enabled to focus on different charters
Brokering Data to
1,000+ Device Types
Screen Real Estate
Controller
Technical Capabilities
One-Size-Fits-All
API
Request
Request
Request
Courtesy of South Florida Classical Review
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recomm...
OSFA API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Bo...
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Bo...
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Bo...
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RAT...
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
SERVER CODE
...
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
DATA GATHERI...
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Main...
1000+ Device Types
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
Reviews
A/B Test
Engine
Dozens of Dependenci...
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Dependency Relationships
2,000,000,000
Incoming Requests Per Day
to the Netflix API
30
Distinct Dependent
Services for the Netflix API
~500
Dependency jars Slurped
into the Netflix API
14,000,000,000
Netflix API Outbound Calls
Per Day to those
Dependent Services
0
Dependent Services with
100% SLA
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime
Per Month
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Circuit Breaker Dashboard
Call Volume and Health / Last 10 Seconds
Call Volume / Last 2 Minutes
Successful Requests
Successful, But Slower Than Expected
Short-Circuited Requests, Delivering Fallbacks
Timeouts, Delivering Fallbacks
Thread Pool & Task Queue Full, Delivering Fallbacks
Exceptions, Delivering Fallbacks
Error Rate
# + # + # + # / (# + # + # + # + #) = Error Rate
Status of Fallback Circuit
Requests per Second, Over Last 10 Seconds
SLA Information
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Main...
Netflix API : Requests Per Month
-
5
10
15
20
25
30
35
RequestsinBillions
50x growth in 18 months
AWS Cloud
Netflix API : Requests Per Month
-
5
10
15
20
25
30
35
RequestsinBillions
Autoscaling
Autoscaling
Scryer : Predictive Auto Scaling
Not yet…
Typical Traffic Patterns Over Five Days
Predicted RPS Compared to Actual RPS
Scaling Plan for Predicted Workload
What is Scryer Doing?
• Evaluating needs based on historical data
– Week over week, month over month metrics
• Adjusts ins...
Results
Results : Load Average
Reactive
Predictive
Results : Response Latencies
Reactive
Predictive
Results : Outage Recovery
Results : AWS Costs
Scaling Globally
More than 48 Million Subscribers
More than 40 Countries
Zuul
Gatekeeper for the Netflix Streaming Application
Zuul *
• Multi-Region
Resiliency
• Insights
• Stress Testing
• Canary Testing
• Dynamic Routing
• Load Shedding
• Security...
All of these approaches are
designed to prevent failures…
But sometimes the best way to
prevent failures is to force them!
I randomly
terminate instances
in production to
identify dormant
failures.
Chaos
Monkey
Chaos
Gorilla
I simulate an
outage of an
entire Amazon
availability zone.
I simulate an
outage in an AWS
region.
Chaos
Kong
I find instances that
don’t adhere to
best practices.
Conformity
Monkey
I extend Conformity
Monkey to find
security violations.
Security
Monkey
I detect unhealthy
instances and
remove them
from service.
Doctor
Monkey
I clean up the
clutter and waste
that runs in the
cloud.
Janitor
Monkey
I induce artificial
delays and errors into
services to determine
how upstream services
will respond.
Latency
Monkey
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Main...
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Dependency Relationships
Testing Philosophy:
Act Fast, React Fast
That Doesn’t Mean We Don’t Test
Automated Delivery Pipeline
Cloud-Based Deployment Techniques
Current Code
In Production
API Requests from
the Internet
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
...
Canary Analysis Automation
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
...
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Error!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
API Requests from
the Internet
New Code
Getting Prepared for Production
https://www.github.com/Netflix
Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://ww...
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Upcoming SlideShare
Loading in...5
×

Maintaining the Netflix Front Door - Presentation at Intuit Meetup

2,569

Published on

This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.

Published in: Technology, Education
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,569
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
58
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • Netflix strives to be the global streaming video leader for TV shows and movies
  • We now have more than 44 million global subscribers in more than 40 countries
  • Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
  • Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
  • Our 44 million Netflix subscribers are watching shows and movies on virtually any device that has a streaming video screen. We are now on more than 1,000 different device types.
  • The subscribers can watch our original shows like Emmy-winning House of Cards.
  • Within this world, the Edge Engineering team focuses on these three aspects of the streaming product.
  • To better understand how target audiences influenced the Netflix API, the following slides will provide background on the Netflix API history.
  • When the Netflix API launched three years ago, it was to “let 1,000 flowers bloom”. It was exclusively a public API.
    Today, that API still exists with about 80,000 flowers.
  • At the time of launch, it was exclusively a public API, with all of the traffic coming in from third-party developers.
  • These are some examples of the various flowers that have bloomed from this program.
  • Then streaming started taking off for Netflix, first with computer-based streaming…
  • And now the model looks more like this, with hundreds of Netflix-branded device implementations running off of the API. The third-party developers are just one of the many consumers of the API.
  • In this new model, however, the public API represents only about .3% of the total API traffic!
  • As a result, the emphasis around the API for Netflix is innovation and support for the Netflix-branded device implementations.
  • Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
  • At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
  • For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
  • Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
  • The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
  • Many UI teams needing metadata means many requests to the API team. In the one-size-fits-all API world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
  • Many other companies have seen similar issues and have introduced orchestration layers that enable more flexible interaction models.
  • Odata, HYQL, ql.io, rest.li and others are examples of orchestration layers. They address the same problems that we have seen, but we have approached the solution in a very different way.
  • We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
  • The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
  • The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
  • In this mode, there is a very clear divide between the Client Code and the Server Code. That divide is the network border.
  • And the responsibilities have the same distribution as well. The Client Code handles the rendering of the interface (as well as asking the server for data). The Server Code is responsible of gathering, formatting and delivering the data to the UIs.
  • And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
  • But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
  • We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
  • In an experience-based interaction, the PS3 can potentially make a single request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
  • We also introduced RxJava into this layer to improve our ability to handle concurrency and callbacks. RxJava is open source in our github repository.
  • In this model, the border between Client Code and Server Code is no longer the network border. It is now back on the server. The Groovy is essentially a client adapter written by the client teams.
  • And the distribution of work changes as well. The client teams continue to handle UI rendering, but now are also responsible for the formatting and delivery of content. The API team, in terms of the data side of things, is responsible for the data gathering and hand-off to the client adapters. Of course, the API team does many other things, including resiliency, scaling, dependency interactions, etc. This model is essentially a platform for API development.
  • If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
  • I like to think of this distributed architecture as being shaped like an hourglass…
  • In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
  • At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
  • The API is at the center of the hourglass, acting as a broker of data.
  • Our distributed architecture, with the number of systems involved, can get quite complicated. Each of these systems talks to a large number of other systems within our architecture.
  • Assuming each of the services have SLAs of four nines, that results in more than two hours of downtime per month.
  • And that is if all services maintain four nines!
  • If it degrades as far as to three nines, that is almost one day per month of downtime!
  • So, back to the hourglass…
  • In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
  • Such a failure could have resulted in an outage in the API.
  • And that outage likely would have cascaded to have some kind of substantive impact on the devices.
  • The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems and to keep them happy.
  • To solve this problem, we created Hystrix, as wrapping technology that provides fault tolerance in a distributed environment. Hystrix is also open source and available at our github repository.
  • To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
  • This is a view of a single circuit.
  • This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
  • The blue line represents the traffic trends over the last two minutes for this dependency.
  • The green number shows the number of successful calls to this dependency over the last two minutes.
  • The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
  • The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
  • The orange number shows the number of calls that have timed out, resulting in fallback responses.
  • The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
  • The red number shows the number of exceptions, resulting in fallback responses.
  • The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
  • If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
  • The dashboard also shows host and cluster information for the dependency.
  • As well as information about our SLAs.
  • So, going back to the engineering diagram…
  • If that same service fails today…
  • We simply disconnect from that service.
  • And replace it with an appropriate fallback. The fallback, ideally is a slightly degrade, but useful offering. If we cannot get that, however, we will quickly provide a 5xx response which will help the systems shed load rather than queue things up (which could eventually cause the system as a whole to tip over).
  • This will keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
  • At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
  • Similarly, as our overall traffic grows over time…
  • In addition to the migration to a distributed architecture, we also aggressively moved out of data centers…
  • And into the cloud.
  • Instead of spending in data centers, we spend out time in tools such as Asgard, created by Netflix staff, to help us manage our instance types and counts in AWS. Asgard is available in our open source repository at github.
  • Our overall server counts can increase as well, commensurate with that growth.
  • Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
  • Through autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
  • To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix).
  • Instead of reacting to real-time metrics, like load average, to increase/decrease the instance count, we can look at historical patterns in our traffic to figure out what will be needed BEFORE it is needed. We believed we could write algorithms to predict the needs.
  • This is the result of the algorithms we created for the predictions. The prediction closely matches the actual traffic.
  • Based on those predictions, we started triggering scaling events. Those scaling events closely matched the traffic patterns as well, although these scaling events (as opposed to the Amazon auto-scaler) preceded the need.
  • Load average when running Scryer is much smoother.
  • Smoother load average results in more consistent and faster response times.
  • Retries and thundering herd effects are mitigated by consistent provisioning of instances through Scryer.
  • Meanwhile, because we have more predictable scaling needs that can be provisioned more granularly, our AWS costs go down.
  • Going global has a different set of scaling challenges. AWS enables us to add instances in new regions that are closer to our customers.
  • To help us manage our traffic across regions, as well as within given regions, we created Zuul. Zuul is open source in our github repository.
  • Zuul does a variety of things for us. Zuul fronts our entire streaming application as well as a range of other services within our system.
  • Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
  • Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
  • The army is the Simian Army, which is a fleet of monkeys who are designed to do a variety of things, in an automated way, in our cloud implementation. Chaos Monkey, for example, periodically terminates AWS instances in production to see how the system as a whole will respond once that server disappears. Latency Monkey introduces latencies and errors into a system to see how it responds. The system is too complex to know how things will respond in various circumstances, so the monkeys expose that information to us in a variety of ways. The monkeys are also available in our open source github repository.
  • So, back to the hourglass…
  • Again, the dependency chains in our system are quite complicated.
  • That is a lot of change in the system!
  • As a result, our philosophy is to act fast (ie. get code into production as quickly as possible), then react fast (ie. response to issues quickly as they arise).
  • Two such examples are canary deployments and what we call red/black deployments.
  • The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
  • If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
  • The health of the canary is automated as well, comparing its metrics against the fleet of production servers.
  • If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
  • If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
  • We will then repeat the process until the analysis of canary servers look good.
  • We will then repeat the process until the analysis of canary servers look good.
  • We also use Zuul to funnel varying degrees of traffic to the canaries to evaluate how much load the canary can take relative to the current production instances. If the RPS, for example, drops, the canary may fail the Zuul stress test.
  • If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
  • Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
  • Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
  • If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
  • Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
  • And we will stress the canary again…
  • If the new code looks good in the canary, we can then bring up another set of servers with the new code.
  • Then we will switch production traffic to the new code.
  • If everything still looks good, we disable the red servers and the new code becomes the new red servers.
  • All of the open source components discussed here, as well as many others, can be found at the Netflix github repository.
  • Transcript of "Maintaining the Netflix Front Door - Presentation at Intuit Meetup"

    1. 1. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson
    2. 2. Global Streaming Video for TV Shows and Movies
    3. 3. More than 48 Million Subscribers More than 40 Countries
    4. 4. Netflix Accounts for >34% of Peak Downstream Traffic in North America Netflix subscribers are watching more than 1 billion hours a month
    5. 5. Netflix Accounts for >6% of Peak Upstream Traffic in North America Netflix subscribers are watching more than 1 billion hours a month
    6. 6. Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: • Non-Member • Discovery • Streaming
    7. 7. The Netflix API - Background
    8. 8. Netflix API
    9. 9. Netflix API Requests by Audience At Launch In 2008 Netflix Devices Open API Developers
    10. 10. Netflix API
    11. 11. Netflix API Requests by Audience From 2011 Netflix Devices Open API Developers
    12. 12. Current Emphasis of Netflix API Netflix Devices
    13. 13. Netflix API : Key Responsibilities • Broker data between services and Devices • Provide features and business logic • Maintain a resilient front-door • Scale the system • Maintain high velocity • Provide detailed insights into the system health
    14. 14. Netflix API : Key Responsibilities • Broker data between services and Devices • Provide features and business logic • Maintain a resilient front-door • Scale the system • Maintain high velocity • Provide detailed insights into the system health
    15. 15. APIs Do Lots of Things!
    16. 16. Data Gathering Data Formatting Data Delivery Security Authorization Authentication System Scaling Discoverability Data Consistency Translations Throttling Orchestration APIs Do Lots of Things! These are some of the many things APIs do.
    17. 17. Data Gathering Data Formatting Data Delivery Security Authorization Authentication System Scaling Discoverability Data Consistency Translations Throttling Orchestration APIs Do Lots of Things! These three are at the core. All others ultimately support them.
    18. 18. Definitions • Data Gathering – Retrieving the requested data from one or many local or remote data sources • Data Formatting – Preparing a structured payload to the requesting agent • Data Delivery – Delivering the structured payload to the requesting agent
    19. 19. Meanwhile… There are two players in APIs
    20. 20. API Provider API Consumer
    21. 21. API Provider PROVIDES API Consumer CONSUMES Traditional API Interactions
    22. 22. API Provider PROVIDES EVERYTHING API Consumer CONSUMES WHAT IS PROVIDED Everything means, API Provider does: • Data Gathering • Data Formatting • Data Delivery • (among other things) Traditional API Interactions
    23. 23. Why do most API providers provide everything? • API design tends to be easier for teams closer to the source • Centralized API functions makes them easier to support • Many APIs have a large set of unknown and external developers
    24. 24. Why do most API providers provide everything? • API design tends to be easier for teams closer to the source • Centralized API functions makes them easier to support • Many APIs have a large set of unknown and external developers
    25. 25. Data Gathering Data Formatting Data Delivery API Consumer API Provider Separation of Concerns To be a better provider, the API should address the separation of concerns of the three core functions
    26. 26. Data Gathering Data Formatting Data Delivery API Consumer Don’t care how data is gathered, as long as it is gathered API Provider Care a lot about how the data is gathered Separation of Concerns
    27. 27. Data Gathering Data Formatting Data Delivery API Consumer Don’t care how data is gathered, as long as it is gathered Each consumer cares a lot about the format for that specific use API Provider Care a lot about how the data is gathered Only cares about the format to the extent it is easy to support Separation of Concerns
    28. 28. Data Gathering Data Formatting Data Delivery API Consumer Don’t care how data is gathered, as long as it is gathered Each consumer cares a lot about the format for that specific use Each consumer cares a lot about how payload is delivered API Provider Care a lot about how the data is gathered Only cares about the format to the extent it is easy to support Only cares about delivery method to the extent it is easy to support Separation of Concerns
    29. 29. Because of our separation of concerns, the Netflix API team is enabled to focus on different charters
    30. 30. Brokering Data to 1,000+ Device Types
    31. 31. Screen Real Estate
    32. 32. Controller
    33. 33. Technical Capabilities
    34. 34. One-Size-Fits-All API Request Request Request
    35. 35. Courtesy of South Florida Classical Review
    36. 36. Resource-Based API vs. Experience-Based API
    37. 37. Resource-Based Requests • /users/<id>/ratings/title • /users/<id>/queues • /users/<id>/queues/instant • /users/<id>/recommendations • /catalog/titles/movie • /catalog/titles/series • /catalog/people
    38. 38. OSFA API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border
    39. 39. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE
    40. 40. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING
    41. 41. Experience-Based Requests • /ps3/homescreen
    42. 42. JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer
    43. 43. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border
    44. 44. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border
    45. 45. Netflix API : Key Responsibilities • Broker data between services and Devices • Provide features and business logic • Maintain a resilient front-door • Scale the system • Maintain high velocity • Provide detailed insights into the system health
    46. 46. 1000+ Device Types
    47. 47. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies
    48. 48. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    49. 49. Dependency Relationships
    50. 50. 2,000,000,000 Incoming Requests Per Day to the Netflix API
    51. 51. 30 Distinct Dependent Services for the Netflix API
    52. 52. ~500 Dependency jars Slurped into the Netflix API
    53. 53. 14,000,000,000 Netflix API Outbound Calls Per Day to those Dependent Services
    54. 54. 0 Dependent Services with 100% SLA
    55. 55. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month
    56. 56. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month
    57. 57. 99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month
    58. 58. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    59. 59. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    60. 60. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    61. 61. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    62. 62. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    63. 63. Circuit Breaker Dashboard
    64. 64. Call Volume and Health / Last 10 Seconds
    65. 65. Call Volume / Last 2 Minutes
    66. 66. Successful Requests
    67. 67. Successful, But Slower Than Expected
    68. 68. Short-Circuited Requests, Delivering Fallbacks
    69. 69. Timeouts, Delivering Fallbacks
    70. 70. Thread Pool & Task Queue Full, Delivering Fallbacks
    71. 71. Exceptions, Delivering Fallbacks
    72. 72. Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate
    73. 73. Status of Fallback Circuit
    74. 74. Requests per Second, Over Last 10 Seconds
    75. 75. SLA Information
    76. 76. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    77. 77. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    78. 78. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    79. 79. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback
    80. 80. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback
    81. 81. Netflix API : Key Responsibilities • Broker data between services and Devices • Provide features and business logic • Maintain a resilient front-door • Scale the system • Maintain high velocity • Provide detailed insights into the system health
    82. 82. Netflix API : Requests Per Month - 5 10 15 20 25 30 35 RequestsinBillions 50x growth in 18 months
    83. 83. AWS Cloud
    84. 84. Netflix API : Requests Per Month - 5 10 15 20 25 30 35 RequestsinBillions
    85. 85. Autoscaling
    86. 86. Autoscaling
    87. 87. Scryer : Predictive Auto Scaling Not yet…
    88. 88. Typical Traffic Patterns Over Five Days
    89. 89. Predicted RPS Compared to Actual RPS
    90. 90. Scaling Plan for Predicted Workload
    91. 91. What is Scryer Doing? • Evaluating needs based on historical data – Week over week, month over month metrics • Adjusts instance minimums based on algorithms • Relies on Amazon Auto Scaling for unpredicted events
    92. 92. Results
    93. 93. Results : Load Average Reactive Predictive
    94. 94. Results : Response Latencies Reactive Predictive
    95. 95. Results : Outage Recovery
    96. 96. Results : AWS Costs
    97. 97. Scaling Globally
    98. 98. More than 48 Million Subscribers More than 40 Countries
    99. 99. Zuul Gatekeeper for the Netflix Streaming Application
    100. 100. Zuul * • Multi-Region Resiliency • Insights • Stress Testing • Canary Testing • Dynamic Routing • Load Shedding • Security • Static Response Handling • Authentication * Most closely resembles an API proxy
    101. 101. All of these approaches are designed to prevent failures…
    102. 102. But sometimes the best way to prevent failures is to force them!
    103. 103. I randomly terminate instances in production to identify dormant failures. Chaos Monkey
    104. 104. Chaos Gorilla I simulate an outage of an entire Amazon availability zone.
    105. 105. I simulate an outage in an AWS region. Chaos Kong
    106. 106. I find instances that don’t adhere to best practices. Conformity Monkey
    107. 107. I extend Conformity Monkey to find security violations. Security Monkey
    108. 108. I detect unhealthy instances and remove them from service. Doctor Monkey
    109. 109. I clean up the clutter and waste that runs in the cloud. Janitor Monkey
    110. 110. I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey
    111. 111. Netflix API : Key Responsibilities • Broker data between services and Devices • Provide features and business logic • Maintain a resilient front-door • Scale the system • Maintain high velocity • Provide detailed insights into the system health
    112. 112. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
    113. 113. Dependency Relationships
    114. 114. Testing Philosophy: Act Fast, React Fast
    115. 115. That Doesn’t Mean We Don’t Test
    116. 116. Automated Delivery Pipeline
    117. 117. Cloud-Based Deployment Techniques
    118. 118. Current Code In Production API Requests from the Internet
    119. 119. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet
    120. 120. Canary Analysis Automation
    121. 121. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!
    122. 122. Current Code In Production API Requests from the Internet
    123. 123. Current Code In Production API Requests from the Internet
    124. 124. Current Code In Production API Requests from the Internet Perfect!
    125. 125. Stress Test with Zuul
    126. 126. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
    127. 127. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
    128. 128. Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
    129. 129. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
    130. 130. Current Code In Production API Requests from the Internet Perfect!
    131. 131. Stress Test with Zuul
    132. 132. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
    133. 133. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
    134. 134. API Requests from the Internet New Code Getting Prepared for Production
    135. 135. https://www.github.com/Netflix
    136. 136. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×