Maintaining the Front Door to Netflix : The Netflix API

Daniel Jacobson
Daniel JacobsonDirector of Engineering - Netflix API at Netflix
Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
There are copious notes attached
to each slide in this presentation.
Please read those notes to get
the full context of the
presentation
Global Streaming Video
for TV Shows and Movies
More than 44 Million Subscribers
More than 40 Countries
Netflix Accounts for ~33% of Peak
Internet Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Team Focus:
Build the Best Global Streaming Product
Three aspects of the Streaming Product:
• Non-Member
• Discovery
• Streaming
Key Responsibilities
• Broker data between services and UIs
• Maintain a resilient front-door
• Scale the system vertically and horizontally
• Maintain high velocity
But Before Streaming…
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Monolithic Application
In Netflix Data Centers
The bigger the ship…
the slower it turns
Distributed Architecture
Maintaining the Front Door to Netflix : The Netflix API
1000+ Device Types
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
Reviews
A/B Test
Engine
Dozens of Dependencies
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Dependency Relationships
2,000,000,000
Requests Per Day to the
Netflix API
30
Distinct Dependent
Services for the Netflix API
~500
Dependency jars Slurped
into the Netflix API
14,000,000,000
Netflix API Calls Per Day to
those Dependent Services
0
Dependent Services with
100% SLA
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime
Per Month
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Maintaining the Front Door to Netflix : The Netflix API
Circuit Breaker Dashboard
Maintaining the Front Door to Netflix : The Netflix API
Call Volume and Health / Last 10 Seconds
Call Volume / Last 2 Minutes
Successful Requests
Successful, But Slower Than Expected
Short-Circuited Requests, Delivering Fallbacks
Timeouts, Delivering Fallbacks
Thread Pool & Task Queue Full, Delivering Fallbacks
Exceptions, Delivering Fallbacks
Error Rate
# + # + # + # / (# + # + # + # + #) = Error Rate
Status of Fallback Circuit
Requests per Second, Over Last 10 Seconds
SLA Information
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Scaling the Distributed System
Maintaining the Front Door to Netflix : The Netflix API
AWS Cloud
Maintaining the Front Door to Netflix : The Netflix API
Autoscaling
Autoscaling
Amazon Auto Scaling Limitations
• Hard to fit policies to variable traffic patterns
(weekday vs weekend)
• Limited control over capacity adjustments
(absolute value or %)
The Impact of AAS Limitations
• Traffic drop can lead to scale downs during
outage
• Performance degradation between new
instance launch and taking traffic
• Excess capacity at peak and trough
Scryer : Predictive Auto Scaling
Not yet…
Typical Traffic Patterns Over Five Days
Predicted RPS Compared to Actual RPS
Scaling Plan for Predicted Workload
What is Scryer Doing?
• Evaluating needs based on historical data
– Week over week, month over month metrics
• Adjusts instance minimums based on
algorithms
• Relies on Amazon Auto Scaling for unpredicted
events
Results
Results : Load Average
Reactive
Predictive
Results : Response Latencies
Reactive
Predictive
Results : Outage Recovery
Results : Outage Recovery
Results : AWS Costs
Scaling Globally
More than 44 Million Subscribers
More than 40 Countries
Zuul
Gatekeeper for the Netflix Streaming Application
Zuul *
• Multi-Region
Resiliency
• Insights
• Stress Testing
• Canary Testing
• Dynamic Routing
• Load Shedding
• Security
• Static Response
Handling
• Authentication
* Most closely resembles an API proxy
Isthmus
Maintaining the Front Door to Netflix : The Netflix API
All of these approaches are
designed to prevent failures…
But sometimes the best way to
prevent failures is to force them!
Maintaining the Front Door to Netflix : The Netflix API
I randomly
terminate instances
in production to
identify dormant
failures.
Chaos
Monkey
Chaos
Gorilla
I simulate an
outage of an
entire Amazon
availability zone.
I simulate an
outage in an AWS
region.
Chaos
Kong
I find instances that
don’t adhere to
best practices.
Conformity
Monkey
I extend Conformity
Monkey to find
security violations.
Security
Monkey
I detect unhealthy
instances and
remove them
from service.
Doctor
Monkey
I clean up the
clutter and waste
that runs in the
cloud.
Janitor
Monkey
I induce artificial
delays and errors into
services to determine
how upstream services
will respond.
Latency
Monkey
Maintaining the Front Door to Netflix : The Netflix API
Deployments in the Cloud
Dependency Relationships
Maintaining the Front Door to Netflix : The Netflix API
Testing Philosophy:
Act Fast, React Fast
That Doesn’t Mean We Don’t Test
Automated Delivery Pipeline
Cloud-Based Deployment Techniques
Current Code
In Production
API Requests from
the Internet
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
Canary Analysis Automation
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
Error!
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Error!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Brokering Data to
1,000+ Device Types
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Screen Real Estate
Controller
Technical Capabilities
One-Size-Fits-All
API
Request
Request
Request
Courtesy of South Florida Classical Review
Maintaining the Front Door to Netflix : The Netflix API
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recommendations
• /catalog/titles/movie
• /catalog/titles/series
• /catalog/people
REST API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
SERVER CODE
CLIENT CODE
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
DATA GATHERING,
FORMATTING,
AND DELIVERY
USER INTERFACE
RENDERING
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Groovy Layer
Maintaining the Front Door to Netflix : The Netflix API
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
SERVER CODE
CLIENT CODE
CLIENT ADAPTER CODE
(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)
Network Border Network Border
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
DATA GATHERING
DATA FORMATTING
AND DELIVERY
USER INTERFACE
RENDERING
Network Border Network Border
Maintaining the Front Door to Netflix : The Netflix API
https://www.github.com/Netflix
Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
1 of 138

Recommended

Kafka 101 and Developer Best Practices by
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practicesconfluent
1.4K views58 slides
Apache Kafka in the Airline, Aviation and Travel Industry by
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryKai Wähner
2.5K views60 slides
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d... by
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
45.1K views51 slides
MLOps with Azure DevOps by
MLOps with Azure DevOpsMLOps with Azure DevOps
MLOps with Azure DevOpsMarco Parenzan
389 views35 slides
Cassandra Operations at Netflix by
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflixgreggulrich
12.8K views18 slides
Kafka Intro With Simple Java Producer Consumers by
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
2.9K views62 slides

More Related Content

What's hot

Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018 by
Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018
Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018Amazon Web Services
2.7K views19 slides
DataOps: An Agile Method for Data-Driven Organizations by
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
2.3K views52 slides
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud by
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudMarkus Michalewicz
2.7K views28 slides
Disaster Recovery and High Availability with Kafka, SRM and MM2 by
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
1.7K views37 slides
Apache Kafka Architecture & Fundamentals Explained by
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
27.7K views33 slides
Principles of System Observability by
Principles of System Observability Principles of System Observability
Principles of System Observability Janis Orlovs
523 views23 slides

What's hot(20)

Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018 by Amazon Web Services
Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018
Running Kubernetes Across Multiple AWS Accounts (CON409) - AWS re:Invent 2018
Amazon Web Services2.7K views
DataOps: An Agile Method for Data-Driven Organizations by Ellen Friedman
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman2.3K views
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud by Markus Michalewicz
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Markus Michalewicz2.7K views
Disaster Recovery and High Availability with Kafka, SRM and MM2 by Abdelkrim Hadjidj
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
Abdelkrim Hadjidj1.7K views
Apache Kafka Architecture & Fundamentals Explained by confluent
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent27.7K views
Principles of System Observability by Janis Orlovs
Principles of System Observability Principles of System Observability
Principles of System Observability
Janis Orlovs523 views
Netflix Global Cloud Architecture by Adrian Cockcroft
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft71.2K views
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker by Provectus
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus274 views
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ... by Kai Wähner
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner3.3K views
Building Serverless ETL Pipelines with AWS Glue by Amazon Web Services
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services1.5K views
Tuning ML Models: Scaling, Workflows, and Architecture by Databricks
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
Databricks440 views
Apache Flink and what it is used for by Aljoscha Krettek
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek1.4K views
Prometheus (Prometheus London, 2016) by Brian Brazil
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil1.2K views
Understanding DataOps and Its Impact on Application Quality by DevOps.com
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application Quality
DevOps.com534 views
Technical Introduction to PostgreSQL and PPAS by Ashnikbiz
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPAS
Ashnikbiz5.5K views
Data Modeling in Looker by Looker
Data Modeling in LookerData Modeling in Looker
Data Modeling in Looker
Looker2.6K views
Iib v10 performance problem determination examples by MartinRoss_IBM
Iib v10 performance problem determination examplesIib v10 performance problem determination examples
Iib v10 performance problem determination examples
MartinRoss_IBM12K views

Viewers also liked

API Revolutions : Netflix's API Redesign by
API Revolutions : Netflix's API RedesignAPI Revolutions : Netflix's API Redesign
API Revolutions : Netflix's API RedesignDaniel Jacobson
35.7K views74 slides
Netflix Edge Engineering Open House Presentations - June 9, 2016 by
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
4.8K views110 slides
Maintaining the Netflix Front Door - Presentation at Intuit Meetup by
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
5.4K views156 slides
Netflix API - Separation of Concerns by
Netflix API - Separation of ConcernsNetflix API - Separation of Concerns
Netflix API - Separation of ConcernsDaniel Jacobson
10.7K views40 slides
Set Your Content Free! : Case Studies from Netflix and NPR by
Set Your Content Free! : Case Studies from Netflix and NPRSet Your Content Free! : Case Studies from Netflix and NPR
Set Your Content Free! : Case Studies from Netflix and NPRDaniel Jacobson
35.1K views80 slides
RESTful API Design, Second Edition by
RESTful API Design, Second EditionRESTful API Design, Second Edition
RESTful API Design, Second EditionApigee | Google Cloud
180.1K views106 slides

Viewers also liked(20)

API Revolutions : Netflix's API Redesign by Daniel Jacobson
API Revolutions : Netflix's API RedesignAPI Revolutions : Netflix's API Redesign
API Revolutions : Netflix's API Redesign
Daniel Jacobson35.7K views
Netflix Edge Engineering Open House Presentations - June 9, 2016 by Daniel Jacobson
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016
Daniel Jacobson4.8K views
Maintaining the Netflix Front Door - Presentation at Intuit Meetup by Daniel Jacobson
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Daniel Jacobson5.4K views
Netflix API - Separation of Concerns by Daniel Jacobson
Netflix API - Separation of ConcernsNetflix API - Separation of Concerns
Netflix API - Separation of Concerns
Daniel Jacobson10.7K views
Set Your Content Free! : Case Studies from Netflix and NPR by Daniel Jacobson
Set Your Content Free! : Case Studies from Netflix and NPRSet Your Content Free! : Case Studies from Netflix and NPR
Set Your Content Free! : Case Studies from Netflix and NPR
Daniel Jacobson35.1K views
Netflix marketing plan by Evelyne Otto
Netflix marketing plan Netflix marketing plan
Netflix marketing plan
Evelyne Otto286.8K views
BDX 2016- Monal daxini @ Netflix by Ido Shilon
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
Ido Shilon7.5K views
Culture (Original 2009 version) by Reed Hastings
Culture (Original 2009 version)Culture (Original 2009 version)
Culture (Original 2009 version)
Reed Hastings1M views
A001321289 Strategic Human Resource Development by Gareth Noble
A001321289 Strategic Human Resource DevelopmentA001321289 Strategic Human Resource Development
A001321289 Strategic Human Resource Development
Gareth Noble1.2K views
Allianz x api_management_servic_fabric by Michele Danieli
Allianz x api_management_servic_fabricAllianz x api_management_servic_fabric
Allianz x api_management_servic_fabric
Michele Danieli316 views
Scaling the Netflix API - OSCON by Daniel Jacobson
Scaling the Netflix API - OSCONScaling the Netflix API - OSCON
Scaling the Netflix API - OSCON
Daniel Jacobson7.5K views
APIs for Internal Audiences - Netflix - App Dev Conference by Daniel Jacobson
APIs for Internal Audiences - Netflix - App Dev ConferenceAPIs for Internal Audiences - Netflix - App Dev Conference
APIs for Internal Audiences - Netflix - App Dev Conference
Daniel Jacobson1.7K views
Tv over the internet by Enrico Mosca
Tv over the internetTv over the internet
Tv over the internet
Enrico Mosca409 views
A Network View of Netflix by Joel Samen
A Network View of NetflixA Network View of Netflix
A Network View of Netflix
Joel Samen974 views
Make Your API Catalog Essential with z/OS Connect EE by Teodoro Cipresso
Make Your API Catalog Essential with z/OS Connect EEMake Your API Catalog Essential with z/OS Connect EE
Make Your API Catalog Essential with z/OS Connect EE
Teodoro Cipresso700 views

Similar to Maintaining the Front Door to Netflix : The Netflix API

Scaling the Netflix API - From Atlassian Dev Den by
Scaling the Netflix API - From Atlassian Dev DenScaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson
12.1K views124 slides
Maintaining the Front Door to Netflix by
Maintaining the Front Door to NetflixMaintaining the Front Door to Netflix
Maintaining the Front Door to NetflixBenjamin Schmaus
2.5K views136 slides
Techniques for Scaling the Netflix API - QCon SF by
Techniques for Scaling the Netflix API - QCon SFTechniques for Scaling the Netflix API - QCon SF
Techniques for Scaling the Netflix API - QCon SFDaniel Jacobson
2.3K views98 slides
Presentation to ESPN about the Netflix API by
Presentation to ESPN about the Netflix APIPresentation to ESPN about the Netflix API
Presentation to ESPN about the Netflix APIDaniel Jacobson
2.2K views37 slides
Netflix API - Presentation to PayPal by
Netflix API - Presentation to PayPalNetflix API - Presentation to PayPal
Netflix API - Presentation to PayPalDaniel Jacobson
25.9K views101 slides
Immutable Infrastructure: Rise of the Machine Images by
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesC4Media
806 views95 slides

Similar to Maintaining the Front Door to Netflix : The Netflix API(20)

Scaling the Netflix API - From Atlassian Dev Den by Daniel Jacobson
Scaling the Netflix API - From Atlassian Dev DenScaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev Den
Daniel Jacobson12.1K views
Maintaining the Front Door to Netflix by Benjamin Schmaus
Maintaining the Front Door to NetflixMaintaining the Front Door to Netflix
Maintaining the Front Door to Netflix
Benjamin Schmaus2.5K views
Techniques for Scaling the Netflix API - QCon SF by Daniel Jacobson
Techniques for Scaling the Netflix API - QCon SFTechniques for Scaling the Netflix API - QCon SF
Techniques for Scaling the Netflix API - QCon SF
Daniel Jacobson2.3K views
Presentation to ESPN about the Netflix API by Daniel Jacobson
Presentation to ESPN about the Netflix APIPresentation to ESPN about the Netflix API
Presentation to ESPN about the Netflix API
Daniel Jacobson2.2K views
Netflix API - Presentation to PayPal by Daniel Jacobson
Netflix API - Presentation to PayPalNetflix API - Presentation to PayPal
Netflix API - Presentation to PayPal
Daniel Jacobson25.9K views
Immutable Infrastructure: Rise of the Machine Images by C4Media
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine Images
C4Media806 views
Oscon2014 Netflix API - Top 10 Lessons Learned by Sangeeta Narayanan
Oscon2014 Netflix API - Top 10 Lessons LearnedOscon2014 Netflix API - Top 10 Lessons Learned
Oscon2014 Netflix API - Top 10 Lessons Learned
Sangeeta Narayanan846 views
Move Fast;Stay Safe:Developing & Deploying the Netflix API by Sangeeta Narayanan
Move Fast;Stay Safe:Developing & Deploying the Netflix APIMove Fast;Stay Safe:Developing & Deploying the Netflix API
Move Fast;Stay Safe:Developing & Deploying the Netflix API
Sangeeta Narayanan1.4K views
Netflix API : BAPI 2011 Presentation : SF by Daniel Jacobson
Netflix API : BAPI 2011 Presentation : SFNetflix API : BAPI 2011 Presentation : SF
Netflix API : BAPI 2011 Presentation : SF
Daniel Jacobson3K views
AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211) by Amazon Web Services
AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)
AWS re:Invent 2016: Getting Started with Serverless Architectures (CMP211)
Amazon Web Services2.1K views
Netapp Michael Galpin by rajivmordani
Netapp Michael GalpinNetapp Michael Galpin
Netapp Michael Galpin
rajivmordani950 views
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent by Sudhir Tonse
Pros and Cons of a MicroServices Architecture talk at AWS ReInventPros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Sudhir Tonse18.7K views
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening by Amazon Web Services
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Starting Your DevOps Journey – Practical Tips for Ops by Dynatrace
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
Dynatrace1.6K views
Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin... by Amazon Web Services
Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...
Petabytes of Data & No Servers: Corteva Scales DNA Analysis to Meet Increasin...
Gluecon 2013 netflix api crash course by Benjamin Schmaus
Gluecon 2013   netflix api crash courseGluecon 2013   netflix api crash course
Gluecon 2013 netflix api crash course
Benjamin Schmaus3.2K views

More from Daniel Jacobson

Top 10 Lessons Learned from the Netflix API - OSCON 2014 by
Top 10 Lessons Learned from the Netflix API - OSCON 2014Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014Daniel Jacobson
29.6K views87 slides
Why API? - Business of APIs Conference by
Why API? - Business of APIs ConferenceWhy API? - Business of APIs Conference
Why API? - Business of APIs ConferenceDaniel Jacobson
4.9K views50 slides
Netflix API: Keynote at Disney Tech Conference by
Netflix API: Keynote at Disney Tech ConferenceNetflix API: Keynote at Disney Tech Conference
Netflix API: Keynote at Disney Tech ConferenceDaniel Jacobson
2.9K views142 slides
Netflix API by
Netflix APINetflix API
Netflix APIDaniel Jacobson
2.7K views111 slides
Redesigning the Netflix API - OSCON by
Redesigning the Netflix API - OSCONRedesigning the Netflix API - OSCON
Redesigning the Netflix API - OSCONDaniel Jacobson
4.8K views45 slides
History and Future of the Netflix API - Mashery Evolution of Distribution by
History and Future of the Netflix API - Mashery Evolution of DistributionHistory and Future of the Netflix API - Mashery Evolution of Distribution
History and Future of the Netflix API - Mashery Evolution of DistributionDaniel Jacobson
3.8K views31 slides

More from Daniel Jacobson(13)

Top 10 Lessons Learned from the Netflix API - OSCON 2014 by Daniel Jacobson
Top 10 Lessons Learned from the Netflix API - OSCON 2014Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Daniel Jacobson29.6K views
Why API? - Business of APIs Conference by Daniel Jacobson
Why API? - Business of APIs ConferenceWhy API? - Business of APIs Conference
Why API? - Business of APIs Conference
Daniel Jacobson4.9K views
Netflix API: Keynote at Disney Tech Conference by Daniel Jacobson
Netflix API: Keynote at Disney Tech ConferenceNetflix API: Keynote at Disney Tech Conference
Netflix API: Keynote at Disney Tech Conference
Daniel Jacobson2.9K views
Redesigning the Netflix API - OSCON by Daniel Jacobson
Redesigning the Netflix API - OSCONRedesigning the Netflix API - OSCON
Redesigning the Netflix API - OSCON
Daniel Jacobson4.8K views
History and Future of the Netflix API - Mashery Evolution of Distribution by Daniel Jacobson
History and Future of the Netflix API - Mashery Evolution of DistributionHistory and Future of the Netflix API - Mashery Evolution of Distribution
History and Future of the Netflix API - Mashery Evolution of Distribution
Daniel Jacobson3.8K views
The future-of-netflix-api by Daniel Jacobson
The future-of-netflix-apiThe future-of-netflix-api
The future-of-netflix-api
Daniel Jacobson233.2K views
NPR Presentation at Wolfram Data Summit 2010 by Daniel Jacobson
NPR Presentation at Wolfram Data Summit 2010NPR Presentation at Wolfram Data Summit 2010
NPR Presentation at Wolfram Data Summit 2010
Daniel Jacobson1.5K views
NPR: Digital Distribution Strategy: OSCON2010 by Daniel Jacobson
NPR: Digital Distribution Strategy: OSCON2010NPR: Digital Distribution Strategy: OSCON2010
NPR: Digital Distribution Strategy: OSCON2010
Daniel Jacobson3.1K views
NPR's Digital Distribution and Mobile Strategy by Daniel Jacobson
NPR's Digital Distribution and Mobile StrategyNPR's Digital Distribution and Mobile Strategy
NPR's Digital Distribution and Mobile Strategy
Daniel Jacobson1.9K views

Recently uploaded

KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
75 views19 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
61 views38 slides
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...ShapeBlue
46 views28 slides
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Moses Kemibaro
27 views38 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
27 views49 slides
Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
32 views45 slides

Recently uploaded(20)

KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue75 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue46 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro27 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue89 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue25 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue66 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue60 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue64 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue31 views

Maintaining the Front Door to Netflix : The Netflix API

Editor's Notes

  1. Netflix strives to be the global streaming video leader for TV shows and movies
  2. We now have more than 44 million global subscribers in more than 40 countries
  3. Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
  4. Our 44 million Netflix subscribers are watching shows and movies on virtually any device that has a streaming video screen. We are now on more than 1,000 different device types.
  5. The subscribers can watch our original shows like Emmy-winning House of Cards.
  6. Within this world, the Edge Engineering team focuses on these three aspects of the streaming product.
  7. Before all of this streaming success, however, our roots were in supporting the DVD business.
  8. The system that ran that business was a large monolithic application that was developed and maintained by many teams.
  9. And that monolithic application ran in data centers that we needed to scale and maintain directly.
  10. And as we all know, the bigger the ship, the slower it turns. That was very much the case for Netflix years ago.
  11. To grow to where we knew we needed to be, Netflix aggressively moved to a distributed architecture.
  12. I like to think of this distributed architecture as being shaped like an hourglass…
  13. In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
  14. At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
  15. The API is at the center of the hourglass, acting as a broker of data.
  16. Our distributed architecture, with the number of systems involved, can get quite complicated. Each of these systems talks to a large number of other systems within our architecture.
  17. Assuming each of the services have SLAs of four nines, that results in more than two hours of downtime per month.
  18. And that is if all services maintain four nines!
  19. If it degrades as far as to three nines, that is almost one day per month of downtime!
  20. So, back to the hourglass…
  21. In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
  22. Such a failure could have resulted in an outage in the API.
  23. And that outage likely would have cascaded to have some kind of substantive impact on the devices.
  24. The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems and to keep them happy.
  25. To solve this problem, we created Hystrix, as wrapping technology that provides fault tolerance in a distributed environment. Hystrix is also open source and available at our github repository.
  26. To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
  27. This is a view of asingle circuit.
  28. This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
  29. The blue line represents the traffic trends over the last two minutes for this dependency.
  30. The green number shows the number of successful calls to this dependency over the last two minutes.
  31. The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
  32. The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
  33. The orange number shows the number of calls that have timed out, resulting in fallback responses.
  34. The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
  35. The red number shows the number of exceptions, resulting in fallback responses.
  36. The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
  37. If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
  38. The dashboard also shows host and cluster information for the dependency.
  39. As well as information about our SLAs.
  40. So, going back to the engineering diagram…
  41. If that same service fails today…
  42. We simply disconnect from that service.
  43. And replace it with an appropriate fallback. The fallback, ideally is a slightly degrade, but useful offering. If we cannot get that, however, we will quickly provide a 5xx response which will help the systems shed load rather than queue things up (which could eventually cause the system as a whole to tip over).
  44. This will keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
  45. In addition to the migration to a distributed architecture, we also aggressively moved out of data centers…
  46. And into the cloud.
  47. Instead of spending in data centers, we spend out time in tools such as Asgard, created by Netflix staff, to help us manage our instance types and counts in AWS. Asgard is available in our open source repository at github.
  48. Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
  49. Through autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
  50. To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix).
  51. Instead of reacting to real-time metrics, like load average, to increase/decrease the instance count, we can look at historical patterns in our traffic to figure out what will be needed BEFORE it is needed. We believed we could write algorithms to predict the needs.
  52. This is the result of the algorithms we created for the predictions. The prediction closely matches the actual traffic.
  53. Based on those predictions, we started triggering scaling events. Those scaling events closely matched the traffic patterns as well, although these scaling events (as opposed to the Amazon auto-scaler) preceded the need.
  54. Load average when running Scryer is much smoother.
  55. Smoother load average results in more consistent and faster response times.
  56. Another benefit is how Scryer protects us from outage recovery.
  57. Retries and thundering herd effects are mitigated by consistent provisioning of instances through Scryer.
  58. Meanwhile, because we have more predictable scaling needs that can be provisioned more granularly, our AWS costs go down.
  59. Going global has a different set of scaling challenges. AWS enables us to add instances in new regions that are closer to our customers.
  60. To help us manage our traffic across regions, as well as within given regions, we created Zuul. Zuul is open source in our github repository.
  61. Zuul does a variety of things for us. Zuul fronts our entire streaming application as well as a range of other services within our system.
  62. Moreover, Zuul is the routing engine that we use for Isthmus, which is designed to marshall traffic between regions, for failover, performance or other reasons.
  63. We also leverage multiple regions to help us fail over one region to another, through an effort we call Active-Active.
  64. Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
  65. Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
  66. The army is the Simian Army, which is a fleet of monkeys who are designed to do a variety of things, in an automated way, in our cloud implementation. Chaos Monkey, for example, periodically terminates AWS instances in production to see how the system as a whole will respond once that server disappears. Latency Monkey introduces latencies and errors into a system to see how it responds. The system is too complex to know how things will respond in various circumstances, so the monkeys expose that information to us in a variety of ways. The monkeys are also available in our open source github repository.
  67. Again, the dependency chains in our system are quite complicated.
  68. That is a lot of change in the system!
  69. As a result, our philosophy is to act fast (ie. get code into production as quickly as possible), then react fast (ie. response to issues quickly as they arise).
  70. Two such examples are canary deployments and what we call red/black deployments.
  71. The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
  72. If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
  73. The health of the canary is automated as well, comparing its metrics against the fleet of production servers.
  74. If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
  75. If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
  76. We will then repeat the process until the analysis of canary servers look good.
  77. We will then repeat the process until the analysis of canary servers look good.
  78. We also use Zuul to funnel varying degrees of traffic to the canaries to evaluate how much load the canary can take relative to the current production instances. If the RPS, for example, drops, the canary may fail the Zuul stress test.
  79. If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
  80. Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
  81. Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
  82. If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
  83. Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
  84. And we will stress the canary again…
  85. If the new code looks good in the canary, we can then bring up another set of servers with the new code.
  86. Then we will switch production traffic to the new code.
  87. If everything still looks good, we disable the red servers and the new code becomes the new red servers.
  88. Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
  89. At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
  90. For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
  91. Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
  92. The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
  93. Many UI teams needing metadata means many requests to the API team. In the one-size-fits-all API world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
  94. Many other companies have seen similar issues and have introduced orchestration layers that enable more flexible interaction models.
  95. Odata, HYQL, ql.io, rest.li and others are examples of orchestration layers. They address the same problems that we have seen, but we have approached the solution in a very different way.
  96. We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
  97. The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
  98. The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
  99. In this mode, there is a very clear divide between the Client Code and the Server Code. That divide is the network border.
  100. And the responsibilities have the same distribution as well. The Client Code handles the rendering of the interface (as well as asking the server for data). The Server Code is responsible of gathering, formatting and delivering the data to the UIs.
  101. And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
  102. But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
  103. We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
  104. In an experience-based interaction, the PS3 can potentially make asingle request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
  105. We also introduced RxJava into this layer to improve our ability to handle concurrency and callbacks. RxJava is open source in our github repository.
  106. In this model, the border between Client Code and Server Code is no longer the network border. It is now back on the server. The Groovy is essentially a client adapter written by the client teams.
  107. And the distribution of work changes as well. The client teams continue to handle UI rendering, but now are also responsible for the formatting and delivery of content. The API team, in terms of the data side of things, is responsible for the data gathering and hand-off to the client adapters. Of course, the API team does many other things, including resiliency, scaling, dependency interactions, etc. This model is essentially a platform for API development.
  108. If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
  109. All of the open source components discussed here, as well as many others, can be found at the Netflix github repository.