Scaling the Netflix API - From Atlassian Dev Den

Daniel Jacobson
Daniel JacobsonDirector of Engineering - Netflix API at Netflix
Scaling the
Netflix API
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
Please read the notes
associated with each slide for
the full context of the
presentation
What do I mean by “scale”?
Scaling the Netflix API - From Atlassian Dev Den
But There Are Many Ways to Scale!
Organization
Systems
Devices
Development
Testing
But first, some background…
Global Streaming Video
for TV Shows and Movies
More than 36 Million Subscribers
More than 40 Countries
Netflix Accounts for 33% of Peak
Internet Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month
Scaling the Netflix API - From Atlassian Dev Den
2007
Netflix REST API:
One-Size-Fits-All (OSFA)
Solution
Image courtesy of Jay Mac 3 on Flickr
Netflix API Requests by Audience
At Launch In 2008
External
Developers
Scaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev Den
Image courtesy of Jay Mac 3 on Flickr
Netflix API Requests by Audience
From 2011
External
Developers
Global Streaming Product
Three aspects of the Streaming Product:
• Discovery
• Sign-Up
• Streaming
Member Sign-Up
Discovery
Discovery
Today, Netflix API Supports Discovery
and Sign-Up
But Soon, Will Support Streaming
Scaling…
Organization
Systems
Devices
Development
Testing
Distributed Architecture
Scaling the Netflix API - From Atlassian Dev Den
1000+ Device Types
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
Reviews
A/B Test
Engine
Dozens of Dependencies
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
http://www.slideshare.net/reed2001/culture-1798664
Scaling…
Organization
Systems
Devices
Development
Testing
System Resiliency
Distributed Architecture
Dependency Relationships
2,000,000,000
Requests Per Day to the
Netflix API
30
Distinct Dependent
Services for the Netflix API
14,000,000,000
Netflix API Calls Per Day to
those Dependent Services
0
Dependent Services with
100% SLA
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime
Per Month
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Scaling the Netflix API - From Atlassian Dev Den
Circuit Breaker Dashboard
Scaling the Netflix API - From Atlassian Dev Den
Call Volume and Health / Last 10 Seconds
Call Volume / Last 2 Minutes
Successful Requests
Successful, But Slower Than Expected
Short-Circuited Requests, Delivering Fallbacks
Timeouts, Delivering Fallbacks
Thread Pool & Task Queue Full, Delivering Fallbacks
Exceptions, Delivering Fallbacks
Error Rate
# + # + # + # / (# + # + # + # + #) = Error Rate
Status of Fallback Circuit
Requests per Second, Over Last 10 Seconds
SLA Information
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
System Infrastructure
AWS Cloud
Scaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev Den
Autoscaling
Autoscaling
Forced Failure
Scaling the Netflix API - From Atlassian Dev Den
Global System
More than 36 Million Subscribers
More than 40 Countries
Zuul
Gatekeeper for the Netflix Streaming Application
Zuul
• Multi-Region
Resiliency
• Insights
• Stress Testing
• Canary Testing
• Dynamic Routing
• Load Shedding
• Security
• Static Response
Handling
• Authentication
Isthmus
Scaling…
Organization
Systems
Devices
Development
Testing
Scaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev Den
Screen Real Estate
Controller
Technical Capabilities
One-Size-Fits-All
API
Request
Request
Request
Scaling…
Organization
Systems
Devices
Development
Testing
Courtesy of South Florida Classical Review
Scaling the Netflix API - From Atlassian Dev Den
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recommendations
• /catalog/titles/movie
• /catalog/titles/series
• /catalog/people
REST API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
SERVER CODE
CLIENT CODE
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Border
DATA GATHERING,
FORMATTING,
AND DELIVERY
USER INTERFACE
RENDERING
Scaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev Den
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Groovy Layer
Scaling the Netflix API - From Atlassian Dev Den
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
SERVER CODE
CLIENT CODE
CLIENT ADAPTER CODE
(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)
Network Border Network Border
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
DATA GATHERING
DATA FORMATTING
AND DELIVERY
USER INTERFACE
RENDERING
Network Border Network Border
Scaling the Netflix API - From Atlassian Dev Den
Scaling…
Organization
Systems
Devices
Development
Testing
Dependency Relationships
Scaling the Netflix API - From Atlassian Dev Den
Testing Philosophy:
Act Fast, React Fast
That Doesn’t Mean We Don’t Test
• Unit tests
• Functional tests
• Regression scripts
• Continuous integration
• Capacity planning
• Load / Performance tests
Cloud-Based Deployment Techniques
Current Code
In Production
API Requests from
the Internet
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from
the Internet
Error!
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Perfect!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Error!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
Perfect!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
API Requests from
the Internet
New Code
Getting Prepared for Production
https://www.github.com/Netflix
Scaling the
Netflix API
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://www.slideshare.net/danieljacobson
1 of 124

Recommended

Scaling the Netflix API - OSCON by
Scaling the Netflix API - OSCONScaling the Netflix API - OSCON
Scaling the Netflix API - OSCONDaniel Jacobson
7.5K views119 slides
Scaling the Netflix API by
Scaling the Netflix APIScaling the Netflix API
Scaling the Netflix APIDaniel Jacobson
32.2K views92 slides
Netflix API: Keynote at Disney Tech Conference by
Netflix API: Keynote at Disney Tech ConferenceNetflix API: Keynote at Disney Tech Conference
Netflix API: Keynote at Disney Tech ConferenceDaniel Jacobson
2.9K views142 slides
API Revolutions : Netflix's API Redesign by
API Revolutions : Netflix's API RedesignAPI Revolutions : Netflix's API Redesign
API Revolutions : Netflix's API RedesignDaniel Jacobson
35.6K views74 slides
Netflix API - Presentation to PayPal by
Netflix API - Presentation to PayPalNetflix API - Presentation to PayPal
Netflix API - Presentation to PayPalDaniel Jacobson
25.9K views101 slides
Netflix API : BAPI 2011 Presentation : SF by
Netflix API : BAPI 2011 Presentation : SFNetflix API : BAPI 2011 Presentation : SF
Netflix API : BAPI 2011 Presentation : SFDaniel Jacobson
3K views33 slides

More Related Content

What's hot

APIs for Internal Audiences - Netflix - App Dev Conference by
APIs for Internal Audiences - Netflix - App Dev ConferenceAPIs for Internal Audiences - Netflix - App Dev Conference
APIs for Internal Audiences - Netflix - App Dev ConferenceDaniel Jacobson
1.7K views34 slides
Presentation to ESPN about the Netflix API by
Presentation to ESPN about the Netflix APIPresentation to ESPN about the Netflix API
Presentation to ESPN about the Netflix APIDaniel Jacobson
2.2K views37 slides
Redesigning the Netflix API - OSCON by
Redesigning the Netflix API - OSCONRedesigning the Netflix API - OSCON
Redesigning the Netflix API - OSCONDaniel Jacobson
4.8K views45 slides
Set Your Content Free! : Case Studies from Netflix and NPR by
Set Your Content Free! : Case Studies from Netflix and NPRSet Your Content Free! : Case Studies from Netflix and NPR
Set Your Content Free! : Case Studies from Netflix and NPRDaniel Jacobson
35.1K views80 slides
Maintaining the Netflix Front Door - Presentation at Intuit Meetup by
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
5.4K views156 slides
Maintaining the Front Door to Netflix : The Netflix API by
Maintaining the Front Door to Netflix : The Netflix APIMaintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix APIDaniel Jacobson
70.3K views138 slides

What's hot(20)

APIs for Internal Audiences - Netflix - App Dev Conference by Daniel Jacobson
APIs for Internal Audiences - Netflix - App Dev ConferenceAPIs for Internal Audiences - Netflix - App Dev Conference
APIs for Internal Audiences - Netflix - App Dev Conference
Daniel Jacobson1.7K views
Presentation to ESPN about the Netflix API by Daniel Jacobson
Presentation to ESPN about the Netflix APIPresentation to ESPN about the Netflix API
Presentation to ESPN about the Netflix API
Daniel Jacobson2.2K views
Redesigning the Netflix API - OSCON by Daniel Jacobson
Redesigning the Netflix API - OSCONRedesigning the Netflix API - OSCON
Redesigning the Netflix API - OSCON
Daniel Jacobson4.8K views
Set Your Content Free! : Case Studies from Netflix and NPR by Daniel Jacobson
Set Your Content Free! : Case Studies from Netflix and NPRSet Your Content Free! : Case Studies from Netflix and NPR
Set Your Content Free! : Case Studies from Netflix and NPR
Daniel Jacobson35.1K views
Maintaining the Netflix Front Door - Presentation at Intuit Meetup by Daniel Jacobson
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Daniel Jacobson5.4K views
Maintaining the Front Door to Netflix : The Netflix API by Daniel Jacobson
Maintaining the Front Door to Netflix : The Netflix APIMaintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Daniel Jacobson70.3K views
The future-of-netflix-api by Daniel Jacobson
The future-of-netflix-apiThe future-of-netflix-api
The future-of-netflix-api
Daniel Jacobson233.2K views
Maintaining the Front Door to Netflix by Benjamin Schmaus
Maintaining the Front Door to NetflixMaintaining the Front Door to Netflix
Maintaining the Front Door to Netflix
Benjamin Schmaus2.5K views
Why API? - Business of APIs Conference by Daniel Jacobson
Why API? - Business of APIs ConferenceWhy API? - Business of APIs Conference
Why API? - Business of APIs Conference
Daniel Jacobson4.9K views
History and Future of the Netflix API - Mashery Evolution of Distribution by Daniel Jacobson
History and Future of the Netflix API - Mashery Evolution of DistributionHistory and Future of the Netflix API - Mashery Evolution of Distribution
History and Future of the Netflix API - Mashery Evolution of Distribution
Daniel Jacobson3.8K views
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army by Ariel Tseitlin
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian ArmyAWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army
Ariel Tseitlin2.8K views
Essential API Facade Patterns - Composition (Episode 1) by Apigee | Google Cloud
Essential API Facade Patterns - Composition (Episode 1)Essential API Facade Patterns - Composition (Episode 1)
Essential API Facade Patterns - Composition (Episode 1)
KPIs for APIs (and how API Calls are the new Web Hits, and you may be measuri... by John Musser
KPIs for APIs (and how API Calls are the new Web Hits, and you may be measuri...KPIs for APIs (and how API Calls are the new Web Hits, and you may be measuri...
KPIs for APIs (and how API Calls are the new Web Hits, and you may be measuri...
John Musser96.9K views

Viewers also liked

Top 10 Lessons Learned from the Netflix API - OSCON 2014 by
Top 10 Lessons Learned from the Netflix API - OSCON 2014Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014Daniel Jacobson
29.6K views87 slides
Netflix Edge Engineering Open House Presentations - June 9, 2016 by
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
4.8K views110 slides
PyData NYC 2015 - Automatically Detecting Outliers with Datadog by
PyData NYC 2015 - Automatically Detecting Outliers with Datadog PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog Datadog
6.2K views41 slides
NetflixOSS meetup lightning talks and roadmap by
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapRuslan Meshenberg
124.4K views42 slides
Netflix – A Game Changer in Internet streaming media by
Netflix – A Game Changer in Internet streaming mediaNetflix – A Game Changer in Internet streaming media
Netflix – A Game Changer in Internet streaming mediaAshish Arora
3K views28 slides
Security for netflix billing & payments (meetup) by
Security for netflix billing & payments (meetup)Security for netflix billing & payments (meetup)
Security for netflix billing & payments (meetup)Poornaprajna Udupi
79.3K views8 slides

Viewers also liked(16)

Top 10 Lessons Learned from the Netflix API - OSCON 2014 by Daniel Jacobson
Top 10 Lessons Learned from the Netflix API - OSCON 2014Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Daniel Jacobson29.6K views
Netflix Edge Engineering Open House Presentations - June 9, 2016 by Daniel Jacobson
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016
Daniel Jacobson4.8K views
PyData NYC 2015 - Automatically Detecting Outliers with Datadog by Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog PyData NYC 2015 - Automatically Detecting Outliers with Datadog
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
Datadog 6.2K views
NetflixOSS meetup lightning talks and roadmap by Ruslan Meshenberg
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
Ruslan Meshenberg124.4K views
Netflix – A Game Changer in Internet streaming media by Ashish Arora
Netflix – A Game Changer in Internet streaming mediaNetflix – A Game Changer in Internet streaming media
Netflix – A Game Changer in Internet streaming media
Ashish Arora3K views
Security for netflix billing & payments (meetup) by Poornaprajna Udupi
Security for netflix billing & payments (meetup)Security for netflix billing & payments (meetup)
Security for netflix billing & payments (meetup)
Poornaprajna Udupi79.3K views
3/18/15 Billing&Payments Eng Meetup II - Payments Processing in the Cloud by Mathieu Chauvin
3/18/15 Billing&Payments Eng Meetup II - Payments Processing in the Cloud3/18/15 Billing&Payments Eng Meetup II - Payments Processing in the Cloud
3/18/15 Billing&Payments Eng Meetup II - Payments Processing in the Cloud
Mathieu Chauvin79.6K views
Escape From PCI Land by Rahul Dani
Escape From PCI LandEscape From PCI Land
Escape From PCI Land
Rahul Dani78.8K views
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi... by Adrian Cockcroft
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Adrian Cockcroft13.8K views
Netflix competitive landscape by dribayles
Netflix competitive landscapeNetflix competitive landscape
Netflix competitive landscape
dribayles25.2K views
Monetization - The Right Business Model for Your Digital Assets by Apigee | Google Cloud
Monetization - The Right Business Model for Your Digital AssetsMonetization - The Right Business Model for Your Digital Assets
Monetization - The Right Business Model for Your Digital Assets
Apigee | Google Cloud25.4K views
(Some) pitfalls of distributed learning by Yves Raimond
(Some) pitfalls of distributed learning(Some) pitfalls of distributed learning
(Some) pitfalls of distributed learning
Yves Raimond62.3K views
イミュータブルデータモデル(入門編) by Yoshitaka Kawashima
イミュータブルデータモデル(入門編)イミュータブルデータモデル(入門編)
イミュータブルデータモデル(入門編)
Yoshitaka Kawashima185.7K views
API Business Models by John Musser
API Business ModelsAPI Business Models
API Business Models
John Musser181.7K views
Paris ML meetup by Yves Raimond
Paris ML meetupParis ML meetup
Paris ML meetup
Yves Raimond96.5K views

Similar to Scaling the Netflix API - From Atlassian Dev Den

Oscon2014 Netflix API - Top 10 Lessons Learned by
Oscon2014 Netflix API - Top 10 Lessons LearnedOscon2014 Netflix API - Top 10 Lessons Learned
Oscon2014 Netflix API - Top 10 Lessons LearnedSangeeta Narayanan
846 views86 slides
Move Fast;Stay Safe:Developing & Deploying the Netflix API by
Move Fast;Stay Safe:Developing & Deploying the Netflix APIMove Fast;Stay Safe:Developing & Deploying the Netflix API
Move Fast;Stay Safe:Developing & Deploying the Netflix APISangeeta Narayanan
1.4K views80 slides
Immutable Infrastructure: Rise of the Machine Images by
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesC4Media
806 views95 slides
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014 by
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014Amazon Web Services
6.4K views35 slides
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk by
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
(DVO312) Sony: Building At-Scale Services with AWS Elastic BeanstalkAmazon Web Services
3.4K views109 slides
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New... by
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...Atlassian
1.6K views54 slides

Similar to Scaling the Netflix API - From Atlassian Dev Den(20)

Oscon2014 Netflix API - Top 10 Lessons Learned by Sangeeta Narayanan
Oscon2014 Netflix API - Top 10 Lessons LearnedOscon2014 Netflix API - Top 10 Lessons Learned
Oscon2014 Netflix API - Top 10 Lessons Learned
Sangeeta Narayanan846 views
Move Fast;Stay Safe:Developing & Deploying the Netflix API by Sangeeta Narayanan
Move Fast;Stay Safe:Developing & Deploying the Netflix APIMove Fast;Stay Safe:Developing & Deploying the Netflix API
Move Fast;Stay Safe:Developing & Deploying the Netflix API
Sangeeta Narayanan1.4K views
Immutable Infrastructure: Rise of the Machine Images by C4Media
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine Images
C4Media806 views
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014 by Amazon Web Services
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
Amazon Web Services6.4K views
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk by Amazon Web Services
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
(DVO312) Sony: Building At-Scale Services with AWS Elastic Beanstalk
Amazon Web Services3.4K views
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New... by Atlassian
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...
Atlassian1.6K views
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New... by Atlassian
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...
Extend Your Use of JIRA by Solving Your Unique Concerns: An Exposé of the New...
Atlassian500 views
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent by Sudhir Tonse
Pros and Cons of a MicroServices Architecture talk at AWS ReInventPros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Sudhir Tonse18.7K views
API Strategy Evolution at Netflix by Michael Hart
API Strategy Evolution at NetflixAPI Strategy Evolution at Netflix
API Strategy Evolution at Netflix
Michael Hart236.5K views
NEW LAUNCH! Introducing Amazon Kinesis Video Streams - ABD216 - re:Invent 2017 by Amazon Web Services
NEW LAUNCH! Introducing Amazon Kinesis Video Streams - ABD216 - re:Invent 2017NEW LAUNCH! Introducing Amazon Kinesis Video Streams - ABD216 - re:Invent 2017
NEW LAUNCH! Introducing Amazon Kinesis Video Streams - ABD216 - re:Invent 2017
Amazon Web Services8.1K views
Kaltura Inspire Webinar: API Driven Video Platform - The Key to Scalability a... by Zohar Babin
Kaltura Inspire Webinar: API Driven Video Platform - The Key to Scalability a...Kaltura Inspire Webinar: API Driven Video Platform - The Key to Scalability a...
Kaltura Inspire Webinar: API Driven Video Platform - The Key to Scalability a...
Zohar Babin2.1K views
Gluecon 2013 netflix api crash course by Benjamin Schmaus
Gluecon 2013   netflix api crash courseGluecon 2013   netflix api crash course
Gluecon 2013 netflix api crash course
Benjamin Schmaus3.2K views
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013 by Amazon Web Services
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Amazon Web Services3.7K views
Stranger Things: The Forces that Disrupt Netflix by C4Media
Stranger Things: The Forces that Disrupt NetflixStranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
C4Media1.4K views
Softjourn and the Entertainment industry VOD Live Video Live Events by Emmy Gengler
Softjourn and the Entertainment industry VOD Live Video Live EventsSoftjourn and the Entertainment industry VOD Live Video Live Events
Softjourn and the Entertainment industry VOD Live Video Live Events
Emmy Gengler679 views
Netapp Michael Galpin by rajivmordani
Netapp Michael GalpinNetapp Michael Galpin
Netapp Michael Galpin
rajivmordani950 views
A Microservices Journey - Susanne Kaiser by Thoughtworks
A Microservices Journey - Susanne KaiserA Microservices Journey - Susanne Kaiser
A Microservices Journey - Susanne Kaiser
Thoughtworks1.5K views
Build an App on AWS for Your First 10 Million Users by Amazon Web Services
Build an App on AWS for Your First 10 Million UsersBuild an App on AWS for Your First 10 Million Users
Build an App on AWS for Your First 10 Million Users

More from Daniel Jacobson

NPR Presentation at Wolfram Data Summit 2010 by
NPR Presentation at Wolfram Data Summit 2010NPR Presentation at Wolfram Data Summit 2010
NPR Presentation at Wolfram Data Summit 2010Daniel Jacobson
1.5K views29 slides
NPR: Digital Distribution Strategy: OSCON2010 by
NPR: Digital Distribution Strategy: OSCON2010NPR: Digital Distribution Strategy: OSCON2010
NPR: Digital Distribution Strategy: OSCON2010Daniel Jacobson
3.1K views62 slides
NPR's Digital Distribution and Mobile Strategy by
NPR's Digital Distribution and Mobile StrategyNPR's Digital Distribution and Mobile Strategy
NPR's Digital Distribution and Mobile StrategyDaniel Jacobson
1.9K views44 slides
NPR API Usage and Metrics by
NPR API Usage and MetricsNPR API Usage and Metrics
NPR API Usage and MetricsDaniel Jacobson
1.9K views53 slides
OpenID Adoption UX Summit by
OpenID Adoption UX SummitOpenID Adoption UX Summit
OpenID Adoption UX SummitDaniel Jacobson
962 views21 slides
NPR : Examples of COPE by
NPR : Examples of COPENPR : Examples of COPE
NPR : Examples of COPEDaniel Jacobson
27K views14 slides

More from Daniel Jacobson(6)

NPR Presentation at Wolfram Data Summit 2010 by Daniel Jacobson
NPR Presentation at Wolfram Data Summit 2010NPR Presentation at Wolfram Data Summit 2010
NPR Presentation at Wolfram Data Summit 2010
Daniel Jacobson1.5K views
NPR: Digital Distribution Strategy: OSCON2010 by Daniel Jacobson
NPR: Digital Distribution Strategy: OSCON2010NPR: Digital Distribution Strategy: OSCON2010
NPR: Digital Distribution Strategy: OSCON2010
Daniel Jacobson3.1K views
NPR's Digital Distribution and Mobile Strategy by Daniel Jacobson
NPR's Digital Distribution and Mobile StrategyNPR's Digital Distribution and Mobile Strategy
NPR's Digital Distribution and Mobile Strategy
Daniel Jacobson1.9K views

Recently uploaded

PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
113 views17 slides
AMD: 4th Generation EPYC CXL Demo by
AMD: 4th Generation EPYC CXL DemoAMD: 4th Generation EPYC CXL Demo
AMD: 4th Generation EPYC CXL DemoCXL Forum
126 views6 slides
Understanding GenAI/LLM and What is Google Offering - Felix Goh by
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
39 views33 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
165 views20 slides
MemVerge: Memory Viewer Software by
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer SoftwareCXL Forum
118 views10 slides
Micron CXL product and architecture update by
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture updateCXL Forum
27 views7 slides

Recently uploaded(20)

PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 views
AMD: 4th Generation EPYC CXL Demo by CXL Forum
AMD: 4th Generation EPYC CXL DemoAMD: 4th Generation EPYC CXL Demo
AMD: 4th Generation EPYC CXL Demo
CXL Forum126 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS39 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views
MemVerge: Memory Viewer Software by CXL Forum
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer Software
CXL Forum118 views
Micron CXL product and architecture update by CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 views
"Fast Start to Building on AWS", Igor Ivaniuk by Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays36 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 views
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... by Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays40 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet52 views
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa... by The Digital Insurer
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...
Future of Learning - Khoong Chan Meng by NUS-ISS
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
NUS-ISS31 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS28 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
AI: mind, matter, meaning, metaphors, being, becoming, life values by Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS25 views
MemVerge: Past Present and Future of CXL by CXL Forum
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXL
CXL Forum110 views
GigaIO: The March of Composability Onward to Memory with CXL by CXL Forum
GigaIO: The March of Composability Onward to Memory with CXLGigaIO: The March of Composability Onward to Memory with CXL
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum126 views

Scaling the Netflix API - From Atlassian Dev Den

Editor's Notes

  1. There are many ways to think of “scale”…
  2. People generally think of scaling as growth in traffic… And traffic growth typically needs to be matched by server growth. System scaling requires a balance between need and capacity.
  3. To have an effective engineering organization, you need to scale in a variety of ways, not just in your systems. This presentation discusses these scaling needs. Of course, I will focus a bit on systems, but that is not the only area that requires focus to be successful.
  4. Netflix strives to be the global streaming video leader for TV shows and movies
  5. We now have more than 36 million global subscribers in more than 40 countries
  6. Those subscribers consume more than a billion hours of streaming video a month which accounts for about 33% of the peak Internet traffic in the US.
  7. Our 36 million Netflix subscribers are watching shows (like House of Cards) and movies on virtually any device that has a streaming video screen. We are now on more than 1,000 different device types.
  8. All of this started, however, with the launch of streaming in 2007. At the time, we were only streaming on computer-based players (i.e.. No devices, mobile phones, etc.).
  9. Shortly after streaming launched, in 2008, we launched our REST API. I describe it as a One-Size-Fits-All (OSFA) type of implementation because the API itself sets the rules and requires anyone who interfaces with it to adhere to those rules. Everyone is treated the same.
  10. The OSFA API launched to support the 1,000 flowers model. That is, we would plant the seeds in the ground (by providing access to our content) and see what flowers sprout up in the myriad fields throughout the US. The 1,000 flowers are public API developers. At the launch of the public API, the content was fully liberated and the bird was set free to fly around in the open world.
  11. And at launch, the API was exclusively targeted towards and consumed by the 1,000 flowers (i.e.. External developers). So all of the API traffic was coming from them.
  12. Some examples of the flowers…
  13. But as streaming gained more steam…
  14. The API evolved to support more of the devices that were getting built. The 1,000 flowers were still supported as well, but as the devices ramped up, the devices became a bigger focus.
  15. Meanwhile, the balance of requests by audience had completely flipped. Overwhelmingly, the majority of traffic was coming from Netflix-ready devices and a shrinking percentage was from the 1,000 flowers. The flowers now represent less than 0.1% of the total traffic to the Netflix API.
  16. For our devices, there are basically threetypes of interactions between Netflix customers and our streaming application… Sign-Up, Discovery and Streaming.
  17. Sign-Up is the ways in which we allow people to become members as well as how they manage their accounts.
  18. Discovery is basically any event with a title other than actually watching it. That includes browsing titles, looking for something watch, etc.
  19. It also includes actions such as rating the title, adding it to your instant queue, etc.
  20. The API currently powers the Sign-Up and Discovery experiences.
  21. We are now in progress to have it also support the Streaming parts of the system as well.
  22. To support the growing member base, growing number of devices, and growing feature-set for the application, we have needed to scale in a variety of ways. First, the organization…
  23. Our application used to be more like a monolithic application running out of data centers. As the streaming application started to emerge, and as we started to shift to the cloud, we began to split out different services towards a distributed, SOA-based architecture.
  24. I like to think of this distributed architecture as being shaped like an hourglass…
  25. In the top end of the hourglass, we have our device and UI teams who build out great user experiences on Netflix-branded devices. To put that into perspective, there are a few hundred more device types that we support than engineers at Netflix.
  26. At the bottom end of the hourglass, there are several dozen dependency teams who focus on things like metadata, algorithms, authentication services, A/B test engines, etc.
  27. The API is at the center of the hourglass, acting as a broker of data.
  28. This hourglass architecture allows us to scale horizontal easily with our device integrations and our backend dependencies.
  29. The glue that helps all of this work as effectively as it does is our engineering culture. We hire great, seasoned engineers with excellent judgment and engineering acumen and enable them to build systems quickly by giving them the right context and helping them work together in a highly aligned way.
  30. Our organization, and therefore our system, is set up to support a distributed architecture.
  31. Our distributed architecture, with the number of systems involved, can get quite complicated. Each of these systems talks to a large number of other systems within our architecture.
  32. Assuming each of the services have SLAs of four nines, that results in more than two hours of downtime per month.
  33. And that is if all services maintain four nines!
  34. If it degrades as far as to three nines, that is almost one day per month of downtime!
  35. So, back to the hourglass…
  36. In the old world, the system was vulnerable to such failures. For example, if one of our dependency services fails…
  37. Such a failure could have resulted in an outage in the API.
  38. And that outage likely would have cascaded to have some kind of substantive impact on the devices.
  39. The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems and to keep them happy.
  40. To solve this problem, we created Hystrix, as wrapping technology that provides fault tolerance in a distributed environment. Hystrix is also open source and available at our github repository.
  41. To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.
  42. This is a view of asingle circuit.
  43. This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.
  44. The blue line represents the traffic trends over the last two minutes for this dependency.
  45. The green number shows the number of successful calls to this dependency over the last two minutes.
  46. The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.
  47. The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.
  48. The orange number shows the number of calls that have timed out, resulting in fallback responses.
  49. The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.
  50. The red number shows the number of exceptions, resulting in fallback responses.
  51. The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.
  52. If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.
  53. The dashboard also shows host and cluster information for the dependency.
  54. As well as information about our SLAs.
  55. So, going back to the engineering diagram…
  56. If that same service fails today…
  57. We simply disconnect from that service.
  58. And replace it with an appropriate fallback. The fallback, ideally is a slightly degrade, but useful offering. If we cannot get that, however, we will quickly provide a 5xx response which will help the systems shed load rather than queue things up (which could eventually cause the system as a whole to tip over).
  59. This will keep our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.
  60. Rather than relying on data centers, we have moved everything to the cloud! Enables rapid scaling with relative ease. Adding new servers, in new locations, take minutes. And this is critical when the service needs to grow from 1B requests a month to 2B requests a day in a relatively short period of time.
  61. That is much more preferable for us than spending our time, money and energy in data centers, adding servers, dealing with power supplies, etc.
  62. Instead, we spend time in tools such as Asgard, created by Netflix staff, to help us manage our instance types and counts in AWS. Asgard is available in our open source repository at github.
  63. Another feature afforded to us through AWS to help us scale is Autoscaling. This is the Netflix API request rates over a span of time. The red line represents a potential capacity needed in a data center to ensure that the spikes could be handled without spending a ton more than is needed for the really unlikely scenarios.
  64. Through autoscaling, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.
  65. Hystrix and other techniques throughout our engineering organization help keep things resilient. We also have an army of tools that introduce failures to the system which will help us identify problems before they become really big problems.
  66. The army is the Simian Army, which is a fleet of monkeys who are designed to do a variety of things, in an automated way, in our cloud implementation. Chaos Monkey, for example, periodically terminates AWS instances in production to see how the system as a whole will respond once that server disappears. Latency Monkey introduces latencies and errors into a system to see how it responds. The system is too complex to know how things will respond in various circumstances, so the monkeys expose that information to us in a variety of ways. The monkeys are also available in our open source github repository.
  67. Going global has a different set of scaling challenges. AWS enables us to add instances in new regions that are closer to our customers.
  68. To help us manage our traffic across regions, as well as within given regions, we created Zuul. Zuul is open source in our github repository.
  69. Zuul does a variety of things for us. Zuul fronts our entire streaming application as well as a range of other services within our system.
  70. Moreover, Zuul is the routing engine that we use for Isthmus, which is designed to marshall traffic between regions, for failover, performance or other reasons.
  71. Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
  72. At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
  73. For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis?
  74. Different devices have different controlling functions as well. For devices with swipe technologies, such as the iPad, do we need to pre-load a lot of extra titles in case a user swipes the row quickly to see the last of 500 titles in their queue? Or for up-down-left-right controllers, would devices be more optimized by fetching a few items at a time when they are needed? Other devices support voice or hand gestures or pointer technologies. How might those impact the user experience and therefore the metadata needed to support them?
  75. The technical specs on these devices differ greatly. Some have significant memory space while others do not, impacting how much data can be handled at a given time. Processing power and hard-drive space could also play a role in how the UI performs, in turn potentially influencing the optimal way for fetching content from the API. All of these differences could result in different potential optimizations across these devices.
  76. Many UI teams needing metadata means many requests to the API team. In the one-size-fits-all API world, we essentially needed to funnel these requests and then prioritize them. That means that some teams would need to wait for API work to be done. It also meant that, because they all shared the same endpoints, we were often adding variations to the endpoints resulting in a more complex system as well as a lot of spaghetti code. Make teams wait due to prioritization was exacerbated by the fact that tasks took longer because the technical debt was increasing, causing time to build and test to increase. Moreover, many of the incoming requests were asking us to do more of the same kinds of customizations. This created a spiral that would be very difficult to break out of…
  77. That variability ultimately caused us to do some introspection on our API layer.
  78. Many other companies have seen similar issues and have introduced orchestration layers that enable more flexible interaction models.
  79. Odata, HYQL, ql.io, rest.li and others are examples of orchestration layers. They address the same problems that we have seen, but we have approached the solution in a very different way.
  80. We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
  81. The original OSFA API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
  82. The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
  83. In this mode, there is a very clear divide between the Client Code and the Server Code. That divide is the network border.
  84. And the responsibilities have the same distribution as well. The Client Code handles the rendering of the interface (as well as asking the server for data). The Server Code is responsible of gathering, formatting and delivering the data to the UIs.
  85. And ultimately, it works. The PS3 interface looks like this and was populated by this interaction model.
  86. But we believe this is not the optimal way to handle it. In fact, assembling a UI through many resource-based API calls is akin to pointillism paintings. The picture looks great when fully assembled, but it is done by assembling many points put together in the right way.
  87. We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
  88. In an experience-based interaction, the PS3 can potentially make asingle request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
  89. We also introduced RxJava into this layer to improve our ability to handle concurrency and callbacks. RxJava is open source in our github repository.
  90. In this model, the border between Client Code and Server Code is no longer the network border. It is now back on the server. The Groovy is essentially a client adapter written by the client teams.
  91. And the distribution of work changes as well. The client teams continue to handle UI rendering, but now are also responsible for the formatting and delivery of content. The API team, in terms of the data side of things, is responsible for the data gathering and hand-off to the client adapters. Of course, the API team does many other things, including resiliency, scaling, dependency interactions, etc. This model is essentially a platform for API development.
  92. If resource-based APIs assemble data like pointillism, experience-based APIs assemble data like a photograph. The experience-based approach captures and delivers it all at once.
  93. Again, the dependency chains in our system are quite complicated.
  94. That is a lot of change in the system!
  95. As a result, our philosophy is to act fast (ie. get code into production as quickly as possible), then react fast (ie. response to issues quickly as they arise).
  96. That said, we do spend a lot of time testing. We just don’t intend to make the system bullet-proof before deploying. Instead, we have employed some techniques to help us learn more about what the new code will look like in production.
  97. Two such examples are canary deployments and what we call red/black deployments.
  98. The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new server(s) into production running new code. Monitoring the canary servers will show what the new code will look like in production.
  99. If the canary encounters problems, it will register in any number of ways. The problems will be determined based on a comprehensive set of tools that will automatically perform health analysis on the canary.
  100. If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc.
  101. We will then repeat the process until the analysis of canary servers look good.
  102. If the new code looks good in the canary, we can then use a technique that we call red/black deployments to launch the code. Start with red, where production code is running. Fire up a new set of servers (black) equal to the count in red with the new code.
  103. Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
  104. Then switch the pointer to have external requests point to the black servers. Sometimes, however, we may find an error in the black cluster that was not detected by the canary. For example, some issues can only be seen with full load.
  105. If a problem is encountered from the black servers, it is easy to rollback quickly by switching the pointer back to red. We will then re-evaluate the new code, debug it, etc.
  106. Once we have debugged the code, we will put another canary up to evaluate the new changes in production.
  107. If the new code looks good in the canary, we can then bring up another set of servers with the new code.
  108. Then we will switch production traffic to the new code.
  109. If everything still looks good, we disable the red servers and the new code becomes the new red servers.
  110. All of the open source components discussed here, as well as many others, can be found at the Netflix github repository.