Evolving the
Netflix API
Katharina Probst
Engineering Manager, API
October 2015
What is Netflix?
> 1000 Devices
Is it significant?
❏ Peak
downstream
traffic in the
US is almost
35%.
❏ Almost 70 Million subscribers worldwide and
growing
Source: http://www.sandvine.com/news/global_broadband_trends.asp
We’re going
global!
Source: https://help.netflix.com/en/node/14164
Recent additions:
Spain, Portugal, Italy
Current availability
Netflix
Originals
Do we need a
Netflix API?
API
Personali-
zation
Engine
User
Info
Ratings
Similar
Movies
A/B Test
Engine
….
Uses
❏ Discovery
❏ Signup
❏ Playback
❏ Internal teams
only
API
Goals
❏ Flexibility
❏ Resiliency
❏ Scalability
❏ Excellent tools
API
Goals
❏ Flexibility
❏ Resiliency
❏ Scalability
❏ Excellent tools
API
Lots of devices, lots of variety
Different interaction models
And just to make things a little more
interesting….
❏ A/B tests
❏ profiles
❏ localization
What we felt
we had
What we
needed
❏ Reduce network chattiness
❏ Support device optimizations
❏ Enable faster development for internal users
Local MethodRemote API
GET
/users/{user_id}/lists
apiGateway
.getLists(userId)
Discrete HTTP requests pay network tax repeatedly
Single, optimized request; pay network tax once
Single, optimized request; pay network tax once
Client data
assembly logic
pushed to server
Add server-side scripting capability
❏ Enable independent development & device
optimization
❏ Profit
❏ UI (script) changes can happen
independently
❏ Script changes can be pushed to running
servers, so decoupled from API push
schedule
❏ Server+UI changes usually involve API team
Impact on velocity and collaboration
RxJava Hystrix
JavaServiceLayer
Mid-tier
Services
UI
Teams
Client Server
Internet
Application
/tv/home
API
Team
Service
Teams
ELB Zuul
Mid-tier
Services
Scriptable
Backend
Scriptable
Backend
+
API Layer
Goals
❏ Flexibility
❏ Resiliency
❏ Scalability
❏ Excellent tools
API
https://github.com/Netflix/Hystrix
resilience patterns for distributed
sys
Hystrix Primer
❏ Protection from and control over
latency and failure from dependencies
❏ Stop cascading failures in a complex
distributed system
❏ Fail fast and rapidly recover
❏ Fall back and gracefully degrade
Personalization
Engine
Similar
Movies
Movie
Metadata
Ratings User Info
Instant
Queue
A/B Test
Engine
API
Personalization
Engine
Similar
movies
Movie
Metadata
Ratings User Info
Instant
Queue
A/B Test
Engine
API
API
Personalization
Engine
Similar
movies
Movie
Metadata
Ratings User Info
Instant
Queue
A/B Test
Engine
Beware
Cascading
Failure!
Personalization
Engine
Similar
Movies
Movie
Metadata
Ratings User Info
Instant
Queue
A/B Test
Engine
API
Personalization
Engine
Similar
Movies
Movie
Metadata
Ratings User Info
Instant
Queue
A/B Test
Engine
Fallback
Response
Local
Fallback
Avoids
Cascading
Failure!
API
Personalization
Engine
Similar
Movies
Movie
Metadata
Ratings User Info
Instant
Queue
A/B Test
Engine
Fallback
Response
Use FIT
to test
such
failures
API
Goals
❏ Flexibility
❏ Resiliency
❏ Scalability
❏ Excellent tools
API
Autoscaling & Capacity Management
http://nflx.it/1LvqLUi
AWS Controls Reactive, does not
scale up fast enough
Fine-grained Control with Scryer
Complements AWS Controls
❏ Faster scale-up, improved cost
❏ Use reactive policy for organic scale down
Goals
❏ Flexibility
❏ Resiliency
❏ Scalability
❏ Excellent tools
API
Run 1% of your traffic on the new
code and see how it does
❏ Errors: 2xx, 4xx, 5xx
❏ latency
❏ network
❏ busy threads
❏ load
❏ ...
So you’ve run a canary. Now what?
Control Canary
Successful canary
red/black push
Continuous Delivery
http://techblog.netflix.com/2015/09/moving-from-asgard-to-spinnaker.html
Quickly see status of all clusters
http://techblog.netflix.com/2015/09/moving-from-asgard-to-spinnaker.html
Script Management
Deployment & Ops
Deployment & Ops
Deployment & Ops
Real-time analysis
http://www.slideshare.net/g9yuayon/qcon-talk-on-netflix-mantis-a-stream-processing-system
Submit a query, see requests in real time.
Looking ahead - current challenges
❏ Breaking up the monolith
❏ Script isolation
❏ Thin client libraries
❏ New interaction models
Looking ahead
Source: http://techcrunch.com/2014/03/08/success-reality-and-the-myth-of-up-and-to-the-right/
Looking ahead
❏ Breaking up the monolith
❏ Script isolation
❏ Thin client libraries
❏ New interaction models
● > 900 active
endpoints
● ~ 30 client libraries
● 78 thread pools
● high memory usage
Breaking up the
monolith
Script isolation & node
❏ Groovy scripts run as
part of API process
❏ UI teams would like to
use other languages
(in particular node.js) API remote
service layer
Service client
libraries
UI/device
scripts (node)
Falcor
var response = model.get("todos[0..2]
['name','done']");
Thin client libraries
❏ Many client libraries
contain a lot of
business logic and
have a lot of
dependencies
❏ Move business logic
and dependencies to
server
API remote
service layer
Service client
libraries
UI/device
scripts (node)
Falcor
Looking ahead
❏ Breaking up the monolith
❏ Script isolation
❏ Thin client libraries
❏ New interaction models
New interaction models
❏ request/response
❏ request/stream
❏ fire-and-forget
❏ event subscription
❏ channel
API remote
service layer
Service client
libraries
UI/device
scripts (node)
Falcor
http://reactivesocket.io
In the beginning...

Evolving the Netflix API