130 million customers
in over 190 countries
streaming 140 million hrs/day
4
We use a data driven approach via A/B testing for most changes to our
product — ensuring every change delights our customers
source: https://www.optimizely.com/optimization-glossary/ab-testing/
1000s of A/B tests a year
Netflix API
http://api.netflix.com
TV
iOS
Android
Windows
Browsers
Remote
Service
Layer
Search
MAP
GPS
Playback
…
Clients Client API Edge API Backend Services
The Netflix API decouples clients from the backend services, providing a
integration point for both services and clients
BFF
The Netflix API uses the BFF (backend for frontend) pattern, where the
BFF is tightly coupled to each device — making it easier to define and
adapt the UI, and streamlining releases
TV
iOS
Android
Windows
Browsers
Remote
Service
Layer
Search
MAP
GPS
Playback
…
Clients Client API Edge API Backend Services
These BFFs are maintained by the UI teams, since it’s tightly coupled to
their UI
Netflix API requirements
Velocity Reliability
Ergonomic No Operations
Going FaaSter: Function as a Service at
Netflix
@
Yunong Xiao,
Principal Software Engineer, Netflix
FaaS Evolution
Others ManageYou Manage
Services
Platform
Application
λ
Pre-Cloud
On Prem
Application
λ
Services
Platform
FaaS
Application
λ
Services
Platform
IaaS
λ
Services
Platform
Application
PaaS
Pros Cons
No-ops Homogenous architecture
Accessible monitoring & debugging
Velocity Netflix stack integration
Reliable service platform Limits: latency, memory,
execution time
Build or buy?
We’ll cover:
Runtime platform
architecture
Developer
experience
Management &
operations
Others ManageYou Manage
Services
Platform
Application
λ
Pre-Cloud
On Prem
Application
λ
Services
Platform
FaaS
Application
λ
Services
Platform
IaaS
λ
Services
Platform
Application
PaaS
We are almost completely hosted in the cloud using AWS
EC2 makes up the foundation of infrastructure at Netflix
VMs or Containers?
We chose to use containers as the foundation of our FaaS platform, as it
gave us advantages which let us build a platform that is ergonomic,
efficient, with high deployment velocity
Lightweight & Fast
Deployments
Portability across
environments
Efficient bin packing
We built Titus — our own container management platform — capable of
launching millions of containers a day
Others ManageYou Manage
Services
Platform
Application
λ
Pre-Cloud
On Prem
Application
λ
Services
Platform
FaaS
Application
λ
Services
Platform
IaaS
λ
Services
Platform
Application
PaaS
We have created a reliable, open source services platform
We have created a reliable, open source services platform
Service Discovery: Eureka https://github.com/Netflix/eureka
RPC: Ribbon (HTTP), gRPC https://github.com/Netflix/ribbon
Configuration: Archaius https://github.com/Netflix/archaius
Metrics: Atlas https://github.com/Netflix/atlas
Fault tolerance: Hystrix https://github.com/Netflix/hystrix
External LB: Zuul https://github.com/Netflix/zuul
Tracing: Mantis, Salp
…
Assembling these components yourself is time consuming, difficult, and
error prone
Assembling these components yourself is time consuming, difficult, and
error prone
You always have to keep components updated to the latest versions
yourself
You have to ensure that metrics and dashboards are created for your
service
You’re on the hook for managing and operating the infrastructure
34
You shouldn’t have to set everything up from scratch every time when all
you care about is the business logic
Others ManageYou Manage
Services
Platform
Application
λ
Pre-Cloud
On Prem
Application
λ
Services
Platform
FaaS
Application
λ
Services
Platform
IaaS
λ
Services
Platform
Application
PaaS
36
We set out to build our runtime FaaS platform that solves these issues
No assembly required
Automatic updates
Observable metrics
Managed operations
The platform is a services container that has been pre-assembled with all
of the components needed for a production ready service
Server Service
Discovery
Daemon
Metrics
Daemon
Log rotation
Service
Registration
Configuration
Metrics
Stream
Processing
RPC
Clients
Auth
Throttling
All that’s needed is for customers to insert their business logic
Server
Service
Registration
Configuration
Metrics
Stream
Processing
Service
Discovery
Daemon
Metrics
Daemon
Log rotation
RPC
Clients
Auth
Throttling
Route /foo
Route /bar
…
We package and version the platform as a single entity, and can easily
upgrade and test the components once and ensure everyone receives the
upgrade
We control the runtime, the platform can emit a consistent set of
application, RPC, and systems metrics for every function
Server
Service
Registration
Configuration
Metrics
Stream
Processing
Service
Discovery
Daemon
Metrics
Daemon
Log rotation
RPC
Clients
Auth
Throttling
Route /foo
Route /bar
…
41
We set out to build our runtime FaaS platform that solves these issues
No assembly required
Automatic updates
Observable metrics
Managed operations
We’ll cover:
Runtime platform
architecture
Developer
experience
Management &
operations
{
"service": {
"org": "iosui",
"name": "iphone"
},
"platformVersion": "^6.0.0",
"routes": {
"routes": {
"movies": {
"get": {
"source": “./lib/endpoints/movies.js"
}
},
"profile": {
"post": {
"source": “./lib/endpoints/profile.js”
}
}
}
},
"sources": ["./lib"],
"propertiesPath": "./etc",
"startupHooks": [
"./hooks/startupHook.js"
]
Functions are managed via a configuration API, where most fields are
optional.
Service name
FaaS platform version
Function declarations
Additional source code
Configuration
Lifecycle
management
Business logic can be implemented using the popular Node.js “Connect”
style middleware which handles requests.
module.exports = function(req, res, next) {
res.send(200, req.query);
return next();
};
HTTP Request object
HTTP Response
callback
Platform components such as metrics, loggers, or RPC clients are
available via the “req” object — providing a full runtime API for
developers
module.exports = function ping(req, res, next) {
req.log.info('Hello World!');
req.getRequestContext(); // request context
req.getAtlas(); // metrics client
req.getDNAClient(); // RPC client
req.getProperties(); // Configuration Client
req.getEdgar(); // Tracing
req.getMantis(); // Stream processing client
req.getGeo(); // Geo location
req.getPassport(); // Auth
return next();
};
Long lived third party libraries can be managed via startup and shutdown
lifecycle hooks.
"startupHooks": [
"./hooks/startupHook.js"
],
"shutdownHooks": [
"./hooks/shutdownHook.js"
]
Hooks are initiated before the platform starts, have access to all platform
components, and allow for third party libraries to be made available on
the request object
// executed before platform starts
module.exports = function startuphook(opts, cb) {
// access to all platform components
opts.atlas;
opts.infrastructureInfo;
opts.log;
...
opts.properties;
opts.serviceInfo;
// return an object that will be made available
// to all functions
return cb(null, { foo: 'bar' });
};
External dependencies can be imported from
Our goal is to create a local function development experience that
improves the software development life cycle for developers
We created a developer workflow tool called NEWT (Netflix Workflow
Toolkit) which simplifies and facilitates common developer tasks 
Development
Debugging
Testing
Publishing
Deployment
One-click setup for a consistent development environment. Installs
dependencies and keeps them updated
We created a development FaaS platform for local development —
enabling engineers to interactively test functions in seconds —
reducing friction and increasing velocity
Server
Service
Registration
Configuration
Metrics
Stream Processing
Service
Discovery
Daemon
Metrics Daemon
Log rotation
RPC
Clients
Auth
Throttling
Dev FaaS platform
local functions
live reload
Local debugging further increases velocity and reduces friction of the
SDLC
Serve
Service
Registration
Configuration
Metrics
Stream
Processing
Service
Discovery
Daemon
Metrics Daemon
Log rotation
RPC
Clients
Auth
Throttling
Dev FaaS platform
Attach debugger
local testing Logs
The local FaaS platform can be integrated and routed within the Netflix
cloud, enabling seamless end to end testing
S
Servi
Confi
Metri
Strea
Servi
Metri
Log
RPC
Auth
Throt
Zuul: Auth, SSL, … Backend servicesDevice
Local functions
Teams also want to test functions in isolation without having to connect
to or depend on upstream and downstream services
Isolated
local functions
Local functions
S
Servi
Confi
Metri
Strea
Servi
Metri
Log
RPC
Auth
Throt
Zuul: Auth, SSL, … Backend servicesDevice
The FaaS platform provides mocks and unit test APIs which allows teams
to test functions in isolation without having to connect to or depend on
upstream and downstream services
module.exports = function ping(req, res, next) {
req.log.info('Hello World!');
req.getRequestContext(); // request context
req.getAtlas(); // metrics client
req.getDNAClient(); // RPC client
req.getProperties(); // Configuration Client
req.getEdgar(); // Tracing
req.getMantis(); // Stream processing client
req.getGeo(); // Geo location
req.getPassport(); // Auth
return next();
};
Runtime API requires downstream services to be available
The FaaS platform provides mocks and unit test APIs which allows teams
to test functions in isolation without having to connect to or depend on
upstream and downstream services
// Unit test
it('should create all mocks', function(done) {
mocks.create(function(err, allMocks) {
assert.isObject(allMocks);
assert.isObject(allMocks.log);
assert.isObject(allMocks.properties);
...
assert.isObject(allMocks.req);
assert.isObject(allMocks.res);
return done();
});
});
Mocks are available from the unit test API
This development platform can also be easily deployed to Jenkins using
NEWT, unlocking CI/CD tests for both the FaaS platform and functions
themselves
We’ll cover:
Runtime platform
architecture
Developer
experience
Management &
operations
Publish
Deploy
Operate
Functions are published using our NEWT tool, and are immutably
versioned and saved in a central registry
Underneath the hood, a Docker image is created at publish time by
combining the functions and the platform into one image, achieving
immutability
FaaS base platform image
S
/etc/functions
myrepo/config.json
myrepo/foo.js
myrepo/bar.js
Customer Functions
S
Customer function image
The centralized function registry can be used to manage published
functions
These published functions can be deployed to the cloud via the NEWT
deploy commands
S
Functions are deployed using Titus, with most functions scheduled under
a few minutes
S
Registry
Titus
Container
Scheduler
S
S
S
S
S
S
S
S
…
Canary deployment and analysis can be used as part of deployment,
minimizing outages and increasing availability
Canary deployment and analysis can be used as part of deployment,
minimizing outages and increasing availability
Each deployed function version can be managed via the control plane,
with access to detailed runtime information
Detailed historical deployment and managed activity is available to aid
debugging
Autoscaling is used to automatically scale the infrastructure for each
function, saving costs and increasing availability. We require an initial
baseline configuration for each function
Metrics and dashboards are automatically generated for each function
Alerts are automatically generated based on metrics
Real time and historical logs are available
Profiling and post mortem debugging tools are made available
The infrastructure and operations of the platform and application itself is
handled by the centralized API platform team. UI teams are only
responsible for managing their individual functions
Netflix FaaS Platform
Runtime platform
architecture
Developer
experience
Management &
operations
80
81
84
Questions?
@yunongx
yunong@netflix.com
@yunongx
linkedin.com/in/yunongxiao/

Going FaaSter, Functions as a Service at Netflix

  • 1.
    130 million customers inover 190 countries streaming 140 million hrs/day
  • 4.
  • 6.
    We use adata driven approach via A/B testing for most changes to our product — ensuring every change delights our customers source: https://www.optimizely.com/optimization-glossary/ab-testing/
  • 7.
    1000s of A/Btests a year
  • 11.
  • 12.
    TV iOS Android Windows Browsers Remote Service Layer Search MAP GPS Playback … Clients Client APIEdge API Backend Services The Netflix API decouples clients from the backend services, providing a integration point for both services and clients
  • 13.
    BFF The Netflix APIuses the BFF (backend for frontend) pattern, where the BFF is tightly coupled to each device — making it easier to define and adapt the UI, and streamlining releases TV iOS Android Windows Browsers Remote Service Layer Search MAP GPS Playback … Clients Client API Edge API Backend Services
  • 14.
    These BFFs aremaintained by the UI teams, since it’s tightly coupled to their UI
  • 15.
    Netflix API requirements VelocityReliability Ergonomic No Operations
  • 16.
    Going FaaSter: Functionas a Service at Netflix @ Yunong Xiao, Principal Software Engineer, Netflix
  • 17.
    FaaS Evolution Others ManageYouManage Services Platform Application λ Pre-Cloud On Prem Application λ Services Platform FaaS Application λ Services Platform IaaS λ Services Platform Application PaaS
  • 18.
    Pros Cons No-ops Homogenousarchitecture Accessible monitoring & debugging Velocity Netflix stack integration Reliable service platform Limits: latency, memory, execution time Build or buy?
  • 19.
  • 20.
    Others ManageYou Manage Services Platform Application λ Pre-Cloud OnPrem Application λ Services Platform FaaS Application λ Services Platform IaaS λ Services Platform Application PaaS
  • 21.
    We are almostcompletely hosted in the cloud using AWS
  • 22.
    EC2 makes upthe foundation of infrastructure at Netflix
  • 23.
  • 24.
    We chose touse containers as the foundation of our FaaS platform, as it gave us advantages which let us build a platform that is ergonomic, efficient, with high deployment velocity Lightweight & Fast Deployments Portability across environments Efficient bin packing
  • 25.
    We built Titus— our own container management platform — capable of launching millions of containers a day
  • 26.
    Others ManageYou Manage Services Platform Application λ Pre-Cloud OnPrem Application λ Services Platform FaaS Application λ Services Platform IaaS λ Services Platform Application PaaS
  • 27.
    We have createda reliable, open source services platform
  • 28.
    We have createda reliable, open source services platform Service Discovery: Eureka https://github.com/Netflix/eureka RPC: Ribbon (HTTP), gRPC https://github.com/Netflix/ribbon Configuration: Archaius https://github.com/Netflix/archaius Metrics: Atlas https://github.com/Netflix/atlas Fault tolerance: Hystrix https://github.com/Netflix/hystrix External LB: Zuul https://github.com/Netflix/zuul Tracing: Mantis, Salp …
  • 29.
    Assembling these componentsyourself is time consuming, difficult, and error prone
  • 30.
    Assembling these componentsyourself is time consuming, difficult, and error prone
  • 31.
    You always haveto keep components updated to the latest versions yourself
  • 32.
    You have toensure that metrics and dashboards are created for your service
  • 33.
    You’re on thehook for managing and operating the infrastructure
  • 34.
    34 You shouldn’t haveto set everything up from scratch every time when all you care about is the business logic
  • 35.
    Others ManageYou Manage Services Platform Application λ Pre-Cloud OnPrem Application λ Services Platform FaaS Application λ Services Platform IaaS λ Services Platform Application PaaS
  • 36.
    36 We set outto build our runtime FaaS platform that solves these issues No assembly required Automatic updates Observable metrics Managed operations
  • 37.
    The platform isa services container that has been pre-assembled with all of the components needed for a production ready service Server Service Discovery Daemon Metrics Daemon Log rotation Service Registration Configuration Metrics Stream Processing RPC Clients Auth Throttling
  • 38.
    All that’s neededis for customers to insert their business logic Server Service Registration Configuration Metrics Stream Processing Service Discovery Daemon Metrics Daemon Log rotation RPC Clients Auth Throttling Route /foo Route /bar …
  • 39.
    We package andversion the platform as a single entity, and can easily upgrade and test the components once and ensure everyone receives the upgrade
  • 40.
    We control theruntime, the platform can emit a consistent set of application, RPC, and systems metrics for every function Server Service Registration Configuration Metrics Stream Processing Service Discovery Daemon Metrics Daemon Log rotation RPC Clients Auth Throttling Route /foo Route /bar …
  • 41.
    41 We set outto build our runtime FaaS platform that solves these issues No assembly required Automatic updates Observable metrics Managed operations
  • 42.
  • 44.
    { "service": { "org": "iosui", "name":"iphone" }, "platformVersion": "^6.0.0", "routes": { "routes": { "movies": { "get": { "source": “./lib/endpoints/movies.js" } }, "profile": { "post": { "source": “./lib/endpoints/profile.js” } } } }, "sources": ["./lib"], "propertiesPath": "./etc", "startupHooks": [ "./hooks/startupHook.js" ] Functions are managed via a configuration API, where most fields are optional. Service name FaaS platform version Function declarations Additional source code Configuration Lifecycle management
  • 45.
    Business logic canbe implemented using the popular Node.js “Connect” style middleware which handles requests. module.exports = function(req, res, next) { res.send(200, req.query); return next(); }; HTTP Request object HTTP Response callback
  • 46.
    Platform components suchas metrics, loggers, or RPC clients are available via the “req” object — providing a full runtime API for developers module.exports = function ping(req, res, next) { req.log.info('Hello World!'); req.getRequestContext(); // request context req.getAtlas(); // metrics client req.getDNAClient(); // RPC client req.getProperties(); // Configuration Client req.getEdgar(); // Tracing req.getMantis(); // Stream processing client req.getGeo(); // Geo location req.getPassport(); // Auth return next(); };
  • 47.
    Long lived thirdparty libraries can be managed via startup and shutdown lifecycle hooks. "startupHooks": [ "./hooks/startupHook.js" ], "shutdownHooks": [ "./hooks/shutdownHook.js" ]
  • 48.
    Hooks are initiatedbefore the platform starts, have access to all platform components, and allow for third party libraries to be made available on the request object // executed before platform starts module.exports = function startuphook(opts, cb) { // access to all platform components opts.atlas; opts.infrastructureInfo; opts.log; ... opts.properties; opts.serviceInfo; // return an object that will be made available // to all functions return cb(null, { foo: 'bar' }); };
  • 49.
    External dependencies canbe imported from
  • 50.
    Our goal isto create a local function development experience that improves the software development life cycle for developers
  • 51.
    We created adeveloper workflow tool called NEWT (Netflix Workflow Toolkit) which simplifies and facilitates common developer tasks  Development Debugging Testing Publishing Deployment
  • 53.
    One-click setup fora consistent development environment. Installs dependencies and keeps them updated
  • 54.
    We created adevelopment FaaS platform for local development — enabling engineers to interactively test functions in seconds — reducing friction and increasing velocity Server Service Registration Configuration Metrics Stream Processing Service Discovery Daemon Metrics Daemon Log rotation RPC Clients Auth Throttling Dev FaaS platform local functions live reload
  • 55.
    Local debugging furtherincreases velocity and reduces friction of the SDLC Serve Service Registration Configuration Metrics Stream Processing Service Discovery Daemon Metrics Daemon Log rotation RPC Clients Auth Throttling Dev FaaS platform Attach debugger local testing Logs
  • 56.
    The local FaaSplatform can be integrated and routed within the Netflix cloud, enabling seamless end to end testing S Servi Confi Metri Strea Servi Metri Log RPC Auth Throt Zuul: Auth, SSL, … Backend servicesDevice Local functions
  • 57.
    Teams also wantto test functions in isolation without having to connect to or depend on upstream and downstream services Isolated local functions Local functions S Servi Confi Metri Strea Servi Metri Log RPC Auth Throt Zuul: Auth, SSL, … Backend servicesDevice
  • 58.
    The FaaS platformprovides mocks and unit test APIs which allows teams to test functions in isolation without having to connect to or depend on upstream and downstream services module.exports = function ping(req, res, next) { req.log.info('Hello World!'); req.getRequestContext(); // request context req.getAtlas(); // metrics client req.getDNAClient(); // RPC client req.getProperties(); // Configuration Client req.getEdgar(); // Tracing req.getMantis(); // Stream processing client req.getGeo(); // Geo location req.getPassport(); // Auth return next(); }; Runtime API requires downstream services to be available
  • 59.
    The FaaS platformprovides mocks and unit test APIs which allows teams to test functions in isolation without having to connect to or depend on upstream and downstream services // Unit test it('should create all mocks', function(done) { mocks.create(function(err, allMocks) { assert.isObject(allMocks); assert.isObject(allMocks.log); assert.isObject(allMocks.properties); ... assert.isObject(allMocks.req); assert.isObject(allMocks.res); return done(); }); }); Mocks are available from the unit test API
  • 60.
    This development platformcan also be easily deployed to Jenkins using NEWT, unlocking CI/CD tests for both the FaaS platform and functions themselves
  • 61.
  • 62.
  • 63.
    Functions are publishedusing our NEWT tool, and are immutably versioned and saved in a central registry
  • 64.
    Underneath the hood,a Docker image is created at publish time by combining the functions and the platform into one image, achieving immutability FaaS base platform image S /etc/functions myrepo/config.json myrepo/foo.js myrepo/bar.js Customer Functions S Customer function image
  • 65.
    The centralized functionregistry can be used to manage published functions
  • 66.
    These published functionscan be deployed to the cloud via the NEWT deploy commands S
  • 67.
    Functions are deployedusing Titus, with most functions scheduled under a few minutes S Registry Titus Container Scheduler S S S S S S S S …
  • 68.
    Canary deployment andanalysis can be used as part of deployment, minimizing outages and increasing availability
  • 69.
    Canary deployment andanalysis can be used as part of deployment, minimizing outages and increasing availability
  • 70.
    Each deployed functionversion can be managed via the control plane, with access to detailed runtime information
  • 71.
    Detailed historical deploymentand managed activity is available to aid debugging
  • 72.
    Autoscaling is usedto automatically scale the infrastructure for each function, saving costs and increasing availability. We require an initial baseline configuration for each function
  • 73.
    Metrics and dashboardsare automatically generated for each function
  • 74.
    Alerts are automaticallygenerated based on metrics
  • 75.
    Real time andhistorical logs are available
  • 76.
    Profiling and postmortem debugging tools are made available
  • 77.
    The infrastructure andoperations of the platform and application itself is handled by the centralized API platform team. UI teams are only responsible for managing their individual functions
  • 78.
    Netflix FaaS Platform Runtimeplatform architecture Developer experience Management & operations
  • 80.
  • 81.
  • 84.
  • 85.