Now Playing on Netflix:
Adventures in a Cloudy Future
CMG November 2013
Adrian Cockcroft
@adrianco @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
How Netflix Used to Work
Consumer
Electronics

Oracle
Monolithic Web
App

AWS Cloud
Services

MySQL

CDN Edge
Locations

Oracle
Datacenter

Customer Device
(PC, PS3, TV…)

Monolithic
Streaming App
MySQL

Content
Management
Limelight/Level 3
Akamai CDNs
Content Encoding
How Netflix Streaming Works Today
Consumer
Electronics

User Data
Web Site or
Discovery API

AWS Cloud
Services

Personalization

CDN Edge
Locations

DRM
Datacenter

Customer Device
(PC, PS3, TV…)

Streaming API
QoS Logging

OpenConnect
CDN Boxes

CDN
Management
and Steering
Content Encoding
Nov
2012
Streaming
Bandwidth

March
2013
Mean
Bandwidth
+39% 6mo
Netflix Scale
• Tens of thousands of instances on AWS
– Typically 4 core, 30GByte, Java business logic
– Thousands created/removed every day

• Thousands of Cassandra NoSQL storage nodes
– Mostly 8 core, 60Gbyte, 2TByte of SSD
– 65 different clusters, over 300TB data, triple zone
– Over 40 are multi-region clusters (6, 9 or 12 zone)
– Biggest 288 nodes, 300K rps, 1.3M wps
Reactions over time
2009 “You guys are crazy! Can’t believe it”
2010 “What Netflix is doing won’t work”

2011 “It only works for ‘Unicorns’ like Netflix”
2012 “We’d like to do that but can’t”
2013 “We’re on our way using Netflix OSS code”
"This is the IT swamp draining manual for anyone who is neck deep in alligators." Adrian Cockcroft, Cloud Architect at Netflix
Web-scale

Cloud
Commodity

ClientServer

Mainframe
Goal of Traditional IT:
Reliable hardware
running stable software
SCALE
Breaks hardware
….SPEED
Breaks software
SPEED at
SCALE
Breaks everything
Incidents – Impact and Mitigation
Public Relations
Media Impact

PR

Y incidents mitigated by Active
Active, game day practicing

X Incidents
High Customer
Service Calls

CS

YY incidents
mitigated by
better tools and
practices

XX Incidents
Affects AB
Test Results

Metrics impact – Feature disable
XXX Incidents
No Impact – fast retry or automated failover
XXXX Incidents

YYY incidents
mitigated by better
data tagging
Web Scale Architecture
AWS
Route53

DynECT
DNS

UltraDNS

DNS
Automation

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas
CIO Says Speed IT Up!
“Get inside your adversaries'
OODA loop to disorient them”
Colonel Boyd, USAF
Land grab
opportunity

Engage
customers

Deliver

Measure
Customers

Act

Competitive
Move

Observe

Colonel Boyd,
USAF
“Get inside your
adversaries'
OODA loop to
disorient them”

Customer
Pain Point

Analysis

Orient
Model
Hypotheses

Implement

Decide
Commit
Resources

Plan
Response
Get Buy-in
Territory
Expansion

Print Ad
Campaign
Upgrade
Mainframe

Measure
Revenue

Act

Foreign
Competition

Observe

Mainframe
Era - 1 year
cycle

Customer
Pain Point

Systems
Analysis

Orient
Capacity
Model

Customize
Vendor SW

Decide
Vendor
Evaluation

5 year Plan
Board
Level Buyin
80’s Mainframe Innovation Cycle
•
•
•
•
•

Cost $1M to $100M
Duration 1 to 5 years
Bet the whole company
Cost of failure – bankrupt or bought
Cobol and DB2 on MVS
Territory
Expansion

TV Advert
Campaign
Install
Servers

Measure
Revenue

Act

Foreign
Competition

Observe

Client/Server
Era – 3
month cycle

Customer
Pain Point

Data
Warehouse

Orient
Capacity
Estimate

Customize
Vendor SW

Decide
Vendor
Evaluation

1 year Plan
CIO Level
Buy-in
90’s Client Server Innovation Cycle
•
•
•
•
•

Cost $100K to $10M
Duration 3 – 12 months
Bet a product line or division
Cost of failure – revenue hit, CIO’s job
C++ and Oracle on Solaris
Territory
Expansion

Web
Display Ads

Measure
Sales

Install
Capacity

Act

Competitive
Moves

Observe

Commodity
Era – 2 week
agile train

Customer
Pain Point

Data
Warehouse

Orient
Capacity
Estimate

Code
Feature

Decide
Feature
Priority

2 Week
Plan
Business
Buy-in
00’s Commodity Agile Innovation Cycle
•
•
•
•
•

Cost $10K to $1M
Duration 2 – 12 weeks
Bet a product feature
Cost of failure – product mgr reputation
Java and MySQL on RedHat Linux
Train Model Process Hand-Off Steps
Product Manager

Developer
QA Integration Team
Operations Deploy Team
BI Analytics Team
What Happened?
Rate of change
increased

Cost and size
and risk of
change reduced
Cloud Native
Construct a highly agile and highly
available service from ephemeral and
assumed broken components
Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Each icon is
three to a few
hundred
instances
across three
AWS zones

Cassandra
memcached

Start Here

Personalization movie group choosers
(for US, Canada and Latam)

Web service
S3 bucket
Continuous Deployment
No time for handoff to IT
Developer Self Service
Freedom and Responsibility
Developers run what
they wrote
Root access and pagerduty
IT is a Cloud API
DEVops automation
Github all the things!
Leverage social coding
Putting it all together…
Land grab
opportunity

Launch AB
Test
Automatic
Deploy

Measure
Customers

Act

Competitive
Move

Observe

Continuous
Delivery on
Cloud

Customer
Pain Point

Analysis

Orient
Model
Hypotheses

Increment
Implement

Decide

Plan
Response

Share Plans
JFDI
Continuous Innovation Cycle
•
•
•
•
•

Cost near zero, variable expense
Duration hours to days
Bet a decoupled microservice code push
Cost of failure – near zero, instant rollback
Clojure/Scala/Python on NoSQL on Cloud
Continuous Deploy Hand-Off Steps
Product Manager
A/B test setup and enable
Self service hypothesis test results

Developer
Automated test

Self service deploy, on call
Self service analytics
Continuous Deploy Automation
Check in code, Jenkins build
Bake AMI, launch in test env

Functional and performance test
Production canary test
Production red/black push
Bad Canary Signature
Happy Canary Signature
Global Deploy Automation
Afternoon in California
Night-time in Europe
If passes test suite, canary then deploy

West Coast Load Balancers

East Coast Load Balancers

Europe Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Canary then deploy

Next day on West Coast

Canary then deploy

Next day on East Coast

After peak in Europe
Ephemeral Instances
• Largest services are autoscaled
• Average lifetime of an instance is 36 hours

Autoscale Up

Autoscale Down

P
u
s
h
(New Today!) Predictive Autoscaling

24 Hours predicted traffic vs. actual
More morning load
Sat/Sun high traffic

Lower load on Weds

Prediction driving AWS Autoscaler to plan capacity
Inspiration
Takeaway
Speed Wins
Assume Broken
Cloud Native Automation
Github is your “app store” and resumé
@adrianco @NetflixOSS
http://netflix.github.com

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery