Netflix has over 20 million subscribers in the US and Canada and is expanding internationally. It is moving its operations entirely to the cloud to gain the scalability and flexibility needed to support unpredictable growth. Netflix uses Amazon Web Services extensively to handle its increasing capacity needs, leveraging AWS's large scale and feature set. The cloud allows Netflix to focus on its core business instead of managing infrastructure.
Boost PC performance: How more available memory can improve productivity
Netflix Cloud Architecture at Qcon Tokyo 2011
1. Ne#lix
Cloud
Architecture
Qcon
Tokyo
April
12,
2011
Adrian
Cockcro<
@adrianco
#ne#lixcloud
h?p://slideshare.net/adrianco
acockcro<@ne#lix.com
2. Who,
Why,
What
Ne#lix
in
the
Cloud
Cloud
Challenges
and
Learnings
Systems
and
OperaJons
Architecture
3. Ne#lix
Inc.
With
more
than
20
million
subscribers
in
the
United
States
and
Canada,
Ne9lix,
Inc.
is
the
world’s
leading
Internet
subscripAon
service
for
enjoying
movies
and
TV
shows.
InternaAonal
Expansion
We
plan
to
expand
into
an
addiAonal
market
in
the
second
half
of
2011…
If
the
second
market
meets
our
expectaAons…
we
will
conAnue
to
invest
and
expand
aggressively
in
2012.
Source:
h?p://ir.ne#lix.com
5. Adrian
Cockcro<
• Director,
Architecture
for
Cloud
Systems,
Ne#lix
Inc.
– Previously
Director
for
PersonalizaJon
Pla#orm
• DisJnguished
Availability
Engineer,
eBay
Inc.
2004-‐7
– Founding
member
of
eBay
Research
Labs
• DisJnguished
Engineer,
Sun
Microsystems
Inc.
1988-‐2004
– 2003-‐4
Chief
Architect
High
Performance
Technical
CompuJng
– 2001
Author:
Capacity
Planning
for
Web
Services
– 1999
Author:
Resource
Management
– 1995
&
1998
Author:
Sun
Performance
and
Tuning
– 1996
Japanese
EdiJon
of
Sun
Performance
and
Tuning
•
SPARC
&
Solaris ( )
7. Ne#lix
is
Path-‐finding
The
Cloud
ecosystem
is
evolving
very
fast
Share
with
and
learn
from
the
cloud
community
8. We
want
to
use
clouds,
not
build
them
Cloud
technology
should
be
a
commodity
Public
cloud
and
open
source
for
agility
and
scale
9. Why
Use
Cloud?
For
Be?er
Business
Agility
For
Unpredictable
Business
Growth
10. Data
Center
Ne#lix
could
not
build
new
datacenters
fast
enough
Capacity
growth
is
acceleraJng,
unpredictable
Product
launch
spikes
-‐
iPhone,
Wii,
PS3,
XBox
11. 20
Million
Customers
2010-‐Q3
year/year
+52%
Total
and
+145%
Streaming
25
20
15
10
5
0
2009Q2
2009Q3
2009Q4
2010Q1
2010Q2
2010Q3
2010Q4
Source:
h?p://ir.ne#lix.com
12. Out-‐Growing
Data
Center
h?p://techblog.ne#lix.com/2011/02/redesigning-‐ne#lix-‐api.html
37x
Growth
Jan
2010-‐Jan
2011
Datacenter
Capacity
13. Ne#lix.com
is
now
~100%
Cloud
Account
sign-‐up
is
currently
being
moved
to
cloud
All
internaJonal
product
will
be
cloud
based
USA
specific
logisJcs
remains
in
the
Datacenter
14. Leverage
AWS
Scale
“the
biggest
public
cloud”
AWS
investment
in
tooling
and
automaJon
Use
many
AWS
zones
for
high
availability,
scalability
AWS
skills
are
most
common
on
resumes…
15. Leverage
AWS
Feature
Set
“the
market
leader”
EC2,
S3,
SDB,
SQS,
EBS,
EMR,
ELB,
ASG,
IAM,
RDB,
VPC…
h?p://aws.amazon.com/jp
16. Amazon Cloud Terminology
See http://aws.amazon.com/jp for Japanese
This is not a full list of Amazon Web Service features
• AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)
• AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaJon
code)
• EC2
–
ElasJc
Compute
Cloud
– Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraJons.
– Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.
– Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage
– Availability
Zone
–
datacenter
with
own
power
and
cooling
hosJng
cloud
instances
– Region
–
group
of
Availability
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan
• ASG
–
Auto
Scaling
Group
(instances
booJng
from
the
same
AMI)
• S3
–
Simple
Storage
Service
(h?p
access)
• EBS
–
ElasJc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)
• RDB
–
RelaJonal
Data
Base
(managed
MySQL
master
and
slaves)
• SDB
–
Simple
Data
Base
(hosted
h?p
based
NoSQL
data
store)
• SQS
–
Simple
Queue
Service
(h?p
based
message
queue)
• SNS
–
Simple
NoJficaJon
Service
(h?p
and
email
based
topics
and
messages)
• EMR
–
ElasJc
Map
Reduce
(automaJcally
managed
Hadoop
cluster)
• ELB
–
ElasJc
Load
Balancer
• EIP
–
ElasJc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)
• VPC
–
Virtual
Private
Cloud
(extension
of
enterprise
datacenter
network
into
cloud)
• IAM
–
IdenJty
and
Access
Management
(fine
grain
role
based
security
keys)
17. “The
cloud
lets
its
users
focus
on
delivering
differenAaAng
business
value
instead
of
wasAng
valuable
resources
on
the
undifferen)ated
heavy
li0ing
that
makes
up
most
of
IT
infrastructure.”
Werner
Vogels
Amazon
CTO
18. We
want
to
use
clouds,
we
don’t
have
Jme
to
build
them
Public
cloud
for
agility
and
scale
AWS
because
they
are
big
enough
to
allocate
thousands
of
instances
per
hour
when
we
need
to
19. Ne#lix
EC2
Instances
per
Account
(summer
2010,
producJon
is
much
higher
now…)
“Many
Thousands”
Content
Encoding
Test
and
ProducJon
Log
Analysis
“Several
Months”
20. Ne#lix
Deployed
on
AWS
Content
Logs
Play
WWW
API
Video
S3
DRM
Search
Metadata
Masters
EMR
CDN
Movie
Device
EC2
Hadoop
rouJng
Choosing
Config
TV
Movie
S3
Hive
Bookmarks
RaJngs
Choosing
Business
Mobile
CDN
Logging
Similars
Intelligence
iPhone
21. Cloud
Encoding
Pipeline
Encode
S3
Encode
S3
Movie
Master
Network
S3
Copy
to
CDN
Stream
Studios
Ne#lix
Master
Mezza-‐ Mezza-‐ to
50+
Origin
Origin
Tapes
Upload
nine
files
CDN
to
TV
nine
files
Licensed
content
is
provided
to
Ne#lix
as
high
quality
master
tapes
Many
formats
are
reduced
to
a
single
high
quality
mezzanine
format
on
S3
Individual
formats
and
speeds
are
encoded
in
over
50
combinaJons
Many
formats
for
older
and
newer
hardware
and
various
game
consoles
Many
speeds
from
mobile
through
standard
and
high
definiJon
StaJc
files
are
copied
to
each
Content
Delivery
Network’s
“origin
server”
CDNs
migrate
files
to
“edge
servers”
near
the
end
user
Files
stream
to
PC/Mac/iPad
or
TV
over
HTTP
using
“range
get”
to
move
chunks
23. Product
Trade-‐off
User
Experience
ImplementaJon
Consistent
Development
Experience
complexity
OperaJonal
Low
Latency
complexity
24. Ne#lix
Cloud
Goals
• Faster
– Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls
– Measured
as
mean
and
99th
percenJle
– For
both
first
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user
• Scalable
– Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases
– No
central
verJcally
scaled
databases
– Leverage
AWS
elasJc
capacity
effecJvely
• Available
– SubstanJally
higher
robustness
and
availability
than
datacenter
services
– Leverage
mulJple
AWS
availability
zones
– No
scheduled
down
Jme,
no
central
database
schema
to
change
• ProducJve
– OpJmize
agility
of
a
large
development
team
with
automaJon
and
tools
– Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)
– Enforce
clean
layered
interfaces
and
re-‐usable
components
25. Old
Datacenter
vs.
New
Cloud
Arch
Central
SQL
Database
Distributed
Key/Value
NoSQL
SJcky
In-‐Memory
Session
Shared
Memcached
Session
Cha?y
Protocols
Latency
Tolerant
Protocols
Tangled
Service
Interfaces
Layered
Service
Interfaces
Instrumented
Code
Instrumented
Service
Pa?erns
Fat
Complex
Objects
Lightweight
Serializable
Objects
Components
as
Jar
Files
Components
as
Services
26. Learnings
• Datacenter
oriented
tools
don’t
work
– Ephemeral
instances
– High
rate
of
change
– Need
too
much
hand-‐holding
and
manual
setup
• Cloud
Tools
Don’t
Scale
for
Enterprise
– Too
many
tools
are
“Startup”
oriented
– Built
our
own
tools
for
1000’s
of
instances
– Drove
vendors
to
be
dynamic,
scale,
add
APIs
• Un-‐modified
Datacenter
Apps
are
Fragile
– Too
many
datacenter
oriented
assumpJons
– We
re-‐wrote
our
code
base!
– (We
re-‐write
it
conJnuously
anyway)
28. API
AWS
EC2
Front
End
Load
Balancer
Discovery
Service
API
Proxy
API
etc.
Load
Balancer
Component
API
SQS
Services
Oracl
e
Oracle
Oracle
memcached
memcached
ReplicaJon
EBS
Ne?lix
S3
Data
Center
AWS
Storage
SimpleDB
29. Database
MigraJon
• Why
SimpleDB?
– No
DBA’s
in
the
cloud,
Amazon
hosted
service
– Work
started
two
years
ago,
fewer
viable
opJons
– Worked
with
Amazon
to
speed
up
and
scale
SimpleDB
• AlternaJves?
– Rolling
out
Cassandra
as
“upgrade”
from
SimpleDB
– Need
several
opJons
to
match
use
cases
well
• Detailed
NoSQL
and
SimpleDB
Advice
– Sid
Anand
-‐
QConSF
Nov
5th
–
Ne#lix’
TransiJon
to
High
Availability
Storage
Systems
– Blog
-‐
h?p://pracJcalcloudcompuJng.com/
– Download
Paper
PDF
-‐
h?p://bit.ly/bhOTLu
30. Cloud
OperaJons
Model
Driven
Architecture
Capacity
Planning
&
Monitoring
31. Tools
and
AutomaJon
• Developer
and
Build
Tools
– Jira,
Perforce,
Eclipse,
Jeeves,
Ivy,
ArJfactory
– Builds,
creates
.war
file,
.rpm,
bakes
AMI
and
launches
• Custom
Ne#lix
ApplicaJon
Console
– AWS
Features
at
Enterprise
Scale
(hide
the
AWS
security
keys!)
– Auto
Scaler
Group
is
unit
of
deployment
to
producJon
• Open
Source
+
Support
– Apache,
Tomcat,
Cassandra,
Hadoop,
OpenJDK,
CentOS
• Monitoring
Tools
– Keynote
–
service
monitoring
and
alerJng
– AppDynamics
–
Developer
focus
for
cloud
h?p://appdynamics.com
– EpicNMS
–
flexible
data
collecJon
and
plots
h?p://epicnms.com
– Nimso<
NMS
–
ITOps
focus
for
Datacenter
+
Cloud
alerJng
32. Model
Driven
Architecture
• Datacenter
PracJces
– Lots
of
unique
hand-‐tweaked
systems
– Hard
to
enforce
pa?erns
• Model
Driven
Cloud
Architecture
– Perforce/Ivy/Jeeves
based
builds
for
everything
– Every
producJon
instance
is
a
pre-‐baked
AMI
– Every
applicaJon
is
managed
by
an
Autoscaler
No
excep)ons,
every
change
is
a
new
AMI
33. High
Availability
Zones
• Each
zone
is
a
separate
datacenter
– Private
power,
cooling,
network
connecJons
– Located
close
together
for
low
latency
• ASG
Instances
are
distributed
over
3
zones
• Data
wri?en
to
one
zone
appears
in
all
zones
• Ne#lix
can
survive
total
failure
of
one
zone
– Increase
capacity
of
exisJng
zones
by
50%
– Small
or
zero
downJme
34. Region
MigraJon
(Ne#lix
is
working
to
have
this
in
place
during
2011,
for
internaJonal
roll-‐out
and
disaster
recovery)
• Data
is
backed
up
into
a
different
cloud
region
– Cloud
bandwidth
is
much
higher
than
Datacenter
• Restore
to
a
new
region
– “A
few
hours”
to
load
data
and
create
databases
• Create
model
driven
architecture
– “A
few
hours”
to
create
service
instances
and
test
• Send
traffic
to
new
region
– Setup
DNS
records
and
start
customer
service
35. Model
Driven
ImplicaJons
• Automated
“Least
Privilege”
Security
– Tightly
specified
security
groups
– Fine
grain
IAM
keys
to
access
AWS
resources
– Performance
tools
security
and
integraJon
• Model
Driven
Performance
Monitoring
– Hundreds
of
instances
appear
in
a
few
minutes…
– Tools
have
to
“garbage
collect”
dead
instances
39. Capacity
Planning
in
Clouds
(a
few
things
have
changed…)
• Capacity
is
expensive
• Capacity
takes
Jme
to
buy
and
provision
• Capacity
only
increases,
can’t
be
shrunk
easily
• Capacity
comes
in
big
chunks,
paid
up
front
• Planning
errors
can
cause
big
problems
• Systems
are
clearly
defined
assets
• Systems
can
be
instrumented
in
detail
• Depreciate
assets
over
3
years
(reservaJons!)
40. Monitoring
Issues
• Problem
– Too
many
tools,
each
with
a
good
reason
to
exist
– Hard
to
get
an
integrated
view
of
a
problem
– Too
much
manual
work
building
dashboards
– Tools
are
not
discoverable,
views
are
not
filtered
• SoluJon
– Get
vendors
to
add
deep
linking
URLs
and
APIs
– IntegraJon
“portal”
Jes
everything
together
– Underlying
dependency
database
– Dynamic
portal
generaJon,
relevant
data,
all
tools
41. Data
Sources
• External
URL
availability
and
latency
alerts
and
reports
–
Keynote
External
TesJng
• Stress
tesJng
-‐
SOASTA
• Ne#lix
REST
calls
–
Chukwa
to
DataOven
with
GUID
transacJon
idenJfier
Request
Trace
Logging
• Generic
HTTP
–
AppDynamics
service
Jer
aggregaJon,
end
to
end
tracking
• Tracers
and
counters
–
log4j,
tracer
central,
Chukwa
to
DataOven
ApplicaJon
logging
• Trackid
and
Audit/Debug
logging
–
DataOven,
Appdynamics
GUID
cross
reference
• ApplicaJon
specific
real
Jme
–
Nimso<,
Appdynamics,
Epic
JMX
Metrics
• Service
and
SLA
percenJles
–
Nimso<,
Appdynamics,
Epic,logged
to
DataOven
• Stdout
logs
–
S3
–
DataOven,
Nimso<
alerJng
Tomcat
and
Apache
logs
• Standard
format
Access
and
Error
logs
–
S3
–
DataOven,
Nimso<
AlerJng
• Garbage
CollecJon
–
Nimso<,
Appdynamics
JVM
• Memory
usage,
call
stacks,
resource/call
-‐
AppDynamics
• system
CPU/Net/RAM/Disk
metrics
–
AppDynamics,
Epic,
Nimso<
AlerJng
Linux
• SNMP
metrics
–
Epic,
Network
flows
-‐
FasJp
• Load
balancer
traffic
–
Amazon
Cloudwatch,
SimpleDB
usage
stats
AWS
• System
configuraJon
-‐
CPU
count/speed
and
RAM
size,
overall
usage
-‐
AWS
43. Dashboards
Architecture
• Integrated
Dashboard
View
– Single
web
page
containing
content
from
many
tools
– Filtered
to
highlight
most
“interesJng”
data
• Relevance
Controller
– Drill
in,
add
and
remove
content
interacJvely
– Given
an
applicaJon,
alert
or
problem
area,
dynamically
build
a
dashboard
relevant
to
your
role
and
needs
• Dependency
and
Incident
Model
– Model
Driven
-‐
Interrogates
tools
and
AWS
APIs
– Document
store
to
capture
dependency
tree
and
states
45. AppDynamics
How
to
look
deep
inside
your
cloud
applicaJons
• AutomaJc
Monitoring
– Base
AMI
bakes
in
all
monitoring
tools
– Outbound
calls
only
–
no
discovery/polling
issues
– InacJve
instances
removed
a<er
a
few
days
• Incident
Alarms
(deviaJon
from
baseline)
– Business
TransacJon
latency
and
error
rate
– Alarm
thresholds
discover
their
own
baseline
– Email
contains
URL
to
Incident
Workbench
UI
47. Point
Finger
and
Assess
Impact
(an
async
S3
write
was
slow,
no
big
deal)
48. Monitoring
Summary
• Broken
datacenter
oriented
tools
is
a
big
problem
• IntegraJng
many
different
tools
– They
are
not
designed
to
be
integrated
– We
have
“persuaded”
vendors
to
add
APIs
• If
you
can’t
see
deep
inside
your
app,
you’re
L
50. ImplicaJons
for
IT
OperaJons
• Cloud
is
run
by
developer
organizaJon
– Our
IT
department
is
Amazon
Cloud
• Cloud
capacity
is
much
bigger
than
Datacenter
– Datacenter
oriented
IT
staffing
is
flat
– We
have
no
IT
staff
working
on
cloud
– We
have
moved
3
people
out
of
IT
to
write
code
• TradiJonal
IT
Roles
are
going
away
– Don’t
need
SA,
DBA,
Storage,
Network
admins
51. Next
Few
Years…
• “System
of
Record”
moves
to
Cloud
(now)
– Master
copies
of
data
live
only
in
the
cloud,
with
backups
– Cut
the
datacenter
to
cloud
replicaJon
link
• InternaJonal
Expansion
–
Global
Clouds
(later
in
2011)
– Rapid
deployments
to
new
markets
• Cloud
StandardizaJon?
– Cloud
features
and
APIs
should
be
a
commodity
not
a
differenJator
– DifferenJate
on
scale
and
quality
of
service
– CompeJJon
also
drives
cost
down
– Higher
resilience
and
scalability
We
would
prefer
to
be
an
insignificant
customer
in
a
giant
cloud
52. Takeaway
Ne9lix
is
path-‐finding
the
use
of
public
AWS
cloud
to
replace
in-‐house
IT
for
non-‐trivial
applicaAons
with
hundreds
of
developers
and
thousands
of
systems.
acockcro<@ne#lix.com
h?p://www.linkedin.com/in/adriancockcro<
@adrianco
#ne#lixcloud