1. @atseitlin
Ne#lix
Cloud
Pla#orm
Ne#lix's
evolu3on
in
the
cloud
Ariel
Tseitlin
h.p://www.linkedin.com/in/atseitlin
@atseitlin
2. @atseitlin
About
Ne<lix
Ne#lix
is
the
world’s
leading
Internet
television
network
with
nearly
38
million
members
in
40
countries
enjoying
more
than
one
billion
hours
of
TV
shows
and
movies
per
month,
including
original
series[1]
[1]
h.p://ir.ne<lix.com/
6. @atseitlin
How
Ne<lix
Streaming
Works
Customer
Device
(PC,
PS3,
TV…)
Web
Site
or
Discovery
API
User
Data
PersonalizaDon
Streaming
API
DRM
QoS
Logging
OpenConnect
CDN
Boxes
CDN
Management
and
Steering
Content
Encoding
Consumer
Electronics
AWS
Cloud
Services
CDN
Edge
LocaDons
Browse
Play
Watch
8. @atseitlin
Web
Server
Dependencies
Flow
Home
page
business
transacDon
Start
Here
memcached
Cassandra
Web
service
S3
bucket
PersonalizaDon
movie
group
chooser
Each
icon
is
three
to
a
few
hundred
instances
across
three
AWS
zones
10. @atseitlin
Three
Balanced
Availability
Zones
Test
with
Chaos
Gorilla
Cassandra
and
Evcache
Replicas
Zone
A
Cassandra
and
Evcache
Replicas
Zone
B
Cassandra
and
Evcache
Replicas
Zone
C
Load
Balancers
11. @atseitlin
Triple
Replicated
Persistence
Cassandra
maintenance
affects
individual
replicas
Cassandra
and
Evcache
Replicas
Zone
A
Cassandra
and
Evcache
Replicas
Zone
B
Cassandra
and
Evcache
Replicas
Zone
C
Load
Balancers
12. @atseitlin
Isolated
Regions
Will
someday
test
with
Chaos
Kong
Cassandra
Replicas
Zone
A
Cassandra
Replicas
Zone
B
Cassandra
Replicas
Zone
C
US-‐East
Load
Balancers
Cassandra
Replicas
Zone
A
Cassandra
Replicas
Zone
B
Cassandra
Replicas
Zone
C
EU-‐West
Load
Balancers
13. @atseitlin
Failure
Modes
and
Effects
Failure
Mode
Probability
Current
Mi3ga3on
Plan
ApplicaDon
Failure
High
AutomaDc
degraded
response
AWS
Region
Failure
Low
Wait
for
region
to
recover
AWS
Zone
Failure
Medium
ConDnue
to
run
on
2
out
of
3
zones
Datacenter
Failure
Medium
Migrate
more
funcDons
to
cloud
Data
store
failure
Low
Restore
from
S3
backups
S3
failure
Low
Restore
from
remote
archive
UnDl
we
got
really
good
at
miDgaDng
high
and
medium
probability
failures,
the
ROI
for
miDgaDng
regional
failures
didn’t
make
sense.
Gedng
there…
15. @atseitlin
Run
What
You
Wrote
• Make
developers
responsible
for
failures
– Then
they
learn
and
write
code
that
doesn’t
fail
• Use
Incident
Reviews
to
find
gaps
to
fix
– Make
sure
its
not
about
finding
“who
to
blame”
• Keep
Dmeouts
short,
fail
fast
– Don’t
let
cascading
Dmeouts
stack
up
16. @atseitlin
Rapid
DetecDon
• If
your
pilot
had
no
instument
panel,
would
you
ever
board
fly
on
a
plane?
– Never
run
your
service
blind
• Monitor
services,
not
instances
– Make
instance
failure
a
non-‐event
• Don’t
pay
people
to
watch
screens
– Instead
pay
them
to
build
alerDng
17. @atseitlin
Rapid
Rollback
• Use
a
new
Autoscale
Group
to
push
code
• Leave
exisDng
ASG
in
place,
switch
traffic
• If
OK,
auto-‐delete
old
ASG
a
few
hours
later
• If
“whoops”,
switch
traffic
back
in
seconds
21. @atseitlin
ElasDcity
• Capacity
planning
replaced
with
forecasDng
• Dynamic
load-‐based
auto-‐scaling
• New
data
centers
at
the
click
of
a
bu.on
22. @atseitlin
Efficiency
• ~10x
trough
to
peak
raDo.
Fill
trough
with
batch
workloads
• OpDmize
machine
class
for
each
service
• Highly
available
red/black
deployments
23. @atseitlin
Coming
soon
to
a
cloud
near
you
Billing
&
Payments,
Big
Data
&
AnalyDcs,
SaaS
24. @atseitlin
Billing
&
Payments
• PCI
compliance
• Privacy
&
security
• Intermediate
step
of
cache
in
the
cloud
25. @atseitlin
Big
Data
&
AnalyDcs
• On
deck
for
cloud
migraDon
• ETL
already
in
cloud
with
EMR
(Hadoop)
• Many
cloud
alternaDves
but
not
yet
as
mature
as
the
old
guard
26. @atseitlin
Corporate
system
moving
to
SaaS
• Email
(Exchange-‐>Google
Apps)
• Expense
Management
(Concur-‐>Workday)
• Document
sharing
(File
Servers-‐>Box)
• Goal
is
100%
SaaS
28. @atseitlin
Open
Source
Projects
Github
/
Techblog
Apache
ContribuDons
Techblog
Post
Coming
Soon
Priam
Cassandra
as
a
Service
Astyanax
Cassandra
client
for
Java
CassJMeter
Cassandra
test
suite
Cassandra
MulD-‐region
EC2
datastore
support
Aegisthus
Hadoop
ETL
for
Cassandra
Ice
Spend
analyDcs
Governator
Library
lifecycle
and
dependency
injecDon
Odin
Cloud
orchestraDon
Blitz4j
Async
logging
Exhibitor
Zookeeper
as
a
Service
Curator
Zookeeper
Pa.erns
EVCache
Memcached
as
a
Service
Eureka
/
Discovery
Service
Directory
Archaius
Dynamics
ProperDes
Service
Edda
Config
state
with
history
Denominator
Ribbon
REST
Client
+
mid-‐Der
LB
Karyon
Instrumented
REST
Base
Serve
Servo
and
Autoscaling
Scripts
Genie
Hadoop
PaaS
Hystrix
Robust
service
pa.ern
RxJava
ReacDve
Pa.erns
Asgard
AutoScaleGroup
based
AWS
console
Chaos
Monkey
Robustness
verificaDon
Latency
Monkey
Janitor
Monkey
Bakeries
/
Aminotor
Legend
30. @atseitlin
Our
Current
Catalog
of
Releases
Free
code
available
at
h.p://ne<lix.github.com
31. @atseitlin
We’re
hiring!
• Simian
Army
• Cloud
Tools
• Ne<lixOSS
• Cloud
OperaDons
• Reliability
Engineering
• Many,
many
more
jobs.ne<lix.com
32. @atseitlin
Takeaways
Ne#lix
has
built
and
deployed
a
scalable
global
and
highly
available
Pla#orm
as
a
Service
and
opened
sourced
it
(Ne#lixOSS)
The
Cloud
enables
elasNcity,
efficiency
and
fine-‐grained
control
via
APIs
Credit
cards,
Big
Data,
and
rest
of
corporate
systems
are
next
to
move
to
the
Cloud
h.p://ne<lix.github.com
h.p://techblog.ne<lix.com
h.p://slideshare.net/Ne<lix
h.p://www.linkedin.com/in/atseitlin
@atseitlin
@Ne<lixOSS
33. @atseitlin
Thank
you!
Any
quesDons?
Ariel
Tseitlin
h.p://www.linkedin.com/in/atseitlin
@atseitlin