Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery
Nov. 10, 2013•0 likes
62 likes
Be the first to like this
Show More
•64,055 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Technology
Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery
Now Playing on Netflix:
Adventures in a Cloudy Future
CMG November 2013
Adrian Cockcroft
@adrianco @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
How Netflix Used to Work
Consumer
Electronics
Oracle
Monolithic Web
App
AWS Cloud
Services
MySQL
CDN Edge
Locations
Oracle
Datacenter
Customer Device
(PC, PS3, TV…)
Monolithic
Streaming App
MySQL
Content
Management
Limelight/Level 3
Akamai CDNs
Content Encoding
How Netflix Streaming Works Today
Consumer
Electronics
User Data
Web Site or
Discovery API
AWS Cloud
Services
Personalization
CDN Edge
Locations
DRM
Datacenter
Customer Device
(PC, PS3, TV…)
Streaming API
QoS Logging
OpenConnect
CDN Boxes
CDN
Management
and Steering
Content Encoding
Netflix Scale
• Tens of thousands of instances on AWS
– Typically 4 core, 30GByte, Java business logic
– Thousands created/removed every day
• Thousands of Cassandra NoSQL storage nodes
– Mostly 8 core, 60Gbyte, 2TByte of SSD
– 65 different clusters, over 300TB data, triple zone
– Over 40 are multi-region clusters (6, 9 or 12 zone)
– Biggest 288 nodes, 300K rps, 1.3M wps
Reactions over time
2009 “You guys are crazy! Can’t believe it”
2010 “What Netflix is doing won’t work”
2011 “It only works for ‘Unicorns’ like Netflix”
2012 “We’d like to do that but can’t”
2013 “We’re on our way using Netflix OSS code”
"This is the IT swamp draining manual for anyone who is neck deep in alligators." Adrian Cockcroft, Cloud Architect at Netflix
Incidents – Impact and Mitigation
Public Relations
Media Impact
PR
Y incidents mitigated by Active
Active, game day practicing
X Incidents
High Customer
Service Calls
CS
YY incidents
mitigated by
better tools and
practices
XX Incidents
Affects AB
Test Results
Metrics impact – Feature disable
XXX Incidents
No Impact – fast retry or automated failover
XXXX Incidents
YYY incidents
mitigated by better
data tagging
80’s Mainframe Innovation Cycle
•
•
•
•
•
Cost $1M to $100M
Duration 1 to 5 years
Bet the whole company
Cost of failure – bankrupt or bought
Cobol and DB2 on MVS
90’s Client Server Innovation Cycle
•
•
•
•
•
Cost $100K to $10M
Duration 3 – 12 months
Bet a product line or division
Cost of failure – revenue hit, CIO’s job
C++ and Oracle on Solaris
00’s Commodity Agile Innovation Cycle
•
•
•
•
•
Cost $10K to $1M
Duration 2 – 12 weeks
Bet a product feature
Cost of failure – product mgr reputation
Java and MySQL on RedHat Linux
Train Model Process Hand-Off Steps
Product Manager
Developer
QA Integration Team
Operations Deploy Team
BI Analytics Team
Cloud Native
Construct a highly agile and highly
available service from ephemeral and
assumed broken components
Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Each icon is
three to a few
hundred
instances
across three
AWS zones
Cassandra
memcached
Start Here
Personalization movie group choosers
(for US, Canada and Latam)
Web service
S3 bucket
Continuous Innovation Cycle
•
•
•
•
•
Cost near zero, variable expense
Duration hours to days
Bet a decoupled microservice code push
Cost of failure – near zero, instant rollback
Clojure/Scala/Python on NoSQL on Cloud
Continuous Deploy Hand-Off Steps
Product Manager
A/B test setup and enable
Self service hypothesis test results
Developer
Automated test
Self service deploy, on call
Self service analytics
Continuous Deploy Automation
Check in code, Jenkins build
Bake AMI, launch in test env
Functional and performance test
Production canary test
Production red/black push
Global Deploy Automation
Afternoon in California
Night-time in Europe
If passes test suite, canary then deploy
West Coast Load Balancers
East Coast Load Balancers
Europe Load Balancers
Zone A
Zone B
Zone C
Zone A
Zone B
Zone C
Zone A
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Canary then deploy
Next day on West Coast
Canary then deploy
Next day on East Coast
After peak in Europe
Ephemeral Instances
• Largest services are autoscaled
• Average lifetime of an instance is 36 hours
Autoscale Up
Autoscale Down
P
u
s
h
(New Today!) Predictive Autoscaling
24 Hours predicted traffic vs. actual
More morning load
Sat/Sun high traffic
Lower load on Weds
Prediction driving AWS Autoscaler to plan capacity