Cloud Native at Netflix
What Changed?
July 2013
Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSS
http://www.linkedin.c...
Cloud Native
Netflix Architecture
NetflixOSS
Cloud Native
What is it?
Why?
Engineers
Solve hard problems
Build amazing and complex things
Fix things when they break
Strive for perfection
Perfect code
Perfect hardware
Perfectly operated
But perfection takes too long…
Compromises…
Time to market vs. Quality
Utopia remains out of reach
Where time to market wins big
Making a land-grab
Disrupting competitors (OODA)
Anything delivered as web services
Observe
Orient
Decide
Act
Land grab
opportunity Competitive
move
Customer
Pain Point
Analysis
Get buy-in
Plan
response
Com...
How Soon?
Code features in days instead of months
Get hardware in minutes instead of weeks
Incident response in seconds in...
A new engineering challenge
Construct a highly agile and highly
available service from ephemeral and
assumed broken compon...
Inspiration
How to get to Cloud Native
Freedom and Responsibility for Developers
Decentralize and Automate Ops Activities
Integrate De...
Four Transitions
• Management: Integrated Roles in a Single Organization
– Business, Development, Operations -> BusDevOps
...
Netflix BusDevOps Organization
Chief Product
Officer
VP Product
Management
Directors
Product
VP UI
Engineering
Directors
D...
Decentralized Deployment
Asgard Developer Portal
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Ephemeral Instances
• Largest services are autoscaled
• Average lifetime of an instance is 36 hours
P
u
s
h
Autoscale Up
A...
Netflix Streaming
A Cloud Native Application based on
an open source platform
Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming A...
Nov
2012
Streaming
Bandwidth
March
2013
Mean
Bandwidth
+39% 6mo
Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cas...
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Repl...
Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cass...
Cross Region Use Cases
• Geographic Isolation
– US to Europe replication of subscriber data
– Read intensive, low update r...
Benchmarking Global Cassandra
Write intensive test of cross region replication capacity
16 x hi1.4xlarge SSD nodes per zon...
Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional ...
Incidents – Impact and Mitigation
PR
X Incidents
CS
XX Incidents
Metrics impact – Feature disable
XXX Incidents
No Impact ...
Cloud Security
Automated attack surface monitoring
Crypto key store management (CloudHSM)
Scale to resist DDOS attacks
htt...
What Changed?
“Impossible” deployments are easy
Jointly building code with vendors in public
Highly available and secure d...
The DIY Question
Why doesn’t Netflix build and run its
own cloud?
Fitting Into Public Scale
Public
Grey
Area
Private
1,000 Instances 100,000 Instances
Netflix FacebookStartups
How big is Public?
AWS upper bound estimate based on the number of public IP Addresses
Every provisioned instance gets a p...
A Cloud Native Open Source Platform
See netflix.github.com
Establish our
solutions as Best
Practices / Standards
Hire, Retain and
Engage Top
Engineers
Build up Netflix
Technology Br...
Example Application – RSS Reader
Z
U
U
L
Zuul Traffic
Processing
and Routing
Ice – Detailed AWS “Chargeback”
http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
Boosting the @NetflixOSS Ecosystem
See netflix.github.com
More Use Cases
More
Features
Better portability
Higher availability
Easier to deploy
Contributions from end users
Contribu...
Vendor Driven Portability
Interest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”
Funct...
Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering a cloud nativ...
Slideshare.net/Netflix Details
• Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games
– http://te...
What Changed?
Speed wins, Cloud Native helps you get there
NetflixOSS makes it easier for everyone to become Cloud Native
...
Upcoming SlideShare
Loading in...5
×

Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

8,928

Published on

If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web services. Cloud Native architecture delivers speed, scalability and security through automation of continuously delivered single function micro-services with a denormalized NoSQL back end. In the case of Netflix, the streaming service is deployed globally using Cassandra to provide cross zone and cross regional replication. NetflixOSS is a set of open source components that anyone can use to help them adopt Cloud Native architectures, and there is even a prize for the best open source contributions to NetflixOSS at http://netflix.github.com

Published in: Technology

Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

  1. 1. Cloud Native at Netflix What Changed? July 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  2. 2. Cloud Native Netflix Architecture NetflixOSS
  3. 3. Cloud Native What is it? Why?
  4. 4. Engineers Solve hard problems Build amazing and complex things Fix things when they break
  5. 5. Strive for perfection Perfect code Perfect hardware Perfectly operated
  6. 6. But perfection takes too long… Compromises… Time to market vs. Quality Utopia remains out of reach
  7. 7. Where time to market wins big Making a land-grab Disrupting competitors (OODA) Anything delivered as web services
  8. 8. Observe Orient Decide Act Land grab opportunity Competitive move Customer Pain Point Analysis Get buy-in Plan response Commit resources Implement Deliver Engage customers Research alternatives Measure customers Colonel Boyd, USAF “Get inside your adversaries' OODA loop to disorient them”
  9. 9. How Soon? Code features in days instead of months Get hardware in minutes instead of weeks Incident response in seconds instead of hours
  10. 10. A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
  11. 11. Inspiration
  12. 12. How to get to Cloud Native Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization
  13. 13. Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers
  14. 14. Netflix BusDevOps Organization Chief Product Officer VP Product Management Directors Product VP UI Engineering Directors Development Developers + DevOps UI Data Sources AWS VP Discovery Engineering Directors Development Developers + DevOps Discovery Data Sources AWS VP Platform Directors Platform Developers + DevOps Platform Data Sources AWS Denormalized, independently updated and scaled data Cloud, self service updated & scaled infrastructure Code, independently updated continuous delivery
  15. 15. Decentralized Deployment
  16. 16. Asgard Developer Portal http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  17. 17. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
  18. 18. Netflix Streaming A Cloud Native Application based on an open source platform
  19. 19. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  20. 20. How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations
  21. 21. Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo
  22. 22. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group choosers (for US, Canada and Latam) Each icon is three to a few hundred instances across three AWS zones
  23. 23. Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  24. 24. Isolated Regions Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  25. 25. Cross Region Use Cases • Geographic Isolation – US to Europe replication of subscriber data – Read intensive, low update rate – Production use since late 2011 • Redundancy for regional failover – US East to US West replication of everything – Includes write intensive data, high update rate – Testing now
  26. 26. Benchmarking Global Cassandra Write intensive test of cross region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 min Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-West-2 Region - Oregon Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East-1 Region - Virginia Test Load Test Load Validation Load Inter-Zone Traffic 1 Million writes CL.ONE (wait for one replica to ack) 1 Million reads After 500ms CL.ONE with no Data loss Inter-Region Traffic Up to 9Gbits/s, 83ms 18TB backups from S3
  27. 27. Managing Multi-Region Availability Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers UltraDNS DynECT DNS AWS Route53 Denominator – manage traffic via multiple DNS providers with Java code 2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May Denominator
  28. 28. Incidents – Impact and Mitigation PR X Incidents CS XX Incidents Metrics impact – Feature disable XXX Incidents No Impact – fast retry or automated failover XXXX Incidents Public Relations Media Impact High Customer Service Calls Affects AB Test Results Y incidents mitigated by Active Active, game day practicing YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging
  29. 29. Cloud Security Automated attack surface monitoring Crypto key store management (CloudHSM) Scale to resist DDOS attacks http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
  30. 30. What Changed? “Impossible” deployments are easy Jointly building code with vendors in public Highly available and secure despite scale and speed
  31. 31. The DIY Question Why doesn’t Netflix build and run its own cloud?
  32. 32. Fitting Into Public Scale Public Grey Area Private 1,000 Instances 100,000 Instances Netflix FacebookStartups
  33. 33. How big is Public? AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default (some VPC don’t) AWS Maximum Possible Instance Count 4.2 Million – May 2013 Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange
  34. 34. A Cloud Native Open Source Platform See netflix.github.com
  35. 35. Establish our solutions as Best Practices / Standards Hire, Retain and Engage Top Engineers Build up Netflix Technology Brand Benefit from a shared ecosystem Goals
  36. 36. Example Application – RSS Reader Z U U L Zuul Traffic Processing and Routing
  37. 37. Ice – Detailed AWS “Chargeback” http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
  38. 38. Boosting the @NetflixOSS Ecosystem See netflix.github.com
  39. 39. More Use Cases More Features Better portability Higher availability Easier to deploy Contributions from end users Contributions from vendors What’s Coming Next?
  40. 40. Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds “It’s done when it runs Asgard” Functionally complete Demonstrated March Released June in V3.3 Offering $10K prize for integration work Vendor and end user interest Openstack “Heat” getting there Paypal C3 Console based on Asgard
  41. 41. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering a cloud native ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  42. 42. Slideshare.net/Netflix Details • Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games – http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html • Lightning Talks March S1E2 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and- roadmap • Lightning Talks Feb S1E1 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks • Asgard In Depth Feb S1E1 – http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house • Security Architecture – http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/ • Cost Aware Cloud Architectures – with Jinesh Varia of AWS – http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh- varia-aws-and-adrian-cockroft-netflix
  43. 43. What Changed? Speed wins, Cloud Native helps you get there NetflixOSS makes it easier for everyone to become Cloud Native @adrianco #netflixcloud @NetflixOSS
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×