Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

  • 7,096 views
Uploaded on

If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web......

If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web services. Cloud Native architecture delivers speed, scalability and security through automation of continuously delivered single function micro-services with a denormalized NoSQL back end. In the case of Netflix, the streaming service is deployed globally using Cassandra to provide cross zone and cross regional replication. NetflixOSS is a set of open source components that anyone can use to help them adopt Cloud Native architectures, and there is even a prize for the best open source contributions to NetflixOSS at http://netflix.github.com

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,096
On Slideshare
5,878
From Embeds
1,218
Number of Embeds
7

Actions

Shares
Downloads
77
Comments
0
Likes
21

Embeds 1,218

https://twitter.com 933
http://www.scoop.it 246
http://tspace.web.att.com 16
http://www.linkedin.com 12
https://www.linkedin.com 9
http://moderation.local 1
http://mayurshintre.github.io 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cloud Native at Netflix What Changed? July 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  • 2. Cloud Native Netflix Architecture NetflixOSS
  • 3. Cloud Native What is it? Why?
  • 4. Engineers Solve hard problems Build amazing and complex things Fix things when they break
  • 5. Strive for perfection Perfect code Perfect hardware Perfectly operated
  • 6. But perfection takes too long… Compromises… Time to market vs. Quality Utopia remains out of reach
  • 7. Where time to market wins big Making a land-grab Disrupting competitors (OODA) Anything delivered as web services
  • 8. Observe Orient Decide Act Land grab opportunity Competitive move Customer Pain Point Analysis Get buy-in Plan response Commit resources Implement Deliver Engage customers Research alternatives Measure customers Colonel Boyd, USAF “Get inside your adversaries' OODA loop to disorient them”
  • 9. How Soon? Code features in days instead of months Get hardware in minutes instead of weeks Incident response in seconds instead of hours
  • 10. A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
  • 11. Inspiration
  • 12. How to get to Cloud Native Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization
  • 13. Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers
  • 14. Netflix BusDevOps Organization Chief Product Officer VP Product Management Directors Product VP UI Engineering Directors Development Developers + DevOps UI Data Sources AWS VP Discovery Engineering Directors Development Developers + DevOps Discovery Data Sources AWS VP Platform Directors Platform Developers + DevOps Platform Data Sources AWS Denormalized, independently updated and scaled data Cloud, self service updated & scaled infrastructure Code, independently updated continuous delivery
  • 15. Decentralized Deployment
  • 16. Asgard Developer Portal http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  • 17. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
  • 18. Netflix Streaming A Cloud Native Application based on an open source platform
  • 19. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  • 20. How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations
  • 21. Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo
  • 22. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group choosers (for US, Canada and Latam) Each icon is three to a few hundred instances across three AWS zones
  • 23. Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
  • 24. Isolated Regions Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
  • 25. Cross Region Use Cases • Geographic Isolation – US to Europe replication of subscriber data – Read intensive, low update rate – Production use since late 2011 • Redundancy for regional failover – US East to US West replication of everything – Includes write intensive data, high update rate – Testing now
  • 26. Benchmarking Global Cassandra Write intensive test of cross region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 min Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-West-2 Region - Oregon Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East-1 Region - Virginia Test Load Test Load Validation Load Inter-Zone Traffic 1 Million writes CL.ONE (wait for one replica to ack) 1 Million reads After 500ms CL.ONE with no Data loss Inter-Region Traffic Up to 9Gbits/s, 83ms 18TB backups from S3
  • 27. Managing Multi-Region Availability Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Regional Load Balancers UltraDNS DynECT DNS AWS Route53 Denominator – manage traffic via multiple DNS providers with Java code 2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May Denominator
  • 28. Incidents – Impact and Mitigation PR X Incidents CS XX Incidents Metrics impact – Feature disable XXX Incidents No Impact – fast retry or automated failover XXXX Incidents Public Relations Media Impact High Customer Service Calls Affects AB Test Results Y incidents mitigated by Active Active, game day practicing YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging
  • 29. Cloud Security Automated attack surface monitoring Crypto key store management (CloudHSM) Scale to resist DDOS attacks http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
  • 30. What Changed? “Impossible” deployments are easy Jointly building code with vendors in public Highly available and secure despite scale and speed
  • 31. The DIY Question Why doesn’t Netflix build and run its own cloud?
  • 32. Fitting Into Public Scale Public Grey Area Private 1,000 Instances 100,000 Instances Netflix FacebookStartups
  • 33. How big is Public? AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default (some VPC don’t) AWS Maximum Possible Instance Count 4.2 Million – May 2013 Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange
  • 34. A Cloud Native Open Source Platform See netflix.github.com
  • 35. Establish our solutions as Best Practices / Standards Hire, Retain and Engage Top Engineers Build up Netflix Technology Brand Benefit from a shared ecosystem Goals
  • 36. Example Application – RSS Reader Z U U L Zuul Traffic Processing and Routing
  • 37. Ice – Detailed AWS “Chargeback” http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
  • 38. Boosting the @NetflixOSS Ecosystem See netflix.github.com
  • 39. More Use Cases More Features Better portability Higher availability Easier to deploy Contributions from end users Contributions from vendors What’s Coming Next?
  • 40. Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds “It’s done when it runs Asgard” Functionally complete Demonstrated March Released June in V3.3 Offering $10K prize for integration work Vendor and end user interest Openstack “Heat” getting there Paypal C3 Console based on Asgard
  • 41. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering a cloud native ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  • 42. Slideshare.net/Netflix Details • Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games – http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html • Lightning Talks March S1E2 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and- roadmap • Lightning Talks Feb S1E1 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks • Asgard In Depth Feb S1E1 – http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house • Security Architecture – http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/ • Cost Aware Cloud Architectures – with Jinesh Varia of AWS – http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh- varia-aws-and-adrian-cockroft-netflix
  • 43. What Changed? Speed wins, Cloud Native helps you get there NetflixOSS makes it easier for everyone to become Cloud Native @adrianco #netflixcloud @NetflixOSS