Dystopia as a Service


Published on

Talk given at the 2013 Linux Foundation Collaboration Summit. San Francisco, April 15th 2013.

Published in: Technology, Education
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dystopia as a Service

  1. 1. Dystopia as a ServiceApril 2013Adrian Cockcroft@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft
  2. 2. DystopiaCloud NativeNetflixOSS – Cloud Native On-RampOpportunities
  3. 3. Dystopia - AbstractWe have spent years striving to build perfect apps running onperfect kernels on perfect CPUs connected by perfectnetworks, but this utopia hasnt really arrived. Instead we livein a dystopian world of buggy apps changing several times aday running on JVMs running on an old version of Linuxrunning on Xen running on something I cant see, that onlyexists for a few hours, connected by a network of unknowntopology and operated by many layers of automation.I will discuss the new challenges and demands of living in thisdystopian world of cloud based services. I will also give anoverview of the Netflix open source cloud platform (seenetflix.github.com) that we use to create our own island ofutopian agility and availability regardless of what is going onunderneath.
  4. 4. We are EngineersWe solve hard problemsWe build amazing and complex thingsWe fix things when they break
  5. 5. We strive for perfectionPerfect codePerfect hardwarePerfectly operated
  6. 6. But perfection takes too long…So we compromiseTime to market vs. QualityUtopia remains out of reach
  7. 7. Where time to market wins bigWeb servicesAgile infrastructure - cloudContinuous deployment
  8. 8. How Soon?Code features in days instead of monthsHardware in minutes instead of weeksIncident response in seconds instead of hours
  9. 9. Tipping the BalanceUtopia Dystopia
  10. 10. A new engineering challengeConstruct a highly agile and highlyavailable service from ephemeral andoften broken components
  11. 11. Cloud NativeHow does Netflix work?
  12. 12. Netflix Member Web Site Home PagePersonalization Driven – What goes on to make this?
  13. 13. How Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocations
  14. 14. Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx
  15. 15. November 2012 Traffic
  16. 16. Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)Start HerememcachedCassandraWeb serviceS3 bucketThree Personalization movie groupchoosers (for US, Canada and Latam)Each icon isthree to a fewhundredinstancesacross threeAWS zones
  17. 17. Cloud Native ArchitectureDistributed QuorumNoSQL DatastoresAutoscaled MicroServicesAutoscaled MicroServicesClients ThingsJVM JVMJVM JVMCassandra Cassandra CassandraMemcachedJVMZone A Zone B Zone C
  18. 18. New Anti-Fragile PatternsMicro-servicesChaos enginesHighly available systems composedfrom ephemeral components
  19. 19. Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu)OptionalApachefrontend,memcached,non-java appsMonitoringLog rotationto S3AppDynamicsmachineagentEpic/AtlasJava (JDK 6 or 7)AppDynamicsappagentmonitoringGC and threaddump loggingTomcatApplication war file, baseservlet, platform, clientinterface jars, AstyanaxHealthcheck, statusservlets, JMX interface,Servo autoscale
  20. 20. Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu)Tomcat andPriam on JDKHealthcheck,StatusMonitoringAppDynamicsmachineagentEpic/AtlasJava (JDK 7)AppDynamicsappagentmonitoringGC and threaddump loggingCassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB diskholding Commit log and SSTables
  21. 21. Cloud NativeMaster copies of data are cloud residentEverything is dynamically provisionedAll services are ephemeral
  22. 22. Dynamic Scalability
  23. 23. Asgardhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  24. 24. Ephemeral Instances• Largest services are autoscaled• Average lifetime of an instance is 36 hoursPushAutoscale UpAutoscale Down
  25. 25. Managing Multi-Region AvailabilityCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersUltraDNSDynECTDNSAWSRoute53A portable way to manage multiple DNS providers from JavaDenominator
  26. 26. A Cloud Native Open Source Platform
  27. 27. Inspiration
  28. 28. Antifragile API PatternsFunctional Reactive with Circuit Breakers and Bulkheads
  29. 29. Establish oursolutions as BestPractices / StandardsHire, Retain andEngage TopEngineersBuild up NetflixTechnology BrandBenefit from ashared ecosystemGoals
  30. 30. GithubNetflixOSSSourceAWSBase AMIMavenCentralCloudbeesJenkinsAminatorBakeryDynaslaveAWS BuildSlavesAsgard(+ Frigga)ConsoleAWSBaked AMIsOdinOrchestrationAPIAWSAccountNetflixOSS Continuous Build and Deployment
  31. 31. AWS AccountAsgard ConsoleArchaius ConfigServiceCross regionPriam C*ExplorersDashboardsAtlasMonitoringGenie HadoopServicesMultiple AWS RegionsEureka RegistryExhibitor ZKEdda HistorySimian Army3 AWS ZonesApplicationClustersAutoscale GroupsInstancesPriamCassandraPersistent StorageEvcacheMemcachedEphemeral StorageNetflixOSS Services Scope
  32. 32. •Baked AMI – Tomcat, Apache, your code•Governator – Guice based dependency injection•Archaius – dynamic configuration properties client•Eureka - service registration clientInitialization•Karyon - Base Server for inbound requests•RxJava – Reactive pattern•Hystrix/Turbine – dependencies and real-time status•Ribbon - REST Client for outbound callsServiceRequests•Astyanax – Cassandra client and pattern library•Evcache – Zone aware Memcached client•Curator – Zookeeper patterns•Denominator – DNS routing abstractionData Access•Blitz4j – non-blocking logging•Servo – metrics export for autoscaling•Atlas – high volume instrumentationLoggingNetflixOSS Instance Libraries
  33. 33. •CassJmeter – Load testing for Cassandra•Circus Monkey – Test account reservation rebalancingTest Tools•Janitor Monkey – Cleans up unused resources•Efficiency Monkey•Doctor Monkey•Howler Monkey – Complains about expiring certsMaintenance•Chaos Monkey – Kills Instances•Chaos Gorilla – Kills Availability Zones•Chaos Kong – Kills Regions•Latency Monkey – Latency and error injectionAvailability•Security Monkey•Conformity MonkeySecurityNetflixOSS Testing and Automation
  34. 34. Example Application – RSS Reader
  35. 35. More Use CasesMoreFeaturesBetter portabilityHigher availabilityEasier to deployContributions from end usersContributions from vendorsWhat’s Coming Next?
  36. 36. Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds“It’s done when it runs Asgard”Functionally completeDemonstrated MarchRelease 3.3 in 2Q13Some vendor interestNeeds AWS compatible AutoscalerSome vendor interestMany missing featuresBait and switch AWS API strategy
  37. 37. Netflix Cloud PrizeBoosting the @NetflixOSS Ecosystem
  38. 38. EntrantsNetflixEngineeringJudges WinnersNominationsConforms toRulesWorkingCodeCommunityTractionCategoriesRegistrationOpenedMarch 13GithubApacheLicensedContributionsGithubClose EntriesSeptember 15GithubAwardCeremonyDinnerNovemberAWSRe:InventTen PrizeCategories$10K cash$5K AWSAWSRe:InventTicketsTrophy
  39. 39. Functionality and scale now, portability comingMoving from parts to a platform in 2013Netflix is fostering an ecosystemRapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
  40. 40. Opportunities
  41. 41. MonocultureReplicate “the best” as patternsReduce interaction complexityBut… epidemic single point of failure
  42. 42. Pattern FailuresInfrastructure Pattern FailuresSoftware Stack Pattern FailuresApplication Pattern Failures
  43. 43. Infrastructure Pattern Failures• Device failures – bad batch of disks, PSUs, etc.• CPU failures – cache corruption, math errors• Datacenter failures – power, network, disaster• Routing failures – DNS, Internet/ISP path
  44. 44. Software Stack Pattern Failures• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Expiration – Certs timing out• Trust revocation – Certificate Authority fails• Security exploit – everything compromised• Language bugs – compilers and runtime
  45. 45. Application Pattern Failures• Content bombs – Data dependent failure• Configuration – wrong/bad syntax• Versioning – incompatible mixes• Cascading failures – error handling bugs etc.• Cascading overload – excessive logging etc.• Network bugs – routers, firewalls, protocols
  46. 46. What to do?Automated diversity managementDiversify the automation as wellEfficient vs. Antifragile trade-off
  47. 47. Linux Foundation• Strengths– Ubiquitous support, open source is the default• Weaknesses– Networking vs. BSD, observability• Opportunities– Optimize for ephemeral dynamic use cases• Threats– Epidemic failure modes – e.g. “leap second”
  48. 48. TakeawayNetflix is making it easy for everyone to adopt Cloud Native patterns.Optimize for dystopia and diversity.http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflixhttp://www.linkedin.com/in/adriancockcroft@adrianco #netflixcloud @NetflixOSS