Dystopia as a Service

Uploaded on

Talk given at the 2013 Linux Foundation Collaboration Summit. San Francisco, April 15th 2013.

Talk given at the 2013 Linux Foundation Collaboration Summit. San Francisco, April 15th 2013.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Dystopia as a ServiceApril 2013Adrian Cockcroft@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft
  • 2. DystopiaCloud NativeNetflixOSS – Cloud Native On-RampOpportunities
  • 3. Dystopia - AbstractWe have spent years striving to build perfect apps running onperfect kernels on perfect CPUs connected by perfectnetworks, but this utopia hasnt really arrived. Instead we livein a dystopian world of buggy apps changing several times aday running on JVMs running on an old version of Linuxrunning on Xen running on something I cant see, that onlyexists for a few hours, connected by a network of unknowntopology and operated by many layers of automation.I will discuss the new challenges and demands of living in thisdystopian world of cloud based services. I will also give anoverview of the Netflix open source cloud platform (seenetflix.github.com) that we use to create our own island ofutopian agility and availability regardless of what is going onunderneath.
  • 4. We are EngineersWe solve hard problemsWe build amazing and complex thingsWe fix things when they break
  • 5. We strive for perfectionPerfect codePerfect hardwarePerfectly operated
  • 6. But perfection takes too long…So we compromiseTime to market vs. QualityUtopia remains out of reach
  • 7. Where time to market wins bigWeb servicesAgile infrastructure - cloudContinuous deployment
  • 8. How Soon?Code features in days instead of monthsHardware in minutes instead of weeksIncident response in seconds instead of hours
  • 9. Tipping the BalanceUtopia Dystopia
  • 10. A new engineering challengeConstruct a highly agile and highlyavailable service from ephemeral andoften broken components
  • 11. Cloud NativeHow does Netflix work?
  • 12. Netflix Member Web Site Home PagePersonalization Driven – What goes on to make this?
  • 13. How Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocations
  • 14. Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx
  • 15. November 2012 Traffic
  • 16. Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)Start HerememcachedCassandraWeb serviceS3 bucketThree Personalization movie groupchoosers (for US, Canada and Latam)Each icon isthree to a fewhundredinstancesacross threeAWS zones
  • 17. Cloud Native ArchitectureDistributed QuorumNoSQL DatastoresAutoscaled MicroServicesAutoscaled MicroServicesClients ThingsJVM JVMJVM JVMCassandra Cassandra CassandraMemcachedJVMZone A Zone B Zone C
  • 18. New Anti-Fragile PatternsMicro-servicesChaos enginesHighly available systems composedfrom ephemeral components
  • 19. Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu)OptionalApachefrontend,memcached,non-java appsMonitoringLog rotationto S3AppDynamicsmachineagentEpic/AtlasJava (JDK 6 or 7)AppDynamicsappagentmonitoringGC and threaddump loggingTomcatApplication war file, baseservlet, platform, clientinterface jars, AstyanaxHealthcheck, statusservlets, JMX interface,Servo autoscale
  • 20. Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu)Tomcat andPriam on JDKHealthcheck,StatusMonitoringAppDynamicsmachineagentEpic/AtlasJava (JDK 7)AppDynamicsappagentmonitoringGC and threaddump loggingCassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB diskholding Commit log and SSTables
  • 21. Cloud NativeMaster copies of data are cloud residentEverything is dynamically provisionedAll services are ephemeral
  • 22. Dynamic Scalability
  • 23. Asgardhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  • 24. Ephemeral Instances• Largest services are autoscaled• Average lifetime of an instance is 36 hoursPushAutoscale UpAutoscale Down
  • 25. Managing Multi-Region AvailabilityCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CRegional Load BalancersUltraDNSDynECTDNSAWSRoute53A portable way to manage multiple DNS providers from JavaDenominator
  • 26. A Cloud Native Open Source Platform
  • 27. Inspiration
  • 28. Antifragile API PatternsFunctional Reactive with Circuit Breakers and Bulkheads
  • 29. Establish oursolutions as BestPractices / StandardsHire, Retain andEngage TopEngineersBuild up NetflixTechnology BrandBenefit from ashared ecosystemGoals
  • 30. GithubNetflixOSSSourceAWSBase AMIMavenCentralCloudbeesJenkinsAminatorBakeryDynaslaveAWS BuildSlavesAsgard(+ Frigga)ConsoleAWSBaked AMIsOdinOrchestrationAPIAWSAccountNetflixOSS Continuous Build and Deployment
  • 31. AWS AccountAsgard ConsoleArchaius ConfigServiceCross regionPriam C*ExplorersDashboardsAtlasMonitoringGenie HadoopServicesMultiple AWS RegionsEureka RegistryExhibitor ZKEdda HistorySimian Army3 AWS ZonesApplicationClustersAutoscale GroupsInstancesPriamCassandraPersistent StorageEvcacheMemcachedEphemeral StorageNetflixOSS Services Scope
  • 32. •Baked AMI – Tomcat, Apache, your code•Governator – Guice based dependency injection•Archaius – dynamic configuration properties client•Eureka - service registration clientInitialization•Karyon - Base Server for inbound requests•RxJava – Reactive pattern•Hystrix/Turbine – dependencies and real-time status•Ribbon - REST Client for outbound callsServiceRequests•Astyanax – Cassandra client and pattern library•Evcache – Zone aware Memcached client•Curator – Zookeeper patterns•Denominator – DNS routing abstractionData Access•Blitz4j – non-blocking logging•Servo – metrics export for autoscaling•Atlas – high volume instrumentationLoggingNetflixOSS Instance Libraries
  • 33. •CassJmeter – Load testing for Cassandra•Circus Monkey – Test account reservation rebalancingTest Tools•Janitor Monkey – Cleans up unused resources•Efficiency Monkey•Doctor Monkey•Howler Monkey – Complains about expiring certsMaintenance•Chaos Monkey – Kills Instances•Chaos Gorilla – Kills Availability Zones•Chaos Kong – Kills Regions•Latency Monkey – Latency and error injectionAvailability•Security Monkey•Conformity MonkeySecurityNetflixOSS Testing and Automation
  • 34. Example Application – RSS Reader
  • 35. More Use CasesMoreFeaturesBetter portabilityHigher availabilityEasier to deployContributions from end usersContributions from vendorsWhat’s Coming Next?
  • 36. Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds“It’s done when it runs Asgard”Functionally completeDemonstrated MarchRelease 3.3 in 2Q13Some vendor interestNeeds AWS compatible AutoscalerSome vendor interestMany missing featuresBait and switch AWS API strategy
  • 37. Netflix Cloud PrizeBoosting the @NetflixOSS Ecosystem
  • 38. EntrantsNetflixEngineeringJudges WinnersNominationsConforms toRulesWorkingCodeCommunityTractionCategoriesRegistrationOpenedMarch 13GithubApacheLicensedContributionsGithubClose EntriesSeptember 15GithubAwardCeremonyDinnerNovemberAWSRe:InventTen PrizeCategories$10K cash$5K AWSAWSRe:InventTicketsTrophy
  • 39. Functionality and scale now, portability comingMoving from parts to a platform in 2013Netflix is fostering an ecosystemRapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
  • 40. Opportunities
  • 41. MonocultureReplicate “the best” as patternsReduce interaction complexityBut… epidemic single point of failure
  • 42. Pattern FailuresInfrastructure Pattern FailuresSoftware Stack Pattern FailuresApplication Pattern Failures
  • 43. Infrastructure Pattern Failures• Device failures – bad batch of disks, PSUs, etc.• CPU failures – cache corruption, math errors• Datacenter failures – power, network, disaster• Routing failures – DNS, Internet/ISP path
  • 44. Software Stack Pattern Failures• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Expiration – Certs timing out• Trust revocation – Certificate Authority fails• Security exploit – everything compromised• Language bugs – compilers and runtime
  • 45. Application Pattern Failures• Content bombs – Data dependent failure• Configuration – wrong/bad syntax• Versioning – incompatible mixes• Cascading failures – error handling bugs etc.• Cascading overload – excessive logging etc.• Network bugs – routers, firewalls, protocols
  • 46. What to do?Automated diversity managementDiversify the automation as wellEfficient vs. Antifragile trade-off
  • 47. Linux Foundation• Strengths– Ubiquitous support, open source is the default• Weaknesses– Networking vs. BSD, observability• Opportunities– Optimize for ephemeral dynamic use cases• Threats– Epidemic failure modes – e.g. “leap second”
  • 48. TakeawayNetflix is making it easy for everyone to adopt Cloud Native patterns.Optimize for dystopia and diversity.http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflixhttp://www.linkedin.com/in/adriancockcroft@adrianco #netflixcloud @NetflixOSS