3. Dystopia - AbstractWe have spent years striving to build perfect apps running onperfect kernels on perfect CPUs connected by perfectnetworks, but this utopia hasnt really arrived. Instead we livein a dystopian world of buggy apps changing several times aday running on JVMs running on an old version of Linuxrunning on Xen running on something I cant see, that onlyexists for a few hours, connected by a network of unknowntopology and operated by many layers of automation.I will discuss the new challenges and demands of living in thisdystopian world of cloud based services. I will also give anoverview of the Netflix open source cloud platform (seenetflix.github.com) that we use to create our own island ofutopian agility and availability regardless of what is going onunderneath.
4. We are EngineersWe solve hard problemsWe build amazing and complex thingsWe fix things when they break
5. We strive for perfectionPerfect codePerfect hardwarePerfectly operated
6. But perfection takes too long…So we compromiseTime to market vs. QualityUtopia remains out of reach
7. Where time to market wins bigWeb servicesAgile infrastructure - cloudContinuous deployment
8. How Soon?Code features in days instead of monthsHardware in minutes instead of weeksIncident response in seconds instead of hours
9. Tipping the BalanceUtopia Dystopia
10. A new engineering challengeConstruct a highly agile and highlyavailable service from ephemeral andoften broken components
11. Cloud NativeHow does Netflix work?
12. Netflix Member Web Site Home PagePersonalization Driven – What goes on to make this?
13. How Netflix Streaming WorksCustomer Device(PC, PS3, TV…)Web Site orDiscovery APIUser DataPersonalizationStreaming APIDRMQoS LoggingOpenConnectCDN BoxesCDNManagement andSteeringContent EncodingConsumerElectronicsAWS CloudServicesCDN EdgeLocations
16. Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)Start HerememcachedCassandraWeb serviceS3 bucketThree Personalization movie groupchoosers (for US, Canada and Latam)Each icon isthree to a fewhundredinstancesacross threeAWS zones
17. Cloud Native ArchitectureDistributed QuorumNoSQL DatastoresAutoscaled MicroServicesAutoscaled MicroServicesClients ThingsJVM JVMJVM JVMCassandra Cassandra CassandraMemcachedJVMZone A Zone B Zone C
18. New Anti-Fragile PatternsMicro-servicesChaos enginesHighly available systems composedfrom ephemeral components
19. Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu)OptionalApachefrontend,memcached,non-java appsMonitoringLog rotationto S3AppDynamicsmachineagentEpic/AtlasJava (JDK 6 or 7)AppDynamicsappagentmonitoringGC and threaddump loggingTomcatApplication war file, baseservlet, platform, clientinterface jars, AstyanaxHealthcheck, statusservlets, JMX interface,Servo autoscale
20. Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu)Tomcat andPriam on JDKHealthcheck,StatusMonitoringAppDynamicsmachineagentEpic/AtlasJava (JDK 7)AppDynamicsappagentmonitoringGC and threaddump loggingCassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB diskholding Commit log and SSTables
21. Cloud NativeMaster copies of data are cloud residentEverything is dynamically provisionedAll services are ephemeral
32. •Baked AMI – Tomcat, Apache, your code•Governator – Guice based dependency injection•Archaius – dynamic configuration properties client•Eureka - service registration clientInitialization•Karyon - Base Server for inbound requests•RxJava – Reactive pattern•Hystrix/Turbine – dependencies and real-time status•Ribbon - REST Client for outbound callsServiceRequests•Astyanax – Cassandra client and pattern library•Evcache – Zone aware Memcached client•Curator – Zookeeper patterns•Denominator – DNS routing abstractionData Access•Blitz4j – non-blocking logging•Servo – metrics export for autoscaling•Atlas – high volume instrumentationLoggingNetflixOSS Instance Libraries
33. •CassJmeter – Load testing for Cassandra•Circus Monkey – Test account reservation rebalancingTest Tools•Janitor Monkey – Cleans up unused resources•Efficiency Monkey•Doctor Monkey•Howler Monkey – Complains about expiring certsMaintenance•Chaos Monkey – Kills Instances•Chaos Gorilla – Kills Availability Zones•Chaos Kong – Kills Regions•Latency Monkey – Latency and error injectionAvailability•Security Monkey•Conformity MonkeySecurityNetflixOSS Testing and Automation
34. Example Application – RSS Reader
35. More Use CasesMoreFeaturesBetter portabilityHigher availabilityEasier to deployContributions from end usersContributions from vendorsWhat’s Coming Next?
36. Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds“It’s done when it runs Asgard”Functionally completeDemonstrated MarchRelease 3.3 in 2Q13Some vendor interestNeeds AWS compatible AutoscalerSome vendor interestMany missing featuresBait and switch AWS API strategy
37. Netflix Cloud PrizeBoosting the @NetflixOSS Ecosystem
39. Functionality and scale now, portability comingMoving from parts to a platform in 2013Netflix is fostering an ecosystemRapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
41. MonocultureReplicate “the best” as patternsReduce interaction complexityBut… epidemic single point of failure
46. What to do?Automated diversity managementDiversify the automation as wellEfficient vs. Antifragile trade-off
47. Linux Foundation• Strengths– Ubiquitous support, open source is the default• Weaknesses– Networking vs. BSD, observability• Opportunities– Optimize for ephemeral dynamic use cases• Threats– Epidemic failure modes – e.g. “leap second”
48. TakeawayNetflix is making it easy for everyone to adopt Cloud Native patterns.Optimize for dystopia and diversity.http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflixhttp://www.linkedin.com/in/adriancockcroft@adrianco #netflixcloud @NetflixOSS