NetflixOSS for Triangle Devops Oct 2013


Published on

My @TriangleDevops talk from 2013-10-17. I covered the work that led us to @NetflixOSS (Acme Air), the work we did on the cloud prize (NetflixOSS on IBM SoftLayer/RightScale) and the @NetflixOSS platform (Karyon, Archaius, Eureka, Ribbon, Asgard, Hystrix, Turbine, Zuul, Servo, Edda, Ice, Denominator, Aminator, Janitor/Conformity/Chaos Monkeys of the Simian Army).

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

NetflixOSS for Triangle Devops Oct 2013

  1. 1. Learning about NetflixOSS For Oct 2013 @TriangleDevops Andrew Spyker @aspyker Some content from @ma4jpb
  2. 2. Agenda • How did I get here? • • • • • Netflix and Netflix OSS platform overview Runtime components Management components Build components Automated test and cleanliness components 2
  3. 3. About me … • IBM STSM of Performance Architect and Strategy • Eleven years in performance in WebSphere – – – – Led the App Server Performance team for years Small sabbatical focused on IBM XML technology Work in Emerging Technology Institute and CTO Office Starting to look at cloud service operations • Email: – – – – Blog: Linkedin: Twitter: Github: • Triangle dad that enjoys technology as well as running, wine and poker 3
  4. 4. Develop or maintain a service today? • Develop – starting • Maintain – starting • More on this later …. 4
  5. 5. What qualifies me to talk? • My shirt? • Of cloud prize ~ 25 nominees – Personally • Best example mash-up sample – My IBM team • Best portability enhancement – More on this coming … • 5
  6. 6. Seriously, how did I get here? • Plenty of experience with performance and scale on standardized benchmarks (SPEC/TPC) – Non representative of how to (web) scale • Pinning, biggest monolithic DB “wins”, hand tuned for fixed size – Out of date on modern architecture for mobile/cloud • Created Acme Air – • Demonstrated that we could achieve (web) scale runs – 4B+ Mobile/Browser request/day – With modern mobile and cloud best practices 6
  7. 7. Demo 7
  8. 8. What was shown? • Peak performance and scale – You betcha! • Operational visibility – Only during the run via nmon collection and post-run visualization • • • • True operational visibility - nope Devops – nope HA and DR – nope Manual and automatic elastic scaling - nope 8
  9. 9. What next? • Went looking for what best industry practices around devops and high availability at web scale existed – Many have documented via research papers and on – Google, Twitter, Facebook, Linkedin, etc. • Why Netflix? – Documented not only on their tech blog, but also have released working OSS on github – Also, given dependence on Amazon, they are a clear bellwether of web scale public cloud availability 9
  10. 10. Steps to NetflixOSS understanding • Recoded Acme Air application to make use of NetflixOSS runtime components • Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance • Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale • Through public collaboration with Netflix technical team – Google groups, github and meetups 10
  11. 11. Why? • To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon • To understand how we can advance IBM cloud platforms for our customers • To understand how we can host our IBM public cloud services better 11
  12. 12. Agenda • How did I get here? • Netflix and Netflix OSS platform overview • • • • Runtime components Management components Build components Automated test and cleanliness components 12
  13. 13. My view of Netflix goals • As a business – Be the best streaming media provider in the world – Make best content deals based on real data/analysis • Technology wise – Have the most availability possible – Measure all things by “stream starts per unit of time” • Any dip in that relates back to the business – Do this at web scale 13
  14. 14. Standing on the shoulder of a giants • Public Cloud (Amazon) – When adding streaming, Netflix decided they • Shouldn’t invest in building data centers worldwide • Had to plan for the streaming business to be very big – Embraced cloud architecture paying only for what they need • Open Source – Many parts of runtime depend on open source • Linux, Apache Tomcat, Apache Cassandra, etc. – Realized that Amazon wasn’t enough • Started a cloud platform on top that would eventually be open sourced - NetflixOSS File:Andre_in_the_late_%2780s.jpg 14
  15. 15. Faleure • What is failing? – Underlying IaaS problems • Instances, racks, availability zones, regions – Software issues • Operating system, servers, application code Inspiration – Surrounding services • Other application services, DNS, user registries, etc. • How is a component failing? – – – – Fails and disappears altogether Intermittently fails Works, but is responding slowly Works, but is causing users a poor experience 15
  16. 16. Overview of Amazon EC2 • Amazon launches instances into availability zones – Instances of various sizes (compute, storage, etc.) • Regions independent of each other Regions only connected over the Internet Regions contain availability zones Availability zones are isolated from each over Availability zones are connected /w low-latency links Availability Zone Availability Zone Internet This gives a high level of resilience to outages – Unlikely to affect multiple availability zones or regions • Availability Zone Organized into regions and availability zones – – – – – • EC2 Region (US East) Amazon requires customer be aware of this topology to take advantage of its benefits within their application EC2 Region (US West) Availability Zone Availability Zone Availability Zone 16
  17. 17. NetflixOSS • “Technical indigestion as a service” - @adrianco • • 30+ OSS projects • Expanding every day 17
  18. 18. NetflixOSS – for today • For today – Focus on mid tier web app and micro service servers – Devops servers and tools – Skipping some just for simplicity • For another time – Big data – Data tier – Caching 18
  19. 19. Agenda • How did I get here? • Netflix and Netflix OSS platform overview • Runtime components • Management components • Build components • Automated test and cleanliness components 19
  20. 20. Acme Air As A Sample ELB Web App Front End (REST services) App Service (Authentication) Data Tier Greatly simplified … 20
  21. 21. Micro-services architecture • Decompose system into isolated services that can be developed separately • Why? – They can fail independently vs. fail together monolythically – They can be developed and released with difference velocities by different teams • To show this we created separate “auth service” for Acme Air • In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources 21
  22. 22. How do services advertise themselves? • Upon web app startup, Karyon server is started – Karyon will configure (via Archaius) the application – Karyon will register the location of the instance with Eureka • Others can know of the existence of the service • Lease based so instances continue to check in updating list of available instances – Karyon will also expose a JMX console, healthcheck URL • Devops can change things about the service via JMX • The system can monitor the health of the instance App Service (Authentication) Name, Port IP address, Healthcheck url Karyon Tomcat Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s), Or remote Archaius stores 22
  23. 23. How do consumers find services? • Service consumers query eureka at startup and periodically to determine location of dependencies – Can query based on availability zone and cross availability zone Web App Front End (REST services) Eureka client Tomcat What “auth-service” instances exist? Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s) 23
  24. 24. Demo 24
  25. 25. How does the consumer call the service? • Protocols impls have eureka aware load balancing support build in – In client load balancing -- does not require separate LB tier • Ribbon – REST client – Pluggable load balancing scheme – Built in failure recovery support (retry next server, mark instance as failing, etc.) • Other eureka enabled clients – memcached (EVCache), asystanax coming (Priam and Cassandra) Web App Front End (REST services) Call “auth-service” Ribbon REST client Eureka client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) 25
  26. 26. How to deploy this with HA? Instances? • Deploy across AZs • Using AutoScalingGroups in EC2 managed by Asgard Eureka? • • DNS and Elastic IP trickery Deployed across AZs • For clients to find eureka servers – – ASG manages recovery – • For new eureka servers – – – • DNS TXT record for domain lists AZ TXT records AZ TXT records have list of Eureka servers Look for list of eureka servers IP’s for the AZ it’s coming up in Look for unassigned elastic IP’s, grab one and assign it to itself Sync with other already assigned IP’s that likely are hosting Eureka server instances Simpler configurations with less HA are available 26
  27. 27. Protect yourself from unhealthy services • Wrap all calls to services with Hystrix command pattern – Hystrix implements circuit breaker pattern – Executes command using semaphore or separate thread pool to guarantee return within finite time to caller – If a unhealthy service is detected, start to call fallback implementation (broken circuit) and periodically check if main implementation works (reset circuit) Execute auth-service call Call “auth-service” Hystrix Web App Front End (REST services) Ribbon REST client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) Fallback implementation 27
  28. 28. Does Hystrix do more? • Main reason for Hystrix is protect yourself from dependencies, but … • Once you have a layer of indirection take advantage of it, Hystrix can provide – Caching – Visualization • Aggregated via Turbine – Request collapsing • Programming models – Sync, Async, Reactive (RxJava) 28
  29. 29. Agenda • How did I get here? • Netflix and Netflix OSS platform overview • Runtime components • Management components • Build components • Automated test and cleanliness components 29
  30. 30. Ability to reconfigure - Archaius • Using dynamic properties, can easily change properties across cluster of applications, either Application – NetflixOSS named props • Hystrix timeouts for example Runtime – Custom dynamic props Hierarchy • High throughput achieved by polling approach • HA of configuration source dependent on what source you use URL JMX Karyon Console Persisted DB Application Props Libraries Container – HTTP server, database, etc. DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE); int value = prop.get(); // value will change over time based on configuration 30
  31. 31. ASGard EC2 Region (US East) Availability Zone Tell EC2 to start these instances and Keep this many Instances running Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) • Asgard is the missing EC2 console for AutoScalingGroup mgmt. 31 – EC2 only has CLI for ASG management
  32. 32. Asgard creates an “application” • Enforces common practices for deploying code – Common approach to linking auto scaling groups to launch configs, ELB’s, security groups, scaling policies and AMIs • Adds missing concept to the EC2 domain model – “application” – Extends clustering to applications vs. AMI’s • Example – – – – Application – app1 Cluster – app1-env Autoscaling group version n – app1-env-v009 Autoscaling group version n+1 – app1-env-v010 32
  33. 33. Asgard devops procedures • • • • Fast rollback Canary testing Red/Black pushes More through REST interfaces – Adhoc processes but enforced through Asgard model • More coming using Glisten and Amazon SWF 33
  34. 34. Demo 34
  35. 35. Augmenting the ELB tier - Zuul • Zuul adds devops support in the front tier routing – – – – – Stress testing (squeeze testing) Canary testing Dynamic routing Load Shedding Debugging • And some common function – – – – – Authentication Security Static response handling Multi-region resiliency (DR for ELB tier) Insight Amazon ELB Filter Filter Filter Filters Zuul Zuul Zuul Edge Service Edge Service • Through dynamically deployable filters (written in Groovy) • Eureka aware using ribbon, and archaius like shown in runtime section 35
  36. 36. Monitoring - Servo • Annotation based publishing through JMX of application metrics • Filters, Observers, and Pollers to publish metrics – Can export metrics to CloudWatch and other monitors • The entire Netflix monitoring infrastructure hasn’t been open sourced due to complexity and priority 36
  37. 37. A note on the next three projects • I haven’t personally worked with the projects • Given the audience, I included as I believe they will be of interest 37
  38. 38. Edda • Polls Amazon config and stores the data in a queriable database • Provides a searchable view of Amazon deployments – Searchable in ways not possible from Amazon API’s • Provides a historical view – For correlation of problems to changes – Likely less of an issue in clouds that expose all changes 38
  39. 39. Ice • Cloud spend and usage analytics • Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups • Watches on-demand, used and unused reserved instances and instance sizes to help optimize • Not point in time – Shows trends to help predict future optimizations 39
  40. 40. Denominator • Java Library and CLI for cross DNS configuration • Allows for common, quicker (than using various DNS provider UI) and automated DNS updates • Plugins have been developed by various DNS providers 40
  41. 41. Agenda • • • • How did I get here? Netflix and Netflix OSS platform overview Runtime components Management components • Build components • Automated test and cleanliness components 41
  42. 42. Get baked! • Caution: Flame/troll bait ahead!! • Netflix takes the approach of baking images as part of build such that – Instance boot-up doesn’t depend on outside servers – Instance boot-up only starts servers already set to run – New code = new instances (never update instances in place) • Why? – Critical when launching hundreds of servers at a time – Goal to reduce the failure points in places where dynamic system configuration doesn’t provide value – Speed of elastic scaling, boot and go – Discourages ad hoc changes to server instances • Criticism – “Netflix is ruining the cloud” – Overhead of AMI’s for every code version – Ties to Amazon AMI’s (would this work for containers – I think yes) 42
  43. 43. AMInator • Starting image/volume – Foundational image created (maybe via loopback), base AMI with common software created/tested independently • Aminator running – Bakery – Bakery obtains a known EBS volume of the base image from a pool – Bakery mounts volume and provisions the application (apt/deb or yum/rpm) – Bakery snapshots and registers snapshot • Recent work to add other provisioning such as chef as plugins • I have used hand built AMI’s thus far, but blog states developers can go through CI builds and have running test instances within 15 minutes of code being checked in 43
  44. 44. Agenda • • • • • How did I get here? Netflix and Netflix OSS platform overview Runtime components Management components Build components • Automated test and cleanliness components 44
  45. 45. The Simian Army • A bunch of automated “monkeys” that perform automated system administration tasks • Anything that is done by a human more than once can and should be automated • Absolutely necessary at web scale 45
  46. 46. Good Monkeys • Janitor Monkey – Somewhat a mitigation for baking approach – Will mark and sweep unused resources (instances, volumes, snapshots, ASG’s, launch configs, images, etc.) – Owners notified, then removed • Conformity Monkey – Check instances are conforming to rules around security, ASG/ELB, age, status/health check, etc. 46
  47. 47. Back to high availability • Failure is inevitable. Don’t try to avoid it! • How do you know if your backup is good? – Try to restore from your backup every so often – Better to ensure backup works before you have a crashed system and find out your backup is broken • How do you know if your system is HA? – Try to force failures every so often – Better to force those failures during office hours – Better to ensure HA before you have a down system and angry users – Best to learn from failures and add automated tests 47
  48. 48. Bad Monkeys • Open Sourced – Chaos Monkey – Used to randomly terminate instances – Now block network, burn cpu, kill processes, fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O • Not yet open sourced – Chaos Gorilla • Kill all instances in an availability zone – Chaos Kong • Kill all instances in an entire region – Latency Monkey • Introduce latency into service calls directly (ribbon server side) 48
  49. 49. Agenda • Blah, blah, blah • How can I learn more? • How do I play with this? • Let’s write some code! 49
  50. 50. Want to play? • NetflixOSS blog and github – – • Acme Air, NetflixOSS AMI’s – Try Asgard/Eureka with a real application – • See what we ported to IBM Cloud (video) – • Fork and submit pull requests to Acme Air – 50