Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Services Powered by IBM SoftLayer and NetflixOSS


Published on

This presentation covers our work starting with Acme Air web scale and transitioning to operational lessons learned in HA, automatic recovery, continuous delivery, and operational visibility. It shows the port of the Netflix OSS cloud platform to IBM's cloud - SoftLayer and use of RightScale.

Published in: Technology
  • If you are looking for customer-oriented academic and research paper writing service try ⇒⇒⇒ ⇐⇐⇐ liked them A LOTTT Really nice solutions for the last-day papers
    Are you sure you want to  Yes  No
    Your message goes here
  • If we are speaking about saving time and money this site ⇒ ⇐ is going to be the best option!! I personally used lots of times and remain highly satisfied.
    Are you sure you want to  Yes  No
    Your message goes here

Cloud Services Powered by IBM SoftLayer and NetflixOSS

  1. 1. Public Cloud Services using IBM Cloud and Netflix OSS Jan 2014 Andrew Spyker @aspyker
  2. 2. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 2
  3. 3. About me … • IBM STSM of Performance Architect and Strategy • Eleven years in performance in WebSphere – – – – Led the App Server Performance team for years Small sabbatical focused on IBM XML technology Works in Emerging Technology Institute, CTO Office Now cloud service operations • Email: – – – – Blog: Linkedin: Twitter: Github: • RTP dad that enjoys technology as well as running, wine and poker 3
  4. 4. Develop or maintain a service today? • Develop – yes • Maintain – starting • So far – Multiple services inside of IBM – Other services for use in our PaaS environment 4
  5. 5. What qualifies me to talk? • My monkey? • Of cloud prize ~ 40 entrants – Best example mash-up sample • Nomination and win – Best portability enhancement • Nomination – More on this coming … • • Other nominees - Other winners - 5
  6. 6. Seriously, how did I get here? • Experience with performance and scale on standardized benchmarks (SPEC/TPC) – Non representative of how to (web) scale • Pinning, biggest monolithic DB “wins”, hand tuned for fixed size – Out of date on modern architecture for mobile/cloud • Created Acme Air – • Demonstrated that we could achieve (web) scale runs – 4B+ Mobile/Browser request/day – With modern mobile and cloud best practices 6
  7. 7. What was shown? • Peak performance and scale – You betcha! • Operational visibility – Only during the run via nmon collection and post-run visualization • • • • True operational visibility - nope Devops – nope HA and DR – nope Manual and automatic elastic scaling - nope 7
  8. 8. What next? • Went looking for what best industry practices around devops and high availability at web scale existed – Many have documented via research papers and on – Google, Twitter, Facebook, Linkedin, etc. • Why Netflix? – Documented not only on their tech blog, but also have released working OSS on github – Also, given dependence on Amazon, they are a clear bellwether of web scale public cloud availability 8
  9. 9. Steps to NetflixOSS understanding • Recoded Acme Air application to make use of NetflixOSS runtime components • Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance on IBM middleware • Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale • Through public collaboration with Netflix technical team – Google groups, github and meetups 9
  10. 10. Why? • To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon • To understand how we can advance IBM cloud platforms for our customers • To understand how we can host our IBM public cloud services better 10
  11. 11. Another Cloud Portability work of note • In this presentation, focused on portability across public clouds Project Aurora • What about applicability to private cloud? • Paypal worked to port the cloud management system to OpenStack and Heat – • Additional work required to port runtime aspects as we did in public cloud 11
  12. 12. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 12
  13. 13. My view of Netflix goals • As a business – Be the best streaming media provider in the world – Make best content deals based on real data/analysis • Technology wise – Have the most availability possible – “Stream starts per unit of time” is KPI measured for entire business – Deliver features to customers first in market • Requiring high velocity of IT change – Do all of this at web scale • Culture wise – Create a high performance delivery culture that attracts top talent 13
  14. 14. Standing on the shoulder of a giants • Public Cloud (Amazon) – When adding streaming, Netflix decided they • Shouldn’t invest in building data centers worldwide • Had to plan for the streaming business to be very big – Embraced cloud architecture paying only for what they need • Open Source – Many parts of runtime depend on open source • Linux, Apache Tomcat, Apache Cassandra, etc. • Requires top technical talent and OSS committers – Realized that Amazon wasn’t enough • Started a cloud platform on top that would eventually be open sourced - NetflixOSS File:Andre_in_the_late_%2780s.jpg 14
  15. 15. NetflixOSS on Github • “Technical indigestion as a service” – Adrian Cockcroft • – 40+ OSS projects – Expanding every day • Focusing more on interactive midtier server technology today … 15
  16. 16. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 16
  17. 17. High Availability Thoughts • Three of every part of your architecture – – – – – EVERYTHING in your architecture (including IaaS components) Likely more via clustering/partitioning One = SPOF Two = slow active/standby recovery Three = where you get zero downtime when failures occur • All parts of application should fail independently – No one part should take down entire application – When linked, highest availability is limited to lowest availability component – Apply circuit breaker pattern to isolate systems • If a part of the system results in total end user failure – Use partitioning to ensure only some smaller percentage of users are affected 17
  18. 18. Faleure • What is failing? – Underlying IaaS problems • Instances, racks, availability zones, regions – Software issues • Operating system, servers, application code Inspiration – Surrounding services • Other application services, DNS, user registries, etc. • How is a component failing? – – – – Fails and disappears altogether Intermittently fails Works, but is responding slowly Works, but is causing users a poor experience 18
  19. 19. Overview of IaaS HA • Launch instances into availability zones – Instances of various sizes (compute, storage, etc.) • Availability zones are isolated from each over Availability zones are connected /w low-latency links Regions contain availability zones Regions independent of each other Regions have higher latency to each other Datacenter/ Availability Zone Datacenter/ Availability Zone Internet This gives a high level of resilience to outages – Unlikely to affect multiple availability zones or regions • Datacenter/ Availability Zone Organized into regions and availability zones – – – – – • Region (Dallas) Cloud providers require customer be aware of this topology to take advantage of its benefits within their application Second Region Datacenter/ Availability Zone Datacenter/ Availability Zone Datacenter/ Availability Zone 19
  20. 20. Acme Air As A Sample ELB Web App Front End (REST services) App Service (Authentication) Data Tier Greatly simplified … 20
  21. 21. Micro-services architecture • Decompose system into isolated services that can be developed separately • Why? – They can fail independently vs. fail together monolythically – They can be developed and released with difference velocities by different teams • To show this we created separate “auth service” for Acme Air • In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources 21
  22. 22. How do services advertise themselves? • Upon web app startup, Karyon server is started – Karyon will configure (via Archaius) the application – Karyon will register the location of the instance with Eureka • Others can know of the existence of the service • Lease based so instances continue to check in updating list of available instances – Karyon will also expose a JMX console, healthcheck URL • Devops can change things about the service via JMX • The system can monitor the health of the instance App Service (Authentication) Name, Port IP address, Healthcheck url Karyon App Server Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s), Or remote Archaius stores 22
  23. 23. How do consumers find services? • Service consumers query eureka at startup and periodically to determine location of dependencies – Can query based on availability zone and cross availability zone Web App Front End (REST services) Eureka client App Server What “auth-service” instances exist? Eureka Eureka Server(s) Eureka Server(s) Eureka Server(s) Server(s) 23
  24. 24. Demo 24
  25. 25. How does the consumer call the service? • Protocols impls have eureka aware load balancing support build in – In client load balancing -- does not require separate LB tier • Ribbon – REST client – Pluggable load balancing scheme – Built in failure recovery support (retry next server, mark instance as failing, etc.) • Other eureka enabled clients – Custom code in non-Java or Ribbon enabled systems (Java or pure REST) – More from Netflix • Memcached (EVCache), Asystanax (Cassandra and Priam) coming Web App Front End (REST services) Call “auth-service” Ribbon REST client Eureka client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) 25
  26. 26. PS. This is a common pattern • Same idea, but different implementations –’s SmartStack • Zookeeper/Synapse/Nerve/HAProxy –’s clustering • Zookeeper/Ngnix 26
  27. 27. How to deploy this with HA? Instances? • Asgard deploys across AZs • Using auto scaling groups in managed by Asgard • More on Asgard later Eureka? • • DNS and Elastic IP trickery Deployed across AZs • For clients to find eureka servers – – • For new eureka servers – – – • DNS TXT record for domain lists AZ TXT records AZ TXT records have list of Eureka servers Look for list of eureka servers IP’s for the AZ it’s coming up in Look for unassigned elastic IP’s, grab one and assign it to itself Sync with other already assigned IP’s that likely are hosting Eureka server instances Simpler configurations with less HA are available 27
  28. 28. Protect yourself from unhealthy services • Wrap all calls to services with Hystrix command pattern – Hystrix implements circuit breaker pattern – Executes command using semaphore or separate thread pool to guarantee return within finite time to caller – If a unhealthy service is detected, start to call fallback implementation (broken circuit) and periodically check if main implementation works (reset circuit) • Hystrix also provides caching, request collapsing with synchronous and asynchronous (reactive via RxJava) invocation Execute auth-service call Call “auth-service” Hystrix Web App Front End (REST services) Ribbon REST client App Service App Service (Authentication) App Service (Authentication) App Service (Authentication) (Authentication) Fallback implementation 28
  29. 29. Denominator • Most (simple) geographic (region) based disaster recovery depends on front end DNS traffic switching • Java Library and CLI for cross DNS configuration • Allows for common, quicker (than using various DNS provider UI) and automated DNS updates • Plugins have been developed by various DNS providers 29
  30. 30. Augmenting the ELB tier - Zuul • Originally developed to do cross region routing for regional HA – Advanced geographic (region) based disaster recovery • Zuul also adds devops support in the front tier routing – – – – – • And some common function – – – – – • • Stress testing (squeeze testing) Canary testing Dynamic routing Load Shedding Debugging Region 1 Load Balancers Filter Filter Filter Filters Zuul Zuul Zuul Edge Service Region 2 Load Balancers Zuul Zuul Zuul Edge Service Authentication Security Static response handling Multi-region resiliency (DR for ELB tier) Insight Through dynamically deployable filters (written in Groovy) Eureka aware using ribbon, and archaius like shown in runtime section 30
  31. 31. HA in application architecture • Stateless application design – – – – Legacy application design has state Temporal state should be pushed to caching servers Durable state should be pushed to partitioned data servers Trades off peak latency for uptime (sometimes no trade off) • Partitioned data servers – Wealth of NoSQL servers available today – Be careful of oversold “consistency” promises • Look for third party “Jepsen-like” testing – Be ready to deal with compensated approaches – Consider differences in system of record vs. interaction data stores 31
  32. 32. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 32
  33. 33. Automatic Recovery Thoughts • Automatic recovery depends on elastic, ephemeral instance cluster design powered by “auto scaling” • If something fails once, it will fail again • No repeated failure should be a pager call – Instead should be email with automated recovery information to be analyzed offline • Test failure on your system before the system tests your failure 33
  34. 34. Auto Scaling (for the masses) • For many, auto scaling is more auto recovery – Far more important to keep N instances running than be able to scale automatically to 2N, 10N, 100N • For many, automatic scaling isn’t appropriate – First understand how the system can be elastically scaled with operator expertise manually 34
  35. 35. ASGard Region (Dallas) Datacenter/ Availability Zone Tell IaaS to start these instances and Keep this many Instances running Datacenter/ Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Datacenter/ Availability Zone Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) Web App App Service (REST App Service Services) (Authentication) App Service (Authentication) (Authentication) App Service App Service App Service (Authentication) (Authentication) App Service (Authentication) (Authentication) • Asgard is the console for automatic scaling and recovery 35
  36. 36. Asgard creates an “application” • Enforces common practices for deploying code – Common approach to linking auto scaling groups to launch configurations, load balancers, security groups, scaling policies and images • Adds missing concept to the IaaS domain model – “application” – Apps clustering and application lifecycle vs. individually launched and managed images • Example – – – – Application – app1 Cluster – app1-env Asgard group version n – app1-env-v009 Asgard group version n+1 – app1-env-v010 36
  37. 37. When to test recovery (and HA)? • Failure is inevitable. Don’t try to avoid it! • How do you know if your backup is good? – Try to restore from your backup every so often – Better to ensure backup works before you have a crashed system and find out your backup is broken • How do you know if your system is HA? – Try to force failures every so often – Better to force those failures during office hours – Better to ensure HA before you have a down system and angry users – Best to learn from failures and add automated tests 37
  38. 38. The Simian Army • A bunch of automated “monkeys” that perform automated system administration tasks • Anything that is done by a human more than once can and should be automated • Absolutely necessary at web scale 38
  39. 39. Bad Monkeys • Open Sourced – Chaos Monkey – Used to randomly terminate instances – Now block network, burn cpu, kill processes, fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O • Not yet open sourced – Chaos Gorilla • Kill datacenter/availability zone instances – Chaos Kong • Kill all instances in an entire region – Latency Monkey • Introduce latency into service calls directly (ribbon server side) – Split Brain Monkey • Datacenters/availability zones continue to operate, but isolated from each other 39
  40. 40. Elastic Scale • Basic elastic scale required to achieve high availability – To run three or more of any component • Front tier specific considerations – Will likely need to scale far higher than micro-services – Use distributed caching with TTL where appropriate – Otherwise micro-service architecture could overload data servers • Scaling larger (or Web Scale) will find bottlenecks that require changes to architecture and/or tuning – Iterative process of improvement 40
  41. 41. Elastic scaling in application architecture • Clusters that replicate data within the cluster must discover new peers (and timeout dead ones) • Clusters that connect to other clusters must discover new dependency instances (and timeout dead ones) • Many legacy architectures contain static cluster definitions that require “re-starts” to update information – Code changes required to leverage dynamic connectivity 41
  42. 42. Full Auto Scaling • Eventually web scale will require auto scaling based on policy – Attach policy based on request latency, utilization, queue depth, etc. • Words of caution, be careful to – Design policies to be proactive on scale up or risk scaling that isn’t fast enough to keep up with demand – Design policies to be generous on scale down or risk over-scaling down and immediate need for scale up 42
  43. 43. Scaling Continues to Evolve • Reactive auto scaling is “easy” but naïve – Instances fail – Unexpected spike in demand • What if your traffic is “predictable”, consider – User population follows a daily pattern – User population known to follow different patterns each day (work days vs. weekends) – End of month influx of work • Scryer is Netflix’s predictive analytics to not wait for reactive scaling – Better end user experience, less over deployment (cheaper), more consistent utilization (cheaper) – Not yet open sourced 43
  44. 44. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • How to grade public cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 44
  45. 45. Thoughts on Continuous Delivery • Legacy waterfall habits are hard to break Inspiration – “Leaks” of old world continue to show – Especially if product has to be released in “shrink wrapped” form in parallel • Netflix approach and technology assists breaking these habits – Provide the tools and proof points and the organization will follow 45
  46. 46. Continuous Delivery Pipeline • Developers – Perform local testing before checking code into continuous build • Continuous build – Builds code, tests code and flags any breaks for immediate attention – Builds packages ready for image installation • Image bakery – Builds image for deployment that then show up in Asgard • Continuous deployment – Images deployed through Asgard – Instances are given image and environmental context from Asgard • • Same images should be used in production that are used in test Due to micro-services (API as contract) approach – No need to co-ordinate typical deployments across teams 46
  47. 47. Asgard devops procedures • • • • Fast rollback Canary testing Red/Black pushes More through REST interfaces – Adhoc processes allowed, enforced through Asgard model • More coming using Glisten and workflow services 47
  48. 48. Demo 48
  49. 49. Ability to reconfigure - Archaius • Using dynamic properties, can easily change properties across cluster of applications, either Application – NetflixOSS named props • Hystrix timeouts for example Runtime – Custom dynamic props Hierarchy • High throughput achieved by polling approach • HA of configuration source dependent on what source you use URL JMX Karyon Console Persisted DB Application Props Libraries Container – HTTP server, database, etc. DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE); int value = prop.get(); // value will change over time based on configuration 49
  50. 50. Get baked! • Caution: Flame/troll bait ahead!! – Criticism – “Netflix is ruining the cloud” • Overhead of images for every code version • Ties to Amazon AMI’s (have proven this tie can be broken) • Netflix takes the approach of baking images as part of build such that – Instance boot-up doesn’t depend on outside servers – Instance boot-up only starts servers already set to run – New code = new instances (never update instances in place) • Why? – Critical when launching hundreds of servers at a time – Goal to reduce the failure points in places where dynamic system configuration doesn’t provide value – Speed of elastic scaling, boot and go – Discourages ad hoc changes to server instances 50
  51. 51. AMInator • Starting image/volume – Foundational image created (maybe via loopback), base AMI with common software created/tested independently • Aminator running – Bakery – Bakery obtains a known EBS volume of the base image from a pool – Bakery mounts volume and provisions the application (apt/deb or yum/rpm) – Bakery snapshots and registers snapshot • Recent work to add other provisioning such as chef as plugins 51
  52. 52. Imaginator • Implementation of Aminator – For IBM SoftLayer cloud • Creates image templates – Starts from base OS and adds deb/rpm’s • Snapshots images for later deployment • Not yet open sourced 52
  53. 53. Good Monkeys • Janitor Monkey – Somewhat a mitigation for baking approach – Will mark and sweep unused resources (instances, volumes, snapshots, ASG’s, launch configs, images, etc.) – Owners notified, then removed • Conformity Monkey – Check instances are conforming to rules around security, ASG/ELB, age, status/health check, etc. 53
  54. 54. Agenda • How did I get here? • Netflix overview, Netflix OSS teaser • Cloud services – High Availability – Automatic recovery – Continuous delivery – Operational visibility • Get started yourself 54
  55. 55. Thoughts on Operational Visibility • Programming model to expose metrics should be simple • Systems need to expose internals in a way that is sensible to the owners and operators • The tools that view the internals need to match the level of abstraction developers care about • The tools must give sufficient context when viewing any single metric or alert 55
  56. 56. Monitoring - Servo • Annotation based publishing through JMX of application metrics • Gauges, counters, and timers • Filters, Observers, and Pollers to publish metrics – Can export metrics to metric collection servers • Netflix exposes their metrics to Atlas – The entire Netflix monitoring infrastructure hasn’t been open sourced due to complexity and priority 56
  57. 57. Back to Hystrix • Main reason for Hystrix is protect yourself from dependencies, but … • Same layer of indirection to services can provide visualization • You can aggregate the view across clusters via Turbine • Other alert system and dashboards can read from Turbine 57
  58. 58. Edda • IaaS does not typically provide – Historical views of the state of the system – All views between components an operator might want to see • Edda polls current state and stores the data in a queriable database • Provides a adhoc queriable view of all deployment aspects • Provides a historical view – For correlation of problems to changes – Becoming a more common place feature in cloud 58
  59. 59. Ice • Cloud spend and usage analytics • Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups • Watches differently priced instances and instance sizes to help optimize • Not point in time – Shows trends to help predict future optimizations 59
  60. 60. Agenda • Blah, blah, blah • How can I learn more? • How do I play with this? • Let’s write some code! 60
  61. 61. Want to play? • NetflixOSS blog and github – – • NetflixOSS as ported to IBM Cloud – – SoftLayer Image Templates coming soon • Acme Air, NetflixOSS AMI’s – Try Asgard/Eureka with a real application – • Thanks! Questions? See what we ported to IBM Cloud (video) – • Fork and submit pull requests to Acme Air – 61