Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gluecon Monitoring Microservices and Containers: A Challenge

43,693 views

Published on

Problems with monitoring in general, with microservices and containers, and explanation of spigo and simianviz simulation tools.

Published in: Technology
  • WONDERFUL JOB!.... STARTUPS...Send your pitchdeck to over 5700 of VC's and Angel's with just 1 click. Visit: Angelvisioninvestors.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi All, We are planning to start new Salesforce Online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/salesforce-online-training-hyderabad-bangalore
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Gluecon Monitoring Microservices and Containers: A Challenge

  1. 1. Monitoring Microservices & Containers: A Challenge Adrian Cockcroft @adrianco Technology Fellow - Battery Ventures May 2015
  2. 2. Monitoring ! Update of my monitoring rules from Monitorama 2014
  3. 3. Rule #1: Spend more time working on code that analyzes the meaning of metrics, than code that collects, moves, stores and displays metrics.
  4. 4. Rule #2: Metric to display latency needs to be less than human attention span (~10s)
  5. 5. Rule #3: Validate that your measurement system has enough accuracy and precision. Collect histograms of response time.
  6. 6. Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
  7. 7. Rule #5: Optimize for distributed, ephemeral, cloud native, containerized microservices.
  8. 8. Rule #6: Fit metrics to models to understand relationships. (New rule)
  9. 9. Container Instance e.g. Machine failure affects all instances and containers inside itZone/DC Region Microservice Model Infrastructure as a Containment Hierarchy Machine Many tools use a naming scheme to imply this model, but most can’t reason about the relationships
  10. 10. Request Model Applications and Networks as a Dataflow Graph APM Tools often model these as business transactions Microservice Zone/DC Region
  11. 11. Developer Developer Model Deployment Ownership and Support Developer Developer
  12. 12. Developer Developer Model Deployment Ownership and Support Micro service Micro service Micro service Micro service Micro service Micro service Micro service Developer Developer
  13. 13. Developer Developer Model Deployment Ownership and Support Micro service Micro service Micro service Micro service Micro service Micro service Micro service Developer Developer Monitoring Tools
  14. 14. DeveloperDeveloper Developer Model Deployment Ownership and Support Micro service Micro service Micro service Micro service Micro service Micro service Micro service Developer Developer Monitoring Tools
  15. 15. DeveloperDeveloper Developer Model Deployment Ownership and Support Micro service Micro service Micro service Micro service Micro service Micro service Micro service Developer Developer Site Reliability Monitoring Tools Availability Metrics 99.95% customer success rate
  16. 16. DeveloperDeveloper Developer Model Deployment Ownership and Support Micro service Micro service Micro service Micro service Micro service Micro service Micro service Developer Developer Manager Manager Site Reliability Monitoring Tools Availability Metrics 99.95% customer success rate
  17. 17. DeveloperDeveloper Developer Model Deployment Ownership and Support Micro service Micro service Micro service Micro service Micro service Micro service Micro service Developer Developer Manager Manager VP Engineering Site Reliability Monitoring Tools Availability Metrics 99.95% customer success rate
  18. 18. Infrastructure, flow and ownership models are orthogonal and need to be linked to make sense of the metrics
  19. 19. Monitoring Rules by @adrianco 1. Spend more time on analysis than data collection and display 2. Reduce key business metric latency to less than 10s 3. Validate your measurement system, use histograms 4. Be more available and scalable than the services being monitored 5. Optimize for distributed, ephemeral cloud native applications 6. Fit metrics to models to understand relationships
  20. 20. Microservices
  21. 21. Microservices @ideavist
  22. 22. A Microservice Definition ! Loosely coupled service oriented architecture with bounded contexts
  23. 23. A Microservice Definition ! Loosely coupled service oriented architecture with bounded contexts If every service has to be updated at the same time it’s not loosely coupled
  24. 24. A Microservice Definition ! Loosely coupled service oriented architecture with bounded contexts If every service has to be updated at the same time it’s not loosely coupled If you have to know too much about surrounding services you don’t have a bounded context. See the Domain Driven Design book by Eric Evans.
  25. 25. Complexity
  26. 26. Monolithic apps have unlimited invisible internal dependencies ! Vastly more complex than explicit visible microservice dependencies
  27. 27. Speed
  28. 28. Speeding Up Deployments Datacenter Snowflakes • Deploy in months • Live for years
  29. 29. Speeding Up Deployments Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks
  30. 30. Speeding Up Deployments Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks Container Deployments • Deploy in seconds • Live for minutes/hours
  31. 31. Speeding Up Deployments Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks Container Deployments • Deploy in seconds • Live for minutes/hours AWS Lambda Events • Respond in milliseconds • Live for seconds
  32. 32. Speeding Up Deployments Measuring CPU usage once a minute makes no sense for containers… Coping with rate of change is a big challenge for monitoring tools. Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks Container Deployments • Deploy in seconds • Live for minutes/hours AWS Lambda Events • Respond in milliseconds • Live for seconds
  33. 33. Scale
  34. 34. A Possible Hierarchy Continents Regions Zones Services Versions Containers Instances How Many? 3 to 5 2-4 per Continent 1-5 per Region 100’s per Zone Many per Service 1000’s per Version 10,000’s It’s much more challenging than just a large number of machines
  35. 35. Flow
  36. 36. Some tools can show the request flow across a few services
  37. 37. But interesting architectures have a lot of microservices! Flow visualization is a challenge. See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
  38. 38. Failures
  39. 39. ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Simple NetflixOSS style microservices architecture on three AWS Availability Zones
  40. 40. ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Simple NetflixOSS style microservices architecture on three AWS Availability Zones
  41. 41. ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Simple NetflixOSS style microservices architecture on three AWS Availability Zones Zone partition/failure What should you do? What should monitors show?
  42. 42. ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Simple NetflixOSS style microservices architecture on three AWS Availability Zones Zone partition/failure What should you do? What should monitors show? By design, everything works with 2 of 3 zones running. This is not an outage, inform but don’t touch anything! Halt deployments perhaps?
  43. 43. ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Simple NetflixOSS style microservices architecture on three AWS Availability Zones Zone partition/failure What should you do? What should monitors show? By design, everything works with 2 of 3 zones running. This is not an outage, inform but don’t touch anything! Halt deployments perhaps? Challenge: understand and communicate common microservice failure patterns.
  44. 44. Testing
  45. 45. Testing monitoring tools at scale gets expensive quickly…
  46. 46. Simulation
  47. 47. Simulated Microservices Model and visualize microservices Simulate interesting architectures Generate large scale configurations Eventually stress test real tools ! See github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3 ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Three Availability Zones
  48. 48. netflixoss.go architecture !!!!!!!!!asgard.Create(cname, asgard.PriamCassandraPkg, regions, priamCassandracount, "eureka", cname) asgard.Create(tname, asgard.StaashPkg, regions, staashcount, cname) asgard.Create(jname, asgard.KaryonPkg, regions, javacount, tname) asgard.Create(nname, asgard.KaryonPkg, regions, nodecount, jname) asgard.Create(zuname, asgard.ZuulPkg, regions, zuulcount, nname) asgard.Create(elbname, asgard.ElbPkg, regions, 0, zuname) asgard.Run(asgard.Create(dns, asgard.DenominatorPkg, 0, 0, elbname), jname) // victimize a javaweb Tooling New tier name Tier package Region count: 1 Node count List of tier dependencies
  49. 49. Run and log results to json $ spigo -a netflixoss -d 10 -j 2015/05/21 00:05:32 netflixoss: scaling to 100% 2015/05/21 00:05:32 netflixoss.edda: starting 2015/05/21 00:05:32 netflixoss.us-east-1.zoneA.eureka.eureka.eureka0: starting 2015/05/21 00:05:32 netflixoss.us-east-1.zoneB.eureka.eureka.eureka1: starting 2015/05/21 00:05:32 netflixoss.us-east-1.zoneC.eureka.eureka.eureka2: starting 2015/05/21 00:05:32 netflixoss.*.*.www.denominator.www0 activity rate 10ms 2015/05/21 00:05:37 chaosmonkey delete: netflixoss.us-east-1.zoneC.javaweb.karyon.javaweb14 2015/05/21 00:05:42 asgard: Shutdown 2015/05/21 00:05:42 netflixoss.us-east-1.zoneB.eureka.eureka.eureka1: closing 2015/05/21 00:05:42 netflixoss.us-east-1.zoneA.eureka.eureka.eureka0: closing 2015/05/21 00:05:42 netflixoss.us-east-1.zoneC.eureka.eureka.eureka2: closing 2015/05/21 00:05:42 spigo: complete 2015/05/21 00:05:42 netflixoss.edda: closing 10 sec run time edda.go logs config to json eureka.go service registry per zone Chaos monkey victim!
  50. 50. Simianviz from json logs http://simianviz.divshot.io/netflixoss/1 ELB splits traffic over zones in single region microservices Cassandra Cluster Six regions Big thanks to @kurtiskemple
  51. 51. Why Build Spigo? Generate test microservice configurations at scale Stress monitoring tools and simulated game day training ! Eventually (i.e. not implemented yet) Dynamically vary configuration: autoscale, code push Chaos gorilla for zone, region failures and partitions Websocket connection between spigo and simianviz display !
  52. 52. My challenge to you: Build your architecture in Spigo. Stress monitoring tools with it. Help fix monitoring for microservices! ! @mgroeniger
  53. 53. Questions? Disclosure: some of the companies mentioned may be Battery Ventures Portfolio Companies See www.battery.com for a list of portfolio investments ● Microservices Challenges ● Speed and Scale ● Flow and Failures ● Testing and Simulation ! ● Battery Ventures http://www.battery.com ● Adrian’s Tweets @adrianco and Blog http://perfcap.blogspot.com ● Slideshare http://slideshare.com/adriancockcroft ● Github http://github.com/adrianco/spigo
  54. 54. What does @adrianco do? @adrianco Technology Due Diligence on Deals Presentations at Conferences Presentations at Companies Technical Advice for Portfolio Companies Program Committee for Conferences Networking with Interesting PeopleTinkering with Technologies Maintain Deep Relationship with Cloud Vendors
  55. 55. | Battery Ventures Portfolio Companies for Enterprise IT Security Visit http://www.battery.com/our-companies/ for a full list of all portfolio companies in which all Battery Funds have invested. Palo Alto Networks Enterprise IT Operations & Management Big DataCompute Networking Storage

×