Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools


Published on

Monitorama opening keynote talk on the challenges of Monitoring in a world where we need to deal with continuous delivery, cloud, and automated control feedback loops.

Published in: Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools

  1. 1. Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! Adrian Cockcroft @adrianco #Monitorama May 2014
  2. 2. 2 | Battery Ventures
  3. 3. 3 | Battery Ventures Enterprise IT Adoption of Cloud By Simon Wardley You Are Here
  4. 4. 4 | Battery Ventures Why am I at Monitorama?
  5. 5. 5 | Battery Ventures Twenty Years of Free and Open Source Monitoring ● 1994 The “SE Toolkit” and ● 1998 Sun Performance Tuning, Java & The Internet Book ● 1999 Resource Management Sun Blueprint Book ● 2000 Capacity Planning for Web Services Sun Blueprint Book ● 2007 A. A. Michelson Award for Outstanding Contribution to Computer Metrics, by the Computer Measurement Group ● 2004-2008 Capacity Planning with Free Tools Workshop at CMG ● 2014 Monitorama!
  6. 6. 6 | Battery Ventures State of the Art for Free Tools in 2008
  7. 7. 7 | Battery Ventures History Lesson SE is a C interpreter with built-in access to all Solaris metric data sources
  8. 8. 8 | Battery Ventures Topics for Today Minutes Monoliths Milliseconds Monitoring tools Challenges for monitoring Continuous delivery & microservices Analysis and closed loop control systems Tools for developers who operate code in production Challenges of dynamic, ephemeral, distributed cloud applications
  9. 9. 9 | Battery Ventures No more monitoring tools?
  10. 10. 10 | Battery Ventures We have too many of them already… What’s needed is more analysis tools.
  11. 11. 11 | Battery Ventures #Analysorama?
  12. 12. 12 | Battery Ventures Rule #1: Spend more time working on code that analyzes the meaning of metrics, than code that collects, moves, stores and displays metrics.
  13. 13. 13 | Battery Ventures What’s wrong with minutes?
  14. 14. 14 | Battery Ventures What’s wrong with minutes? Takes too long to see a problem 0 1 2 3 4 5 Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7 Metric Threshold Something broke at 2m20 40s of failure didn’t trigger 1st high metric seen at agent on instance 1st high metric arrives at monitoring system 1st high metric processed (maybe) 1st high metric seen on graph Three datapoints on user graph so looks bad at 8m00.
  15. 15. 15 | Battery Ventures Whoops! I didn’t mean that! Reverting… Not cool if it takes 5 minutes to see it failed and 5 more to see a fix No-one notices if it only takes 5 seconds to detect and 5 to see a fix
  16. 16. 16 | Battery Ventures Try that again by the second More confidence more quickly 0 1 2 3 4 Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7 Threshold ThresholdSomething broke at 2m20 Measurable in 1s 1st high metric seen at agent on instance 1st high metric arrives at monitoring system 1st high metric processed 1st high metric seen on graph Three datapoints on user graph so looks bad at 2m25.
  17. 17. 17 | Battery Ventures Continuous Delivery and DevOps Implications ●Changes are smaller but more frequent ●Individual changes more likely to be broken ●Changes likely to be deployed by developers ●Instant detection and rollback matters much more
  18. 18. 18 | Battery Ventures SaaS Based Products Show What Can Be Done and Seeing Problems In Seconds
  19. 19. 19 | Battery Ventures NetflixOSS Hystrix / Turbine Circuit Breaker Monitoring Streaming metrics directly from front end services to a web browser
  20. 20. 20 | Battery Ventures Rule #2: Metric to display latency needs to be less than human attention span (~10s)
  21. 21. 21 | Battery Ventures What’s Wrong With Milliseconds?
  22. 22. 22 | Battery Ventures A Millisecond is a Very Long Time! ● Some JVM based tools measure response times in ms Network round trip within a datacenter/zone is less than 1ms SSD access latency is usually less than 1ms Cassandra (a Java app) response times can be less than 1ms ● Rounding Errors Quantization loses too much information Automated threshold warning “One is infinitely larger than zero”! JVM does have nanosecond resolution times available
  23. 23. 23 | Battery Ventures Rule #3: Validate that your measurement system has enough accuracy and precision. Gauge Repeatability and Reproducibility matters, see
  24. 24. 24 | Battery Ventures Monolithic Monitoring Systems Simple to build and install, but problematic… Services Being Monitored Monolithic Monitoring System Services Being Monitored Distributed Collection Systems Analysis / Display Aggregators
  25. 25. 25 | Battery Ventures Monolithic Monitoring Issues ● Scalability Problems scaling data collection, analysis and reporting throughput Limitations on number of distinct metrics that can be collected Traffic storms can overload the system and take it down ● Availability Monitoring system needs to stay up when everything else dies! Downtime for upgrades is always inconvenient Gaps in the metric history can trigger alarms and lose confidence
  26. 26. 26 | Battery Ventures In-Band, Out-of-Band, or Both? In-band means deployed using same tools and infrastructure as your services Dependencies lead to common mode failures that can leave you blind Best option is both in-house in-band, and external SaaS Services Monitoring System Monitoring System SaaS Based Monitoring In-Band Monitoring Very unlikely to have both fail at the same time
  27. 27. 27 | Battery Ventures Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
  28. 28. 28 | Battery Ventures Continuous Delivery
  29. 29. 29 | Battery Ventures Issues with Continuous Delivery and Microservices ● High rate of change Code pushes can cause floods of new instances and metrics Short baseline for alert threshold analysis – everything looks unusual ● Ephemeral Configurations Short lifetimes make it hard to aggregate historical views Hand tweaked monitoring tools take too much work to keep running ● Microservices with complex calling patterns End-to-end request flow measurements are very important Request flow visualizations get overwhelmed
  30. 30. 30 | Battery Ventures Microservice Based Architectures See From a Gilt Groupe Presentation
  31. 31. 31 | Battery Ventures “Death Star” Architecture Diagrams As visualized by Appdynamics, and Twitter internal tools Netflix Gilt Groupe (12 of 450) Twitter
  32. 32. 32 | Battery Ventures Closed Loop Control Systems
  33. 33. 33 | Battery Ventures Autoscaled Ephemeral Instances at Netflix (the old way) ● Largest services use autoscaled red/black code pushes ● Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
  34. 34. 34 | Battery Ventures Scryer - Predictive Auto-scaling at Netflix See and More morning load Sat/Sun high traffic Lower load on Weds 24 Hours predicted traffic vs. actual FFT based prediction driving AWS Autoscaler to plan minimum capacity
  35. 35. 35 | Battery Ventures Netflix Automatic Code Deployment Canary - Bad Signature
  36. 36. 36 | Battery Ventures Happy Canary Signature
  37. 37. 37 | Battery Ventures Monitoring Tools for Developers ● Most monitoring tools are built to be used by operations people Focus on individual systems rather than applications Focus on utilization rather than throughput and response time Fiefdoms of sysadmin, network admin, storage admin, database admin… Hard to integrate and extend ● Developer oriented monitoring tools Application Performance Measurement (APM) and Analysis Business transactions, response time, JVM internal metrics Logging business metrics directly (NetflixOSS Servo, Yammer Metrics) APIs for integration, data extraction, deep linking and embedding and
  38. 38. 38 | Battery Ventures Challenges of Dynamic, Ephemeral, Distributed Cloud Applications
  39. 39. 39 | Battery Ventures Dynamic and Ephemeral Challenges ● Datacenter Assets Arrive infrequently, disappear infrequently Stick around for three years or so before they get retired Have unique IP and Mac addresses ● Cloud Assets Arrive in bursts – a Netflix code push creates over a hundred per minute Stick around for a few hours before they get retired Often re-use the IP and Mac address that was just vacated! Use NetflixOSS Edda to record a full history of your configuration
  40. 40. 40 | Battery Ventures Cloud Native Architectures
  41. 41. 41 | Battery Ventures Traditional vs. Cloud Native Storage Architectures Business Logic Database Master Fabric Storage Arrays Database Slave Fabric Storage Arrays Business Logic Cassandra Zone A nodes Cassandra Zone B nodes Cassandra Zone C nodes Cloud Object Store Backups
  42. 42. 42 | Battery Ventures Distributed Cloud Applications Challenges ● Cloud provider data stores don’t have the usual monitoring hooks e.g. no way to install an agent on AWS RDS MySQL, AWS DynamoDB ● Dependency on web services as well as code on instances Integration of data sources like CloudWatch, measure use of S3 etc. ● Cloud applications span zones and regions Monitoring tools need to span and aggregate zones and regions too! ● NoSQL data stores introduce new protocols and metrics e.g. cross zone and cross regions replication traffic for Cassandra
  43. 43. 43 | Battery Ventures Monitoring “New Rules” by @adrianco 1. Spend more time on analysis than data collection and display 2. Reduce key business metric latency to less than 10s 3. Validate your measurement system precision and accuracy 4. Be more available and scalable than the services being monitored 5. Optimize for distributed, ephemeral cloud native applications
  44. 44. 44 | Battery Ventures Any Questions? ● Battery Ventures ● Adrian’s Blog ● Slideshare Appearances by @adrianco ● Migrating to Microservices – Qcon London - March 6th, 2014 ● Monitorama Opening Keynote Portland OR - May 7th, 2014 ● GOTO Chicago Opening Keynote May 20th, 2014 ● DevOps Summit at Cloud Expo New York – June 10th, 2014 ● Qcon New York – June 11th, 2014 ● GOTO Copenhagen/Aarhus – Denmark – Oct 25th, 2014 Find me on LinkedIn or Twitter @adrianco