Please, no More Minutes, Milliseconds,
Monoliths... or Monitoring Tools!
Adrian Cockcroft @adrianco #Monitorama May 2014
2 | Battery Ventures
3 | Battery Ventures
Enterprise IT Adoption of Cloud
By Simon Wardley http://enterpriseitadoption.com/
You Are
Here
4 | Battery Ventures
Why am I at Monitorama?
5 | Battery Ventures
Twenty Years of Free and Open Source Monitoring
● 1994 The “SE Toolkit” and virtual_adrian.se
● 1998 Sun Performance Tuning, Java & The Internet Book
● 1999 Resource Management Sun Blueprint Book
● 2000 Capacity Planning for Web Services Sun Blueprint Book
● 2007 A. A. Michelson Award for Outstanding Contribution to
Computer Metrics, by the Computer Measurement Group
● 2004-2008 Capacity Planning with Free Tools Workshop at CMG
● 2014 Monitorama!
6 | Battery Ventures
State of the Art for Free Tools in 2008
http://www.slideshare.net/adrianco/capacity-planning-with-free-tools
7 | Battery Ventures
History Lesson
http://sourceforge.net/projects/setoolkit/
SE is a C interpreter with built-in access to all Solaris metric data sources
8 | Battery Ventures
Topics for Today
Minutes
Monoliths
Milliseconds
Monitoring tools
Challenges for monitoring
Continuous delivery & microservices
Analysis and closed loop control systems
Tools for developers who operate code in production
Challenges of dynamic, ephemeral, distributed cloud applications
9 | Battery Ventures
No more monitoring tools?
10 | Battery Ventures
We have too many of them already…
What’s needed is more analysis tools.
11 | Battery Ventures
#Analysorama?
12 | Battery Ventures
Rule #1: Spend more time working on code
that analyzes the meaning of metrics, than
code that collects, moves, stores and
displays metrics.
13 | Battery Ventures
What’s wrong with minutes?
14 | Battery Ventures
What’s wrong with minutes?
Takes too long to see a problem
0
1
2
3
4
5
Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7
Metric Threshold
Something
broke at 2m20
40s of failure
didn’t trigger
1st high metric
seen at agent
on instance
1st high metric arrives at
monitoring system
1st high metric
processed
(maybe)
1st high metric
seen on graph
Three datapoints
on user graph so
looks bad at 8m00.
15 | Battery Ventures
Whoops! I didn’t mean that! Reverting…
Not cool if it takes 5 minutes to see it failed and 5 more to see a fix
No-one notices if it only takes 5 seconds to detect and 5 to see a fix
16 | Battery Ventures
Try that again by the second
More confidence more quickly
0
1
2
3
4
Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7
Threshold
ThresholdSomething
broke at 2m20
Measurable
in 1s
1st high metric
seen at agent
on instance
1st high metric arrives at
monitoring system
1st high metric
processed
1st high metric
seen on graph
Three datapoints
on user graph so
looks bad at 2m25.
17 | Battery Ventures
Continuous Delivery and DevOps Implications
●Changes are smaller but more frequent
●Individual changes more likely to be broken
●Changes likely to be deployed by developers
●Instant detection and rollback matters much
more
18 | Battery Ventures
SaaS Based Products Show What Can Be Done
www.vividcortex.com and www.boundary.com
Seeing Problems In Seconds
19 | Battery Ventures
NetflixOSS Hystrix / Turbine Circuit Breaker Monitoring
http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html
Streaming metrics directly from front end services to a web browser
20 | Battery Ventures
Rule #2: Metric to display latency needs to
be less than human attention span (~10s)
21 | Battery Ventures
What’s Wrong With Milliseconds?
22 | Battery Ventures
A Millisecond is a Very Long Time!
● Some JVM based tools measure response times in ms
Network round trip within a datacenter/zone is less than 1ms
SSD access latency is usually less than 1ms
Cassandra (a Java app) response times can be less than 1ms
● Rounding Errors
Quantization loses too much information
Automated threshold warning “One is infinitely larger than zero”!
JVM does have nanosecond resolution times available
23 | Battery Ventures
Rule #3: Validate that your measurement
system has enough accuracy and precision.
Gauge Repeatability and Reproducibility matters, see
http://en.wikipedia.org/wiki/ANOVA_gauge_R%26R
24 | Battery Ventures
Monolithic Monitoring Systems
Simple to build and install, but problematic…
Services Being Monitored
Monolithic Monitoring System
Services Being Monitored
Distributed Collection Systems
Analysis / Display Aggregators
25 | Battery Ventures
Monolithic Monitoring Issues
● Scalability
Problems scaling data collection, analysis and reporting throughput
Limitations on number of distinct metrics that can be collected
Traffic storms can overload the system and take it down
● Availability
Monitoring system needs to stay up when everything else dies!
Downtime for upgrades is always inconvenient
Gaps in the metric history can trigger alarms and lose confidence
26 | Battery Ventures
In-Band, Out-of-Band, or Both?
In-band means deployed using same tools and infrastructure as your services
Dependencies lead to common mode failures that can leave you blind
Best option is both in-house in-band, and external SaaS
Services
Monitoring
System Monitoring
System
SaaS Based Monitoring
In-Band Monitoring
Very unlikely to have both fail at the same time
27 | Battery Ventures
Rule #4: Monitoring systems need to be
more available and scalable than the
systems being monitored.
28 | Battery Ventures
Continuous Delivery
29 | Battery Ventures
Issues with Continuous Delivery and Microservices
● High rate of change
Code pushes can cause floods of new instances and metrics
Short baseline for alert threshold analysis – everything looks unusual
● Ephemeral Configurations
Short lifetimes make it hard to aggregate historical views
Hand tweaked monitoring tools take too much work to keep running
● Microservices with complex calling patterns
End-to-end request flow measurements are very important
Request flow visualizations get overwhelmed
30 | Battery Ventures
Microservice Based Architectures
See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
From a Gilt Groupe Presentation
31 | Battery Ventures
“Death Star” Architecture Diagrams
As visualized by Appdynamics, Boundary.com and Twitter internal tools
Netflix Gilt Groupe (12 of 450) Twitter
32 | Battery Ventures
Closed Loop Control Systems
33 | Battery Ventures
Autoscaled Ephemeral Instances at Netflix (the old way)
● Largest services use autoscaled red/black code pushes
● Average lifetime of an instance is 36 hours
P
u
s
h
Autoscale Up
Autoscale Down
34 | Battery Ventures
Scryer - Predictive Auto-scaling at Netflix
See http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
and http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.html
More morning load
Sat/Sun high traffic
Lower load on Weds 24 Hours predicted traffic vs. actual
FFT based prediction driving AWS Autoscaler to plan minimum capacity
35 | Battery Ventures
Netflix Automatic Code Deployment Canary - Bad Signature
36 | Battery Ventures
Happy Canary Signature
37 | Battery Ventures
Monitoring Tools for Developers
● Most monitoring tools are built to be used by operations people
Focus on individual systems rather than applications
Focus on utilization rather than throughput and response time
Fiefdoms of sysadmin, network admin, storage admin, database admin…
Hard to integrate and extend
● Developer oriented monitoring tools
Application Performance Measurement (APM) and Analysis
Business transactions, response time, JVM internal metrics
Logging business metrics directly (NetflixOSS Servo, Yammer Metrics)
APIs for integration, data extraction, deep linking and embedding
http://techblog.netflix.com/2012/02/announcing-servo.html and http://metrics.codahale.com/
38 | Battery Ventures
Challenges of Dynamic, Ephemeral,
Distributed Cloud Applications
39 | Battery Ventures
Dynamic and Ephemeral Challenges
● Datacenter Assets
Arrive infrequently, disappear infrequently
Stick around for three years or so before they get retired
Have unique IP and Mac addresses
● Cloud Assets
Arrive in bursts – a Netflix code push creates over a hundred per minute
Stick around for a few hours before they get retired
Often re-use the IP and Mac address that was just vacated!
Use NetflixOSS Edda to record a full history of your configuration
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
40 | Battery Ventures
Cloud Native Architectures
41 | Battery Ventures
Traditional vs. Cloud Native Storage Architectures
Business
Logic
Database
Master
Fabric
Storage
Arrays
Database
Slave
Fabric
Storage
Arrays
Business
Logic
Cassandra
Zone A nodes
Cassandra
Zone B nodes
Cassandra
Zone C nodes
Cloud Object
Store Backups
42 | Battery Ventures
Distributed Cloud Applications Challenges
● Cloud provider data stores don’t have the usual monitoring hooks
e.g. no way to install an agent on AWS RDS MySQL, AWS DynamoDB
● Dependency on web services as well as code on instances
Integration of data sources like CloudWatch, measure use of S3 etc.
● Cloud applications span zones and regions
Monitoring tools need to span and aggregate zones and regions too!
● NoSQL data stores introduce new protocols and metrics
e.g. cross zone and cross regions replication traffic for Cassandra
43 | Battery Ventures
Monitoring “New Rules” by @adrianco
1. Spend more time on analysis than data collection and display
2. Reduce key business metric latency to less than 10s
3. Validate your measurement system precision and accuracy
4. Be more available and scalable than the services being monitored
5. Optimize for distributed, ephemeral cloud native applications
44 | Battery Ventures
Any Questions?
● Battery Ventures http://www.battery.com
● Adrian’s Blog http://perfcap.blogspot.com
● Slideshare http://slideshare.com/adriancockcroft
Appearances by @adrianco
● Migrating to Microservices – Qcon London - March 6th, 2014
● Monitorama Opening Keynote Portland OR - May 7th, 2014
● GOTO Chicago Opening Keynote May 20th, 2014
● DevOps Summit at Cloud Expo New York – June 10th, 2014
● Qcon New York – June 11th, 2014
● GOTO Copenhagen/Aarhus – Denmark – Oct 25th, 2014
Find me on LinkedIn or Twitter @adrianco

Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring Tools

  • 1.
    Please, no MoreMinutes, Milliseconds, Monoliths... or Monitoring Tools! Adrian Cockcroft @adrianco #Monitorama May 2014
  • 2.
    2 | BatteryVentures
  • 3.
    3 | BatteryVentures Enterprise IT Adoption of Cloud By Simon Wardley http://enterpriseitadoption.com/ You Are Here
  • 4.
    4 | BatteryVentures Why am I at Monitorama?
  • 5.
    5 | BatteryVentures Twenty Years of Free and Open Source Monitoring ● 1994 The “SE Toolkit” and virtual_adrian.se ● 1998 Sun Performance Tuning, Java & The Internet Book ● 1999 Resource Management Sun Blueprint Book ● 2000 Capacity Planning for Web Services Sun Blueprint Book ● 2007 A. A. Michelson Award for Outstanding Contribution to Computer Metrics, by the Computer Measurement Group ● 2004-2008 Capacity Planning with Free Tools Workshop at CMG ● 2014 Monitorama!
  • 6.
    6 | BatteryVentures State of the Art for Free Tools in 2008 http://www.slideshare.net/adrianco/capacity-planning-with-free-tools
  • 7.
    7 | BatteryVentures History Lesson http://sourceforge.net/projects/setoolkit/ SE is a C interpreter with built-in access to all Solaris metric data sources
  • 8.
    8 | BatteryVentures Topics for Today Minutes Monoliths Milliseconds Monitoring tools Challenges for monitoring Continuous delivery & microservices Analysis and closed loop control systems Tools for developers who operate code in production Challenges of dynamic, ephemeral, distributed cloud applications
  • 9.
    9 | BatteryVentures No more monitoring tools?
  • 10.
    10 | BatteryVentures We have too many of them already… What’s needed is more analysis tools.
  • 11.
    11 | BatteryVentures #Analysorama?
  • 12.
    12 | BatteryVentures Rule #1: Spend more time working on code that analyzes the meaning of metrics, than code that collects, moves, stores and displays metrics.
  • 13.
    13 | BatteryVentures What’s wrong with minutes?
  • 14.
    14 | BatteryVentures What’s wrong with minutes? Takes too long to see a problem 0 1 2 3 4 5 Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7 Metric Threshold Something broke at 2m20 40s of failure didn’t trigger 1st high metric seen at agent on instance 1st high metric arrives at monitoring system 1st high metric processed (maybe) 1st high metric seen on graph Three datapoints on user graph so looks bad at 8m00.
  • 15.
    15 | BatteryVentures Whoops! I didn’t mean that! Reverting… Not cool if it takes 5 minutes to see it failed and 5 more to see a fix No-one notices if it only takes 5 seconds to detect and 5 to see a fix
  • 16.
    16 | BatteryVentures Try that again by the second More confidence more quickly 0 1 2 3 4 Minute 1 Minute 2 Minute 3 Minute 4 Minute 5 Minute 6 Minute 7 Threshold ThresholdSomething broke at 2m20 Measurable in 1s 1st high metric seen at agent on instance 1st high metric arrives at monitoring system 1st high metric processed 1st high metric seen on graph Three datapoints on user graph so looks bad at 2m25.
  • 17.
    17 | BatteryVentures Continuous Delivery and DevOps Implications ●Changes are smaller but more frequent ●Individual changes more likely to be broken ●Changes likely to be deployed by developers ●Instant detection and rollback matters much more
  • 18.
    18 | BatteryVentures SaaS Based Products Show What Can Be Done www.vividcortex.com and www.boundary.com Seeing Problems In Seconds
  • 19.
    19 | BatteryVentures NetflixOSS Hystrix / Turbine Circuit Breaker Monitoring http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html Streaming metrics directly from front end services to a web browser
  • 20.
    20 | BatteryVentures Rule #2: Metric to display latency needs to be less than human attention span (~10s)
  • 21.
    21 | BatteryVentures What’s Wrong With Milliseconds?
  • 22.
    22 | BatteryVentures A Millisecond is a Very Long Time! ● Some JVM based tools measure response times in ms Network round trip within a datacenter/zone is less than 1ms SSD access latency is usually less than 1ms Cassandra (a Java app) response times can be less than 1ms ● Rounding Errors Quantization loses too much information Automated threshold warning “One is infinitely larger than zero”! JVM does have nanosecond resolution times available
  • 23.
    23 | BatteryVentures Rule #3: Validate that your measurement system has enough accuracy and precision. Gauge Repeatability and Reproducibility matters, see http://en.wikipedia.org/wiki/ANOVA_gauge_R%26R
  • 24.
    24 | BatteryVentures Monolithic Monitoring Systems Simple to build and install, but problematic… Services Being Monitored Monolithic Monitoring System Services Being Monitored Distributed Collection Systems Analysis / Display Aggregators
  • 25.
    25 | BatteryVentures Monolithic Monitoring Issues ● Scalability Problems scaling data collection, analysis and reporting throughput Limitations on number of distinct metrics that can be collected Traffic storms can overload the system and take it down ● Availability Monitoring system needs to stay up when everything else dies! Downtime for upgrades is always inconvenient Gaps in the metric history can trigger alarms and lose confidence
  • 26.
    26 | BatteryVentures In-Band, Out-of-Band, or Both? In-band means deployed using same tools and infrastructure as your services Dependencies lead to common mode failures that can leave you blind Best option is both in-house in-band, and external SaaS Services Monitoring System Monitoring System SaaS Based Monitoring In-Band Monitoring Very unlikely to have both fail at the same time
  • 27.
    27 | BatteryVentures Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
  • 28.
    28 | BatteryVentures Continuous Delivery
  • 29.
    29 | BatteryVentures Issues with Continuous Delivery and Microservices ● High rate of change Code pushes can cause floods of new instances and metrics Short baseline for alert threshold analysis – everything looks unusual ● Ephemeral Configurations Short lifetimes make it hard to aggregate historical views Hand tweaked monitoring tools take too much work to keep running ● Microservices with complex calling patterns End-to-end request flow measurements are very important Request flow visualizations get overwhelmed
  • 30.
    30 | BatteryVentures Microservice Based Architectures See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture From a Gilt Groupe Presentation
  • 31.
    31 | BatteryVentures “Death Star” Architecture Diagrams As visualized by Appdynamics, Boundary.com and Twitter internal tools Netflix Gilt Groupe (12 of 450) Twitter
  • 32.
    32 | BatteryVentures Closed Loop Control Systems
  • 33.
    33 | BatteryVentures Autoscaled Ephemeral Instances at Netflix (the old way) ● Largest services use autoscaled red/black code pushes ● Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
  • 34.
    34 | BatteryVentures Scryer - Predictive Auto-scaling at Netflix See http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html and http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.html More morning load Sat/Sun high traffic Lower load on Weds 24 Hours predicted traffic vs. actual FFT based prediction driving AWS Autoscaler to plan minimum capacity
  • 35.
    35 | BatteryVentures Netflix Automatic Code Deployment Canary - Bad Signature
  • 36.
    36 | BatteryVentures Happy Canary Signature
  • 37.
    37 | BatteryVentures Monitoring Tools for Developers ● Most monitoring tools are built to be used by operations people Focus on individual systems rather than applications Focus on utilization rather than throughput and response time Fiefdoms of sysadmin, network admin, storage admin, database admin… Hard to integrate and extend ● Developer oriented monitoring tools Application Performance Measurement (APM) and Analysis Business transactions, response time, JVM internal metrics Logging business metrics directly (NetflixOSS Servo, Yammer Metrics) APIs for integration, data extraction, deep linking and embedding http://techblog.netflix.com/2012/02/announcing-servo.html and http://metrics.codahale.com/
  • 38.
    38 | BatteryVentures Challenges of Dynamic, Ephemeral, Distributed Cloud Applications
  • 39.
    39 | BatteryVentures Dynamic and Ephemeral Challenges ● Datacenter Assets Arrive infrequently, disappear infrequently Stick around for three years or so before they get retired Have unique IP and Mac addresses ● Cloud Assets Arrive in bursts – a Netflix code push creates over a hundred per minute Stick around for a few hours before they get retired Often re-use the IP and Mac address that was just vacated! Use NetflixOSS Edda to record a full history of your configuration http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
  • 40.
    40 | BatteryVentures Cloud Native Architectures
  • 41.
    41 | BatteryVentures Traditional vs. Cloud Native Storage Architectures Business Logic Database Master Fabric Storage Arrays Database Slave Fabric Storage Arrays Business Logic Cassandra Zone A nodes Cassandra Zone B nodes Cassandra Zone C nodes Cloud Object Store Backups
  • 42.
    42 | BatteryVentures Distributed Cloud Applications Challenges ● Cloud provider data stores don’t have the usual monitoring hooks e.g. no way to install an agent on AWS RDS MySQL, AWS DynamoDB ● Dependency on web services as well as code on instances Integration of data sources like CloudWatch, measure use of S3 etc. ● Cloud applications span zones and regions Monitoring tools need to span and aggregate zones and regions too! ● NoSQL data stores introduce new protocols and metrics e.g. cross zone and cross regions replication traffic for Cassandra
  • 43.
    43 | BatteryVentures Monitoring “New Rules” by @adrianco 1. Spend more time on analysis than data collection and display 2. Reduce key business metric latency to less than 10s 3. Validate your measurement system precision and accuracy 4. Be more available and scalable than the services being monitored 5. Optimize for distributed, ephemeral cloud native applications
  • 44.
    44 | BatteryVentures Any Questions? ● Battery Ventures http://www.battery.com ● Adrian’s Blog http://perfcap.blogspot.com ● Slideshare http://slideshare.com/adriancockcroft Appearances by @adrianco ● Migrating to Microservices – Qcon London - March 6th, 2014 ● Monitorama Opening Keynote Portland OR - May 7th, 2014 ● GOTO Chicago Opening Keynote May 20th, 2014 ● DevOps Summit at Cloud Expo New York – June 10th, 2014 ● Qcon New York – June 11th, 2014 ● GOTO Copenhagen/Aarhus – Denmark – Oct 25th, 2014 Find me on LinkedIn or Twitter @adrianco