Introduction to Efficient Monitoring

Monitoring is both the process and the set
of tools of finding problems before your
users, minimizing monetary impact of failure
and enabling fast recovery.

Efficient Monitoring aims at notifying the
right person at the right time (and right time
only) with the most precise information.

What monitoring is
measure
Aggregate
& Visualize
Alert

Webapp DB
What to Measure?
End user
experience
/performance

End User Monitoring
• Validates our application is running from
“outside”
• Measure “real user” performance
• Geo-Distributed – including real latency
• Many tools offer such solutions
– Measure, visualize, alerts

End User Monitoring
• When is a page fully loaded?
• Take care - some tools are biased

End User Monitoring
• Measure yourself
• Using
– Resource Timing API
– User Timing API
– Custom JS
• Send metrics from Browsers to your own
sync server
– all users / samples

End User Monitoring
What to measure
• Measure page load time (as you define it)
• Measure loading errors
• Measure number of page views
• Group by Geo & Application
• Group by browser

End User Monitoring
Alert on
• Sudden drop in traffic from a certain geo
• Sudden increase in traffic
• Increase in loading times
• Increase in errors
– From a specific browser

Webapp DB
What to Measure?
Is Alive?

Is Alive
• Measure a process liveliness
– Is the process running?
• Measure a process responsiveness
– Does the process respond to a request?
• Alert on instance down
– And auto restart it
• Alert on all instances down

Is Alive
• A variety of great tools
• Tools that perform “ping” tests
• Tools that call a designated URL for
responsiveness tests
• Is alive != Availability
– Is alive is per host
– Availability is about the system as a whole

Webapp DB
What to Measure?
Request
performance

Request Monitoring
• Measure how your application performs
– Regardless of networking to the user
– Regardless of latency
• Measuring on the server, per server
• Many tools provide such solutions
– Measure, visualize, alerts

Request Monitoring
• But many tools miss the branching point
– Branching point – the point in your code at
which your code decides what branch of
execution to perform for a request
• Issues with aggregation, what is
monitored, alert flexibility
• But still, there are some great tools

Request Monitoring
What to measure
• Measure request rate
• Measure performance histogram
• Measure error rate, by error type, http
response code
• Group by request type (as you define it)
• Group by host, application, data center
• Group by error type (as you define it)

Do not use Average
• Don’t use Average for performance
• Instead, use median, 95%tile and 99%tile.

Request Monitoring
What to Visualize
• Request rate (RPM)
• Request performance
– Median, 95%tile and 99%tile
on a moving window

Request Monitoring
What to Visualize
• Errors
– Rate, percent (compared to request rate)
– Top X errors by percent
– Separate system and application errors
– You will always have application errors
– You should have exactly 0 system errors

Request Monitoring
Alert on
• Big changes in traffic
• Increase in response times
• Increase in errors
• System errors

Webapp DB
What to Measure?
Resource
Utilization

Resources
• System resources
– CPU, Memory, IO, Storage, network
• Resource pools
– Database connection pools
– HTTP connection pools
– Thread pools
– Other resource pools

Resource Monitoring
What to measure
• Measure resource utilization
– Percent of resource used
• Measure resource acquisition queue
– Time to acquire
– Acquire Timeouts
– Usage Timeouts

Resource Monitoring
What to measure
• Group by resource type and pool
• Group by host, application, data center
• Group by error type (as you define it)
Alert on
• Resource over utilization –
avg usage over XX% in a time window

Webapp DB
What to Measure?
Database
Monitor

Database monitoring
Depends on the database, but yet -
• Storage
• Replication “lag”
• Slow operations
• Resource usage

Precise information
Alert the right person
Automation

Service is alive
• Is my application alive on the minimum
number required by my SLA?
• 2 out of 5 instances of my-app are not
responding to isAlive
• my-app requires a minimum of 3 instances
to meet the SLA

Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive

Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive
Precise information
Automation

Service anomalies
• Backend Anomalies
• Identify unhealthy KPIs per endpoints
• Abnormal increase in error rate for
class.method.get

Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers
Metrics library
metrics / 1m
Graphs

Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers
Metrics library
metrics / 1m
Graphs
Precise information
Automation

Service anomalies
• Frontend Anomalies
• Browser (client) generated KPIs
• User Experience - Users effected or not?
How and where?

Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
events
Anodot
Time series
anomaly detection
Alerts & graphs

Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
events
Anodot
Time series
anomaly detection
Alerts & graphs
Precise information
Alert the right personAutomation

Alert management
• What are the active alerts?
• What is the root cause?
• It is correlated to a change?

Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRelic
Sensu
Nagios
PingDom
Web UI

Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRelic
Sensu
Nagios
PingDom
Web UI
Precise information
Automation

Introduction to Efficient Monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Efficient Monitoring

Similar to Introduction to Efficient Monitoring (20)

Recently uploaded

Recently uploaded (20)

Introduction to Efficient Monitoring

Editor's Notes