5. Monitoring is both the process and the set
of tools of finding problems before your
users, minimizing monetary impact of failure
and enabling fast recovery.
6. Efficient Monitoring aims at notifying the
right person at the right time (and right time
only) with the most precise information.
10. End User Monitoring
• Validates our application is running from
“outside”
• Measure “real user” performance
• Geo-Distributed – including real latency
• Many tools offer such solutions
– Measure, visualize, alerts
11. End User Monitoring
• When is a page fully loaded?
• Take care - some tools are biased
12.
13.
14. End User Monitoring
• Measure yourself
• Using
– Resource Timing API
– User Timing API
– Custom JS
• Send metrics from Browsers to your own
sync server
– all users / samples
15. End User Monitoring
What to measure
• Measure page load time (as you define it)
• Measure loading errors
• Measure number of page views
• Group by Geo & Application
• Group by browser
16. End User Monitoring
Alert on
• Sudden drop in traffic from a certain geo
• Sudden increase in traffic
• Increase in loading times
• Increase in errors
– From a specific browser
18. Is Alive
• Measure a process liveliness
– Is the process running?
• Measure a process responsiveness
– Does the process respond to a request?
• Alert on instance down
– And auto restart it
• Alert on all instances down
19. Is Alive
• A variety of great tools
• Tools that perform “ping” tests
• Tools that call a designated URL for
responsiveness tests
• Is alive != Availability
– Is alive is per host
– Availability is about the system as a whole
21. Request Monitoring
• Measure how your application performs
– Regardless of networking to the user
– Regardless of latency
• Measuring on the server, per server
• Many tools provide such solutions
– Measure, visualize, alerts
22. Request Monitoring
• But many tools miss the branching point
– Branching point – the point in your code at
which your code decides what branch of
execution to perform for a request
• Issues with aggregation, what is
monitored, alert flexibility
• But still, there are some great tools
23. Request Monitoring
What to measure
• Measure request rate
• Measure performance histogram
• Measure error rate, by error type, http
response code
• Group by request type (as you define it)
• Group by host, application, data center
• Group by error type (as you define it)
24. Do not use Average
• Don’t use Average for performance
• Instead, use median, 95%tile and 99%tile.
25. Request Monitoring
What to Visualize
• Request rate (RPM)
• Request performance
– Median, 95%tile and 99%tile
on a moving window
26. Request Monitoring
What to Visualize
• Errors
– Rate, percent (compared to request rate)
– Top X errors by percent
– Separate system and application errors
– You will always have application errors
– You should have exactly 0 system errors
27. Request Monitoring
Alert on
• Big changes in traffic
• Increase in response times
• Increase in errors
• System errors
30. Resource Monitoring
What to measure
• Measure resource utilization
– Percent of resource used
• Measure resource acquisition queue
– Time to acquire
– Acquire Timeouts
– Usage Timeouts
31. Resource Monitoring
What to measure
• Group by resource type and pool
• Group by host, application, data center
• Group by error type (as you define it)
Alert on
• Resource over utilization –
avg usage over XX% in a time window
36. Service is alive
• Is my application alive on the minimum
number required by my SLA?
• 2 out of 5 instances of my-app are not
responding to isAlive
• my-app requires a minimum of 3 instances
to meet the SLA
38. Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive
Alert the right person
Precise information
Automation
39. Service anomalies
• Backend Anomalies
• Identify unhealthy KPIs per endpoints
• Abnormal increase in error rate for
class.method.get
41. Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers
Metrics library
metrics / 1m
Graphs
Precise information
Alert the right person
Automation
42. Service anomalies
• Frontend Anomalies
• Browser (client) generated KPIs
• User Experience - Users effected or not?
How and where?
43. Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
events
Anodot
Time series
anomaly detection
Alerts & graphs
44. Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
events
Anodot
Time series
anomaly detection
Alerts & graphs
Precise information
Alert the right personAutomation
45. Alert management
• What are the active alerts?
• What is the root cause?
• It is correlated to a change?
46. Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRelic
Sensu
Nagios
PingDom
Web UI
47. Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRelic
Sensu
Nagios
PingDom
Web UI
Precise information
Alert the right person
Automation