Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to Monitoring
Monitoring is both the process and the set
of tools of finding problems before your
users, minimizing monetary impact of f...
Efficient Monitoring aims at notifying the
right person at the right time (and right time
only) with the most precise info...
What monitoring is
measure
Aggregate
& Visualize
Alert
Webapp DB
Webapp DB
What to Measure?
End user
experience
/performance
End User Monitoring
• Validates our application is running from
“outside”
• Measure “real user” performance
• Geo-Distribu...
End User Monitoring
• When is a page fully loaded?
• Take care - some tools are biased
End User Monitoring
• Measure yourself
• Using
– Resource Timing API
– User Timing API
– Custom JS
• Send metrics from Bro...
End User Monitoring
What to measure
• Measure page load time (as you define it)
• Measure loading errors
• Measure number ...
End User Monitoring
Alert on
• Sudden drop in traffic from a certain geo
• Sudden increase in traffic
• Increase in loadin...
Webapp DB
What to Measure?
Is Alive?
Is Alive
• Measure a process liveliness
– Is the process running?
• Measure a process responsiveness
– Does the process re...
Is Alive
• A variety of great tools
• Tools that perform “ping” tests
• Tools that call a designated URL for
responsivenes...
Webapp DB
What to Measure?
Request
performance
Request Monitoring
• Measure how your application performs
– Regardless of networking to the user
– Regardless of latency
...
Request Monitoring
• But many tools miss the branching point
– Branching point – the point in your code at
which your code...
Request Monitoring
What to measure
• Measure request rate
• Measure performance histogram
• Measure error rate, by error t...
Do not use Average
• Don’t use Average for performance
• Instead, use median, 95%tile and 99%tile.
Request Monitoring
What to Visualize
• Request rate (RPM)
• Request performance
– Median, 95%tile and 99%tile
on a moving ...
Request Monitoring
What to Visualize
• Errors
– Rate, percent (compared to request rate)
– Top X errors by percent
– Separ...
Request Monitoring
Alert on
• Big changes in traffic
• Increase in response times
• Increase in errors
• System errors
Webapp DB
What to Measure?
Resource
Utilization
Resources
• System resources
– CPU, Memory, IO, Storage, network
• Resource pools
– Database connection pools
– HTTP conne...
Resource Monitoring
What to measure
• Measure resource utilization
– Percent of resource used
• Measure resource acquisiti...
Resource Monitoring
What to measure
• Group by resource type and pool
• Group by host, application, data center
• Group by...
Webapp DB
What to Measure?
Database
Monitor
Database monitoring
Depends on the database, but yet -
• Storage
• Replication “lag”
• Slow operations
• Resource usage
Monitoring at Wix
Precise information
Alert the right person
Automation
Service is alive
• Is my application alive on the minimum
number required by my SLA?
• 2 out of 5 instances of my-app are ...
Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive
Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive
A...
Service anomalies
• Backend Anomalies
• Identify unhealthy KPIs per endpoints
• Abnormal increase in error rate for
class....
Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers...
Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers...
Service anomalies
• Frontend Anomalies
• Browser (client) generated KPIs
• User Experience - Users effected or not?
How an...
Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
eve...
Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
eve...
Alert management
• What are the active alerts?
• What is the root cause?
• It is correlated to a change?
Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRe...
Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRe...
Questions?
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
Upcoming SlideShare
Loading in …5
×

StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

253 views

Published on

Slides of Yoav Abrahami and Mark Sonis talk at StatsCraft 2015

Published in: Technology
  • Be the first to comment

  • Be the first to like this

StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis

  1. 1. Introduction to Monitoring
  2. 2. Monitoring is both the process and the set of tools of finding problems before your users, minimizing monetary impact of failure and enabling fast recovery.
  3. 3. Efficient Monitoring aims at notifying the right person at the right time (and right time only) with the most precise information.
  4. 4. What monitoring is measure Aggregate & Visualize Alert
  5. 5. Webapp DB
  6. 6. Webapp DB What to Measure? End user experience /performance
  7. 7. End User Monitoring • Validates our application is running from “outside” • Measure “real user” performance • Geo-Distributed – including real latency • Many tools offer such solutions – Measure, visualize, alerts
  8. 8. End User Monitoring • When is a page fully loaded? • Take care - some tools are biased
  9. 9. End User Monitoring • Measure yourself • Using – Resource Timing API – User Timing API – Custom JS • Send metrics from Browsers to your own sync server – all users / samples
  10. 10. End User Monitoring What to measure • Measure page load time (as you define it) • Measure loading errors • Measure number of page views • Group by Geo & Application • Group by browser
  11. 11. End User Monitoring Alert on • Sudden drop in traffic from a certain geo • Sudden increase in traffic • Increase in loading times • Increase in errors – From a specific browser
  12. 12. Webapp DB What to Measure? Is Alive?
  13. 13. Is Alive • Measure a process liveliness – Is the process running? • Measure a process responsiveness – Does the process respond to a request? • Alert on instance down – And auto restart it • Alert on all instances down
  14. 14. Is Alive • A variety of great tools • Tools that perform “ping” tests • Tools that call a designated URL for responsiveness tests • Is alive != Availability – Is alive is per host – Availability is about the system as a whole
  15. 15. Webapp DB What to Measure? Request performance
  16. 16. Request Monitoring • Measure how your application performs – Regardless of networking to the user – Regardless of latency • Measuring on the server, per server • Many tools provide such solutions – Measure, visualize, alerts
  17. 17. Request Monitoring • But many tools miss the branching point – Branching point – the point in your code at which your code decides what branch of execution to perform for a request • Issues with aggregation, what is monitored, alert flexibility • But still, there are some great tools
  18. 18. Request Monitoring What to measure • Measure request rate • Measure performance histogram • Measure error rate, by error type, http response code • Group by request type (as you define it) • Group by host, application, data center • Group by error type (as you define it)
  19. 19. Do not use Average • Don’t use Average for performance • Instead, use median, 95%tile and 99%tile.
  20. 20. Request Monitoring What to Visualize • Request rate (RPM) • Request performance – Median, 95%tile and 99%tile on a moving window
  21. 21. Request Monitoring What to Visualize • Errors – Rate, percent (compared to request rate) – Top X errors by percent – Separate system and application errors – You will always have application errors – You should have exactly 0 system errors
  22. 22. Request Monitoring Alert on • Big changes in traffic • Increase in response times • Increase in errors • System errors
  23. 23. Webapp DB What to Measure? Resource Utilization
  24. 24. Resources • System resources – CPU, Memory, IO, Storage, network • Resource pools – Database connection pools – HTTP connection pools – Thread pools – Other resource pools
  25. 25. Resource Monitoring What to measure • Measure resource utilization – Percent of resource used • Measure resource acquisition queue – Time to acquire – Acquire Timeouts – Usage Timeouts
  26. 26. Resource Monitoring What to measure • Group by resource type and pool • Group by host, application, data center • Group by error type (as you define it) Alert on • Resource over utilization – avg usage over XX% in a time window
  27. 27. Webapp DB What to Measure? Database Monitor
  28. 28. Database monitoring Depends on the database, but yet - • Storage • Replication “lag” • Slow operations • Resource usage
  29. 29. Monitoring at Wix
  30. 30. Precise information Alert the right person Automation
  31. 31. Service is alive • Is my application alive on the minimum number required by my SLA? • 2 out of 5 instances of my-app are not responding to isAlive • my-app requires a minimum of 3 instances to meet the SLA
  32. 32. Alert Sensu Queries Nginx Alert & SLA ZooKeeper Planned Configuration Service owner Nginx Service Load Balancer Is-alive
  33. 33. Alert Sensu Queries Nginx Alert & SLA ZooKeeper Planned Configuration Service owner Nginx Service Load Balancer Is-alive Alert the right person Precise information Automation
  34. 34. Service anomalies • Backend Anomalies • Identify unhealthy KPIs per endpoints • Abnormal increase in error rate for class.method.get
  35. 35. Anomaly Alert Anodot Time series anomaly detection Alerts & graphs statsd Stats aggregation Forwarding metrics JVM servers Metrics library metrics / 1m Graphs
  36. 36. Anomaly Alert Anodot Time series anomaly detection Alerts & graphs statsd Stats aggregation Forwarding metrics JVM servers Metrics library metrics / 1m Graphs Precise information Alert the right person Automation
  37. 37. Service anomalies • Frontend Anomalies • Browser (client) generated KPIs • User Experience - Users effected or not? How and where?
  38. 38. Anomaly Alert Storm & Esper Realtime streaming processing Metrics / 1m Client JS in Browser events Graphs Logger flume events Anodot Time series anomaly detection Alerts & graphs
  39. 39. Anomaly Alert Storm & Esper Realtime streaming processing Metrics / 1m Client JS in Browser events Graphs Logger flume events Anodot Time series anomaly detection Alerts & graphs Precise information Alert the right personAutomation
  40. 40. Alert management • What are the active alerts? • What is the root cause? • It is correlated to a change?
  41. 41. Alert BigPanda Central alerts & changes Alerts & Changes Changes Deployments Chef uploads A/B, F-Toggle, Exp. Alerts NewRelic Sensu Nagios PingDom Web UI
  42. 42. Alert BigPanda Central alerts & changes Alerts & Changes Changes Deployments Chef uploads A/B, F-Toggle, Exp. Alerts NewRelic Sensu Nagios PingDom Web UI Precise information Alert the right person Automation
  43. 43. Questions?

×