This document discusses holistic approaches to monitoring systems and applications. It emphasizes the importance of monitoring business metrics, system performance, and failure metrics. It recommends defining metrics to monitor, collecting code-level metrics using tools like StatsD and Graphite, and collecting environment metrics from operating systems and databases. It also stresses the importance of visualization through different types of dashboards and anomaly detection. Action items include creating useful alerts and dashboards, adding anomaly detection, and exploiting failures to improve monitoring.
Automating Google Workspace (GWS) & more with Apps Script
Holistic Approach To Monitoring
1. Thank you to our Sponsors
A Holistic Approach to Monitoring
Melanie Cey – Yardi Systems Inc.
Media Sponsor:
2. @melaniemj
Systems Analyst in DevOps (Web Operations) @ Yardi
• 5 years Programming
• 3.5 years Team Lead/Project Manager
• 4 years Systems Administration/Analysis
3.
4. Because
• Customers should not alert you to failure
• Business metrics matter
• When something fails you need enough info to know why
• Agile teams release frequently
• No one can afford to be reactive
7. Definition: What to measure
• Business Metrics & Events
- Login/logout
- Sign up, buy something
- Sent email
• System Events, Performance and Utilization Metrics
- Web Service Call details (counter / time taken)
- Deployments
- Cache system (e.g. Redis or other) hits / misses
- Environment performance
• Failure Metrics
- Exceptions, segregated by type / app / server of origin
- Number and type of errors that reached customers
9. Code Collection – Add / Refine Stats
• Developer Friendly Platform
- Stats need to be able to be added ‘without permission’
- Create own dashboards
- Tools with APIs
- Build client library for sending stats
10. Code Collection – Graphite
• Using Graphite
- (Etsy 2011) StatsD UDP Node.js daemon collects and
aggregates
- Sends stats (as strings) to Graphite where they are stored in
Whisper (like RRD) files
- Graphite has a web interface, url api (with a json output option)
and built in ability to create dashboards
- Can receive stats from anything and is easy to setup
- Open source with lots of industry use
- Plenty of built in functions to help analyze and visualize data
14. Code Collection – Logging
• Metrics – what and when
• Logging – how and why
15. Code Collection – Add / Refine Logging
• Why Log and what to log?
- Log when you record a statistic
• Logging Best Practices
- Log locally
- Don’t log to your production database server
- Don’t fail if you can’t log
- Log in GMT
- Keep your logs, ship them to a central location
- Aggregate recent data in real time if you can
- Log more than you think you need to
- Use a parse friendly format
24. Action
• Useful dashboards help create useful alerts
• Add / refine anomaly detection & alerting
• Know your own boundaries
• A fuzzy threshold is better than no threshold
• Attach graphs to alerts
• Exploit failures
- Add an alerts after RCA
- Theorize other possible causes or conditions
Spent the last 4 years strictly working on proactive monitoring measures for various systems
Is your site 100% functional just because you can hit your homepage?
When I came back from my 4 mo mat leave in 2010
Reactive: Bugs vs Features
~ 9 years ago the first “live” aggregation of stats I saw was 24 hours after the fact, using ms log parser and presented via a ssms report “slow pages” and “pages that had errors”
- This was better than nothing – and I have seen systems with literally just up/down checks on the home page as their complete monitoring set
Definition: Define what to measure/observe
Code Collection: Add / refine (necessary) stats and logging into your codebase
Environment Collection: Add / refine environmental metrics
Visualization: Build / refine dashboards
Action: Add / refine anomaly detection & alerting
“3 armed sweaters” and “screwed users”
Definition: Define what to measure/observe
Code Collection: Add / refine (necessary) stats and logging into your codebase
Environment Collection: Add / refine environmental metrics
Visualization: Build / refine dashboards
Action: Add / refine anomaly detection & alerting
Choose a developer friendly platform
Spend more time analyzing the meaning of the metrics than code that collects, moves, stores and displays metrics
RRD Round Robin Database
Whisper is a fixed-size database, similar in design to RRD. T provides fast, reliable storage of numerical data over time
Metrics will only ever tell you part of the story
Definition: Define what to measure/observe
Code Collection: Add / refine (necessary) stats and logging into your codebase
Environment Collection: Add / refine environmental metrics
Visualization: Build / refine dashboards
Action: Add / refine anomaly detection & alerting
Note: Hypervisors
How: Performance Counters using WMI
I have the distinct pleasure of living in both worlds so this is part of the information I measure
Linux servers you can use collectd or custom scripts
Definition: Define what to measure/observe
Code Collection: Add / refine (necessary) stats and logging into your codebase
Environment Collection: Add / refine environmental metrics
Visualization: Build / refine dashboards
Action: Add / refine anomaly detection & alerting
12 hours, each minute 1 pixel
“What is normal?”
Definition: Define what to measure/observe
Code Collection: Add / refine (necessary) stats and logging into your codebase
Environment Collection: Add / refine environmental metrics
Visualization: Build / refine dashboards
Action: Add / refine anomaly detection & alerting
What’s important changes as application and traffic changes
Add alerts around things that fail
Add and remove dashboard items
A fuzzy threshold is better than no threshold – and can always be changed
Definition: Define what to measure/observe
Code Collection: Add / refine (necessary) stats and logging into your codebase
Environment Collection: Add / refine environmental metrics
Visualization: Build / refine dashboards
Action: Add / refine anomaly detection & alerting
Scaling monitoring
Monitoring the monitoring
Auto addition and removal of nodes and stats (environmentals)
Too many monitoring tools, not enough analysis tools
Scaling monitoring
Monitoring the monitoring
Auto addition and removal of nodes and stats (environmentals)
Too many monitoring tools, not enough analysis tools