Holistic Approach To Monitoring

•Download as PPTX, PDF•

1 like•985 views

This document discusses holistic approaches to monitoring systems and applications. It emphasizes the importance of monitoring business metrics, system performance, and failure metrics. It recommends defining metrics to monitor, collecting code-level metrics using tools like StatsD and Graphite, and collecting environment metrics from operating systems and databases. It also stresses the importance of visualization through different types of dashboards and anomaly detection. Action items include creating useful alerts and dashboards, adding anomaly detection, and exploiting failures to improve monitoring.

Technology

Thank you to our Sponsors
A Holistic Approach to Monitoring
Melanie Cey – Yardi Systems Inc.
Media Sponsor:

@melaniemj
Systems Analyst in DevOps (Web Operations) @ Yardi
• 5 years Programming
• 3.5 years Team Lead/Project Manager
• 4 years Systems Administration/Analysis

Because
• Customers should not alert you to failure
• Business metrics matter
• When something fails you need enough info to know why
• Agile teams release frequently
• No one can afford to be reactive

Definition: What to measure
• Business Metrics & Events
- Login/logout
- Sign up, buy something
- Sent email
• System Events, Performance and Utilization Metrics
- Web Service Call details (counter / time taken)
- Deployments
- Cache system (e.g. Redis or other) hits / misses
- Environment performance
• Failure Metrics
- Exceptions, segregated by type / app / server of origin
- Number and type of errors that reached customers

Code Collection – Add / Refine Stats
• Developer Friendly Platform
- Stats need to be able to be added ‘without permission’
- Create own dashboards
- Tools with APIs
- Build client library for sending stats

Code Collection – Graphite
• Using Graphite
- (Etsy 2011) StatsD UDP Node.js daemon collects and
aggregates
- Sends stats (as strings) to Graphite where they are stored in
Whisper (like RRD) files
- Graphite has a web interface, url api (with a json output option)
and built in ability to create dashboards
- Can receive stats from anything and is easy to setup
- Open source with lots of industry use
- Plenty of built in functions to help analyze and visualize data

Code Collection – Logging
• Metrics – what and when
• Logging – how and why

Code Collection – Add / Refine Logging
• Why Log and what to log?
- Log when you record a statistic
• Logging Best Practices
- Log locally
- Don’t log to your production database server
- Don’t fail if you can’t log
- Log in GMT
- Keep your logs, ship them to a central location
- Aggregate recent data in real time if you can
- Log more than you think you need to
- Use a parse friendly format

Environment Collection
• Operating Systems
- CPU, Free Memory, Paging, I/O ms speeds, network utilization
• Database Management Systems
- Transactions, blocks
• Application Containers
- Memory utilization, IIS requests current & queued, restarts,
cache statistics etc.

Visualization
• Types of Dashboards
- Feature based
- Resource based (server or container)
- Performance
- Anomaly detection
- Correlation
- Root Cause Analysis
- “Overview”

Visualization – Tasseo
• https://github.com/obfuscurity/tasseo

Visualization – Cubism
• https://github.com/square/cubism

Action: Putting inside knowledge to work

Action
• Useful dashboards help create useful alerts
• Add / refine anomaly detection & alerting
• Know your own boundaries
• A fuzzy threshold is better than no threshold
• Attach graphs to alerts
• Exploit failures
- Add an alerts after RCA
- Theorize other possible causes or conditions

More?
• http://graphite.readthedocs.org/en/latest/
• http://codeascraft.com/
• http://vimeo.com/monitorama
• Twitter #devops #monitoringlove
• https://github.com/monitoringsucks
• http://www.opsschool.org/en/latest/

What's hot

Migrate to platform of your choiceAshnikbiz

ERP monitoring with Applications ManagerManageEngine, Zoho Corporation

Manage VoIP, Network, Application and Server logs with ServicePilotServicePilot

Key takeaways for SAP PI Integration 2018Daniel Graversen

Primavera p6 release 8 web innovation and so much more pptp6academy

SAP License Audit ProcessAuditBot SAP Security Audit

What’s New in Athene™ 11Precisely

Data to Insight: Introduction to WSO2 Business Activity MonitorWSO2

SAP HANA Certified courseMultisoft Virtual Academy

IRUS R5: open and flexible access to standardised repository usage dataJisc

AMS NarusAtlas Systems

Managing Performance Globally with MySQLDaniel Austin

Active System Manager 7.1 Messaging Frameworkrishi679

Optimally Using Office 365Stridely Solutions

Cloud Migration - Adapting to DigitalizationStridely Solutions

Strategies for Multiple LocationsEMAINT

AMS Net iqAtlas Systems

Asset Manager - QuickFMS ProductTeam QuickFMS

What's hot (18)

Migrate to platform of your choice

ERP monitoring with Applications Manager

Manage VoIP, Network, Application and Server logs with ServicePilot

Key takeaways for SAP PI Integration 2018

Primavera p6 release 8 web innovation and so much more ppt

SAP License Audit Process

What’s New in Athene™ 11

Data to Insight: Introduction to WSO2 Business Activity Monitor

SAP HANA Certified course

IRUS R5: open and flexible access to standardised repository usage data

AMS Narus

Managing Performance Globally with MySQL

Active System Manager 7.1 Messaging Framework

Optimally Using Office 365

Cloud Migration - Adapting to Digitalization

Strategies for Multiple Locations

AMS Net iq

Asset Manager - QuickFMS Product

Similar to Holistic Approach To Monitoring

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Piyush Kumar

Graphing for Securitymr_secure

Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman

SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!Richard Robinson

API and Big Data Solution Patterns WSO2

SplunkLive! Advanced SessionSplunk

Service quality monitoring system architectureMatsuo Sawahashi

SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...Marek Maśko

Productionising Machine Learning ModelsTash Bickley

DevOps Toolbox: Application monitoring and insightssriram_rajan

Data Management and Migration in SalesforceSunil kumar

Dynamic 365Jitendra Soni

Monitorama: How monitoring can improve the rest of the companyJeff Weinstein

Business Analytics Paradigm ChangeDmitry Anoshin

PAD: Performance Anomaly Detection in Multi-Server Distributed SystemsJames Hill

Web Performance Optimization (WPO)Betclic Everest Group Tech Team

Levelling up your data infrastructureSimon Belak

Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...BI Brainz

SharePoint Databases: What you need to know (201509)Alan Eardley

CAAT_Outa_BagJennifer Goines, ACDA

Similar to Holistic Approach To Monitoring (20)

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

Graphing for Security

Choosing the Right Business Intelligence Tools for Your Data and Architectura...

SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!

API and Big Data Solution Patterns

SplunkLive! Advanced Session

Service quality monitoring system architecture

SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...

Productionising Machine Learning Models

DevOps Toolbox: Application monitoring and insights

Data Management and Migration in Salesforce

Dynamic 365

Monitorama: How monitoring can improve the rest of the company

Business Analytics Paradigm Change

PAD: Performance Anomaly Detection in Multi-Server Distributed Systems

Web Performance Optimization (WPO)

Levelling up your data infrastructure

Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...

SharePoint Databases: What you need to know (201509)

CAAT_Outa_Bag

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Real Time Object Detection Using Open CVKhem

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Manulife - Insurer Transformation Award 2024The Digital Insurer

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

Why Teams call analytics are critical to your entire businesspanagenda

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...

Ransomware_Q4_2023. The report. [EN].pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

AWS Community Day CPH - Three problems of Terraform

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Real Time Object Detection Using Open CV

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Manulife - Insurer Transformation Award 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

A Year of the Servo Reboot: Where Are We Now?

Why Teams call analytics are critical to your entire business

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

MS Copilot expands with MS Graph connectors

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Automating Google Workspace (GWS) & more with Apps Script

Holistic Approach To Monitoring

1. Thank you to our Sponsors A Holistic Approach to Monitoring Melanie Cey – Yardi Systems Inc. Media Sponsor:

2. @melaniemj Systems Analyst in DevOps (Web Operations) @ Yardi • 5 years Programming • 3.5 years Team Lead/Project Manager • 4 years Systems Administration/Analysis

4. Because • Customers should not alert you to failure • Business metrics matter • When something fails you need enough info to know why • Agile teams release frequently • No one can afford to be reactive

5. When you release code…

6. Monitoring Cycles

7. Definition: What to measure • Business Metrics & Events - Login/logout - Sign up, buy something - Sent email • System Events, Performance and Utilization Metrics - Web Service Call details (counter / time taken) - Deployments - Cache system (e.g. Redis or other) hits / misses - Environment performance • Failure Metrics - Exceptions, segregated by type / app / server of origin - Number and type of errors that reached customers

8. Code Collection

9. Code Collection – Add / Refine Stats • Developer Friendly Platform - Stats need to be able to be added ‘without permission’ - Create own dashboards - Tools with APIs - Build client library for sending stats

10. Code Collection – Graphite • Using Graphite - (Etsy 2011) StatsD UDP Node.js daemon collects and aggregates - Sends stats (as strings) to Graphite where they are stored in Whisper (like RRD) files - Graphite has a web interface, url api (with a json output option) and built in ability to create dashboards - Can receive stats from anything and is easy to setup - Open source with lots of industry use - Plenty of built in functions to help analyze and visualize data

11. Code Collection – Graphite

12. Code Collection – Add / Refine Stats

13. Code Collection – Graphite Samples

14. Code Collection – Logging • Metrics – what and when • Logging – how and why

15. Code Collection – Add / Refine Logging • Why Log and what to log? - Log when you record a statistic • Logging Best Practices - Log locally - Don’t log to your production database server - Don’t fail if you can’t log - Log in GMT - Keep your logs, ship them to a central location - Aggregate recent data in real time if you can - Log more than you think you need to - Use a parse friendly format

16. Environment Collection

17. Environment Collection • Operating Systems - CPU, Free Memory, Paging, I/O ms speeds, network utilization • Database Management Systems - Transactions, blocks • Application Containers - Memory utilization, IIS requests current & queued, restarts, cache statistics etc.

18. Visualization

19. Visualization • Types of Dashboards - Feature based - Resource based (server or container) - Performance - Anomaly detection - Correlation - Root Cause Analysis - “Overview”

20. Visualization – Tasseo • https://github.com/obfuscurity/tasseo

21. Visualization – Cubism • https://github.com/square/cubism

22. Visualization – Cubism • https://github.com/square/cubism

23. Action: Putting inside knowledge to work

24. Action • Useful dashboards help create useful alerts • Add / refine anomaly detection & alerting • Know your own boundaries • A fuzzy threshold is better than no threshold • Attach graphs to alerts • Exploit failures - Add an alerts after RCA - Theorize other possible causes or conditions

25. Monitoring Cycles

26. More? • http://graphite.readthedocs.org/en/latest/ • http://codeascraft.com/ • http://vimeo.com/monitorama • Twitter #devops #monitoringlove • https://github.com/monitoringsucks • http://www.opsschool.org/en/latest/

Editor's Notes

Spent the last 4 years strictly working on proactive monitoring measures for various systems
Is your site 100% functional just because you can hit your homepage? When I came back from my 4 mo mat leave in 2010
Reactive: Bugs vs Features ~ 9 years ago the first “live” aggregation of stats I saw was 24 hours after the fact, using ms log parser and presented via a ssms report “slow pages” and “pages that had errors” - This was better than nothing – and I have seen systems with literally just up/down checks on the home page as their complete monitoring set
Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
“3 armed sweaters” and “screwed users”
Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
Choose a developer friendly platform Spend more time analyzing the meaning of the metrics than code that collects, moves, stores and displays metrics
RRD Round Robin Database Whisper is a fixed-size database, similar in design to RRD. T provides fast, reliable storage of numerical data over time
Metrics will only ever tell you part of the story
Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
Note: Hypervisors How: Performance Counters using WMI I have the distinct pleasure of living in both worlds so this is part of the information I measure Linux servers you can use collectd or custom scripts
Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
12 hours, each minute 1 pixel “What is normal?”
Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
What’s important changes as application and traffic changes Add alerts around things that fail Add and remove dashboard items A fuzzy threshold is better than no threshold – and can always be changed
Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
Scaling monitoring Monitoring the monitoring Auto addition and removal of nodes and stats (environmentals) Too many monitoring tools, not enough analysis tools
Scaling monitoring Monitoring the monitoring Auto addition and removal of nodes and stats (environmentals) Too many monitoring tools, not enough analysis tools

Holistic Approach To Monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Holistic Approach To Monitoring

Similar to Holistic Approach To Monitoring (20)

Recently uploaded

Recently uploaded (20)

Holistic Approach To Monitoring

Editor's Notes