SlideShare a Scribd company logo
1 of 27
Thank you to our Sponsors
A Holistic Approach to Monitoring
Melanie Cey – Yardi Systems Inc.
Media Sponsor:
@melaniemj
Systems Analyst in DevOps (Web Operations) @ Yardi
• 5 years Programming
• 3.5 years Team Lead/Project Manager
• 4 years Systems Administration/Analysis
Because
• Customers should not alert you to failure
• Business metrics matter
• When something fails you need enough info to know why
• Agile teams release frequently
• No one can afford to be reactive
When you release code…
Monitoring Cycles
Definition: What to measure
• Business Metrics & Events
- Login/logout
- Sign up, buy something
- Sent email
• System Events, Performance and Utilization Metrics
- Web Service Call details (counter / time taken)
- Deployments
- Cache system (e.g. Redis or other) hits / misses
- Environment performance
• Failure Metrics
- Exceptions, segregated by type / app / server of origin
- Number and type of errors that reached customers
Code Collection
Code Collection – Add / Refine Stats
• Developer Friendly Platform
- Stats need to be able to be added ‘without permission’
- Create own dashboards
- Tools with APIs
- Build client library for sending stats
Code Collection – Graphite
• Using Graphite
- (Etsy 2011) StatsD UDP Node.js daemon collects and
aggregates
- Sends stats (as strings) to Graphite where they are stored in
Whisper (like RRD) files
- Graphite has a web interface, url api (with a json output option)
and built in ability to create dashboards
- Can receive stats from anything and is easy to setup
- Open source with lots of industry use
- Plenty of built in functions to help analyze and visualize data
Code Collection – Graphite
Code Collection – Add / Refine Stats
Code Collection – Graphite Samples
Code Collection – Logging
• Metrics – what and when
• Logging – how and why
Code Collection – Add / Refine Logging
• Why Log and what to log?
- Log when you record a statistic
• Logging Best Practices
- Log locally
- Don’t log to your production database server
- Don’t fail if you can’t log
- Log in GMT
- Keep your logs, ship them to a central location
- Aggregate recent data in real time if you can
- Log more than you think you need to
- Use a parse friendly format
Environment Collection
Environment Collection
• Operating Systems
- CPU, Free Memory, Paging, I/O ms speeds, network utilization
• Database Management Systems
- Transactions, blocks
• Application Containers
- Memory utilization, IIS requests current & queued, restarts,
cache statistics etc.
Visualization
Visualization
• Types of Dashboards
- Feature based
- Resource based (server or container)
- Performance
- Anomaly detection
- Correlation
- Root Cause Analysis
- “Overview”
Visualization – Tasseo
• https://github.com/obfuscurity/tasseo
Visualization – Cubism
• https://github.com/square/cubism
Visualization – Cubism
• https://github.com/square/cubism
Action: Putting inside knowledge to work
Action
• Useful dashboards help create useful alerts
• Add / refine anomaly detection & alerting
• Know your own boundaries
• A fuzzy threshold is better than no threshold
• Attach graphs to alerts
• Exploit failures
- Add an alerts after RCA
- Theorize other possible causes or conditions
Monitoring Cycles
More?
• http://graphite.readthedocs.org/en/latest/
• http://codeascraft.com/
• http://vimeo.com/monitorama
• Twitter #devops #monitoringlove
• https://github.com/monitoringsucks
• http://www.opsschool.org/en/latest/
Holistic Approach To Monitoring

More Related Content

What's hot

Migrate to platform of your choice
Migrate to platform of your choiceMigrate to platform of your choice
Migrate to platform of your choiceAshnikbiz
 
Manage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilotManage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilotServicePilot
 
Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Daniel Graversen
 
Primavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more pptPrimavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more pptp6academy
 
What’s New in Athene™ 11
What’s New in Athene™ 11What’s New in Athene™ 11
What’s New in Athene™ 11Precisely
 
Data to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorData to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorWSO2
 
IRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage dataIRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage dataJisc
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQLDaniel Austin
 
Active System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging FrameworkActive System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging Frameworkrishi679
 
Cloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to DigitalizationCloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to DigitalizationStridely Solutions
 
Strategies for Multiple Locations
Strategies for Multiple LocationsStrategies for Multiple Locations
Strategies for Multiple LocationsEMAINT
 
Asset Manager - QuickFMS Product
Asset Manager - QuickFMS ProductAsset Manager - QuickFMS Product
Asset Manager - QuickFMS ProductTeam QuickFMS
 

What's hot (18)

Migrate to platform of your choice
Migrate to platform of your choiceMigrate to platform of your choice
Migrate to platform of your choice
 
ERP monitoring with Applications Manager
ERP monitoring with Applications ManagerERP monitoring with Applications Manager
ERP monitoring with Applications Manager
 
Manage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilotManage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilot
 
Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018
 
Primavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more pptPrimavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more ppt
 
SAP License Audit Process
SAP License Audit ProcessSAP License Audit Process
SAP License Audit Process
 
What’s New in Athene™ 11
What’s New in Athene™ 11What’s New in Athene™ 11
What’s New in Athene™ 11
 
Data to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorData to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity Monitor
 
SAP HANA Certified course
SAP HANA Certified courseSAP HANA Certified course
SAP HANA Certified course
 
IRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage dataIRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage data
 
AMS Narus
AMS NarusAMS Narus
AMS Narus
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
 
Active System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging FrameworkActive System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging Framework
 
Optimally Using Office 365
Optimally Using Office 365Optimally Using Office 365
Optimally Using Office 365
 
Cloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to DigitalizationCloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to Digitalization
 
Strategies for Multiple Locations
Strategies for Multiple LocationsStrategies for Multiple Locations
Strategies for Multiple Locations
 
AMS Net iq
AMS Net iqAMS Net iq
AMS Net iq
 
Asset Manager - QuickFMS Product
Asset Manager - QuickFMS ProductAsset Manager - QuickFMS Product
Asset Manager - QuickFMS Product
 

Similar to Holistic Approach To Monitoring

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Piyush Kumar
 
Graphing for Security
Graphing for SecurityGraphing for Security
Graphing for Securitymr_secure
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!Richard Robinson
 
API and Big Data Solution Patterns
API and Big Data Solution Patterns API and Big Data Solution Patterns
API and Big Data Solution Patterns WSO2
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunk
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architectureMatsuo Sawahashi
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...Marek Maśko
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning ModelsTash Bickley
 
DevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightsDevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightssriram_rajan
 
Data Management and Migration in Salesforce
Data Management and Migration in SalesforceData Management and Migration in Salesforce
Data Management and Migration in SalesforceSunil kumar
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyJeff Weinstein
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm ChangeDmitry Anoshin
 
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed SystemsPAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed SystemsJames Hill
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...BI Brainz
 
SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)Alan Eardley
 

Similar to Holistic Approach To Monitoring (20)

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
Graphing for Security
Graphing for SecurityGraphing for Security
Graphing for Security
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
API and Big Data Solution Patterns
API and Big Data Solution Patterns API and Big Data Solution Patterns
API and Big Data Solution Patterns
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
 
DevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightsDevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insights
 
Data Management and Migration in Salesforce
Data Management and Migration in SalesforceData Management and Migration in Salesforce
Data Management and Migration in Salesforce
 
Dynamic 365
Dynamic 365Dynamic 365
Dynamic 365
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm Change
 
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed SystemsPAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
 
Web Performance Optimization (WPO)
Web Performance Optimization (WPO)Web Performance Optimization (WPO)
Web Performance Optimization (WPO)
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
 
SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)
 
CAAT_Outa_Bag
CAAT_Outa_BagCAAT_Outa_Bag
CAAT_Outa_Bag
 

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Holistic Approach To Monitoring

  • 1. Thank you to our Sponsors A Holistic Approach to Monitoring Melanie Cey – Yardi Systems Inc. Media Sponsor:
  • 2. @melaniemj Systems Analyst in DevOps (Web Operations) @ Yardi • 5 years Programming • 3.5 years Team Lead/Project Manager • 4 years Systems Administration/Analysis
  • 3.
  • 4. Because • Customers should not alert you to failure • Business metrics matter • When something fails you need enough info to know why • Agile teams release frequently • No one can afford to be reactive
  • 7. Definition: What to measure • Business Metrics & Events - Login/logout - Sign up, buy something - Sent email • System Events, Performance and Utilization Metrics - Web Service Call details (counter / time taken) - Deployments - Cache system (e.g. Redis or other) hits / misses - Environment performance • Failure Metrics - Exceptions, segregated by type / app / server of origin - Number and type of errors that reached customers
  • 9. Code Collection – Add / Refine Stats • Developer Friendly Platform - Stats need to be able to be added ‘without permission’ - Create own dashboards - Tools with APIs - Build client library for sending stats
  • 10. Code Collection – Graphite • Using Graphite - (Etsy 2011) StatsD UDP Node.js daemon collects and aggregates - Sends stats (as strings) to Graphite where they are stored in Whisper (like RRD) files - Graphite has a web interface, url api (with a json output option) and built in ability to create dashboards - Can receive stats from anything and is easy to setup - Open source with lots of industry use - Plenty of built in functions to help analyze and visualize data
  • 12. Code Collection – Add / Refine Stats
  • 13. Code Collection – Graphite Samples
  • 14. Code Collection – Logging • Metrics – what and when • Logging – how and why
  • 15. Code Collection – Add / Refine Logging • Why Log and what to log? - Log when you record a statistic • Logging Best Practices - Log locally - Don’t log to your production database server - Don’t fail if you can’t log - Log in GMT - Keep your logs, ship them to a central location - Aggregate recent data in real time if you can - Log more than you think you need to - Use a parse friendly format
  • 17. Environment Collection • Operating Systems - CPU, Free Memory, Paging, I/O ms speeds, network utilization • Database Management Systems - Transactions, blocks • Application Containers - Memory utilization, IIS requests current & queued, restarts, cache statistics etc.
  • 19. Visualization • Types of Dashboards - Feature based - Resource based (server or container) - Performance - Anomaly detection - Correlation - Root Cause Analysis - “Overview”
  • 20. Visualization – Tasseo • https://github.com/obfuscurity/tasseo
  • 21. Visualization – Cubism • https://github.com/square/cubism
  • 22. Visualization – Cubism • https://github.com/square/cubism
  • 23. Action: Putting inside knowledge to work
  • 24. Action • Useful dashboards help create useful alerts • Add / refine anomaly detection & alerting • Know your own boundaries • A fuzzy threshold is better than no threshold • Attach graphs to alerts • Exploit failures - Add an alerts after RCA - Theorize other possible causes or conditions
  • 26. More? • http://graphite.readthedocs.org/en/latest/ • http://codeascraft.com/ • http://vimeo.com/monitorama • Twitter #devops #monitoringlove • https://github.com/monitoringsucks • http://www.opsschool.org/en/latest/

Editor's Notes

  1. Spent the last 4 years strictly working on proactive monitoring measures for various systems
  2. Is your site 100% functional just because you can hit your homepage? When I came back from my 4 mo mat leave in 2010
  3. Reactive: Bugs vs Features ~ 9 years ago the first “live” aggregation of stats I saw was 24 hours after the fact, using ms log parser and presented via a ssms report “slow pages” and “pages that had errors” - This was better than nothing – and I have seen systems with literally just up/down checks on the home page as their complete monitoring set
  4. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  5. “3 armed sweaters” and “screwed users”
  6. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  7. Choose a developer friendly platform Spend more time analyzing the meaning of the metrics than code that collects, moves, stores and displays metrics
  8. RRD Round Robin Database Whisper is a fixed-size database, similar in design to RRD. T provides fast, reliable storage of numerical data over time
  9. Metrics will only ever tell you part of the story
  10. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  11. Note: Hypervisors How: Performance Counters using WMI I have the distinct pleasure of living in both worlds so this is part of the information I measure Linux servers you can use collectd or custom scripts
  12. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  13. 12 hours, each minute 1 pixel “What is normal?”
  14. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  15. What’s important changes as application and traffic changes Add alerts around things that fail Add and remove dashboard items A fuzzy threshold is better than no threshold – and can always be changed
  16. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  17. Scaling monitoring Monitoring the monitoring Auto addition and removal of nodes and stats (environmentals) Too many monitoring tools, not enough analysis tools
  18. Scaling monitoring Monitoring the monitoring Auto addition and removal of nodes and stats (environmentals) Too many monitoring tools, not enough analysis tools