SlideShare a Scribd company logo
Thank you to our Sponsors
A Holistic Approach to Monitoring
Melanie Cey – Yardi Systems Inc.
Media Sponsor:
@melaniemj
Systems Analyst in DevOps (Web Operations) @ Yardi
• 5 years Programming
• 3.5 years Team Lead/Project Manager
• 4 years Systems Administration/Analysis
Because
• Customers should not alert you to failure
• Business metrics matter
• When something fails you need enough info to know why
• Agile teams release frequently
• No one can afford to be reactive
When you release code…
Monitoring Cycles
Definition: What to measure
• Business Metrics & Events
- Login/logout
- Sign up, buy something
- Sent email
• System Events, Performance and Utilization Metrics
- Web Service Call details (counter / time taken)
- Deployments
- Cache system (e.g. Redis or other) hits / misses
- Environment performance
• Failure Metrics
- Exceptions, segregated by type / app / server of origin
- Number and type of errors that reached customers
Code Collection
Code Collection – Add / Refine Stats
• Developer Friendly Platform
- Stats need to be able to be added ‘without permission’
- Create own dashboards
- Tools with APIs
- Build client library for sending stats
Code Collection – Graphite
• Using Graphite
- (Etsy 2011) StatsD UDP Node.js daemon collects and
aggregates
- Sends stats (as strings) to Graphite where they are stored in
Whisper (like RRD) files
- Graphite has a web interface, url api (with a json output option)
and built in ability to create dashboards
- Can receive stats from anything and is easy to setup
- Open source with lots of industry use
- Plenty of built in functions to help analyze and visualize data
Code Collection – Graphite
Code Collection – Add / Refine Stats
Code Collection – Graphite Samples
Code Collection – Logging
• Metrics – what and when
• Logging – how and why
Code Collection – Add / Refine Logging
• Why Log and what to log?
- Log when you record a statistic
• Logging Best Practices
- Log locally
- Don’t log to your production database server
- Don’t fail if you can’t log
- Log in GMT
- Keep your logs, ship them to a central location
- Aggregate recent data in real time if you can
- Log more than you think you need to
- Use a parse friendly format
Environment Collection
Environment Collection
• Operating Systems
- CPU, Free Memory, Paging, I/O ms speeds, network utilization
• Database Management Systems
- Transactions, blocks
• Application Containers
- Memory utilization, IIS requests current & queued, restarts,
cache statistics etc.
Visualization
Visualization
• Types of Dashboards
- Feature based
- Resource based (server or container)
- Performance
- Anomaly detection
- Correlation
- Root Cause Analysis
- “Overview”
Visualization – Tasseo
• https://github.com/obfuscurity/tasseo
Visualization – Cubism
• https://github.com/square/cubism
Visualization – Cubism
• https://github.com/square/cubism
Action: Putting inside knowledge to work
Action
• Useful dashboards help create useful alerts
• Add / refine anomaly detection & alerting
• Know your own boundaries
• A fuzzy threshold is better than no threshold
• Attach graphs to alerts
• Exploit failures
- Add an alerts after RCA
- Theorize other possible causes or conditions
Monitoring Cycles
More?
• http://graphite.readthedocs.org/en/latest/
• http://codeascraft.com/
• http://vimeo.com/monitorama
• Twitter #devops #monitoringlove
• https://github.com/monitoringsucks
• http://www.opsschool.org/en/latest/
Holistic Approach To Monitoring

More Related Content

What's hot

Migrate to platform of your choice
Migrate to platform of your choiceMigrate to platform of your choice
Migrate to platform of your choice
Ashnikbiz
 
ERP monitoring with Applications Manager
ERP monitoring with Applications ManagerERP monitoring with Applications Manager
ERP monitoring with Applications Manager
ManageEngine, Zoho Corporation
 
Manage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilotManage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilot
ServicePilot
 
Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018
Daniel Graversen
 
Primavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more pptPrimavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more ppt
p6academy
 
SAP License Audit Process
SAP License Audit ProcessSAP License Audit Process
SAP License Audit Process
AuditBot SAP Security Audit
 
What’s New in Athene™ 11
What’s New in Athene™ 11What’s New in Athene™ 11
What’s New in Athene™ 11
Precisely
 
Data to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorData to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity Monitor
WSO2
 
SAP HANA Certified course
SAP HANA Certified courseSAP HANA Certified course
SAP HANA Certified course
Multisoft Virtual Academy
 
IRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage dataIRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage data
Jisc
 
AMS Narus
AMS NarusAMS Narus
AMS Narus
Atlas Systems
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
Daniel Austin
 
Active System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging FrameworkActive System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging Framework
rishi679
 
Optimally Using Office 365
Optimally Using Office 365Optimally Using Office 365
Optimally Using Office 365
Stridely Solutions
 
Cloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to DigitalizationCloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to Digitalization
Stridely Solutions
 
Strategies for Multiple Locations
Strategies for Multiple LocationsStrategies for Multiple Locations
Strategies for Multiple Locations
EMAINT
 
AMS Net iq
AMS Net iqAMS Net iq
AMS Net iq
Atlas Systems
 
Asset Manager - QuickFMS Product
Asset Manager - QuickFMS ProductAsset Manager - QuickFMS Product
Asset Manager - QuickFMS Product
Team QuickFMS
 

What's hot (18)

Migrate to platform of your choice
Migrate to platform of your choiceMigrate to platform of your choice
Migrate to platform of your choice
 
ERP monitoring with Applications Manager
ERP monitoring with Applications ManagerERP monitoring with Applications Manager
ERP monitoring with Applications Manager
 
Manage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilotManage VoIP, Network, Application and Server logs with ServicePilot
Manage VoIP, Network, Application and Server logs with ServicePilot
 
Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018Key takeaways for SAP PI Integration 2018
Key takeaways for SAP PI Integration 2018
 
Primavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more pptPrimavera p6 release 8 web innovation and so much more ppt
Primavera p6 release 8 web innovation and so much more ppt
 
SAP License Audit Process
SAP License Audit ProcessSAP License Audit Process
SAP License Audit Process
 
What’s New in Athene™ 11
What’s New in Athene™ 11What’s New in Athene™ 11
What’s New in Athene™ 11
 
Data to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorData to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity Monitor
 
SAP HANA Certified course
SAP HANA Certified courseSAP HANA Certified course
SAP HANA Certified course
 
IRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage dataIRUS R5: open and flexible access to standardised repository usage data
IRUS R5: open and flexible access to standardised repository usage data
 
AMS Narus
AMS NarusAMS Narus
AMS Narus
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
 
Active System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging FrameworkActive System Manager 7.1 Messaging Framework
Active System Manager 7.1 Messaging Framework
 
Optimally Using Office 365
Optimally Using Office 365Optimally Using Office 365
Optimally Using Office 365
 
Cloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to DigitalizationCloud Migration - Adapting to Digitalization
Cloud Migration - Adapting to Digitalization
 
Strategies for Multiple Locations
Strategies for Multiple LocationsStrategies for Multiple Locations
Strategies for Multiple Locations
 
AMS Net iq
AMS Net iqAMS Net iq
AMS Net iq
 
Asset Manager - QuickFMS Product
Asset Manager - QuickFMS ProductAsset Manager - QuickFMS Product
Asset Manager - QuickFMS Product
 

Similar to Holistic Approach To Monitoring

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Piyush Kumar
 
Graphing for Security
Graphing for SecurityGraphing for Security
Graphing for Security
mr_secure
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Victor Holman
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
Richard Robinson
 
API and Big Data Solution Patterns
API and Big Data Solution Patterns API and Big Data Solution Patterns
API and Big Data Solution Patterns
WSO2
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
Splunk
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
Matsuo Sawahashi
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
Marek Maśko
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
Tash Bickley
 
DevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightsDevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insights
sriram_rajan
 
Data Management and Migration in Salesforce
Data Management and Migration in SalesforceData Management and Migration in Salesforce
Data Management and Migration in Salesforce
Sunil kumar
 
Dynamic 365
Dynamic 365Dynamic 365
Dynamic 365
Jitendra Soni
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
Jeff Weinstein
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm Change
Dmitry Anoshin
 
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed SystemsPAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
James Hill
 
Web Performance Optimization (WPO)
Web Performance Optimization (WPO)Web Performance Optimization (WPO)
Web Performance Optimization (WPO)
Betclic Everest Group Tech Team
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
Simon Belak
 
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
BI Brainz
 
SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)
Alan Eardley
 

Similar to Holistic Approach To Monitoring (20)

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
Graphing for Security
Graphing for SecurityGraphing for Security
Graphing for Security
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
API and Big Data Solution Patterns
API and Big Data Solution Patterns API and Big Data Solution Patterns
API and Big Data Solution Patterns
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
 
DevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightsDevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insights
 
Data Management and Migration in Salesforce
Data Management and Migration in SalesforceData Management and Migration in Salesforce
Data Management and Migration in Salesforce
 
Dynamic 365
Dynamic 365Dynamic 365
Dynamic 365
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm Change
 
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed SystemsPAD: Performance Anomaly Detection in Multi-Server Distributed Systems
PAD: Performance Anomaly Detection in Multi-Server Distributed Systems
 
Web Performance Optimization (WPO)
Web Performance Optimization (WPO)Web Performance Optimization (WPO)
Web Performance Optimization (WPO)
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
 
SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)
 
CAAT_Outa_Bag
CAAT_Outa_BagCAAT_Outa_Bag
CAAT_Outa_Bag
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

Holistic Approach To Monitoring

  • 1. Thank you to our Sponsors A Holistic Approach to Monitoring Melanie Cey – Yardi Systems Inc. Media Sponsor:
  • 2. @melaniemj Systems Analyst in DevOps (Web Operations) @ Yardi • 5 years Programming • 3.5 years Team Lead/Project Manager • 4 years Systems Administration/Analysis
  • 3.
  • 4. Because • Customers should not alert you to failure • Business metrics matter • When something fails you need enough info to know why • Agile teams release frequently • No one can afford to be reactive
  • 7. Definition: What to measure • Business Metrics & Events - Login/logout - Sign up, buy something - Sent email • System Events, Performance and Utilization Metrics - Web Service Call details (counter / time taken) - Deployments - Cache system (e.g. Redis or other) hits / misses - Environment performance • Failure Metrics - Exceptions, segregated by type / app / server of origin - Number and type of errors that reached customers
  • 9. Code Collection – Add / Refine Stats • Developer Friendly Platform - Stats need to be able to be added ‘without permission’ - Create own dashboards - Tools with APIs - Build client library for sending stats
  • 10. Code Collection – Graphite • Using Graphite - (Etsy 2011) StatsD UDP Node.js daemon collects and aggregates - Sends stats (as strings) to Graphite where they are stored in Whisper (like RRD) files - Graphite has a web interface, url api (with a json output option) and built in ability to create dashboards - Can receive stats from anything and is easy to setup - Open source with lots of industry use - Plenty of built in functions to help analyze and visualize data
  • 12. Code Collection – Add / Refine Stats
  • 13. Code Collection – Graphite Samples
  • 14. Code Collection – Logging • Metrics – what and when • Logging – how and why
  • 15. Code Collection – Add / Refine Logging • Why Log and what to log? - Log when you record a statistic • Logging Best Practices - Log locally - Don’t log to your production database server - Don’t fail if you can’t log - Log in GMT - Keep your logs, ship them to a central location - Aggregate recent data in real time if you can - Log more than you think you need to - Use a parse friendly format
  • 17. Environment Collection • Operating Systems - CPU, Free Memory, Paging, I/O ms speeds, network utilization • Database Management Systems - Transactions, blocks • Application Containers - Memory utilization, IIS requests current & queued, restarts, cache statistics etc.
  • 19. Visualization • Types of Dashboards - Feature based - Resource based (server or container) - Performance - Anomaly detection - Correlation - Root Cause Analysis - “Overview”
  • 20. Visualization – Tasseo • https://github.com/obfuscurity/tasseo
  • 21. Visualization – Cubism • https://github.com/square/cubism
  • 22. Visualization – Cubism • https://github.com/square/cubism
  • 23. Action: Putting inside knowledge to work
  • 24. Action • Useful dashboards help create useful alerts • Add / refine anomaly detection & alerting • Know your own boundaries • A fuzzy threshold is better than no threshold • Attach graphs to alerts • Exploit failures - Add an alerts after RCA - Theorize other possible causes or conditions
  • 26. More? • http://graphite.readthedocs.org/en/latest/ • http://codeascraft.com/ • http://vimeo.com/monitorama • Twitter #devops #monitoringlove • https://github.com/monitoringsucks • http://www.opsschool.org/en/latest/

Editor's Notes

  1. Spent the last 4 years strictly working on proactive monitoring measures for various systems
  2. Is your site 100% functional just because you can hit your homepage? When I came back from my 4 mo mat leave in 2010
  3. Reactive: Bugs vs Features ~ 9 years ago the first “live” aggregation of stats I saw was 24 hours after the fact, using ms log parser and presented via a ssms report “slow pages” and “pages that had errors” - This was better than nothing – and I have seen systems with literally just up/down checks on the home page as their complete monitoring set
  4. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  5. “3 armed sweaters” and “screwed users”
  6. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  7. Choose a developer friendly platform Spend more time analyzing the meaning of the metrics than code that collects, moves, stores and displays metrics
  8. RRD Round Robin Database Whisper is a fixed-size database, similar in design to RRD. T provides fast, reliable storage of numerical data over time
  9. Metrics will only ever tell you part of the story
  10. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  11. Note: Hypervisors How: Performance Counters using WMI I have the distinct pleasure of living in both worlds so this is part of the information I measure Linux servers you can use collectd or custom scripts
  12. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  13. 12 hours, each minute 1 pixel “What is normal?”
  14. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  15. What’s important changes as application and traffic changes Add alerts around things that fail Add and remove dashboard items A fuzzy threshold is better than no threshold – and can always be changed
  16. Definition: Define what to measure/observe Code Collection: Add / refine (necessary) stats and logging into your codebase Environment Collection: Add / refine environmental metrics Visualization: Build / refine dashboards Action: Add / refine anomaly detection & alerting
  17. Scaling monitoring Monitoring the monitoring Auto addition and removal of nodes and stats (environmentals) Too many monitoring tools, not enough analysis tools
  18. Scaling monitoring Monitoring the monitoring Auto addition and removal of nodes and stats (environmentals) Too many monitoring tools, not enough analysis tools