SlideShare a Scribd company logo
1 of 48
Introduction to Monitoring
Monitoring is both the process and the set
of tools of finding problems before your
users, minimizing monetary impact of failure
and enabling fast recovery.
Efficient Monitoring aims at notifying the
right person at the right time (and right time
only) with the most precise information.
What monitoring is
measure
Aggregate
& Visualize
Alert
Webapp DB
Webapp DB
What to Measure?
End user
experience
/performance
End User Monitoring
• Validates our application is running from
“outside”
• Measure “real user” performance
• Geo-Distributed – including real latency
• Many tools offer such solutions
– Measure, visualize, alerts
End User Monitoring
• When is a page fully loaded?
• Take care - some tools are biased
End User Monitoring
• Measure yourself
• Using
– Resource Timing API
– User Timing API
– Custom JS
• Send metrics from Browsers to your own
sync server
– all users / samples
End User Monitoring
What to measure
• Measure page load time (as you define it)
• Measure loading errors
• Measure number of page views
• Group by Geo & Application
• Group by browser
End User Monitoring
Alert on
• Sudden drop in traffic from a certain geo
• Sudden increase in traffic
• Increase in loading times
• Increase in errors
– From a specific browser
Webapp DB
What to Measure?
Is Alive?
Is Alive
• Measure a process liveliness
– Is the process running?
• Measure a process responsiveness
– Does the process respond to a request?
• Alert on instance down
– And auto restart it
• Alert on all instances down
Is Alive
• A variety of great tools
• Tools that perform “ping” tests
• Tools that call a designated URL for
responsiveness tests
• Is alive != Availability
– Is alive is per host
– Availability is about the system as a whole
Webapp DB
What to Measure?
Request
performance
Request Monitoring
• Measure how your application performs
– Regardless of networking to the user
– Regardless of latency
• Measuring on the server, per server
• Many tools provide such solutions
– Measure, visualize, alerts
Request Monitoring
• But many tools miss the branching point
– Branching point – the point in your code at
which your code decides what branch of
execution to perform for a request
• Issues with aggregation, what is
monitored, alert flexibility
• But still, there are some great tools
Request Monitoring
What to measure
• Measure request rate
• Measure performance histogram
• Measure error rate, by error type, http
response code
• Group by request type (as you define it)
• Group by host, application, data center
• Group by error type (as you define it)
Do not use Average
• Don’t use Average for performance
• Instead, use median, 95%tile and 99%tile.
Request Monitoring
What to Visualize
• Request rate (RPM)
• Request performance
– Median, 95%tile and 99%tile
on a moving window
Request Monitoring
What to Visualize
• Errors
– Rate, percent (compared to request rate)
– Top X errors by percent
– Separate system and application errors
– You will always have application errors
– You should have exactly 0 system errors
Request Monitoring
Alert on
• Big changes in traffic
• Increase in response times
• Increase in errors
• System errors
Webapp DB
What to Measure?
Resource
Utilization
Resources
• System resources
– CPU, Memory, IO, Storage, network
• Resource pools
– Database connection pools
– HTTP connection pools
– Thread pools
– Other resource pools
Resource Monitoring
What to measure
• Measure resource utilization
– Percent of resource used
• Measure resource acquisition queue
– Time to acquire
– Acquire Timeouts
– Usage Timeouts
Resource Monitoring
What to measure
• Group by resource type and pool
• Group by host, application, data center
• Group by error type (as you define it)
Alert on
• Resource over utilization –
avg usage over XX% in a time window
Webapp DB
What to Measure?
Database
Monitor
Database monitoring
Depends on the database, but yet -
• Storage
• Replication “lag”
• Slow operations
• Resource usage
Monitoring at Wix
Precise information
Alert the right person
Automation
Service is alive
• Is my application alive on the minimum
number required by my SLA?
• 2 out of 5 instances of my-app are not
responding to isAlive
• my-app requires a minimum of 3 instances
to meet the SLA
Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive
Alert
Sensu
Queries Nginx
Alert & SLA
ZooKeeper
Planned Configuration
Service owner
Nginx
Service Load Balancer
Is-alive
Alert the right person
Precise information
Automation
Service anomalies
• Backend Anomalies
• Identify unhealthy KPIs per endpoints
• Abnormal increase in error rate for
class.method.get
Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers
Metrics library
metrics / 1m
Graphs
Anomaly Alert
Anodot
Time series anomaly
detection
Alerts & graphs
statsd
Stats aggregation
Forwarding metrics
JVM servers
Metrics library
metrics / 1m
Graphs
Precise information
Alert the right person
Automation
Service anomalies
• Frontend Anomalies
• Browser (client) generated KPIs
• User Experience - Users effected or not?
How and where?
Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
events
Anodot
Time series
anomaly detection
Alerts & graphs
Anomaly Alert
Storm & Esper
Realtime streaming
processing
Metrics / 1m
Client
JS in Browser
events Graphs
Logger
flume
events
Anodot
Time series
anomaly detection
Alerts & graphs
Precise information
Alert the right personAutomation
Alert management
• What are the active alerts?
• What is the root cause?
• It is correlated to a change?
Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRelic
Sensu
Nagios
PingDom
Web UI
Alert
BigPanda
Central alerts & changes
Alerts & Changes
Changes
Deployments
Chef uploads
A/B, F-Toggle,
Exp.
Alerts
NewRelic
Sensu
Nagios
PingDom
Web UI
Precise information
Alert the right person
Automation
Questions?

More Related Content

What's hot

The cloud moved your monitoring cheese
The cloud moved your monitoring cheeseThe cloud moved your monitoring cheese
The cloud moved your monitoring cheeseKen Ahrens
 
From web interface to the database:Monitor all that matters
From web interface to the database:Monitor all that mattersFrom web interface to the database:Monitor all that matters
From web interface to the database:Monitor all that mattersManageEngine, Zoho Corporation
 
Leading American Entertainment Company implements OpManager
Leading American Entertainment Company implements OpManagerLeading American Entertainment Company implements OpManager
Leading American Entertainment Company implements OpManagerManageEngine, Zoho Corporation
 
How to Control Your Data and Stay Compliant with Robotic Process Automation
How to Control Your Data and Stay Compliant with Robotic Process AutomationHow to Control Your Data and Stay Compliant with Robotic Process Automation
How to Control Your Data and Stay Compliant with Robotic Process AutomationHelpSystems
 
Real User Monitoring: Getting Real Data from Real Users in the Real World - S...
Real User Monitoring: Getting Real Data from Real Users in the Real World - S...Real User Monitoring: Getting Real Data from Real Users in the Real World - S...
Real User Monitoring: Getting Real Data from Real Users in the Real World - S...Akamai Technologies
 
CompWALK ASQ Presentation
CompWALK ASQ PresentationCompWALK ASQ Presentation
CompWALK ASQ PresentationChad Baker
 
Monitoring and Managing Java Applications
Monitoring and Managing Java ApplicationsMonitoring and Managing Java Applications
Monitoring and Managing Java ApplicationsAlois Reitbauer
 
Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...
Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...
Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...APNIC
 
Enabling DevOps to optimize application and server performance
Enabling DevOps to optimize application and server performanceEnabling DevOps to optimize application and server performance
Enabling DevOps to optimize application and server performanceManageEngine, Zoho Corporation
 
Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture Samir El-Nabawy
 
Global Airline giant's application performance monitoring solution!
Global Airline giant's application performance monitoring solution!Global Airline giant's application performance monitoring solution!
Global Airline giant's application performance monitoring solution!ManageEngine, Zoho Corporation
 
Incident Management with Workflows
Incident Management with WorkflowsIncident Management with Workflows
Incident Management with WorkflowsPatrick Hoolboom
 
Application performance monitoring with Applications Manager
Application performance monitoring with Applications ManagerApplication performance monitoring with Applications Manager
Application performance monitoring with Applications ManagerManageEngine, Zoho Corporation
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...Michael Kehoe
 

What's hot (20)

Network Configuration Management - Mumbai Seminar
Network Configuration Management - Mumbai SeminarNetwork Configuration Management - Mumbai Seminar
Network Configuration Management - Mumbai Seminar
 
The cloud moved your monitoring cheese
The cloud moved your monitoring cheeseThe cloud moved your monitoring cheese
The cloud moved your monitoring cheese
 
From web interface to the database:Monitor all that matters
From web interface to the database:Monitor all that mattersFrom web interface to the database:Monitor all that matters
From web interface to the database:Monitor all that matters
 
Leading American Entertainment Company implements OpManager
Leading American Entertainment Company implements OpManagerLeading American Entertainment Company implements OpManager
Leading American Entertainment Company implements OpManager
 
Scale net apps in aws
Scale net apps in awsScale net apps in aws
Scale net apps in aws
 
IE 61850
IE 61850 IE 61850
IE 61850
 
How to Control Your Data and Stay Compliant with Robotic Process Automation
How to Control Your Data and Stay Compliant with Robotic Process AutomationHow to Control Your Data and Stay Compliant with Robotic Process Automation
How to Control Your Data and Stay Compliant with Robotic Process Automation
 
Real User Monitoring: Getting Real Data from Real Users in the Real World - S...
Real User Monitoring: Getting Real Data from Real Users in the Real World - S...Real User Monitoring: Getting Real Data from Real Users in the Real World - S...
Real User Monitoring: Getting Real Data from Real Users in the Real World - S...
 
CompWALK ASQ Presentation
CompWALK ASQ PresentationCompWALK ASQ Presentation
CompWALK ASQ Presentation
 
Monitoring and Managing Java Applications
Monitoring and Managing Java ApplicationsMonitoring and Managing Java Applications
Monitoring and Managing Java Applications
 
SAP License Audit Process
SAP License Audit ProcessSAP License Audit Process
SAP License Audit Process
 
Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...
Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...
Democratizing Insecurity: Bringing security weaknesses to the tech masses, by...
 
Finding application problems before they impact users
Finding application problems before they impact usersFinding application problems before they impact users
Finding application problems before they impact users
 
Enabling DevOps to optimize application and server performance
Enabling DevOps to optimize application and server performanceEnabling DevOps to optimize application and server performance
Enabling DevOps to optimize application and server performance
 
Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture Oracle Management Cloud, OMC architecture
Oracle Management Cloud, OMC architecture
 
2.2.management center
2.2.management center2.2.management center
2.2.management center
 
Global Airline giant's application performance monitoring solution!
Global Airline giant's application performance monitoring solution!Global Airline giant's application performance monitoring solution!
Global Airline giant's application performance monitoring solution!
 
Incident Management with Workflows
Incident Management with WorkflowsIncident Management with Workflows
Incident Management with Workflows
 
Application performance monitoring with Applications Manager
Application performance monitoring with Applications ManagerApplication performance monitoring with Applications Manager
Application performance monitoring with Applications Manager
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
 

Similar to Introduction to Efficient Monitoring

ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...Agile Testing Alliance
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfMAshok10
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleAppDynamics
 
End user-experience monitoring
End user-experience monitoring End user-experience monitoring
End user-experience monitoring Site24x7
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overviewgjuljo
 
webservertrafficanalysis
webservertrafficanalysiswebservertrafficanalysis
webservertrafficanalysisnitesh kanojiya
 
Performance testing
Performance testingPerformance testing
Performance testingJyoti Babbar
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance ManagementNoriaki Tatsumi
 
How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoringAndrew White
 
Performance Testing
Performance TestingPerformance Testing
Performance TestingAnu Shaji
 
Build 2016 - T666 - Interactive Analytics with Application Insights
Build 2016 - T666 - Interactive Analytics with Application InsightsBuild 2016 - T666 - Interactive Analytics with Application Insights
Build 2016 - T666 - Interactive Analytics with Application InsightsWindows Developer
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application DesignOrkhan Gasimov
 
Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...
Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...
Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...Igor Abade
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsDynatrace
 

Similar to Introduction to Efficient Monitoring (20)

ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
 
performancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdfperformancetestinganoverview-110206071921-phpapp02.pdf
performancetestinganoverview-110206071921-phpapp02.pdf
 
Unified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin WhittleUnified Monitoring Webinar with Dustin Whittle
Unified Monitoring Webinar with Dustin Whittle
 
New relic
New relicNew relic
New relic
 
End user-experience monitoring
End user-experience monitoring End user-experience monitoring
End user-experience monitoring
 
JMeter
JMeterJMeter
JMeter
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
webservertrafficanalysis
webservertrafficanalysiswebservertrafficanalysis
webservertrafficanalysis
 
Performance testing
Performance testingPerformance testing
Performance testing
 
Closing the door on application performance problems
Closing the door on application performance problemsClosing the door on application performance problems
Closing the door on application performance problems
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance Management
 
Applications manager 1 - Middle East Workshop
Applications manager 1 - Middle East WorkshopApplications manager 1 - Middle East Workshop
Applications manager 1 - Middle East Workshop
 
Neev Load Testing Services
Neev Load Testing ServicesNeev Load Testing Services
Neev Load Testing Services
 
How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoring
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Build 2016 - T666 - Interactive Analytics with Application Insights
Build 2016 - T666 - Interactive Analytics with Application InsightsBuild 2016 - T666 - Interactive Analytics with Application Insights
Build 2016 - T666 - Interactive Analytics with Application Insights
 
Neev QA Offering
Neev QA OfferingNeev QA Offering
Neev QA Offering
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application Design
 
Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...
Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...
Aprenda mais sobre sua aplicação e seus usuários com Application Insights (DN...
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Introduction to Efficient Monitoring

  • 2.
  • 3.
  • 4.
  • 5. Monitoring is both the process and the set of tools of finding problems before your users, minimizing monetary impact of failure and enabling fast recovery.
  • 6. Efficient Monitoring aims at notifying the right person at the right time (and right time only) with the most precise information.
  • 9. Webapp DB What to Measure? End user experience /performance
  • 10. End User Monitoring • Validates our application is running from “outside” • Measure “real user” performance • Geo-Distributed – including real latency • Many tools offer such solutions – Measure, visualize, alerts
  • 11. End User Monitoring • When is a page fully loaded? • Take care - some tools are biased
  • 12.
  • 13.
  • 14. End User Monitoring • Measure yourself • Using – Resource Timing API – User Timing API – Custom JS • Send metrics from Browsers to your own sync server – all users / samples
  • 15. End User Monitoring What to measure • Measure page load time (as you define it) • Measure loading errors • Measure number of page views • Group by Geo & Application • Group by browser
  • 16. End User Monitoring Alert on • Sudden drop in traffic from a certain geo • Sudden increase in traffic • Increase in loading times • Increase in errors – From a specific browser
  • 17. Webapp DB What to Measure? Is Alive?
  • 18. Is Alive • Measure a process liveliness – Is the process running? • Measure a process responsiveness – Does the process respond to a request? • Alert on instance down – And auto restart it • Alert on all instances down
  • 19. Is Alive • A variety of great tools • Tools that perform “ping” tests • Tools that call a designated URL for responsiveness tests • Is alive != Availability – Is alive is per host – Availability is about the system as a whole
  • 20. Webapp DB What to Measure? Request performance
  • 21. Request Monitoring • Measure how your application performs – Regardless of networking to the user – Regardless of latency • Measuring on the server, per server • Many tools provide such solutions – Measure, visualize, alerts
  • 22. Request Monitoring • But many tools miss the branching point – Branching point – the point in your code at which your code decides what branch of execution to perform for a request • Issues with aggregation, what is monitored, alert flexibility • But still, there are some great tools
  • 23. Request Monitoring What to measure • Measure request rate • Measure performance histogram • Measure error rate, by error type, http response code • Group by request type (as you define it) • Group by host, application, data center • Group by error type (as you define it)
  • 24. Do not use Average • Don’t use Average for performance • Instead, use median, 95%tile and 99%tile.
  • 25. Request Monitoring What to Visualize • Request rate (RPM) • Request performance – Median, 95%tile and 99%tile on a moving window
  • 26. Request Monitoring What to Visualize • Errors – Rate, percent (compared to request rate) – Top X errors by percent – Separate system and application errors – You will always have application errors – You should have exactly 0 system errors
  • 27. Request Monitoring Alert on • Big changes in traffic • Increase in response times • Increase in errors • System errors
  • 28. Webapp DB What to Measure? Resource Utilization
  • 29. Resources • System resources – CPU, Memory, IO, Storage, network • Resource pools – Database connection pools – HTTP connection pools – Thread pools – Other resource pools
  • 30. Resource Monitoring What to measure • Measure resource utilization – Percent of resource used • Measure resource acquisition queue – Time to acquire – Acquire Timeouts – Usage Timeouts
  • 31. Resource Monitoring What to measure • Group by resource type and pool • Group by host, application, data center • Group by error type (as you define it) Alert on • Resource over utilization – avg usage over XX% in a time window
  • 32. Webapp DB What to Measure? Database Monitor
  • 33. Database monitoring Depends on the database, but yet - • Storage • Replication “lag” • Slow operations • Resource usage
  • 35. Precise information Alert the right person Automation
  • 36. Service is alive • Is my application alive on the minimum number required by my SLA? • 2 out of 5 instances of my-app are not responding to isAlive • my-app requires a minimum of 3 instances to meet the SLA
  • 37. Alert Sensu Queries Nginx Alert & SLA ZooKeeper Planned Configuration Service owner Nginx Service Load Balancer Is-alive
  • 38. Alert Sensu Queries Nginx Alert & SLA ZooKeeper Planned Configuration Service owner Nginx Service Load Balancer Is-alive Alert the right person Precise information Automation
  • 39. Service anomalies • Backend Anomalies • Identify unhealthy KPIs per endpoints • Abnormal increase in error rate for class.method.get
  • 40. Anomaly Alert Anodot Time series anomaly detection Alerts & graphs statsd Stats aggregation Forwarding metrics JVM servers Metrics library metrics / 1m Graphs
  • 41. Anomaly Alert Anodot Time series anomaly detection Alerts & graphs statsd Stats aggregation Forwarding metrics JVM servers Metrics library metrics / 1m Graphs Precise information Alert the right person Automation
  • 42. Service anomalies • Frontend Anomalies • Browser (client) generated KPIs • User Experience - Users effected or not? How and where?
  • 43. Anomaly Alert Storm & Esper Realtime streaming processing Metrics / 1m Client JS in Browser events Graphs Logger flume events Anodot Time series anomaly detection Alerts & graphs
  • 44. Anomaly Alert Storm & Esper Realtime streaming processing Metrics / 1m Client JS in Browser events Graphs Logger flume events Anodot Time series anomaly detection Alerts & graphs Precise information Alert the right personAutomation
  • 45. Alert management • What are the active alerts? • What is the root cause? • It is correlated to a change?
  • 46. Alert BigPanda Central alerts & changes Alerts & Changes Changes Deployments Chef uploads A/B, F-Toggle, Exp. Alerts NewRelic Sensu Nagios PingDom Web UI
  • 47. Alert BigPanda Central alerts & changes Alerts & Changes Changes Deployments Chef uploads A/B, F-Toggle, Exp. Alerts NewRelic Sensu Nagios PingDom Web UI Precise information Alert the right person Automation

Editor's Notes

  1. Built for fast development Did not know what are business is We know we will need to replace it Did not know how hard that will be
  2. Built for fast development Did not know what are business is We know we will need to replace it Did not know how hard that will be