Monitoring
Deeper dive
Who am I
Robert Kubiś
DevOps Engineer
https://www.linkedin.com/in/robertkubis89
Mikey
Dickerson
Hierarchy of
Needs
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data
about a system, such as query counts and types, error counts and types,
processing times, and server lifetimes.
● White-box monitoring
● Black-box monitoring
● Dashboard
● Alert
● Root cause
● Push
● Node and machine
Why Monitor?
● Analyzing long-term trends
● Comparing over time or experiment groups
● Alerting
● Building dashboards
● Conducting ad hoc retrospective analysis (i.e., debugging)
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Disadvantages:
● Doesn’t scale - cannot be clustering -
Thruk hack
● Millions lines of configuration -
check_mk hack
● Horrible interface
● Only for static infrastructure
● Stupid format of clients - hacks
● Perfdata…
● Doesn’t have API - livestatus hack
● Always need to hack….
Nagios
When your monitoring suck...
- Improve the quality of alerts
- Improve monitoring tools, or even change them
Wait a minute…. Before you start to solve them...
UNDERSTAND PROBLEMS AND MEASURE THEM!!!
“To measure is to know”
“If you can not measure it,
you can not improve it”
Lord William Thomson
(aka Baron Kelvin)
Over-monitoring and alarm fatigue: For whom do the
bells toll? Hospitals in USA
- Ignoring Alarms notification
- “Yeah that is no important”
- 72–99% false alerts
- Young parents vs nurses in hospital
- Monitoring means more money
- More is not better
- Patient could died
- Telemetry as a means of preventing, detecting, and improving
Source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
What to do?
So what to use for monitoring?
What monitoring should be?
Actionable
Compatible
Essential - only alerts which are needed
Fully Automated
Proactive - should predict failures
Easy for operators
State monitoring what it should be like?
State or blackbox monitoring now has the most of sense in VMs and bare-metals
What should be monitored with those kind of tools?
● Health endpoints
● Service states (like systemctl status *)
What could be monitored?
● Specific endpoints (using for example satellite node) with http/tcp checks
Icinga2 - Nagios fork but rewritten in many places, has scaling scenarios
(multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC),
clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc.
What we can get from Icinga2?
● High Available and distributed setup
● Nice and good documented REST API
● (dynamic inventory)
● Decrease amount of time needed for implement features
Metrics
Metric tools could be used in two ways:
1. Failure prediction
2. Graphing the data for humans - for humans it means SIMPLE
First case is quite simple - rules for detecting anomalies like more traffic than
usual and alert if it can make an impact on other clients
Second case is also simple - just graphs for debugging and better understanding
what’s happening with applications
Not every metric should have alert (and notifications)!
Prometheus
Circa 120 ready to use dashboards in Grafana repository(ie. MySQL board by
Percona)
Many useful features in one tool - Prometheus has a rich query language, Alert
manager, support for PagerDuty etc.
Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX,
Pagespeed, BIND, Jenkins, scollector
Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf,
jmx-exporter, collectd)
Logs
Servers, application, network and security devices generate log files.
Errors, problems, and more information is constantly logged and saved for
analysis.
Once an event is detected, the monitoring system will send alert, either to a
person or to another software/hardware system.
Elasticsearch stack
Monitoring strategy
Icinga 2 for state monitoring on bare metal, VMs and VMs in cloud.
Prometheus for metrics and data from Kubernetes (or other container) clusters.
ELK stack for logs
Is that enough?
What should be next step?
What is Pager Duty?
Users Settings
Notification Roules
Schedules
Escalation Policies
Services
Integrations
Integrations list
Connect with any
tool that provides
incoming event
data.
Extensions
Extensions list
Extend the PagerDuty
workflow to your existing tools.
Good practices for alerts
● Notify before accident
● Actionable alarms
● Value of measure things
● Documentation - not only one-liners in on call wiki
● Reduce number of tools
● Terraform
Let’s say that you’re rich :)
New Relic
STACKDRIVER
● Full-Stack Monitoring, Powered by Google
● For Cloud Platform, AWS, and Hybrid Deployments
● Identify Trends, Prevent Issues
● Reduce Monitoring Overhead
● Improve Signal-to-Noise
● Fix Problems Faster
Stackdriver heatmap
STACKDRIVER MONITORING FEATURES
● Debugger
● Error reporting
● Rapid discovery
● Uptime monitoring
● Integrations
● Smart defaults
● Alerts
● Tracing
● Logging
● Dashboards
● Profiling
MONITORING = PEOPLE
Not only tools...
Team
To make sense currently and in the future changing the monitoring infrastructure
should be supported by development and "reacting" teams.
Reacting team:
● 24/7 people for looking on boards and reacting on issues work shifts
● Incident manager taking decisions and investigating tuning of monitoring
● People with “programming” skills responsible for deploy proposals of IM
(writing new checks, adding some pieces of code)
Plan your work
Monitoring  - deeper dive
Monitoring  - deeper dive

Monitoring - deeper dive

  • 1.
  • 2.
    Who am I RobertKubiś DevOps Engineer https://www.linkedin.com/in/robertkubis89
  • 3.
  • 5.
    Monitoring Collecting, processing, aggregating,and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. ● White-box monitoring ● Black-box monitoring ● Dashboard ● Alert ● Root cause ● Push ● Node and machine
  • 6.
    Why Monitor? ● Analyzinglong-term trends ● Comparing over time or experiment groups ● Alerting ● Building dashboards ● Conducting ad hoc retrospective analysis (i.e., debugging)
  • 7.
    Please stop usingnagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it?
  • 8.
    Please stop usingnagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it? Advantages: ● Incredible simple plugins model. ● Simple to use ● Many people know it. ● On the top in google and everybody use it :)
  • 9.
    Please stop usingnagios (Andy Sykes) So we can die peacefully….. Who use it? Why did you choose it? Advantages: ● Incredible simple plugins model. ● Simple to use ● Many people know it. ● On the top in google and everybody use it :) Disadvantages: ● Doesn’t scale - cannot be clustering - Thruk hack ● Millions lines of configuration - check_mk hack ● Horrible interface ● Only for static infrastructure ● Stupid format of clients - hacks ● Perfdata… ● Doesn’t have API - livestatus hack ● Always need to hack….
  • 10.
  • 13.
    When your monitoringsuck... - Improve the quality of alerts - Improve monitoring tools, or even change them Wait a minute…. Before you start to solve them...
  • 14.
    UNDERSTAND PROBLEMS ANDMEASURE THEM!!! “To measure is to know” “If you can not measure it, you can not improve it” Lord William Thomson (aka Baron Kelvin)
  • 17.
    Over-monitoring and alarmfatigue: For whom do the bells toll? Hospitals in USA - Ignoring Alarms notification - “Yeah that is no important” - 72–99% false alerts - Young parents vs nurses in hospital - Monitoring means more money - More is not better - Patient could died - Telemetry as a means of preventing, detecting, and improving Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
  • 18.
    What to do? Sowhat to use for monitoring?
  • 19.
    What monitoring shouldbe? Actionable Compatible Essential - only alerts which are needed Fully Automated Proactive - should predict failures Easy for operators
  • 20.
    State monitoring whatit should be like? State or blackbox monitoring now has the most of sense in VMs and bare-metals What should be monitored with those kind of tools? ● Health endpoints ● Service states (like systemctl status *) What could be monitored? ● Specific endpoints (using for example satellite node) with http/tcp checks
  • 21.
    Icinga2 - Nagiosfork but rewritten in many places, has scaling scenarios (multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC), clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc. What we can get from Icinga2? ● High Available and distributed setup ● Nice and good documented REST API ● (dynamic inventory) ● Decrease amount of time needed for implement features
  • 22.
    Metrics Metric tools couldbe used in two ways: 1. Failure prediction 2. Graphing the data for humans - for humans it means SIMPLE First case is quite simple - rules for detecting anomalies like more traffic than usual and alert if it can make an impact on other clients Second case is also simple - just graphs for debugging and better understanding what’s happening with applications Not every metric should have alert (and notifications)!
  • 23.
    Prometheus Circa 120 readyto use dashboards in Grafana repository(ie. MySQL board by Percona) Many useful features in one tool - Prometheus has a rich query language, Alert manager, support for PagerDuty etc. Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX, Pagespeed, BIND, Jenkins, scollector Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf, jmx-exporter, collectd)
  • 24.
    Logs Servers, application, networkand security devices generate log files. Errors, problems, and more information is constantly logged and saved for analysis. Once an event is detected, the monitoring system will send alert, either to a person or to another software/hardware system.
  • 25.
  • 27.
    Monitoring strategy Icinga 2for state monitoring on bare metal, VMs and VMs in cloud. Prometheus for metrics and data from Kubernetes (or other container) clusters. ELK stack for logs
  • 28.
    Is that enough? Whatshould be next step?
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Integrations Integrations list Connect withany tool that provides incoming event data.
  • 36.
    Extensions Extensions list Extend thePagerDuty workflow to your existing tools.
  • 37.
    Good practices foralerts ● Notify before accident ● Actionable alarms ● Value of measure things ● Documentation - not only one-liners in on call wiki ● Reduce number of tools ● Terraform
  • 38.
    Let’s say thatyou’re rich :)
  • 39.
  • 51.
    STACKDRIVER ● Full-Stack Monitoring,Powered by Google ● For Cloud Platform, AWS, and Hybrid Deployments ● Identify Trends, Prevent Issues ● Reduce Monitoring Overhead ● Improve Signal-to-Noise ● Fix Problems Faster
  • 52.
  • 53.
    STACKDRIVER MONITORING FEATURES ●Debugger ● Error reporting ● Rapid discovery ● Uptime monitoring ● Integrations ● Smart defaults ● Alerts ● Tracing ● Logging ● Dashboards ● Profiling
  • 54.
  • 55.
    Team To make sensecurrently and in the future changing the monitoring infrastructure should be supported by development and "reacting" teams. Reacting team: ● 24/7 people for looking on boards and reacting on issues work shifts ● Incident manager taking decisions and investigating tuning of monitoring ● People with “programming” skills responsible for deploy proposals of IM (writing new checks, adding some pieces of code)
  • 56.