Continuous monitoring

Привет
Меня зовут Дима
Мне 29 лет
И я мониторю

О чем же мы будем говорить?
Что происходило с мониторингом в последнее время
Как выглядит архитектура
Как построить процесс

Немножко классификации и истории
➔ Active / passive
➔ Pull / Push
➔ Source based / GUI

Passive
Собирают информацию и не
паясничают
➔ Collectd
➔ Zabbix
➔ Prometheus Exporters
➔ InfluxDB Telegraf

Active
Могут и рестартнуть, если чё
➔ Monit
➔ Forever
➔ PM2
➔ Upstart / Systemd

Push
Повсюду используют своих агентов
➔ Collectd
➔ Zabbix
➔ Sensu

Pull
Полагаются на проверенные решения
➔ SNMP
➔ telnet
➔ ssh
➔ http

Source based
Только конфиги, только хардкор
➔ Sensu
➔ Collectd
➔ Prometheus

GUI
Даже менеджер сможет настроить,
но не будет
➔ Zabbix
➔ PRTG

Что же выбрать?
➔ С кодом удобнее работать. Особенно, если ты не один
➔ Push очень мощный!
➔ Pull модель, внезапно, очень хороша для микросервисов
➔ Active как-то не прижился, все боятся restart loop.
◆ А restart в кластере может плохо кончится

Архитектура мониторинга
Capture
Store
Visualize
Alert

Capture. Снятие метрик
➔ Statsd
➔ Collectd
➔ Prometheus Exporters
➔ Influxdb Telegraf
➔ Elastic Beats
★ Nodejs / no Windows
★ C / SSC Serv for Windows
★ Go
★ Go
★ Go

Store. База данных
➔ Elasticsearch
➔ Graphite
➔ Prometheus
➔ InfluxDB
➔ OpenTSDB

Visualize
➔ Kibana
➔ Grafana
➔ Graphite
➔ InfluxDB Chronograf

Alerts
➔ Grafana alerts
➔Elasticsearch Watcher
➔Prometheus Alertmanager
➔Bosun

Alerts Processing
➔Riemann
➔PagerDuty / OpenDuty

Внезапно
➔ Метрики -- одна из важнейших частей feedback loop
◆ Аналитика -- часть мониторинга
➔ Гипотеза, реализация, проверка, гипотеза
➔ Инфраструктура и код неразделимы

Проблемы
➔ Мониторинг -- blackbox для разработчиков
◆ Read Only модель
➔ Сенсоры пишут Ops
◆ Ops не знают код, dev не знают инфраструктуру

Фасилитируем девелоперов
➔ Monitoring as a code
◆ Storage as a Service
➔ Возможность протестировать
◆ Git flow

Monitoring as code
➔ Capture
➔ Store
➔ Visualize
➔ Alert
...as code

Capture as code
➔ Collectd with plugins
➔ Prometheus exporter
➔ Logstash

Collectd
<Plugin "java">
JVMArg "-Djava.class.path=/usr/share/collectd/java/collectd-api.jar:/usr/share/collectd/java/generic-jmx.jar"
LoadPlugin "org.collectd.java.GenericJMX"
<Plugin "GenericJMX">
<MBean "memory">
ObjectName "java.lang:type=Memory"
<Value>
Type "memory"
InstancePrefix "heap-"
Table true
Attribute "HeapMemoryUsage"
</Value>
<Connection>
Host "localhost"
ServiceURL "service:jmx:rmi:///jndi/rmi://localhost:31666/jmxrmi"
Collect "memory"
User "readOnlyRole"
Password "securePassword"
InstancePrefix "localhost"
</Connection>
</Plugin>
</Plugin>

Prometheus
Exporter
hostPort: 127.0.0.1:1234
jmxUrl: service:jmx:rmi:///jndi/rmi://127.0.0.1:1234/jmxrmi
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
whitelistObjectNames: ["org.apache.cassandra.metrics:*"]
blacklistObjectNames: ["org.apache.cassandra.metrics:type=ColumnFamily,*"]
rules:
- pattern: "^org.apache.cassandra.metrics<type=(w+), name=(w+)><>Value: (d+)"
name: cassandra_$1_$2
value: $3
valueFactor: 0.001
labels: {}
help: "Cassandra metric $1 $2"
type: GAUGE
attrNameSnakeCase: false

Logstash
input {
file {
type => "nginx_access"
port => 6379
path => "/var/log/nginx/access.log"
}
}
filter {
if [type] == "nginx_access" {
json {
source => "message"
}
}
}
output {
elasticsearch {
codec => json
cluster => logs
host => "127.0.0.1"
embedded => false
protocol => "transport"
index => "logstash-%{+YYYY.MM.dd}"
}
}

Visualization as code
➔ https://gitlab.com/gitlab-org/grafana-dashboards
➔ https://github.com/jakubplichta/grafana-dashboard-builder
◆ pip install grafana-dashboard-builder
◆ grafana-dashboard-builder -c project.yaml --exporter file.yaml

Grafana dashboard builder
---
- defaultdashboard: &defaultdashboard
time_options: [1h, 6h, 12h, 24h, 2d, 7d, 14d, 30d]
refresh_intervals: [5m, 15m, 30m, 1h]
time:
from: now-2d
to: now
- name: overview
dashboard:
title: '{dashboard-prefix} Overview'
tags:
- tag1
- tag2
<<: *defaultdashboard
rows:
- row:
title: '{dashboard-prefix}-row'
panels:
- graph:
span: 3
title: Frontend
target: 'aliasByMetric({metric-prefix}.frontend.*)'
- graph:
span: 3
title: Backend
target: 'aliasByMetric({metric-prefix}.backend.*)'

Alert as code
➔ Prometheus Alertmanager
➔ Bosun

Prometheus Alertmanager Rules
alert.rules: |-
ALERT HighCPU
IF ((sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job))
- ( sum(node_cpu{mode=~"idle|iowait"}) by (instance,job) ) )
/ (sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) * 100 > 95
FOR 10m
LABELS { service = "backend" }
ANNOTATIONS {
summary = "High CPU Usage",
description = "This machine has really high CPU usage for over 10m",
}

Prometheus Alertmanager Global
route:
receiver: 'slack_chatbots'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
routes:
- match:
service: frontend
receiver: slack_chatbots
- match:
service: backend
receiver: pager_duty
receivers:
- name: 'slack_chatbots'
slack_configs:
- send_resolved: true
api_url: 'https://hooks.slack.com/services/xxxxxxx'
channel: '#chatbots'
text: >-
Summary:
Description:
Details:
- =
Playbook:
Graph:
- name: 'pager_duty'
pagerduty_configs:
- service_key: xxxxxxxxxxxxxxxxxx

Bosun Alert Rule
alert cpu.linux {
template = generic
$notes = "High CPU load on server for last 30 minutes."
$series = 100 - graphite("groupByNode(servers.*.cpu.*.percent.idle, 1, 'avg')", "30m", "", "host")
$q = avg($series)
warn = $q >= 90
crit = $q >= 95
macro = common
warnNotification = $warnNotification
critNotification = $critNotification
}

Bosun Alert Global
macro common {
$warnNotification = slack
$critNotification = slack
$kibanaDomainName = kibana.mycompany.com
$grafanaDomainName = grafana.mycompany.com
}
notification slack {
post = https://hooks.slack.com/services/xxxxxxx
body = payload={"username": "bosun", "text": {{.|json}}, "attachments":[{"fallback": "Prod","color":
"#69B","title": "Prod env"}] }
next = slack
timeout = 4h
}

Storage as a Service
➔ Storage на dev, stage, prod
◆ https://github.com/vegasbrianc/prometheus
◆ https://github.com/hopsoft/docker-graphite-statsd
◆ https://github.com/sstarcher/docker-sensu

Заключение
➔ Мониторинг -- это непрерывный процесс
➔ Read/Write модель для разработчиков
➔ DevOps предоставляет сервис, а не сенсоры

Спасибо за внимание

Continuous monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Continuous monitoring

Similar to Continuous monitoring (20)

Continuous monitoring