Monitoring systemu. Dlaczego mój kardiolog jest bogatym człowiekiem?

Monitorowanie systemu
dlaczego mój kardiolog
jest bogatym człowiekiem
Wojciech Wójcik

W odcinku wystąpią:
• W roli głównej: Prometheus

• Jakże ważne role drugoplanowe: Grafana, Telegraf

• Statyści:

• k8s

• istio

• knative

• rabbitmq

• mysql

• giełdy cryptowalut

• zapomniany host z asterisk pbx

• lamerskie apki w golang

Prometheus
• Narzędzie do zbierania oraz obróbki metryk

• Możliwość pobierania wielu źródeł danych (serwisy, dane
z instancji, aplikacje)

• Praca z danymi wysyłanymi do niego za pośrednictwem
push-gateway (np zadania typu cron, serverless)

• Zarządzanie alertami

• Spora ilość gotowych integracji oraz możliwość tworzenia
własnych rozwiązań

Prometheus-operator
• łatwa instalacja (helm - https://github.com/helm/charts/tree/
master/stable/prometheus-operator)

• dużo domyślnie skonﬁgurowanych metryk, alertów

• sporo przydatnych gotowców dla grafany

• możliwość instalacji wielu instancji w jednym klastrze ( różne
dane, retencja danych, storage Thanos - https://github.com/
improbable-eng/thanos)

• deﬁniowanie własnych metryk oraz alertów przez
developerów, które mogą zasilać różne instancje prometheus

grafana:

adminPassword: uszanowankoHura

extraEmptyDirMounts:

- name: provisioning-notifiers

mountPath: /etc/grafana/provisioning/notifiers

additionalDataSources:

- name: Mysql-Uszanowanko

type: mysql

url: "uszanowanko-mysql.db:3306"

user: kurnik

password: Ciapcie

database: uszanowanko

isDefault: false

notifiers:

notifiers.yaml:

notifiers:

- name: prometheus-alertmanager-notifier

type: prometheus-alertmanager

uid: prometheus-alertmanager1

org_id: 1

is_default: true

settings:

url: http://prometheus-operator-alertmanager:9093

sidecar:

dashboards:

searchNamespace: ALL

datasources:

searchNamespace: ALL

alertmanager:

config:

global:

resolve_timeout: 5m

route:

group_by: ['job']

group_wait: 30s

group_interval: 5m

repeat_interval: 12h

receiver: 'null'

routes:

- match:

alertname: Watchdog

receiver: 'null'

- match:

alertname: MessagesWaitingInQueues

receiver: 'slack_general'

- match:

alertname: BTCBuyPrice


- match:

alertname: BTCSellPrice


receivers:

- name: 'null'

- name: slack_general

slack_configs:

- api_url: https://hooks.slack.com/services/fff/ffff/blbbb

channel: '#uszanowanko'

icon_url: https://avatars3.githubusercontent.com/u/3380462

send_resolved: true

title: '{{ template "custom_title" . }}'

text: '{{ template "custom_slack_message" . }}'

templates:

- '/etc/alertmanager/config/notifications_slack.tmpl'

templateFiles:

notifications_slack.tmpl: |-

{{ define "__single_message_title" }}{{ range .Alerts.Firing }}{{ .Labels.alertname }} @
{{ .Annotations.identifier }}{{ end }}{{ range .Alerts.Resolved }}{{ .Labels.alertname }} @
{{ .Annotations.identifier }}{{ end }}{{ end }}

{{ define "custom_title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{
end }}] {{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing)
0) (eq (len .Alerts.Resolved) 1)) }}{{ template "__single_message_title" . }}{{ end }}{{ end }}

{{ define "custom_slack_message" }}

{{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0)
(eq (len .Alerts.Resolved) 1)) }}

{{ range .Alerts.Firing }}{{ .Annotations.message }}{{ end }}{{ range .Alerts.Resolved }}
{{ .Annotations.message }}{{ end }}

{{ else }}

{{ if gt (len .Alerts.Firing) 0 }}

*Alerts Firing:*

{{ range .Alerts.Firing }}- {{ .Annotations.message }}

{{ end }}{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}

*Alerts Resolved:*

{{ range .Alerts.Resolved }}- {{ .Annotations.message }}

{{ end }}{{ end }}

{{ end }}

{{ end }}

Dodanie danych do prometheus
apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: nats-exporter

namespace: monitoring

labels:

release: prometheus-operator

spec:

selector:

matchLabels:

app.kubernetes.io/name: prometheus-nats-exporter

endpoints:

- port: http


kind: PrometheusRule

metadata:

name: prometheus-operator-kube-prometheus-node-alerting.rules

spec:

groups:

- name: kube-prometheus-node-alerting.rules

rules:

- alert: NodeDiskRunningFull

annotations:

message: Device {{ $labels.device }} of node-exporter {{ $labels.namespace

}}/{{ $labels.pod }} will be full within the next 24 hours.

expr: '(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h],

3600 * 24) < 0)'

for: 30m

labels:

severity: warning

- alert: NodeDiskRunningFull

annotations:

message: Device {{ $labels.device }} of node-exporter {{ $labels.namespace

}}/{{ $labels.pod }} will be full within the next 2 hours.

expr: '(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m],

3600 * 2) < 0)'

for: 10m

labels:

severity: critical

Pacjent RabbitMQ
• Skalowanie na podstawie ilości wiadomości w kolejkach

• Wykorzystamy metryki prometheus do skalowania i alarmów

• Zrobimy piękne wykresy żeby zadowolić szefa, który nie
chce grzebać w trzewiach

• Stworzymy CustomMetrics dla k8s

• Użyjemy k8s Horizontal Pod Autoscaler

• Napotkamy pierwsze problemy

Zainstalowany RabbitMQ (https://github.com/helm/charts/tree/master/stable/rabbitmq) 
Conﬁg:

metrics:

enabled: true

Skonﬁgurowany ServiceMonitor

kind: ServiceMonitor

metadata:

name: rabbitmq-exporter

namespace: rabbitmq

labels:


spec:

selector:

matchLabels:

app: rabbitmq

endpoints:

- port: metrics

Aplikacja
• Wysyła dużo zadań do kolejki, ale ich przetwarzanie trwa
długo (np łączymy się do wolnego api)

Własne metryki


metadata:

labels:

app: prometheus-operator


name: rabbit-queue-rule

spec:

groups:

- name: rabbit.rules

rules:

- record: messages_waiting_per_consumer

expr: avg(avg_over_time(rabbitmq_queue_messages_ready{queue="uszanowanko"}[5m]) /
avg_over_time(rabbitmq_queue_consumers{queue="uszanowanko"}[5m]))

labels:

namespace: default

service: uszanowanko-queue

- alert: MessagesWaitingInQueues

annotations:

message: Czas oczekiwania wiadomosci w kolejce {{ $labels.service }} w namespace
{{ $labels.namespace }} wynosi {{ $value }}

expr: messages_waiting_per_consumer > 2

for: 5m

labels:

severity: info

• Jak dostarczyć dane dla k8s horizonal pod autoscaler

• Prometheus-adapter https://github.com/helm/charts/tree/master/
stable/prometheus-adapter

Conﬁg

prometheus:

url: http://prometheus-operator-prometheus.monitoring.svc.cluster.local

port: 9090

rules:

default: false

custom:

- seriesQuery: messages_waiting_per_consumer{namespace!='',service!=''}

resources:

overrides:

namespace: {resource: 'namespace'}

service: {resource: 'service'}

name:

matches: ^(.*)

as: ${1}

metricsQuery: <<.Series>>{<<.LabelMatchers>>}

apiVersion: autoscaling/v2beta1

kind: HorizontalPodAutoscaler

metadata:

name: uszanowanko-queue-autoscaler

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: rabbitmq-consumer-example

minReplicas: 1

maxReplicas: 20

metrics:

- type: Object

object:

target:

kind: Service

name: uszanowanko-queue

metricName: messages_waiting_per_consumer

targetValue: 1
HPA

Read replicas w MySQL
• Kuszące, ale może przynieść nam sporo zmartwień
(kosztów)

• mySQL-exporter dostarcza wielu ciekawych metryk

• Uruchamiając aplikację ﬁrm trzecich nie ma gwarancji jak
one zadziałają w przypadku rozsynchronizowania

Idealny świat
Update value: 16
Select result: 16
Update value: 23
Select result: 23
Update value: 31
Select result: 31
Update value: 12
Select result: 12
Update value: 33
Select result: 33

Polecenie zmieniające
świat
STOP SLAVE SQL_THREAD;

CHANGE MASTER TO MASTER_DELAY = 40;
START SLAVE SQL_THREAD;

Ooooo
Select result: 33
Update value: 6
Select result: 33
Update value: 8
Select result: 33
Update value: 13
Select result: 33
Update value: 43
Select result: 33
Update value: 30
Select result: 33
Update value: 47
Select result: 33
Update value: 45
Select result: 33
Update value: 5
Select result: 33
Update value: 38
Select result: 33
Update value: 4

Chceta mięsa
mySQL
Master
mySQL
read
replica
Zajefajne zadanie w cronie
odpalane co minutę
API
REST
select * from order_status where
transaction_date_changed >
date_sub(now(), interval 3 minute) Update order_status set
transaction_date_changed=now()
where id=1
Update order_status set
Status=1 where id=blabla

Doładowania pre-paid
mySQL
master
mySQL
read
replica
API
REST
select used from accounts where
voucher=132323232
If !used {
update accounts set
balance = (balance+500) , used = true
where voucher=132323232
}
Wiele zapytań z grubymi selectami

Update
cost =100

data = „SELECT balance from account where id=1” (dane z repliki)
newBalance = data - cost

„UPDATE account set balance=newBalance where id=1” (dane na mastera)
Kontra
„UPDATE account set balance=balance-cost where id=1” (dane na mastera)

Dostarczenie danych do prometheus
na przykładzie giełd crypto
• spora ilość ogólnodostępnych exporterów do
popularnych rozwiązań

• łatwa integracja w kodzie np (prometheus-client,
opencensus)

• Możliwość dostarczenia metryk na podstawie  
np zewnętrzych danych



metadata:

labels:

app: prometheus-operator


name: prometheus-exchanges-rules

namespace: monitoring

spec:

groups:

- name: btc.rules

rules:

- alert: BTCBuyPrice

annotations:

message: 'Cena zakupu {{ $labels.pair }} na {{ $labels.exchange }} wynosi {{ $value }}'

expr: 'exchange_trades{type="ask", quantile="0.99"} > 5001'

for: 3m

labels:

severity: info

- alert: BTCSellPrice

annotations:

message: 'Cena sprzedazy {{ $labels.pair }} na {{ $labels.exchange }} wynosi {{ $value }}'

expr: 'exchange_trades{type="bid", quantile="0.99"} < 6001'

for: 3m

labels:

severity: info

Nie samym k8s człowiek żyje
• Telegraf uniwersalne narzędzie do zbierania i exportu
metryk - https://github.com/inﬂuxdata/telegraf

• Wiele formatów inputs oraz outputs

• Zbierzemy dane o rozmowach telefonicznych

Telegraf config
[global_tags]

team = "uszanowanko"

[[outputs.prometheus_client]]

listen = ":9273"

[[inputs.docker]]

endpoint = "unix:///var/run/docker.sock"

[[inputs.snmp]]

agents = [ "127.0.0.1:161" ]

version = 2

community = "public"

name = "snmp"

[[inputs.snmp.field]]

name = "processed_calls"

oid = "ASTERISK-MIB::astConfigCallsProcessed.0"

Promethues conﬁg
- job_name: 'telegraf'

scrape_interval: 5s

static_conﬁgs:

- targets: [’1.1.1.1:9273’]

Metrics
# HELP snmp_processed_calls Telegraf collected metric
# TYPE snmp_processed_calls untyped
snmp_processed_calls{agent_host="127.0.0.1",host="static",team="uszanowanko"} 2
# HELP docker_memory_total Telegraf collected metric
# TYPE docker_memory_total untyped
docker_memory_total{engine_host="static",host="static",server_version="0.0.0-201905
09050102-e9f60f21b0",team="uszanowanko"} 2.087882752e+09
# HELP docker_n_containers Telegraf collected metric
# TYPE docker_n_containers untyped
docker_n_containers{engine_host="static",host="static",server_version="0.0.0-201905
09050102-e9f60f21b0",team="uszanowanko"} 1
# HELP docker_n_containers_paused Telegraf collected metric
# TYPE docker_n_containers_paused untyped
docker_n_containers_paused{engine_host="static",host="static",server_version="0.0.0
-20190509050102-e9f60f21b0",team="uszanowanko"} 0
# HELP docker_n_containers_running Telegraf collected metric
# TYPE docker_n_containers_running untyped
docker_n_containers_running{engine_host="static",host="static",server_version="0.0.
0-20190509050102-e9f60f21b0",team="uszanowanko"} 0
# HELP docker_n_containers_stopped Telegraf collected metric
# TYPE docker_n_containers_stopped untyped
docker_n_containers_stopped{engine_host="static",host="static",server_version="0.0.
0-20190509050102-e9f60f21b0",team="uszanowanko"} 1

Po co
• Predictive dialer

• Statystyki pracy agentów CC

• Uruchamianie różnego rodzaju aplikacji bazujących na
metrykach (kierowanie ruchu, zasilania kont u operatorów)

• Fraud detection

• Wysyłanie notyﬁkacji do niegrzecznych użytkowników

Biznes kocha słupki
• w szybki sposób dostarczamy biznesowi informacji  
o ich procesach (stany magazynowe, sprzedaż)

• Biznes może sam deﬁniować poziomy alertów i zarządzać
notyﬁkacjami

• Dashboards zbudowane pod każdy release czy też PR

• Nie obciążanie aplikacji głównej danymi statystycznymi

• Jak ja mam im sprzedać grafanę ? Rest API Promethues

A teraz czas na pytania.
Proszę go dojechać

Monitoring systemu. Dlaczego mój kardiolog jest bogatym człowiekiem?

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Monitoring systemu. Dlaczego mój kardiolog jest bogatym człowiekiem?

Similar to Monitoring systemu. Dlaczego mój kardiolog jest bogatym człowiekiem? (20)

More from The Software House

More from The Software House (20)

Monitoring systemu. Dlaczego mój kardiolog jest bogatym człowiekiem?