Prometheus casual talk1

Hadoop, Fluentd cluster monitoring
with Prometheus and Grafana
2016/06/14
@wyukawa
Prometheus Casual Talks #1
#prometheuscasual

Agenda
•  Prometheus History
•  Prometheus Feature
•  Prometheus Architecture
•  My use case

History
•  Started in 2012 by ex-Google Site Reliability
Engineers
•  WriLen in Go
•  Inspired by Google’s Borgmon
– Borgmon monitors Borg
•  Public announcement in January 2015
hLp://www.slideshare.net/FabianReinartz/prometheus-a-next-gen-monitoring-system-3

Features
•  pull architecture
– easy ﬂow control
– not easy to get through ﬁrewall
•  Cloud Monitoring as a Service uses push model
•  mulZ dimensional data model
•  powerful query language
•  alert

pull architecture
hLps://prometheus.io/docs/introducZon/overview/

node_exporter example
•  hLp://host:9100/metrics

mulZ dimensional data model
•  metric types
– counter
– gauge
– histogram
– summary
hLps://prometheus.io/docs/concepts/metric_types/

How to handle counter metric
•  Do you use reset?
hLp://www.robustpercepZon.io/how-does-a-prometheus-counter-work/
No! use rate/irate/increase funcZon!
100 - (avg by (instance)
(irate(node_cpu{job="node",mode="idle"}[5m])) *
100)
hLp://www.robustpercepZon.io/understanding-machine-cpu-usage/

powerful query language
sum by(status) (
rate(hLp_response_status_total [1m]))
)
ALERT DiskWillFillIn4Hours
IF predict_linear(node_ﬁlesystem_free{job='node'}[1h], 4*3600) < 0
FOR 5m
LABELS {
severity="page"
}
hLp://www.robustpercepZon.io/reduce-noise-from-disk-space-alerts/

Alert
•  Alertmanager has the role
•  very young compared to Prometheus itself
•  very promising
•  aim to have as few alerts as possible
– repeat_interval: 4hours

My use case
•  At first I use file_sd_configs manually
•  Now I use promgen!
•  Exporters are executed by supervisord/
systemd
•  Monitor middlewares and machines
– Hadoop
– Fluentd
– ElasZcsearch

monitoring hadoop/hive
•  developer always uses jmx_exporter to
monitor java middleware
•  But I implement namenode/
resourcemanager/jstat exporter because I
want and I don’t want to restart daemon
•  hLps://github.com/wyukawa/
hadoop_exporter
•  hLps://github.com/wyukawa/jstat_exporter

Namenode block monitoring
Grafana AnnotaZon
Alert is also prometheus metrics so grafana can show alert as annotaZon

Hiveserver2 jvm monitoring
hLps://issues.apache.org/jira/browse/HIVE-13374

Fluentd buffer monitoring
•  fluent-plugin-prometheus enables buffer
monitoring

access log count
•  ﬂuent-plugin-prometheus enable to count
access log but need sampling because of high
cpu usage(Flink/Storm/… may be necessary)

HTTP status count
Although 4xx/5xx is not 0, it may become 0
because of sampling

fluentd_exporter
•  I implement fluentd_expoter because I want
to monitor fluentd cpu usage
hLp://d.hatena.ne.jp/wyukawa/20160603/1464934228

elasZcsearch_exporter
hLps://github.com/elasZc/elasZcsearch/issues/18635

My impression
•  Prometheus has a powerful query but
someZmes diﬃcult to understand
– sum(rate(accesslog_counts{tag="..."}[1m])) by
(status, job) / ignoring(status) group_lew
sum(rate(accesslog_counts{tag="..."}[1m])) by
(job)
•  Grafana is also great but to share link is a liLle
weak

Prometheus casual talk1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Prometheus casual talk1

Similar to Prometheus casual talk1 (20)

More from wyukawa

More from wyukawa (17)

Recently uploaded

Recently uploaded (20)

Prometheus casual talk1