Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prometheus casual talk1

8,613 views

Published on

prometheus

Published in: Data & Analytics
  • Be the first to comment

Prometheus casual talk1

  1. 1. Hadoop, Fluentd cluster monitoring with Prometheus and Grafana 2016/06/14 @wyukawa Prometheus Casual Talks #1 #prometheuscasual
  2. 2. Agenda •  Prometheus History •  Prometheus Feature •  Prometheus Architecture •  My use case
  3. 3. History •  Started in 2012 by ex-Google Site Reliability Engineers •  WriLen in Go •  Inspired by Google’s Borgmon – Borgmon monitors Borg •  Public announcement in January 2015 hLp://www.slideshare.net/FabianReinartz/prometheus-a-next-gen-monitoring-system-3
  4. 4. Features •  pull architecture – easy flow control – not easy to get through firewall •  Cloud Monitoring as a Service uses push model •  mulZ dimensional data model •  powerful query language •  alert
  5. 5. pull architecture hLps://prometheus.io/docs/introducZon/overview/
  6. 6. node_exporter example •  hLp://host:9100/metrics
  7. 7. mulZ dimensional data model •  metric types – counter – gauge – histogram – summary hLps://prometheus.io/docs/concepts/metric_types/
  8. 8. How to handle counter metric •  Do you use reset? hLp://www.robustpercepZon.io/how-does-a-prometheus-counter-work/ No! use rate/irate/increase funcZon! 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) hLp://www.robustpercepZon.io/understanding-machine-cpu-usage/
  9. 9. powerful query language sum by(status) ( rate(hLp_response_status_total [1m])) ) ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m LABELS { severity="page" } hLp://www.robustpercepZon.io/reduce-noise-from-disk-space-alerts/
  10. 10. Alert •  Alertmanager has the role •  very young compared to Prometheus itself •  very promising •  aim to have as few alerts as possible – repeat_interval: 4hours
  11. 11. My use case •  At first I use file_sd_configs manually •  Now I use promgen! •  Exporters are executed by supervisord/ systemd •  Monitor middlewares and machines – Hadoop – Fluentd – ElasZcsearch
  12. 12. monitoring hadoop/hive •  developer always uses jmx_exporter to monitor java middleware •  But I implement namenode/ resourcemanager/jstat exporter because I want and I don’t want to restart daemon •  hLps://github.com/wyukawa/ hadoop_exporter •  hLps://github.com/wyukawa/jstat_exporter
  13. 13. Namenode block monitoring Grafana AnnotaZon Alert is also prometheus metrics so grafana can show alert as annotaZon
  14. 14. Resoucemanager job monitoring
  15. 15. Hiveserver2 jvm monitoring hLps://issues.apache.org/jira/browse/HIVE-13374
  16. 16. Fluentd buffer monitoring •  fluent-plugin-prometheus enables buffer monitoring
  17. 17. access log count •  fluent-plugin-prometheus enable to count access log but need sampling because of high cpu usage(Flink/Storm/… may be necessary)
  18. 18. HTTP status count Although 4xx/5xx is not 0, it may become 0 because of sampling
  19. 19. HTTP status percentage
  20. 20. fluentd_exporter •  I implement fluentd_expoter because I want to monitor fluentd cpu usage hLp://d.hatena.ne.jp/wyukawa/20160603/1464934228
  21. 21. elasZcsearch_exporter hLps://github.com/elasZc/elasZcsearch/issues/18635
  22. 22. My impression •  Prometheus has a powerful query but someZmes difficult to understand – sum(rate(accesslog_counts{tag="..."}[1m])) by (status, job) / ignoring(status) group_lew sum(rate(accesslog_counts{tag="..."}[1m])) by (job) •  Grafana is also great but to share link is a liLle weak

×