Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2018: Maximilian Bode - "Monitoring Flink with Prometheus"


Published on

Prometheus is a cloud-native monitoring system prioritizing reliability and simplicity – and Flink works really well with it! This session will show you how to leverage the Flink metrics system together with Pronetheus to improve the observability of your jobs. There will be a live demo establishing how everything ties in together. The talk is aimed at people already building and running Flink jobs who would like to gain more insight into them. It is fine if you are not familiar with Prometheus yet as the basic concepts will be introduced. If you have ever wondered how you could use modern monitoring tools to be alerted in the middle of the night in case your Flink job‘s 99th percentile end-to-end latency degraded for some reason, this might just be the talk you are looking for.

Published in: Technology
  • Be the first to comment

Flink Forward Berlin 2018: Maximilian Bode - "Monitoring Flink with Prometheus"

  1. 1. Monitoring Flink with PrometheusMonitoring Flink with Prometheus  &  &  Flink Forward Berlin 2018 Maximilian Bode
  2. 2. whoami Software Engineer with Focus on Data-Intensive Applications Site Reliability Engineering
  3. 3. open-source, metrics-based monitoring system simple yet powerful data model & query language client libraries in all popular languages high-performance and simple to run  prometheus/prometheus
  4. 4. ☝☝ Metrics Time series of 64-bit floating-point numbers Labels Key-value pairs associated with time series Scrape Act of fetching metrics via HTTP request TSDB Prometheus storage layer,  PromQL Query language, used for graphing and alerting prometheus/tsdb flink_jobmanager_job_uptime{job_name="PrometheusExampleJob"}
  5. 5.    grafana/grafana prometheus/alertmanager - alert: FlinkJobsMissing expr: sum(flink_api_jobs_running) < 2 for: 3m annotations: summary: Fewer Flink jobs than expected are running.
  6. 6.       
  7. 7. PrometheusReporter 1. Copy reporter jar in lib directory 2. Configure in conf/flink-conf.yaml cp /opt/flink-metrics-prometheus-1.6.0.jar /lib metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.Prometheus metrics.reporter.prom.port: 9999 docs
  8. 8. 3. Let Prometheus scrape a) Statically in prometheus.yml b) Service Discovery scrape_configs: - job_name: 'jobmanager' static_configs: - targets: ['jobmanager:9999'] - job_name: 'taskmanager1' static_configs: - targets: ['taskmanager1:9999'] - [...] docs
  9. 9. 4. Use custom metrics in your jobs class CountingMap extends RichMapFunction<Integer, Integer> { private transient Counter eventCounter; @Override public void open(Configuration parameters) { eventCounter = getRuntimeContext().getMetricGroup().counter("events"); } @Override public Integer map(Integer value) {; return value; } }
  10. 10. " " ( ) by Praying squirrel CC BY 2.0 Michael Seeley
  11. 11. [Flink docs] [Prometheus docs] , Brian Brazil  [Prometheus docs]  Debugging & Monitoring / Metrics Introduction / Overview Prometheus Up & Running prometheus/pushgateway Remote endpoints & storage improbable-eng/thanos
  12. 12.     mbode/flink-prometheus-example @mxpbode Maximilian Bode