Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Kafka w/ Prometheus

18,100 views

Published on

About Prometheus

Published in: Technology

Monitoring Kafka w/ Prometheus

  1. 1. Monitoring Kafka w/ Prometheus Yuto Kawamura(kawamuray)
  2. 2. About me ● Software Engineer @ LINE corp ○ Develop & operate Apache HBase clusters ○ Design and implement data flow between services with ♥ to Apache Kafka ● Recent works ○ Playing with Apache Kafka and Kafka Streams ■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA% 20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20 (kawamuray) ● Past works ○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare. net/kawamuray/coreos-meetup ○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www. slideshare.net/kawamuray/norikra-meetup ○ Student @ Google Summer of Code 2013, 2014 ● https://github.com/kawamuray
  3. 3. How are we(our team) using Prometheus? ● To monitor most of our middleware, clients on Java applications ○ Kafka clusters ○ HBase clusters ○ Kafka clients - producer and consumer ○ Stream Processing jobs
  4. 4. Overall Architecture Grafana Prometheus HBase clusterHBase cluster Kafka cluster Prometheus Prometheus Prometheus (Federation) Prometheus Prometheus Prometheus YARN Application Pushgateway Dashboard Direct query
  5. 5. Why Prometheus? ● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics collection ● Good data model ○ Genuine metric identifier + attributes as labels ■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job=" prometheus",method="get"} ● Scalable by nature ● Simple philosophy ○ Metrics exposure interface: GET /metrics => Text Protocol ○ Monolithic server ● Flexible but easy PromQL ○ Derive aggregated metrics by composing existing metrics ○ E.g, Sum of TX bps / second of entire cluster ■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
  6. 6. Deployment ● Launch ○ Official Docker image: https://hub.docker.com/r/prom/prometheus/ ○ Ansible for dynamic prometheus.yml generation based on inventory and container management ● Machine spec ○ 2.40GHz * 24 CPUs ○ 192GB RAM ○ 6 SSDs ○ Single SSD / Single Prometheus instance ○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec just to use it.
  7. 7. Kafka monitoring w/ Prometheus overview Kafka broker Kafka client in Java application YARN ResourceManager Stream Processing jobs on YARN Prometheus Server Pushgate way Jmx exporter Prometh eus Java library + Servlet JSON exporter Kafka consumer group exporter
  8. 8. Monitoring Kafka brokers - jmx_exporter ● https://github.com/prometheus/jmx_exporter ● Run as standalone process(no -javaagent) ○ Just in order to avoid cumbersome rolling restart ○ Maybe turn into use javaagent on next opportunity of rolling restart :p ● With very complicated config.yml ○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06 ● Colocate one instance per broker on the same host
  9. 9. Monitoring Kafka producer on Java application - prometheus_simpleclient ● https://github.com/prometheus/client_java ● Official Java client library
  10. 10. prometheus_simpleclient - Basic usage private static final Counter queueOutCounter = Counter.build() .namespace("kafka_streams") // Namespace(= Application prefix?) .name("process_count") // Metric name .help("Process calls count") // Metric description .labelNames("processor", "topic") // Declare labels .register(); // Register to CollectorRegistry.defaultRegistry (default, global registry) ... queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels queueOutCounter.labels("Processor-B", "topic-P").inc(2.0); => kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0 kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
  11. 11. Exposing Java application metrics ● Through servlet ○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet ● Add an entry to web.xml or embedded jetty .. Server server = new Server(METRICS_PORT); ServletContextHandler context = new ServletContextHandler(); context.setContextPath("/"); server.setHandler(context); context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics"); server.start();
  12. 12. Monitoring Kafka producer on Java application - prometheus_simpleclient ● Primitive types: ○ Counter, Gauge, Histogram, Summary ● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance ● How to expose the value? ● => Implement proxy metric type which implements SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter { ... private void registerMetric(KafkaMetric kafkaMetric) { ... KafkaMetricProxy.build() .namespace(“kafka”) .name(fqn) .help("Help: " + metricName.description()) .labelNames(labelNames) .register(); ... } ... }
  13. 13. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> { @Override public KafkaMetricProxy create() { return new KafkaMetricProxy(this); } } KafkaMetricProxy(Builder b) { super(b); } ... @Override public List<MetricFamilySamples> collect() { List<MetricFamilySamples.Sample> samples = new ArrayList<>(); for (Map.Entry<List<String>, Child> entry : children.entrySet()) { List<String> labels = entry.getKey(); Child child = entry.getValue(); samples.add(new Sample(fullname, labelNames, labels, child.getValue())); } return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples)); } }
  14. 14. Monitoring YARN jobs - json_exporter ● https://github.com/kawamuray/prometheus-json-exporter ○ Can export value from JSON by specifying the value as JSONPath ● http://<rm http address:port>/ws/v1/cluster/apps ○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/ResourceManagerRest.html#Cluster_Applications_API ○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
  15. 15. json_exporter - name: yarn_application type: object path: $.apps.app[*]?(@.state == "RUNNING") labels: application: $.id phase: beta values: alive: 1 elapsed_time: $.elapsedTime allocated_mb: $.allocatedMB ... {"apps":{"app":[ { "id": "application_1234_0001", "state": "RUNNING", "elapsedTime": 25196, "allocatedMB": 1024, ... }, ... }} + yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1 yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196 yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
  16. 16. Important configurations ● -storage.local.retention(default: 15 days) ○ TTL for collected values ● -storage.local.memory-chunks(default: 1M) ○ Practically controls memory allocation of Prometheus instance ○ Lower value can cause ingestion throttling(metric loss) ● -storage.local.max-chunks-to-persist(default: 512K) ○ Lower value can cause ingestion throttling likewise ○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode ○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage. local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local. memory-chunks value. ● -query.staleness-delta(default: 5mins) ○ Resolution to detect lost metrics ○ Could lead weird behavior on Prometheus WebUI
  17. 17. Query tips - label_replace function ● It’s quite common that two metrics has different label sets ○ E.g, server side metric and client side metrics ● Say have metrics like: ○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"} ● Introduce new label from existing label ○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”} ● Rewrite existing label with new value ○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST”} ● Even possible to rewrite metric name… :D ○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*") ○ => foobar{...}
  18. 18. Points to improve ● Service discovery ○ It’s too cumbersome to configure server list and exporter list statically ○ Pushgateway? ■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus - https://github. com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway- ○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config> ■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all defined files are detected via disk watches and applied immediately. ● Local time support :( ○ They don’t like TZ other than UTC; making sense though: https://prometheus. io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc? ○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093 ○ Still might possible to introduce toggle on view
  19. 19. Conclusion ● Data model is very intuitive ● PromQL is very powerful and relatively easy ○ Helps you find out important metrics from hundreds of metrics ● Few pitfalls needs to be avoid w/ tuning configurations ○ memory-chunks, query.staleness-detla… ● Building exporter is reasonably easy ○ Officially supported lot’s of languages… ○ /metrics is the only interface
  20. 20. Questions?
  21. 21. End of Presentation
  22. 22. Metrics naming ● APPLICATIONPREFIX_METRICNAME ○ https://prometheus.io/docs/practices/naming/#metric-names ○ kafka_producer_request_rate ○ http_request_duration ● Fully utilize labels ○ x: kafka_network_request_duration_milliseconds_{max,min,mean} ○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”} ○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds {instance=”HOSTA”} ○ Much flexible than using static name
  23. 23. Alerting ● Not using Alert Manager ● Inhouse monitoring tool has alerting capability ○ Has user directory of alerting target ○ Has known expression to configure alerting ○ Tool unification is important and should be respected as possible ● Then? ○ Built a tool to mirror metrics from Prometheus to inhouse monitoring tool ○ Setup alert on inhouse monitoring tool /api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by (instance) { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "instance": "HOST_A:PORT" }, "value": [ 1465819064.067, "82317.10280584119" ] }, { "metric": { "instance": "HOST_B:PORT" }, "value": [ 1465819064.067, "81379.73499610288" ] }, ] } }
  24. 24. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { ... public static class Child { private KafkaMetric kafkaMetric; public void setKafkaMetric(KafkaMetric kafkaMetric) { this.kafkaMetric = kafkaMetric; } double getValue() { return kafkaMetric == null ? 0 : kafkaMetric.value(); } } @Override protected Child newChild() { return new Child(); } ... }
  25. 25. Monitoring Kafka consumer offset - kafka_consumer_group_exporter ● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter ● Exports some metrics WRT Kafka consumer group by executing kafka- consumer-groups.sh command(bundled to Kafka) ● Specific exporter for specific use ● Would better being familiar with your favorite exporter framework ○ Raw use of official prometheus package: https://github. com/prometheus/client_golang/tree/master/prometheus ○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
  26. 26. Query tips - Product set ● Calculated result of more than two metrics results product set ● metric_A{cluster=”A or B”} ● metric_B{cluster=”A or B”,instance=”a or b or c”} ● metric_A / metric_B ● => {} ● metric_A / sum(metric_B) by (cluster) ● => {cluster=”A or B”} ● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster) ● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!

×