SlideShare a Scribd company logo
1 of 26
Download to read offline
Monitoring Kafka
w/ Prometheus
Yuto Kawamura(kawamuray)
About me
● Software Engineer @ LINE corp
○ Develop & operate Apache HBase clusters
○ Design and implement data flow between services with ♥ to Apache Kafka
● Recent works
○ Playing with Apache Kafka and Kafka Streams
■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA%
20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20
(kawamuray)
● Past works
○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare.
net/kawamuray/coreos-meetup
○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www.
slideshare.net/kawamuray/norikra-meetup
○ Student @ Google Summer of Code 2013, 2014
● https://github.com/kawamuray
How are we(our team) using Prometheus?
● To monitor most of our middleware, clients on Java applications
○ Kafka clusters
○ HBase clusters
○ Kafka clients - producer and consumer
○ Stream Processing jobs
Overall Architecture
Grafana
Prometheus
HBase
clusterHBase
cluster
Kafka cluster
Prometheus
Prometheus
Prometheus
(Federation)
Prometheus
Prometheus
Prometheus
YARN Application
Pushgateway
Dashboard
Direct query
Why Prometheus?
● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics
collection
● Good data model
○ Genuine metric identifier + attributes as labels
■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job="
prometheus",method="get"}
● Scalable by nature
● Simple philosophy
○ Metrics exposure interface: GET /metrics => Text Protocol
○ Monolithic server
● Flexible but easy PromQL
○ Derive aggregated metrics by composing existing metrics
○ E.g, Sum of TX bps / second of entire cluster
■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
Deployment
● Launch
○ Official Docker image: https://hub.docker.com/r/prom/prometheus/
○ Ansible for dynamic prometheus.yml generation based on inventory and container
management
● Machine spec
○ 2.40GHz * 24 CPUs
○ 192GB RAM
○ 6 SSDs
○ Single SSD / Single Prometheus instance
○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec
just to use it.
Kafka monitoring w/ Prometheus overview
Kafka broker
Kafka client in
Java application
YARN
ResourceManager
Stream Processing
jobs on YARN
Prometheus Server
Pushgate
way
Jmx
exporter
Prometh
eus Java
library
+ Servlet
JSON
exporter
Kafka
consumer
group
exporter
Monitoring Kafka brokers - jmx_exporter
● https://github.com/prometheus/jmx_exporter
● Run as standalone process(no -javaagent)
○ Just in order to avoid cumbersome rolling restart
○ Maybe turn into use javaagent on next opportunity of rolling restart :p
● With very complicated config.yml
○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06
● Colocate one instance per broker on the same host
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● https://github.com/prometheus/client_java
● Official Java client library
prometheus_simpleclient - Basic usage
private static final Counter queueOutCounter =
Counter.build()
.namespace("kafka_streams") // Namespace(= Application prefix?)
.name("process_count") // Metric name
.help("Process calls count") // Metric description
.labelNames("processor", "topic") // Declare labels
.register(); // Register to CollectorRegistry.defaultRegistry (default, global registry)
...
queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels
queueOutCounter.labels("Processor-B", "topic-P").inc(2.0);
=> kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0
kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
Exposing Java application metrics
● Through servlet
○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet
● Add an entry to web.xml or embedded jetty ..
Server server = new Server(METRICS_PORT);
ServletContextHandler context = new ServletContextHandler();
context.setContextPath("/");
server.setHandler(context);
context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
server.start();
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● Primitive types:
○ Counter, Gauge, Histogram, Summary
● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance
● How to expose the value?
● => Implement proxy metric type which implements
SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter {
...
private void registerMetric(KafkaMetric kafkaMetric) {
...
KafkaMetricProxy.build()
.namespace(“kafka”)
.name(fqn)
.help("Help: " + metricName.description())
.labelNames(labelNames)
.register();
...
}
...
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> {
@Override
public KafkaMetricProxy create() {
return new KafkaMetricProxy(this);
}
}
KafkaMetricProxy(Builder b) {
super(b);
}
...
@Override
public List<MetricFamilySamples> collect() {
List<MetricFamilySamples.Sample> samples = new ArrayList<>();
for (Map.Entry<List<String>, Child> entry : children.entrySet()) {
List<String> labels = entry.getKey();
Child child = entry.getValue();
samples.add(new Sample(fullname, labelNames, labels, child.getValue()));
}
return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples));
}
}
Monitoring YARN jobs - json_exporter
● https://github.com/kawamuray/prometheus-json-exporter
○ Can export value from JSON by specifying the value as JSONPath
● http://<rm http address:port>/ws/v1/cluster/apps
○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/ResourceManagerRest.html#Cluster_Applications_API
○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
json_exporter
- name: yarn_application
type: object
path: $.apps.app[*]?(@.state == "RUNNING")
labels:
application: $.id
phase: beta
values:
alive: 1
elapsed_time: $.elapsedTime
allocated_mb: $.allocatedMB
...
{"apps":{"app":[
{
"id": "application_1234_0001",
"state": "RUNNING",
"elapsedTime": 25196,
"allocatedMB": 1024,
...
},
...
}}
+
yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1
yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196
yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
Important configurations
● -storage.local.retention(default: 15 days)
○ TTL for collected values
● -storage.local.memory-chunks(default: 1M)
○ Practically controls memory allocation of Prometheus instance
○ Lower value can cause ingestion throttling(metric loss)
● -storage.local.max-chunks-to-persist(default: 512K)
○ Lower value can cause ingestion throttling likewise
○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode
○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage.
local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local.
memory-chunks value.
● -query.staleness-delta(default: 5mins)
○ Resolution to detect lost metrics
○ Could lead weird behavior on Prometheus WebUI
Query tips - label_replace function
● It’s quite common that two metrics has different label sets
○ E.g, server side metric and client side metrics
● Say have metrics like:
○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"}
● Introduce new label from existing label
○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”}
● Rewrite existing label with new value
○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST”}
● Even possible to rewrite metric name… :D
○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*")
○ => foobar{...}
Points to improve
● Service discovery
○ It’s too cumbersome to configure server list and exporter list statically
○ Pushgateway?
■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose
their metrics to Prometheus - https://github.
com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway-
○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config>
■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all
defined files are detected via disk watches and applied immediately.
● Local time support :(
○ They don’t like TZ other than UTC; making sense though: https://prometheus.
io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc?
○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093
○ Still might possible to introduce toggle on view
Conclusion
● Data model is very intuitive
● PromQL is very powerful and relatively easy
○ Helps you find out important metrics from hundreds of metrics
● Few pitfalls needs to be avoid w/ tuning configurations
○ memory-chunks, query.staleness-detla…
● Building exporter is reasonably easy
○ Officially supported lot’s of languages…
○ /metrics is the only interface
Questions?
End of Presentation
Metrics naming
● APPLICATIONPREFIX_METRICNAME
○ https://prometheus.io/docs/practices/naming/#metric-names
○ kafka_producer_request_rate
○ http_request_duration
● Fully utilize labels
○ x: kafka_network_request_duration_milliseconds_{max,min,mean}
○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”}
○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds
{instance=”HOSTA”}
○ Much flexible than using static name
Alerting
● Not using Alert Manager
● Inhouse monitoring tool has alerting capability
○ Has user directory of alerting target
○ Has known expression to configure alerting
○ Tool unification is important and should be respected as
possible
● Then?
○ Built a tool to mirror metrics from Prometheus to inhouse
monitoring tool
○ Setup alert on inhouse monitoring tool
/api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by
(instance)
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"instance": "HOST_A:PORT"
},
"value": [
1465819064.067,
"82317.10280584119"
]
},
{
"metric": {
"instance": "HOST_B:PORT"
},
"value": [
1465819064.067,
"81379.73499610288"
]
},
]
}
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
...
public static class Child {
private KafkaMetric kafkaMetric;
public void setKafkaMetric(KafkaMetric kafkaMetric) {
this.kafkaMetric = kafkaMetric;
}
double getValue() {
return kafkaMetric == null ? 0 : kafkaMetric.value();
}
}
@Override
protected Child newChild() {
return new Child();
}
...
}
Monitoring Kafka consumer offset -
kafka_consumer_group_exporter
● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter
● Exports some metrics WRT Kafka consumer group by executing kafka-
consumer-groups.sh command(bundled to Kafka)
● Specific exporter for specific use
● Would better being familiar with your favorite exporter framework
○ Raw use of official prometheus package: https://github.
com/prometheus/client_golang/tree/master/prometheus
○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
Query tips - Product set
● Calculated result of more than two metrics results product set
● metric_A{cluster=”A or B”}
● metric_B{cluster=”A or B”,instance=”a or b or c”}
● metric_A / metric_B
● => {}
● metric_A / sum(metric_B) by (cluster)
● => {cluster=”A or B”}
● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster)
● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!

More Related Content

What's hot

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with PrometheusTobias Schmidt
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
OpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQLOpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQLOpen Gurukul
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELKYuHsuan Chen
 
Dapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageDapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageBilgin Ibryam
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPCGuo Jing
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
NGINX: Basics and Best Practices
NGINX: Basics and Best PracticesNGINX: Basics and Best Practices
NGINX: Basics and Best PracticesNGINX, Inc.
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 

What's hot (20)

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with Prometheus
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
OpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQLOpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQL
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
Zookeeper 소개
Zookeeper 소개Zookeeper 소개
Zookeeper 소개
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Dapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any LanguageDapr - A 10x Developer Framework for Any Language
Dapr - A 10x Developer Framework for Any Language
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
NGINX: Basics and Best Practices
NGINX: Basics and Best PracticesNGINX: Basics and Best Practices
NGINX: Basics and Best Practices
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 

Viewers also liked

Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1wyukawa
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust modelLINE Corporation
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, EverywhereLINE Corporation
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティLINE Corporation
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldLINE Corporation
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」LINE Corporation
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO AuthenticationLINE Corporation
 

Viewers also liked (12)

Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
 
Prometheus on AWS
Prometheus on AWSPrometheus on AWS
Prometheus on AWS
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust model
 
FRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHYFRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHY
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, Everywhere
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティ
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile World
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication
 

Similar to Monitoring Kafka w/ Prometheus

PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksShowmax Engineering
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰KAI CHU CHUNG
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerApplitools
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside OutFerenc Kovács
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発Yoichi Toyota
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013Andy Bunce
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuJ.J. Ciarlante
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101yinonavraham
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaJazz Yao-Tsung Wang
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUPRonald Hsu
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheusAlex Lau
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient waySylvain Rayé
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.orgTed Husted
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with GatlingPetr Vlček
 

Similar to Monitoring Kafka w/ Prometheus (20)

Sprint 17
Sprint 17Sprint 17
Sprint 17
 
PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacks
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test Runner
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside Out
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheus
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient way
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.org
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 

More from kawamuray

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEkawamuray
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...kawamuray
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restorekawamuray
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganetikawamuray
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupkawamuray
 

More from kawamuray (7)

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restore
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Monitoring Kafka w/ Prometheus

  • 2. About me ● Software Engineer @ LINE corp ○ Develop & operate Apache HBase clusters ○ Design and implement data flow between services with ♥ to Apache Kafka ● Recent works ○ Playing with Apache Kafka and Kafka Streams ■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA% 20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20 (kawamuray) ● Past works ○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare. net/kawamuray/coreos-meetup ○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www. slideshare.net/kawamuray/norikra-meetup ○ Student @ Google Summer of Code 2013, 2014 ● https://github.com/kawamuray
  • 3. How are we(our team) using Prometheus? ● To monitor most of our middleware, clients on Java applications ○ Kafka clusters ○ HBase clusters ○ Kafka clients - producer and consumer ○ Stream Processing jobs
  • 5. Why Prometheus? ● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics collection ● Good data model ○ Genuine metric identifier + attributes as labels ■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job=" prometheus",method="get"} ● Scalable by nature ● Simple philosophy ○ Metrics exposure interface: GET /metrics => Text Protocol ○ Monolithic server ● Flexible but easy PromQL ○ Derive aggregated metrics by composing existing metrics ○ E.g, Sum of TX bps / second of entire cluster ■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
  • 6. Deployment ● Launch ○ Official Docker image: https://hub.docker.com/r/prom/prometheus/ ○ Ansible for dynamic prometheus.yml generation based on inventory and container management ● Machine spec ○ 2.40GHz * 24 CPUs ○ 192GB RAM ○ 6 SSDs ○ Single SSD / Single Prometheus instance ○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec just to use it.
  • 7. Kafka monitoring w/ Prometheus overview Kafka broker Kafka client in Java application YARN ResourceManager Stream Processing jobs on YARN Prometheus Server Pushgate way Jmx exporter Prometh eus Java library + Servlet JSON exporter Kafka consumer group exporter
  • 8. Monitoring Kafka brokers - jmx_exporter ● https://github.com/prometheus/jmx_exporter ● Run as standalone process(no -javaagent) ○ Just in order to avoid cumbersome rolling restart ○ Maybe turn into use javaagent on next opportunity of rolling restart :p ● With very complicated config.yml ○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06 ● Colocate one instance per broker on the same host
  • 9. Monitoring Kafka producer on Java application - prometheus_simpleclient ● https://github.com/prometheus/client_java ● Official Java client library
  • 10. prometheus_simpleclient - Basic usage private static final Counter queueOutCounter = Counter.build() .namespace("kafka_streams") // Namespace(= Application prefix?) .name("process_count") // Metric name .help("Process calls count") // Metric description .labelNames("processor", "topic") // Declare labels .register(); // Register to CollectorRegistry.defaultRegistry (default, global registry) ... queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels queueOutCounter.labels("Processor-B", "topic-P").inc(2.0); => kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0 kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
  • 11. Exposing Java application metrics ● Through servlet ○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet ● Add an entry to web.xml or embedded jetty .. Server server = new Server(METRICS_PORT); ServletContextHandler context = new ServletContextHandler(); context.setContextPath("/"); server.setHandler(context); context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics"); server.start();
  • 12. Monitoring Kafka producer on Java application - prometheus_simpleclient ● Primitive types: ○ Counter, Gauge, Histogram, Summary ● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance ● How to expose the value? ● => Implement proxy metric type which implements SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter { ... private void registerMetric(KafkaMetric kafkaMetric) { ... KafkaMetricProxy.build() .namespace(“kafka”) .name(fqn) .help("Help: " + metricName.description()) .labelNames(labelNames) .register(); ... } ... }
  • 13. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> { @Override public KafkaMetricProxy create() { return new KafkaMetricProxy(this); } } KafkaMetricProxy(Builder b) { super(b); } ... @Override public List<MetricFamilySamples> collect() { List<MetricFamilySamples.Sample> samples = new ArrayList<>(); for (Map.Entry<List<String>, Child> entry : children.entrySet()) { List<String> labels = entry.getKey(); Child child = entry.getValue(); samples.add(new Sample(fullname, labelNames, labels, child.getValue())); } return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples)); } }
  • 14. Monitoring YARN jobs - json_exporter ● https://github.com/kawamuray/prometheus-json-exporter ○ Can export value from JSON by specifying the value as JSONPath ● http://<rm http address:port>/ws/v1/cluster/apps ○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/ResourceManagerRest.html#Cluster_Applications_API ○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
  • 15. json_exporter - name: yarn_application type: object path: $.apps.app[*]?(@.state == "RUNNING") labels: application: $.id phase: beta values: alive: 1 elapsed_time: $.elapsedTime allocated_mb: $.allocatedMB ... {"apps":{"app":[ { "id": "application_1234_0001", "state": "RUNNING", "elapsedTime": 25196, "allocatedMB": 1024, ... }, ... }} + yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1 yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196 yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
  • 16. Important configurations ● -storage.local.retention(default: 15 days) ○ TTL for collected values ● -storage.local.memory-chunks(default: 1M) ○ Practically controls memory allocation of Prometheus instance ○ Lower value can cause ingestion throttling(metric loss) ● -storage.local.max-chunks-to-persist(default: 512K) ○ Lower value can cause ingestion throttling likewise ○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode ○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage. local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local. memory-chunks value. ● -query.staleness-delta(default: 5mins) ○ Resolution to detect lost metrics ○ Could lead weird behavior on Prometheus WebUI
  • 17. Query tips - label_replace function ● It’s quite common that two metrics has different label sets ○ E.g, server side metric and client side metrics ● Say have metrics like: ○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"} ● Introduce new label from existing label ○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”} ● Rewrite existing label with new value ○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST”} ● Even possible to rewrite metric name… :D ○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*") ○ => foobar{...}
  • 18. Points to improve ● Service discovery ○ It’s too cumbersome to configure server list and exporter list statically ○ Pushgateway? ■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus - https://github. com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway- ○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config> ■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all defined files are detected via disk watches and applied immediately. ● Local time support :( ○ They don’t like TZ other than UTC; making sense though: https://prometheus. io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc? ○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093 ○ Still might possible to introduce toggle on view
  • 19. Conclusion ● Data model is very intuitive ● PromQL is very powerful and relatively easy ○ Helps you find out important metrics from hundreds of metrics ● Few pitfalls needs to be avoid w/ tuning configurations ○ memory-chunks, query.staleness-detla… ● Building exporter is reasonably easy ○ Officially supported lot’s of languages… ○ /metrics is the only interface
  • 22. Metrics naming ● APPLICATIONPREFIX_METRICNAME ○ https://prometheus.io/docs/practices/naming/#metric-names ○ kafka_producer_request_rate ○ http_request_duration ● Fully utilize labels ○ x: kafka_network_request_duration_milliseconds_{max,min,mean} ○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”} ○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds {instance=”HOSTA”} ○ Much flexible than using static name
  • 23. Alerting ● Not using Alert Manager ● Inhouse monitoring tool has alerting capability ○ Has user directory of alerting target ○ Has known expression to configure alerting ○ Tool unification is important and should be respected as possible ● Then? ○ Built a tool to mirror metrics from Prometheus to inhouse monitoring tool ○ Setup alert on inhouse monitoring tool /api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by (instance) { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "instance": "HOST_A:PORT" }, "value": [ 1465819064.067, "82317.10280584119" ] }, { "metric": { "instance": "HOST_B:PORT" }, "value": [ 1465819064.067, "81379.73499610288" ] }, ] } }
  • 24. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { ... public static class Child { private KafkaMetric kafkaMetric; public void setKafkaMetric(KafkaMetric kafkaMetric) { this.kafkaMetric = kafkaMetric; } double getValue() { return kafkaMetric == null ? 0 : kafkaMetric.value(); } } @Override protected Child newChild() { return new Child(); } ... }
  • 25. Monitoring Kafka consumer offset - kafka_consumer_group_exporter ● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter ● Exports some metrics WRT Kafka consumer group by executing kafka- consumer-groups.sh command(bundled to Kafka) ● Specific exporter for specific use ● Would better being familiar with your favorite exporter framework ○ Raw use of official prometheus package: https://github. com/prometheus/client_golang/tree/master/prometheus ○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
  • 26. Query tips - Product set ● Calculated result of more than two metrics results product set ● metric_A{cluster=”A or B”} ● metric_B{cluster=”A or B”,instance=”a or b or c”} ● metric_A / metric_B ● => {} ● metric_A / sum(metric_B) by (cluster) ● => {cluster=”A or B”} ● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster) ● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!