SlideShare a Scribd company logo
1 of 26
Download to read offline
Monitoring Kafka
w/ Prometheus
Yuto Kawamura(kawamuray)
About me
● Software Engineer @ LINE corp
○ Develop & operate Apache HBase clusters
○ Design and implement data flow between services with ♥ to Apache Kafka
● Recent works
○ Playing with Apache Kafka and Kafka Streams
■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA%
20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20
(kawamuray)
● Past works
○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare.
net/kawamuray/coreos-meetup
○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www.
slideshare.net/kawamuray/norikra-meetup
○ Student @ Google Summer of Code 2013, 2014
● https://github.com/kawamuray
How are we(our team) using Prometheus?
● To monitor most of our middleware, clients on Java applications
○ Kafka clusters
○ HBase clusters
○ Kafka clients - producer and consumer
○ Stream Processing jobs
Overall Architecture
Grafana
Prometheus
HBase
clusterHBase
cluster
Kafka cluster
Prometheus
Prometheus
Prometheus
(Federation)
Prometheus
Prometheus
Prometheus
YARN Application
Pushgateway
Dashboard
Direct query
Why Prometheus?
● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics
collection
● Good data model
○ Genuine metric identifier + attributes as labels
■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job="
prometheus",method="get"}
● Scalable by nature
● Simple philosophy
○ Metrics exposure interface: GET /metrics => Text Protocol
○ Monolithic server
● Flexible but easy PromQL
○ Derive aggregated metrics by composing existing metrics
○ E.g, Sum of TX bps / second of entire cluster
■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
Deployment
● Launch
○ Official Docker image: https://hub.docker.com/r/prom/prometheus/
○ Ansible for dynamic prometheus.yml generation based on inventory and container
management
● Machine spec
○ 2.40GHz * 24 CPUs
○ 192GB RAM
○ 6 SSDs
○ Single SSD / Single Prometheus instance
○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec
just to use it.
Kafka monitoring w/ Prometheus overview
Kafka broker
Kafka client in
Java application
YARN
ResourceManager
Stream Processing
jobs on YARN
Prometheus Server
Pushgate
way
Jmx
exporter
Prometh
eus Java
library
+ Servlet
JSON
exporter
Kafka
consumer
group
exporter
Monitoring Kafka brokers - jmx_exporter
● https://github.com/prometheus/jmx_exporter
● Run as standalone process(no -javaagent)
○ Just in order to avoid cumbersome rolling restart
○ Maybe turn into use javaagent on next opportunity of rolling restart :p
● With very complicated config.yml
○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06
● Colocate one instance per broker on the same host
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● https://github.com/prometheus/client_java
● Official Java client library
prometheus_simpleclient - Basic usage
private static final Counter queueOutCounter =
Counter.build()
.namespace("kafka_streams") // Namespace(= Application prefix?)
.name("process_count") // Metric name
.help("Process calls count") // Metric description
.labelNames("processor", "topic") // Declare labels
.register(); // Register to CollectorRegistry.defaultRegistry (default, global registry)
...
queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels
queueOutCounter.labels("Processor-B", "topic-P").inc(2.0);
=> kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0
kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
Exposing Java application metrics
● Through servlet
○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet
● Add an entry to web.xml or embedded jetty ..
Server server = new Server(METRICS_PORT);
ServletContextHandler context = new ServletContextHandler();
context.setContextPath("/");
server.setHandler(context);
context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
server.start();
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● Primitive types:
○ Counter, Gauge, Histogram, Summary
● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance
● How to expose the value?
● => Implement proxy metric type which implements
SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter {
...
private void registerMetric(KafkaMetric kafkaMetric) {
...
KafkaMetricProxy.build()
.namespace(“kafka”)
.name(fqn)
.help("Help: " + metricName.description())
.labelNames(labelNames)
.register();
...
}
...
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> {
@Override
public KafkaMetricProxy create() {
return new KafkaMetricProxy(this);
}
}
KafkaMetricProxy(Builder b) {
super(b);
}
...
@Override
public List<MetricFamilySamples> collect() {
List<MetricFamilySamples.Sample> samples = new ArrayList<>();
for (Map.Entry<List<String>, Child> entry : children.entrySet()) {
List<String> labels = entry.getKey();
Child child = entry.getValue();
samples.add(new Sample(fullname, labelNames, labels, child.getValue()));
}
return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples));
}
}
Monitoring YARN jobs - json_exporter
● https://github.com/kawamuray/prometheus-json-exporter
○ Can export value from JSON by specifying the value as JSONPath
● http://<rm http address:port>/ws/v1/cluster/apps
○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/ResourceManagerRest.html#Cluster_Applications_API
○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
json_exporter
- name: yarn_application
type: object
path: $.apps.app[*]?(@.state == "RUNNING")
labels:
application: $.id
phase: beta
values:
alive: 1
elapsed_time: $.elapsedTime
allocated_mb: $.allocatedMB
...
{"apps":{"app":[
{
"id": "application_1234_0001",
"state": "RUNNING",
"elapsedTime": 25196,
"allocatedMB": 1024,
...
},
...
}}
+
yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1
yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196
yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
Important configurations
● -storage.local.retention(default: 15 days)
○ TTL for collected values
● -storage.local.memory-chunks(default: 1M)
○ Practically controls memory allocation of Prometheus instance
○ Lower value can cause ingestion throttling(metric loss)
● -storage.local.max-chunks-to-persist(default: 512K)
○ Lower value can cause ingestion throttling likewise
○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode
○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage.
local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local.
memory-chunks value.
● -query.staleness-delta(default: 5mins)
○ Resolution to detect lost metrics
○ Could lead weird behavior on Prometheus WebUI
Query tips - label_replace function
● It’s quite common that two metrics has different label sets
○ E.g, server side metric and client side metrics
● Say have metrics like:
○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"}
● Introduce new label from existing label
○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”}
● Rewrite existing label with new value
○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST”}
● Even possible to rewrite metric name… :D
○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*")
○ => foobar{...}
Points to improve
● Service discovery
○ It’s too cumbersome to configure server list and exporter list statically
○ Pushgateway?
■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose
their metrics to Prometheus - https://github.
com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway-
○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config>
■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all
defined files are detected via disk watches and applied immediately.
● Local time support :(
○ They don’t like TZ other than UTC; making sense though: https://prometheus.
io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc?
○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093
○ Still might possible to introduce toggle on view
Conclusion
● Data model is very intuitive
● PromQL is very powerful and relatively easy
○ Helps you find out important metrics from hundreds of metrics
● Few pitfalls needs to be avoid w/ tuning configurations
○ memory-chunks, query.staleness-detla…
● Building exporter is reasonably easy
○ Officially supported lot’s of languages…
○ /metrics is the only interface
Questions?
End of Presentation
Metrics naming
● APPLICATIONPREFIX_METRICNAME
○ https://prometheus.io/docs/practices/naming/#metric-names
○ kafka_producer_request_rate
○ http_request_duration
● Fully utilize labels
○ x: kafka_network_request_duration_milliseconds_{max,min,mean}
○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”}
○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds
{instance=”HOSTA”}
○ Much flexible than using static name
Alerting
● Not using Alert Manager
● Inhouse monitoring tool has alerting capability
○ Has user directory of alerting target
○ Has known expression to configure alerting
○ Tool unification is important and should be respected as
possible
● Then?
○ Built a tool to mirror metrics from Prometheus to inhouse
monitoring tool
○ Setup alert on inhouse monitoring tool
/api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by
(instance)
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"instance": "HOST_A:PORT"
},
"value": [
1465819064.067,
"82317.10280584119"
]
},
{
"metric": {
"instance": "HOST_B:PORT"
},
"value": [
1465819064.067,
"81379.73499610288"
]
},
]
}
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
...
public static class Child {
private KafkaMetric kafkaMetric;
public void setKafkaMetric(KafkaMetric kafkaMetric) {
this.kafkaMetric = kafkaMetric;
}
double getValue() {
return kafkaMetric == null ? 0 : kafkaMetric.value();
}
}
@Override
protected Child newChild() {
return new Child();
}
...
}
Monitoring Kafka consumer offset -
kafka_consumer_group_exporter
● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter
● Exports some metrics WRT Kafka consumer group by executing kafka-
consumer-groups.sh command(bundled to Kafka)
● Specific exporter for specific use
● Would better being familiar with your favorite exporter framework
○ Raw use of official prometheus package: https://github.
com/prometheus/client_golang/tree/master/prometheus
○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
Query tips - Product set
● Calculated result of more than two metrics results product set
● metric_A{cluster=”A or B”}
● metric_B{cluster=”A or B”,instance=”a or b or c”}
● metric_A / metric_B
● => {}
● metric_A / sum(metric_B) by (cluster)
● => {cluster=”A or B”}
● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster)
● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!

More Related Content

What's hot

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with PrometheusTobias Schmidt
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfAlkin Tezuysal
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
 
Microservices Network Architecture 101
Microservices Network Architecture 101Microservices Network Architecture 101
Microservices Network Architecture 101Cumulus Networks
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOSAkihiro Suda
 
DevOps best practices with OpenShift
DevOps best practices with OpenShiftDevOps best practices with OpenShift
DevOps best practices with OpenShiftMichael Lehmann
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
Monitoring MySQL Replication lag with Prometheus & pt-heartbeat
Monitoring MySQL Replication lag with Prometheus & pt-heartbeatMonitoring MySQL Replication lag with Prometheus & pt-heartbeat
Monitoring MySQL Replication lag with Prometheus & pt-heartbeatJulien Pivotto
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTWNGINX, Inc.
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Vietnam Open Infrastructure User Group
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlMydbops
 
Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)Animesh Singh
 
FreeSWITCH Cluster by K8s
FreeSWITCH Cluster by K8sFreeSWITCH Cluster by K8s
FreeSWITCH Cluster by K8sChien Cheng Wu
 
PGEncryption_Tutorial
PGEncryption_TutorialPGEncryption_Tutorial
PGEncryption_TutorialVibhor Kumar
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVMMarcus L Sorensen
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveMadhu Venugopal
 

What's hot (20)

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with Prometheus
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
 
Microservices Network Architecture 101
Microservices Network Architecture 101Microservices Network Architecture 101
Microservices Network Architecture 101
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
 
DevOps best practices with OpenShift
DevOps best practices with OpenShiftDevOps best practices with OpenShift
DevOps best practices with OpenShift
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Monitoring MySQL Replication lag with Prometheus & pt-heartbeat
Monitoring MySQL Replication lag with Prometheus & pt-heartbeatMonitoring MySQL Replication lag with Prometheus & pt-heartbeat
Monitoring MySQL Replication lag with Prometheus & pt-heartbeat
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTW
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in Postgresql
 
Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)
 
FreeSWITCH Cluster by K8s
FreeSWITCH Cluster by K8sFreeSWITCH Cluster by K8s
FreeSWITCH Cluster by K8s
 
PGEncryption_Tutorial
PGEncryption_TutorialPGEncryption_Tutorial
PGEncryption_Tutorial
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVM
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep dive
 

Viewers also liked

Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1wyukawa
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust modelLINE Corporation
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, EverywhereLINE Corporation
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティLINE Corporation
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldLINE Corporation
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」LINE Corporation
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO AuthenticationLINE Corporation
 

Viewers also liked (11)

Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1
 
Prometheus on AWS
Prometheus on AWSPrometheus on AWS
Prometheus on AWS
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust model
 
FRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHYFRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHY
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, Everywhere
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティ
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile World
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication
 

Similar to Monitoring Kafka w/ Prometheus

PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksShowmax Engineering
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰KAI CHU CHUNG
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerApplitools
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside OutFerenc Kovács
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発Yoichi Toyota
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013Andy Bunce
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuJ.J. Ciarlante
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101yinonavraham
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaJazz Yao-Tsung Wang
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUPRonald Hsu
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheusAlex Lau
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient waySylvain Rayé
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.orgTed Husted
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with GatlingPetr Vlček
 
Testing Django APIs
Testing Django APIsTesting Django APIs
Testing Django APIstyomo4ka
 

Similar to Monitoring Kafka w/ Prometheus (20)

Sprint 17
Sprint 17Sprint 17
Sprint 17
 
PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacks
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test Runner
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside Out
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheus
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient way
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.org
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 
Testing Django APIs
Testing Django APIsTesting Django APIs
Testing Django APIs
 

More from kawamuray

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEkawamuray
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...kawamuray
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restorekawamuray
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganetikawamuray
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupkawamuray
 

More from kawamuray (7)

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restore
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Monitoring Kafka w/ Prometheus

  • 2. About me ● Software Engineer @ LINE corp ○ Develop & operate Apache HBase clusters ○ Design and implement data flow between services with ♥ to Apache Kafka ● Recent works ○ Playing with Apache Kafka and Kafka Streams ■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA% 20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20 (kawamuray) ● Past works ○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare. net/kawamuray/coreos-meetup ○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www. slideshare.net/kawamuray/norikra-meetup ○ Student @ Google Summer of Code 2013, 2014 ● https://github.com/kawamuray
  • 3. How are we(our team) using Prometheus? ● To monitor most of our middleware, clients on Java applications ○ Kafka clusters ○ HBase clusters ○ Kafka clients - producer and consumer ○ Stream Processing jobs
  • 5. Why Prometheus? ● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics collection ● Good data model ○ Genuine metric identifier + attributes as labels ■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job=" prometheus",method="get"} ● Scalable by nature ● Simple philosophy ○ Metrics exposure interface: GET /metrics => Text Protocol ○ Monolithic server ● Flexible but easy PromQL ○ Derive aggregated metrics by composing existing metrics ○ E.g, Sum of TX bps / second of entire cluster ■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
  • 6. Deployment ● Launch ○ Official Docker image: https://hub.docker.com/r/prom/prometheus/ ○ Ansible for dynamic prometheus.yml generation based on inventory and container management ● Machine spec ○ 2.40GHz * 24 CPUs ○ 192GB RAM ○ 6 SSDs ○ Single SSD / Single Prometheus instance ○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec just to use it.
  • 7. Kafka monitoring w/ Prometheus overview Kafka broker Kafka client in Java application YARN ResourceManager Stream Processing jobs on YARN Prometheus Server Pushgate way Jmx exporter Prometh eus Java library + Servlet JSON exporter Kafka consumer group exporter
  • 8. Monitoring Kafka brokers - jmx_exporter ● https://github.com/prometheus/jmx_exporter ● Run as standalone process(no -javaagent) ○ Just in order to avoid cumbersome rolling restart ○ Maybe turn into use javaagent on next opportunity of rolling restart :p ● With very complicated config.yml ○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06 ● Colocate one instance per broker on the same host
  • 9. Monitoring Kafka producer on Java application - prometheus_simpleclient ● https://github.com/prometheus/client_java ● Official Java client library
  • 10. prometheus_simpleclient - Basic usage private static final Counter queueOutCounter = Counter.build() .namespace("kafka_streams") // Namespace(= Application prefix?) .name("process_count") // Metric name .help("Process calls count") // Metric description .labelNames("processor", "topic") // Declare labels .register(); // Register to CollectorRegistry.defaultRegistry (default, global registry) ... queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels queueOutCounter.labels("Processor-B", "topic-P").inc(2.0); => kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0 kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
  • 11. Exposing Java application metrics ● Through servlet ○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet ● Add an entry to web.xml or embedded jetty .. Server server = new Server(METRICS_PORT); ServletContextHandler context = new ServletContextHandler(); context.setContextPath("/"); server.setHandler(context); context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics"); server.start();
  • 12. Monitoring Kafka producer on Java application - prometheus_simpleclient ● Primitive types: ○ Counter, Gauge, Histogram, Summary ● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance ● How to expose the value? ● => Implement proxy metric type which implements SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter { ... private void registerMetric(KafkaMetric kafkaMetric) { ... KafkaMetricProxy.build() .namespace(“kafka”) .name(fqn) .help("Help: " + metricName.description()) .labelNames(labelNames) .register(); ... } ... }
  • 13. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> { @Override public KafkaMetricProxy create() { return new KafkaMetricProxy(this); } } KafkaMetricProxy(Builder b) { super(b); } ... @Override public List<MetricFamilySamples> collect() { List<MetricFamilySamples.Sample> samples = new ArrayList<>(); for (Map.Entry<List<String>, Child> entry : children.entrySet()) { List<String> labels = entry.getKey(); Child child = entry.getValue(); samples.add(new Sample(fullname, labelNames, labels, child.getValue())); } return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples)); } }
  • 14. Monitoring YARN jobs - json_exporter ● https://github.com/kawamuray/prometheus-json-exporter ○ Can export value from JSON by specifying the value as JSONPath ● http://<rm http address:port>/ws/v1/cluster/apps ○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/ResourceManagerRest.html#Cluster_Applications_API ○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
  • 15. json_exporter - name: yarn_application type: object path: $.apps.app[*]?(@.state == "RUNNING") labels: application: $.id phase: beta values: alive: 1 elapsed_time: $.elapsedTime allocated_mb: $.allocatedMB ... {"apps":{"app":[ { "id": "application_1234_0001", "state": "RUNNING", "elapsedTime": 25196, "allocatedMB": 1024, ... }, ... }} + yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1 yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196 yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
  • 16. Important configurations ● -storage.local.retention(default: 15 days) ○ TTL for collected values ● -storage.local.memory-chunks(default: 1M) ○ Practically controls memory allocation of Prometheus instance ○ Lower value can cause ingestion throttling(metric loss) ● -storage.local.max-chunks-to-persist(default: 512K) ○ Lower value can cause ingestion throttling likewise ○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode ○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage. local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local. memory-chunks value. ● -query.staleness-delta(default: 5mins) ○ Resolution to detect lost metrics ○ Could lead weird behavior on Prometheus WebUI
  • 17. Query tips - label_replace function ● It’s quite common that two metrics has different label sets ○ E.g, server side metric and client side metrics ● Say have metrics like: ○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"} ● Introduce new label from existing label ○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”} ● Rewrite existing label with new value ○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST”} ● Even possible to rewrite metric name… :D ○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*") ○ => foobar{...}
  • 18. Points to improve ● Service discovery ○ It’s too cumbersome to configure server list and exporter list statically ○ Pushgateway? ■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus - https://github. com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway- ○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config> ■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all defined files are detected via disk watches and applied immediately. ● Local time support :( ○ They don’t like TZ other than UTC; making sense though: https://prometheus. io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc? ○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093 ○ Still might possible to introduce toggle on view
  • 19. Conclusion ● Data model is very intuitive ● PromQL is very powerful and relatively easy ○ Helps you find out important metrics from hundreds of metrics ● Few pitfalls needs to be avoid w/ tuning configurations ○ memory-chunks, query.staleness-detla… ● Building exporter is reasonably easy ○ Officially supported lot’s of languages… ○ /metrics is the only interface
  • 22. Metrics naming ● APPLICATIONPREFIX_METRICNAME ○ https://prometheus.io/docs/practices/naming/#metric-names ○ kafka_producer_request_rate ○ http_request_duration ● Fully utilize labels ○ x: kafka_network_request_duration_milliseconds_{max,min,mean} ○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”} ○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds {instance=”HOSTA”} ○ Much flexible than using static name
  • 23. Alerting ● Not using Alert Manager ● Inhouse monitoring tool has alerting capability ○ Has user directory of alerting target ○ Has known expression to configure alerting ○ Tool unification is important and should be respected as possible ● Then? ○ Built a tool to mirror metrics from Prometheus to inhouse monitoring tool ○ Setup alert on inhouse monitoring tool /api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by (instance) { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "instance": "HOST_A:PORT" }, "value": [ 1465819064.067, "82317.10280584119" ] }, { "metric": { "instance": "HOST_B:PORT" }, "value": [ 1465819064.067, "81379.73499610288" ] }, ] } }
  • 24. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { ... public static class Child { private KafkaMetric kafkaMetric; public void setKafkaMetric(KafkaMetric kafkaMetric) { this.kafkaMetric = kafkaMetric; } double getValue() { return kafkaMetric == null ? 0 : kafkaMetric.value(); } } @Override protected Child newChild() { return new Child(); } ... }
  • 25. Monitoring Kafka consumer offset - kafka_consumer_group_exporter ● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter ● Exports some metrics WRT Kafka consumer group by executing kafka- consumer-groups.sh command(bundled to Kafka) ● Specific exporter for specific use ● Would better being familiar with your favorite exporter framework ○ Raw use of official prometheus package: https://github. com/prometheus/client_golang/tree/master/prometheus ○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
  • 26. Query tips - Product set ● Calculated result of more than two metrics results product set ● metric_A{cluster=”A or B”} ● metric_B{cluster=”A or B”,instance=”a or b or c”} ● metric_A / metric_B ● => {} ● metric_A / sum(metric_B) by (cluster) ● => {cluster=”A or B”} ● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster) ● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!