Full Stack Monitoring with Prometheus and Grafana (Updated)

Build Full Stack Monitoring and Notification with Prometheus
Jazz Yao-Tsung Wang / 王耀聰
Initiator and Chair of Taiwan Data Engineering Association (TDEA) / 臺灣資料工程協會發起人暨第一屆理事長
Senior Development Manager, Change Healthcare | Innova Solutions / 英諾瓦資訊科技資深研發經理

Hello!
I am Jazz Yao-Tsung Wang
資安門外漢、DevOps初心者、SRE觀眾、異質雲端流浪漢、資料工程大聲公。過去十二年致力於
推廣 Apache Hadoop 生態系暨資料工程相關技術在台灣的應用落地。
- 11 years (2003/01 ~ 2014/02) Associate Researcher, NCHC
- 2 years (2014/03 ~ 2016/04) AVP of Product Management, Etu (SYSTEX)
- 2 years (2016/04 ~ 2018/06) Data Architect, TenMax
- 2.5 years ( 2018/07 ~ Now ) Sr Development Manager, Change Healthcare
You can find me at @jazzwang_tw or
https://fb.com/groups/dataengineering.tw
https://slideshare.net/jazzwang
2

1.
任務 / 痛點 / 獲益
Why do I need Full Stack Monitoring and Notification ?
Let’s start with Jazz’s Jobs / Pains / Gains
3

AWS
這年頭不只車子要 Hybrid 連基礎建設也 ….
自建機房
伺服器網路設備
VM 容器
Azure
GCP
虛擬機器容器
4

網管
NetAdmin
研發
Research
Developer
資安
Security
雲端維運
Cloud Ops
系統管理
SysAdmin
資料工程師
Data Engineer
任務
5

網管
NetAdmin
研發
Research
Developer
資安
Security
⽤ Cacti 監控
⽤ NewRelic
Server 監控
⽤ OpsCenter 監控
⽤ Kafka Manager 監控
⽤ NewRelic
Synthetic / APM 監控
外部 Status Cake 監控
⽬前採⽤多種⼯具進⾏不同元
件的監控與告警
++ 資訊分散 ++
痛點
⽤ DataDog 監控
6

痛點 Pain
▷資訊分散 Data Fragments
▷ 不同元件、不同監控，無法放在一起看
▷ 不易進行客製化查詢
▷保存時間過短 Data Retention
▷ 免費雲服務往往只保留 7 天
▷黑箱 Black Box
▷ 不知道監控數據 (Metrics) 是怎麼產生的
▷ 自訂 Metrics 不易
▷受限於服務提供者 Vendor Lock-in
▷ 哪天要突然終止服務或者變更服務範圍無法預期
7

獲益 Gain — 期望獲得什麼樣的改善
▷資訊集中 Centralized Time-serious Database
▷ 易於針對特定時間區間，跨不同元件，進行自訂查詢
▷支援多種告警機制 Support Alert Notification
▷ Slack, E-mail, SMS …
▷自訂保存時間 Self-defined Data Retention Rate
▷ 企業往往需要至少一年的紀錄作為法律追溯的佐證資料
▷白箱 White Box
▷ 可自訂 Metrics = 知道監控數據 (Metrics) 是怎麼產生的
▷自訂監控儀表板 Self-defined Dashboard
▷ Ex. 針對 Data Pipeline 每個端點繪製的監控儀表板
8

全棧式監控與告警的概念，並非我個人原創
(非置入性行銷) …. Inspired by Outlier …
https://www.outlyer.com/
監控告警本應是
為了維持商務運作
~~ 不役於物 ~~
9

2.
普羅米修斯生態系簡介
產品功能 / 止痛劑 / 大補丸
Introduction to Prometheus Ecosystem
Features / Pain Relievers / Gain Creators
10

觀念釋疑 Concepts
一些名詞解釋與不同生態系的簡易比較
11

通用系統組成 Common Building Blocks
受控物
Target
蒐集器
Collector
Exporter
時序資料庫
Time-Series Database
告警規則
Rule
儀表板
Dashboard
告警訊息
Alert Message
蒐集器
Collector
Exporter
蒐集器
Exporter
儀表板
Dashboard
儀表板
Dashboard
受控物
Target
受控物
Target
告警規則
Rule
告警規則
Rule
告警訊息
Alert Message
告警訊息
Annotation
推 Push
拉 Pull
Metrics
12

排行榜 Ranking of Time Series DBMS
https://db-engines.com/en/ranking/time+series+dbms 13

比較表 Comparison of Common Monitor and Notification System
受控物 Target 推/拉蒐集器 Exporter 資料庫 DBMS 儀表板 Dashboard 告警 Alert
網路設備
snmpd
拉 Pull
Cacti — Device
( snmpwalk )
RRDTool Cacti — Graph Plugin*
作業系統
gmond
拉 Pull
Ganglia
gmetad
RRDTool Ganglia 搭配 Nagios
作業系統
newrelic-agent
推 Push (?) NewRelic ?? NewRelic NewRelic Alert
應用程式
statsD
推 Push Carbon / whisper Graphite Grafana 靠 Grafana
應用程式
Telegraf
推 Push Telegraf InfluxDB Grafana 靠 Grafana
網路設備
作業系統
應用程式
拉 Pull
推 Push*
snmp_expoter
node_exporter
jmx_exporter …
Prometheus Grafana AlertManager
14

普羅米修斯小檔案 About Prometheus
▷ 官方網站：https://prometheus.io/
▷ 2012 年 11 月啟動，源自 SoundCloud
▷ 用 Go 語言撰寫，Apache 2.0 軟體授權
▷ 2016 年加入 Cloud Native Computing Foundation
僅次於 Kubernates，K8S 社群用 Prometheus 做監控
▷ v1.0.0 / 2016-07-18，v2.0.0 / 2017-11-08
▷ 最新版本：v2.22.0 / 2020-10-15
▷ 多維度資料模型，自帶 PromQL 查詢語言
▷ 內建簡易視覺化介面，可搭配 Grafana 製作儀表板
▷ 自帶 AlertManager 告警機制
▷ 儲存效率：v2.0 有做大幅度的效能改進
15

普羅米修斯組成 Components of Prometheus
推 Push
拉 Pull
查 Query
https://prometheus.io/docs/introduction/overview/
警 Alert
16

競品比較 Comparison of Time-Series DBMS
Prometheus
設計精神
強調 HA
Prometheus
Data Model
有其限制
https://prometheus.io/docs/introduction/comparison/ 17

語言支援 Client Libraries
▷官方支援的函式庫 Official Prometheus client library
▷ Go
▷ Java or Scala
▷ Python
▷ Ruby
▷非官方支援的函式庫 Unofficial 3rd-party client
library
▷ Bash
▷ C++
▷ Common Lisp
▷ Elixir
▷ Erlang
▷ Haskell
▷ Lua for Nginx
▷ Lua for Tarantool
▷ .NET / C#
▷ Node.js
▷ PHP
▷ Rust
https://prometheus.io/docs/instrumenting/clientlibs/ 18

3.
火力展示
讓我們用一個 Docker Compose
模擬 Full Stack 環境
19

展示原始碼 Show me the source code!!
○ https://github.com/jazzwang/prometheus-labs
○ 我寫了一個基於 Docker Compose 的展示叢集，
分別示範「網路層」、「系統層」、「中介層」的監控。
○ 「應用層」因為還來不及寫範例，就只先介紹一下做法。
如果覺得這個範例對您有幫助
歡迎幫我按個星星吧～～
20

礙於時間限制，今天僅能分享示範 Dashboard 截圖
21
如果用雲服務虛擬機器，請把 localhost 換成 Public IP

Prometheus 自帶簡易視覺化介面（展示 SNMP）
22

Grafana 圖表展示
23

26

28

展示資料流 — Data Pipeline
in_dummy Fluentd out_kafka
Kafka
in_kafka_group Fluentd
out_file
29

網路層監控與告警 Network Layer
▷ snmp_exporter
○ https://github.com/prometheus/snmp_exporter
○ 多數網路設備都會支援 snmp 來提供裝置的 Metrics
○ 常見痛點：需要針對不同的網路設備 MIB 查對應的 OID
○ 經驗分享：
使用 snmp_exporter 專案提供的 generator 來產生所需的 snmp.yml 設定檔
▷ blackbox_exporter
○ https://github.com/prometheus/blackbox_exporter
○ 支援 HTTP, HTTPS, DNS, TCP 與 ICMP 的監測
○ 應用場景：
如果您需要定期去檢查 Web Service、SSH 是否可正常連線、DNS 反查是否正確、
Ping 的反應速度，都可以用 blackbox_exporter
31

系統層監控與告警 System Layer
▷ node_exporter
○ https://github.com/prometheus/node_exporter
○ 支援非常細節的 OS Level Metrics
32

中介層監控與告警 Middleware Layer
▷ jmx_exporter
○ https://github.com/prometheus/jmx_exporter
○ 基本上使用 Java 寫的中介軟體都可以透過撰寫特定的 YAML 設定檔來產生 Prometheus Metrics
○ 目前有提供的範例：
■ Apache Kafka
■ Apache Cassandra
■ Apache Flink
■ Apache Spark
■ Apache Tomcat
■ Apache ZooKeeper
■ Apache ActiveMQ Artemis 2.x
■ WebLogic
■ WildFly 10
33

Apache Kafka 監控與告警
▷ 官方建議使用 `jmx_exporter` 來監控 Kafka 與 Cassandra
○ Docker 範例 - https://github.com/RobustPerception/docker_examples
▷ kafka_topic_exporter
○ 採 Java Jetty 撰寫
○ https://github.com/ogibayashi/kafka-topic-exporter
▷ kafka_zookeeper_exporter
○ 用來看各別 ZK 的狀態，主要看 topic_partition
○ https://github.com/cloudflare/kafka_zookeeper_exporter
▷ prometheus-kafka-consumer-group-exporter
○ 用 Python 實作，Metrics 主要是 consumer_group_offset 跟 topic_highwater 可以用這兩個計算出
Lag
○ https://github.com/braedon/prometheus-kafka-consumer-group-exporter
▷ burrow_exporter
○ 是讀取 LinkedIn 開發的 Kafka Lag 監控 Burrow
(Go 實作, 不用自行指定臨界值，會根據歷史資料的 sliding window 計算告警臨界值)
○ https://github.com/jirwin/burrow_exporter
34

▷ kafka-consumer-group-exporter
○ 用 Go 實作，但相依 kafka-consumer-groups.sh 執行
○ https://github.com/kawamuray/prometheus-kafka-consumer-group-
exporter
▷ kafka-prometheus-exporter
○ 用 Go 實作，目前實驗可以取得 consumergoup_lag 的 metrics
○ 實測結果：適用 Kafka 0.8 以前的版本 (ZK)
○ https://github.com/ogibayashi/kafka-topic-exporter
▷ kafka_zookeeper_exporter
○ 用 Go 實作，Metrics 比較豐富
○ 實測結果：適用 Kafka 0.9 以後的版本 (KF)
○ https://github.com/danielqsj/kafka_exporter
35
Apache Kafka 監控與告警

CNCF Fluentd 監控與告警
▷ fluent-agent-lite_exporter
○ 是監控 Tagamoris 四年前寫的 fluent-agent-lite [1]
○ https://github.com/matsumana/fluent-agent-lite_exporter
○ [1] https://github.com/tagomoris/fluent-agent-lite
▷ fluent-plugin-prometheus
○ fluentd → monitor_agent → fluent-plugin-prometheus
○ http://prometheus:9090/metrics → `fluent-plugin-prometheus` → fluentd
○ https://github.com/fluent/fluent-plugin-prometheus
▷ fluentd_exporter
○ 沒有 Release, 程式碼看起來功能不完整
○ https://github.com/wyukawa/fluentd_exporter
▷ fluentd_exporter
○ http://fluentd:9224/metrics → `fluentd_exporter` (by V3ckt0r) → prometheus
○ https://github.com/wyukawa/fluentd_exporter
36

應用層監控與告警 Application Layer
▷ https://www.robustperception.io/exposing-dropwizard-metrics-to-prometheus
http://metrics.dropwizard.io/
37

4.
心得分享
Lesson Learned
38

經驗傳承 Lesson Learned
▷ Lesson #1
使用 Prometheus 這類白箱監控，可操控性高，可客製化查詢，
缺點是要花比較多時間去設計與釐清做監控背後的核心目標。
▷ Lesson #2
全棧式監控在執行上須針對每種受控物的蒐集器
與對應可以蒐集到的 Metrics 作深入理解才有辦法找到合用的 exporter。
○ 第一步：查目前是否有社群貢獻的 exporter
https://prometheus.io/docs/instrumenting/exporters/
○ 第二步：查會用到哪些連線埠 Port
https://github.com/prometheus/prometheus/wiki/Default-port-allocations
○ 第三步：驗證 exporter 產生的 Metrics 是不是你要的
39

經驗傳承 Lesson Learned
▷ 我的做法：
○ 到對應的 github 目錄看一下說明，先過濾一些
○ 接著裝起來看 exporter 產生的 Metrics
○ 到 http://prometheus:9090/graph 試著下查詢
○ 到 Grafana 建立對應的 Dashboard
○ 根據 Grafana 歷史監控數據來定義 Alert 告警的臨界值
40

Thanks!
Any questions?
You can find me at @jazzwang_tw or
https://fb.com/groups/dataengineering.tw
https://slideshare.net/jazzwang
https://github.com/jazzwang
如果喜歡今天的範例，麻煩在 Github 幫我按個星星唷～ *^__^*
41

Full Stack Monitoring with Prometheus and Grafana (Updated)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Full Stack Monitoring with Prometheus and Grafana (Updated)

Similar to Full Stack Monitoring with Prometheus and Grafana (Updated) (20)

Full Stack Monitoring with Prometheus and Grafana (Updated)