This "Deep Into Prometheus" talk is the second part of our Prometheus & Monitoring series that is heavily built on the content from the Part I "Dip into Prometheus" webinar. It's strongly recommended to watch the Part I recording first is you missed it: https://youtu.be/lvogDmRN-Hs
In the "Deep Into Prometheus" meetup, I'm going deep into the Prometheus metrics data model. He will delve into how to design high-quality monitoring metrics and consume that data using PromQL while outlining how to avoid common "gotchas" while doing so.
6. So we have metrics
2020-07-28T02:32:06Z
http_requests_total{code=200, process=A, path=/foo} 1107
http_requests_total{code=404, process=A, path=/foo} 12
http_requests_total{code=404, process=A, path=/bar} 120
http_requests_total{code=200, process=B, path=/foo} 1005
http_requests_total{code=404, process=B, path=/foo} 8
http_requests_total{code=200, process=B, path=/bar} 172
6
But there are different kinds of those
7. Meet GAUGE
✘ Represents data that can go up and down
✘ Example metrics:
➢ Temperature
➢ Memory usage
➢ Requests in progress
➢ Queue size
7
8. Meet GAUGE
✘ Represents data that can go up and down
✘ Example metrics:
➢ Temperature
➢ Memory usage
➢ Requests in progress
➢ Queue size
8
Better use
counters
10. Where gauges fail mislead
10
Server 1 : 2rps (avg)
Server 2: 10rps (avg)
Total: 6rps (avg)
Server 1 : 4 requests in 2 seconds
Server 2: 10 requests in 1 second
Total: (4+30) / (2+3) = 4.6rps
11. Meet COUNTER
✘ Can only go up
✘ Example: anything related to production / consumption:
➢ Request rate
➢ Error rate
➢ Latency
✘ We only care about deltas over time interval
11
12. Gauge vs Counter
12
✘ Goes up and down
✘ Usable raw values
✘ Non-cumulative*
✘ Goes only UP
✘ We care about Deltas
✘ Cumulative
13. Gauge vs Counter
13
✘ Goes up and down
✘ Usable raw values
✘ Non-cumulative*
✘ Goes only UP
✘ We care about Deltas
✘ Cumulative
Use Counters!
14. The looks of metrics
$ curl localhost:8081/metrics
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9.
# HELP jobs_processed_total Total jobs processed
# TYPE jobs_processed_total counter
jobs_processed_total{cluster="us-east",host="host_a"} 173.0
jobs_processed_total{cluster="us-east",host="host_b"} 203.0
jobs_processed_total{cluster="eu-west",host="host_c"} 190.0
jobs_processed_total{cluster="eu-west",host="host_d"} 169.0
14
28. The looks of metrics
$ curl localhost:8081/metrics
# HELP jobs_in_progress Currently active jobs
# TYPE jobs_in_progress gauge
# HELP jobs_processed_total Total jobs processed
# TYPE jobs_processed_total counter
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
# HELP errors_total Total errors encountered
# TYPE errors_total counter
28