Deep into Prometheus

hello!
I am Zaar Hai
Staff Cloud Architect at DoiT International
linkedin.com/in/zaar
2

Missed Part I?
https://youtu.be/lvogDmRN-Hs
3

Let’s focus on Display & Alert
Instrument
Collect
&
Store
Display
&
Alert
4

So we have metrics
2020-07-28T02:32:06Z
http_requests_total{code=200, process=A, path=/foo} 1107
http_requests_total{code=404, process=A, path=/foo} 12
http_requests_total{code=404, process=A, path=/bar} 120
http_requests_total{code=200, process=B, path=/foo} 1005
http_requests_total{code=404, process=B, path=/foo} 8
http_requests_total{code=200, process=B, path=/bar} 172
6
But there are different kinds of those

Meet GAUGE
✘ Represents data that can go up and down
✘ Example metrics:
➢ Temperature
➢ Memory usage
➢ Requests in progress
➢ Queue size
7

Meet GAUGE
✘ Represents data that can go up and down
✘ Example metrics:
➢ Temperature
➢ Memory usage
➢ Requests in progress
➢ Queue size
8
Better use
counters

Where gauges fail?
9
Server 1 : 2rps (avg)
Server 2: 10rps (avg)
Total: 6rps (avg)

Where gauges fail mislead
10
Server 1 : 2rps (avg)
Server 2: 10rps (avg)
Total: 6rps (avg)
Server 1 : 4 requests in 2 seconds
Server 2: 10 requests in 1 second
Total: (4+30) / (2+3) = 4.6rps

Meet COUNTER
✘ Can only go up
✘ Example: anything related to production / consumption:
➢ Request rate
➢ Error rate
➢ Latency
✘ We only care about deltas over time interval
11

Gauge vs Counter
12
✘ Goes up and down
✘ Usable raw values
✘ Non-cumulative*
✘ Goes only UP
✘ We care about Deltas
✘ Cumulative

Gauge vs Counter
13
✘ Goes up and down
✘ Usable raw values
✘ Non-cumulative*
✘ Goes only UP
✘ We care about Deltas
✘ Cumulative
Use Counters!

The looks of metrics
$ curl localhost:8081/metrics
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9.
# HELP jobs_processed_total Total jobs processed
# TYPE jobs_processed_total counter
jobs_processed_total{cluster="us-east",host="host_a"} 173.0
jobs_processed_total{cluster="us-east",host="host_b"} 203.0
jobs_processed_total{cluster="eu-west",host="host_c"} 190.0
jobs_processed_total{cluster="eu-west",host="host_d"} 169.0
14

Data Viz in a Nutshell
Select Bucket Aggregate
16

From Numbers to Graph Dot
17
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
bar{}
1m 2m15s scrape
interval

18
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
bar{}
Select

19
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
bar{}
Select
Bucket

20
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
bar{}
Select
Bucket
Aggregate

P8s Instant query
21
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
Select
GET /api/v1/query?time=t4
&query=foo
Instance
vector

P8s Instant query
22
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
Select
&query=sum(foo)
sum
bar{}

P8s Instant query
23
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
Select
&query=foo[1m]
Range
vector
bar{}

Range Instance
P8s Instant query
24
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
Select
&query=avg_over_time(foo[1m])
avgavg
avg avg
bar{}

P8s Instant query
25
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
Select
&query=sum(avg_over_time(foo[1m]))
Instance vector
sum
bar{}

P8s Range query - rinse/repeat
26
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
foo{t=a}
foo{t=b}
Select
GET /api/v1/query_range?start=t1
&end=t4
&step=60
query=sum(avg_over_time(foo[1m]))
sum sumavg avg
bar{}

The looks of metrics
$ curl localhost:8081/metrics
# HELP jobs_in_progress Currently active jobs
# TYPE jobs_in_progress gauge
# HELP jobs_processed_total Total jobs processed
# TYPE jobs_processed_total counter
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
# HELP errors_total Total errors encountered
# TYPE errors_total counter
28

Initialize to ZERO
30
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
1 -
2 -
3 -
Increase[1m] = 0

Initialize to ZERO
31
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
1 -
2 -
3 -
Increase[1m] ~= 1

Counter resets
32
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
Process restart

Counter resets
33
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
“Hoist”
rate() ﬁxes reset “in ﬂight”

Jumpy graphs
34
Graph jumps as “now” progresses
source

P8s, deltas, and fencing posts
35
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
1m 2m15s scrape
interval
1 -
2 -
3 -

36
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
1m 2m15s scrape
interval
1 -
2 -
3 -
Increase[1m] = 0
Increase[1m] = 0

37
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
1m 2m15s scrape
interval
1 -
2 -
3 -
Increase[1m] = ?

38
t2
t3
t5
t4
t6
t1
t7
t8
t9
t10
bar{}
1m 2m15s scrape
interval
1 -
2 -
3 -
Increase[1m] = 1 / 45 * 60 = 1.33
(over)
extrapolation

39
✘ Solutions
➢ xrate fork: https://github.com/free/prometheus
➢ Victoria Metrics: https://github.com/VictoriaMetrics/VictoriaMetrics

Deep into Prometheus

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Deep into Prometheus

Similar to Deep into Prometheus (20)

More from Zaar Hai

More from Zaar Hai (7)

Recently uploaded

Recently uploaded (20)

Deep into Prometheus