Thanos: Global, durable Prometheus monitoring

Bartek Plotka
Bwplotka Bplotka
Fabian Reinartz
fabxc
Global, durable Prometheus monitoring

Prometheus 2.0
● Reliable operational model
● Powerful query language
● Scraping capabilities beyond the casual usage
● Local metric storage

n1
Improbable case
2 n+1
...
● Multiple isolated Kubernetes clusters

n1
Improbable case
2
Prometheus
n+1
Prometheus
...
● Single Prometheus server per cluster

1
Improbable case
2
Prometheus
n
n+1
Prometheus
...
● Dashboards & Alertmanager in separate cluster
Grafana
Alertmanager

Improbable case
● Dashboards & Alertmanager in separate cluster
Grafana
Alertmanager
What is missing? 1
2
Prometheus
n
n+1
Prometheus
...

Global View
See everything from a single
place!

1
Global View
2
Prometheus
n
n+1
Prometheus
...
Grafana
Alertmanager
“Alert when 66% of clusters in a region are down”

1
Global View
2
Prometheus
n
n+1
Prometheus
...
● How to aggregate data from different clusters?
Grafana
Alertmanager

1
Global View
2
Prometheus
n
n+1
Prometheus
...
○ Use hierarchical federation?
Grafana
Alertmanager
Prometheus

1
Global View
2
Prometheus
n
n+1
Prometheus
...
○ Use hierarchical federation?
Grafana
AlertmanagerSingle point of failure
Maintenance
What data are federated?
Prometheus

Availability
Where is my sample?

1
Availability
2
Prometheus
n
n+1
Prometheus
...
Grafana
Alertmanager
Operator error
Hardware failure
Rollout

1
Availability
2
Prometheus
n
n+1
Prometheus
...
● Can we assure no loss in metric data?
Grafana
Alertmanager

1
Availability
2
Prometheus
n
n+1
Prometheus
...
○ Add HA replicas?
Grafana
Alertmanager
Prometheus Prometheus

1
Availability
2
Prometheus
n
n+1
Prometheus
...
○ Add HA replicas?
Grafana
Alertmanager
Prometheus Prometheus
What replica we should query?
Cost?
Where to put rules and alerts?

Historical Metrics
What exactly happened
X weeks ago?

Metric retention
T - 3 years TT - 12 monthsT - 2 years
?
“Go back to what happened 6 months ago...”

Metric retention
T - 3 years TT - 12 monthsT - 2 years
?
“Good progress! Memory allocs looks better than 1,5 year ago!”
T
X

Metric retention
“Let’s see user traffic across years!”
T - 2 years T - 12 monthsT - 3 years
? ? ? ?
T
X

Metric retention
Infrastructure retention: 9 days
9 days
T - 3 years TT - 12 monthsT - 2 years T

Metric retention
● Can we have longer retention?

Metric retention
○ Upgrade to Prometheus 2.0

Metric retention
○ Scale SSD Vertically?
SSD
Prometheus

Metric retention
SSD
Prometheus
SSD
SSD
SSD
SSD

Metric retention
SSD
Prometheus
Backup?
Maintenance
Cost

Recap
1
2
Prometheus
n
n+1
Prometheus
...
Grafana
Alertmanager
Prometheus
It is just hard to…
● Have a global view
● Have a HA in place
● Increase retention

Thanos
It is just hard to…
● Have a global view
● Have a HA in place
● Increase retention

● Seamless integration with Prometheus
● Easy deployment model
● Minimal number of dependencies
● Minimal baseline cost
Additional Goals

SSD
Prometheus
Prometheus
Targets

SSD
Sidecar
Prometheus Sidecar
Targets

SSD
Sidecar
Prometheus Sidecar
Targets
gRPC (Store API)

Store API
service Store {
rpc Series(SeriesRequest) returns (stream SeriesResponse);
rpc LabelNames(LabelNamesRequest) returns (LabelNamesResponse);
rpc LabelValues(LabelValuesRequest) returns (LabelValuesResponse);
}
message SeriesRequest {
int64 min_time = 1;
int64 max_time = 2;
repeated LabelMatcher matchers = 3;
}
Sidecar
Prometheus
remote read
Store API

SSD
Querier
Prometheus Sidecar
Querier
Store API
Targets
HTTP
Query API

SSD
Global View
Prometheus Sidecar
Querier
Targets
SSD
Sidecar
Targets
Prometheus
Merge
Store API

SSD
Global View + Availability
Prometheus Sidecar
Targets
SSD
Sidecar
Targets
Prometheus
SSD
Sidecar Prometheus
“replica”:”1”
“replica”:”2”
Querier
Merge
Deduplicate
Store API

TSDB Layout
Block 2 Block 4Block 3Block 1
T-200T-300 T-100 T-50 T

TSDB Layout
Block 4Block 3Block 1
chunks chunks
chunks chunks
index
T-200T-300 T-100 T-50 T

SSD
Data saving
Prometheus Sidecar
Targets
Object Storage
Blocks Blocks
Block

Store
Object Storage
Blocks
Cache
Store
Querier
Store API

Store
Object Storage
Blocks
Cache
Store
Querier
Block
Store API

Store
● A series is made up of one or more “chunks”
● A chunk contains ~120 samples each
● Chunks can be retrieved through HTTP byte
range queries

Store
● A series is made up of one or more “chunks”
range queries
Example:
● 1000 series @ 30s scrape interval

Store
● A series is made up of one or more “Chunks”
range queries
Example:
● 1000 series @ 30s scrape interval
● Query 1 year
8.7 million chunks/range queries

Store
Leverage Prometheus’ TSDB file layout

Store
● Chunks of the same series are aligned
sequentially

Store
● Similar series are aligned, e.g. due to same
metric name

Store
metric name
Consolidating ranges in close proximity reduces
request count by 4-6 orders of magnitude.
8.7 million requests turned into O(20) requests.

Store
metric name
Index lookups profit from a similar approach.

Compaction
Object Storage
Blocks
Disk
Compactor

Compaction
Object Storage
Blocks
Disk
Compactor
Blocks

Compaction
Object Storage
Blocks
Disk
Compactor
Blocks
Block

Compaction
Object Storage
Blocks
Disk
Compactor
Block

Downsampling
Let’s just step back a little

Downsampling
Raw: 16 bytes/sample
Compressed: 1.07
bytes/sample

Downsampling
Decompressing one sample takes 10-40 nanoseconds
● Times 1000 series @ 30s scrape interval
● Times 1 year

Downsampling
Decompressing one sample takes 10-40 nanoseconds
● Times 1000 series @ 30s scrape interval
● Times 1 year
● Over 1 billion samples, i.e. 10-40s – for decoding alone
● Plus your actual computation over all those samples, e.g. rate()

Downsampling
Block
RAW
Block
@ 5m
Block
@ 1h
10x 12x

Downsampling
raw chunk
count sum min max counter
raw chunk...

Downsampling
...

Downsampling
count_over_time(requests_total[1h])

Downsampling
sum_over_time(requests_total[1h])

Downsampling
min(requests_total)
min_over_time(requests_total[1h])

Downsampling
max(requests_total)
max_over_time(requests_total[1h])

Downsampling
rate(requests_total[1h])
increase(requests_total[1h])

Downsampling
requests_total
avg(requests_total)
...
*
avg

Full Architecture
Querier
SSD
Sidecar Prometheus
SSD
Sidecar Prometheus
QuerierQuerier
…
Compactor
Store
Bucket

Full Architecture
$ thanos sidecar …
$ thanos query …
$ thanos store …
$ thanos compact …

Deployment Models
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Cluster A
Cluster B
Cluster C

Deployment Models
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Cluster A
Cluster B
Cluster C
Federation (through Store API)

Deployment Models
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
S P …
Store
Bucket
S P
S P …
Store
Bucket
S P
Cluster A
Cluster B
Cluster C
Global Scale Thanos Cluster

Cost
● Store + Query node ~ Savings on Prometheus side (+/- 0)
● Fewer SSD space on Prometheus side (savings)
● Basically: just your data stored in S3/GCS/HDFS + requests

Cost
Example:
● 20 Prometheus servers each ingesting 100k samples/sec, 500GB of local disk
● 20 x 250GB of new data per month + ~20% overhead for downsampling
● $1440/month for storage after 1 year (72TB of queryable data)
● $100/month for sustained 100 query/sec against object storage
Thanos Cost: $1540

Cost
Example:
● 20 Prometheus servers each ingesting 100k samples/sec, 500GB of local disk
● 20 x 250GB of new data per month + ~20% overhead for downsampling
● $1440/month for storage after 1 year (72TB of queryable data)
● $100/month for sustained 100 query/sec against object storage
● $1530/month savings in local SSDs
Thanos Cost: $1540 Prometheus Savings: $1530

Any questions?
github.com/improbable-eng/thanos
Fabian Reinartz
fabxc
Bartek Plotka
bwplotka Bplotka

Thanos: Global, durable Prometheus monitoring

More Related Content

What's hot

Similar to Thanos: Global, durable Prometheus monitoring

Recently uploaded

Thanos: Global, durable Prometheus monitoring