Bartek Plotka
Bwplotka Bplotka
Fabian Reinartz
fabxc
Global, durable Prometheus monitoring
Prometheus 2.0
● Reliable operational model
● Powerful query language
● Scraping capabilities beyond the casual usage
● Local metric storage
Prometheus at Scale
A dream
n1
Improbable case
2 n+1
...
● Multiple isolated Kubernetes clusters
n1
Improbable case
2
Prometheus
n+1
Prometheus
...
● Multiple isolated Kubernetes clusters
● Single Prometheus server per cluster
1
Improbable case
2
Prometheus
n
n+1
Prometheus
...
● Multiple isolated Kubernetes clusters
● Single Prometheus server per cluster
● Dashboards & Alertmanager in separate cluster
Grafana
Alertmanager
Improbable case
● Multiple isolated Kubernetes clusters
● Single Prometheus server per cluster
● Dashboards & Alertmanager in separate cluster
Grafana
Alertmanager
What is missing? 1
2
Prometheus
n
n+1
Prometheus
...
Global View
See everything from a single
place!
1
Global View
2
Prometheus
n
n+1
Prometheus
...
Grafana
Alertmanager
“Alert when 66% of clusters in a region are down”
1
Global View
2
Prometheus
n
n+1
Prometheus
...
● How to aggregate data from different clusters?
Grafana
Alertmanager
1
Global View
2
Prometheus
n
n+1
Prometheus
...
● How to aggregate data from different clusters?
○ Use hierarchical federation?
Grafana
Alertmanager
Prometheus
1
Global View
2
Prometheus
n
n+1
Prometheus
...
● How to aggregate data from different clusters?
○ Use hierarchical federation?
Grafana
AlertmanagerSingle point of failure
Maintenance
What data are federated?
Prometheus
Availability
Where is my sample?
1
Availability
2
Prometheus
n
n+1
Prometheus
...
Grafana
Alertmanager
Operator error
Hardware failure
Rollout
1
Availability
2
Prometheus
n
n+1
Prometheus
...
● Can we assure no loss in metric data?
Grafana
Alertmanager
1
Availability
2
Prometheus
n
n+1
Prometheus
...
● Can we assure no loss in metric data?
○ Add HA replicas?
Grafana
Alertmanager
Prometheus Prometheus
1
Availability
2
Prometheus
n
n+1
Prometheus
...
● Can we assure no loss in metric data?
○ Add HA replicas?
Grafana
Alertmanager
Prometheus Prometheus
What replica we should query?
Cost?
Where to put rules and alerts?
Historical Metrics
What exactly happened
X weeks ago?
Metric retention
T - 3 years TT - 12 monthsT - 2 years
?
“Go back to what happened 6 months ago...”
Metric retention
T - 3 years TT - 12 monthsT - 2 years
?
“Good progress! Memory allocs looks better than 1,5 year ago!”
T
X
Metric retention
“Let’s see user traffic across years!”
T - 2 years T - 12 monthsT - 3 years
? ? ? ?
T
X
Metric retention
Infrastructure retention: 9 days
9 days
T - 3 years TT - 12 monthsT - 2 years T
Metric retention
● Can we have longer retention?
Metric retention
● Can we have longer retention?
○ Upgrade to Prometheus 2.0
Metric retention
● Can we have longer retention?
○ Upgrade to Prometheus 2.0
○ Scale SSD Vertically?
SSD
Prometheus
Metric retention
● Can we have longer retention?
○ Upgrade to Prometheus 2.0
○ Scale SSD Vertically?
SSD
Prometheus
Metric retention
● Can we have longer retention?
○ Upgrade to Prometheus 2.0
○ Scale SSD Vertically?
SSD
Prometheus
Metric retention
● Can we have longer retention?
○ Upgrade to Prometheus 2.0
○ Scale SSD Vertically?
SSD
Prometheus
SSD
SSD
SSD
SSD
Metric retention
● Can we have longer retention?
○ Upgrade to Prometheus 2.0
○ Scale SSD Vertically?
SSD
Prometheus
Backup?
Maintenance
Cost
Recap
1
2
Prometheus
n
n+1
Prometheus
...
Grafana
Alertmanager
Prometheus
It is just hard to…
● Have a global view
● Have a HA in place
● Increase retention
Thanos
It is just hard to…
● Have a global view
● Have a HA in place
● Increase retention
● Seamless integration with Prometheus
● Easy deployment model
● Minimal number of dependencies
● Minimal baseline cost
Additional Goals
Global View
See everything from a single
place!
SSD
Prometheus
Prometheus
Targets
SSD
Sidecar
Prometheus Sidecar
Targets
SSD
Sidecar
Prometheus Sidecar
Targets
gRPC (Store API)
Store API
service Store {
rpc Series(SeriesRequest) returns (stream SeriesResponse);
rpc LabelNames(LabelNamesRequest) returns (LabelNamesResponse);
rpc LabelValues(LabelValuesRequest) returns (LabelValuesResponse);
}
message SeriesRequest {
int64 min_time = 1;
int64 max_time = 2;
repeated LabelMatcher matchers = 3;
}
Sidecar
Prometheus
remote read
Store API
SSD
Querier
Prometheus Sidecar
Querier
Store API
Targets
HTTP
Query API
SSD
Global View
Prometheus Sidecar
Querier
Targets
SSD
Sidecar
Targets
Prometheus
Merge
Store API
SSD
Global View + Availability
Prometheus Sidecar
Targets
SSD
Sidecar
Targets
Prometheus
SSD
Sidecar Prometheus
“replica”:”1”
“replica”:”2”
Querier
Merge
Deduplicate
Store API
Thanos
It is just hard to…
● Have a global view
● Have a HA in place
● Increase retention
Historical Metrics
What exactly happened
X weeks ago?
TSDB Layout
Block 2 Block 4Block 3Block 1
T-200T-300 T-100 T-50 T
TSDB Layout
Block 4Block 3Block 1
chunks chunks
chunks chunks
index
T-200T-300 T-100 T-50 T
SSD
Data saving
Prometheus Sidecar
Targets
Object Storage
Blocks Blocks
Block
Store
Object Storage
Blocks
Cache
Store
Querier
Store API
Store
Object Storage
Blocks
Cache
Store
Querier
Block
Store API
Store
● A series is made up of one or more “chunks”
● A chunk contains ~120 samples each
● Chunks can be retrieved through HTTP byte
range queries
Store
● A series is made up of one or more “chunks”
● A chunk contains ~120 samples each
● Chunks can be retrieved through HTTP byte
range queries
Example:
● 1000 series @ 30s scrape interval
Store
● A series is made up of one or more “Chunks”
● A chunk contains ~120 samples each
● Chunks can be retrieved through HTTP byte
range queries
Example:
● 1000 series @ 30s scrape interval
● Query 1 year
8.7 million chunks/range queries
Store
Leverage Prometheus’ TSDB file layout
Store
Leverage Prometheus’ TSDB file layout
● Chunks of the same series are aligned
sequentially
Store
Leverage Prometheus’ TSDB file layout
● Chunks of the same series are aligned
● Similar series are aligned, e.g. due to same
metric name
Store
Leverage Prometheus’ TSDB file layout
● Chunks of the same series are aligned
● Similar series are aligned, e.g. due to same
metric name
Consolidating ranges in close proximity reduces
request count by 4-6 orders of magnitude.
8.7 million requests turned into O(20) requests.
Store
Leverage Prometheus’ TSDB file layout
● Chunks of the same series are aligned
● Similar series are aligned, e.g. due to same
metric name
Index lookups profit from a similar approach.
Compaction
Density matters
Compaction
Object Storage
Blocks
Disk
Compactor
Compaction
Object Storage
Blocks
Disk
Compactor
Blocks
Compaction
Object Storage
Blocks
Disk
Compactor
Blocks
Block
Compaction
Object Storage
Blocks
Disk
Compactor
Block
Thanos
It is just hard to…
● Have a global view
● Have a HA in place
● Increase retention
Downsampling
Let’s just step back a little
Downsampling
Raw: 16 bytes/sample
Compressed: 1.07
bytes/sample
Downsampling
BUT…
Downsampling
Decompressing one sample takes 10-40 nanoseconds
● Times 1000 series @ 30s scrape interval
● Times 1 year
Downsampling
Decompressing one sample takes 10-40 nanoseconds
● Times 1000 series @ 30s scrape interval
● Times 1 year
● Over 1 billion samples, i.e. 10-40s – for decoding alone
● Plus your actual computation over all those samples, e.g. rate()
Downsampling
Block
RAW
Block
@ 5m
Block
@ 1h
10x 12x
Downsampling
raw chunk
count sum min max counter
raw chunk...
Downsampling
count sum min max counter
...
Downsampling
count sum min max counter
count_over_time(requests_total[1h])
Downsampling
count sum min max counter
sum_over_time(requests_total[1h])
Downsampling
count sum min max counter
min(requests_total)
min_over_time(requests_total[1h])
Downsampling
count sum min max counter
max(requests_total)
max_over_time(requests_total[1h])
Downsampling
count sum min max counter
rate(requests_total[1h])
increase(requests_total[1h])
Downsampling
count sum min max counter
requests_total
avg(requests_total)
...
*
avg
Full Architecture
Querier
SSD
Sidecar Prometheus
SSD
Sidecar Prometheus
QuerierQuerier
…
Compactor
Store
Bucket
Full Architecture
$ thanos sidecar …
$ thanos query …
$ thanos store …
$ thanos compact …
Deployment Models
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Cluster A
Cluster B
Cluster C
Deployment Models
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
Cluster A
Cluster B
Cluster C
Federation (through Store API)
Deployment Models
Querier
S P
QuerierQuerier
…
Store
Bucket
S P
S P …
Store
Bucket
S P
S P …
Store
Bucket
S P
Cluster A
Cluster B
Cluster C
Global Scale Thanos Cluster
Cost
● Store + Query node ~ Savings on Prometheus side (+/- 0)
● Fewer SSD space on Prometheus side (savings)
● Basically: just your data stored in S3/GCS/HDFS + requests
Cost
Example:
● 20 Prometheus servers each ingesting 100k samples/sec, 500GB of local disk
● 20 x 250GB of new data per month + ~20% overhead for downsampling
● $1440/month for storage after 1 year (72TB of queryable data)
● $100/month for sustained 100 query/sec against object storage
Thanos Cost: $1540
Cost
Example:
● 20 Prometheus servers each ingesting 100k samples/sec, 500GB of local disk
● 20 x 250GB of new data per month + ~20% overhead for downsampling
● $1440/month for storage after 1 year (72TB of queryable data)
● $100/month for sustained 100 query/sec against object storage
● $1530/month savings in local SSDs
Thanos Cost: $1540 Prometheus Savings: $1530
Demo - retention
Demo - deduplication
Demo - deduplication
Any questions?
github.com/improbable-eng/thanos
Fabian Reinartz
fabxc
Bartek Plotka
bwplotka Bplotka

Thanos: Global, durable Prometheus monitoring