How Prometheus Store the Data

MegaEase
How Prometheus
Store the Data
Hao Chen

MegaEase
Self Introduction
l 20+ years working experience for large-scale distributed system
architecture and development. Familiar with Cloud Native
computing and high concurrency / high availability architecture
solution.
l Working Experiences
l MegaEase – Cloud Native Software products as Founder
l Alibaba – AliCloud, Tmall as principle software engineer.
l Amazon – Amazon.com as senior software manager.
l Thomson Reuters – Real-time system software development Manager.
l IBM Platform – Distributed computing system as software engineer.
Weibo: @左耳朵耗子
Twitter: @haoel
Blog: http://coolshell.cn/

MegaEase
Understanding Time Series

MegaEase
Understanding Time Series Data
l Data scheme
l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), ....
l Prometheus Data Model
l <metric name>{<label name>=<label value>, ...}
l Typical set of series identifiers
l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355
l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934
l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483
l Query
l __name__=“requests_total” - selects all series belonging to the requests_total metric.
l method=“PUT|POST” - selects all series method is PUT or POST
Metric Name Labels Timestamp Sample Value
Key - Series Value - Sample

MegaEase
2D Data Plane
l Write
l Completely vertical and highly concurrent as samples from each target are ingested independently
l Query
l Retrieves data can be paralleled and batched
series
^
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“GET”}
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“POST”}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“POST”}
│ . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“GET”}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->

MegaEase
The Fundamental Problem
l Storage problem
l IDE – spinning physically
l SSD - write amplification
l Query is much more complicated than
write
l Time series query could cause the
random read.
l Ideal Write
l Sequential writes
l Batched writes
l Ideal Read
l Same Time Series should be
sequentially

MegaEase
Prometheus Solution
(v1.x - “V2”)

MegaEase
Prometheus Solution (v1.x “V2”)
l One file per time series
l Batch up 1KiB chunks in memory
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A
└──────────┴─────────┴─────────┴─────────┴─────────┘
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B
└──────────┴─────────┴─────────┴─────────┴─────────┘
. . .
┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ
└──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
chunk 1 chunk 2 chunk 3 ...
l Dark Sides
l Chunk are hold in memory, it could be lost if application or node crashed.
l With several million files, inodes would be run out
l With several thousands of chunks need be persisted, could cause disk I/O so busy.
l Keep so many files open for I/O, which cause very high latency.
l Old data need be cleaned, it could cause the SSD’s write amplification
l Very big CPU/MEM/DISK resource consumption

MegaEase
Series Churn
l Definition
l Some time series become INACTIVE
l Some time series become ACTIVE
l Reasons
l Rolling up a number of microservice
l Kubernetes scaling the services
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->

MegaEase
New Prometheus Design
(v2.x - “V3”)

MegaEase
Fundamental Design – V3
l Storage Layout
l 01XXXXXXX- is a data block
l ULID - like UUID but lexicographically sortable and encoding the creation time
l chunk directory
l contains the raw chucks of data points for various series(likes “V2”)
l No long a single file per series
l index – index of data
l Lots of black magic find the data by labels.
l meta.json - Readable meta data
l the state of our storage and the data it contains
l tombstone
l Deleted data will be recorded into this file, instead removing from chunk file
l wal – Write-Ahead Log
l The WAL segments would be truncated to “checkpoint.X” directory
l chunks_head – in memory data
l Notes
l The data will be persisted into disk every 2 hours
l WAL is used for data recovery.
l 2 Hours block could make the range data query efficiently
$ tree ./data
./data
├── 01BKGV7JBM69T2G1BGBGM6KB12
│ ├── chunks
│ │ ├── 000001
│ │ ├── 000002
│ │ └── 000003
│ ├── index
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98
│ │ └── 000001
│ ├── index
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│ │ └── 000001
│ ├── index
│ ├── tombstones
├── chunks_head
│ └── 000001
└── wal
├── 000000003
└── checkpoint.00000002
├── 00000000
└── 00000001
https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md
File Format

MegaEase
Blocks – Little Database
l Partition the data into non-overlapping blocks
l Each block acts as a fully independent database
l Containing all time series data for its time window
l it has its own index and set of chunk files.
l Every block of data is immutable
l The current block can be append the data
l All new data is write to an in-memory database
l To prevent data loss, a temporary WAL is also written.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └────────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘

MegaEase
Tree Concept
Block 1 Block 2 Block 3 Block 4 Block N
chunk1 chunk2 chunk3
time

MegaEase
New Design’s Benefits
l Good for querying a time range
l we can easily ignore all data blocks outside of this range.
l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with
l Good for disk writes
l When completing a block, we can persist the data from our in-memory database by sequentially writing just
a handful of larger files.
l Keep the good property of V2 that recent chunks
l which are queried most, are always hot in memory.
l Flexible for chunk size
l We can pick any size that makes the most sense for the individual data points and chosen compression
format.
l Deleting old data becomes extremely cheap and instantaneous.
l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write
up to hundreds of millions of files, which could take hours to converge.

MegaEase
Chunk-head
l Chunk will be cut
l fills till 120 samples
l 2 hour (by default)
l Since Prometheus v2.19
l not all chunks are stored in memory
l When the chunk is cut, it will be flushed to disk and to mmap
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/

MegaEase
Chunk head à Block
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
l After some time, the chunks meet threshold
l When the Chunks range is 3hrs
l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block
l Meanwhile
l The WAL is truncated at this point
l And the “checkpoint” is created!

MegaEase
Large file with “mmap”
l mmap stands for memory-mapped files. It is a
way to read and write files without invoking
system calls.
l It is great if multiple processes accessing data in
a read only fashion from the same file
l It allows all those processes to share the same
physical memory pages, saving a lot of memory.
l it also allows the operating system to optimize
paging operations.
User Process
File System
Page Cache
Disk
User Space
Kernel Space
Device
mmap
Direct I/O
read/write
Why mmap is faster than system calls
https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37

MegaEase
Write-Ahead Log（WAL）
l widely used in relational databases to provide durability (D from ACID)
l Persisting every state change as a command to the append only log.
https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html
l Store each state changes as command
l A single log is appended sequentially
l Each log entry is given a unique identifier
l Roll the logs as Segmented Log
l Clean the log with Low-Water Mark
l Snapshot based (Zookeeper & ETCD)
l Time based (Kafka)
l Support Singular Update Queue
l A work queue
l A single thread

MegaEase
Prometheus WAL & Checkpoint
l WAL Records - includes the Series and their corresponding Samples.
l The Series record is written only once when we see it for the first time
l The Samples record is written for all write requests that contain a sample.
l WAL Truncation - Checkpoints
l Drops all the series records for series which are no longer in the Head.
l Drops all the samples which are before time T.
l Drops all the tombstone records for time ranges before T.
l Retain back remaining series, samples and tombstone records in the same way as
you find it in the WAL (in the same order as they appear in the WAL).
l WAL Replay
l Replaying the “checkpoint.X”
l Replaying the WAL X+1, X+2,… X+N
l WAL Compression
l The WAL records are not heavily compressed by Snappy
l Snappy is developed by Google based on LZ77
l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility.
l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB….
Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal
data
└── wal
├── 000000
├── 000001
├── 000002
├── 000003
├── 000004
└── 000005
data
└── wal
├── checkpoint.000003
| ├── 000000
| └── 000001
├── 000004
└── 000005

MegaEase
Block Compaction
l Problem
l When querying multiple blocks, we have to merge their results into an overall result.
l If we need a week-long query, it has to merge 80+ partial blocks.
l Compaction
t0 t1 t2 t3 t4 now
┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before
└────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘
┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐
│ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A)
└─────────────────────────────────────────┘ └───────────┘ └───────────┘
┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐
│ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B)
└──────────────────────────┘ └──────────────────────────┘ └───────────┘

MegaEase
Retention
l Example
l Block 1 can be deleted safely, bock 2 has to keep until it fully behind the boundary.
l Block Compaction impacts
l Block compaction could make the block too large to delete.
l We need to limit the block size.
Maximum block size = 10% * retention window.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary

MegaEase
Index
l Using inverted index for label index
l Allocate an unique ID for every series
l Look up the series by this ID, the time complexity is O(1)
l This ID is forward index.
l Construct the labels’ index
l If series ID = {2,5, 10, 29} contains app=“nginx”
l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx”
l In Short
l Number of labels is significantly less then the number of series.
l Walking through all of the labels is not problem.
{
__name__=”requests_total”,
pod=”nginx-34534242-abc723
job=”nginx”,
path=”/api/v1/status”,
status=”200”,
method=”GET”,
}
status=”200”: 1 2 5 ...
method=”GET”: 2 3 4 5 6 9 ...
ID : 5

MegaEase
Sets Operation
l Considering we have the following query:
l app=“foo” AND __name__=“requests_total”
l How to do intersection with two invert index list?
l General Algorithm Interview Question
l By given two integer array, return their intersection.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 2, 9} as there intersection
l By given two integer array return their union.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union
l Time: O(m*n) - no extra space

MegaEase
Sort The Array
l If we sort the array
__name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ]
app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ]
intersection => [ 1000, 1001 ]
l We can have efficient algorithm
l O(m+n) : two pointers for each array.
while (idx1 < len1 && idx2 < len2) {
if (a[idx1] > b[idx2] ) {
idx2++
} else if (a[idx1] < b[idx2] ) {
idx1++
} else {
c = append(c, a[idx1])
}
}
return c
l Series ID must be easy to sort, use
MD5 or UUID is not a good idea
( V2 use the hash ID)
l Delete the data could cause the
index rebuild.

MegaEase
Benchmark
(v1.5.2 vs v2.0)

MegaEase
Benchmark – Memory
l Heap memory usage in GB
l Prometheus 2.0’s memory consumption is reduced by 3-4x

MegaEase
Benchmark – CPU
l CPU usage in cores/second
l Prometheus 2.0 needs 3-10 times fewer CPU resources.

MegaEase
Benchmark – Disk Writes
l Disk writes in MB/second
l Prometheus 2.0 saving 97-99%.
l Prometheus 1.5 is prone to wear out SSD

MegaEase
Benchmark – Query Latency
l Query P99 latency in seconds
l Prometheus 1.5 the query latency increases over time as more series are stored.

MegaEase
Facebook Paper
Gorilla: A fast, scalable, in-memory time series database
TimeScale

MegaEase
Gorilla Requirements
l 2 billion unique time series identified by a string key.
l 700 million data points (time stamp and value) added per minute.
l Store data for 26 hours.
l More than 40,000 queries per second at peak.
l Reads succeed in under one millisecond.
l Support time series with 15 second granularity (4 points per minute per time series).
l Two in-memory, not co-located replicas (for disaster recovery capacity).
l Always serve reads even when a single server crashes.
l Ability to quickly scan over all in memory data.
l Support at least 2x growth per year.
85% Queries for latest 26 hours data

MegaEase
Key Technology
l Simple Data Model – (string key, int64 timestamp, double value)
l In memory – low latency
l High Data Compression Raito – Save 90% space
l Cache first then Disk – accept the data lost
l Stateless - Easy to scale
l Hash(key) à Shard à node

MegaEase
Fundamental
l Delta Encoding (aka Delta Compression)
l https://en.wikipedia.org/wiki/Delta_encoding
l Examples
l HTTP RFC 3229 “Delta encoding in HTTP”
l rsync - Delta file copying
l Online backup
l Version Control

MegaEase
Compress of timestamp
l Delta-of-Delta

MegaEase
Compression Algorithm
Compress Timestamp
D = 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐)
l D = 0, then store a single ‘0’ bit
l D = [-63, 64], ‘10’ : value (7 bits)
l D = [-255, 256], ‘110’ : value (9 bits)
l D = [-2047, 2048], ‘1110’ : value (12 bits)
l Otherwise store ‘1111’ : D (32 bits)
Compress Values (Double float)
X = 𝑽𝒊 ^ 𝑽𝒊"𝟏
l X = 0, then store a single ‘0’ bit
l X != 0,
首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit
值存为’1’，第二个bit值为
如果Leading Zeros与Trailing Zeros与前一个XOR值相同，则第2个bit
值存为’0’，而后，紧跟着去掉Leading Zeros与Trailing Zeros以后的
有效XOR值部分。
如果Leading Zeros与Trailing Zeros与前一个XOR值不同，则第2个bit
值存为’1’，而后，紧跟着5个bits用来描述Leading Zeros的个数，再
用6个bits来描述有效XOR值的长度，最后再存储有效XOR值部分
（这种情形下，至少产生了13个bits的冗余信息）

MegaEase
Sample Compression
l Raw : 16 bytes/ sample
l Compressed: 1.37 bytes/sample

MegaEase
Open Source Implementation
l Golang
l https://github.com/dgryski/go-tsz
l Java
l https://github.com/burmanm/gorilla-tsc
l https://github.com/milpol/gorilla4j
l Rust
l https://github.com/jeromefroe/tsz-rs
l https://github.com/mheffner/rust-gorilla-tsdb

MegaEase
Reference
l Writing a Time Series Database from Scratch by Fabian Reinartz
https://fabxc.org/tsdb/
l Gorilla: A Fast, Scalable, In-Memory Time Series Database
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
l TSDB format
https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md
l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz
l video: https://www.youtube.com/watch?v=b_pEevMAC3I
l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf
l Ganesh Vernekar Blog - Prometheus TSDB
l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block
l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint
l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk
l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index
l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries
l Time-series compression algorithms, explained
l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/

How Prometheus Store the Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Prometheus Store the Data

Similar to How Prometheus Store the Data (20)

Recently uploaded

Recently uploaded (20)

How Prometheus Store the Data