MegaEase
How Prometheus
Store the Data
Hao Chen
MegaEase
Self Introduction
l 20+ years working experience for large-scale distributed system
architecture and development. Familiar with Cloud Native
computing and high concurrency / high availability architecture
solution.
l Working Experiences
l MegaEase – Cloud Native Software products as Founder
l Alibaba – AliCloud, Tmall as principle software engineer.
l Amazon – Amazon.com as senior software manager.
l Thomson Reuters – Real-time system software development Manager.
l IBM Platform – Distributed computing system as software engineer.
Weibo: @左耳朵耗子
Twitter: @haoel
Blog: http://coolshell.cn/
MegaEase
Understanding Time Series
MegaEase
Understanding Time Series Data
l Data scheme
l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), ....
l Prometheus Data Model
l <metric name>{<label name>=<label value>, ...}
l Typical set of series identifiers
l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355
l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934
l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483
l Query
l __name__=“requests_total” - selects all series belonging to the requests_total metric.
l method=“PUT|POST” - selects all series method is PUT or POST
Metric Name Labels Timestamp Sample Value
Key - Series Value - Sample
MegaEase
2D Data Plane
l Write
l Completely vertical and highly concurrent as samples from each target are ingested independently
l Query
l Retrieves data can be paralleled and batched
series
^
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“GET”}
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“POST”}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“POST”}
│ . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“GET”}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->
MegaEase
The Fundamental Problem
l Storage problem
l IDE – spinning physically
l SSD - write amplification
l Query is much more complicated than
write
l Time series query could cause the
random read.
l Ideal Write
l Sequential writes
l Batched writes
l Ideal Read
l Same Time Series should be
sequentially
MegaEase
Prometheus Solution
(v1.x - “V2”)
MegaEase
Prometheus Solution (v1.x “V2”)
l One file per time series
l Batch up 1KiB chunks in memory
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A
└──────────┴─────────┴─────────┴─────────┴─────────┘
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B
└──────────┴─────────┴─────────┴─────────┴─────────┘
. . .
┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ
└──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
chunk 1 chunk 2 chunk 3 ...
l Dark Sides
l Chunk are hold in memory, it could be lost if application or node crashed.
l With several million files, inodes would be run out
l With several thousands of chunks need be persisted, could cause disk I/O so busy.
l Keep so many files open for I/O, which cause very high latency.
l Old data need be cleaned, it could cause the SSD’s write amplification
l Very big CPU/MEM/DISK resource consumption
MegaEase
Series Churn
l Definition
l Some time series become INACTIVE
l Some time series become ACTIVE
l Reasons
l Rolling up a number of microservice
l Kubernetes scaling the services
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->
MegaEase
New Prometheus Design
(v2.x - “V3”)
MegaEase
Fundamental Design – V3
l Storage Layout
l 01XXXXXXX- is a data block
l ULID - like UUID but lexicographically sortable and encoding the creation time
l chunk directory
l contains the raw chucks of data points for various series(likes “V2”)
l No long a single file per series
l index – index of data
l Lots of black magic find the data by labels.
l meta.json - Readable meta data
l the state of our storage and the data it contains
l tombstone
l Deleted data will be recorded into this file, instead removing from chunk file
l wal – Write-Ahead Log
l The WAL segments would be truncated to “checkpoint.X” directory
l chunks_head – in memory data
l Notes
l The data will be persisted into disk every 2 hours
l WAL is used for data recovery.
l 2 Hours block could make the range data query efficiently
$ tree ./data
./data
├── 01BKGV7JBM69T2G1BGBGM6KB12
│ ├── chunks
│ │ ├── 000001
│ │ ├── 000002
│ │ └── 000003
│ ├── index
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98
│ ├── chunks
│ │ └── 000001
│ ├── index
│ └── meta.json
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│ ├── chunks
│ │ └── 000001
│ ├── index
│ ├── tombstones
│ └── meta.json
├── chunks_head
│ └── 000001
└── wal
├── 000000003
└── checkpoint.00000002
├── 00000000
└── 00000001
https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md
File Format
MegaEase
Blocks – Little Database
l Partition the data into non-overlapping blocks
l Each block acts as a fully independent database
l Containing all time series data for its time window
l it has its own index and set of chunk files.
l Every block of data is immutable
l The current block can be append the data
l All new data is write to an in-memory database
l To prevent data loss, a temporary WAL is also written.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └────────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘
MegaEase
Tree Concept
Block 1 Block 2 Block 3 Block 4 Block N
chunk1 chunk2 chunk3
time
MegaEase
New Design’s Benefits
l Good for querying a time range
l we can easily ignore all data blocks outside of this range.
l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with
l Good for disk writes
l When completing a block, we can persist the data from our in-memory database by sequentially writing just
a handful of larger files.
l Keep the good property of V2 that recent chunks
l which are queried most, are always hot in memory.
l Flexible for chunk size
l We can pick any size that makes the most sense for the individual data points and chosen compression
format.
l Deleting old data becomes extremely cheap and instantaneous.
l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write
up to hundreds of millions of files, which could take hours to converge.
MegaEase
Chunk-head
l Chunk will be cut
l fills till 120 samples
l 2 hour (by default)
l Since Prometheus v2.19
l not all chunks are stored in memory
l When the chunk is cut, it will be flushed to disk and to mmap
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
MegaEase
Chunk head à Block
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
l After some time, the chunks meet threshold
l When the Chunks range is 3hrs
l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block
l Meanwhile
l The WAL is truncated at this point
l And the “checkpoint” is created!
MegaEase
Large file with “mmap”
l mmap stands for memory-mapped files. It is a
way to read and write files without invoking
system calls.
l It is great if multiple processes accessing data in
a read only fashion from the same file
l It allows all those processes to share the same
physical memory pages, saving a lot of memory.
l it also allows the operating system to optimize
paging operations.
User Process
File System
Page Cache
Disk
User Space
Kernel Space
Device
mmap
Direct I/O
read/write
Why mmap is faster than system calls
https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37
MegaEase
Write-Ahead Log(WAL)
l widely used in relational databases to provide durability (D from ACID)
l Persisting every state change as a command to the append only log.
https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html
l Store each state changes as command
l A single log is appended sequentially
l Each log entry is given a unique identifier
l Roll the logs as Segmented Log
l Clean the log with Low-Water Mark
l Snapshot based (Zookeeper & ETCD)
l Time based (Kafka)
l Support Singular Update Queue
l A work queue
l A single thread
MegaEase
Prometheus WAL & Checkpoint
l WAL Records - includes the Series and their corresponding Samples.
l The Series record is written only once when we see it for the first time
l The Samples record is written for all write requests that contain a sample.
l WAL Truncation - Checkpoints
l Drops all the series records for series which are no longer in the Head.
l Drops all the samples which are before time T.
l Drops all the tombstone records for time ranges before T.
l Retain back remaining series, samples and tombstone records in the same way as
you find it in the WAL (in the same order as they appear in the WAL).
l WAL Replay
l Replaying the “checkpoint.X”
l Replaying the WAL X+1, X+2,… X+N
l WAL Compression
l The WAL records are not heavily compressed by Snappy
l Snappy is developed by Google based on LZ77
l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility.
l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB….
Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal
data
└── wal
├── 000000
├── 000001
├── 000002
├── 000003
├── 000004
└── 000005
data
└── wal
├── checkpoint.000003
| ├── 000000
| └── 000001
├── 000004
└── 000005
MegaEase
Block Compaction
l Problem
l When querying multiple blocks, we have to merge their results into an overall result.
l If we need a week-long query, it has to merge 80+ partial blocks.
l Compaction
t0 t1 t2 t3 t4 now
┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before
└────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘
┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐
│ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A)
└─────────────────────────────────────────┘ └───────────┘ └───────────┘
┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐
│ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B)
└──────────────────────────┘ └──────────────────────────┘ └───────────┘
MegaEase
Retention
l Example
l Block 1 can be deleted safely, bock 2 has to keep until it fully behind the boundary.
l Block Compaction impacts
l Block compaction could make the block too large to delete.
l We need to limit the block size.
Maximum block size = 10% * retention window.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary
MegaEase
V2 – Chunk Query
MegaEase
V3 - Block Query
MegaEase
V3 - Compaction
MegaEase
V3 - Retention
MegaEase
Index
l Using inverted index for label index
l Allocate an unique ID for every series
l Look up the series by this ID, the time complexity is O(1)
l This ID is forward index.
l Construct the labels’ index
l If series ID = {2,5, 10, 29} contains app=“nginx”
l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx”
l In Short
l Number of labels is significantly less then the number of series.
l Walking through all of the labels is not problem.
{
__name__=”requests_total”,
pod=”nginx-34534242-abc723
job=”nginx”,
path=”/api/v1/status”,
status=”200”,
method=”GET”,
}
status=”200”: 1 2 5 ...
method=”GET”: 2 3 4 5 6 9 ...
ID : 5
MegaEase
Sets Operation
l Considering we have the following query:
l app=“foo” AND __name__=“requests_total”
l How to do intersection with two invert index list?
l General Algorithm Interview Question
l By given two integer array, return their intersection.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 2, 9} as there intersection
l By given two integer array return their union.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union
l Time: O(m*n) - no extra space
MegaEase
Sort The Array
l If we sort the array
__name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ]
app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ]
intersection => [ 1000, 1001 ]
l We can have efficient algorithm
l O(m+n) : two pointers for each array.
while (idx1 < len1 && idx2 < len2) {
if (a[idx1] > b[idx2] ) {
idx2++
} else if (a[idx1] < b[idx2] ) {
idx1++
} else {
c = append(c, a[idx1])
}
}
return c
l Series ID must be easy to sort, use
MD5 or UUID is not a good idea
( V2 use the hash ID)
l Delete the data could cause the
index rebuild.
MegaEase
Benchmark
(v1.5.2 vs v2.0)
MegaEase
Benchmark – Memory
l Heap memory usage in GB
l Prometheus 2.0’s memory consumption is reduced by 3-4x
MegaEase
Benchmark – CPU
l CPU usage in cores/second
l Prometheus 2.0 needs 3-10 times fewer CPU resources.
MegaEase
Benchmark – Disk Writes
l Disk writes in MB/second
l Prometheus 2.0 saving 97-99%.
l Prometheus 1.5 is prone to wear out SSD
MegaEase
Benchmark – Query Latency
l Query P99 latency in seconds
l Prometheus 1.5 the query latency increases over time as more series are stored.
MegaEase
Facebook Paper
Gorilla: A fast, scalable, in-memory time series database
TimeScale
MegaEase
Gorilla Requirements
l 2 billion unique time series identified by a string key.
l 700 million data points (time stamp and value) added per minute.
l Store data for 26 hours.
l More than 40,000 queries per second at peak.
l Reads succeed in under one millisecond.
l Support time series with 15 second granularity (4 points per minute per time series).
l Two in-memory, not co-located replicas (for disaster recovery capacity).
l Always serve reads even when a single server crashes.
l Ability to quickly scan over all in memory data.
l Support at least 2x growth per year.
85% Queries for latest 26 hours data
MegaEase
Key Technology
l Simple Data Model – (string key, int64 timestamp, double value)
l In memory – low latency
l High Data Compression Raito – Save 90% space
l Cache first then Disk – accept the data lost
l Stateless - Easy to scale
l Hash(key) à Shard à node
MegaEase
Fundamental
l Delta Encoding (aka Delta Compression)
l https://en.wikipedia.org/wiki/Delta_encoding
l Examples
l HTTP RFC 3229 “Delta encoding in HTTP”
l rsync - Delta file copying
l Online backup
l Version Control
MegaEase
Compress of timestamp
l Delta-of-Delta
MegaEase
Compression Algorithm
Compress Timestamp
D = 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐)
l D = 0, then store a single ‘0’ bit
l D = [-63, 64], ‘10’ : value (7 bits)
l D = [-255, 256], ‘110’ : value (9 bits)
l D = [-2047, 2048], ‘1110’ : value (12 bits)
l Otherwise store ‘1111’ : D (32 bits)
Compress Values (Double float)
X = 𝑽𝒊 ^ 𝑽𝒊"𝟏
l X = 0, then store a single ‘0’ bit
l X != 0,
首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit
值存为’1’,第二个bit值为
如果Leading Zeros与Trailing Zeros与前一个XOR值相同,则第2个bit
值存为’0’,而后,紧跟着去掉Leading Zeros与Trailing Zeros以后的
有效XOR值部分。
如果Leading Zeros与Trailing Zeros与前一个XOR值不同,则第2个bit
值存为’1’,而后,紧跟着5个bits用来描述Leading Zeros的个数,再
用6个bits来描述有效XOR值的长度,最后再存储有效XOR值部分
(这种情形下,至少产生了13个bits的冗余信息)
MegaEase
Sample Compression
l Raw : 16 bytes/ sample
l Compressed: 1.37 bytes/sample
MegaEase
Open Source Implementation
l Golang
l https://github.com/dgryski/go-tsz
l Java
l https://github.com/burmanm/gorilla-tsc
l https://github.com/milpol/gorilla4j
l Rust
l https://github.com/jeromefroe/tsz-rs
l https://github.com/mheffner/rust-gorilla-tsdb
MegaEase
Reference
MegaEase
Reference
l Writing a Time Series Database from Scratch by Fabian Reinartz
https://fabxc.org/tsdb/
l Gorilla: A Fast, Scalable, In-Memory Time Series Database
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
l TSDB format
https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md
l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz
l video: https://www.youtube.com/watch?v=b_pEevMAC3I
l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf
l Ganesh Vernekar Blog - Prometheus TSDB
l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block
l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint
l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk
l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index
l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries
l Time-series compression algorithms, explained
l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/
MegaEase
Thanks
MegaEase Inc

How Prometheus Store the Data

  • 1.
  • 2.
    MegaEase Self Introduction l 20+years working experience for large-scale distributed system architecture and development. Familiar with Cloud Native computing and high concurrency / high availability architecture solution. l Working Experiences l MegaEase – Cloud Native Software products as Founder l Alibaba – AliCloud, Tmall as principle software engineer. l Amazon – Amazon.com as senior software manager. l Thomson Reuters – Real-time system software development Manager. l IBM Platform – Distributed computing system as software engineer. Weibo: @左耳朵耗子 Twitter: @haoel Blog: http://coolshell.cn/
  • 3.
  • 4.
    MegaEase Understanding Time SeriesData l Data scheme l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), .... l Prometheus Data Model l <metric name>{<label name>=<label value>, ...} l Typical set of series identifiers l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355 l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934 l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483 l Query l __name__=“requests_total” - selects all series belonging to the requests_total metric. l method=“PUT|POST” - selects all series method is PUT or POST Metric Name Labels Timestamp Sample Value Key - Series Value - Sample
  • 5.
    MegaEase 2D Data Plane lWrite l Completely vertical and highly concurrent as samples from each target are ingested independently l Query l Retrieves data can be paralleled and batched series ^ │ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“GET”} │ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“POST”} │ . . . . . . . │ . . . . . . . . . . . . . . . . . . . ... │ . . . . . . . . . . . . . . . . . . . . . │ . . . . . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“POST”} │ . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“GET”} │ . . . . . . . . . . . . . . │ . . . . . . . . . . . . . . . . . . . ... │ . . . . . . . . . . . . . . . . . . . . v <-------------------- time --------------------->
  • 6.
    MegaEase The Fundamental Problem lStorage problem l IDE – spinning physically l SSD - write amplification l Query is much more complicated than write l Time series query could cause the random read. l Ideal Write l Sequential writes l Batched writes l Ideal Read l Same Time Series should be sequentially
  • 7.
  • 8.
    MegaEase Prometheus Solution (v1.x“V2”) l One file per time series l Batch up 1KiB chunks in memory ┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A └──────────┴─────────┴─────────┴─────────┴─────────┘ ┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B └──────────┴─────────┴─────────┴─────────┴─────────┘ . . . ┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ └──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ chunk 1 chunk 2 chunk 3 ... l Dark Sides l Chunk are hold in memory, it could be lost if application or node crashed. l With several million files, inodes would be run out l With several thousands of chunks need be persisted, could cause disk I/O so busy. l Keep so many files open for I/O, which cause very high latency. l Old data need be cleaned, it could cause the SSD’s write amplification l Very big CPU/MEM/DISK resource consumption
  • 9.
    MegaEase Series Churn l Definition lSome time series become INACTIVE l Some time series become ACTIVE l Reasons l Rolling up a number of microservice l Kubernetes scaling the services series ^ │ . . . . . . │ . . . . . . │ . . . . . . │ . . . . . . . │ . . . . . . . │ . . . . . . . │ . . . . . . │ . . . . . . │ . . . . . │ . . . . . │ . . . . . v <-------------------- time --------------------->
  • 10.
  • 11.
    MegaEase Fundamental Design –V3 l Storage Layout l 01XXXXXXX- is a data block l ULID - like UUID but lexicographically sortable and encoding the creation time l chunk directory l contains the raw chucks of data points for various series(likes “V2”) l No long a single file per series l index – index of data l Lots of black magic find the data by labels. l meta.json - Readable meta data l the state of our storage and the data it contains l tombstone l Deleted data will be recorded into this file, instead removing from chunk file l wal – Write-Ahead Log l The WAL segments would be truncated to “checkpoint.X” directory l chunks_head – in memory data l Notes l The data will be persisted into disk every 2 hours l WAL is used for data recovery. l 2 Hours block could make the range data query efficiently $ tree ./data ./data ├── 01BKGV7JBM69T2G1BGBGM6KB12 │ ├── chunks │ │ ├── 000001 │ │ ├── 000002 │ │ └── 000003 │ ├── index │ └── meta.json ├── 01BKGTZQ1SYQJTR4PB43C8PD98 │ ├── chunks │ │ └── 000001 │ ├── index │ └── meta.json ├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K │ ├── chunks │ │ └── 000001 │ ├── index │ ├── tombstones │ └── meta.json ├── chunks_head │ └── 000001 └── wal ├── 000000003 └── checkpoint.00000002 ├── 00000000 └── 00000001 https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md File Format
  • 12.
    MegaEase Blocks – LittleDatabase l Partition the data into non-overlapping blocks l Each block acts as a fully independent database l Containing all time series data for its time window l it has its own index and set of chunk files. l Every block of data is immutable l The current block can be append the data l All new data is write to an in-memory database l To prevent data loss, a temporary WAL is also written. t0 t1 t2 t3 now ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐ │ │ │ │ │ │ │ │ ┌────────────┐ │ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │ │ │ │ │ │ │ │ │ └────────────┘ └───────────┘ └───────────┘ └───────────┘ └────────────┘ ^ └──────────────┴───────┬──────┴──────────────┘ │ │ query │ │ merge ─────────────────────────────────────────────────┘
  • 13.
    MegaEase Tree Concept Block 1Block 2 Block 3 Block 4 Block N chunk1 chunk2 chunk3 time
  • 14.
    MegaEase New Design’s Benefits lGood for querying a time range l we can easily ignore all data blocks outside of this range. l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with l Good for disk writes l When completing a block, we can persist the data from our in-memory database by sequentially writing just a handful of larger files. l Keep the good property of V2 that recent chunks l which are queried most, are always hot in memory. l Flexible for chunk size l We can pick any size that makes the most sense for the individual data points and chosen compression format. l Deleting old data becomes extremely cheap and instantaneous. l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write up to hundreds of millions of files, which could take hours to converge.
  • 15.
    MegaEase Chunk-head l Chunk willbe cut l fills till 120 samples l 2 hour (by default) l Since Prometheus v2.19 l not all chunks are stored in memory l When the chunk is cut, it will be flushed to disk and to mmap https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
  • 16.
    MegaEase Chunk head àBlock https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/ l After some time, the chunks meet threshold l When the Chunks range is 3hrs l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block l Meanwhile l The WAL is truncated at this point l And the “checkpoint” is created!
  • 17.
    MegaEase Large file with“mmap” l mmap stands for memory-mapped files. It is a way to read and write files without invoking system calls. l It is great if multiple processes accessing data in a read only fashion from the same file l It allows all those processes to share the same physical memory pages, saving a lot of memory. l it also allows the operating system to optimize paging operations. User Process File System Page Cache Disk User Space Kernel Space Device mmap Direct I/O read/write Why mmap is faster than system calls https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37
  • 18.
    MegaEase Write-Ahead Log(WAL) l widelyused in relational databases to provide durability (D from ACID) l Persisting every state change as a command to the append only log. https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html l Store each state changes as command l A single log is appended sequentially l Each log entry is given a unique identifier l Roll the logs as Segmented Log l Clean the log with Low-Water Mark l Snapshot based (Zookeeper & ETCD) l Time based (Kafka) l Support Singular Update Queue l A work queue l A single thread
  • 19.
    MegaEase Prometheus WAL &Checkpoint l WAL Records - includes the Series and their corresponding Samples. l The Series record is written only once when we see it for the first time l The Samples record is written for all write requests that contain a sample. l WAL Truncation - Checkpoints l Drops all the series records for series which are no longer in the Head. l Drops all the samples which are before time T. l Drops all the tombstone records for time ranges before T. l Retain back remaining series, samples and tombstone records in the same way as you find it in the WAL (in the same order as they appear in the WAL). l WAL Replay l Replaying the “checkpoint.X” l Replaying the WAL X+1, X+2,… X+N l WAL Compression l The WAL records are not heavily compressed by Snappy l Snappy is developed by Google based on LZ77 l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility. l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB…. Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal data └── wal ├── 000000 ├── 000001 ├── 000002 ├── 000003 ├── 000004 └── 000005 data └── wal ├── checkpoint.000003 | ├── 000000 | └── 000001 ├── 000004 └── 000005
  • 20.
    MegaEase Block Compaction l Problem lWhen querying multiple blocks, we have to merge their results into an overall result. l If we need a week-long query, it has to merge 80+ partial blocks. l Compaction t0 t1 t2 t3 t4 now ┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before └────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘ ┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐ │ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A) └─────────────────────────────────────────┘ └───────────┘ └───────────┘ ┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐ │ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B) └──────────────────────────┘ └──────────────────────────┘ └───────────┘
  • 21.
    MegaEase Retention l Example l Block1 can be deleted safely, bock 2 has to keep until it fully behind the boundary. l Block Compaction impacts l Block compaction could make the block too large to delete. l We need to limit the block size. Maximum block size = 10% * retention window. | ┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . . └────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘ | | retention boundary
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    MegaEase Index l Using invertedindex for label index l Allocate an unique ID for every series l Look up the series by this ID, the time complexity is O(1) l This ID is forward index. l Construct the labels’ index l If series ID = {2,5, 10, 29} contains app=“nginx” l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx” l In Short l Number of labels is significantly less then the number of series. l Walking through all of the labels is not problem. { __name__=”requests_total”, pod=”nginx-34534242-abc723 job=”nginx”, path=”/api/v1/status”, status=”200”, method=”GET”, } status=”200”: 1 2 5 ... method=”GET”: 2 3 4 5 6 9 ... ID : 5
  • 27.
    MegaEase Sets Operation l Consideringwe have the following query: l app=“foo” AND __name__=“requests_total” l How to do intersection with two invert index list? l General Algorithm Interview Question l By given two integer array, return their intersection. l A[] = { 4, 1, 6, 7, 3, 2, 9 } l B[] = { 11,30, 2, 70, 9} l return { 2, 9} as there intersection l By given two integer array return their union. l A[] = { 4, 1, 6, 7, 3, 2, 9 } l B[] = { 11,30, 2, 70, 9} l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union l Time: O(m*n) - no extra space
  • 28.
    MegaEase Sort The Array lIf we sort the array __name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ] app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ] intersection => [ 1000, 1001 ] l We can have efficient algorithm l O(m+n) : two pointers for each array. while (idx1 < len1 && idx2 < len2) { if (a[idx1] > b[idx2] ) { idx2++ } else if (a[idx1] < b[idx2] ) { idx1++ } else { c = append(c, a[idx1]) } } return c l Series ID must be easy to sort, use MD5 or UUID is not a good idea ( V2 use the hash ID) l Delete the data could cause the index rebuild.
  • 29.
  • 30.
    MegaEase Benchmark – Memory lHeap memory usage in GB l Prometheus 2.0’s memory consumption is reduced by 3-4x
  • 31.
    MegaEase Benchmark – CPU lCPU usage in cores/second l Prometheus 2.0 needs 3-10 times fewer CPU resources.
  • 32.
    MegaEase Benchmark – DiskWrites l Disk writes in MB/second l Prometheus 2.0 saving 97-99%. l Prometheus 1.5 is prone to wear out SSD
  • 33.
    MegaEase Benchmark – QueryLatency l Query P99 latency in seconds l Prometheus 1.5 the query latency increases over time as more series are stored.
  • 34.
    MegaEase Facebook Paper Gorilla: Afast, scalable, in-memory time series database TimeScale
  • 35.
    MegaEase Gorilla Requirements l 2billion unique time series identified by a string key. l 700 million data points (time stamp and value) added per minute. l Store data for 26 hours. l More than 40,000 queries per second at peak. l Reads succeed in under one millisecond. l Support time series with 15 second granularity (4 points per minute per time series). l Two in-memory, not co-located replicas (for disaster recovery capacity). l Always serve reads even when a single server crashes. l Ability to quickly scan over all in memory data. l Support at least 2x growth per year. 85% Queries for latest 26 hours data
  • 36.
    MegaEase Key Technology l SimpleData Model – (string key, int64 timestamp, double value) l In memory – low latency l High Data Compression Raito – Save 90% space l Cache first then Disk – accept the data lost l Stateless - Easy to scale l Hash(key) à Shard à node
  • 37.
    MegaEase Fundamental l Delta Encoding(aka Delta Compression) l https://en.wikipedia.org/wiki/Delta_encoding l Examples l HTTP RFC 3229 “Delta encoding in HTTP” l rsync - Delta file copying l Online backup l Version Control
  • 38.
  • 39.
    MegaEase Compression Algorithm Compress Timestamp D= 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐) l D = 0, then store a single ‘0’ bit l D = [-63, 64], ‘10’ : value (7 bits) l D = [-255, 256], ‘110’ : value (9 bits) l D = [-2047, 2048], ‘1110’ : value (12 bits) l Otherwise store ‘1111’ : D (32 bits) Compress Values (Double float) X = 𝑽𝒊 ^ 𝑽𝒊"𝟏 l X = 0, then store a single ‘0’ bit l X != 0, 首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit 值存为’1’,第二个bit值为 如果Leading Zeros与Trailing Zeros与前一个XOR值相同,则第2个bit 值存为’0’,而后,紧跟着去掉Leading Zeros与Trailing Zeros以后的 有效XOR值部分。 如果Leading Zeros与Trailing Zeros与前一个XOR值不同,则第2个bit 值存为’1’,而后,紧跟着5个bits用来描述Leading Zeros的个数,再 用6个bits来描述有效XOR值的长度,最后再存储有效XOR值部分 (这种情形下,至少产生了13个bits的冗余信息)
  • 40.
    MegaEase Sample Compression l Raw: 16 bytes/ sample l Compressed: 1.37 bytes/sample
  • 41.
    MegaEase Open Source Implementation lGolang l https://github.com/dgryski/go-tsz l Java l https://github.com/burmanm/gorilla-tsc l https://github.com/milpol/gorilla4j l Rust l https://github.com/jeromefroe/tsz-rs l https://github.com/mheffner/rust-gorilla-tsdb
  • 42.
  • 43.
    MegaEase Reference l Writing aTime Series Database from Scratch by Fabian Reinartz https://fabxc.org/tsdb/ l Gorilla: A Fast, Scalable, In-Memory Time Series Database http://www.vldb.org/pvldb/vol8/p1816-teller.pdf l TSDB format https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz l video: https://www.youtube.com/watch?v=b_pEevMAC3I l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf l Ganesh Vernekar Blog - Prometheus TSDB l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries l Time-series compression algorithms, explained l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/
  • 44.