SlideShare a Scribd company logo
MegaEase
How Prometheus
Store the Data
Hao Chen
MegaEase
Self Introduction
l 20+ years working experience for large-scale distributed system
architecture and development. Familiar with Cloud Native
computing and high concurrency / high availability architecture
solution.
l Working Experiences
l MegaEase – Cloud Native Software products as Founder
l Alibaba – AliCloud, Tmall as principle software engineer.
l Amazon – Amazon.com as senior software manager.
l Thomson Reuters – Real-time system software development Manager.
l IBM Platform – Distributed computing system as software engineer.
Weibo: @左耳朵耗子
Twitter: @haoel
Blog: http://coolshell.cn/
MegaEase
Understanding Time Series
MegaEase
Understanding Time Series Data
l Data scheme
l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), ....
l Prometheus Data Model
l <metric name>{<label name>=<label value>, ...}
l Typical set of series identifiers
l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355
l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934
l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483
l Query
l __name__=“requests_total” - selects all series belonging to the requests_total metric.
l method=“PUT|POST” - selects all series method is PUT or POST
Metric Name Labels Timestamp Sample Value
Key - Series Value - Sample
MegaEase
2D Data Plane
l Write
l Completely vertical and highly concurrent as samples from each target are ingested independently
l Query
l Retrieves data can be paralleled and batched
series
^
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“GET”}
│ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“POST”}
│ . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“POST”}
│ . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“GET”}
│ . . . . . . . . . . . . . .
│ . . . . . . . . . . . . . . . . . . . ...
│ . . . . . . . . . . . . . . . . . . . .
v
<-------------------- time --------------------->
MegaEase
The Fundamental Problem
l Storage problem
l IDE – spinning physically
l SSD - write amplification
l Query is much more complicated than
write
l Time series query could cause the
random read.
l Ideal Write
l Sequential writes
l Batched writes
l Ideal Read
l Same Time Series should be
sequentially
MegaEase
Prometheus Solution
(v1.x - “V2”)
MegaEase
Prometheus Solution (v1.x “V2”)
l One file per time series
l Batch up 1KiB chunks in memory
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A
└──────────┴─────────┴─────────┴─────────┴─────────┘
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B
└──────────┴─────────┴─────────┴─────────┴─────────┘
. . .
┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ
└──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
chunk 1 chunk 2 chunk 3 ...
l Dark Sides
l Chunk are hold in memory, it could be lost if application or node crashed.
l With several million files, inodes would be run out
l With several thousands of chunks need be persisted, could cause disk I/O so busy.
l Keep so many files open for I/O, which cause very high latency.
l Old data need be cleaned, it could cause the SSD’s write amplification
l Very big CPU/MEM/DISK resource consumption
MegaEase
Series Churn
l Definition
l Some time series become INACTIVE
l Some time series become ACTIVE
l Reasons
l Rolling up a number of microservice
l Kubernetes scaling the services
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->
MegaEase
New Prometheus Design
(v2.x - “V3”)
MegaEase
Fundamental Design – V3
l Storage Layout
l 01XXXXXXX- is a data block
l ULID - like UUID but lexicographically sortable and encoding the creation time
l chunk directory
l contains the raw chucks of data points for various series(likes “V2”)
l No long a single file per series
l index – index of data
l Lots of black magic find the data by labels.
l meta.json - Readable meta data
l the state of our storage and the data it contains
l tombstone
l Deleted data will be recorded into this file, instead removing from chunk file
l wal – Write-Ahead Log
l The WAL segments would be truncated to “checkpoint.X” directory
l chunks_head – in memory data
l Notes
l The data will be persisted into disk every 2 hours
l WAL is used for data recovery.
l 2 Hours block could make the range data query efficiently
$ tree ./data
./data
├── 01BKGV7JBM69T2G1BGBGM6KB12
│ ├── chunks
│ │ ├── 000001
│ │ ├── 000002
│ │ └── 000003
│ ├── index
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98
│ ├── chunks
│ │ └── 000001
│ ├── index
│ └── meta.json
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│ ├── chunks
│ │ └── 000001
│ ├── index
│ ├── tombstones
│ └── meta.json
├── chunks_head
│ └── 000001
└── wal
├── 000000003
└── checkpoint.00000002
├── 00000000
└── 00000001
https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md
File Format
MegaEase
Blocks – Little Database
l Partition the data into non-overlapping blocks
l Each block acts as a fully independent database
l Containing all time series data for its time window
l it has its own index and set of chunk files.
l Every block of data is immutable
l The current block can be append the data
l All new data is write to an in-memory database
l To prevent data loss, a temporary WAL is also written.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └────────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘
MegaEase
Tree Concept
Block 1 Block 2 Block 3 Block 4 Block N
chunk1 chunk2 chunk3
time
MegaEase
New Design’s Benefits
l Good for querying a time range
l we can easily ignore all data blocks outside of this range.
l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with
l Good for disk writes
l When completing a block, we can persist the data from our in-memory database by sequentially writing just
a handful of larger files.
l Keep the good property of V2 that recent chunks
l which are queried most, are always hot in memory.
l Flexible for chunk size
l We can pick any size that makes the most sense for the individual data points and chosen compression
format.
l Deleting old data becomes extremely cheap and instantaneous.
l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write
up to hundreds of millions of files, which could take hours to converge.
MegaEase
Chunk-head
l Chunk will be cut
l fills till 120 samples
l 2 hour (by default)
l Since Prometheus v2.19
l not all chunks are stored in memory
l When the chunk is cut, it will be flushed to disk and to mmap
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
MegaEase
Chunk head à Block
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
l After some time, the chunks meet threshold
l When the Chunks range is 3hrs
l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block
l Meanwhile
l The WAL is truncated at this point
l And the “checkpoint” is created!
MegaEase
Large file with “mmap”
l mmap stands for memory-mapped files. It is a
way to read and write files without invoking
system calls.
l It is great if multiple processes accessing data in
a read only fashion from the same file
l It allows all those processes to share the same
physical memory pages, saving a lot of memory.
l it also allows the operating system to optimize
paging operations.
User Process
File System
Page Cache
Disk
User Space
Kernel Space
Device
mmap
Direct I/O
read/write
Why mmap is faster than system calls
https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37
MegaEase
Write-Ahead Log(WAL)
l widely used in relational databases to provide durability (D from ACID)
l Persisting every state change as a command to the append only log.
https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html
l Store each state changes as command
l A single log is appended sequentially
l Each log entry is given a unique identifier
l Roll the logs as Segmented Log
l Clean the log with Low-Water Mark
l Snapshot based (Zookeeper & ETCD)
l Time based (Kafka)
l Support Singular Update Queue
l A work queue
l A single thread
MegaEase
Prometheus WAL & Checkpoint
l WAL Records - includes the Series and their corresponding Samples.
l The Series record is written only once when we see it for the first time
l The Samples record is written for all write requests that contain a sample.
l WAL Truncation - Checkpoints
l Drops all the series records for series which are no longer in the Head.
l Drops all the samples which are before time T.
l Drops all the tombstone records for time ranges before T.
l Retain back remaining series, samples and tombstone records in the same way as
you find it in the WAL (in the same order as they appear in the WAL).
l WAL Replay
l Replaying the “checkpoint.X”
l Replaying the WAL X+1, X+2,… X+N
l WAL Compression
l The WAL records are not heavily compressed by Snappy
l Snappy is developed by Google based on LZ77
l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility.
l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB….
Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal
data
└── wal
├── 000000
├── 000001
├── 000002
├── 000003
├── 000004
└── 000005
data
└── wal
├── checkpoint.000003
| ├── 000000
| └── 000001
├── 000004
└── 000005
MegaEase
Block Compaction
l Problem
l When querying multiple blocks, we have to merge their results into an overall result.
l If we need a week-long query, it has to merge 80+ partial blocks.
l Compaction
t0 t1 t2 t3 t4 now
┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before
└────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘
┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐
│ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A)
└─────────────────────────────────────────┘ └───────────┘ └───────────┘
┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐
│ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B)
└──────────────────────────┘ └──────────────────────────┘ └───────────┘
MegaEase
Retention
l Example
l Block 1 can be deleted safely, bock 2 has to keep until it fully behind the boundary.
l Block Compaction impacts
l Block compaction could make the block too large to delete.
l We need to limit the block size.
Maximum block size = 10% * retention window.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary
MegaEase
V2 – Chunk Query
MegaEase
V3 - Block Query
MegaEase
V3 - Compaction
MegaEase
V3 - Retention
MegaEase
Index
l Using inverted index for label index
l Allocate an unique ID for every series
l Look up the series by this ID, the time complexity is O(1)
l This ID is forward index.
l Construct the labels’ index
l If series ID = {2,5, 10, 29} contains app=“nginx”
l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx”
l In Short
l Number of labels is significantly less then the number of series.
l Walking through all of the labels is not problem.
{
__name__=”requests_total”,
pod=”nginx-34534242-abc723
job=”nginx”,
path=”/api/v1/status”,
status=”200”,
method=”GET”,
}
status=”200”: 1 2 5 ...
method=”GET”: 2 3 4 5 6 9 ...
ID : 5
MegaEase
Sets Operation
l Considering we have the following query:
l app=“foo” AND __name__=“requests_total”
l How to do intersection with two invert index list?
l General Algorithm Interview Question
l By given two integer array, return their intersection.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 2, 9} as there intersection
l By given two integer array return their union.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union
l Time: O(m*n) - no extra space
MegaEase
Sort The Array
l If we sort the array
__name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ]
app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ]
intersection => [ 1000, 1001 ]
l We can have efficient algorithm
l O(m+n) : two pointers for each array.
while (idx1 < len1 && idx2 < len2) {
if (a[idx1] > b[idx2] ) {
idx2++
} else if (a[idx1] < b[idx2] ) {
idx1++
} else {
c = append(c, a[idx1])
}
}
return c
l Series ID must be easy to sort, use
MD5 or UUID is not a good idea
( V2 use the hash ID)
l Delete the data could cause the
index rebuild.
MegaEase
Benchmark
(v1.5.2 vs v2.0)
MegaEase
Benchmark – Memory
l Heap memory usage in GB
l Prometheus 2.0’s memory consumption is reduced by 3-4x
MegaEase
Benchmark – CPU
l CPU usage in cores/second
l Prometheus 2.0 needs 3-10 times fewer CPU resources.
MegaEase
Benchmark – Disk Writes
l Disk writes in MB/second
l Prometheus 2.0 saving 97-99%.
l Prometheus 1.5 is prone to wear out SSD
MegaEase
Benchmark – Query Latency
l Query P99 latency in seconds
l Prometheus 1.5 the query latency increases over time as more series are stored.
MegaEase
Facebook Paper
Gorilla: A fast, scalable, in-memory time series database
TimeScale
MegaEase
Gorilla Requirements
l 2 billion unique time series identified by a string key.
l 700 million data points (time stamp and value) added per minute.
l Store data for 26 hours.
l More than 40,000 queries per second at peak.
l Reads succeed in under one millisecond.
l Support time series with 15 second granularity (4 points per minute per time series).
l Two in-memory, not co-located replicas (for disaster recovery capacity).
l Always serve reads even when a single server crashes.
l Ability to quickly scan over all in memory data.
l Support at least 2x growth per year.
85% Queries for latest 26 hours data
MegaEase
Key Technology
l Simple Data Model – (string key, int64 timestamp, double value)
l In memory – low latency
l High Data Compression Raito – Save 90% space
l Cache first then Disk – accept the data lost
l Stateless - Easy to scale
l Hash(key) à Shard à node
MegaEase
Fundamental
l Delta Encoding (aka Delta Compression)
l https://en.wikipedia.org/wiki/Delta_encoding
l Examples
l HTTP RFC 3229 “Delta encoding in HTTP”
l rsync - Delta file copying
l Online backup
l Version Control
MegaEase
Compress of timestamp
l Delta-of-Delta
MegaEase
Compression Algorithm
Compress Timestamp
D = 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐)
l D = 0, then store a single ‘0’ bit
l D = [-63, 64], ‘10’ : value (7 bits)
l D = [-255, 256], ‘110’ : value (9 bits)
l D = [-2047, 2048], ‘1110’ : value (12 bits)
l Otherwise store ‘1111’ : D (32 bits)
Compress Values (Double float)
X = 𝑽𝒊 ^ 𝑽𝒊"𝟏
l X = 0, then store a single ‘0’ bit
l X != 0,
首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit
值存为’1’,第二个bit值为
如果Leading Zeros与Trailing Zeros与前一个XOR值相同,则第2个bit
值存为’0’,而后,紧跟着去掉Leading Zeros与Trailing Zeros以后的
有效XOR值部分。
如果Leading Zeros与Trailing Zeros与前一个XOR值不同,则第2个bit
值存为’1’,而后,紧跟着5个bits用来描述Leading Zeros的个数,再
用6个bits来描述有效XOR值的长度,最后再存储有效XOR值部分
(这种情形下,至少产生了13个bits的冗余信息)
MegaEase
Sample Compression
l Raw : 16 bytes/ sample
l Compressed: 1.37 bytes/sample
MegaEase
Open Source Implementation
l Golang
l https://github.com/dgryski/go-tsz
l Java
l https://github.com/burmanm/gorilla-tsc
l https://github.com/milpol/gorilla4j
l Rust
l https://github.com/jeromefroe/tsz-rs
l https://github.com/mheffner/rust-gorilla-tsdb
MegaEase
Reference
MegaEase
Reference
l Writing a Time Series Database from Scratch by Fabian Reinartz
https://fabxc.org/tsdb/
l Gorilla: A Fast, Scalable, In-Memory Time Series Database
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
l TSDB format
https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md
l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz
l video: https://www.youtube.com/watch?v=b_pEevMAC3I
l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf
l Ganesh Vernekar Blog - Prometheus TSDB
l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block
l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint
l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk
l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index
l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries
l Time-series compression algorithms, explained
l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/
MegaEase
Thanks
MegaEase Inc

More Related Content

What's hot

Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
confluent
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
ELK Stack
ELK StackELK Stack
ELK Stack
Phuc Nguyen
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
Rico Chen
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
Kasun Don
 
Apache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel QuarkusApache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel Quarkus
Claus Ibsen
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
confluent
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
Lhouceine OUHAMZA
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
kafka
kafkakafka

What's hot (20)

Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
 
Apache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel QuarkusApache Camel v3, Camel K and Camel Quarkus
Apache Camel v3, Camel K and Camel Quarkus
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
kafka
kafkakafka
kafka
 

Similar to How Prometheus Store the Data

jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
srisatish ambati
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Tanel Poder
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
Amazon Web Services
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Amazon Web Services
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
tomasbart
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Amazon Web Services
 
Using AWR for IO Subsystem Analysis
Using AWR for IO Subsystem AnalysisUsing AWR for IO Subsystem Analysis
Using AWR for IO Subsystem Analysis
Texas Memory Systems, and IBM Company
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
Stewart Needham
 
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...
NETWAYS
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Phil Estes
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
DataStax
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
YUCHENG HU
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
Amazon Web Services
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
Zahari Dichev
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsJignesh Shah
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
Amazon Web Services
 

Similar to How Prometheus Store the Data (20)

jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
11g R2
11g R211g R2
11g R2
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Using AWR for IO Subsystem Analysis
Using AWR for IO Subsystem AnalysisUsing AWR for IO Subsystem Analysis
Using AWR for IO Subsystem Analysis
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
 
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
Using Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and TuningUsing Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and Tuning
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System Administrators
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 

Recently uploaded

The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 

Recently uploaded (20)

The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 

How Prometheus Store the Data

  • 2. MegaEase Self Introduction l 20+ years working experience for large-scale distributed system architecture and development. Familiar with Cloud Native computing and high concurrency / high availability architecture solution. l Working Experiences l MegaEase – Cloud Native Software products as Founder l Alibaba – AliCloud, Tmall as principle software engineer. l Amazon – Amazon.com as senior software manager. l Thomson Reuters – Real-time system software development Manager. l IBM Platform – Distributed computing system as software engineer. Weibo: @左耳朵耗子 Twitter: @haoel Blog: http://coolshell.cn/
  • 4. MegaEase Understanding Time Series Data l Data scheme l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), .... l Prometheus Data Model l <metric name>{<label name>=<label value>, ...} l Typical set of series identifiers l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355 l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934 l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483 l Query l __name__=“requests_total” - selects all series belonging to the requests_total metric. l method=“PUT|POST” - selects all series method is PUT or POST Metric Name Labels Timestamp Sample Value Key - Series Value - Sample
  • 5. MegaEase 2D Data Plane l Write l Completely vertical and highly concurrent as samples from each target are ingested independently l Query l Retrieves data can be paralleled and batched series ^ │ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“GET”} │ . . . . . . . . . . . . . . . . . . . . . . {name=“request_total”, method=“POST”} │ . . . . . . . │ . . . . . . . . . . . . . . . . . . . ... │ . . . . . . . . . . . . . . . . . . . . . │ . . . . . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“POST”} │ . . . . . . . . . . . . . . . . . {name=“errors_total”, method=“GET”} │ . . . . . . . . . . . . . . │ . . . . . . . . . . . . . . . . . . . ... │ . . . . . . . . . . . . . . . . . . . . v <-------------------- time --------------------->
  • 6. MegaEase The Fundamental Problem l Storage problem l IDE – spinning physically l SSD - write amplification l Query is much more complicated than write l Time series query could cause the random read. l Ideal Write l Sequential writes l Batched writes l Ideal Read l Same Time Series should be sequentially
  • 8. MegaEase Prometheus Solution (v1.x “V2”) l One file per time series l Batch up 1KiB chunks in memory ┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A └──────────┴─────────┴─────────┴─────────┴─────────┘ ┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B └──────────┴─────────┴─────────┴─────────┴─────────┘ . . . ┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ └──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ chunk 1 chunk 2 chunk 3 ... l Dark Sides l Chunk are hold in memory, it could be lost if application or node crashed. l With several million files, inodes would be run out l With several thousands of chunks need be persisted, could cause disk I/O so busy. l Keep so many files open for I/O, which cause very high latency. l Old data need be cleaned, it could cause the SSD’s write amplification l Very big CPU/MEM/DISK resource consumption
  • 9. MegaEase Series Churn l Definition l Some time series become INACTIVE l Some time series become ACTIVE l Reasons l Rolling up a number of microservice l Kubernetes scaling the services series ^ │ . . . . . . │ . . . . . . │ . . . . . . │ . . . . . . . │ . . . . . . . │ . . . . . . . │ . . . . . . │ . . . . . . │ . . . . . │ . . . . . │ . . . . . v <-------------------- time --------------------->
  • 11. MegaEase Fundamental Design – V3 l Storage Layout l 01XXXXXXX- is a data block l ULID - like UUID but lexicographically sortable and encoding the creation time l chunk directory l contains the raw chucks of data points for various series(likes “V2”) l No long a single file per series l index – index of data l Lots of black magic find the data by labels. l meta.json - Readable meta data l the state of our storage and the data it contains l tombstone l Deleted data will be recorded into this file, instead removing from chunk file l wal – Write-Ahead Log l The WAL segments would be truncated to “checkpoint.X” directory l chunks_head – in memory data l Notes l The data will be persisted into disk every 2 hours l WAL is used for data recovery. l 2 Hours block could make the range data query efficiently $ tree ./data ./data ├── 01BKGV7JBM69T2G1BGBGM6KB12 │ ├── chunks │ │ ├── 000001 │ │ ├── 000002 │ │ └── 000003 │ ├── index │ └── meta.json ├── 01BKGTZQ1SYQJTR4PB43C8PD98 │ ├── chunks │ │ └── 000001 │ ├── index │ └── meta.json ├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K │ ├── chunks │ │ └── 000001 │ ├── index │ ├── tombstones │ └── meta.json ├── chunks_head │ └── 000001 └── wal ├── 000000003 └── checkpoint.00000002 ├── 00000000 └── 00000001 https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md File Format
  • 12. MegaEase Blocks – Little Database l Partition the data into non-overlapping blocks l Each block acts as a fully independent database l Containing all time series data for its time window l it has its own index and set of chunk files. l Every block of data is immutable l The current block can be append the data l All new data is write to an in-memory database l To prevent data loss, a temporary WAL is also written. t0 t1 t2 t3 now ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐ │ │ │ │ │ │ │ │ ┌────────────┐ │ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │ │ │ │ │ │ │ │ │ └────────────┘ └───────────┘ └───────────┘ └───────────┘ └────────────┘ ^ └──────────────┴───────┬──────┴──────────────┘ │ │ query │ │ merge ─────────────────────────────────────────────────┘
  • 13. MegaEase Tree Concept Block 1 Block 2 Block 3 Block 4 Block N chunk1 chunk2 chunk3 time
  • 14. MegaEase New Design’s Benefits l Good for querying a time range l we can easily ignore all data blocks outside of this range. l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with l Good for disk writes l When completing a block, we can persist the data from our in-memory database by sequentially writing just a handful of larger files. l Keep the good property of V2 that recent chunks l which are queried most, are always hot in memory. l Flexible for chunk size l We can pick any size that makes the most sense for the individual data points and chosen compression format. l Deleting old data becomes extremely cheap and instantaneous. l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write up to hundreds of millions of files, which could take hours to converge.
  • 15. MegaEase Chunk-head l Chunk will be cut l fills till 120 samples l 2 hour (by default) l Since Prometheus v2.19 l not all chunks are stored in memory l When the chunk is cut, it will be flushed to disk and to mmap https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
  • 16. MegaEase Chunk head à Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/ l After some time, the chunks meet threshold l When the Chunks range is 3hrs l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block l Meanwhile l The WAL is truncated at this point l And the “checkpoint” is created!
  • 17. MegaEase Large file with “mmap” l mmap stands for memory-mapped files. It is a way to read and write files without invoking system calls. l It is great if multiple processes accessing data in a read only fashion from the same file l It allows all those processes to share the same physical memory pages, saving a lot of memory. l it also allows the operating system to optimize paging operations. User Process File System Page Cache Disk User Space Kernel Space Device mmap Direct I/O read/write Why mmap is faster than system calls https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37
  • 18. MegaEase Write-Ahead Log(WAL) l widely used in relational databases to provide durability (D from ACID) l Persisting every state change as a command to the append only log. https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html l Store each state changes as command l A single log is appended sequentially l Each log entry is given a unique identifier l Roll the logs as Segmented Log l Clean the log with Low-Water Mark l Snapshot based (Zookeeper & ETCD) l Time based (Kafka) l Support Singular Update Queue l A work queue l A single thread
  • 19. MegaEase Prometheus WAL & Checkpoint l WAL Records - includes the Series and their corresponding Samples. l The Series record is written only once when we see it for the first time l The Samples record is written for all write requests that contain a sample. l WAL Truncation - Checkpoints l Drops all the series records for series which are no longer in the Head. l Drops all the samples which are before time T. l Drops all the tombstone records for time ranges before T. l Retain back remaining series, samples and tombstone records in the same way as you find it in the WAL (in the same order as they appear in the WAL). l WAL Replay l Replaying the “checkpoint.X” l Replaying the WAL X+1, X+2,… X+N l WAL Compression l The WAL records are not heavily compressed by Snappy l Snappy is developed by Google based on LZ77 l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility. l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB…. Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal data └── wal ├── 000000 ├── 000001 ├── 000002 ├── 000003 ├── 000004 └── 000005 data └── wal ├── checkpoint.000003 | ├── 000000 | └── 000001 ├── 000004 └── 000005
  • 20. MegaEase Block Compaction l Problem l When querying multiple blocks, we have to merge their results into an overall result. l If we need a week-long query, it has to merge 80+ partial blocks. l Compaction t0 t1 t2 t3 t4 now ┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before └────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘ ┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐ │ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A) └─────────────────────────────────────────┘ └───────────┘ └───────────┘ ┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐ │ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B) └──────────────────────────┘ └──────────────────────────┘ └───────────┘
  • 21. MegaEase Retention l Example l Block 1 can be deleted safely, bock 2 has to keep until it fully behind the boundary. l Block Compaction impacts l Block compaction could make the block too large to delete. l We need to limit the block size. Maximum block size = 10% * retention window. | ┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . . └────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘ | | retention boundary
  • 26. MegaEase Index l Using inverted index for label index l Allocate an unique ID for every series l Look up the series by this ID, the time complexity is O(1) l This ID is forward index. l Construct the labels’ index l If series ID = {2,5, 10, 29} contains app=“nginx” l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx” l In Short l Number of labels is significantly less then the number of series. l Walking through all of the labels is not problem. { __name__=”requests_total”, pod=”nginx-34534242-abc723 job=”nginx”, path=”/api/v1/status”, status=”200”, method=”GET”, } status=”200”: 1 2 5 ... method=”GET”: 2 3 4 5 6 9 ... ID : 5
  • 27. MegaEase Sets Operation l Considering we have the following query: l app=“foo” AND __name__=“requests_total” l How to do intersection with two invert index list? l General Algorithm Interview Question l By given two integer array, return their intersection. l A[] = { 4, 1, 6, 7, 3, 2, 9 } l B[] = { 11,30, 2, 70, 9} l return { 2, 9} as there intersection l By given two integer array return their union. l A[] = { 4, 1, 6, 7, 3, 2, 9 } l B[] = { 11,30, 2, 70, 9} l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union l Time: O(m*n) - no extra space
  • 28. MegaEase Sort The Array l If we sort the array __name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ] app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ] intersection => [ 1000, 1001 ] l We can have efficient algorithm l O(m+n) : two pointers for each array. while (idx1 < len1 && idx2 < len2) { if (a[idx1] > b[idx2] ) { idx2++ } else if (a[idx1] < b[idx2] ) { idx1++ } else { c = append(c, a[idx1]) } } return c l Series ID must be easy to sort, use MD5 or UUID is not a good idea ( V2 use the hash ID) l Delete the data could cause the index rebuild.
  • 30. MegaEase Benchmark – Memory l Heap memory usage in GB l Prometheus 2.0’s memory consumption is reduced by 3-4x
  • 31. MegaEase Benchmark – CPU l CPU usage in cores/second l Prometheus 2.0 needs 3-10 times fewer CPU resources.
  • 32. MegaEase Benchmark – Disk Writes l Disk writes in MB/second l Prometheus 2.0 saving 97-99%. l Prometheus 1.5 is prone to wear out SSD
  • 33. MegaEase Benchmark – Query Latency l Query P99 latency in seconds l Prometheus 1.5 the query latency increases over time as more series are stored.
  • 34. MegaEase Facebook Paper Gorilla: A fast, scalable, in-memory time series database TimeScale
  • 35. MegaEase Gorilla Requirements l 2 billion unique time series identified by a string key. l 700 million data points (time stamp and value) added per minute. l Store data for 26 hours. l More than 40,000 queries per second at peak. l Reads succeed in under one millisecond. l Support time series with 15 second granularity (4 points per minute per time series). l Two in-memory, not co-located replicas (for disaster recovery capacity). l Always serve reads even when a single server crashes. l Ability to quickly scan over all in memory data. l Support at least 2x growth per year. 85% Queries for latest 26 hours data
  • 36. MegaEase Key Technology l Simple Data Model – (string key, int64 timestamp, double value) l In memory – low latency l High Data Compression Raito – Save 90% space l Cache first then Disk – accept the data lost l Stateless - Easy to scale l Hash(key) à Shard à node
  • 37. MegaEase Fundamental l Delta Encoding (aka Delta Compression) l https://en.wikipedia.org/wiki/Delta_encoding l Examples l HTTP RFC 3229 “Delta encoding in HTTP” l rsync - Delta file copying l Online backup l Version Control
  • 39. MegaEase Compression Algorithm Compress Timestamp D = 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐) l D = 0, then store a single ‘0’ bit l D = [-63, 64], ‘10’ : value (7 bits) l D = [-255, 256], ‘110’ : value (9 bits) l D = [-2047, 2048], ‘1110’ : value (12 bits) l Otherwise store ‘1111’ : D (32 bits) Compress Values (Double float) X = 𝑽𝒊 ^ 𝑽𝒊"𝟏 l X = 0, then store a single ‘0’ bit l X != 0, 首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit 值存为’1’,第二个bit值为 如果Leading Zeros与Trailing Zeros与前一个XOR值相同,则第2个bit 值存为’0’,而后,紧跟着去掉Leading Zeros与Trailing Zeros以后的 有效XOR值部分。 如果Leading Zeros与Trailing Zeros与前一个XOR值不同,则第2个bit 值存为’1’,而后,紧跟着5个bits用来描述Leading Zeros的个数,再 用6个bits来描述有效XOR值的长度,最后再存储有效XOR值部分 (这种情形下,至少产生了13个bits的冗余信息)
  • 40. MegaEase Sample Compression l Raw : 16 bytes/ sample l Compressed: 1.37 bytes/sample
  • 41. MegaEase Open Source Implementation l Golang l https://github.com/dgryski/go-tsz l Java l https://github.com/burmanm/gorilla-tsc l https://github.com/milpol/gorilla4j l Rust l https://github.com/jeromefroe/tsz-rs l https://github.com/mheffner/rust-gorilla-tsdb
  • 43. MegaEase Reference l Writing a Time Series Database from Scratch by Fabian Reinartz https://fabxc.org/tsdb/ l Gorilla: A Fast, Scalable, In-Memory Time Series Database http://www.vldb.org/pvldb/vol8/p1816-teller.pdf l TSDB format https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz l video: https://www.youtube.com/watch?v=b_pEevMAC3I l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf l Ganesh Vernekar Blog - Prometheus TSDB l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries l Time-series compression algorithms, explained l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/