Writing a TSDB from scratch_ performance optimizations.pdf

Writing a TSDB from scratch
performance optimizations
Roman Khavronenko | github.com/hagen1778

Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778

What is a metric? Scrape target
> curl http://service:port/metrics

Delivering collected metrics to TSDB

Workload pattern for TSDB: writes

Workload pattern for TSDB: reads

Workload pattern for TSDB
● TSDBs process tremendous amounts of data
● They are usually write-heavy applications, optimized for ingestion
● Read load is usually much lower than write load
● Read queries are sporadic and unpredictable

How to deal with such workload?
System design oriented for time series data:
1. Log Structured Merge (LSM) data structure
2. Data for each column is stored separately
3. Append-only writes

How to deal with such workload?
And some more non-design-speciﬁc optimizations:
1. Strings interning
2. Function results caching
3. Concurrency limiting for CPU-bound operations
4. Sync pool for CPU-bound operations

Store only one unique copy in memory!

String interning: naive implementation
var internStringsMap = make(map[string]string)
func intern(s string) string {
m := internStringsMap
if v, ok := m[s]; ok {
return v
}
m[s] = s
return s
}

func ptr (s string) uintptr {
return (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
}
func main() {
s1 := intern("42")
s2 := intern(fmt.Sprintf("%d", 42))
fmt.Println(ptr(s1) == ptr(s2)) // true
}

1. Map isn't thread safe

1. Map isn't thread safe
2. Map with lock doesn't scale with number of CPUs

String interning: sync.Map
var internStringsMap = sync.Map{}
func intern(s string) string {
m := &internStringsMap
interned, _ := m.LoadOrStore(s, s)
return interned.(string)
}

sync.Map is optimized for two common use cases:
1. When the entry for a given key is only ever written once but read
many times

sync.Map is optimized for two common use cases:
1. When the entry for a given key is only ever written once but read
many times
2. When multiple goroutines read, write, and overwrite entries for
disjoint sets of keys.
In these two cases, use of a Map reduces lock contention
and improves performance compared to a Go map paired with a
separate Mutex or RWMutex.

String interning: gotchas
1. Map will grow over time:
a. Rotate maps once in a while

b. Add TTL logic to purge cold entries

2. Sanity check of arguments:
a. At some point, someone will try to intern byte slice or substring:
*(*string)(unsafe.Pointer(&b)) or str[:n]

2. Sanity check of arguments:
a. At some point, someone will try to intern byte slice or substring:
*(*string)(unsafe.Pointer(&b)) or str[:n]
b. Make sure to clone received strings:
strings.Clone(s)

String interning: summary
● We use string interning for storing time series metadata (aka labels).
● It helps to reduce memory usage during metadata parsing
● Interning works the best for read-intensive workload with limited
number of variants with high hit rate

Function results caching: relabeling

Function results caching: caching Transformer
type Transformer struct {
m sync.Map
transformFunc func(s string) string
}

func (t *Transformer) Transform(s string) string {
v, ok := t.m.Load(s)
if ok {
// Fast path - the transformed s is found in the cache.
return v.(string)
}
// Slow path - transform s and store it in the cache.
sTransformed := t.transformFunc(s)
t.m.Store(s, sTransformed)
return sTransformed
}
Function results caching: caching Transformer

// SanitizeName replaces unsupported by Prometheus chars
// in metric names and label names with _.
func SanitizeName(name string) string {
return promSanitizer.Transform(name)
}
var promSanitizer = NewTransformer(func(s string) string {
return unsupportedPromChars.ReplaceAllString(s, "_")
})
Function results caching: example

Function results caching: summary
● Helps to save CPU time in the cost of increased mem usage
● Works best for heavy usage of string transforms, regex matching, etc
● And when the number of arguments and their variants is limited
● Doesn't work good when number of transformations is unlimited or
inconsistent - like query processing

Limiting concurrency for CPU-bound load

Volatile number of scrape targets

Limiting concurrency for CPU intensive operations
+ Makes system more stable and efﬁcient
+ Helps to control the memory usage on load spikes (which is expected in
monitoring)
+ Improves the processing speed of each goroutine by reducing the number
of context switches
- The downside is complexity - it is easy to make a mistake and end up with
a deadlock or inefﬁcient resource utilization.

Limited concurrency: workers
var concurrencyLimit = runtime.NumCPU()
func main() {
workCh := make(chan work, concurrencyLimit*2)
for i := 0; i < concurrencyLimit; i++ {
go func() {
for {
processData(<-workCh)
}
}()
}
}

Limited concurrency: workers
+ Workers could have scoped buffers, metrics, etc.
- Code becomes complicated: start and stop procedures for workers
- Additional synchronization to distribute work via channels

Limited concurrency: channel
var concurrencyLimitCh = make(chan struct{}, runtime.NumCPU())
// This function is CPU-bound and may allocate a lot of memory.
// We limit the number of concurrent calls to limit memory
// usage under high load without sacrificing the performance.
func processData(src, dst []byte) error {
concurrencyLimitCh <- struct{}{}
defer func() {
<-concurrencyLimitCh
}()
// heavy processing...

Limited concurrency: summary
● Works the best for CPU bound operations
● Helps to bound resource usage and process it sequentially with the
optimal performance instead of wasting resources on context switches
● Helps to prevent from excessive memory usage during load spikes
● Do not apply limiting to IO bound (disk, network) operations

sync.Pool for CPU bound operations

sync.Pool is widely used in VM
grep -r "sync.Pool" ./app ./lib | wc -l
118
grep -r "bytesutil.ByteBufferPool" ./app ./lib | wc -l
34

sync.Pool for CPU bound operations in one thread
● All processed on a single CPU core
● No object stealing
● Lower number of objects allocated, better pool utilization
● Lower GC pressure

sync.Pool for synchronous processing
● Object is retrieved, used and released by different goroutines
● High chances for goroutines to be scheduled to different threads
● High chances for objects stealing

sync.Pool for IO bound operations
● Obj retrieved from sync.pool used for IO operations.
● IO operations are slow and sporadic
● so sync.Pool can allocate big amount of objects and result in uncontrolled
mem usage
● Higher pressure on GC

sync.Pool - lib/bytesbuffer
type ByteBufferPool struct {
p sync.Pool
}
// Verify ByteBuffer implements the given interfaces.
_ io.Writer = &ByteBuffer{}
_ fs.MustReadAtCloser = &ByteBuffer{}
_ io.ReaderFrom = &ByteBuffer{}

func (bbp *ByteBufferPool) Get() *ByteBuffer {
bbv := bbp.p.Get()
if bbv == nil {
return &ByteBuffer{}
}
return bbv.(*ByteBuffer)
}
func (bbp *ByteBufferPool) Put(bb *ByteBuffer) {
bb.Reset()
bbp.p.Put(bb)
}

bb := bbPool.Get() // acquire from pool
bb.B, err = DecompressZSTD(bb.B[:0], src)
if err != nil {
return nil, fmt.Errorf("cannot decompress: %w", err)
}
// unmarshal from buffer to dst
dst, err = unmarshalInt64NearestDelta(dst, bb.B)
bbPool.Put(bb) // release to pool

Bytebuffer pool issues
1. sync.Pool assumes all entries it contains are "the same"
2. While in real world bytebuffer are usually have different size
3. Mixing big and small bytebuffers in a single pool can result into:
a. Excessive memory usage
b. Suboptimal objects reuse

Leveled (bucketized) bytebuffer pool

// pools contains pools for byte slices of various capacities.
//
// pools[0] is for capacities from 0 to 8
// ...
// pools[n] is for capacities from 2^(n+2)+1 to 2^(n+3)
//
// Limit the maximum capacity to 2^18, since there are no
performance benefits
// in caching byte slices with bigger capacities.
var pools [17]sync.Pool

func (sw *scrapeWork) scrape() {
body := leveledbytebufferpool.Get(sw.prevBodyLen)
body.B = sw.ReadData(body.B[:0])
sw.processScrapedData(body)
leveledbytebufferpool.Put(body)
}

Ingestion of 100Mil samples/s benchmark

Summary
1. String interning for reducing GC pressure and memory usage for
read-intensive workloads
2. Function results caching for reducing CPU usage during strings
transformations
3. Concurrency limiting for the better performance and predictable
memory usage
4. Sync.pool for reducing GC pressure and improving performance of
CPU bound operations.

Questions?
● VictoriaMetrics scaling to 100M samples/s
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778

Writing a TSDB from scratch_ performance optimizations.pdf

Recommended

Recommended

More Related Content

Similar to Writing a TSDB from scratch_ performance optimizations.pdf

Similar to Writing a TSDB from scratch_ performance optimizations.pdf (20)

Recently uploaded

Recently uploaded (20)

Writing a TSDB from scratch_ performance optimizations.pdf