I'm one of maintainers of the open source TimeSeries database VictoriaMetrics, written in Go. It is used for APM or Kubernetes monitoring. The average VictoriaMetrics installation is processing 2-4 million samples/s on the ingestion path, 20-40 million samples/s on the read path. The biggest installations have more than 100 million samples/s on the ingestion path for a single cluster. This requires being clever with data processing to keep it efficient and scalable. In the talk, I'll cover the following optimizations for keeping the database fast:
1. String interning for lowering GC pressure. We use string interning for storing time series metadata (aka labels). However, this approach has downside of increased memory usage. When it is worth it to use string interning?
2. Metadata processing may require many regular expression matching and strings modification operations. Caching results of such operations helps to save CPU. But the downside could be increased memory usage. Which operations should be cached and which are not?
3. Limiting the number of concurrently running goroutines with CPU-bound load by the number of available CPU cores. This helps to control the memory usage on load spikes (which is a frequent event in monitoring). The limit also improves the processing speed of each goroutine, since it reduces the number of context switches. The downside of the approach is its complexity - it is easy to make a mistake and end up with a deadlock or inefficient resource utilization.
4. The better understanding of `sync.Pool`. For us, `sync.Pool` shows itself the best when used in CPU-bound code, while in IO-bound code it leads to excessive memory usage. The CPU-bound code has short ownership over the objects retrieved from the pool. In combination with p.3 (limited number of goroutines processing CPU-bound code) it gives the most efficient processing speed and memory usage since the chance to get a "hot" object from the pool is much higher.
Writing a TSDB from scratch_ performance optimizations.pdf
1. Writing a TSDB from scratch
performance optimizations
Roman Khavronenko | github.com/hagen1778
2. Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778
10. Workload pattern for TSDB
● TSDBs process tremendous amounts of data
● They are usually write-heavy applications, optimized for ingestion
● Read load is usually much lower than write load
● Read queries are sporadic and unpredictable
11. How to deal with such workload?
System design oriented for time series data:
1. Log Structured Merge (LSM) data structure
2. Data for each column is stored separately
3. Append-only writes
12. How to deal with such workload?
And some more non-design-specific optimizations:
1. Strings interning
2. Function results caching
3. Concurrency limiting for CPU-bound operations
4. Sync pool for CPU-bound operations
17. String interning: naive implementation
var internStringsMap = make(map[string]string)
func intern(s string) string {
m := internStringsMap
if v, ok := m[s]; ok {
return v
}
m[s] = s
return s
}
22. String interning: sync.Map
sync.Map is optimized for two common use cases:
1. When the entry for a given key is only ever written once but read
many times
23. String interning: sync.Map
sync.Map is optimized for two common use cases:
1. When the entry for a given key is only ever written once but read
many times
2. When multiple goroutines read, write, and overwrite entries for
disjoint sets of keys.
In these two cases, use of a Map reduces lock contention
and improves performance compared to a Go map paired with a
separate Mutex or RWMutex.
25. String interning: gotchas
1. Map will grow over time:
a. Rotate maps once in a while
b. Add TTL logic to purge cold entries
26. String interning: gotchas
1. Map will grow over time:
a. Rotate maps once in a while
b. Add TTL logic to purge cold entries
2. Sanity check of arguments:
a. At some point, someone will try to intern byte slice or substring:
*(*string)(unsafe.Pointer(&b)) or str[:n]
27. String interning: gotchas
1. Map will grow over time:
a. Rotate maps once in a while
b. Add TTL logic to purge cold entries
2. Sanity check of arguments:
a. At some point, someone will try to intern byte slice or substring:
*(*string)(unsafe.Pointer(&b)) or str[:n]
b. Make sure to clone received strings:
strings.Clone(s)
28. String interning: summary
● We use string interning for storing time series metadata (aka labels).
● It helps to reduce memory usage during metadata parsing
● Interning works the best for read-intensive workload with limited
number of variants with high hit rate
31. Function results caching: caching Transformer
type Transformer struct {
m sync.Map
transformFunc func(s string) string
}
32. func (t *Transformer) Transform(s string) string {
v, ok := t.m.Load(s)
if ok {
// Fast path - the transformed s is found in the cache.
return v.(string)
}
// Slow path - transform s and store it in the cache.
sTransformed := t.transformFunc(s)
t.m.Store(s, sTransformed)
return sTransformed
}
Function results caching: caching Transformer
33. // SanitizeName replaces unsupported by Prometheus chars
// in metric names and label names with _.
func SanitizeName(name string) string {
return promSanitizer.Transform(name)
}
var promSanitizer = NewTransformer(func(s string) string {
return unsupportedPromChars.ReplaceAllString(s, "_")
})
Function results caching: example
34. Function results caching: summary
● Helps to save CPU time in the cost of increased mem usage
● Works best for heavy usage of string transforms, regex matching, etc
● And when the number of arguments and their variants is limited
● Doesn't work good when number of transformations is unlimited or
inconsistent - like query processing
38. Limiting concurrency for CPU intensive operations
+ Makes system more stable and efficient
+ Helps to control the memory usage on load spikes (which is expected in
monitoring)
+ Improves the processing speed of each goroutine by reducing the number
of context switches
- The downside is complexity - it is easy to make a mistake and end up with
a deadlock or inefficient resource utilization.
39. Limited concurrency: workers
var concurrencyLimit = runtime.NumCPU()
func main() {
workCh := make(chan work, concurrencyLimit*2)
for i := 0; i < concurrencyLimit; i++ {
go func() {
for {
processData(<-workCh)
}
}()
}
}
40. Limited concurrency: workers
+ Workers could have scoped buffers, metrics, etc.
- Code becomes complicated: start and stop procedures for workers
- Additional synchronization to distribute work via channels
41. Limited concurrency: channel
var concurrencyLimitCh = make(chan struct{}, runtime.NumCPU())
// This function is CPU-bound and may allocate a lot of memory.
// We limit the number of concurrent calls to limit memory
// usage under high load without sacrificing the performance.
func processData(src, dst []byte) error {
concurrencyLimitCh <- struct{}{}
defer func() {
<-concurrencyLimitCh
}()
// heavy processing...
42. Limited concurrency: summary
● Works the best for CPU bound operations
● Helps to bound resource usage and process it sequentially with the
optimal performance instead of wasting resources on context switches
● Helps to prevent from excessive memory usage during load spikes
● Do not apply limiting to IO bound (disk, network) operations
44. sync.Pool is widely used in VM
grep -r "sync.Pool" ./app ./lib | wc -l
118
grep -r "bytesutil.ByteBufferPool" ./app ./lib | wc -l
34
45. sync.Pool for CPU bound operations in one thread
● All processed on a single CPU core
● No object stealing
● Lower number of objects allocated, better pool utilization
● Lower GC pressure
46. sync.Pool for synchronous processing
● Object is retrieved, used and released by different goroutines
● High chances for goroutines to be scheduled to different threads
● High chances for objects stealing
47. sync.Pool for IO bound operations
● Obj retrieved from sync.pool used for IO operations.
● IO operations are slow and sporadic
● so sync.Pool can allocate big amount of objects and result in uncontrolled
mem usage
● Higher pressure on GC
48. sync.Pool - lib/bytesbuffer
type ByteBufferPool struct {
p sync.Pool
}
// Verify ByteBuffer implements the given interfaces.
_ io.Writer = &ByteBuffer{}
_ fs.MustReadAtCloser = &ByteBuffer{}
_ io.ReaderFrom = &ByteBuffer{}
50. sync.Pool - lib/bytesbuffer
bb := bbPool.Get() // acquire from pool
bb.B, err = DecompressZSTD(bb.B[:0], src)
if err != nil {
return nil, fmt.Errorf("cannot decompress: %w", err)
}
// unmarshal from buffer to dst
dst, err = unmarshalInt64NearestDelta(dst, bb.B)
bbPool.Put(bb) // release to pool
51. Bytebuffer pool issues
1. sync.Pool assumes all entries it contains are "the same"
2. While in real world bytebuffer are usually have different size
3. Mixing big and small bytebuffers in a single pool can result into:
a. Excessive memory usage
b. Suboptimal objects reuse
53. Leveled (bucketized) bytebuffer pool
// pools contains pools for byte slices of various capacities.
//
// pools[0] is for capacities from 0 to 8
// pools[1] is for capacities from 9 to 16
// pools[2] is for capacities from 17 to 32
// ...
// pools[n] is for capacities from 2^(n+2)+1 to 2^(n+3)
//
// Limit the maximum capacity to 2^18, since there are no
performance benefits
// in caching byte slices with bigger capacities.
var pools [17]sync.Pool
54. Leveled (bucketized) bytebuffer pool
func (sw *scrapeWork) scrape() {
body := leveledbytebufferpool.Get(sw.prevBodyLen)
body.B = sw.ReadData(body.B[:0])
sw.processScrapedData(body)
leveledbytebufferpool.Put(body)
}
56. Summary
1. String interning for reducing GC pressure and memory usage for
read-intensive workloads
2. Function results caching for reducing CPU usage during strings
transformations
3. Concurrency limiting for the better performance and predictable
memory usage
4. Sync.pool for reducing GC pressure and improving performance of
CPU bound operations.