The document discusses techniques for optimizing the performance of log-structured merge trees (LSM trees), which are commonly used as the storage engine in databases. It describes the basic in-place and out-of-place update approaches, and focuses on LSM trees. Key topics covered include compaction strategies like leveled, tiered, and hybrid approaches; techniques like file partitioning to reduce disk usage and improve concurrency; and efficient handling of deletes using tombstone recycling. The goal is to squeeze the most performance out of LSM tree storage engines by applying state-of-the-art techniques.
Scaling ScyllaDB Storage Engine with State-of-Art Compaction
1. Squeezing the Most Out of
the Storage Engine with
State of the Art Compaction
Raphael S. Carvalho, Software Engineer
2. Raphael Carvalho
■ Syslinux, suite of bootloaders
■ OSv, an operating system for the cloud
■ Seastar, the framework powering ScyllaDB
■ ScyllaDB, the best database in the world
3. “In order to make good use of the computer
resources, one must organize files intelligently,
making the retrieval process efficient.”
The Ubiquitous B-Tree paper, 1979
4. ■ Short & precise definition from aforementioned paper:
■ “allow users to store, update, and recall”
Storage Engines
5. ■ Two approaches for handling updates
■ In-place structure (B+-tree)
Storage Engines
6. ■ Two approaches for handling updates
■ In-place structure (B+-tree)
Storage Engines
(k1,v1)(k2,v2)
7. ■ Two approaches for handling updates
■ In-place structure (B+-tree)
Storage Engines
(k1,v1)(k2,v2)
(k1, v3)
8. ■ Two approaches for handling updates
■ In-place structure (ex: B+-tree)
Storage Engines
(k1,v3)(k2,v2)
9. ■ Two approaches for handling updates
■ Out-of-place structure (ex: LSM-tree)
Storage Engines
10. ■ Two approaches for handling updates
■ Out-of-place structure (ex: LSM-tree)
Storage Engines
(k1,v1)(k2,v2)
11. ■ Two approaches for handling updates
■ Out-of-place structure (ex: LSM-tree)
Storage Engines
(k1,v1)(k2,v2)
(k1,v3)
12. ■ Two approaches for handling updates
■ Out-of-place structure (ex: LSM-tree)
Storage Engines
(k1,v1)(k2,v2)
(k1,v3)
13. ■ Two approaches for handling updates
■ Out-of-place structure (ex: LSM-tree)
Storage Engines
(k1,v1)(k2,v2)
(k1,v3)
14. ■ Two approaches for handling updates
■ Out-of-place structure (ex: LSM-tree)
Storage Engines
(k1,v1)(k2,v2)
(k1,v3)
15. ■ Out-of-place update isn’t new.
■ 1976 paper “Differential files” shows its applicability in the real world
■ “shown to be an efficient method for storing a large and changing
database”
Storage Engines
16. ■ A good analogy is presented in the paper
Storage Engines
17. ■ The Log-Structured Merge-Tree (LSM-Tree)
paper is then published in 1996
Storage Engines
19. Storage Engines
THE LSM-TREE
C1 is T times bigger than C0.
C(K) is T times bigger than C(K-1).
C0
C1
C2
Ck
MEMORY
DISK
merge sort
20. ■ Immutability of LSM tree components (ex: SSTables) simplifies
■ Concurrency control
■ Recovery
Storage Engines
21. Query on LSM Tree
(k1, v2)
(k1, v1)
MEMORY
DISK
Query
k1
22. ■ A compaction policy (or strategy) defines the shape of LSM tree
■ Any policy is composed of 4 primitives
■ Trigger (when to compact)
■ File picking policy (which data to compact)
■ Granularity (how much data at once)
■ Layout (how data is laid out)
LSM-tree compaction policy
23. Pure Leveled in Original LSM Design
ONLY 1 COMPONENT PER LEVEL!
C0
C1
C2
Ck
MEMORY
DISK
merge sort
27. ■ Partitions the LSM-tree components into (usually fixed-size) fragments
■ Subset of a level can be merged into the next one (partial merge)
■ Bounds:
■ compaction operation time
■ temporary disk space during compaction lifetime
Partitioning Optimization for Leveled
30. Leveled Policy - Cost Analysis
■ Let T be the size ratio between adjacent levels
■ Let L be the number of levels for a given LSM tree
■ Write amplification:
■ Space amplification:
O(T * L)
O(T + 1)
------ = ~1.1
T
31. Stepped-Merge Algorithm
■ 1997 paper Incremental organization for data recording and
warehousing -> a new approach to LSM tree layout
■ “Our goal is to design a technique that supports both insertion and
queries with reasonable efficiency, and without the delays of periodic
batch processing.”
■ Gives birth to the tiered compaction policy
35. Tiered Policy - Cost Analysis
■ Let T be the size ratio between adjacent levels
■ Let L be the number of levels for a given LSM tree
■ Write amplification:
■ Space amplification:
O(L)
O(T * L)
36. Now ScyllaDB journey begins
The database inherited all the LSM-tree
improvements described so far…
But they weren’t enough
42. Tiered Policy - Partitioning Optimization
■ Bounds temporary space overhead significantly
■ Allows disk space usage from 50% to 80% and beyond.
■ Available in ScyllaDB as Incremental Compaction Strategy (ICS)
43. LSM tree - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED
44. LSM tree - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
LEVELED
45. LSM tree - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
LEVELED
PURE
TIERED
46. But the world is not only black and white
There are shades of gray in between…
47. Hybrid LSM-tree data layout
■ Largest level is space optimized
■ Other levels are write optimized
■ Addresses O(K) space amplification in tiered in overwrite workloads
■ Where K = number of components per level
48. Hybrid LSM-tree data layout
L1
L2
FILE SIZE
L0
SST
SST
SST SST
WRITE OPTIMIZED LEVELS
SPACE OPTIMIZED LEVEL
49. Hybrid LSM-tree data layout
L1
L2
FILE SIZE
L0
SST
SST
WRITE OPTIMIZED LEVELS
SPACE OPTIMIZED LEVEL
SST
50. Hybrid LSM - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
TIERED
PURE
LEVELED
HYBRID
51. Hybrid LSM - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
TIERED
PURE
LEVELED
HYBRID
52. Hybrid LSM-tree data layout
■ Reduces space amplification in overwrite-intensive workloads
■ = less space amplification
■ = increased storage density per node
■ = more money in your pocket.
■ Available as space amplification goal (SAG) option of Incremental
Compaction Strategy.