Scaling ScyllaDB Storage Engine with State-of-Art Compaction

Squeezing the Most Out of
the Storage Engine with
State of the Art Compaction
Raphael S. Carvalho, Software Engineer

Raphael Carvalho
■ Syslinux, suite of bootloaders
■ OSv, an operating system for the cloud
■ Seastar, the framework powering ScyllaDB
■ ScyllaDB, the best database in the world

“In order to make good use of the computer
resources, one must organize files intelligently,
making the retrieval process efficient.”
The Ubiquitous B-Tree paper, 1979

■ Short & precise deﬁnition from aforementioned paper:
■ “allow users to store, update, and recall”
Storage Engines

■ Two approaches for handling updates
■ In-place structure (B+-tree)
Storage Engines

Storage Engines
(k1,v1)(k2,v2)

Storage Engines
(k1,v1)(k2,v2)
(k1, v3)

■ In-place structure (ex: B+-tree)
Storage Engines
(k1,v3)(k2,v2)

■ Out-of-place structure (ex: LSM-tree)
Storage Engines

Storage Engines
(k1,v1)(k2,v2)

Storage Engines
(k1,v1)(k2,v2)
(k1,v3)

■ Out-of-place update isn’t new.
■ 1976 paper “Differential ﬁles” shows its applicability in the real world
■ “shown to be an eﬃcient method for storing a large and changing
database”
Storage Engines

■ A good analogy is presented in the paper
Storage Engines

■ The Log-Structured Merge-Tree (LSM-Tree)
paper is then published in 1996
Storage Engines

Storage Engines
THE LSM-TREE
writes
C0
C1
C2
Ck
MEMORY
DISK
merge sort

Storage Engines
THE LSM-TREE
C1 is T times bigger than C0.
C(K) is T times bigger than C(K-1).
C0
C1
C2
Ck
MEMORY
DISK
merge sort

■ Immutability of LSM tree components (ex: SSTables) simpliﬁes
■ Concurrency control
■ Recovery
Storage Engines

Query on LSM Tree
(k1, v2)
(k1, v1)
MEMORY
DISK
Query
k1

■ A compaction policy (or strategy) deﬁnes the shape of LSM tree
■ Any policy is composed of 4 primitives
■ Trigger (when to compact)
■ File picking policy (which data to compact)
■ Granularity (how much data at once)
■ Layout (how data is laid out)
LSM-tree compaction policy

Pure Leveled in Original LSM Design
ONLY 1 COMPONENT PER LEVEL!
C0
C1
C2
Ck
MEMORY
DISK
merge sort

Flexible Leveled in Modern LSM Design
MEMORY
DISK
L0
L1

■ Partitions the LSM-tree components into (usually ﬁxed-size) fragments
■ Subset of a level can be merged into the next one (partial merge)
■ Bounds:
■ compaction operation time
■ temporary disk space during compaction lifetime
Partitioning Optimization for Leveled

Partitioning Optimization for Leveled
MEMORY
DISK
L1
L2
KEY RANGE
SST
SST SST SST
SST
SST

Leveled Policy - Cost Analysis
■ Let T be the size ratio between adjacent levels
■ Let L be the number of levels for a given LSM tree
■ Write ampliﬁcation:
■ Space ampliﬁcation:
O(T * L)
O(T + 1)
------ = ~1.1
T

Stepped-Merge Algorithm
■ 1997 paper Incremental organization for data recording and
warehousing -> a new approach to LSM tree layout
■ “Our goal is to design a technique that supports both insertion and
queries with reasonable eﬃciency, and without the delays of periodic
batch processing.”
■ Gives birth to the tiered compaction policy

Tiered Compaction Policy
MEMORY
DISK
L0
L1
SST
FILE SIZE

MEMORY
DISK
L0
L1
SST
FILE SIZE
SST

MEMORY
DISK
L0
L1
FILE SIZE
SST

Tiered Policy - Cost Analysis
■ Let T be the size ratio between adjacent levels
■ Let L be the number of levels for a given LSM tree
■ Write ampliﬁcation:
■ Space ampliﬁcation:
O(L)
O(T * L)

Now ScyllaDB journey begins
The database inherited all the LSM-tree
improvements described so far…
But they weren’t enough

Tiered - Temporary Space Problem!
MEMORY
DISK
L0
L1
FILE SIZE
SST SST

Tiered - Temporary Space Problem!
MEMORY
DISK
L0
L1
FILE SIZE
SST SST
SST
100% TEMP SPACE OVERHEAD

Partitioning Optimization for Tiered
MEMORY
DISK
L0
L1
FILE SIZE
S S T S S T

MEMORY
DISK
L0
L1
FILE SIZE
S S T S S T
S

MEMORY
DISK
L0
L1
FILE SIZE
S T S T
S

Tiered Policy - Partitioning Optimization
■ Bounds temporary space overhead signiﬁcantly
■ Allows disk space usage from 50% to 80% and beyond.
■ Available in ScyllaDB as Incremental Compaction Strategy (ICS)

LSM tree - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED

SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
LEVELED

SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
LEVELED
PURE
TIERED

But the world is not only black and white
There are shades of gray in between…

Hybrid LSM-tree data layout
■ Largest level is space optimized
■ Other levels are write optimized
■ Addresses O(K) space ampliﬁcation in tiered in overwrite workloads
■ Where K = number of components per level

L1
L2
FILE SIZE
L0
SST
SST
SST SST
WRITE OPTIMIZED LEVELS
SPACE OPTIMIZED LEVEL

L1
L2
FILE SIZE
L0
SST
SST
WRITE OPTIMIZED LEVELS
SPACE OPTIMIZED LEVEL
SST

Hybrid LSM - Efficiency Space
SPACE
OPTIMIZED
WRITE
OPTIMIZED
PURE
TIERED
PURE
LEVELED
HYBRID

■ Reduces space amplification in overwrite-intensive workloads
■ = less space amplification
■ = increased storage density per node
■ = more money in your pocket.
■ Available as space amplification goal (SAG) option of Incremental
Compaction Strategy.

LSM-tree & tombstones
MEMORY
DISK
L0
L1
FILE SIZE
KEY A

MEMORY
DISK
L0
L1
FILE SIZE
KEY A
KEY A
TOMBSTONE

MEMORY
DISK
L0
L1
FILE SIZE
KEY A
KEY A

Suboptimal LSM-tree tombstone handling
MEMORY
DISK
L0
L1
FILE SIZE
KEY A
KEY A
GARBAGE
COLLECTION

Efficient LSM-tree tombstone handling
MEMORY
DISK
L0
L1
FILE SIZE
KEY A
KEY A
GARBAGE
COLLECTION

Efficient LSM-tree tombstone handling
■ Piggyback on incremental compaction, to bound temporary disk
space.
■ Triggers (avoids write ampliﬁcation issues):
■ File staleness
■ Tombstone density threshold
■ Available in Incremental Compaction Strategy (ICS) by default.

Thank You
Stay in Touch
Raphael Carvalho
raphaelsc@scylladb.com
@raphael_scarv
raphaelsc

Scaling ScyllaDB Storage Engine with State-of-Art Compaction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling ScyllaDB Storage Engine with State-of-Art Compaction

Similar to Scaling ScyllaDB Storage Engine with State-of-Art Compaction (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scaling ScyllaDB Storage Engine with State-of-Art Compaction