How Incremental
Compaction Reduces Your
Storage Footprint
Benny Halevy, Core Storage Group Manager
Presenter
Benny Halevy, Core Storage Group Manager
■ Leads the storage software development team at ScyllaDB.
■ Working on operating systems and distributed file systems for
over 20 years.
■ Before Scylla, led software development for GSI Technology,
providing a hardware/software solution for deep learning and
similarity search using in-memory computing technology.
■ Previously co-founded Tonian (later acquired by Primary Data)
and led it as CTO, developing a distributed file server based on
the pNFS protocol delivering highly scalable performance and
dynamic, out-of-band data placement control.
■ Before Tonian, lead architect in Panasas of the pNFS protocol.
Introduction
Log-structured Storage and
Compaction Fundamentals
■ Changes to the data are:
● First, recorded in memory, then
● Flushed into SSTables.
■ Updates accumulate over time
● in different SSTables
● Having several versions of the same cell is called
“write amplification”
Log-structured Writes
...
Updates
MemTable
...
SSTable
SSTables
■ Immutable
■ Contain changes to data
● A.k.a mutations
■ Sorted (“Sorted Strings Table”)
■ Have metadata, like:
● Index, Statistics, Filter
...
Updates
MemTable
...
SSTable
🛈 There is no static view of the database
Reading Data
■ Requires reading all relevant SSTables
● Applying the live mutations
● Bloom filter used to locate those
■ Consolidating mutations from many
SSTables is expensive
● We call that “read amplification”
...
Updates
MemTable
...
SSTable
Why is Compaction Needed?
■ SSTables are immutable
● We can’t just keep writing updates
● Obsolete data needs to be deleted
● Reduce write amplification
■ Data may be scattered around
● We want to consolidate it
● Reduce read amplification
...
Updates
MemTable
...
SSTable
Compaction Fundamentals
1. Compaction first selects a set of sstables to process.
● based on the Compaction Strategy.
2. It then reads the SSTables, and
● writes the compacted output
● while eliminating overwrites, deleted and expired data.
3. Eventually, when the output SSTables are
sealed and safely stored on storage
● the input SSTables can be finally deleted.
� Note that compaction requires temporary space
Since SSTables must not be deleted until their compaction completes.
Compaction Fundamentals
■ Which mutations can be eliminated?
● Overwritten
● Expired (by TTL)
● Deleted (by tombstone / column deletion)
● Droppable tombstones
a’
a
b c
!c
!d
a’ b !c
!z
!d
[a] is overwritten
by [a’]
[b] is newly
written
[c] is deleted
by [!c]
[!d] is a live
tombstone
[!z] is a
droppable
tombstone
poof!
🛈 Note that tombstones are kept around for gc_grace_seconds
until they are garbage-collected, to prevent data resurrection.
Legacy Compaction Strategies - STCS
There is a choice of compaction strategies, for different workloads.
ICS is based on the following two common strategies:
■ Size-Tiered Compaction Strategy (STCS)
● STCS organizes SSTables into tiers,
● based on their size,
● on an exponential scale
■ When compacting several SSTables
● A single SSTable is created
● It may be as large as the union of all of them
■ Then it’s moved to the next tier
● Or become much smaller due deletes and
expirations
■ Potentially dropping to a lower tier.
STCS Space Amplification
■ STCS requires space of at least twice the data size
■ This is called Space amplification
■ The main factors are:
● Temporary space: during compaction.
● Accumulation of updates and deletes
across different tiers
Legacy Compaction Strategies - LCS
Leveled Compaction Strategy (LCS)
■ Compaction is triggered when a level has more than 10i SSTables
■ LCS picks one sstable from level “i”, with size X, to compact
■ it then finds the roughly 10 sstables in the next level
● overlapping with this sstable
● and compacts all of them together
■ It writes the resulting run
● to the next level
● Run size bound by (1+10)*X
Legacy Compaction Strategies - LCS
■ While LCS limits space amplification
■ It results in higher write amplification.
Incremental
Compaction Strategy
ICS In a Nutshell
■ We observed problems with legacy compaction strategies:
● STCS has high space amplification (and low write amplification)
● LCS has high write amplification (and low space amplification)
■ We wanted to benefit from both approaches
■ By borrowing SSTable Runs from LCS
■ And applying them over size-tiers
🛈 Merely replacing
● increasingly larger SSTables with
● increasingly longer SSTable Runs
SSTable Runs
■ Expansion of the SSTable concept
■ Comprised of a sorted set of SSTables
■ The SSTables are non-overlapping
● Those are called “Fragments”
a
b
...
z
a
b
...
z
🛈 A run is equivalent to
● a large SSTable
● split into several smaller SSTables
How ICS Works?
■ Remember that:
● Fragments are disjoint
● and sorted with respect to each other
■ So we scan the runs, fragment-by-fragment
■ and compact them incrementally
● While deleting exhausted SSTables as we go
A
B
...
Z
a
b
...
z
A+a
B+b
A a
B b
A+a
B+b
Case Study
Phases:
1. Write 500GB
2. Overwrite repeatedly
3. Compact
■ Clearly shows ICS’
improved space-
amplification
■ Most notably
STCS 2X major peak
is gone!
Thank you Stay in touch
Any questions? Benny Halevy
bhalevy@scylladb.com

How Incremental Compaction Reduces Your Storage Footprint

  • 1.
    How Incremental Compaction ReducesYour Storage Footprint Benny Halevy, Core Storage Group Manager
  • 2.
    Presenter Benny Halevy, CoreStorage Group Manager ■ Leads the storage software development team at ScyllaDB. ■ Working on operating systems and distributed file systems for over 20 years. ■ Before Scylla, led software development for GSI Technology, providing a hardware/software solution for deep learning and similarity search using in-memory computing technology. ■ Previously co-founded Tonian (later acquired by Primary Data) and led it as CTO, developing a distributed file server based on the pNFS protocol delivering highly scalable performance and dynamic, out-of-band data placement control. ■ Before Tonian, lead architect in Panasas of the pNFS protocol.
  • 3.
  • 4.
    ■ Changes tothe data are: ● First, recorded in memory, then ● Flushed into SSTables. ■ Updates accumulate over time ● in different SSTables ● Having several versions of the same cell is called “write amplification” Log-structured Writes ... Updates MemTable ... SSTable
  • 5.
    SSTables ■ Immutable ■ Containchanges to data ● A.k.a mutations ■ Sorted (“Sorted Strings Table”) ■ Have metadata, like: ● Index, Statistics, Filter ... Updates MemTable ... SSTable 🛈 There is no static view of the database
  • 6.
    Reading Data ■ Requiresreading all relevant SSTables ● Applying the live mutations ● Bloom filter used to locate those ■ Consolidating mutations from many SSTables is expensive ● We call that “read amplification” ... Updates MemTable ... SSTable
  • 7.
    Why is CompactionNeeded? ■ SSTables are immutable ● We can’t just keep writing updates ● Obsolete data needs to be deleted ● Reduce write amplification ■ Data may be scattered around ● We want to consolidate it ● Reduce read amplification ... Updates MemTable ... SSTable
  • 8.
    Compaction Fundamentals 1. Compactionfirst selects a set of sstables to process. ● based on the Compaction Strategy. 2. It then reads the SSTables, and ● writes the compacted output ● while eliminating overwrites, deleted and expired data. 3. Eventually, when the output SSTables are sealed and safely stored on storage ● the input SSTables can be finally deleted. � Note that compaction requires temporary space Since SSTables must not be deleted until their compaction completes.
  • 9.
    Compaction Fundamentals ■ Whichmutations can be eliminated? ● Overwritten ● Expired (by TTL) ● Deleted (by tombstone / column deletion) ● Droppable tombstones a’ a b c !c !d a’ b !c !z !d [a] is overwritten by [a’] [b] is newly written [c] is deleted by [!c] [!d] is a live tombstone [!z] is a droppable tombstone poof! 🛈 Note that tombstones are kept around for gc_grace_seconds until they are garbage-collected, to prevent data resurrection.
  • 10.
    Legacy Compaction Strategies- STCS There is a choice of compaction strategies, for different workloads. ICS is based on the following two common strategies: ■ Size-Tiered Compaction Strategy (STCS) ● STCS organizes SSTables into tiers, ● based on their size, ● on an exponential scale ■ When compacting several SSTables ● A single SSTable is created ● It may be as large as the union of all of them ■ Then it’s moved to the next tier ● Or become much smaller due deletes and expirations ■ Potentially dropping to a lower tier.
  • 11.
    STCS Space Amplification ■STCS requires space of at least twice the data size ■ This is called Space amplification ■ The main factors are: ● Temporary space: during compaction. ● Accumulation of updates and deletes across different tiers
  • 12.
    Legacy Compaction Strategies- LCS Leveled Compaction Strategy (LCS) ■ Compaction is triggered when a level has more than 10i SSTables ■ LCS picks one sstable from level “i”, with size X, to compact ■ it then finds the roughly 10 sstables in the next level ● overlapping with this sstable ● and compacts all of them together ■ It writes the resulting run ● to the next level ● Run size bound by (1+10)*X
  • 13.
    Legacy Compaction Strategies- LCS ■ While LCS limits space amplification ■ It results in higher write amplification.
  • 14.
  • 15.
    ICS In aNutshell ■ We observed problems with legacy compaction strategies: ● STCS has high space amplification (and low write amplification) ● LCS has high write amplification (and low space amplification) ■ We wanted to benefit from both approaches ■ By borrowing SSTable Runs from LCS ■ And applying them over size-tiers 🛈 Merely replacing ● increasingly larger SSTables with ● increasingly longer SSTable Runs
  • 16.
    SSTable Runs ■ Expansionof the SSTable concept ■ Comprised of a sorted set of SSTables ■ The SSTables are non-overlapping ● Those are called “Fragments” a b ... z a b ... z 🛈 A run is equivalent to ● a large SSTable ● split into several smaller SSTables
  • 17.
    How ICS Works? ■Remember that: ● Fragments are disjoint ● and sorted with respect to each other ■ So we scan the runs, fragment-by-fragment ■ and compact them incrementally ● While deleting exhausted SSTables as we go A B ... Z a b ... z A+a B+b A a B b A+a B+b
  • 18.
    Case Study Phases: 1. Write500GB 2. Overwrite repeatedly 3. Compact ■ Clearly shows ICS’ improved space- amplification ■ Most notably STCS 2X major peak is gone!
  • 19.
    Thank you Stayin touch Any questions? Benny Halevy bhalevy@scylladb.com

Editor's Notes

  • #5 Changes to data are first recorded in memory and also stored on disk in the commit log.
  • #14 As data updates need to be frequently compacted, along with unchanged data, that is merely copied over and over again.