ZFS Log Spacemap - Flushing Algorithm

Log Spacemap
Flushing Algorithm & Performance
Serapheim Dimitropoulos | Delphix

Problem
Pools that:
● Have High Fragmentation
● Do a lot of random-writes
Append a tiny amount of entries to all
metaslab spacemaps each TXG.
That’s a lot of I/Os for that amount of data.

Solution
VDEV
Metaslab
Keep all changes in-memory (2 range trees):
● Unflushed allocations
● Unflushed frees
Move segments from one tree to the other when we
allocate and free.

What about the memory pressure?

Solution to Memory Pressure
When unflushed changes exceed some set limit, flush them to the metaslab’s spacemap.

Persistence
● Each TXG write all metaslab changes to a pool-wide
spacemap.
● If we crash, reconstruct unflushed state from Log Spacemaps.

Won’t that make import times take longer?

Won’t that make import times take longer?
Yes … but if we have control over the number of log spacemap blocks we can decrease that overhead!

Solution to Long Import Times
Each TXG:
● Flush a few metaslabs in order, from oldest-flushed to most recently-flushed
● Destroy old log spacemaps that only contain obsolete entries

How many metaslabs to flush each TXG?

Trade-offs
● The less we flush:
○ Less I/Os issued (good)
○ More log blocks accumulate if incoming rate is high (bad)
● The more we flush:
○ More log spacemaps are destroyed (good)
○ More I/Os issued (bad)
● The problem is workload dependent … need to come up with a heuristic

Block Limit Heuristic
● Set a limit in the log spacemap blocks
○ (acts as an upper bound in import time overhead)
● When exceeded, flush metaslabs until we get under the limit

A Few More Heuristics
ZFS Flushing Algorithm Simulator - https://github.com/sdimitro/zfas

The Ideal Heuristic
… would take into consideration:
1. The difference between the total number of log blocks and the block limit.

The Ideal Heuristic
2. The current incoming rate of log spacemap blocks.

The Ideal Heuristic
3. The distribution of metaslabs flushed over the log spacemap history.

The Ideal Heuristic
3. The distribution of metaslabs flushed over the log spacemap history.
4. The distribution of log blocks over the log spacemap history.

Running Sums Heuristic
● Project the incoming rate of the current TXG to the future
● When the limit is exceeded:
○ Check how many blocks over the limit we are
○ Based on that number, see how many metaslabs we need to flush to get below that limit
○ Then calculate (# of flushes needed) / (TXGs in the future)
● Keep projecting until the all log spacemaps are theoretically gone.

Example
Scenario -
● Currently at TXG 16
● Have 24 log spacemap blocks
● Block Limit is 32 blocks
● Have 4 incoming log
spacemap blocks this TXG

Sample Simulation Results
A pool with:
● 300 metaslabs
● Block limit - 300 log blocks
● Incoming Rate: random(10, 64)
Flushed:
● 11 metaslabs on average each TXG
● 24 metaslabs maximum ever
That’s at most 8% of the metaslabs.

Running Sums with Summary
● Decreased runtime of heuristic
● Yields similar results!

Considering the Block Limit
● Indirectly controls the flushing behavior (tunable)
● … but what’s a safe default?
○ Driving factor: Want at least 1 block of metaslab spacemap changes per flush
What’s the correlation between log entries to metaslab entries?

Obsolete Log Entries
We can make the assumption that about half the
entries in the log are obsolete (i.e flushed).
This bumps our factor to 2.

Metaslab spacemap entries
… are mostly one-word, while log spacemap entries are always two-word.
(another factor of 2)

➔ Make the tunable a factor of the number of metaslabs in the pool
➔ Make that tunable a factor of 4 (e.g. 4 times the # of metaslabs)

Performance Results
IOPS
~40.5% win at ~91% Fragmentation!Time

Resources
●
●
●
●
●
●
●

zdb(1M) extensions
+ added log spacemaps to
built-in leak detection and
verification

Metaslabs
Each VDEV is divided into equal chunks.
Each chunk keeps track of its free space
using a metaslab.
VDEV
Metaslab

Solution
VDEV
Metaslab
Keep all changes in-memory (2 range trees):
● Unflushed allocations
● Unflushed frees
Move segments from one tree to the other when we
allocate and free.
Don’t spend I/Os writing to the spacemap.

Persistence
Each TXG write all metaslab changes
to a pool-wide spacemap.
If we crash, reconstruct unflushed state
from Log Spacemaps.

Problems
● Memory Pressure
● Import time after crash

Need To Decide
● When do we flush metaslabs?
● How many metaslabs to flush?

Block Limit Heuristic
● Set a limit in the log spacemap blocks
● When exceeded, flush metaslabs until we get under the limit

Average Blocks-Per-Log Heuristic
Heuristic:
if block limit is exceeded:
keep flushing until we go below the limit
else:
flush X number of metaslabs
where X is (total # of log blocks) / (# of logs)
Idea → Adjust to rate of incoming blocks per TXG.
Problem → Doesn’t consider flushing history

Running Sums Heuristic
Know how many metaslabs you need to flush in order to delete X log spacemap blocks.
… then project to the future with the current amount of incoming log spacemap blocks ...

Example Run
Scenario -
● Currently at TXG 16
● Have 24 log spacemap blocks
● Block Limit is 32 blocks
● Have 4 incoming log
spacemap blocks this TXG

Start making projections assuming the
same incoming rate in the future.

● Indirectly controls the flushing behavior (tunable)
● … but what’s a sane default?
○ Driving factor: Want at least 1 block of metaslab spacemap changes per flush
➔ Make the tunable a factor of the number of metaslabs in the pool
➔ We decided to make that factor 4 (e.g. 4 times the # of metaslabs) because ...

Performance Results
IOPS
~40.5% win!

ZFS Log Spacemap - Flushing Algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ZFS Log Spacemap - Flushing Algorithm

Similar to ZFS Log Spacemap - Flushing Algorithm (20)

Recently uploaded

Recently uploaded (20)

ZFS Log Spacemap - Flushing Algorithm