3. Problem
Pools that:
● Have High Fragmentation
● Do a lot of random-writes
Append a tiny amount of entries to all
metaslab spacemaps each TXG.
That’s a lot of I/Os for that amount of data.
4. Solution
VDEV
Metaslab
Keep all changes in-memory (2 range trees):
● Unflushed allocations
● Unflushed frees
Move segments from one tree to the other when we
allocate and free.
12. Solution to Long Import Times
Each TXG:
● Flush a few metaslabs in order, from oldest-flushed to most recently-flushed
● Destroy old log spacemaps that only contain obsolete entries
22. Trade-offs
● The less we flush:
○ Less I/Os issued (good)
○ More log blocks accumulate if incoming rate is high (bad)
● The more we flush:
○ More log spacemaps are destroyed (good)
○ More I/Os issued (bad)
● The problem is workload dependent … need to come up with a heuristic
23. Block Limit Heuristic
● Set a limit in the log spacemap blocks
○ (acts as an upper bound in import time overhead)
● When exceeded, flush metaslabs until we get under the limit
27. The Ideal Heuristic
… would take into consideration:
1. The difference between the total number of log blocks and the block limit.
28. The Ideal Heuristic
… would take into consideration:
1. The difference between the total number of log blocks and the block limit.
2. The current incoming rate of log spacemap blocks.
29. The Ideal Heuristic
… would take into consideration:
1. The difference between the total number of log blocks and the block limit.
2. The current incoming rate of log spacemap blocks.
3. The distribution of metaslabs flushed over the log spacemap history.
30. The Ideal Heuristic
… would take into consideration:
1. The difference between the total number of log blocks and the block limit.
2. The current incoming rate of log spacemap blocks.
3. The distribution of metaslabs flushed over the log spacemap history.
4. The distribution of log blocks over the log spacemap history.
31. Running Sums Heuristic
● Project the incoming rate of the current TXG to the future
● When the limit is exceeded:
○ Check how many blocks over the limit we are
○ Based on that number, see how many metaslabs we need to flush to get below that limit
○ Then calculate (# of flushes needed) / (TXGs in the future)
● Keep projecting until the all log spacemaps are theoretically gone.
32. Example
Scenario -
● Currently at TXG 16
● Have 24 log spacemap blocks
● Block Limit is 32 blocks
● Have 4 incoming log
spacemap blocks this TXG
40. Sample Simulation Results
A pool with:
● 300 metaslabs
● Block limit - 300 log blocks
● Incoming Rate: random(10, 64)
Flushed:
● 11 metaslabs on average each TXG
● 24 metaslabs maximum ever
That’s at most 8% of the metaslabs.
44. Considering the Block Limit
● Indirectly controls the flushing behavior (tunable)
● … but what’s a safe default?
○ Driving factor: Want at least 1 block of metaslab spacemap changes per flush
What’s the correlation between log entries to metaslab entries?
45. Obsolete Log Entries
We can make the assumption that about half the
entries in the log are obsolete (i.e flushed).
This bumps our factor to 2.
46. Metaslab spacemap entries
… are mostly one-word, while log spacemap entries are always two-word.
(another factor of 2)
48. Considering the Block Limit
➔ Make the tunable a factor of the number of metaslabs in the pool
➔ Make that tunable a factor of 4 (e.g. 4 times the # of metaslabs)
62. Metaslabs
Each VDEV is divided into equal chunks.
Each chunk keeps track of its free space
using a metaslab.
VDEV
Metaslab
63. Problem
Pools that:
● Have High Fragmentation
● Do a lot of random-writes
Append a tiny amount of entries to all
metaslab spacemaps each TXG.
That’s a lot of I/Os for that amount of data.
64. Solution
VDEV
Metaslab
Keep all changes in-memory (2 range trees):
● Unflushed allocations
● Unflushed frees
Move segments from one tree to the other when we
allocate and free.
Don’t spend I/Os writing to the spacemap.
65. Persistence
Each TXG write all metaslab changes
to a pool-wide spacemap.
If we crash, reconstruct unflushed state
from Log Spacemaps.
67. Solution to Memory Pressure
When unflushed changes exceed some set limit, flush them to the metaslab’s spacemap.
68. Solution to Long Import Times
Each TXG:
● Flush a few metaslabs in order, from oldest-flushed to most recently-flushed
● Destroy old log spacemaps that only contain obsolete entries
69.
70.
71.
72.
73.
74. Need To Decide
● When do we flush metaslabs?
● How many metaslabs to flush?
75. Block Limit Heuristic
● Set a limit in the log spacemap blocks
● When exceeded, flush metaslabs until we get under the limit
77. Average Blocks-Per-Log Heuristic
Heuristic:
if block limit is exceeded:
keep flushing until we go below the limit
else:
flush X number of metaslabs
where X is (total # of log blocks) / (# of logs)
Idea → Adjust to rate of incoming blocks per TXG.
Problem → Doesn’t consider flushing history
78. The Ideal Heuristic
… would take into consideration:
1. The difference between the total number of log blocks and the block limit.
2. The current incoming rate of log spacemap blocks.
3. The distribution of metaslabs flushed over the log spacemap history.
4. The distribution of log blocks over the log spacemap history.
79. Running Sums Heuristic
Know how many metaslabs you need to flush in order to delete X log spacemap blocks.
… then project to the future with the current amount of incoming log spacemap blocks ...
80. Example Run
Scenario -
● Currently at TXG 16
● Have 24 log spacemap blocks
● Block Limit is 32 blocks
● Have 4 incoming log
spacemap blocks this TXG
87. Considering the Block Limit
● Indirectly controls the flushing behavior (tunable)
● … but what’s a sane default?
○ Driving factor: Want at least 1 block of metaslab spacemap changes per flush
➔ Make the tunable a factor of the number of metaslabs in the pool
➔ We decided to make that factor 4 (e.g. 4 times the # of metaslabs) because ...
88. Obsolete Log Entries
We can make the assumption that about half the
entries in the log are obsolete (i.e flushed).
This bumps our factor to 2.
89. Metaslab spacemap entries
… are mostly one-word, while log spacemap entries are always two-word.
(another factor of 2)