Plam15 slides.potx

Percona Live 2015
September 21-23, 2015 | Mövenpick Hotel | Amsterdam

TokuDB internals
Percona team, Vlad Lesin, Sveta Smirnova

Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
3

Problem
- RAM with fast access but small size M
- disk with slow access but big size
- the whole data does not fit in RAM
- blocks of data size B
- the performance is bound by blocks transferred (ignore CPU costs)
- assume all blocks accesses have the same cost
The goal is to minimize the number of block transfers.
4

DAM - Disk Access Machine model
5
RAM
DISK
М
B
B

B-tree
- each node consists of pivots
- node has fixed size B and fetching pivots as a group can save I/Os
- most leafs are on disk
- inserting into a leaf requires additional I/O if one is not in memory
6
B
B B
log NB

B-tree: search
- good if leaf is in memory
- LogB
(N) I/O’s - worst case
- one I/O for leaf read
7
B
B B
log NB

B-tree: fast sequential insert
- most of nodes are cached
- sequential disk I/O, one disk I/O per leaf which contains many rows
8
B
BB
In memory
Insertions are
into this leaf node

B-tree: slow for random inserts
- most leafs are not cached
- most insertions require random I/O’s
9
B
BB
In memory

B-tree: random inserts buffering
The idea is to buffer inserts and merge them on necessity or when system
idles.
- allows to reduce I/O’s as several changes of the same node can be
written at once
- can slow down reads
- bad performance on heavy load when buffer is full
- anyway we have to read leafs on applying changes from buffer
10

B-tree: cons and pros
- good for sequential inserts
- random inserts can be the cause of big I/O load due to cache misses
- for the big-enough data most of the leafs are not in cache and random
inserts have bad performance
- random insert speed degrades with raising tree size
11

Fractal tree: the idea
- fractal tree is the same as B-tree but with message buffers in each
node
- buffers contain messages
- each message describes a data change
- the messages are pushed down when buffer is full (or node
merge/split required)
12

Fractal tree: the illustration
13
NODE
BUFFER
MESSAGE

Fractal tree: messages push down
14

15
1

16
1 2

17
1 2 3

18
1 2 3 4

19
1 2 3 4
5

20
1 3 52 4

Fractal tree: performance analysis
- the most recently used buffers are cached
- less I/O’s in comparison with B-tree as there is no need to access leaf
on each insert
- more information about changes is stored per each I/O
- schema changes are broadcast messages
21

Fractal tree: search
The same as for B-tree but collect and apply all changes to the target leaf
- the same I/O number as for B-tree search
- more CPU work for collecting and merging changes
- good for I/O-bounded loads
22

Fractal tree: summary
In the case if data is big enough, i.e. most leafs do not fit in memory,
- the number of I/O’s for search is the same as for B-tree
- the number of I/O’s for sequential inserts is the same as for B-tree
- the number of I/O’s for random inserts is less than for B-tree
It can be said fractal trees are optimal for random inserts.
23

Slides plan
Files
Block files
Cachetable
MVCC
25

Files
- Lock files
- File map
- Environment
- Crash recovery log
- Transaction rollback log
- Fractal tree files
26

Files: lock files
The ‘*lock_dont_delete_me*’ files are lock files that are created so that
multiple TokuFT applications do not simultaneously use the same
directories.
27

Files: file map
The ‘tokuft.directory’ file is a fractal tree that contains a map of
application object names to the fractal tree file that stores them. The
directory is used to implement transactional file operations by leveraging
the row locks that are grabbed by inserts and deletes.
28

Files: environment
The ‘tokudb.environment’ file is a fractal tree file that contains data used
for upgrade.
29

Files: recovery log, rollback log
The ‘log*.tokulog*’ files are the TokuFT crash recovery log.
The ‘tokudb.rollback’ file is a block file that stores the rollback logs for all
live transactions.
30

Files: fractal trees files
The files named ‘*.tokudb’ are block files that store fractal trees.
31

Slides plan
Files
Block files
Cachetable
MVCC
32

Block files
- A block file is a file that stores a set of variable length blocks.
- A block file provides random access to any block given a block number.
- A block file allows new blocks to be allocated and in use blocks to be
freed.
33

Block files: blocks
- A block is a region in a file that stores data.
- A block number is used to identify a block. A block number is a 64 bit
unsigned integer.
- Each block can have a different size.
34

Block files: file layout
35
Header 0
LSN 0
BTT 0 info
Header 1
LSN 1
BTT 1 info
Block
#42
Block
#7
BTT 1 Block
#177
BTT 0
BTT 1[block#7] = { Offset #7, Size #7}
Offset 0 Offset 4K
Offset #7
Size #7
BTT 1 offset BTT 0 offset

Block files: block translation
- The block transaction table(BTT) is a data structure that maps a set of
blocks.
- Block transaction maps a block number to the block’s offset within a
file and its size. The BTT is just a giant array indexed by block number.
- The BTT is written to the file when the file is checkpointed.
- Each file header points to a BTT in the file.
36

Block files: file layout
- Two headers at offsets 0 and 4K
- Each stamped with its own LSN
- Each with its own BTT info (offset, size)
- Sequence of variable length blocks
- BTT is just a variable length block + checksum
- Blocks are aligned % 4096, so there can be gaps
- There are several block allocation strategies which can be used. The
default is first fit. First fit finds the first free region of a given size with
lowest file offset in the file.
37

Block files: fragmentation
- Fragmentation is caused by the mismatch between block alignment
and variable length blocks. With 1M byte blocks and 4K block
alignment, the fragmentation overhead is about 0.4%.
- Fragmentation is also caused by freed blocks not making the space
immediately available to the file system. Two possible remedies are to
use sparse files with file system hole punching or to periodically move
blocks to the beginning of a file and truncate the file.
38

Slides plan
Files
Block files
Cachetable
MVCC
39

Fractal tree storage
- TokuFT uses block files to store fractal trees.
- TokuFT stores one fractal tree in one block file.
- Each node in a fractal tree is stored in its own block.
- The root block number identifies the root block of the tree.
- Each node is labeled with its height in the tree. Leaf nodes have height
0. The parent of a node has height = height of the node + 1.
40

Leaf node
- Leaf node consists of basement nodes
- Each leaf node consists of a node header, a directory of the basement
node offsets and sizes, a sequence of N-1 pivots, and a sequence of N
basement nodes.
- The basement node directory is used to support point queries to leaf
entries in a specific basement node.
- The intent of a basement node is to allow a point query to only need to
read a basement node from disk rather than the entire leaf node.
- Each basement node consists of a sequence of leaf entries.
41

Fractal tree storage: example
42
tree header
root node = #3
metadata
non-leaf #3
height=2 children #4, #5
non-leaf #4
height=1
non-leaf #5
height=1
children #6, etc
leaf #6
height=0
msg buffer #2
basement
node #37
leaf entry #979
key, txn record[]
node size = target size for
uncompressed nodes
fanout = #children
basement node size = target size of
uncompressed basement nodes

Fractal tree parameters
The fractal tree has the following parameters that are stored with its
metadata.
- Node size (the default target - 4MB)
- Basement node size
- Fanout (the default target is 16 children)
- Compression
43

Slides plan
Files
Block files
Cachetable
MVCC
44

Cachetable: purpose
- The purpose of the cache table is to control the memory residency of a
set of objects that are stored in cache files.
- The cache table has an upper bound on the total memory used to store
these objects.
- The optimization is to keep the hot objects in memory to maximize app
throughput by minimizing I/O operations.
- The cache table must also write dirty objects to the cache files when
the objects are removed, evicted, or checkpointed.
- The cache table uses a clock algorithm to select objects for eviction.
45

Cachetable: structure
- A cache table manages a set of cache files and a set of cached memory
objects that are stored in the cache files.
- The cache table stores the set of cache files in a linked list.
- The cache table stores the set of memory objects in a big hash table.
- The cache table manages a set of background threads that are used to
perform compute and I/O intensive work.
46

Cachetable: background threads
- evictor for flushing memory objects from the cache.
- checkpointer for doing the begin and the end of checkpoint work.
- cleaner to flush buffered fractal tree messages.
47

Evictor: purpose
The cache table maintains a cache of memory objects. Since big data does
not fit in memory, only a subset of the data can be in memory. When the
cache table memory limits are reached, some of the cache pairs must be
evicted. The purpose of evictions is to keep control of the memory
footprint of the cache table AND minimize I/O operations by keeping hot
objects in memory and kicking cold objects out of memory.
48

Evictor: memory limits
49
evictions stop < low size watermark evictions happen > low size hysteresis
low
size
watermark
0 low
size
hysteresis
high
size
hysteresis
high
size
watermark
size limit client threads sleep > high size
watermark

Evictor: memory control
- Evictions are not needed when current size < low size watermark.
- Evictions are needed when the current size > low size hysteresis.
- Client threads sleep when the current size > high size watermark.
- Client threads wake up when the current size < high size hysteresis.
50

Evictor: clock algorithm
- saturated counter is increased on touch
- evictor iterates cachetable pairs until cachetable size reaches some
limit
- if the pair is locked ignore it
- otherwise decrease the counter
- if the counter is 0 then the victim is selected
- partial eviction can be done on any node regardless of its counter value
if it has clean partitions that use a lot of space and there is high cache
pressure
51

52
counter = 10 counter = 5 counter = 1

53
touch
increase counter client

54
decrease counterevictor

55
decrease counterevictor

56
decrease counter,
evict as counter is 0
evictor

Checkpoints: purpose
- The purpose of a checkpoint is to make a durable snapshot of a set of
open fractal tree files, a set of live and prepared transactions, and a set
of dirty blocks in the cache table.
- A checkpoint contains a list of all of the cache files and a list of all of
the live transactions. These lists allow recovery to restore the state of
the cache files and transactions prior to replaying the recovery log.
- A checkpoint must also write all of the dirty nodes and update the
cache file with a snapshot of the fractal tree block table and the LSN of
the checkpoint.
57

Checkpoint logic
A TokuFT checkpoint has a begin phase and an end phase.
- Write lock the checkpoint safe lock. This serializes checkpoints.
- Write lock the multi-operation (MO) lock. This serializes checkpoints
with transactions and files so they can be marked for checkpoint and
logged.
- Run begin checkpoint logic.
- Unlock MO lock.
- Run end checkpoint logic.
- Unlock the checkpoint safe lock.
58

Begin checkpoint
- Pin all of the open cache files.
- Write the checkpoint begin log entry to the recovery log.
- Write fassociate log entries for all open cache files to the recovery log.
- Write xstillopen log entries for all live transactions to the recovery log .
- Mark all cache table pairs for checkpoint.
- Call the begin checkpoint on all cache files in the checkpoint.
The time for the begin checkpoint MUST be fast since the MO lock is held
which blocks out transaction commits/aborts.
59

End checkpoint
- Checkpoint all cache table pair’s that are dirty and marked for
checkpoint. This writes the dirty data and updates the fractal tree’s
checkpoint block translation table of the cache file.
- Checkpoint all cache files that are marked for checkpoint. This writes
the file header and and the checkpoint block translation table.
- Write the end checkpoint log entry to the recovery log.
60

Cleaner
The purpose of the cleaner is to flush messages down fractal trees
without affecting the I/O amortization of fractal trees too much and
without consuming too much system resources.
61

Slides plan
Files
Block files
Cachetable
MVCC
62

- The purpose of rollback logging is to efficiently capture transaction
changes so that these changes can be either committed or rolled back.
- The purpose of recovery logging is to restore the state of the database
to some point in time without missing transactionally committed
changes up to that time.
63

Recovery log
- The recovery log contains those changes to the database that occurred
since the last checkpoint.
- The recovery algorithm executes those changes in the log since the last
checkpoint against the last checkpointed version of the database.
- This restores the state of the database to the state that existed when
the database crashed without losing any changes by committed
transactions.
64

Recovery log files
- The recovery log is a sequence of files.
- The recovery log file names match ‘logN.tokulogM’, where N is a
monotonically increasing sequence number and M is the TokuFT
version number.
- Recovery log events are appended to the end of the newest log file (the
one with the largest sequence number).
- Recovery log files are 100MB in size. When completely written, a new
log file with the next sequence number is created.
- Old recovery log files are automatically removed when their largest
LSN is smaller than the last checkpoint LSN.
65

Recovery Log Group Commit
- Fsync’s are SLOW.
- Fsync’s are used to make the recovery log persistent.
- How to increase throughput beyond the fsync limit?
- Group commit writes MANY log events from multiple client threads
together and fsync’s the log ONCE.
- The group commit algorithm uses a double log buffer and some
synchronization locks to elect one thread to do the fsync and
coordinate with the other threads.
66

Fractal Tree Snapshots and Recovery
- Each fractal tree file contains two snapshots.
- Each snapshot is labeled with a checkpoint LSN which is its version
number.
- Recovery opens the snapshot version with the largest checkpoint LSN
that is less than or equal to the checkpoint LSN from the recovery log.
67

Rollback Log Location
- Each transaction has its own rollback log.
- Each transaction’s rollback log is a sequence of blocks in the file called
‘tokudb.rollback’.
- Small transactions will seldom have their rollback log written to this
file. The transaction’s rollback log will remain in memory if the
transaction retires between checkpoints AND if the rollback log is
small.
68

Checkpointing the Rollback Log
- The rollback log is stored in blocks in the ‘tokudb.rollback’ file.
- These blocks are cached in the cache table.
- A checkpoint of the cache table will write the dirty blocks to the file.
69

Slides plan
Files
Block files
Cachetable
MVCC
70

MVCC purpose
- Implement different transaction isolation levels
- Reduce the number of locks in the system
71

MVCC implementation
- The lock tree ensures that if a transaction, T_i, modifies a leafentry,
then no other transaction modifies the same leafentry until T_i either
commits or aborts
- when a transaction T_i modifies a key, the leafentry stores T_i, T_i's
parent, T_i's grandparent, and so on all the way to T_i's oldest ancestor
- at the bottom of this stack, is another stack of committed values from
previous versions
- each transaction contains a list of transactions which are being
executed at the start of the transaction, the list is called ‘live list’
72

MVCC implementation
73
T_1_W T_2_R
T_3_1 T_3_2
T_4_R T_5_R
Leaf entry values
committed
Transactions timing
T_1_W value
T_3_W placeholder provisioning
t
T_3_1 placeholder
T_3_1 value
T_3_2 placeholder
T_3_2 value
T_3_W
T_1_W: insert into t values (1, 10)
T_2_R: select * from t where K=1
T_3_W: begin
update t set v = v+10 where K=1
update t set v = v+10 where K=1
commit

MVCC: Rule for a transaction reading an element
- First look at the provisional stack. If the value associated with the
innermost transaction passes the test defined below, return it.
- Otherwise, move on to the most recently committed value. For each
committed value, if the transaction passes the test defined below,
return it, otherwise continue moving down the stack and testing older
committed values.
74

The Rule for deciding whether to return a value from the provisional stack:
- if the provisional stack's root transaction is the same as the root of the
transaction doing the read, return the value
- if provisional stack's root transaction is less than or equal to the LSN of
the read transaction, and is not in the read transaction's live list, return
the value
- otherwise, do not return a value
75

The rule for deciding whether to return a committed value:
- if committed value's transaction is less than or equal to the LSN of the
read transaction, and is not in the read transaction's live list, return the
value
- otherwise, do not return a value
76

MVCC: Promotion and garbage collection
- if the root of the current transaction is not the same as the root if
transaction in provisioning stack then the stack is promoted in
committed stack
- garbage collection is removing unneeded values from leafentry
77

Slides plan
Files
Block files
Cachetable
MVCC
78

- Multiple clustered indexes
- Hot indexing
- Transactional file operations
79

Plam15 slides.potx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Plam15 slides.potx

Similar to Plam15 slides.potx (20)

Recently uploaded

Recently uploaded (20)

Plam15 slides.potx