Percona FT / TokuDB

Vadim Tkachenko
Percona
April’16
Percona Fractal Tree / TokuDB

Agenda
2
Why new data structure
Fractal Tree & LSM tree
Internals of Fractal Tree
When it is useful
How to use it

Before it was B-Tree
• “Traditional” data structure
• In the field from 1970-ies
4

When B-Tree is good
• When datasize doesn’t exceed memory limits
• When the application is mostly performing read (SELECT)
operations, or when read performance is more important
than write performance
6

When B-Tree is not good
• As soon as the data size exceeds available memory,
performance drops rapidly
• Choosing a flash-based storage helps performance, but only
to a certain extent -- in the long run, memory limits cause
performance to suffer
7

To summarize
• B-tree was designed to provide optimal data retrieval
performance, but not data updates (insert, delete, update)
• This shortcoming created a need for data structures that
provide better performance for data storage.
8

Cases when B-Tree is not optimal
• accepting and storing event logs
• storing measurements from a high-frequency sensor,
• tracking user clicks, and so on.
• For such cases, two new data structures were created: log
structured merge (LSM) trees and Fractal Trees®.
9

LSM tree & Fractal tree
• Shift balance from optimal reads toward faster writes
11

Fractal Trees
• Invented ~ 2007
• Tokutek and TokuDB as commercial engine
• 2015 – part of Percona
13

Fractal Tree
• Delay writes (send messages)
• Combine multiple delayed writes into single IO
• => SELECTs have much work to do
• Walk through all messages
14

Fractal tree benefits
• Tables that have a lot of indexes (preferably non-unique
indexes)
• Heavy write workload into the tables
• Systems with slow storage times
• Saving space when the environment storage is fast but
expensive.
16

From idea to reality
• Need concurrency-control mechanisms
• Need crash safety
• Need transactions, logging+recovery
• Need to support multithreading.
• Need to integrate with MySQL API layer
• Not everything perfect yet
17

On MySQL Level:
CREATE TABLE metrics (
ts timestamp,
device_id int,
metric_id int,
cnt int,
val double,
PRIMARY KEY (ts, device_id, metric_id),
KEY metric_id (metric_id, ts),
KEY device_id (device_id, ts)
)
19

Internally 3 trees
• Primary Key (ts, device_id, metric_id) => data
• Key (metric_id, ts) => PK (ts, device_id, metric_id)
• Key (device_id, ts) => PK (ts, device_id, metric_id)
• Notice – long PK adds overhead
20

Root Node
21
F – tokudb_fanout (default 16)
Tokudb_block_size (default 4MB)

• Tokudb_read_block_size (default 64KB)
• Chunk used for compression/decompression
• Smaller size is better for point lookups
23

Shape your tree (settings per TABLE)
• tokudb_block_size (default 4MiB)
• size of Node IN Memory (on disk it will be compressed)
• tokudb_read_block_size (default 64KiB)
• size of basement node - minimal reading block size, also block size for
compression
• Balance: smaller tokudb_read_block_size - better for Point Reads, but
leads for more random IO
• tokudb_fanout (default 16) - defines maximal amount of
pivots per non-leaf node. (amount of pivots = tokudb_fanout-1)
24

Recommendations
tokudb_block_size:
4MiB block size is good for spinning disk.
For SSD smaller block size might be beneficial, I often use 1MiB
In reality 64-128KiB should be even better, but TokuDB does not
handle these properly (performance bug: linear search of a free
block in fragmented storage)
25

Recommendations
tokudb_read_block_size:
Recommended to set 16KiB if you expect point queries (again,
too bad this setting is per-table, not per-index)
26

How to see the shape of the tree
tokuftdump --summary
27

tokuftdump --summary
leaf nodes: 6797
non-leaf nodes: 97
Leaf size: 4,278,632,448
Total size: 4,286,052,352
Total uncompressed size: 6,231,518,882
Messages count: 70155
Messages size: 10,535,155
Records count: 30000000
Tree height: 2
height: 0, nodes count: 6797; avg children/node: 59.364131
basement nodes: 403498; msg size: 0; disksize: 4,278,632,448; uncompressed size: 6,220,381,082; ratio:
1.453825
msg cnt: 65001; msg size: 9,756,907; disksize: 6,907,904; uncompressed size: 10,334,469; ratio: 1.496035
msg cnt: 5154; msg size: 778,248; disksize: 512,000; uncompressed size: 803,331; ratio: 1.569006
29

FT properties
• “Delay writes” for as long as possible =>
• writes are amortized into 1 single big write instead of N random writes
• May result in serious liability: huge amount of messages not merged to
leaf-nodes
• SELECT will require traversing through all messages
• Especially bad for point SELECT queries
• Remember: Primary Key or Unique Key constraints REQUIRE a
HIDDEN POINT SELECT lookup
• UNIQUE KEY - Performance Killer for TokuDB
• non-sequential PRIMARY KEY - Performance Killer for TokuDB
30

Implication of slow selects
• Unique keys – background checks – implicit reads
• Foreign Keys – background checks (not supported in
TokuDB)
• Select by index – requires two lookups
31

Covering indexes
• SELECT user_name
FROM users
WHERE user_email=’sherlock@holmes.guru’
• Instead of INDEX (user_email) =>
• INDEX (user_email, user_name)
32

When to use Fractal Tree?
• Table with many indexes (better if not UNIQUE), intensive
writes into this table
• Slow storage
• Saving space of fast expensive storage
• Less write amplification (good for SSD health)
• Cloud instances are often good fit: storage either slow, or
expensive when fast.
33

Stories on PerconaFT internals
Section Information

Eviction
• Algorithm to maintain cached nodes within limit
38

Eviction
• tokudb_cache_size - Amount of memory TokuDB allocates for
nodes in memory.
• TokuDB’s term is “CACHETABLE”, status variables
• show global status like '%CACHETABLE%';
• Eviction - background process to keep memory consumption <=
tokudb_cache_size.
• It starts in only when size_of(nodes_in_memory) > tokudb_cache_size
• TokuDB will use more memory than tokudb_cache_size,
• User thread will be stopped if used memory > tokudb_cache_size*1.2
39

Eviction algorithm
CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory.
Eviction algorithm in simple steps:
• If size_of(nodes_in_memory) > tokudb_cache_size
Find victim to remove from memory
Node with smallest access_count is removed (evicted)
If Node is DIRTY - node is sent into background process to write on disk
Tokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write
queue
• Potential memory consumption is tokudb_cache_size +
Tokudb_CACHETABLE_SIZE_WRITING
40

Partial eviction
• For non-leaf non-dirty nodes Evictor may choose to perform
partial eviction
• 2 stage of partial evictions:
• Compress a part of node
• If still not-used, remove from memory
• Variables to controls this:
• tokudb_enable_partial_eviction
• tokudb_compress_buffers_before_eviction
41

TokuDB Compression
• Only non-compressed data stored in memory (unless
partial compressed part of non-leaf node).
• It seems beneficial to use OS cache as a secondary cache
for compressed nodes, for this:
• tokudb_directio=OFF
• USE cgroups to limit total memory usage by mysqld process
44

Checkpointing
• Checkpointing - is the periodical process to get datafiles in sync with transactional
redo log files.
• show global status like '%CHECKPOINT%';
• In TokuDB checkpointing is time-based, in InnoDB - log file size based.
• In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done.
• Checkpointing interval in TokuDB:
• tokudb_checkpointing_period=N sec
45

Checkpoint algorithm
• START CHECKPOINT;
• begin_checkpoint; ←- all transactions are stalled
• mark all nodes in memory as PENDING;
• end_begin_checkpoint;
• Checkpoint thread: go through all PENDING nodes; if dirty - write to disk
• User threads: if user query faces PENDING node; node is CLONED and put into background
checkpoint thread pool
• By default checkpoint thread pool size (amount of threads) = CPU CORES / 4.
• That is 4 threads on 16 cores servers.
• In CPU bound workload it takes 25% of CPU power from user threads!!!!
• Variable: tokudb_checkpoint_pool_threads=N
48

Few words on LSM
Section Information

LSM tree
• Older than Fractal Tree
• Google BigTable as primary driver of interest
• Cassandra
• RocksDB
• MongoRocks
• MyRocks
51

Instead of final summary
• Alternative data structures have their place
• Use wisely, know limitations
• A lot of work ahead
52

Thank you!
• My contact
• Vadim@percona.com
• @VadimTK
53

Percona FT / TokuDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Percona FT / TokuDB

Similar to Percona FT / TokuDB (20)

Recently uploaded

Recently uploaded (20)

Percona FT / TokuDB