6. When B-Tree is good
• When datasize doesn’t exceed memory limits
• When the application is mostly performing read (SELECT)
operations, or when read performance is more important
than write performance
6
7. When B-Tree is not good
• As soon as the data size exceeds available memory,
performance drops rapidly
• Choosing a flash-based storage helps performance, but only
to a certain extent -- in the long run, memory limits cause
performance to suffer
7
8. To summarize
• B-tree was designed to provide optimal data retrieval
performance, but not data updates (insert, delete, update)
• This shortcoming created a need for data structures that
provide better performance for data storage.
8
9. Cases when B-Tree is not optimal
• accepting and storing event logs
• storing measurements from a high-frequency sensor,
• tracking user clicks, and so on.
• For such cases, two new data structures were created: log
structured merge (LSM) trees and Fractal Trees®.
9
13. Fractal Trees
• Invented ~ 2007
• Tokutek and TokuDB as commercial engine
• 2015 – part of Percona
13
14. Fractal Tree
• Delay writes (send messages)
• Combine multiple delayed writes into single IO
• => SELECTs have much work to do
• Walk through all messages
14
16. Fractal tree benefits
• Tables that have a lot of indexes (preferably non-unique
indexes)
• Heavy write workload into the tables
• Systems with slow storage times
• Saving space when the environment storage is fast but
expensive.
16
17. From idea to reality
• Need concurrency-control mechanisms
• Need crash safety
• Need transactions, logging+recovery
• Need to support multithreading.
• Need to integrate with MySQL API layer
• Not everything perfect yet
17
24. Shape your tree (settings per TABLE)
• tokudb_block_size (default 4MiB)
• size of Node IN Memory (on disk it will be compressed)
• tokudb_read_block_size (default 64KiB)
• size of basement node - minimal reading block size, also block size for
compression
• Balance: smaller tokudb_read_block_size - better for Point Reads, but
leads for more random IO
• tokudb_fanout (default 16) - defines maximal amount of
pivots per non-leaf node. (amount of pivots = tokudb_fanout-1)
24
25. Recommendations
tokudb_block_size:
4MiB block size is good for spinning disk.
For SSD smaller block size might be beneficial, I often use 1MiB
In reality 64-128KiB should be even better, but TokuDB does not
handle these properly (performance bug: linear search of a free
block in fragmented storage)
25
30. FT properties
• “Delay writes” for as long as possible =>
• writes are amortized into 1 single big write instead of N random writes
• May result in serious liability: huge amount of messages not merged to
leaf-nodes
• SELECT will require traversing through all messages
• Especially bad for point SELECT queries
• Remember: Primary Key or Unique Key constraints REQUIRE a
HIDDEN POINT SELECT lookup
• UNIQUE KEY - Performance Killer for TokuDB
• non-sequential PRIMARY KEY - Performance Killer for TokuDB
30
31. Implication of slow selects
• Unique keys – background checks – implicit reads
• Foreign Keys – background checks (not supported in
TokuDB)
• Select by index – requires two lookups
31
32. Covering indexes
• SELECT user_name
FROM users
WHERE user_email=’sherlock@holmes.guru’
• Instead of INDEX (user_email) =>
• INDEX (user_email, user_name)
32
33. When to use Fractal Tree?
• Table with many indexes (better if not UNIQUE), intensive
writes into this table
• Slow storage
• Saving space of fast expensive storage
• Less write amplification (good for SSD health)
• Cloud instances are often good fit: storage either slow, or
expensive when fast.
33
39. Eviction
• tokudb_cache_size - Amount of memory TokuDB allocates for
nodes in memory.
• TokuDB’s term is “CACHETABLE”, status variables
• show global status like '%CACHETABLE%';
• Eviction - background process to keep memory consumption <=
tokudb_cache_size.
• It starts in only when size_of(nodes_in_memory) > tokudb_cache_size
• TokuDB will use more memory than tokudb_cache_size,
• User thread will be stopped if used memory > tokudb_cache_size*1.2
39
40. Eviction algorithm
CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory.
Eviction algorithm in simple steps:
• If size_of(nodes_in_memory) > tokudb_cache_size
Find victim to remove from memory
Node with smallest access_count is removed (evicted)
If Node is DIRTY - node is sent into background process to write on disk
Tokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write
queue
• Potential memory consumption is tokudb_cache_size +
Tokudb_CACHETABLE_SIZE_WRITING
40
41. Partial eviction
• For non-leaf non-dirty nodes Evictor may choose to perform
partial eviction
• 2 stage of partial evictions:
• Compress a part of node
• If still not-used, remove from memory
• Variables to controls this:
• tokudb_enable_partial_eviction
• tokudb_compress_buffers_before_eviction
41
44. TokuDB Compression
• Only non-compressed data stored in memory (unless
partial compressed part of non-leaf node).
• It seems beneficial to use OS cache as a secondary cache
for compressed nodes, for this:
• tokudb_directio=OFF
• USE cgroups to limit total memory usage by mysqld process
44
45. Checkpointing
• Checkpointing - is the periodical process to get datafiles in sync with transactional
redo log files.
• show global status like '%CHECKPOINT%';
• In TokuDB checkpointing is time-based, in InnoDB - log file size based.
• In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done.
• Checkpointing interval in TokuDB:
• tokudb_checkpointing_period=N sec
45
48. Checkpoint algorithm
• START CHECKPOINT;
• begin_checkpoint; ←- all transactions are stalled
• mark all nodes in memory as PENDING;
• end_begin_checkpoint;
• Checkpoint thread: go through all PENDING nodes; if dirty - write to disk
• User threads: if user query faces PENDING node; node is CLONED and put into background
checkpoint thread pool
• By default checkpoint thread pool size (amount of threads) = CPU CORES / 4.
• That is 4 threads on 16 cores servers.
• In CPU bound workload it takes 25% of CPU power from user threads!!!!
• Variable: tokudb_checkpoint_pool_threads=N
48