SlideShare a Scribd company logo
1 of 80
Download to read offline
Percona Live 2015
September 21-23, 2015 | Mövenpick Hotel | Amsterdam
TokuDB internals
Percona team, Vlad Lesin, Sveta Smirnova
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
3
Problem
- RAM with fast access but small size M
- disk with slow access but big size
- the whole data does not fit in RAM
- blocks of data size B
- the performance is bound by blocks transferred (ignore CPU costs)
- assume all blocks accesses have the same cost
The goal is to minimize the number of block transfers.
4
DAM - Disk Access Machine model
5
RAM
DISK
М
B
B
B-tree
- each node consists of pivots
- node has fixed size B and fetching pivots as a group can save I/Os
- most leafs are on disk
- inserting into a leaf requires additional I/O if one is not in memory
6
B
B B
log NB
B-tree: search
- good if leaf is in memory
- LogB
(N) I/O’s - worst case
- one I/O for leaf read
7
B
B B
log NB
B-tree: fast sequential insert
- most of nodes are cached
- sequential disk I/O, one disk I/O per leaf which contains many rows
8
B
BB
In memory
Insertions are
into this leaf node
B-tree: slow for random inserts
- most leafs are not cached
- most insertions require random I/O’s
9
B
BB
In memory
B-tree: random inserts buffering
The idea is to buffer inserts and merge them on necessity or when system
idles.
- allows to reduce I/O’s as several changes of the same node can be
written at once
- can slow down reads
- bad performance on heavy load when buffer is full
- anyway we have to read leafs on applying changes from buffer
10
B-tree: cons and pros
- good for sequential inserts
- random inserts can be the cause of big I/O load due to cache misses
- for the big-enough data most of the leafs are not in cache and random
inserts have bad performance
- random insert speed degrades with raising tree size
11
Fractal tree: the idea
- fractal tree is the same as B-tree but with message buffers in each
node
- buffers contain messages
- each message describes a data change
- the messages are pushed down when buffer is full (or node
merge/split required)
12
Fractal tree: the illustration
13
NODE
BUFFER
MESSAGE
Fractal tree: messages push down
14
Fractal tree: messages push down
15
1
Fractal tree: messages push down
16
1 2
Fractal tree: messages push down
17
1 2 3
Fractal tree: messages push down
18
1 2 3 4
Fractal tree: messages push down
19
1 2 3 4
5
Fractal tree: messages push down
20
1 3 52 4
Fractal tree: performance analysis
- the most recently used buffers are cached
- less I/O’s in comparison with B-tree as there is no need to access leaf
on each insert
- more information about changes is stored per each I/O
- schema changes are broadcast messages
21
Fractal tree: search
The same as for B-tree but collect and apply all changes to the target leaf
- the same I/O number as for B-tree search
- more CPU work for collecting and merging changes
- good for I/O-bounded loads
22
Fractal tree: summary
In the case if data is big enough, i.e. most leafs do not fit in memory,
- the number of I/O’s for search is the same as for B-tree
- the number of I/O’s for sequential inserts is the same as for B-tree
- the number of I/O’s for random inserts is less than for B-tree
It can be said fractal trees are optimal for random inserts.
23
iiBench
24
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
25
Files
- Lock files
- File map
- Environment
- Crash recovery log
- Transaction rollback log
- Fractal tree files
26
Files: lock files
The ‘*lock_dont_delete_me*’ files are lock files that are created so that
multiple TokuFT applications do not simultaneously use the same
directories.
27
Files: file map
The ‘tokuft.directory’ file is a fractal tree that contains a map of
application object names to the fractal tree file that stores them. The
directory is used to implement transactional file operations by leveraging
the row locks that are grabbed by inserts and deletes.
28
Files: environment
The ‘tokudb.environment’ file is a fractal tree file that contains data used
for upgrade.
29
Files: recovery log, rollback log
The ‘log*.tokulog*’ files are the TokuFT crash recovery log.
The ‘tokudb.rollback’ file is a block file that stores the rollback logs for all
live transactions.
30
Files: fractal trees files
The files named ‘*.tokudb’ are block files that store fractal trees.
31
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
32
Block files
- A block file is a file that stores a set of variable length blocks.
- A block file provides random access to any block given a block number.
- A block file allows new blocks to be allocated and in use blocks to be
freed.
33
Block files: blocks
- A block is a region in a file that stores data.
- A block number is used to identify a block. A block number is a 64 bit
unsigned integer.
- Each block can have a different size.
34
Block files: file layout
35
Header 0
LSN 0
BTT 0 info
Header 1
LSN 1
BTT 1 info
Block
#42
Block
#7
BTT 1 Block
#177
BTT 0
BTT 1[block#7] = { Offset #7, Size #7}
Offset 0 Offset 4K
Offset #7
Size #7
BTT 1 offset BTT 0 offset
Block files: block translation
- The block transaction table(BTT) is a data structure that maps a set of
blocks.
- Block transaction maps a block number to the block’s offset within a
file and its size. The BTT is just a giant array indexed by block number.
- The BTT is written to the file when the file is checkpointed.
- Each file header points to a BTT in the file.
36
Block files: file layout
- Two headers at offsets 0 and 4K
- Each stamped with its own LSN
- Each with its own BTT info (offset, size)
- Sequence of variable length blocks
- BTT is just a variable length block + checksum
- Blocks are aligned % 4096, so there can be gaps
- There are several block allocation strategies which can be used. The
default is first fit. First fit finds the first free region of a given size with
lowest file offset in the file.
37
Block files: fragmentation
- Fragmentation is caused by the mismatch between block alignment
and variable length blocks. With 1M byte blocks and 4K block
alignment, the fragmentation overhead is about 0.4%.
- Fragmentation is also caused by freed blocks not making the space
immediately available to the file system. Two possible remedies are to
use sparse files with file system hole punching or to periodically move
blocks to the beginning of a file and truncate the file.
38
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
39
Fractal tree storage
- TokuFT uses block files to store fractal trees.
- TokuFT stores one fractal tree in one block file.
- Each node in a fractal tree is stored in its own block.
- The root block number identifies the root block of the tree.
- Each node is labeled with its height in the tree. Leaf nodes have height
0. The parent of a node has height = height of the node + 1.
40
Leaf node
- Leaf node consists of basement nodes
- Each leaf node consists of a node header, a directory of the basement
node offsets and sizes, a sequence of N-1 pivots, and a sequence of N
basement nodes.
- The basement node directory is used to support point queries to leaf
entries in a specific basement node.
- The intent of a basement node is to allow a point query to only need to
read a basement node from disk rather than the entire leaf node.
- Each basement node consists of a sequence of leaf entries.
41
Fractal tree storage: example
42
tree header
root node = #3
metadata
non-leaf #3
height=2 children #4, #5
non-leaf #4
height=1
non-leaf #5
height=1
children #6, etc
leaf #6
height=0
msg buffer #2
basement
node #37
leaf entry #979
key, txn record[]
node size = target size for
uncompressed nodes
fanout = #children
basement node size = target size of
uncompressed basement nodes
Fractal tree parameters
The fractal tree has the following parameters that are stored with its
metadata.
- Node size (the default target - 4MB)
- Basement node size
- Fanout (the default target is 16 children)
- Compression
43
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
44
Cachetable: purpose
- The purpose of the cache table is to control the memory residency of a
set of objects that are stored in cache files.
- The cache table has an upper bound on the total memory used to store
these objects.
- The optimization is to keep the hot objects in memory to maximize app
throughput by minimizing I/O operations.
- The cache table must also write dirty objects to the cache files when
the objects are removed, evicted, or checkpointed.
- The cache table uses a clock algorithm to select objects for eviction.
45
Cachetable: structure
- A cache table manages a set of cache files and a set of cached memory
objects that are stored in the cache files.
- The cache table stores the set of cache files in a linked list.
- The cache table stores the set of memory objects in a big hash table.
- The cache table manages a set of background threads that are used to
perform compute and I/O intensive work.
46
Cachetable: background threads
- evictor for flushing memory objects from the cache.
- checkpointer for doing the begin and the end of checkpoint work.
- cleaner to flush buffered fractal tree messages.
47
Evictor: purpose
The cache table maintains a cache of memory objects. Since big data does
not fit in memory, only a subset of the data can be in memory. When the
cache table memory limits are reached, some of the cache pairs must be
evicted. The purpose of evictions is to keep control of the memory
footprint of the cache table AND minimize I/O operations by keeping hot
objects in memory and kicking cold objects out of memory.
48
Evictor: memory limits
49
evictions stop < low size watermark evictions happen > low size hysteresis
low
size
watermark
0 low
size
hysteresis
high
size
hysteresis
high
size
watermark
size limit client threads sleep > high size
watermark
Evictor: memory control
- Evictions are not needed when current size < low size watermark.
- Evictions are needed when the current size > low size hysteresis.
- Client threads sleep when the current size > high size watermark.
- Client threads wake up when the current size < high size hysteresis.
50
Evictor: clock algorithm
- saturated counter is increased on touch
- evictor iterates cachetable pairs until cachetable size reaches some
limit
- if the pair is locked ignore it
- otherwise decrease the counter
- if the counter is 0 then the victim is selected
- partial eviction can be done on any node regardless of its counter value
if it has clean partitions that use a lot of space and there is high cache
pressure
51
Evictor: clock algorithm
52
counter = 10 counter = 5 counter = 1
Evictor: clock algorithm
53
counter = 10 counter = 5 counter = 1
touch
increase counter client
Evictor: clock algorithm
54
counter = 10 counter = 6 counter = 1
decrease counterevictor
Evictor: clock algorithm
55
counter = 9 counter = 6 counter = 1
decrease counterevictor
Evictor: clock algorithm
56
counter = 9 counter = 5 counter = 1
decrease counter,
evict as counter is 0
evictor
Checkpoints: purpose
- The purpose of a checkpoint is to make a durable snapshot of a set of
open fractal tree files, a set of live and prepared transactions, and a set
of dirty blocks in the cache table.
- A checkpoint contains a list of all of the cache files and a list of all of
the live transactions. These lists allow recovery to restore the state of
the cache files and transactions prior to replaying the recovery log.
- A checkpoint must also write all of the dirty nodes and update the
cache file with a snapshot of the fractal tree block table and the LSN of
the checkpoint.
57
Checkpoint logic
A TokuFT checkpoint has a begin phase and an end phase.
- Write lock the checkpoint safe lock. This serializes checkpoints.
- Write lock the multi-operation (MO) lock. This serializes checkpoints
with transactions and files so they can be marked for checkpoint and
logged.
- Run begin checkpoint logic.
- Unlock MO lock.
- Run end checkpoint logic.
- Unlock the checkpoint safe lock.
58
Begin checkpoint
- Pin all of the open cache files.
- Write the checkpoint begin log entry to the recovery log.
- Write fassociate log entries for all open cache files to the recovery log.
- Write xstillopen log entries for all live transactions to the recovery log .
- Mark all cache table pairs for checkpoint.
- Call the begin checkpoint on all cache files in the checkpoint.
The time for the begin checkpoint MUST be fast since the MO lock is held
which blocks out transaction commits/aborts.
59
End checkpoint
- Checkpoint all cache table pair’s that are dirty and marked for
checkpoint. This writes the dirty data and updates the fractal tree’s
checkpoint block translation table of the cache file.
- Checkpoint all cache files that are marked for checkpoint. This writes
the file header and and the checkpoint block translation table.
- Write the end checkpoint log entry to the recovery log.
60
Cleaner
The purpose of the cleaner is to flush messages down fractal trees
without affecting the I/O amortization of fractal trees too much and
without consuming too much system resources.
61
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
62
Recovery and rollback logs
- The purpose of rollback logging is to efficiently capture transaction
changes so that these changes can be either committed or rolled back.
- The purpose of recovery logging is to restore the state of the database
to some point in time without missing transactionally committed
changes up to that time.
63
Recovery log
- The recovery log contains those changes to the database that occurred
since the last checkpoint.
- The recovery algorithm executes those changes in the log since the last
checkpoint against the last checkpointed version of the database.
- This restores the state of the database to the state that existed when
the database crashed without losing any changes by committed
transactions.
64
Recovery log files
- The recovery log is a sequence of files.
- The recovery log file names match ‘logN.tokulogM’, where N is a
monotonically increasing sequence number and M is the TokuFT
version number.
- Recovery log events are appended to the end of the newest log file (the
one with the largest sequence number).
- Recovery log files are 100MB in size. When completely written, a new
log file with the next sequence number is created.
- Old recovery log files are automatically removed when their largest
LSN is smaller than the last checkpoint LSN.
65
Recovery Log Group Commit
- Fsync’s are SLOW.
- Fsync’s are used to make the recovery log persistent.
- How to increase throughput beyond the fsync limit?
- Group commit writes MANY log events from multiple client threads
together and fsync’s the log ONCE.
- The group commit algorithm uses a double log buffer and some
synchronization locks to elect one thread to do the fsync and
coordinate with the other threads.
66
Fractal Tree Snapshots and Recovery
- Each fractal tree file contains two snapshots.
- Each snapshot is labeled with a checkpoint LSN which is its version
number.
- Recovery opens the snapshot version with the largest checkpoint LSN
that is less than or equal to the checkpoint LSN from the recovery log.
67
Rollback Log Location
- Each transaction has its own rollback log.
- Each transaction’s rollback log is a sequence of blocks in the file called
‘tokudb.rollback’.
- Small transactions will seldom have their rollback log written to this
file. The transaction’s rollback log will remain in memory if the
transaction retires between checkpoints AND if the rollback log is
small.
68
Checkpointing the Rollback Log
- The rollback log is stored in blocks in the ‘tokudb.rollback’ file.
- These blocks are cached in the cache table.
- A checkpoint of the cache table will write the dirty blocks to the file.
69
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
70
MVCC purpose
- Implement different transaction isolation levels
- Reduce the number of locks in the system
71
MVCC implementation
- The lock tree ensures that if a transaction, T_i, modifies a leafentry,
then no other transaction modifies the same leafentry until T_i either
commits or aborts
- when a transaction T_i modifies a key, the leafentry stores T_i, T_i's
parent, T_i's grandparent, and so on all the way to T_i's oldest ancestor
- at the bottom of this stack, is another stack of committed values from
previous versions
- each transaction contains a list of transactions which are being
executed at the start of the transaction, the list is called ‘live list’
72
MVCC implementation
73
T_1_W T_2_R
T_3_1 T_3_2
T_4_R T_5_R
Leaf entry values
committed
Transactions timing
T_1_W value
T_3_W placeholder provisioning
t
T_3_1 placeholder
T_3_1 value
T_3_2 placeholder
T_3_2 value
T_3_W
T_1_W: insert into t values (1, 10)
T_2_R: select * from t where K=1
T_3_W: begin
update t set v = v+10 where K=1
update t set v = v+10 where K=1
commit
T_4_R: select * from t where K=1
T_5_R: select * from t where K=1
MVCC: Rule for a transaction reading an element
- First look at the provisional stack. If the value associated with the
innermost transaction passes the test defined below, return it.
- Otherwise, move on to the most recently committed value. For each
committed value, if the transaction passes the test defined below,
return it, otherwise continue moving down the stack and testing older
committed values.
74
MVCC: Rule for a transaction reading an element
The Rule for deciding whether to return a value from the provisional stack:
- if the provisional stack's root transaction is the same as the root of the
transaction doing the read, return the value
- if provisional stack's root transaction is less than or equal to the LSN of
the read transaction, and is not in the read transaction's live list, return
the value
- otherwise, do not return a value
75
MVCC: Rule for a transaction reading an element
The rule for deciding whether to return a committed value:
- if committed value's transaction is less than or equal to the LSN of the
read transaction, and is not in the read transaction's live list, return the
value
- otherwise, do not return a value
76
MVCC: Promotion and garbage collection
- if the root of the current transaction is not the same as the root if
transaction in provisioning stack then the stack is promoted in
committed stack
- garbage collection is removing unneeded values from leafentry
77
Slides plan
Introduction in Fractal Trees and TokuDB
Files
Block files
Fractal trees storage
Cachetable
Recovery and rollback logs
MVCC
Some interesting features
78
Some interesting features
- Multiple clustered indexes
- Hot indexing
- Transactional file operations
79
Questions
80

More Related Content

What's hot

What's hot (20)

Free Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFSFree Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFS
 
MySQL vs MonetDB Bencharmarks
MySQL vs MonetDB BencharmarksMySQL vs MonetDB Bencharmarks
MySQL vs MonetDB Bencharmarks
 
MySQL vs. MonetDB
MySQL vs. MonetDBMySQL vs. MonetDB
MySQL vs. MonetDB
 
File organisation
File organisationFile organisation
File organisation
 
8 0-os file-system management
8 0-os file-system management8 0-os file-system management
8 0-os file-system management
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
File system implementation
File system implementationFile system implementation
File system implementation
 
(Very u seful) different file format
(Very u seful) different file format(Very u seful) different file format
(Very u seful) different file format
 
Forensic artifacts in modern linux systems
Forensic artifacts in modern linux systemsForensic artifacts in modern linux systems
Forensic artifacts in modern linux systems
 
Ch11
Ch11Ch11
Ch11
 
Ch11 file system implementation
Ch11 file system implementationCh11 file system implementation
Ch11 file system implementation
 
Advfs 2 ondisk Structures
Advfs 2 ondisk StructuresAdvfs 2 ondisk Structures
Advfs 2 ondisk Structures
 
File
FileFile
File
 
OSCh11
OSCh11OSCh11
OSCh11
 
NTFS and Inode
NTFS and InodeNTFS and Inode
NTFS and Inode
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
Cache memory ppt
Cache memory ppt  Cache memory ppt
Cache memory ppt
 
04 cache memory.ppt 1
04 cache memory.ppt 104 cache memory.ppt 1
04 cache memory.ppt 1
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in database
 

Similar to Plam15 slides.potx

Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationJafar Nesargi
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationJafar Nesargi
 
Root file system
Root file systemRoot file system
Root file systemBindu U
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliNETWAYS
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
OS_Assignment for Disk Space & File System & File allocation table(FAT)
OS_Assignment for Disk Space & File System & File allocation table(FAT)OS_Assignment for Disk Space & File System & File allocation table(FAT)
OS_Assignment for Disk Space & File System & File allocation table(FAT)Chinmaya M. N
 
InnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter ZaitsevInnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter ZaitsevFuenteovejuna
 
InnoDB architecture and performance optimization (Пётр Зайцев)
InnoDB architecture and performance optimization (Пётр Зайцев)InnoDB architecture and performance optimization (Пётр Зайцев)
InnoDB architecture and performance optimization (Пётр Зайцев)Ontico
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Paralle programming 2
Paralle programming 2Paralle programming 2
Paralle programming 2Anshul Sharma
 

Similar to Plam15 slides.potx (20)

Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organization
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organization
 
Assignment 2 Theoretical
Assignment 2 TheoreticalAssignment 2 Theoretical
Assignment 2 Theoretical
 
Cache Memory
Cache MemoryCache Memory
Cache Memory
 
Root file system
Root file systemRoot file system
Root file system
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
 
Lecture2 oracle ppt
Lecture2 oracle pptLecture2 oracle ppt
Lecture2 oracle ppt
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
OS_Assignment for Disk Space & File System & File allocation table(FAT)
OS_Assignment for Disk Space & File System & File allocation table(FAT)OS_Assignment for Disk Space & File System & File allocation table(FAT)
OS_Assignment for Disk Space & File System & File allocation table(FAT)
 
InnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter ZaitsevInnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter Zaitsev
 
InnoDB architecture and performance optimization (Пётр Зайцев)
InnoDB architecture and performance optimization (Пётр Зайцев)InnoDB architecture and performance optimization (Пётр Зайцев)
InnoDB architecture and performance optimization (Пётр Зайцев)
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
kerch04.ppt
kerch04.pptkerch04.ppt
kerch04.ppt
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
Less01_Architecture.ppt
Less01_Architecture.pptLess01_Architecture.ppt
Less01_Architecture.ppt
 
Paralle programming 2
Paralle programming 2Paralle programming 2
Paralle programming 2
 
Risc vs cisc
Risc vs ciscRisc vs cisc
Risc vs cisc
 

Recently uploaded

Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 

Recently uploaded (20)

Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 

Plam15 slides.potx

  • 1. Percona Live 2015 September 21-23, 2015 | Mövenpick Hotel | Amsterdam
  • 2. TokuDB internals Percona team, Vlad Lesin, Sveta Smirnova
  • 3. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 3
  • 4. Problem - RAM with fast access but small size M - disk with slow access but big size - the whole data does not fit in RAM - blocks of data size B - the performance is bound by blocks transferred (ignore CPU costs) - assume all blocks accesses have the same cost The goal is to minimize the number of block transfers. 4
  • 5. DAM - Disk Access Machine model 5 RAM DISK М B B
  • 6. B-tree - each node consists of pivots - node has fixed size B and fetching pivots as a group can save I/Os - most leafs are on disk - inserting into a leaf requires additional I/O if one is not in memory 6 B B B log NB
  • 7. B-tree: search - good if leaf is in memory - LogB (N) I/O’s - worst case - one I/O for leaf read 7 B B B log NB
  • 8. B-tree: fast sequential insert - most of nodes are cached - sequential disk I/O, one disk I/O per leaf which contains many rows 8 B BB In memory Insertions are into this leaf node
  • 9. B-tree: slow for random inserts - most leafs are not cached - most insertions require random I/O’s 9 B BB In memory
  • 10. B-tree: random inserts buffering The idea is to buffer inserts and merge them on necessity or when system idles. - allows to reduce I/O’s as several changes of the same node can be written at once - can slow down reads - bad performance on heavy load when buffer is full - anyway we have to read leafs on applying changes from buffer 10
  • 11. B-tree: cons and pros - good for sequential inserts - random inserts can be the cause of big I/O load due to cache misses - for the big-enough data most of the leafs are not in cache and random inserts have bad performance - random insert speed degrades with raising tree size 11
  • 12. Fractal tree: the idea - fractal tree is the same as B-tree but with message buffers in each node - buffers contain messages - each message describes a data change - the messages are pushed down when buffer is full (or node merge/split required) 12
  • 13. Fractal tree: the illustration 13 NODE BUFFER MESSAGE
  • 14. Fractal tree: messages push down 14
  • 15. Fractal tree: messages push down 15 1
  • 16. Fractal tree: messages push down 16 1 2
  • 17. Fractal tree: messages push down 17 1 2 3
  • 18. Fractal tree: messages push down 18 1 2 3 4
  • 19. Fractal tree: messages push down 19 1 2 3 4 5
  • 20. Fractal tree: messages push down 20 1 3 52 4
  • 21. Fractal tree: performance analysis - the most recently used buffers are cached - less I/O’s in comparison with B-tree as there is no need to access leaf on each insert - more information about changes is stored per each I/O - schema changes are broadcast messages 21
  • 22. Fractal tree: search The same as for B-tree but collect and apply all changes to the target leaf - the same I/O number as for B-tree search - more CPU work for collecting and merging changes - good for I/O-bounded loads 22
  • 23. Fractal tree: summary In the case if data is big enough, i.e. most leafs do not fit in memory, - the number of I/O’s for search is the same as for B-tree - the number of I/O’s for sequential inserts is the same as for B-tree - the number of I/O’s for random inserts is less than for B-tree It can be said fractal trees are optimal for random inserts. 23
  • 25. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 25
  • 26. Files - Lock files - File map - Environment - Crash recovery log - Transaction rollback log - Fractal tree files 26
  • 27. Files: lock files The ‘*lock_dont_delete_me*’ files are lock files that are created so that multiple TokuFT applications do not simultaneously use the same directories. 27
  • 28. Files: file map The ‘tokuft.directory’ file is a fractal tree that contains a map of application object names to the fractal tree file that stores them. The directory is used to implement transactional file operations by leveraging the row locks that are grabbed by inserts and deletes. 28
  • 29. Files: environment The ‘tokudb.environment’ file is a fractal tree file that contains data used for upgrade. 29
  • 30. Files: recovery log, rollback log The ‘log*.tokulog*’ files are the TokuFT crash recovery log. The ‘tokudb.rollback’ file is a block file that stores the rollback logs for all live transactions. 30
  • 31. Files: fractal trees files The files named ‘*.tokudb’ are block files that store fractal trees. 31
  • 32. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 32
  • 33. Block files - A block file is a file that stores a set of variable length blocks. - A block file provides random access to any block given a block number. - A block file allows new blocks to be allocated and in use blocks to be freed. 33
  • 34. Block files: blocks - A block is a region in a file that stores data. - A block number is used to identify a block. A block number is a 64 bit unsigned integer. - Each block can have a different size. 34
  • 35. Block files: file layout 35 Header 0 LSN 0 BTT 0 info Header 1 LSN 1 BTT 1 info Block #42 Block #7 BTT 1 Block #177 BTT 0 BTT 1[block#7] = { Offset #7, Size #7} Offset 0 Offset 4K Offset #7 Size #7 BTT 1 offset BTT 0 offset
  • 36. Block files: block translation - The block transaction table(BTT) is a data structure that maps a set of blocks. - Block transaction maps a block number to the block’s offset within a file and its size. The BTT is just a giant array indexed by block number. - The BTT is written to the file when the file is checkpointed. - Each file header points to a BTT in the file. 36
  • 37. Block files: file layout - Two headers at offsets 0 and 4K - Each stamped with its own LSN - Each with its own BTT info (offset, size) - Sequence of variable length blocks - BTT is just a variable length block + checksum - Blocks are aligned % 4096, so there can be gaps - There are several block allocation strategies which can be used. The default is first fit. First fit finds the first free region of a given size with lowest file offset in the file. 37
  • 38. Block files: fragmentation - Fragmentation is caused by the mismatch between block alignment and variable length blocks. With 1M byte blocks and 4K block alignment, the fragmentation overhead is about 0.4%. - Fragmentation is also caused by freed blocks not making the space immediately available to the file system. Two possible remedies are to use sparse files with file system hole punching or to periodically move blocks to the beginning of a file and truncate the file. 38
  • 39. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 39
  • 40. Fractal tree storage - TokuFT uses block files to store fractal trees. - TokuFT stores one fractal tree in one block file. - Each node in a fractal tree is stored in its own block. - The root block number identifies the root block of the tree. - Each node is labeled with its height in the tree. Leaf nodes have height 0. The parent of a node has height = height of the node + 1. 40
  • 41. Leaf node - Leaf node consists of basement nodes - Each leaf node consists of a node header, a directory of the basement node offsets and sizes, a sequence of N-1 pivots, and a sequence of N basement nodes. - The basement node directory is used to support point queries to leaf entries in a specific basement node. - The intent of a basement node is to allow a point query to only need to read a basement node from disk rather than the entire leaf node. - Each basement node consists of a sequence of leaf entries. 41
  • 42. Fractal tree storage: example 42 tree header root node = #3 metadata non-leaf #3 height=2 children #4, #5 non-leaf #4 height=1 non-leaf #5 height=1 children #6, etc leaf #6 height=0 msg buffer #2 basement node #37 leaf entry #979 key, txn record[] node size = target size for uncompressed nodes fanout = #children basement node size = target size of uncompressed basement nodes
  • 43. Fractal tree parameters The fractal tree has the following parameters that are stored with its metadata. - Node size (the default target - 4MB) - Basement node size - Fanout (the default target is 16 children) - Compression 43
  • 44. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 44
  • 45. Cachetable: purpose - The purpose of the cache table is to control the memory residency of a set of objects that are stored in cache files. - The cache table has an upper bound on the total memory used to store these objects. - The optimization is to keep the hot objects in memory to maximize app throughput by minimizing I/O operations. - The cache table must also write dirty objects to the cache files when the objects are removed, evicted, or checkpointed. - The cache table uses a clock algorithm to select objects for eviction. 45
  • 46. Cachetable: structure - A cache table manages a set of cache files and a set of cached memory objects that are stored in the cache files. - The cache table stores the set of cache files in a linked list. - The cache table stores the set of memory objects in a big hash table. - The cache table manages a set of background threads that are used to perform compute and I/O intensive work. 46
  • 47. Cachetable: background threads - evictor for flushing memory objects from the cache. - checkpointer for doing the begin and the end of checkpoint work. - cleaner to flush buffered fractal tree messages. 47
  • 48. Evictor: purpose The cache table maintains a cache of memory objects. Since big data does not fit in memory, only a subset of the data can be in memory. When the cache table memory limits are reached, some of the cache pairs must be evicted. The purpose of evictions is to keep control of the memory footprint of the cache table AND minimize I/O operations by keeping hot objects in memory and kicking cold objects out of memory. 48
  • 49. Evictor: memory limits 49 evictions stop < low size watermark evictions happen > low size hysteresis low size watermark 0 low size hysteresis high size hysteresis high size watermark size limit client threads sleep > high size watermark
  • 50. Evictor: memory control - Evictions are not needed when current size < low size watermark. - Evictions are needed when the current size > low size hysteresis. - Client threads sleep when the current size > high size watermark. - Client threads wake up when the current size < high size hysteresis. 50
  • 51. Evictor: clock algorithm - saturated counter is increased on touch - evictor iterates cachetable pairs until cachetable size reaches some limit - if the pair is locked ignore it - otherwise decrease the counter - if the counter is 0 then the victim is selected - partial eviction can be done on any node regardless of its counter value if it has clean partitions that use a lot of space and there is high cache pressure 51
  • 52. Evictor: clock algorithm 52 counter = 10 counter = 5 counter = 1
  • 53. Evictor: clock algorithm 53 counter = 10 counter = 5 counter = 1 touch increase counter client
  • 54. Evictor: clock algorithm 54 counter = 10 counter = 6 counter = 1 decrease counterevictor
  • 55. Evictor: clock algorithm 55 counter = 9 counter = 6 counter = 1 decrease counterevictor
  • 56. Evictor: clock algorithm 56 counter = 9 counter = 5 counter = 1 decrease counter, evict as counter is 0 evictor
  • 57. Checkpoints: purpose - The purpose of a checkpoint is to make a durable snapshot of a set of open fractal tree files, a set of live and prepared transactions, and a set of dirty blocks in the cache table. - A checkpoint contains a list of all of the cache files and a list of all of the live transactions. These lists allow recovery to restore the state of the cache files and transactions prior to replaying the recovery log. - A checkpoint must also write all of the dirty nodes and update the cache file with a snapshot of the fractal tree block table and the LSN of the checkpoint. 57
  • 58. Checkpoint logic A TokuFT checkpoint has a begin phase and an end phase. - Write lock the checkpoint safe lock. This serializes checkpoints. - Write lock the multi-operation (MO) lock. This serializes checkpoints with transactions and files so they can be marked for checkpoint and logged. - Run begin checkpoint logic. - Unlock MO lock. - Run end checkpoint logic. - Unlock the checkpoint safe lock. 58
  • 59. Begin checkpoint - Pin all of the open cache files. - Write the checkpoint begin log entry to the recovery log. - Write fassociate log entries for all open cache files to the recovery log. - Write xstillopen log entries for all live transactions to the recovery log . - Mark all cache table pairs for checkpoint. - Call the begin checkpoint on all cache files in the checkpoint. The time for the begin checkpoint MUST be fast since the MO lock is held which blocks out transaction commits/aborts. 59
  • 60. End checkpoint - Checkpoint all cache table pair’s that are dirty and marked for checkpoint. This writes the dirty data and updates the fractal tree’s checkpoint block translation table of the cache file. - Checkpoint all cache files that are marked for checkpoint. This writes the file header and and the checkpoint block translation table. - Write the end checkpoint log entry to the recovery log. 60
  • 61. Cleaner The purpose of the cleaner is to flush messages down fractal trees without affecting the I/O amortization of fractal trees too much and without consuming too much system resources. 61
  • 62. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 62
  • 63. Recovery and rollback logs - The purpose of rollback logging is to efficiently capture transaction changes so that these changes can be either committed or rolled back. - The purpose of recovery logging is to restore the state of the database to some point in time without missing transactionally committed changes up to that time. 63
  • 64. Recovery log - The recovery log contains those changes to the database that occurred since the last checkpoint. - The recovery algorithm executes those changes in the log since the last checkpoint against the last checkpointed version of the database. - This restores the state of the database to the state that existed when the database crashed without losing any changes by committed transactions. 64
  • 65. Recovery log files - The recovery log is a sequence of files. - The recovery log file names match ‘logN.tokulogM’, where N is a monotonically increasing sequence number and M is the TokuFT version number. - Recovery log events are appended to the end of the newest log file (the one with the largest sequence number). - Recovery log files are 100MB in size. When completely written, a new log file with the next sequence number is created. - Old recovery log files are automatically removed when their largest LSN is smaller than the last checkpoint LSN. 65
  • 66. Recovery Log Group Commit - Fsync’s are SLOW. - Fsync’s are used to make the recovery log persistent. - How to increase throughput beyond the fsync limit? - Group commit writes MANY log events from multiple client threads together and fsync’s the log ONCE. - The group commit algorithm uses a double log buffer and some synchronization locks to elect one thread to do the fsync and coordinate with the other threads. 66
  • 67. Fractal Tree Snapshots and Recovery - Each fractal tree file contains two snapshots. - Each snapshot is labeled with a checkpoint LSN which is its version number. - Recovery opens the snapshot version with the largest checkpoint LSN that is less than or equal to the checkpoint LSN from the recovery log. 67
  • 68. Rollback Log Location - Each transaction has its own rollback log. - Each transaction’s rollback log is a sequence of blocks in the file called ‘tokudb.rollback’. - Small transactions will seldom have their rollback log written to this file. The transaction’s rollback log will remain in memory if the transaction retires between checkpoints AND if the rollback log is small. 68
  • 69. Checkpointing the Rollback Log - The rollback log is stored in blocks in the ‘tokudb.rollback’ file. - These blocks are cached in the cache table. - A checkpoint of the cache table will write the dirty blocks to the file. 69
  • 70. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 70
  • 71. MVCC purpose - Implement different transaction isolation levels - Reduce the number of locks in the system 71
  • 72. MVCC implementation - The lock tree ensures that if a transaction, T_i, modifies a leafentry, then no other transaction modifies the same leafentry until T_i either commits or aborts - when a transaction T_i modifies a key, the leafentry stores T_i, T_i's parent, T_i's grandparent, and so on all the way to T_i's oldest ancestor - at the bottom of this stack, is another stack of committed values from previous versions - each transaction contains a list of transactions which are being executed at the start of the transaction, the list is called ‘live list’ 72
  • 73. MVCC implementation 73 T_1_W T_2_R T_3_1 T_3_2 T_4_R T_5_R Leaf entry values committed Transactions timing T_1_W value T_3_W placeholder provisioning t T_3_1 placeholder T_3_1 value T_3_2 placeholder T_3_2 value T_3_W T_1_W: insert into t values (1, 10) T_2_R: select * from t where K=1 T_3_W: begin update t set v = v+10 where K=1 update t set v = v+10 where K=1 commit T_4_R: select * from t where K=1 T_5_R: select * from t where K=1
  • 74. MVCC: Rule for a transaction reading an element - First look at the provisional stack. If the value associated with the innermost transaction passes the test defined below, return it. - Otherwise, move on to the most recently committed value. For each committed value, if the transaction passes the test defined below, return it, otherwise continue moving down the stack and testing older committed values. 74
  • 75. MVCC: Rule for a transaction reading an element The Rule for deciding whether to return a value from the provisional stack: - if the provisional stack's root transaction is the same as the root of the transaction doing the read, return the value - if provisional stack's root transaction is less than or equal to the LSN of the read transaction, and is not in the read transaction's live list, return the value - otherwise, do not return a value 75
  • 76. MVCC: Rule for a transaction reading an element The rule for deciding whether to return a committed value: - if committed value's transaction is less than or equal to the LSN of the read transaction, and is not in the read transaction's live list, return the value - otherwise, do not return a value 76
  • 77. MVCC: Promotion and garbage collection - if the root of the current transaction is not the same as the root if transaction in provisioning stack then the stack is promoted in committed stack - garbage collection is removing unneeded values from leafentry 77
  • 78. Slides plan Introduction in Fractal Trees and TokuDB Files Block files Fractal trees storage Cachetable Recovery and rollback logs MVCC Some interesting features 78
  • 79. Some interesting features - Multiple clustered indexes - Hot indexing - Transactional file operations 79