MySQL Buffer Management

1/31/2017 1
MySQL Buffer Management
Mijin Ahn
meeeeejin@gmail.com

Contents
1/31/2017 2
• Overview
• Buffer Pool
• Buffer Read
• LRU Replacement
• Flusher
• Doublewrite Buffer
• Synchronization

Overview
1/31/2017 4
• Buffer Pool
– Considering memory
hierarchy
– Caching frequently
accessed data into
DRAM like a cache
memory in CPU
– Exploit locality

Overview
• InnoDB Architecture
1/31/2017 5
Handler API
Transaction (trx)
Logging &
Crash
Recovery
(log & recv)
Mini
Transaction
(mtr)
Lock
(lck)
Cursor (cur)
Row (row)
B-tree (btr)
Page (page)
Buffer Manager (buf)
Free space / File Management (fsp / fil)
IO

Overview
• Buffer Manager
– Buffer Pool (buf0buf.cc) : buffer pool manager
– Buffer Read (buf0read.cc) : read buffer
– LRU (buf0lru.cc) : buffer replacement
1/31/2017 6

Overview
• Buffer Manager
– Flusher (buf0flu.cc) : dirty page writer & background flusher
– Doublewrite (buf0dblwr.cc) : doublewrite buffer
1/31/2017 7

Lists of Buffer Blocks
• Free list
– Contains free page frames
• LRU list
– Contains all the blocks holding a file page
• Flush list
– Contains the blocks holding file pages that have been modified in the
memory but not written to disk yet
1/31/2017 9
Database
Buffer
TailHead D D D
Main LRU list
Free list
D
Flush listD D

Buffer Pool Mutex
• The buffer buf_pool contains a single mutex
– buf_pool->mutex: protects all the control data structures of the
buf_pool
• The buf_pool->mutex is a hot-spot in main memory
– Causing a lot of memory bus traffic on multiprocessor systems when
processors alternatively access the mutex
– A solution to reduce mutex contention
• To create a separate lock for the page hash table
1/31/2017 10

Buffer Pool Struct
• include/buf0buf.h: buf_pool_t
1/31/2017 11

Buffer Pool Struct
1/31/2017 12

Buffer Pool Struct
1/31/2017 13

Buffer Pool Struct
1/31/2017 14

Buffer Pool Init
• buf/buf0buf.cc: buf_pool_init()
1/31/2017 15
...
Buffer pool init per instance

Buffer Pool Init
• buf/buf0buf.cc: buf_pool_init_instance()
1/31/2017 16
Create mutex for buffer pool

Buffer Pool Init
1/31/2017 17
Initialize buffer chunk

Buffer Pool Init
1/31/2017 18
Create page hash table

Buffer Chunk
• Total buffer pool size
= x * innodb_buffer_pool_instances * innodb_buffer_pool_chunk_size
1/31/2017 19

Buffer Chunk Struct
• include/buf0buf.ic: buf_chunk_t
1/31/2017 20
[0]
[…]
[N]
[0] [1] [3][2]
[…] […] [N][…]
blocks
frames
Buffer chunk mem

• buf/buf0buf.cc: buf_chunk_init()
Buffer Pool Init
1/31/2017 21
Allocate chunk mem
(blocks + frames)

Buffer Pool Init
1/31/2017 22
Allocate control blocks
Allocate frames
(Page size is aligned)

Buffer Pool Init
1/31/2017 23
Add all blocks to free list
Initialize control block

Buffer Control Block (BCB)
• The control block contains
– Read-write lock
– Buffer fix count
• Which is incremented when a thread wants a file page to be fixed in a buffer
frame
• The buffer fix operation does not lock the contents of the frame
– Page frame
– File pages
• Put to a hash table according to the file address of the page
1/31/2017 24

Buffer Control Block (BCB)
1/31/2017 25
Page
Frame ptr
Mutex
RW lock
Lock hash value
…
buf_block_t
Table space ID
Page offset
Buffer fix count
IO fix
State
Hash
List (LRU, free, flush)
…
buf_page_t

• include/buf0buf.h: buf_block_t
BCB Struct
1/31/2017 26

• include/buf0buf.h: buf_block_t
BCB Struct
1/31/2017 27

• include/buf0buf.h: buf_page_t
BCB Struct
1/31/2017 28
Page identification

BCB Struct
1/31/2017 29
...
...

BCB Struct
1/31/2017 30

• buf/buf0buf.cc: buf_block_init()
BCB Init
1/31/2017 31
Set data frame
... Create block mutex & rw lock

BUFFER READ
PART 1 : READ A PAGE
1/31/2017 32

MySQL Buffer Manager: Read
Database
on Flash SSD
Database
Buffer
3. Read a page
Tail
1. Search free list
Head D D D
Main LRU List
Free list
Dirty Page Set
D D
Scan LRU List from tail
Double Write Buffer
2. Flush Dirty Pages
D

Buffer Read
• Read a page (buf0rea.cc)
– Find a certain page in the buffer pool using hash table
– If it is not in the buffer pool, then read a block from the storage
• Allocate a free block for read (include buffer block)
• Two cases
– Buffer pool has free blocks
– Buffer pool doesn’t have a free block
• Read a page
1/31/2017 34

Buffer Read
• buf/buf0buf.cc: buf_page_get_gen()
1/31/2017 36
Get the buffer pool ptr using space & offset
** 2 important things **
1) ID of a page is (space, offset) of the page
2) Buffer pool – page mapping is mapped

Buffer Read
• include/buf0buf.ic: buf_pool_get()
1/31/2017 37
Make a fold number

Buffer Read
• Why fold?
– They want to put pages together in the same buffer pool if
it is the same extents for read ahead
1/31/2017 38
Page 0
Page 1
…
Page 63
Page 64
Page 65
…
Page 127
Fold 0 Fold 1
…

Buffer Read
1/31/2017 39
Get page hash lock before
searching in the hash table
Set shared lock on hash table

Buffer Read
1/31/2017 40
...
...
Find a page in
the hash table
Page doesn’t exist in buffer pool
Read the page from the storage
retry
success

Buffer Read
1/31/2017 41
Fail

Buffer Read
1/31/2017 42
If it failed to read target page,
go to the first part of the loop

Buffer Read
• buf/buf0rea.cc: buf_read_page()
1/31/2017 43

Buffer Read
• buf/buf0rea.cc: buf_read_page_low()
1/31/2017 44

Buffer Read
1/31/2017 45
Allocate buffer block for read

Buffer Read
• buf/buf0buf.cc: buf_page_init_for_read()
1/31/2017 46

Buffer Read
1/31/2017 47
Get free block: see this later

Buffer Read
1/31/2017 48
Initialize buffer page for current read

Buffer Read
• buf/buf0buf.cc: buf_page_init()
1/31/2017 49

Buffer Read
1/31/2017 50

Buffer Read
1/31/2017 51
Insert a page into hash table

Buffer Read
1/31/2017 52
Add current block to LRU list
: see this later
Set io fix to BUF_IO_READ

Buffer Read
1/31/2017 53
Increase pending read count;
How many buffer read were
requested but not finished

Buffer Read
• We allocate a free buffer and control block
• And the block was inserted into hash table and LRU list of
corresponding buffer pool
• Now, we need to read the real content of the target page from
the storage
1/31/2017 54

Buffer Read
1/31/2017 55
Read a page from storage
: see this later

Buffer Read
1/31/2017 56
Complete the read request

Buffer Read
• buf/buf0buf.cc: buf_page_io_complete()
1/31/2017 57
Get io type (In this case, BUF_IO_READ)

Buffer Read
1/31/2017 58
Get io type (In this case, BUF_IO_READ)
...
Page corruption check based
on checksum in the page

Buffer Read
1/31/2017 59
Set io fix to BUF_IO_NONE

Buffer Read
1/31/2017 60
Decrease pending read count

BUFFER READ
PART 2 : AFTER GOT BLOCK
1/31/2017 61

Buffer Read
1/31/2017 62
...
Set access time

Buffer Read
1/31/2017 63
Do read ahead process
(default=false)

LRU REPLACEMENT
PART 1 : ADD BLOCK
1/31/2017 64

LRU Add Block
1/31/2017 65
Add current block to LRU list

LRU Add Block
• buf/buf0lru.cc: buf_LRU_add_block()
1/31/2017 66

LRU Add Block
• buf/buf0lru.cc: buf_LRU_add_block_low()
1/31/2017 67
If list is too small, then put
current block to first of the list

LRU Add Block
• buf/buf0lru.cc: buf_LRU_add_block_low()
1/31/2017 68
Else, insert current block to
after LRU_old pointer

LRU REPLACEMENT
PART 2 : GET FREE BLOCK
1/31/2017 69

LRU Get Free Block
• This function is called from a user thread when it needs a
clean block to read in a page
– Note that we only ever get a block from the free list
– Even when we flush a page or find a page in LRU scan we put it to
free list to be used
1/31/2017 70

LRU Get Free Block
• iteration 0:
– get a block from free list, success: done
– if there is an LRU flush batch in progress:
• wait for batch to end: retry free list
– if buf_pool->try_LRU_scan is set
• scan LRU up to srv_LRU_scan_depth to find a clean block
• the above will put the block on free list
• success: retry the free list
– flush one dirty page from tail of LRU to disk (= single page flush)
• the above will put the block on free list
• success: retry the free list
1/31/2017 71

LRU Get Free Block
• iteration 1:
– same as iteration 0 except:
• scan whole LRU list
• iteration > 1:
– same as iteration 1 but sleep 100ms
1/31/2017 72

LRU Get Free Block
• buf/buf0lru.cc: buf_LRU_get_free_block()
1/31/2017 73
Get buffer pool mutex
Get a free block from the free list

LRU Get Free Block
1/31/2017 74
Getting a free block succeeded

LRU Get Free Block
1/31/2017 75
If already background flushed
started, wait for it to end

LRU Get Free Block
1/31/2017 76
Find a victim page to replace
and make a free block

LRU Get Free Block
• buf/buf0lru.cc: buf_LRU_scan_and_free_block()
1/31/2017 77

LRU Get Free Block
• buf/buf0lru.cc: buf_LRU_scan_and_free_block()
1/31/2017 78

LRU Get Free Block
• buf/buf0lru.cc: buf_LRU_free_from_common_LRU_list()
1/31/2017 79
Try to free it

LRU Get Free Block
• buf/buf0lru.cc: buf_LRU_free_page()
1/31/2017 80
Get hash lock & block mutex

LRU Get Free Block
1/31/2017 81
If current page is dirty and not
flushed to disk yet, exit

– After func_exit, we are on clean case!
LRU Get Free Block
1/31/2017 82

...
...
...
• buf/buf0lru.cc: buf_LRU_block_remove_hashed()
LRU Get Free Block
1/31/2017 83

LRU Get Free Block
1/31/2017 84
...
If we have free block(s),
go to loop

LRU Get Free Block
1/31/2017 85
If we failed to make a free block, do a single page flush

FLUSH A PAGE
PART 1 : SINGLE PAGE FLUSH
1/31/2017 86

Flush a Page
• Flushing a page by
– A background flusher  batch flush (LRU & flush list)
– A single page flush  in LRU_get_free_block()
• Background flusher
– Regularly check system status (per 1000ms)
– Flush all buffer pool instances in a batch manner
1/31/2017 87

Single Page Flush
1/31/2017 88
buf_LRU_get_
free_block()
buf_flush_single
_page_from_LRU()
buf_flush_
ready_for_
flush() buf_flush
_page()
buf_flush_write
_block_low()
buf_dblwr_write
_single_page()
fil_io()
fil_flush()
buf_page_io_
complete()

• buf/buf0flu.cc: buf_flush_single_page_from_LRU()
Single Page Flush
1/31/2017 89
Full scan

• buf/buf0flu.cc: buf_flush_single_page_from_LRU()
Single Page Flush
1/31/2017 90
Check whether we can flush
current block and ready for flush
Try to flush it;
write to disk

...
• buf/buf0flu.cc: buf_flush_ready_for_flush()
Single Page Flush
1/31/2017 91
If the page is already flushed
or doing IO, return false

...
• buf/buf0flu.cc: buf_flush_page()
– Writes a flushable page from the buffer pool to a file
Single Page Flush
1/31/2017 92
Get lock

...
• buf/buf0flu.cc: buf_flush_page()
Single Page Flush
1/31/2017 93
Set fix and flush type

...
...
• buf/buf0flu.cc: buf_flush_write_block_low()
Single Page Flush
1/31/2017 94
Flush log (transaction log – WAL)

Single Page Flush
1/31/2017 95
Doublewrite off case
Write the page to dwb, then write to datafile;
See this in dwb part

Single Page Flush
1/31/2017 96
Sync buffered write to disk;
call fsync by fil_flush()

FLUSH A PAGE
PART 2 : BATCH FLUSH
1/31/2017 97

Batch Flush
• Background flusher (= page cleaner thread)
– Independent thread for flushing a dirty pages from buffer pools to
storage
– Regularly (per 1000ms) do flush from LRU tail or
– Do flush by dirty page percent (configurable)
• Thread definition
– buf/buf0flu.cc: DECLARE_THREAD(buf_flush_page_cleaner_thread)
1/31/2017 98

...
• buf/buf0flu.cc: DECLARE_THREAD(buf_flush_page_cleaner_thread)
Background Flusher
1/31/2017 99
Run until shutdown

• buf/buf0flu.cc: DECLARE_THREAD(buf_flush_page_cleaner_thread)
Background Flusher
1/31/2017 100
Nothing has
been changed
Something has
been changed!

• buf/buf0flu.cc: buf_flush_LRU_tail()
• Clears up tail of the LRU lists:
– Put replaceable pages at the tail of LRU to the free list
– Flush dirty pages at the tail of LRU to the disk
• srv_LRU_scan_depth
– Scan each buffer pool at this amount
– Configuable: innodb_LRU_scan_depth
LRU List Batch Flush
1/31/2017 101

1/31/2017 102
buf_flush_
LRU_tail()
buf_flush
_LRU()
buf_flush
_batch()
buf_do_LRU
_batch()
buf_flush
_LRU_list
_batch()
buf_LRU_free
_page()
fil_io()
fil_flush()
buf_page_io_
complete()
buf_flush_p
age_and_try
_neighbors()
buf_flush
_page()
buf_flush_write
_block_low()
buf_dblwr
_add_to_
batch()
buf_dblwr_
flush_
buffered_
writes()

1/31/2017 103
Per buffer pool instance

1/31/2017 104
Chunk size = 100

• buf/buf0flu.cc: buf_flush_LRU()
1/31/2017 105
...
Batch LRU flush

• buf/buf0flu.cc: buf_flush_batch()
1/31/2017 106
Do LRU batch

• buf/buf0flu.cc: buf_do_LRU_batch()
1/31/2017 107

• buf/buf0flu.cc: buf_flush_LRU_list_batch()
1/31/2017 108
Get the last page
from LRU

1/31/2017 109

• buf/buf0flu.cc: buf_flush_ready_for_replace()
1/31/2017 110
Check whether current
page is clean page or not

1/31/2017 111
It there is any replaceable page,
free the page

1/31/2017 112
Else, try to flush neighbor pages

• buf/buf0flu.cc: buf_flush_page_and_try_neighbors()
1/31/2017 113
Flush page, but no sync

1/31/2017 114
Add the page to the dwb buffer;
See this in dwb part

• Now, victim pages are gathered for replacement
• We need to flush them to disk
• We can do this by calling buf_flush_common()
1/31/2017 115

• buf/buf0flu.cc: buf_flush_LRU()
1/31/2017 116
...

• buf/buf0flu.cc: buf_flush_common()
1/31/2017 117
• flush all pages we gathered so far
• write the pages to dwb area first
• then issue it to datafile
; See this later

DOUBLEWRITE BUFFER
PART 1 : ARCHITECTURE
1/31/2017 118

Double Write Buffer
• To avoid torn page (partial page) written problem
• Write dirty pages to special storage area in system tablespace
priori to write database file
1/31/2017 119
Database
on Flash SSD
Database
Buffer
TailHead D D D
Main LRU List
Free list
Dirty Page Set
D D
Scan LRU List from tail
D
Double Write Buffer
ibdata1-space0
datafiles

DWB Architecture
1/31/2017 120

• include/buf0dblwr.cc: buf_dblwr_t
DWB Struct
1/31/2017 121

• include/buf0dblwr.cc: buf_dblwr_t
DWB Struct
1/31/2017 122

DOUBLEWRITE BUFFER
PART 2 : SINGLE PAGE FLUSH
1/31/2017 123

Single Page Flush
1/31/2017 124

...
• buf/buf0dblwr.cc: buf_dblwr_write_single_page()
Single Page Flush
1/31/2017 125
# of slots for single page flush
= 2 * DOUBLEWRITE_BLOCK_SIZE – BATCH_SIZE
= 128 – 120
= 8

Single Page Flush
1/31/2017 126
If all slots are reserved,
wait until current
dblwr done
Find a free slot

Single Page Flush
1/31/2017 127
Write to datafile
(synchronous)
Sync system tablespace
for dwb area in disk
Write block to
dwb area in disk
(synchronous)

...
• buf/buf0dblwr.cc: buf_dblwr_write_block_to_datafile()
Single Page Flush
1/31/2017 128
Issue write operation
to datafile
In single page flush,
sync = true

Single Page Flush
1/31/2017 129
Sync buffered write to disk;
call fsync by fil_flush()

Single Page Flush
1/31/2017 130
Get io type (In this case, BUF_IO_WRITE)

Single Page Flush
1/31/2017 131
Set io fix to BUF_IO_NONE

Single Page Flush
1/31/2017 132

Single Page Flush
• buf/buf0flu.cc: buf_flush_write_complete()
1/31/2017 133
Remove the block
from the flush list
…

Single Page Flush
• buf/buf0dblwr.cc: buf_dblwr_update()
1/31/2017 134
Free the dwb slot
of the target page

Single Page Flush
• Single page flush is performed in the context of the query
thread itself
• Single page flush mode iterates over the LRU list of a buffer
pool instance, while holding the buffer pool mutex
• It might have trouble in getting a free doublewrite buffer slot
(total 8 slots)
• In result, it makes the overall performance worse
1/31/2017 135

DOUBLEWRITE BUFFER
PART 3 : BATCH FLUSH
1/31/2017 136

Batch Flush
1/31/2017 137

• buf/buf0dblwr.cc: buf_dblwr_add_to_batch()
Batch Flush
1/31/2017 138
If another batch is already
running, wait until done

Batch Flush
1/31/2017 139
If all slots for batch flush
in dwb buffer is reserved,
flush dwb buffer

Batch Flush
1/31/2017 140
After flushing, copy current
block to buf_dblwr->first_free

• buf/buf0dblwr.cc: buf_dblwr_flush_buffered_writes()
Batch Flush
1/31/2017 141

• buf/buf0dblwr.cc: buf_dblwr_sync_datafiles()
Batch Flush
1/31/2017 142
 fil_flush()
 os_file_flush()
 os_file_fsync()
 fsync()

Batch Flush
1/31/2017 143
Change batch running status
Exit mutex
Nobody won’t be
here except me!

Batch Flush
1/31/2017 144
Issue write op for block1
(synchronous)
If current write uses only block1,
then flush

Batch Flush
1/31/2017 145
Issue write op for block2
(synchronous)

...
Batch Flush
1/31/2017 146
Flush (fsync) system table space
Write all blocks to datafile
(asynchronous)

• After submitting aio requests,
– fil_aio_wait()
– buf_page_io_complete()
– buf_flush_write_complete()
– buf_dblwr_update()
Batch Flush
1/31/2017 147

• buf/buf0dblwr.cc: buf_dblwr_update()
Batch Flush
1/31/2017 148
Flush datafile
Reset dwb

InnoDB Synchronization
• InnoDB implements its own mutexes & RW-locks for buffer
management
• Latch in InnoDB
– A lightweight structure used by InnoDB to implement a lock
– Typically held for a brief time (milliseconds or microseconds)
– A general term that includes both mutexes (for exclusive access) and
rw-locks (for shared access)
1/31/2017 150

Mutex in InnoDB
• The low-level object to represent and enforce exclusive-
access locks to internal in-memory data structures
• Once the lock is acquired, any other process, thread, and so
on is prevented from acquiring the same lock
1/31/2017 151

Mutex in InnoDB
1/31/2017 152
• Example code in InnoDB
– buf/buf0buf.cc: buf_wait_for_read()
Get current IO fix of the block

Mutex in InnoDB
1/31/2017 153
• Example code in InnoDB
– buf/buf0flu.cc: buf_flush_batch()
Flush the pages in
the buffer pool

RW-lock in InnoDB
• The low-level object to represent and enforce shared-access
locks to internal in-memory data structures
• RW-lock includes three types of locks
– S-locks (shared locks)
– X-locks (exclusive locks)
– SX-locks (shared-exclusive locks)
1/31/2017 154
S SX X
S Compatible Compatible Conflict
SX Compatible Conflict Conflict
X Conflict Conflict Conflict

RW-lock in InnoDB
• S-lock (Shared-lock)
– provides read access to a common resource
• X-lock (eXclusive-lock)
– provides write access to a common resource
– while not permitting inconsistent reads by other threads
• SX-lock (Shared-eXclusive lock)
– provides write access to a common resource
– while permitting inconsistent reads by other threads
– introduced in MySQL 5.7 to optimize concurrency and improve
scalability for read-write workloads.
1/31/2017 155

RW-lock in InnoDB
1/31/2017 156
• Example code in InnoDB (S-lock)
– buf/buf0buf.cc: buf_page_get_gen()
…
Search hash table

RW-lock in InnoDB
1/31/2017 157
• Example code in InnoDB (X-lock)
– buf/buf0buf.cc: buf_page_init_for_read()
…
Insert a page into
the hash table

MySQL Buffer Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MySQL Buffer Management

Similar to MySQL Buffer Management (14)

Recently uploaded

Recently uploaded (20)

MySQL Buffer Management