3. The InnoDB Engine
Introduction to InnoDB
• Currently the default in MySQL (as of 5.5)
• Referential/Structural Integrity
• Consistent data
• Transactional
4. InnoDB is atomic in that its transactions
have only two possible outcomes - complete
fully, or fail completely.
• SUCCESS:
• FAILURE:
All changes committed.
All changes rolled back.
InnoDB Anatomy – ACID Compliance
ATOMICITY
COMMITROLLBACK
Unchanged
Data
Changes
Applied
5. • Data stays consistent before, during, and after a
transaction.
• No conflict of “versions”
• Successful transactions end with a commit.
InnoDB maintains optimism.
Introduction to InnoDB Structure
CONSISTENCY
Valid State
Work
performed
Still a valid
state
6. • Transactions cannot interact with each other
• Adjustable level of isolation.*
• Row-level locking
Introduction to InnoDB Structure
ISOLATION
* Isolation level changed via the transaction-isolation configuration option.
7. • Atomic transactions keep data durable.
• Changes are permanent once committed.
• Doublewrite buffer helps to recover from crashes
that occur during page writes.
Introduction to InnoDB Structure
DURABILITY
8. Explore the InnoDB structure within the file
system, and at its lower levels, to find out how
it can affect database operations.
InnoDB Anatomy
The Goal
19. InnoDB Anatomy – System Tablespace
Separate your Undo Logs (5.6+ only)
innodb_undo_logs; innodb_undo_tablespaces; innodb_undo_directory
I/O to the undo logs is random, instead of sequential like some other areas
of InnoDB. Because of this, it makes sense to separate your Undo Log
tablespaces out from the system tablespace onto a disk that handles
random reads and writes more effectively, such as a SSD.
Use the Information Schema, or innodb_table_monitor, to view
the Data Dictionary table data.
In 5.6, the information_schema database contains INNODB_SYS* tables that
allow you to view data dictionary information directly in MySQL.
Alternatively, you can create a table called “innodb_table_monitor” to
dump the data dictionary into the MySQL error logs.
How can you use this?
20. InnoDB Anatomy – System Tablespace
Do you need the Doublewrite buffer?
innodb_doublewrite
With Doublewrite buffer enabled, there is a 5-10% impact on I/O. If you
operate on a transactional file system, you disable this to avoid this impact.
Customizing Change Buffering for your Workload
innodb_change_buffering
Change buffering, by default in 5.5+, encompasses insert, update, and
delete buffering. If your workload consists almost entirely of one or the
other, it can make sense to limit this down to only one type of buffering.
How can you use this?
21. InnoDB in Memory and on Disk
Memory
Buffer Pool
Insert Buffer
Log Buffer
Additional Memory
Disk
System Tablespace
Doublewrite Buffer
Transaction Log Files
Insert Buffer
Undo Logs
Rollback Segment
Data Dictionary
Undo Buffering
Indexing
Thread Processing
Tablespace Files
22. InnoDB Anatomy – Pages
Page Headers/Trailers
Name Byte Length Offset Description
FIL_PAGE_SPACE 4 0 Space ID
FIL_PAGE_OFFSET 4 4 Page Number
FIL_PAGE_PREV 4 8 Previous Page (in key order)
FIL_PAGE_NEXT 4 12 Next Page (in key order)
FIL_PAGE_LSN 8 16 LSN of page’s latest log record
FIL_PAGE_TYPE 2 24 Page Type
FIL_PAGE_FILE_FLUSH_LSN 8 26
Flushed-up-to LSN (only in space ID 0, page
0)
FIL_PAGE_ARCH_LOG_NO 4 34
Latest archived LSN (only in space ID 0, page
0)
FIL Header (38)
FIL Trailer (8)
Name Byte Length Offset Description
FIL_PAGE_END_LSN 8 16376
Low 4 bytes: Checksum, Last 4 bytes:
FIL_PAGE_LSN
storage/innobase/include/fil0fil.h
23. InnoDB Anatomy - Demonstration
Changing values directly
At the byte level, these values can be changed directly in many situations to
“trick” InnoDB in one way or another. One good example of this is to get
around a page checksum failure. You can change the stored checksum to
match the calculated checksum, bypassing the crash and often allowing you
sufficient access to your records.
How can you use this?
InnoDB: Page checksum 2047964429, prior-to-4.0.14-form checksum 4196043695
InnoDB: stored checksum 1873408413, prior-to-4.0.14-form stored checksum 1946395024
# printf '%Xn' 2047964429; printf '%Xn' 4196043695
7A11750D Primary calculated checksum
FA1A8BAF “Old-style” calculated checksum
# expr 16384 * 6 Example Page 6
98304 Starting byte offset for Page 6
Writing the primary calculated checksum over the stored value of page 6:
# printf ‘x7Ax11x75x0D’ | dd of=table.ibd bs=1 seek=98304 count=4 conv=notrunc
Writing the “old-style” calculated checksum over the stored value of page 6:
# printf ‘x7Ax11x75x0D’ | dd of=table.ibd bs=1 seek=98304 count=4 conv=notrunc
24. •Stored in 2 files by default
(ib_logfile0/1)
•Treated as single file
•Circular buffer
LOG BLOCK
Header (12)
Log Records
Trailer (4)
…
LOG BLOCK
Header (12)
Log Records
Trailer (4)
…
LOG BLOCK
Header (12)
Log Records
Trailer (4)
…
ib_logfile0ib_logfile1
InnoDB Anatomy – Redo Logs
Structure
•Log blocks are 512 bytes
•Each block contains
checkpoint data
The Logical Log FileThe Redo Logs
25. InnoDB Anatomy – Redo Logs
Optimized log file size
innodb_log_file_size
Larger size means less checkpoint flushing required, reducing I/O impact.
Balance with expected recovery time required as a result of the size (less of
an issue in 5.6).
General Formula: (Current LSN – LSN 60 seconds later) * 60 / 1024 / 1024
Optimized log buffer size
innodb_log_buffer_size
Log buffer allows transactions to move forward without having to write the
log to disk before commit. Increased size allows larger transactions to run
without requiring writes to disk before a commit is performed.
How can you use this?
26. InnoDB Anatomy – Index Pages
INDEX Pages - B+Tree Structure
•Efficient method of storing data on disk in a tree format.
•Actual records stored in leaf pages (level 0).
•Root-level pages exist at the top of the tree structure.
•Non-leaf pages contain only pointers to leaf pages.
Level 0
Level 1
Level 2 Root
Non-Leaf
Leaf Leaf
Non-Leaf
Leaf
28. InnoDB Anatomy – Index Pages
INDEX Pages
•Not physically in order
•User data “grows down”
•Page directory “grows up”
FIL Header (38)
… Page Directory
FIL Trailer (8)
INDEX Header (36)
FSEG Header (20)
System Records (26)
User Data …
EMPTY
29. InnoDB Anatomy – Index Pages
INDEX Page Header (after FIL Header)
Name Byte Length Offset Description
PAGE_N_DIR_SLOTS 2 38 + 0 Number of Slots in Page Directory
PAGE_HEAP_TOP 2 38 + 2 Pointer to Record Heap Top
PAGE_N_HEAP 2 38 + 4 Number of Records in Heap
PAGE_FREE 2 38 + 6 Pointer to start of page’s free-record list
PAGE_GARBAGE 2 38 + 8 Number of bytes in “deleted” records
PAGE_LAST_INSERT 2 38 + 10
Pointer to last inserted record, or NULL if
this has been reset – eg. by a delete.
PAGE_DIRECTION 2 38 + 12
Last Insert direction, PAGE_LEFT,
PAGE_RIGHT …
PAGE_N_DIRECTION 2 38 + 14 Consecutive inserts in the same direction
PAGE_N_RECS 2 38 + 16 Number of user records on the page
PAGE_MAX_TRX_ID 8 38 + 18
Highest ID of transaction that may have
modified a record on the page.
PAGE_LEVEL 2 38 + 26 Level of node in index tree
PAGE_INDEX_ID 4 38 + 28 Index ID that page belongs to
storage/innobase/include/page0page.h
30. InnoDB Anatomy – Demonstration
Demonstration
Determining page level on an INDEX page
First, find your page’s start byte:
# expr 16384 * 3
49152
The offset of the PAGE_LEVEL value is 26 after the
FIL Header (38):
# expr 49152 + 38 + 26
49216
The byte-length is 2
# xxd –ps –s 49216 –l 2 customer.ibd
0001
Page Level: 1
/var/lib/mysql/testdb/
32. Conclusion
• Jeremy Cole
• http://blog.jcole.us/
• https://github.com/jeremycole/
• Percona
• http://www.percona.com/files/percona-live/justin-
innodb-internals.pdf
• MySQL Internals Documentation & Source
• http://dev.mysql.com/doc/internals/en/innodb.html
• https://launchpad.net/mysql
Sources and Thanks
Editor's Notes
- Referential integrity = ensuring validity via adherence to constraints and restrictions
Consistent data = enforced via checksum matching, by default.
Transactional = grouping series of operations into a single, logical, atomic unit of work.
Core InnoDB Concepts after this
Transaction/Redo terms interchangeable when referring to the ib_logfiles.
Tablespaces: Divide the data – innodb’s way of holding data for individual tables
Pages: 16K data sections
Extents: Units of allocation, stores groups of data (pages). Extents hold up to 64 pages each.
Segments: Divisions of data within the tablespaces
Inodes: Contain attributes and pointers to other sections of data.
Tablespace: At least 3 initial header pages, each holding the standard page structure. Headers contain values about what to expect from the tablespace.
Page:
SEGMENT ACTS AS A DIVISION OF THE TABLESPACE – LOGICAL GROUP OF EXTENTS
Tablespace: At least 3 initial header pages, each holding the standard page structure. Headers contain values about what to expect from the tablespace.
Page:
SEGMENT ACTS AS A DIVISION OF THE TABLESPACE – LOGICAL GROUP OF EXTENTS
Tablespace: At least 3 initial header pages, each holding the standard page structure. Headers contain values about what to expect from the tablespace.
Page:
SEGMENT ACTS AS A DIVISION OF THE TABLESPACE – LOGICAL GROUP OF EXTENTS
Tablespace: At least 3 initial header pages, each holding the standard page structure. Headers contain values about what to expect from the tablespace.
Page:
SEGMENT ACTS AS A DIVISION OF THE TABLESPACE – LOGICAL GROUP OF EXTENTS
Tablespace: At least 3 initial header pages, each holding the standard page structure. Headers contain values about what to expect from the tablespace.
Page:
SEGMENT ACTS AS A DIVISION OF THE TABLESPACE – LOGICAL GROUP OF EXTENTS
Additional log files can be used and/or relocated.
Currently only one “group” supported
Reliance on the log file allows InnoDB to delay flushes/writes to disk.
Undo logs are composed of rollback segments (128), each able to support 1023 transactions; in total can support up to 128K concurrent transactions- increased from pre-5.5 value of just a single segment of 1023 transactions.
Undo logs can be split off to be handled elsewhere, such as with an SSD, for optimal perfromance
ZFS is an example of a transactional file system; makes sure writes are atomic
Additional log files can be used and/or relocated.
Log buffer flushed once per second
MySQL 5.6 adjusts for log file size changes performed while offline.
Currently only one “group” supported
Reliance on the log file allows InnoDB to delay flushes/writes to disk.
Additional log files can be used and/or relocated.
Page structure does not apply here (blocks of 512 bytes instead).
Total size of logical log file can be up to 512GB (file size * files in group)
Reliance on the log file allows InnoDB to delay flushes/writes to disk.
MySQL 5.6 adjusts for log file size changes performed while offline.
Log buffer flushed once per second
Allows for a consistent, determinable amount of reads to access any record in an index.
Root page is a “starting point” for accessing the tree, tells it where to look and how far it will need to go (root page’s level indicates how far down it takes to get to 0, hence the bottom->top numbering)
Tree can be as small as a single root page, or as big as millions of pages in a multi-level structure
Everything is an index in InnoDB
This is how all user records are stored
Not stored physically in order, but use pointers
In a single 4-level index tree, you have the potential for 814 billion rows/25.9TiB of data.
“Non-leaf Nodes” also referred to as “Internal Nodes”
Root page is a “starting point” for accessing the tree, tells it where to look and how far it will need to go (root page’s level indicates how far down it takes to get to 0, hence the bottom->top numbering)
Everything is an index in InnoDB
In a single 4-level index tree, you have the potential for 814 billion rows/25.9TiB of data.