<Insert Picture Here>

The InnoDB Storage Engine for MySQL

Morgan Tocker, MySQL Community Manager

http://www.tocker.ca/

Safe Harbor Statement
The	
  following	
  is	
  intended	
  to	
  outline	
  our	
  general	
  product	
  direction.	
  
It	
  is	
  intended	
  for	
  information	
  purposes	
  only,	
  and	
  may	
  not	
  be	
  
incorporated	
  into	
  any	
  contract.	
  It	
  is	
  not	
  a	
  commitment	
  to	
  deliver	
  
any	
  material,	
  code,	
  or	
  functionality,	
  and	
  should	
  not	
  be	
  relied	
  upon	
  
in	
  making	
  purchasing	
  decisions.	
  



The	
  development,	
  release,	
  and	
  timing	
  of	
  any	
  features	
  or	
  
functionality	
  described	
  for	
  Oracle’s	
  products	
  remains	
  at	
  the	
  sole	
  
discretion	
  of	
  Oracle.
4 Years of MySQL Innovation
MySQL Cluster 7.3

MySQL Workbench 6.0

MySQL Migration Wizard
<Insert Picture Here>

MySQL 5.5

MySQL 5.6

Windows installer & Tools MySQL 5.7 M y S Q L
Cluster
MySQL
MySQL Enterprise Monitor 2.3 & e r
M a n a g 3.0
Applier for
Hadoop

MySQL Enterprise Backup
Security
MySQL Utilities
MySQL Workbench 5.2 & 6.0
Scalability
MySQL Cluster 7.2
HA
MySQL Enterprise
MySQL Cluster 7.1
O r a c l e C e r t i f i c a t i o n s Audit
Hello and Welcome!
• I will be talking about InnoDB’s internal behaviour.
• Not talking (much) about MySQL.
• Aim of this talk is to give you X-ray vision.
• i.e. not so many direct takeaways, but one day it
will help you debug a problem.
Prerequisites

Copyright	
  ©	
  2012	
  Oracle	
  and/or	
  its	
  affiliates.	
  All	
  rights	
  reserved.
MySQL Architecture
IO Performance
L1 cache reference
Branch mispredict
L2 cache reference
Mutex lock/unlock
Main memory reference
Compress 1K bytes with Zippy
Send 2K bytes over 1 Gbps network
Read 1 MB sequentially from memory
Round trip within same datacenter
Disk seek
Read 1 MB sequentially from disk
Send packet CA->Netherlands->CA

0.5 ns!
5 ns!
7 ns!
25 ns!
100 ns!
3,000 ns!
20,000 ns!
250,000 ns!
500,000 ns!
10,000,000 ns!
20,000,000 ns!
150,000,000 ns

See: http://www.linux-mag.com/cache/7589/1.html and Google http://
www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
IO Performance (cont.)
• 5-10ms per disk IO.
• Maybe 50us for a high end SSD.
• Still not “memory speed”.
Buffered IO
• Operating Systems compensate well already.
• Reads are cached with free memory.
• Writes don’t happen instantly.
• A step is introduced to rewrite and merge.
Block 9, 10, 1, 4, 200, 5.

Block 1, 4, 5, 9, 10, 200
fsync

Synopsis
#include <unistd.h>
int fsync(int fd);
int fdatasync(int fd);

!

Description
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by
the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks
until the device reports that the transfer has completed. It also flushes metadata information associated with the file
(see stat(2)).
Basic	
  Operation

Copyright	
  ©	
  2012	
  Oracle	
  and/or	
  its	
  affiliates.	
  All	
  rights	
  reserved.
Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB High Level Overview

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd

In Memory
On Disk
Query (pages not in buffer pool)
mysqld

SELECT * FROM a
WHERE id = 10;

Buffer Pool

Buffer Pool LRU

Adaptive Hash Indexes

Data Dictionary Cache

Buffer Pool Flush List

Page Cache

Log Group
ibdata1
space 0

Storage

InnoDB
Additional Mem Pool

Log Buffer

Transaction
System

Caching

Not Found

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
Query (pages in buffer pool)
mysqld

SELECT * FROM a
WHERE id = 10;

Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
Update Query in a Transaction (simplified)
mysqld

UPDATE a SET col1 =
‘new’ WHERE id = 10;
commit;

Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
Log files
• Provide recovery.
• Only written to in regular operation.
• Read only required if there is a crash.
• Are rewritten over-and-over again.
• Think of it like a tank tread.
Log files (cont.)
• Are an optimization!
• 512B aligned sequential writes.
• Tablespace writes are 16KiB random writes.
• Tablespace writes to same pages in close time
window can be merged.
• Just need a large enough log file.
Checkpoint (Background Activity)
mysqld

(nothing)

Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
FAQ
• Q: What do we write to the log - is it committed data
only, or can we write uncommitted data as well?
• A: Both.
FAQ
• Q: How do you unapply transactions?
• A: UNDO space.

Think of it like a hidden table internally stored in
ibdata1.
Update Query (More Accurate*)
mysqld

UPDATE a SET col1 =
‘new’ WHERE id = 10;
commit;

Page Cache

Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Additional Mem Pool

Redirect older version
Update any indexes
Modify row in place
of row to undo space
to hold both versions.
Log Group

ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
Update Query (cont.)
• Background purge process is able to clean old rows
from UNDO as soon as oldest transaction advances
forward.
Summarized Performance
Characteristics
• Log Files:
• Are short sequential writes.
• They permit InnoDB to delay tablespace writes enabling more merging/optimization.
• Buffer Pool:
• “In memory version of the tablespace”.
• Loading/unloading via modified LRU algorithm.
Index Structure
• Indexes and “data” in InnoDB are B+Trees.
• Clustered Index design means that data itself is
stored in an index.
Index Structure (cont.)

Empty root
Infimum
Level 0
Root

Page 3
Next
Record

Supremum
Index Structure (cont.)

Insert: 1
Infimum
Level 0
Root

Page size is 16KB
B* is 2KB

Page 3

1
B*

Supremum
Index Structure (cont.)

Insert: 1 to 7
I
Level 0
Root

S

Page 3

1
B*

2
B*

3
B*

4
B*

5
B*

6
B*

7
B*
Index Structure (cont.)
Insert: 8
Infimum

Page 3

Level 1
Root

1
4

I
Level 0
Leaf

Allocate new page and link in root
Move records to new page
Split new page

Supremum

S

Page 4

1
B*

2
B*

3
B*

4
B*

5
B*

6
B*

7
B*
Index Structure (cont.)
Insert: 8 (Cont.)
Infimum
Level 1
Root

1
4

I
Level 0
Leaf

S

Page 4

1
B*

Split at the middle of original page

2
B*

3
B*

Supremum

Page 3

4
5

I

S

Page 5

4
B*

5
B*

6
B*

7
B*
Index Structure (cont.)
Insert: 9 and 10
Infimum
Level 1
Root

1
4

I
Level 0
Leaf

S

Page 4

1
B*

2
B*

3
B*

Supremum

Page 3

4
5

I

S

Page 5

4
B*

5
B*

6
B*

7
B*

8
B*

9
B*

10
B*
Index Structure (cont.)
Insert: 11
Infimum
Level 1
Root

4
5

1
4

I
Level 0
Leaf

Page 3

S

Page 4

I

Supremum
11
6

S

Page 5

I

S
Page 6

1
B*

2
B*

3
B*

Insert leads to a split at the insertion point

4
B*

5
B*

6
B*

7
B*

8
B*

9
B*

10
B*

11
B*
Index Structure
Level 2
Root

Infimum

Level 1
Internal

Page 6

0
A

1
B

S

Page 4

≥0
→
6

I
Level 0
Leaf

Next
Record

≥0
→
4

I

S

Supremum

Page 3

≥4
→
5

I

Next Page

≥2
→
7

≥4
→
8

Prev Page

I

Page 7

2
C

3
D

S

I

Page 8

4
E

5
F

S

Page 5

S

≥6
→
9

I

Page 9

6
G

7
H

S
Page Format

0
38

FIL Header (38)
Other headers and page data,
depending on page type.
Total usable space: 16,338 bytes.

16376
16384

FIL Trailer (8)
Row Format

N-5

N-4

N-2
N
N+k
N+k+6
N+k+13
N+k+13+j

Variable field lengths (1-2 bytes per var. field)
Info Flags (4 bits)
Number of Records Owned (4 bits)
Order (13 bits)
Record Type (3 bits)
Next Record Offset (2)
Cluster Key Fields (k)
Transaction ID (6)
Roll Pointer (7)
Non-Key Fields (j)
Conclusion
• Page is basic unit of storage.
• Default is 16KiB
• Rows of variable length.
Two more useful features
• Adaptive hash - Partial hash index to accelerate
secondary key lookups.
• Change buffering - when non-unique indexes are
not in memory, changes can be temporarily buffered
until they are.
Query (by secondary key)
mysqld

SELECT * FROM a
WHERE b_key = 10;

Buffer Pool

Data Dictionary Cache

Buffer Pool Flush List

Adaptive Hash Indexes

Log Group
ibdata1
space 0

Storage

Buffer Pool LRU

Page Cache

Additional Mem Pool

Log Buffer

Transaction
System

Caching

InnoDB

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
Update Query (large table)
mysqld

UPDATE a SET col1 =
‘new’ WHERE id = 10;
commit;

Buffer Pool

Caching

InnoDB
Buffer Pool LRU

Adaptive Hash Indexes

Data Dictionary Cache

Buffer Pool Flush List

Page Cache

Additional Mem Pool

Not Required
Log Group
ibdata1
space 0

Storage

Transaction
System

Log Buffer

IBUF_HEADER
IBUF_TREE
TRX_SYS
FIRST_RSEG
DICT_HDR

iblogfile0

iblogfile1

Doublewrite Buffer

Data Dict.

Block 1 (64 pages)
Block 2 (64 pages)

SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

iblogfile2

Tables with
file_per_table
A.ibd
B.ibd
C.ibd
New	
  Features

Copyright	
  ©	
  2012	
  Oracle	
  and/or	
  its	
  affiliates.	
  All	
  rights	
  reserved.
MySQL 5.5+
•
•
•
•
•
•
•
•
•

IO Scalability
Async IO
Multiple Buffer Pools
Adaptive Flushing
Scan Resistant LRU
Compressed Pages
CPU Scalability
Improved Atomics
Spin Loops with PAUSE
MySQL 5.6+
• LRU Dump and
Restore
• Improved Group
Commit
• Fulltext Search
• Fast Read-only
Transactions
• Memcached Interface
• Information Schema
metadata tables

•
•
•
•

Persistent Statistics
Variable Page Size
Online DDL
Transportable
Tablespace
• Transactional
Replication

Using InnoDB
MySQL 5.7+
• Faster Temporary Tables
• Index Lock Contention Reduction
• More Online DDL
• Extend VARCHAR
• Rename Index
• Improved Read-Only Transactions
• Improved CPU Scalability
Configuration

Copyright	
  ©	
  2012	
  Oracle	
  and/or	
  its	
  affiliates.	
  All	
  rights	
  reserved.
The Top 3
1. innodb-buffer-pool-size
2. innodb-log-file-size
3. innodb_flush_log_at_trx_commit
innodb-buffer-pool-size
• Really only one major buffer/cache settings to set.
• Responsible for all pages types (data, indexes, undo,
insert buffer..)
innodb-buffer-pool-size (cont.)
• Recommendation is 50-80% of RAM.
• Default is 128M of RAM.
• Please allow 5-10% on top for other meta data to
grow.
innodb-log-file-size
• Log files are on disk, but this contributes to how
many unflushed (dirty) pages you can hold in
memory.
• In theory larger log files = longer crash recovery.
• In MySQL 5.5 -2G max.
• In MySQL 5.6 - 4G is usually safe.
• Early versions should be much smaller.
• Default is 48M Log Files.
innodb_flush_log_at_trx_commit
• Default is full ACID Compliance (=1)
• Can be set to 0/2 if you do not mind some data loss.
innodb_flush_log_at_trx_commit
• 0 = Log buffer written + synced once per second.
Nothing done at commit.
• 1 = Log buffer written + synced once per second +
written and synced on commit.
• 2 = Log buffer written + synced once per second +
written (not synced) on commit.
!
2 is a slightly safer version of 0.
Requires about 5-10% of buffer pool
size as overhead (not directly
configurable).

Basic Configuration

Buffer Pool

InnoDB
Buffer Pool LRU

Adaptive Hash Indexes

Additional Mem Pool

innodb_log_file_size. Typical
values 256M+. Default of 2 files
(innodb_log_files_in_group).
Log Group

iblogfile0

ibdata1
space 0

innodb_log_buffer_size. Typical
Doublewrite Buffer
values 1-8M. Flushing
IBUF_HEADER
IBUF_TREE
Block 1 (64 pages)
behaviour influenced by
TRX_SYS
Block 2 (64 pages)
FIRST_RSEG
innodb_flush_log_at_trx_commit.
DICT_HDR
Storage

Data Dictionary Cache

Buffer Pool Flush List

Page Cache

Log Buffer

Transaction
System

Caching

Innodb_buffer_pool_size.
Recommendation is 50-80% RAM.

iblogfile1

iblogfile2

Tables with
file_per_table

Data Dict.
SYS_TABLES
SYS_COLUMNS
SYS_INDEXES
SYS_FIELDS

A.ibd
B.ibd
C.ibd

innodb_file_per_table
(Default: ON in 5.6+)
The ~Top 10
1. innodb-buffer-pool-size
2. innodb-log-file-size
3. innodb-log-buffer-size
4. innodb_flush_log_at_trx_commit
5. innodb_flush_method
6. innodb_flush_neighbors
7. innodb_io_capacity, innodb_io_capacity_max,
innodb_lru_scan_depth
8. innodb-buffer-pool-instances
9. innodb_read_io_threads and innodb_write_io_threads
Less-likely to need configuration
•
•
•
•
•

innodb_thread_concurrency
innodb_concurrency_tickets
innodb_max_pct_dirty_pages
innodb_use_native_aio (always on)
innodb_old_blocks_time (5.6 default: 1000)
Deprecated Settings
• Typically “remove on sight” from config files:
• innodb_additional_mempool_size
• innodb_use_sys_malloc
Credits
• InnoDB Architecture Diagrams via https://github.com/
jeremycole/innodb_diagrams
• Available under (3-clause) BSD license

Copyright (c) 2013, Twitter, Inc.

Copyright (c) 2013, Jeremy Cole <jeremy@jcole.us>

Copyright (c) 2013, Davi Arnaut
<darnaut@gmail.com>
The InnoDB Storage Engine for MySQL

The InnoDB Storage Engine for MySQL

  • 1.
    <Insert Picture Here> TheInnoDB Storage Engine for MySQL
 Morgan Tocker, MySQL Community Manager
 http://www.tocker.ca/

  • 2.
    Safe Harbor Statement The  following  is  intended  to  outline  our  general  product  direction.   It  is  intended  for  information  purposes  only,  and  may  not  be   incorporated  into  any  contract.  It  is  not  a  commitment  to  deliver   any  material,  code,  or  functionality,  and  should  not  be  relied  upon   in  making  purchasing  decisions.   
 The  development,  release,  and  timing  of  any  features  or   functionality  described  for  Oracle’s  products  remains  at  the  sole   discretion  of  Oracle.
  • 3.
    4 Years ofMySQL Innovation MySQL Cluster 7.3 MySQL Workbench 6.0 MySQL Migration Wizard <Insert Picture Here> MySQL 5.5 MySQL 5.6 Windows installer & Tools MySQL 5.7 M y S Q L Cluster MySQL MySQL Enterprise Monitor 2.3 & e r M a n a g 3.0 Applier for Hadoop MySQL Enterprise Backup Security MySQL Utilities MySQL Workbench 5.2 & 6.0 Scalability MySQL Cluster 7.2 HA MySQL Enterprise MySQL Cluster 7.1 O r a c l e C e r t i f i c a t i o n s Audit
  • 4.
    Hello and Welcome! •I will be talking about InnoDB’s internal behaviour. • Not talking (much) about MySQL. • Aim of this talk is to give you X-ray vision. • i.e. not so many direct takeaways, but one day it will help you debug a problem.
  • 5.
    Prerequisites Copyright  ©  2012  Oracle  and/or  its  affiliates.  All  rights  reserved.
  • 6.
  • 7.
    IO Performance L1 cachereference Branch mispredict L2 cache reference Mutex lock/unlock Main memory reference Compress 1K bytes with Zippy Send 2K bytes over 1 Gbps network Read 1 MB sequentially from memory Round trip within same datacenter Disk seek Read 1 MB sequentially from disk Send packet CA->Netherlands->CA 0.5 ns! 5 ns! 7 ns! 25 ns! 100 ns! 3,000 ns! 20,000 ns! 250,000 ns! 500,000 ns! 10,000,000 ns! 20,000,000 ns! 150,000,000 ns See: http://www.linux-mag.com/cache/7589/1.html and Google http:// www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
  • 8.
    IO Performance (cont.) •5-10ms per disk IO. • Maybe 50us for a high end SSD. • Still not “memory speed”.
  • 9.
    Buffered IO • OperatingSystems compensate well already. • Reads are cached with free memory. • Writes don’t happen instantly. • A step is introduced to rewrite and merge. Block 9, 10, 1, 4, 200, 5. Block 1, 4, 5, 9, 10, 200
  • 10.
    fsync Synopsis #include <unistd.h> int fsync(intfd); int fdatasync(int fd); ! Description fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).
  • 11.
    Basic  Operation Copyright  ©  2012  Oracle  and/or  its  affiliates.  All  rights  reserved.
  • 12.
    Buffer Pool Data DictionaryCache Buffer Pool Flush List Adaptive Hash Indexes Log Group ibdata1 space 0 Storage Buffer Pool LRU Page Cache Additional Mem Pool Log Buffer Transaction System Caching InnoDB High Level Overview IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd In Memory On Disk
  • 13.
    Query (pages notin buffer pool) mysqld SELECT * FROM a WHERE id = 10; Buffer Pool Buffer Pool LRU Adaptive Hash Indexes Data Dictionary Cache Buffer Pool Flush List Page Cache Log Group ibdata1 space 0 Storage InnoDB Additional Mem Pool Log Buffer Transaction System Caching Not Found IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 14.
    Query (pages inbuffer pool) mysqld SELECT * FROM a WHERE id = 10; Buffer Pool Data Dictionary Cache Buffer Pool Flush List Adaptive Hash Indexes Log Group ibdata1 space 0 Storage Buffer Pool LRU Page Cache Additional Mem Pool Log Buffer Transaction System Caching InnoDB IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 15.
    Update Query ina Transaction (simplified) mysqld UPDATE a SET col1 = ‘new’ WHERE id = 10; commit; Buffer Pool Data Dictionary Cache Buffer Pool Flush List Adaptive Hash Indexes Log Group ibdata1 space 0 Storage Buffer Pool LRU Page Cache Additional Mem Pool Log Buffer Transaction System Caching InnoDB IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 16.
    Log files • Providerecovery. • Only written to in regular operation. • Read only required if there is a crash. • Are rewritten over-and-over again. • Think of it like a tank tread.
  • 17.
    Log files (cont.) •Are an optimization! • 512B aligned sequential writes. • Tablespace writes are 16KiB random writes. • Tablespace writes to same pages in close time window can be merged. • Just need a large enough log file.
  • 18.
    Checkpoint (Background Activity) mysqld (nothing) BufferPool Data Dictionary Cache Buffer Pool Flush List Adaptive Hash Indexes Log Group ibdata1 space 0 Storage Buffer Pool LRU Page Cache Additional Mem Pool Log Buffer Transaction System Caching InnoDB IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 19.
    FAQ • Q: Whatdo we write to the log - is it committed data only, or can we write uncommitted data as well? • A: Both.
  • 20.
    FAQ • Q: Howdo you unapply transactions? • A: UNDO space.
 Think of it like a hidden table internally stored in ibdata1.
  • 21.
    Update Query (MoreAccurate*) mysqld UPDATE a SET col1 = ‘new’ WHERE id = 10; commit; Page Cache Buffer Pool Data Dictionary Cache Buffer Pool Flush List Adaptive Hash Indexes Additional Mem Pool Redirect older version Update any indexes Modify row in place of row to undo space to hold both versions. Log Group ibdata1 space 0 Storage Buffer Pool LRU Page Cache Log Buffer Transaction System Caching InnoDB IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 22.
    Update Query (cont.) •Background purge process is able to clean old rows from UNDO as soon as oldest transaction advances forward.
  • 23.
    Summarized Performance Characteristics • LogFiles: • Are short sequential writes. • They permit InnoDB to delay tablespace writes enabling more merging/optimization. • Buffer Pool: • “In memory version of the tablespace”. • Loading/unloading via modified LRU algorithm.
  • 24.
    Index Structure • Indexesand “data” in InnoDB are B+Trees. • Clustered Index design means that data itself is stored in an index.
  • 25.
    Index Structure (cont.) Emptyroot Infimum Level 0 Root Page 3 Next Record Supremum
  • 26.
    Index Structure (cont.) Insert:1 Infimum Level 0 Root Page size is 16KB B* is 2KB Page 3 1 B* Supremum
  • 27.
    Index Structure (cont.) Insert:1 to 7 I Level 0 Root S Page 3 1 B* 2 B* 3 B* 4 B* 5 B* 6 B* 7 B*
  • 28.
    Index Structure (cont.) Insert:8 Infimum Page 3 Level 1 Root 1 4 I Level 0 Leaf Allocate new page and link in root Move records to new page Split new page Supremum S Page 4 1 B* 2 B* 3 B* 4 B* 5 B* 6 B* 7 B*
  • 29.
    Index Structure (cont.) Insert:8 (Cont.) Infimum Level 1 Root 1 4 I Level 0 Leaf S Page 4 1 B* Split at the middle of original page 2 B* 3 B* Supremum Page 3 4 5 I S Page 5 4 B* 5 B* 6 B* 7 B*
  • 30.
    Index Structure (cont.) Insert:9 and 10 Infimum Level 1 Root 1 4 I Level 0 Leaf S Page 4 1 B* 2 B* 3 B* Supremum Page 3 4 5 I S Page 5 4 B* 5 B* 6 B* 7 B* 8 B* 9 B* 10 B*
  • 31.
    Index Structure (cont.) Insert:11 Infimum Level 1 Root 4 5 1 4 I Level 0 Leaf Page 3 S Page 4 I Supremum 11 6 S Page 5 I S Page 6 1 B* 2 B* 3 B* Insert leads to a split at the insertion point 4 B* 5 B* 6 B* 7 B* 8 B* 9 B* 10 B* 11 B*
  • 32.
    Index Structure Level 2 Root Infimum Level1 Internal Page 6 0 A 1 B S Page 4 ≥0 → 6 I Level 0 Leaf Next Record ≥0 → 4 I S Supremum Page 3 ≥4 → 5 I Next Page ≥2 → 7 ≥4 → 8 Prev Page I Page 7 2 C 3 D S I Page 8 4 E 5 F S Page 5 S ≥6 → 9 I Page 9 6 G 7 H S
  • 33.
    Page Format 0 38 FIL Header(38) Other headers and page data, depending on page type. Total usable space: 16,338 bytes. 16376 16384 FIL Trailer (8)
  • 34.
    Row Format N-5 N-4 N-2 N N+k N+k+6 N+k+13 N+k+13+j Variable fieldlengths (1-2 bytes per var. field) Info Flags (4 bits) Number of Records Owned (4 bits) Order (13 bits) Record Type (3 bits) Next Record Offset (2) Cluster Key Fields (k) Transaction ID (6) Roll Pointer (7) Non-Key Fields (j)
  • 35.
    Conclusion • Page isbasic unit of storage. • Default is 16KiB • Rows of variable length.
  • 36.
    Two more usefulfeatures • Adaptive hash - Partial hash index to accelerate secondary key lookups. • Change buffering - when non-unique indexes are not in memory, changes can be temporarily buffered until they are.
  • 37.
    Query (by secondarykey) mysqld SELECT * FROM a WHERE b_key = 10; Buffer Pool Data Dictionary Cache Buffer Pool Flush List Adaptive Hash Indexes Log Group ibdata1 space 0 Storage Buffer Pool LRU Page Cache Additional Mem Pool Log Buffer Transaction System Caching InnoDB IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 38.
    Update Query (largetable) mysqld UPDATE a SET col1 = ‘new’ WHERE id = 10; commit; Buffer Pool Caching InnoDB Buffer Pool LRU Adaptive Hash Indexes Data Dictionary Cache Buffer Pool Flush List Page Cache Additional Mem Pool Not Required Log Group ibdata1 space 0 Storage Transaction System Log Buffer IBUF_HEADER IBUF_TREE TRX_SYS FIRST_RSEG DICT_HDR iblogfile0 iblogfile1 Doublewrite Buffer Data Dict. Block 1 (64 pages) Block 2 (64 pages) SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS iblogfile2 Tables with file_per_table A.ibd B.ibd C.ibd
  • 39.
    New  Features Copyright  ©  2012  Oracle  and/or  its  affiliates.  All  rights  reserved.
  • 40.
    MySQL 5.5+ • • • • • • • • • IO Scalability AsyncIO Multiple Buffer Pools Adaptive Flushing Scan Resistant LRU Compressed Pages CPU Scalability Improved Atomics Spin Loops with PAUSE
  • 41.
    MySQL 5.6+ • LRUDump and Restore • Improved Group Commit • Fulltext Search • Fast Read-only Transactions • Memcached Interface • Information Schema metadata tables • • • • Persistent Statistics Variable Page Size Online DDL Transportable Tablespace • Transactional Replication
 Using InnoDB
  • 42.
    MySQL 5.7+ • FasterTemporary Tables • Index Lock Contention Reduction • More Online DDL • Extend VARCHAR • Rename Index • Improved Read-Only Transactions • Improved CPU Scalability
  • 43.
    Configuration Copyright  ©  2012  Oracle  and/or  its  affiliates.  All  rights  reserved.
  • 44.
    The Top 3 1.innodb-buffer-pool-size 2. innodb-log-file-size 3. innodb_flush_log_at_trx_commit
  • 45.
    innodb-buffer-pool-size • Really onlyone major buffer/cache settings to set. • Responsible for all pages types (data, indexes, undo, insert buffer..)
  • 46.
    innodb-buffer-pool-size (cont.) • Recommendationis 50-80% of RAM. • Default is 128M of RAM. • Please allow 5-10% on top for other meta data to grow.
  • 47.
    innodb-log-file-size • Log filesare on disk, but this contributes to how many unflushed (dirty) pages you can hold in memory. • In theory larger log files = longer crash recovery. • In MySQL 5.5 -2G max. • In MySQL 5.6 - 4G is usually safe. • Early versions should be much smaller. • Default is 48M Log Files.
  • 48.
    innodb_flush_log_at_trx_commit • Default isfull ACID Compliance (=1) • Can be set to 0/2 if you do not mind some data loss.
  • 49.
    innodb_flush_log_at_trx_commit • 0 =Log buffer written + synced once per second. Nothing done at commit. • 1 = Log buffer written + synced once per second + written and synced on commit. • 2 = Log buffer written + synced once per second + written (not synced) on commit. ! 2 is a slightly safer version of 0.
  • 50.
    Requires about 5-10%of buffer pool size as overhead (not directly configurable). Basic Configuration Buffer Pool InnoDB Buffer Pool LRU Adaptive Hash Indexes Additional Mem Pool innodb_log_file_size. Typical values 256M+. Default of 2 files (innodb_log_files_in_group). Log Group iblogfile0 ibdata1 space 0 innodb_log_buffer_size. Typical Doublewrite Buffer values 1-8M. Flushing IBUF_HEADER IBUF_TREE Block 1 (64 pages) behaviour influenced by TRX_SYS Block 2 (64 pages) FIRST_RSEG innodb_flush_log_at_trx_commit. DICT_HDR Storage Data Dictionary Cache Buffer Pool Flush List Page Cache Log Buffer Transaction System Caching Innodb_buffer_pool_size. Recommendation is 50-80% RAM. iblogfile1 iblogfile2 Tables with file_per_table Data Dict. SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS A.ibd B.ibd C.ibd innodb_file_per_table (Default: ON in 5.6+)
  • 51.
    The ~Top 10 1.innodb-buffer-pool-size 2. innodb-log-file-size 3. innodb-log-buffer-size 4. innodb_flush_log_at_trx_commit 5. innodb_flush_method 6. innodb_flush_neighbors 7. innodb_io_capacity, innodb_io_capacity_max, innodb_lru_scan_depth 8. innodb-buffer-pool-instances 9. innodb_read_io_threads and innodb_write_io_threads
  • 52.
    Less-likely to needconfiguration • • • • • innodb_thread_concurrency innodb_concurrency_tickets innodb_max_pct_dirty_pages innodb_use_native_aio (always on) innodb_old_blocks_time (5.6 default: 1000)
  • 53.
    Deprecated Settings • Typically“remove on sight” from config files: • innodb_additional_mempool_size • innodb_use_sys_malloc
  • 54.
    Credits • InnoDB ArchitectureDiagrams via https://github.com/ jeremycole/innodb_diagrams • Available under (3-clause) BSD license
 Copyright (c) 2013, Twitter, Inc.
 Copyright (c) 2013, Jeremy Cole <jeremy@jcole.us>
 Copyright (c) 2013, Davi Arnaut <darnaut@gmail.com>