• Like
InnoDB: архитектура транзакционного хранилища (Константин Осипов)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

InnoDB: архитектура транзакционного хранилища (Константин Осипов)



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Everybody knows Heikki. Introduce mysqlf
  • The following 32 pages are allocated individually (from the fragmented extent space); after that, full 64 page extents are allocated Using multiple tablespaces can be beneficial to users who want to move specific tables to separate physical disks or who wish to restore backups of single tables quickly without interrupting the use of the remaining InnoDB tables.
  • Also overflow page pointers Whether any columns are stored off-page depends on the page size and the total size of the row. When the row is too long to fit entirely within the page of the clustered index, InnoDB will choose the longest columns for off-page storage until the row fits on the clustered index page.
  • COMPACT mode always stores up to 768-byte prefix of such columns in the clustered index page New DYNAMIC mode stores long columns entirely “off-page”, with only a 20-byte prefix in the clustered index page If row does not fit in clustered index page, some long BLOB or VARCHAR column(s) may be stored on an overflow page the REDUNDANT format is available to retain compatibility with older versions of MySQL
  • If you do not define a PRIMARY KEY for your table, InnoDB uses the first UNIQUE index that has only NOT NULL columns as the clustered index. If there is no such index in the table, InnoDB internally generates a clustered index where the rows are ordered by the row ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order. All InnoDB indexes are B-trees where the index records are stored in the leaf pages of the tree. The default size of an index page is 16KB. When new records are inserted, InnoDB tries to leave 1/16 of the page free for future insertions and updates of the index records.
  • If you specify the autoextend option for the last data file, InnoDB extends the data file if it runs out of free space in the tablespace. The increment is 8MB at a time by default. It can be modified by changing the innodb_autoextend_increment system variable. You can use raw disk partitions as data files in the shared tablespace. Remember that only the last data file in the innodb_data_file_path can be specified as auto-extending.
  • Note : InnoDB always needs the shared tablespace because it puts its internal data dictionary and undo logs there. To support recovery and unique features.
  • Combination of physical (disk address) and logical (field content) logging -- Physiological Logging Logging is used for durability at crash recovery InnoDB keeps two logs, the redo log and the undo log. The redo log is for re-doing data changes that had not been written to disk when a crash occurred. The undo log is primarily for removing data changes that had been written to disk when a crash occurred, but should not have been written, because they were for uncommitted transactions. The undo log is inside the tablespace. The undo log is primarily for removing data changes that had been written to disk when a crash occurred, but should not have been written, because they were for uncommitted transactions . The undo log is inside the tablespace. The "insert" section of the undo log is needed only for transaction rollback and can be discarded at COMMIT time. The "update/delete" section of the undo log is also useful for consistent reads, and can be discarded when InnoDB has ended all transactions that might need the undo log records to reconstruct earlier versions of rows. An undo log record's contents are: Primary Key Value (not a page number or physical address), Old Transaction ID (of the transaction that updated the row), and the changes (only old values). COMMIT will write the contents of the log buffer to disk, and put undo log records in a history list. ROLLBACK will delete undo log records that are no longer needed. PURGE (an internal operation that occurs outside user control) will no-longer-necessary undo log records and, for data records that have been marked for deletion and are no longer necessary for consistent read, will remove the records.
  • There is one redo log for the entire workspace, it contains multiple files, it is circular. The file header includes the last successful checkpoint. A redo log record's contents are: Space id, Page Number (4 bytes = page number within tablespace), Offset of change within page (2 bytes), Log Record Type (insert, update, delete, "fill space with blanks", etc.), and the changes on that page ( only after images , not before images).
  • Primary goal: make it faster! Run-time --- Faster! Foreground threads from MySQL server Memory: log buffer (REDO records), Buffer pool (data pages; index pages; undo records; adaptive hash indexes; table of lock info), and additional memory pool(cached data dictionary; open table handles)
  • Typically two segments for each index (non-leaf index segment and leaf index segment) One segment for very small indexes only Each secondary index has its own pair of segments


  • 1.
      InnoDB: архитектура транзакционного хранилища
      Highload++, Октябрь 2010
    • Константин Осипов, kostja@sun.com
  • 2.
      О чём этот доклад
    • Повторение – мать у.
    • 3. Точка зрения разработчика, не DBA на:
      • InnoDB Database Files и их содержимое
      • 4. InnoDB threads
      • 5. InnoDB data structures and algorithms
  • 6.
      InnoDB Database Files
      ibdata files
      System tablespace
      internal data dictionary
      MySQL Data Directory
      InnoDB tables
      .ibd files
      .frm files
      undo logs
      insert buffer
  • 7.
      InnoDB Tablespaces
      an extent = 64 pages
      Trx id
      Field 1
      Roll pointer
      Field pointers
      Field 2
      Field n
      Leaf node segment
      Rollback segment
      Non-leaf node segment
  • 8.
      InnoDB Pages
      A page consists of: a page header, a page trailer, and a page body (rows or other contents).
      Page header
      Page trailer
      row offset array
  • 9.
      InnoDB Rows
      Record hdr Trx ID Roll ptr Fld ptrs overflow-page ptr .. Field values
      overflow page
      COMACT format
      overflow page
      DYNAMIC format
      20 bytes
  • 10.
      InnoDB Indexes - Primary
    • Data rows are stored in the B-tree leaf nodes of a clustered index
      • B-tree is organized by primary key or non-null unique key of table, if defined; else, an internal column with 6-byte ROW_ID is added.
      Primary Index
      xxx - nnn
      001 - 275
      276 – 500
      clustered (primary key) index
      501 - 630
      631 - 768
      769 - 800
      801 - 949
      950 - xxx
      001 – 500
      801 – nnn
      500 – 800
      PK values 001 - nnn
      Key values 501-630 + data for corresponding rows
  • 11.
      InnoDB Tablespaces
    • A tablespace consists of multiple files and/or raw disk partitions. file_name : file_size [:autoextend[:max: max_file_size ]]
    • 12. A file/partition is a collection of segments.
    • 13. A segment consists of fixed-length pages.
    • 14. The page size is always 16KB in uncompressed tablespaces, and 1KB-16KB in compressed tablespaces (for both data and index).
  • 15.
      System Tablespace
    • Internal Data Dictionary
    • 16. Undo
    • 17. Insert Buffer
    • 18. Doublewrite Buffer
    • 19. MySQL Replication Info
  • 20.
      InnoDB Logging
      Rollback segments
      Log Buffer
      Buffer Pool
      redo log
      Log File #1
      Log File #2
      log thread
      write thread
      log files
      ibdata files
  • 21.
      InnoDB Redo Log
      Redo log structure:
      Space id PageNo OpCode Data
      end of log
      min LSN
      start of log
      last checkpoint
  • 22. Redo Logging
    • The redo log remembers EVERY operation on any page in the database
    • 23. Redo log record format: <space id, page no, operation code, data>
    • 24. An example of a redo log record:
    • 25. <0, 1234, insert, after record at offset 5444, ‘(25, 'heikki', …)'>
    • 26. 'Physiological' logging, per-page
  • 27. Redo Logging (continued)
    • Physiological means that the log record is per page and it codes the page operation in a concise way:
    • 28. - 'Reorganize page 1234'
    • 29. - 'Delete all records on page 1234 after position 6543'
    • 30. - ...
  • 31.
      InnoDB Architecture: Runtime Model
      InnoDB Code
      Buffer Pool
      Background threads
      Log buffer
      buffer pool
      Misc buffer
  • 41.
      InnoDB Architecture
      File Space Manager
      Handler API Embedded InnoDB API
      Cursor / Row
  • 42. InnoDB Transactions onsistent – transactions operate on a consistent view of the data, leaving the data in a consistent state (by transaction’s end) solated – each transaction “thinks” it is running by itself – effects of other transactions are invisible until it commits
    • C
    • I
    urable – once committed, all changes persist, even if there are system failures
    • D
    • A
    tomic – all changes are either committed as a group, or all are rolled back as a group
  • 43. InnoDB Consistent Reads
    • Queries see a snapshot of the data consistent with the other data they read
    • 44. By default, InnoDB uses “consistent read” for queries like this
    SELECT a FROM t WHERE b = 2; X LOCKED! “ old” data SQL Query
    • Normal undo is used to generate consistent data for the query to see
      • No overhead: undo info is required to rollback uncommitted transactions
    • No need to set locks, as history cannot change
  • 45.
      InnoDB Indexes - Secondary
    • Secondary index B-tree leaf nodes contain, for each key value, the primary keys of the corresponding rows, used to access clustering index to obtain the data
      clustered (primary key) index
      clustered (primary key) index
      Secondary Index
      Secondary index
      PK values 001 - nnn
      B-tree leaf nodes, containing data
      key values A Z
      B-tree leaf nodes, containing PKs
      Secondary index
      key values A Z
      B-tree leaf nodes, containing PKs
  • 46. Insert Buffering
    • Defers writes to secondary indexes on INSERTs
    • 47. Unique to InnoDB, saves random writes, improves insert speed
    • 48. Performance benefit of insert buffering:
      • mysqlha.blogspot.com/2008/12/innodb-insert-performance.html
      • 49. as much as 7.2x faster than the theoretical rate of inserts in a &quot;normal&quot; DBMS
    clustered (primary key) index PK values 001 - nnn B-tree leaf nodes, containing data key values A Z B-tree leaf nodes, containing PKs Secondary index key values A Z B-tree leaf nodes, containing PKs Clustered index Secondary index
  • 50. Shared and Exclusive Locks Q : If user A has shared row locks in table T, how does InnoDB know to not let user B set an exclusive X lock on table T? A : 'Intention locks'. Before setting a shared lock on a row in t, user A sets an 'intention lock' IS on table t. Similarly, before setting an exclusive X lock on a row, a user must set an IX lock on the table.
      • IX is not compatible with S lock on T
      • 51. if a user has IX lock on table T, no other user can take S lock on T
      • 52. if a user has an S lock on table T, no other user can take X locks on rows in T
    ? IS is compatible with S, but not with X. Thus, if a user has row share locks, no other user can lock the table in X mode S LOCKED! S LOCKED! S LOCKED! A B IS lock
  • 53. Phantoms vs. Consistency Parent 21 102 5 Child 77 12 45 157 PHANTOM: A row that appears in a second query that was not in the first
      Check that there are no children with parent id=10: SELECT * FROM child WHERE parent_id = 10 FOR UPDATE;
    DELETE FROM parent WHERE id = 10;
    • If the SELECT returns 0 rows, then the user thinks he can delete the parent
    SQL Query Example: foreign key check in application code
    • But before first user COMMITs, another user inserts a child with parent_id = 10 …
  • 54. Statement-Based Replication Relies on No Phantoms MySQL binlog A’s DELETE B’s INSERT 5. The MySQL binlog contain B's transaction before A's We do not know if A deleted the row that B inserted! User B BEGIN; INSERT INTO t VALUES (10); 2. Before A commits, user B inserts the same row slave may get out-of- sync with the master! User A BEGIN; DELETE FROM t WHERE a = 10; 1. User A deletes a row User B BEGIN; INSERT INTO t VALUES (10); COMMIT; 3. User B commits before A User A BEGIN; DELETE FROM t WHERE a = 10; COMMIT; 4. User A commits
  • 55. Row-Based Replication Needs Less Locking MySQL binlog A’s DELETE B’s INSERT 5. The MySQL binlog contain B's transaction before A's Row-based binlog contains information of each individual row that was deleted or inserted => phantoms are no longer a problem! User B BEGIN; INSERT INTO t VALUES (10); 2. Before A commits, user B inserts the same row User A BEGIN; DELETE FROM t WHERE a = 10; 1. User A deletes a row User B BEGIN; INSERT INTO t VALUES (10); COMMIT; 3. User B commits before A User A BEGIN; DELETE FROM t WHERE a = 10; COMMIT; 4. User A commits
  • 56. InnoDB Avoids Phantoms Through 'Gap Locking'
    • Every SELECT, UPDATE, DELETE in InnoDB uses an index to find the rows to return or operate on
    the index is searched, or scanned
    • To avoid phantoms, we lock not only the index records we scan, but also the 'gaps' between them
    No other user can insert new records in the gaps If the query scans the rows between Heikki and Ken, we also lock the 'gap' between those records, so that other users cannot insert 'Jeffrey' in the gap An alphabetical index on people's names David Heikki Ken Monty range of search/scan
  • 57. Types of Gap Locking in InnoDB InnoDB minimizes gap locking by using record-only locks in UNIQUE searches UPDATE t SET a = a + 1 WHERE primary_key_value = 100; Gap lock locks just the gap before the key Record-only lock locks just the key and not the gap Insert-intention gap lock held when waiting to insert into a gap Next-key lock locks the key & the gap before the key
  • 58. Transaction Isolation Levels
    • All SELECTs after the 1 st consistent read SELECT in a transaction use the same “snapshot”
    • 59. UPDATE, DELETE use next-key locking
    • 60. This is the default level
    • All plain SELECTs execute as if they used LOCK IN SHARE MODE
    • 61. No 'consistent' reads; all SELECTs return the very latest state of the database
    • 62. Downside: lots of locking, lots of deadlocks.
    • Each SELECT uses its own “snapshot”
    • 64. Data is “up to date”, but multiple SELECTs may be inconsistent with one another
    • 65. In V5.1, most gap-locking is removed w/ this level, but you MUST use row-based logging/replication
    • 66. Fewer gap locks mean fewer deadlocks
    • 67. UNIQUE KEY checks in secondary indexes and some FOREIGN KEY checks still need to set gap locks
      • Gaps must be locked to prevent inserting child rows after parent row is deleted
    • Many users will move to this isolation level >= V5.1
    • 68. Use innodb_locks_unsafe_for_binlog to remove gap locking in MySQL-5.0 and earlier
  • 69. Deadlock Detection & Rollback
    • InnoDB automatically detects deadlocks if it detects a cycle in “waits-for” graph of transactions
    D A B C waits-for graph
    • Given a deadlock, InnoDB chooses the transaction that modified the fewest rows as the victim, and rolls it back
    • Note: InnoDB cannot detect deadlocks that span MySQL storage engines
      • Set innodb_lock_wait_timeout in my.cnf, to break deadlocks via timeout (default 50 sec)