<ul>InnoDB: архитектура транзакционного хранилища </ul><ul>Highload++, Октябрь 2010 <li>Константин Осипов, kostja@sun.com ...
<ul>О чём этот доклад </ul><ul><li>Повторение – мать у.
Точка зрения разработчика, не DBA на: </li><ul><li>InnoDB Database Files и их содержимое
InnoDB threads
InnoDB data structures and algorithms </li></ul></ul>
<ul>InnoDB Database Files </ul><ul>ibdata files </ul><ul>System tablespace </ul><ul>internal data dictionary </ul><ul>MySQ...
<ul>InnoDB Tablespaces </ul><ul>Extent   </ul><ul>Segment </ul><ul>Extent   </ul><ul>Extent   </ul><ul>Extent   </ul><ul>a...
<ul>InnoDB Pages </ul><ul>A page consists of: a page header, a page trailer, and a page body (rows or other contents). </u...
<ul>InnoDB Rows </ul><ul>Record hdr  Trx ID  Roll ptr  Fld ptrs  overflow-page ptr .. Field values </ul><ul>prefix(768B) <...
<ul>InnoDB Indexes - Primary </ul><ul><li>Data rows are stored in the B-tree leaf nodes of a clustered index </li></ul><ul...
<ul>InnoDB Tablespaces </ul><ul><li>A tablespace consists of multiple files and/or raw disk partitions.  file_name : file_...
A file/partition is a collection of segments.
A segment consists of fixed-length pages.
The page size is always 16KB in uncompressed tablespaces, and 1KB-16KB in compressed tablespaces (for both data and index)...
<ul>System Tablespace </ul><ul><li>Internal Data Dictionary
Undo
Insert Buffer
Doublewrite Buffer
MySQL Replication Info </li></ul>
<ul>InnoDB Logging </ul><ul>DATA </ul><ul>Rollback segments </ul><ul>Log Buffer </ul><ul>Buffer Pool </ul><ul>redo log </u...
<ul>InnoDB Redo Log </ul><ul>Redo log structure: </ul><ul>Space id  PageNo  OpCode  Data </ul><ul>end of log </ul><ul>min ...
Redo Logging <ul><li>The redo log remembers EVERY operation on any page in the database
Redo log record format:  <space id, page no, operation code, data>
An example of a redo log record:
<0, 1234, insert, after record at offset 5444, ‘(25, 'heikki', …)'>
'Physiological' logging, per-page </li></ul>
Redo Logging (continued) <ul><li>Physiological means that the log record is per page and it codes the page operation in a ...
-  'Reorganize page 1234'
Upcoming SlideShare
Loading in …5
×

InnoDB: архитектура транзакционного хранилища (Константин Осипов)

918 views

Published on

Published in: Technology
  • Be the first to comment

InnoDB: архитектура транзакционного хранилища (Константин Осипов)

  1. 1. <ul>InnoDB: архитектура транзакционного хранилища </ul><ul>Highload++, Октябрь 2010 <li>Константин Осипов, kostja@sun.com </li></ul>
  2. 2. <ul>О чём этот доклад </ul><ul><li>Повторение – мать у.
  3. 3. Точка зрения разработчика, не DBA на: </li><ul><li>InnoDB Database Files и их содержимое
  4. 4. InnoDB threads
  5. 5. InnoDB data structures and algorithms </li></ul></ul>
  6. 6. <ul>InnoDB Database Files </ul><ul>ibdata files </ul><ul>System tablespace </ul><ul>internal data dictionary </ul><ul>MySQL Data Directory </ul><ul>InnoDB tables </ul><ul>OR </ul><ul>innodb_file_per_table </ul><ul>.ibd files </ul><ul>.frm files </ul><ul>undo logs </ul><ul>insert buffer </ul>
  7. 7. <ul>InnoDB Tablespaces </ul><ul>Extent </ul><ul>Segment </ul><ul>Extent </ul><ul>Extent </ul><ul>Extent </ul><ul>an extent = 64 pages </ul><ul>Extent </ul><ul>Trx id </ul><ul>Row </ul><ul>Field 1 </ul><ul>Roll pointer </ul><ul>Field pointers </ul><ul>Field 2 </ul><ul>Field n </ul><ul>Row </ul><ul>Page </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Leaf node segment </ul><ul>Tablespace </ul><ul>Rollback segment </ul><ul>Non-leaf node segment </ul><ul>Row </ul><ul>Row </ul>
  8. 8. <ul>InnoDB Pages </ul><ul>A page consists of: a page header, a page trailer, and a page body (rows or other contents). </ul><ul>Page header </ul><ul>Page trailer </ul><ul>row offset array </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul><ul>Row </ul>
  9. 9. <ul>InnoDB Rows </ul><ul>Record hdr Trx ID Roll ptr Fld ptrs overflow-page ptr .. Field values </ul><ul>prefix(768B) </ul><ul>… </ul><ul>… </ul><ul>overflow page </ul><ul>COMACT format </ul><ul>overflow page </ul><ul>… </ul><ul>… </ul><ul>DYNAMIC format </ul><ul>20 bytes </ul>
  10. 10. <ul>InnoDB Indexes - Primary </ul><ul><li>Data rows are stored in the B-tree leaf nodes of a clustered index </li></ul><ul><ul><li>B-tree is organized by primary key or non-null unique key of table, if defined; else, an internal column with 6-byte ROW_ID is added. </li></ul></ul><ul>… </ul><ul>… </ul><ul>Primary Index </ul><ul>xxx - nnn </ul><ul>001 - 275 </ul><ul>276 – 500 </ul><ul>clustered (primary key) index </ul><ul>501 - 630 </ul><ul>631 - 768 </ul><ul>769 - 800 </ul><ul>801 - 949 </ul><ul>950 - xxx </ul><ul>001 – 500 </ul><ul>801 – nnn </ul><ul>500 – 800 </ul><ul>PK values 001 - nnn </ul><ul>Key values 501-630 + data for corresponding rows </ul>
  11. 11. <ul>InnoDB Tablespaces </ul><ul><li>A tablespace consists of multiple files and/or raw disk partitions. file_name : file_size [:autoextend[:max: max_file_size ]]
  12. 12. A file/partition is a collection of segments.
  13. 13. A segment consists of fixed-length pages.
  14. 14. The page size is always 16KB in uncompressed tablespaces, and 1KB-16KB in compressed tablespaces (for both data and index). </li></ul>
  15. 15. <ul>System Tablespace </ul><ul><li>Internal Data Dictionary
  16. 16. Undo
  17. 17. Insert Buffer
  18. 18. Doublewrite Buffer
  19. 19. MySQL Replication Info </li></ul>
  20. 20. <ul>InnoDB Logging </ul><ul>DATA </ul><ul>Rollback segments </ul><ul>Log Buffer </ul><ul>Buffer Pool </ul><ul>redo log </ul><ul>rollback </ul><ul>Log File #1 </ul><ul>Log File #2 </ul><ul>log thread </ul><ul>write thread </ul><ul>log files </ul><ul>ibdata files </ul>
  21. 21. <ul>InnoDB Redo Log </ul><ul>Redo log structure: </ul><ul>Space id PageNo OpCode Data </ul><ul>end of log </ul><ul>min LSN </ul><ul>start of log </ul><ul>last checkpoint </ul>
  22. 22. Redo Logging <ul><li>The redo log remembers EVERY operation on any page in the database
  23. 23. Redo log record format: <space id, page no, operation code, data>
  24. 24. An example of a redo log record:
  25. 25. <0, 1234, insert, after record at offset 5444, ‘(25, 'heikki', …)'>
  26. 26. 'Physiological' logging, per-page </li></ul>
  27. 27. Redo Logging (continued) <ul><li>Physiological means that the log record is per page and it codes the page operation in a concise way:
  28. 28. - 'Reorganize page 1234'
  29. 29. - 'Delete all records on page 1234 after position 6543'
  30. 30. - ... </li></ul>
  31. 31. <ul>InnoDB Architecture: Runtime Model </ul><ul>InnoDB Code </ul><ul>Memory </ul><ul>Threads: </ul><ul><li>master
  32. 32. read io
  33. 33. write io
  34. 34. ibuf io
  35. 35. log io
  36. 36. lock timeout
  37. 37. monitor </li></ul><ul>Buffer Pool </ul><ul><li>data
  38. 38. index
  39. 39. undo
  40. 40. adaptive hash index </li></ul><ul>Background threads </ul><ul>Files </ul><ul>Log buffer </ul><ul>buffer pool </ul><ul>Misc buffer </ul>
  41. 41. <ul>InnoDB Architecture </ul><ul>Applications </ul><ul>IO </ul><ul>Buffer </ul><ul>File Space Manager </ul><ul>Transaction </ul><ul>Handler API Embedded InnoDB API </ul><ul>Cursor / Row </ul><ul>Mini-transaction </ul><ul>Lock </ul><ul>B-tree </ul><ul>Page </ul><ul>Server </ul>
  42. 42. InnoDB Transactions onsistent – transactions operate on a consistent view of the data, leaving the data in a consistent state (by transaction’s end) solated – each transaction “thinks” it is running by itself – effects of other transactions are invisible until it commits <ul><li>C </li></ul><ul><li>I </li></ul>urable – once committed, all changes persist, even if there are system failures <ul><li>D </li></ul><ul><li>A </li></ul>tomic – all changes are either committed as a group, or all are rolled back as a group
  43. 43. InnoDB Consistent Reads <ul><li>Queries see a snapshot of the data consistent with the other data they read
  44. 44. By default, InnoDB uses “consistent read” for queries like this </li></ul>SELECT a FROM t WHERE b = 2; X LOCKED! “ old” data SQL Query <ul><li>Normal undo is used to generate consistent data for the query to see </li></ul><ul><ul><li>No overhead: undo info is required to rollback uncommitted transactions </li></ul></ul><ul><li>No need to set locks, as history cannot change </li></ul>
  45. 45. <ul>InnoDB Indexes - Secondary </ul><ul><li>Secondary index B-tree leaf nodes contain, for each key value, the primary keys of the corresponding rows, used to access clustering index to obtain the data </li></ul><ul>clustered (primary key) index </ul><ul>clustered (primary key) index </ul><ul>Secondary Index </ul><ul>Secondary index </ul><ul>PK values 001 - nnn </ul><ul>B-tree leaf nodes, containing data </ul><ul>key values A Z </ul><ul>B-tree leaf nodes, containing PKs </ul><ul>Secondary index </ul><ul>key values A Z </ul><ul>B-tree leaf nodes, containing PKs </ul>
  46. 46. Insert Buffering <ul><li>Defers writes to secondary indexes on INSERTs
  47. 47. Unique to InnoDB, saves random writes, improves insert speed
  48. 48. Performance benefit of insert buffering: </li></ul><ul><ul><li>mysqlha.blogspot.com/2008/12/innodb-insert-performance.html
  49. 49. as much as 7.2x faster than the theoretical rate of inserts in a &quot;normal&quot; DBMS </li></ul></ul>clustered (primary key) index PK values 001 - nnn B-tree leaf nodes, containing data key values A Z B-tree leaf nodes, containing PKs Secondary index key values A Z B-tree leaf nodes, containing PKs Clustered index Secondary index
  50. 50. Shared and Exclusive Locks Q : If user A has shared row locks in table T, how does InnoDB know to not let user B set an exclusive X lock on table T? A : 'Intention locks'. Before setting a shared lock on a row in t, user A sets an 'intention lock' IS on table t. Similarly, before setting an exclusive X lock on a row, a user must set an IX lock on the table. <ul><ul><li>IX is not compatible with S lock on T
  51. 51. if a user has IX lock on table T, no other user can take S lock on T
  52. 52. if a user has an S lock on table T, no other user can take X locks on rows in T </li></ul></ul>? IS is compatible with S, but not with X. Thus, if a user has row share locks, no other user can lock the table in X mode S LOCKED! S LOCKED! S LOCKED! A B IS lock
  53. 53. Phantoms vs. Consistency Parent 21 102 5 Child 77 12 45 157 PHANTOM: A row that appears in a second query that was not in the first <ul>Check that there are no children with parent id=10: SELECT * FROM child WHERE parent_id = 10 FOR UPDATE; </ul>DELETE FROM parent WHERE id = 10; <ul><li>If the SELECT returns 0 rows, then the user thinks he can delete the parent </li></ul>SQL Query Example: foreign key check in application code <ul><li>But before first user COMMITs, another user inserts a child with parent_id = 10 … </li></ul>10 10 INSERT INCONSISTENCY!
  54. 54. Statement-Based Replication Relies on No Phantoms MySQL binlog A’s DELETE B’s INSERT 5. The MySQL binlog contain B's transaction before A's We do not know if A deleted the row that B inserted! User B BEGIN; INSERT INTO t VALUES (10); 2. Before A commits, user B inserts the same row slave may get out-of- sync with the master! User A BEGIN; DELETE FROM t WHERE a = 10; 1. User A deletes a row User B BEGIN; INSERT INTO t VALUES (10); COMMIT; 3. User B commits before A User A BEGIN; DELETE FROM t WHERE a = 10; COMMIT; 4. User A commits
  55. 55. Row-Based Replication Needs Less Locking MySQL binlog A’s DELETE B’s INSERT 5. The MySQL binlog contain B's transaction before A's Row-based binlog contains information of each individual row that was deleted or inserted => phantoms are no longer a problem! User B BEGIN; INSERT INTO t VALUES (10); 2. Before A commits, user B inserts the same row User A BEGIN; DELETE FROM t WHERE a = 10; 1. User A deletes a row User B BEGIN; INSERT INTO t VALUES (10); COMMIT; 3. User B commits before A User A BEGIN; DELETE FROM t WHERE a = 10; COMMIT; 4. User A commits
  56. 56. InnoDB Avoids Phantoms Through 'Gap Locking' <ul><li>Every SELECT, UPDATE, DELETE in InnoDB uses an index to find the rows to return or operate on </li></ul>the index is searched, or scanned <ul><li>To avoid phantoms, we lock not only the index records we scan, but also the 'gaps' between them </li></ul>No other user can insert new records in the gaps If the query scans the rows between Heikki and Ken, we also lock the 'gap' between those records, so that other users cannot insert 'Jeffrey' in the gap An alphabetical index on people's names David Heikki Ken Monty range of search/scan
  57. 57. Types of Gap Locking in InnoDB InnoDB minimizes gap locking by using record-only locks in UNIQUE searches UPDATE t SET a = a + 1 WHERE primary_key_value = 100; Gap lock locks just the gap before the key Record-only lock locks just the key and not the gap Insert-intention gap lock held when waiting to insert into a gap Next-key lock locks the key & the gap before the key
  58. 58. Transaction Isolation Levels <ul><li>All SELECTs after the 1 st consistent read SELECT in a transaction use the same “snapshot”
  59. 59. UPDATE, DELETE use next-key locking
  60. 60. This is the default level </li></ul>SERIALIZABLE SET {SESSION | GLOBAL} TRANSACTION ISOLATION LEVEL <level>; <ul><li>All plain SELECTs execute as if they used LOCK IN SHARE MODE
  61. 61. No 'consistent' reads; all SELECTs return the very latest state of the database
  62. 62. Downside: lots of locking, lots of deadlocks. </li></ul>REPEATABLE READ
  63. 63. Transaction Isolation Levels READ COMMITTED SET {SESSION | GLOBAL} TRANSACTION ISOLATION LEVEL <level>; <ul><li>Each SELECT uses its own “snapshot”
  64. 64. Data is “up to date”, but multiple SELECTs may be inconsistent with one another
  65. 65. In V5.1, most gap-locking is removed w/ this level, but you MUST use row-based logging/replication
  66. 66. Fewer gap locks mean fewer deadlocks
  67. 67. UNIQUE KEY checks in secondary indexes and some FOREIGN KEY checks still need to set gap locks </li></ul><ul><ul><li>Gaps must be locked to prevent inserting child rows after parent row is deleted </li></ul></ul><ul><li>Many users will move to this isolation level >= V5.1
  68. 68. Use innodb_locks_unsafe_for_binlog to remove gap locking in MySQL-5.0 and earlier </li></ul>
  69. 69. Deadlock Detection & Rollback <ul><li>InnoDB automatically detects deadlocks if it detects a cycle in “waits-for” graph of transactions </li></ul>D A B C waits-for graph <ul><li>Given a deadlock, InnoDB chooses the transaction that modified the fewest rows as the victim, and rolls it back </li></ul><ul><li>Note: InnoDB cannot detect deadlocks that span MySQL storage engines </li></ul><ul><ul><li>Set innodb_lock_wait_timeout in my.cnf, to break deadlocks via timeout (default 50 sec) </li></ul></ul>D

×