Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Faster, better, stronger: The new InnoDB

For MariaDB Enterprise Server 10.5, the default transactional storage engine, InnoDB, has been significantly rewritten to improve the performance of writes and backups. Next, we removed a number of parameters to reduce unnecessary complexity, not only in terms of configuration but of the code itself. And finally, we improved crash recovery thanks to better consistency checks and we reduced memory consumption and file I/O thanks to an all new log record format.

In this session, we’ll walk through all of the improvements to InnoDB, and dive deep into the implementation to explain how these improvements help everything from configuration and performance to reliability and recovery.

  • Be the first to comment

Faster, better, stronger: The new InnoDB

  1. 1. Faster, better, stronger: The new InnoDB Marko Mäkelä Lead Developer InnoDB MariaDB Corporation 1
  2. 2. Cleanup of Configuration Parameters ● We deprecate & hard-wire a number of parameters, including the following: ○ innodb_buffer_pool_instances=1, innodb_page_cleaners=1 ○ innodb_log_files_in_group=1 (only ib_logfile0) ○ innodb_undo_logs=128 (the “rollback segment” component of DB_ROLL_PTR) ● Enterprise Server allows more changes while the server is running: ○ SET GLOBAL innodb_log_file_size=…; ○ SET GLOBAL innodb_purge_threads=…; MARIADB SERVER 10.5 2
  3. 3. Improvements to the Redo Log ● More compact, extensible redo log record format and faster backup&recovery ○ We now avoid writes of freed pages after DROP (or rebuild) operations. ○ The doublewrite buffer is not used for newly (re)initialized pages. ○ The physical format is easy to parse, thanks to explicitly encoded lengths. ○ Optimized memory management on recovery (or mariabackup --prepare). ● An improved group commit reduces contention and improves scalability ● Compile-time option for persistent memory (such as Intel® Optane™ DC) MARIADB SERVER 10.5 3
  4. 4. Cleanup of Background Tasks ● Background tasks are run in a thread pool ○ No more merges of buffered changes to secondary indexes in the background ○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…; ● An InnoDB internal rw-lock was all but replaced by metadata locks (MDL) ○ We must ensure that the table not be dropped during an operation. ○ This used to be covered by dict_operation_lock covering any InnoDB table! ○ Acquiring only MDL on the table name ought to improve scalability. MARIADB SERVER 10.5 4
  5. 5. Improvements to the Redo Log and Recovery … while still guaranteeing Atomicity, Consistency, Isolation, Durability 5
  6. 6. A Comparison to the ISO OSI Model MARIADB SERVER Low Layers in the OSI Model ● Transport: Retransmission, flow control (TCP/IP) ● Network: IP, ICMP, UDP, BGP, DNS, … (router/switch) ● Data link: Packet framing, checksums ● Physical: Ethernet (CSMA/CD), WLAN (CSMA/CA), … ● Transaction: ACID, MVCC ● Mini-transaction: Atomic, Durable multi-page changes with checksums & recovery ● File system: ext4, XFS, ZFS, NFS, … ● Storage: HDD, SSD, PMEM, … InnoDB Engine in MariaDB Server 6
  7. 7. Write Dependencies and ACID ● Log sequence number (LSN) totally orders the output of mini-transactions. ○ The mini-transaction’s atomic change to one or multiple pages is durable if all log up to the end LSN has been written. ● Undo log pages implement ACID transactions (implicit locks, rollback, MVCC) ○ A user transaction COMMIT is durable if the undo page change is durable. ● Write-ahead logging: Must write log before changed pages, at least up to the FIL_PAGE_LSN of the changed page that is about to be written ● Log checkpoint: write all changed pages older than the checkpoint LSN ● Recovery will have to process log from the checkpoint LSN to last durable LSN MARIADB SERVER 10.5 7
  8. 8. Mini-Transactions: Latches and Log MARIADB SERVER 10.5 8 Mini-Transaction Memo: Locks or Buffer-Fixes dict_index_t:: lock covers internal (non-leaf) pages fil_space_t:: latch: allocating or freeing pages Log: Page Changes Data Files FIL_PAGE_LSN Flush (after log written) ib_logfile0Log Buffer log_sys.buf Write ahead (of page flush) to log Buffer pool page buf_page_t:: oldest_modification commit A mini-transaction commit stores the log position (LSN) to each changed page. Recovery will apply log if its LSN is newer than the FIL_PAGE_LSN. Log position (LSN)
  9. 9. (Almost) Physical Redo Logging ● Original goal: Replace all physio-logical log records with purely physical ones ○ “write N bytes to offset O”, “copy N bytes from offset O to P”. ○ Simple and fast recovery, with no possibility of validation. ● Setback: Any increase of redo log record volume severely hurts performance! ○ Must avoid writing log for updating numerous index page header or footer fields. ● Solution: Special logical records UNDO_APPEND, INSERT, DELETE ○ For these records, recovery will validate the page contents. MARIADB SERVER 10.5 9
  10. 10. Fewer Writes and Reads of Data Pages ● Page (re)initialization will write an INIT_PAGE record ○ The page will be recovered based on log records only (without reading the page! ○ Page flushing can safely skip the doublewrite buffer. ● Freeing a page will write a FREE_PAGE record ○ Freed pages will not be written back, nor read by crash recovery! ○ Only if scrubbing is enabled, we must discard the garbage contents. MARIADB SERVER 10.5 10
  11. 11. Stronger, Faster Backup and Recovery ● Recovery (and mariabackup) must parse and buffer all log records ● Recovery (and mariabackup --prepare) must process all log that was durably written since the last completed log checkpoint LSN ● In MariaDB Server 10.5, thanks to the new log record format makes this faster: ○ Explicitly encoded lengths simplify parsing. ○ Less copying, because a record can never exceed innodb_page_size. ● The recovery of logical INSERT, DELETE includes validation of page contents ○ Corrupted data can be detected more reliably. MARIADB SERVER 10.5 11
  12. 12. Faster InnoDB Redo Log Writes ● Vladislav Vaintroub introduced group_commit_lock for more efficient synchronization of redo log writing and flushing. ○ The goal was to reduce CPU consumption on log_write_up_to(), to reduce spurious wakeups, and improve the throughput in write-intensive benchmarks. ○ Vlad’s extensive benchmarks highlighted (again) that performance is very sensitive to any increase of the redo log volume. This was why the logical UNDO_APPEND, INSERT, DELETE records were added. ● Sergey Vojtovich and Eugene Kosov wrote an optonal libpmem interface to improve performance on Intel® Optane™ DC Persistent Memory Module. MARIADB SERVER 10.5 12
  13. 13. Other InnoDB Changes in MariaDB Server 10.5 Better performance by cleaning up maintenance debt 13
  14. 14. More Predictable Change Buffer ● InnoDB aims to avoid read-before-write when it needs to modify a secondary index B-tree leaf page that is not in the buffer pool. ○ Insert, delete-mark and purge (delete) operations can be written to a change buffer in the system tablespace, to be merged to the final location later. ● MariaDB Server 10.5 no longer merges buffered changes in the background ○ Change buffer merges can no longer cause hard-to-predict I/O spikes. ○ A corrupted index can only cause trouble when it is being accessed. ● This was joint work with Thirunarayanan Balathandayuthapani. MARIADB SERVER 10.5 14
  15. 15. Thread Pool for Background Tasks ● Vladislav Vaintroub converted most InnoDB internal threads into tasks ○ The internal thread pool will dynamically create or destroy threads based on the number of active tasks. ○ Vlad also refactored the InnoDB I/O threads and the asynchronous I/O. ● It is easier to configure the maximum number of tasks of a kind than to configure the number of threads. ○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…; MARIADB SERVER 10.5 15
  16. 16. InnoDB data dictionary cleanup ● Thirunarayanan Balathandayuthapani extended the use of metadata locks (MDL) ○ Background operations must ensure that the table not be dropped. ○ This used to be covered by dict_operation_lock (or dict_sys.latch), which covers any InnoDB table! ○ It suffices to acquire MDL on the table name. ● In a future release, we hope to remove dict_sys.latch altogether, and to replace internal transactional table locks with MDL. MARIADB SERVER 10.5 16
  17. 17. The Buffer Pool is Single Again ● In 2010, the buffer pool was partitioned in order to reduce contention. ○ But many scalability fixes had been implemented since then! ○ Extensive benchmarks by Axel Schwenke showed that a single buffer pool generally performs best; in one case, 4 buffer pool instances was better. ○ With a single buffer pool, it is easier to employ std::atomic data members. ● We removed some data members from buf_page_t and buf_block_t altogether, including buf_block_t::mutex. ● Page lookups will no longer acquire buf_pool.mutex at all. MARIADB SERVER 10.5 17
  18. 18. A Few Words on Quality How we Test the InnoDB Storage Engine at MariaDB 18
  19. 19. The Backbone of InnoDB Testing at MariaDB ● A large number of assertions (in debug builds) form a ‘specification’ ○ A regression test (mtr) is run on CI systems. ○ Random Query Generator (RQG) and grammar simplifier ■ Many tweaks by Matthias Leich, especially to test crash recovery ● AddressSanitizer (ASAN) with custom memory poisoning ● MemorySanitizer (MSAN), UndefinedBehaviorSanitizer (UBSAN) ● Performance tests are run on non-debug builds to prevent regressions MARIADB SERVER 10.5 19
  20. 20. Repeatable Execution Traces of Failures ● by the Mozilla Foundation records an execution trace that can be used for deterministic debugging with rr replay ○ Breakpoints and watchpoints will work and can catch data races! ○ Much smaller than core dumps, even though all intermediate states are included. ● Even the most nondeterministic bugs become tractable and fixable. ○ Recovery bugs: need a trace of the killed server and the recovering server. ○ We recently found and fixed several elusive 10-year-old bugs. ● Best of all, rr record can be combined with ASAN and RQG. MARIADB SERVER 10.5 20
  21. 21. Performance Evaluation I/O Bound Performance on a 2×16-CPU System 21
  22. 22. OLTP Read-Write (14 reads, 4 writes / trx) ● Disabling the innodb_adaptive_hash_index helps in this case ● 10.4.13 is faster than 10.4.12 ● 10.5 is clearly better MARIADB SERVER 10.5 22
  23. 23. OLTP Point Selects and Read-Only (14/trx) MARIADB SERVER 10.5 23
  24. 24. Summary of the Benchmarks ● MariaDB Server 10.5 significantly improves write performance ○ Thanks to reduced contention on buffer pool and redo log mutexes. ● Some improvement in read workloads until reaching the number of CPUs ● Read workloads show degradation after reaching the number of CPUs ○ We have identified some bottlenecks, such as statistics counters. ○ This is work in progress. MARIADB SERVER 10.5 24
  25. 25. The Way Ahead 25
  26. 26. Limitations in Current File Formats ● Redo log: 512-byte block size causes copying and mutex contention ○ Block framing forces log records to be split or padded ○ A mutex must be held while copying, padding, encrypting, computing checksums ● Secondary indexes are missing a per-record transaction ID ○ MVCC, purge, and checks for implicit locks could be much simpler and faster ● DB_ROLL_PTR and the undo log format limit us to 128 rollback segments ○ Cannot possibly scale beyond 128 concurrent write transactions MARIADB SERVER 10.5 26
  27. 27. Future Ideas for Durability and Backup ● Flash-friendly redo log file format, with a separate checkpoint log file ○ Arbitrary block size to minimize write amplification: No padding on /dev/pmem! ○ Encrypt log records and compute checksums while not holding any mutex! ○ mariabackup --prepare could become optional. No need for .delta files! ● Asynchronous COMMIT: send OK packet on write completion ○ Execute next statement(s) without waiting for COMMIT (Idea: Vladislav Vaintroub) ● Complete the InnoDB recovery in the background, while allowing connections. MARIADB SERVER 10.5 27
  28. 28. Conclusion ● MariaDB Server 10.5 makes better use of the hardware resources ○ Useless or harmful parameters were removed, others made dynamic. ○ Performance and scalability were improved for various types of workloads. ● Performance must never come at the cost of reliability or compatibility ○ Our stress tests are based on some formal methods and state-of-the-art tools. ○ We also test in-place upgrades of existing data files. ● Watch out for more improvements in future releases MARIADB SERVER 10.5 28