Faster, better,
stronger: The new
InnoDB
Marko Mäkelä
Lead Developer InnoDB
MariaDB Corporation
1
Cleanup of Configuration Parameters
● We deprecate & hard-wire a number of parameters, including the following:
○ innodb_buffer_pool_instances=1, innodb_page_cleaners=1
○ innodb_log_files_in_group=1 (only ib_logfile0)
○ innodb_undo_logs=128 (the “rollback segment” component of DB_ROLL_PTR)
● Enterprise Server allows more changes while the server is running:
○ SET GLOBAL innodb_log_file_size=…;
○ SET GLOBAL innodb_purge_threads=…;
MARIADB SERVER 10.5
2
Improvements to the Redo Log
● More compact, extensible redo log record format and faster backup&recovery
○ We now avoid writes of freed pages after DROP (or rebuild) operations.
○ The doublewrite buffer is not used for newly (re)initialized pages.
○ The physical format is easy to parse, thanks to explicitly encoded lengths.
○ Optimized memory management on recovery (or mariabackup --prepare).
● An improved group commit reduces contention and improves scalability
● Compile-time option for persistent memory (such as Intel® Optane™ DC)
MARIADB SERVER 10.5
3
Cleanup of Background Tasks
● Background tasks are run in a thread pool
○ No more merges of buffered changes to secondary indexes in the background
○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…;
● An InnoDB internal rw-lock was all but replaced by metadata locks (MDL)
○ We must ensure that the table not be dropped during an operation.
○ This used to be covered by dict_operation_lock covering any InnoDB table!
○ Acquiring only MDL on the table name ought to improve scalability.
MARIADB SERVER 10.5
4
Improvements to the
Redo Log and Recovery
… while still guaranteeing Atomicity, Consistency, Isolation, Durability
5
A Comparison to the ISO OSI Model
MARIADB SERVER
Low Layers in the OSI Model
● Transport: Retransmission,
flow control (TCP/IP)
● Network: IP, ICMP, UDP, BGP,
DNS, … (router/switch)
● Data link: Packet framing,
checksums
● Physical: Ethernet (CSMA/CD),
WLAN (CSMA/CA), …
● Transaction: ACID, MVCC
● Mini-transaction: Atomic,
Durable multi-page changes
with checksums & recovery
● File system: ext4, XFS, ZFS,
NFS, …
● Storage: HDD, SSD, PMEM, …
InnoDB Engine in MariaDB Server
6
Write Dependencies and ACID
● Log sequence number (LSN) totally orders the output of mini-transactions.
○ The mini-transaction’s atomic change to one or multiple pages is durable if all log up
to the end LSN has been written.
● Undo log pages implement ACID transactions (implicit locks, rollback, MVCC)
○ A user transaction COMMIT is durable if the undo page change is durable.
● Write-ahead logging: Must write log before changed pages, at least up to the
FIL_PAGE_LSN of the changed page that is about to be written
● Log checkpoint: write all changed pages older than the checkpoint LSN
● Recovery will have to process log from the checkpoint LSN to last durable LSN
MARIADB SERVER 10.5
7
Mini-Transactions: Latches and Log
MARIADB SERVER 10.5
8
Mini-Transaction
Memo:
Locks or
Buffer-Fixes
dict_index_t::
lock covers internal
(non-leaf) pages
fil_space_t::
latch: allocating or
freeing pages
Log:
Page Changes
Data Files
FIL_PAGE_LSN
Flush (after log written)
ib_logfile0Log Buffer
log_sys.buf
Write ahead (of page flush) to log
Buffer pool page
buf_page_t::
oldest_modification
commit
A mini-transaction commit stores
the log position (LSN) to each
changed page.
Recovery will apply log if its LSN is
newer than the FIL_PAGE_LSN.
Log position (LSN)
(Almost) Physical Redo Logging
● Original goal: Replace all physio-logical log records with purely physical ones
○ “write N bytes to offset O”, “copy N bytes from offset O to P”.
○ Simple and fast recovery, with no possibility of validation.
● Setback: Any increase of redo log record volume severely hurts performance!
○ Must avoid writing log for updating numerous index page header or footer fields.
● Solution: Special logical records UNDO_APPEND, INSERT, DELETE
○ For these records, recovery will validate the page contents.
MARIADB SERVER 10.5
9
Fewer Writes and Reads of Data Pages
● Page (re)initialization will write an INIT_PAGE record
○ The page will be recovered based on log records only (without reading the page!
○ Page flushing can safely skip the doublewrite buffer.
● Freeing a page will write a FREE_PAGE record
○ Freed pages will not be written back, nor read by crash recovery!
○ Only if scrubbing is enabled, we must discard the garbage contents.
MARIADB SERVER 10.5
10
Stronger, Faster Backup and Recovery
● Recovery (and mariabackup) must parse and buffer all log records
● Recovery (and mariabackup --prepare) must process all log that was
durably written since the last completed log checkpoint LSN
● In MariaDB Server 10.5, thanks to the new log record format makes this faster:
○ Explicitly encoded lengths simplify parsing.
○ Less copying, because a record can never exceed innodb_page_size.
● The recovery of logical INSERT, DELETE includes validation of page contents
○ Corrupted data can be detected more reliably.
MARIADB SERVER 10.5
11
Faster InnoDB Redo Log Writes
● Vladislav Vaintroub introduced group_commit_lock for more efficient
synchronization of redo log writing and flushing.
○ The goal was to reduce CPU consumption on log_write_up_to(), to reduce
spurious wakeups, and improve the throughput in write-intensive benchmarks.
○ Vlad’s extensive benchmarks highlighted (again) that performance is very sensitive to
any increase of the redo log volume. This was why the logical UNDO_APPEND,
INSERT, DELETE records were added.
● Sergey Vojtovich and Eugene Kosov wrote an optonal libpmem interface to
improve performance on Intel® Optane™ DC Persistent Memory Module.
MARIADB SERVER 10.5
12
Other InnoDB Changes
in MariaDB Server 10.5
Better performance by cleaning up maintenance debt
13
More Predictable Change Buffer
● InnoDB aims to avoid read-before-write when it needs to modify a secondary
index B-tree leaf page that is not in the buffer pool.
○ Insert, delete-mark and purge (delete) operations can be written to a change buffer in
the system tablespace, to be merged to the final location later.
● MariaDB Server 10.5 no longer merges buffered changes in the background
○ Change buffer merges can no longer cause hard-to-predict I/O spikes.
○ A corrupted index can only cause trouble when it is being accessed.
● This was joint work with Thirunarayanan Balathandayuthapani.
MARIADB SERVER 10.5
14
Thread Pool for Background Tasks
● Vladislav Vaintroub converted most InnoDB internal threads into tasks
○ The internal thread pool will dynamically create or destroy threads based on the
number of active tasks.
○ Vlad also refactored the InnoDB I/O threads and the asynchronous I/O.
● It is easier to configure the maximum number of tasks of a kind than to
configure the number of threads.
○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…;
MARIADB SERVER 10.5
15
InnoDB data dictionary cleanup
● Thirunarayanan Balathandayuthapani extended the use of metadata locks
(MDL)
○ Background operations must ensure that the table not be dropped.
○ This used to be covered by dict_operation_lock (or dict_sys.latch), which
covers any InnoDB table!
○ It suffices to acquire MDL on the table name.
● In a future release, we hope to remove dict_sys.latch altogether, and to
replace internal transactional table locks with MDL.
MARIADB SERVER 10.5
16
The Buffer Pool is Single Again
● In 2010, the buffer pool was partitioned in order to reduce contention.
○ But many scalability fixes had been implemented since then!
○ Extensive benchmarks by Axel Schwenke showed that a single buffer pool generally
performs best; in one case, 4 buffer pool instances was better.
○ With a single buffer pool, it is easier to employ std::atomic data members.
● We removed some data members from buf_page_t and buf_block_t
altogether, including buf_block_t::mutex.
● Page lookups will no longer acquire buf_pool.mutex at all.
MARIADB SERVER 10.5
17
A Few Words on
Quality
How we Test the InnoDB Storage Engine at MariaDB
18
The Backbone of InnoDB Testing at MariaDB
● A large number of assertions (in debug builds) form a ‘specification’
○ A regression test (mtr) is run on CI systems.
○ Random Query Generator (RQG) and grammar simplifier
■ Many tweaks by Matthias Leich, especially to test crash recovery
● AddressSanitizer (ASAN) with custom memory poisoning
● MemorySanitizer (MSAN), UndefinedBehaviorSanitizer (UBSAN)
● Performance tests are run on non-debug builds to prevent regressions
MARIADB SERVER 10.5
19
Repeatable Execution Traces of Failures
● https://rr-project.org by the Mozilla Foundation records an execution trace that
can be used for deterministic debugging with rr replay
○ Breakpoints and watchpoints will work and can catch data races!
○ Much smaller than core dumps, even though all intermediate states are included.
● Even the most nondeterministic bugs become tractable and fixable.
○ Recovery bugs: need a trace of the killed server and the recovering server.
○ We recently found and fixed several elusive 10-year-old bugs.
● Best of all, rr record can be combined with ASAN and RQG.
MARIADB SERVER 10.5
20
Performance Evaluation
I/O Bound Performance on a 2×16-CPU System
21
OLTP Read-Write (14 reads, 4 writes / trx)
● Disabling the
innodb_adaptive_hash_index
helps in this case
● 10.4.13 is faster than 10.4.12
● 10.5 is clearly better
MARIADB SERVER 10.5
22
OLTP Point Selects and Read-Only (14/trx)
MARIADB SERVER 10.5
23
Summary of the Benchmarks
● MariaDB Server 10.5 significantly improves write performance
○ Thanks to reduced contention on buffer pool and redo log mutexes.
● Some improvement in read workloads until reaching the number of CPUs
● Read workloads show degradation after reaching the number of CPUs
○ We have identified some bottlenecks, such as statistics counters.
○ This is work in progress.
MARIADB SERVER 10.5
24
The Way Ahead
25
Limitations in Current File Formats
● Redo log: 512-byte block size causes copying and mutex contention
○ Block framing forces log records to be split or padded
○ A mutex must be held while copying, padding, encrypting, computing checksums
● Secondary indexes are missing a per-record transaction ID
○ MVCC, purge, and checks for implicit locks could be much simpler and faster
● DB_ROLL_PTR and the undo log format limit us to 128 rollback segments
○ Cannot possibly scale beyond 128 concurrent write transactions
MARIADB SERVER 10.5
26
Future Ideas for Durability and Backup
● Flash-friendly redo log file format, with a separate checkpoint log file
○ Arbitrary block size to minimize write amplification: No padding on /dev/pmem!
○ Encrypt log records and compute checksums while not holding any mutex!
○ mariabackup --prepare could become optional. No need for .delta files!
● Asynchronous COMMIT: send OK packet on write completion
○ Execute next statement(s) without waiting for COMMIT (Idea: Vladislav Vaintroub)
● Complete the InnoDB recovery in the background, while allowing connections.
MARIADB SERVER 10.5
27
Conclusion
● MariaDB Server 10.5 makes better use of the hardware resources
○ Useless or harmful parameters were removed, others made dynamic.
○ Performance and scalability were improved for various types of workloads.
● Performance must never come at the cost of reliability or compatibility
○ Our stress tests are based on some formal methods and state-of-the-art tools.
○ We also test in-place upgrades of existing data files.
● Watch out for more improvements in future releases
MARIADB SERVER 10.5
28

Faster, better, stronger: The new InnoDB

  • 1.
    Faster, better, stronger: Thenew InnoDB Marko Mäkelä Lead Developer InnoDB MariaDB Corporation 1
  • 2.
    Cleanup of ConfigurationParameters ● We deprecate & hard-wire a number of parameters, including the following: ○ innodb_buffer_pool_instances=1, innodb_page_cleaners=1 ○ innodb_log_files_in_group=1 (only ib_logfile0) ○ innodb_undo_logs=128 (the “rollback segment” component of DB_ROLL_PTR) ● Enterprise Server allows more changes while the server is running: ○ SET GLOBAL innodb_log_file_size=…; ○ SET GLOBAL innodb_purge_threads=…; MARIADB SERVER 10.5 2
  • 3.
    Improvements to theRedo Log ● More compact, extensible redo log record format and faster backup&recovery ○ We now avoid writes of freed pages after DROP (or rebuild) operations. ○ The doublewrite buffer is not used for newly (re)initialized pages. ○ The physical format is easy to parse, thanks to explicitly encoded lengths. ○ Optimized memory management on recovery (or mariabackup --prepare). ● An improved group commit reduces contention and improves scalability ● Compile-time option for persistent memory (such as Intel® Optane™ DC) MARIADB SERVER 10.5 3
  • 4.
    Cleanup of BackgroundTasks ● Background tasks are run in a thread pool ○ No more merges of buffered changes to secondary indexes in the background ○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…; ● An InnoDB internal rw-lock was all but replaced by metadata locks (MDL) ○ We must ensure that the table not be dropped during an operation. ○ This used to be covered by dict_operation_lock covering any InnoDB table! ○ Acquiring only MDL on the table name ought to improve scalability. MARIADB SERVER 10.5 4
  • 5.
    Improvements to the RedoLog and Recovery … while still guaranteeing Atomicity, Consistency, Isolation, Durability 5
  • 6.
    A Comparison tothe ISO OSI Model MARIADB SERVER Low Layers in the OSI Model ● Transport: Retransmission, flow control (TCP/IP) ● Network: IP, ICMP, UDP, BGP, DNS, … (router/switch) ● Data link: Packet framing, checksums ● Physical: Ethernet (CSMA/CD), WLAN (CSMA/CA), … ● Transaction: ACID, MVCC ● Mini-transaction: Atomic, Durable multi-page changes with checksums & recovery ● File system: ext4, XFS, ZFS, NFS, … ● Storage: HDD, SSD, PMEM, … InnoDB Engine in MariaDB Server 6
  • 7.
    Write Dependencies andACID ● Log sequence number (LSN) totally orders the output of mini-transactions. ○ The mini-transaction’s atomic change to one or multiple pages is durable if all log up to the end LSN has been written. ● Undo log pages implement ACID transactions (implicit locks, rollback, MVCC) ○ A user transaction COMMIT is durable if the undo page change is durable. ● Write-ahead logging: Must write log before changed pages, at least up to the FIL_PAGE_LSN of the changed page that is about to be written ● Log checkpoint: write all changed pages older than the checkpoint LSN ● Recovery will have to process log from the checkpoint LSN to last durable LSN MARIADB SERVER 10.5 7
  • 8.
    Mini-Transactions: Latches andLog MARIADB SERVER 10.5 8 Mini-Transaction Memo: Locks or Buffer-Fixes dict_index_t:: lock covers internal (non-leaf) pages fil_space_t:: latch: allocating or freeing pages Log: Page Changes Data Files FIL_PAGE_LSN Flush (after log written) ib_logfile0Log Buffer log_sys.buf Write ahead (of page flush) to log Buffer pool page buf_page_t:: oldest_modification commit A mini-transaction commit stores the log position (LSN) to each changed page. Recovery will apply log if its LSN is newer than the FIL_PAGE_LSN. Log position (LSN)
  • 9.
    (Almost) Physical RedoLogging ● Original goal: Replace all physio-logical log records with purely physical ones ○ “write N bytes to offset O”, “copy N bytes from offset O to P”. ○ Simple and fast recovery, with no possibility of validation. ● Setback: Any increase of redo log record volume severely hurts performance! ○ Must avoid writing log for updating numerous index page header or footer fields. ● Solution: Special logical records UNDO_APPEND, INSERT, DELETE ○ For these records, recovery will validate the page contents. MARIADB SERVER 10.5 9
  • 10.
    Fewer Writes andReads of Data Pages ● Page (re)initialization will write an INIT_PAGE record ○ The page will be recovered based on log records only (without reading the page! ○ Page flushing can safely skip the doublewrite buffer. ● Freeing a page will write a FREE_PAGE record ○ Freed pages will not be written back, nor read by crash recovery! ○ Only if scrubbing is enabled, we must discard the garbage contents. MARIADB SERVER 10.5 10
  • 11.
    Stronger, Faster Backupand Recovery ● Recovery (and mariabackup) must parse and buffer all log records ● Recovery (and mariabackup --prepare) must process all log that was durably written since the last completed log checkpoint LSN ● In MariaDB Server 10.5, thanks to the new log record format makes this faster: ○ Explicitly encoded lengths simplify parsing. ○ Less copying, because a record can never exceed innodb_page_size. ● The recovery of logical INSERT, DELETE includes validation of page contents ○ Corrupted data can be detected more reliably. MARIADB SERVER 10.5 11
  • 12.
    Faster InnoDB RedoLog Writes ● Vladislav Vaintroub introduced group_commit_lock for more efficient synchronization of redo log writing and flushing. ○ The goal was to reduce CPU consumption on log_write_up_to(), to reduce spurious wakeups, and improve the throughput in write-intensive benchmarks. ○ Vlad’s extensive benchmarks highlighted (again) that performance is very sensitive to any increase of the redo log volume. This was why the logical UNDO_APPEND, INSERT, DELETE records were added. ● Sergey Vojtovich and Eugene Kosov wrote an optonal libpmem interface to improve performance on Intel® Optane™ DC Persistent Memory Module. MARIADB SERVER 10.5 12
  • 13.
    Other InnoDB Changes inMariaDB Server 10.5 Better performance by cleaning up maintenance debt 13
  • 14.
    More Predictable ChangeBuffer ● InnoDB aims to avoid read-before-write when it needs to modify a secondary index B-tree leaf page that is not in the buffer pool. ○ Insert, delete-mark and purge (delete) operations can be written to a change buffer in the system tablespace, to be merged to the final location later. ● MariaDB Server 10.5 no longer merges buffered changes in the background ○ Change buffer merges can no longer cause hard-to-predict I/O spikes. ○ A corrupted index can only cause trouble when it is being accessed. ● This was joint work with Thirunarayanan Balathandayuthapani. MARIADB SERVER 10.5 14
  • 15.
    Thread Pool forBackground Tasks ● Vladislav Vaintroub converted most InnoDB internal threads into tasks ○ The internal thread pool will dynamically create or destroy threads based on the number of active tasks. ○ Vlad also refactored the InnoDB I/O threads and the asynchronous I/O. ● It is easier to configure the maximum number of tasks of a kind than to configure the number of threads. ○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…; MARIADB SERVER 10.5 15
  • 16.
    InnoDB data dictionarycleanup ● Thirunarayanan Balathandayuthapani extended the use of metadata locks (MDL) ○ Background operations must ensure that the table not be dropped. ○ This used to be covered by dict_operation_lock (or dict_sys.latch), which covers any InnoDB table! ○ It suffices to acquire MDL on the table name. ● In a future release, we hope to remove dict_sys.latch altogether, and to replace internal transactional table locks with MDL. MARIADB SERVER 10.5 16
  • 17.
    The Buffer Poolis Single Again ● In 2010, the buffer pool was partitioned in order to reduce contention. ○ But many scalability fixes had been implemented since then! ○ Extensive benchmarks by Axel Schwenke showed that a single buffer pool generally performs best; in one case, 4 buffer pool instances was better. ○ With a single buffer pool, it is easier to employ std::atomic data members. ● We removed some data members from buf_page_t and buf_block_t altogether, including buf_block_t::mutex. ● Page lookups will no longer acquire buf_pool.mutex at all. MARIADB SERVER 10.5 17
  • 18.
    A Few Wordson Quality How we Test the InnoDB Storage Engine at MariaDB 18
  • 19.
    The Backbone ofInnoDB Testing at MariaDB ● A large number of assertions (in debug builds) form a ‘specification’ ○ A regression test (mtr) is run on CI systems. ○ Random Query Generator (RQG) and grammar simplifier ■ Many tweaks by Matthias Leich, especially to test crash recovery ● AddressSanitizer (ASAN) with custom memory poisoning ● MemorySanitizer (MSAN), UndefinedBehaviorSanitizer (UBSAN) ● Performance tests are run on non-debug builds to prevent regressions MARIADB SERVER 10.5 19
  • 20.
    Repeatable Execution Tracesof Failures ● https://rr-project.org by the Mozilla Foundation records an execution trace that can be used for deterministic debugging with rr replay ○ Breakpoints and watchpoints will work and can catch data races! ○ Much smaller than core dumps, even though all intermediate states are included. ● Even the most nondeterministic bugs become tractable and fixable. ○ Recovery bugs: need a trace of the killed server and the recovering server. ○ We recently found and fixed several elusive 10-year-old bugs. ● Best of all, rr record can be combined with ASAN and RQG. MARIADB SERVER 10.5 20
  • 21.
    Performance Evaluation I/O BoundPerformance on a 2×16-CPU System 21
  • 22.
    OLTP Read-Write (14reads, 4 writes / trx) ● Disabling the innodb_adaptive_hash_index helps in this case ● 10.4.13 is faster than 10.4.12 ● 10.5 is clearly better MARIADB SERVER 10.5 22
  • 23.
    OLTP Point Selectsand Read-Only (14/trx) MARIADB SERVER 10.5 23
  • 24.
    Summary of theBenchmarks ● MariaDB Server 10.5 significantly improves write performance ○ Thanks to reduced contention on buffer pool and redo log mutexes. ● Some improvement in read workloads until reaching the number of CPUs ● Read workloads show degradation after reaching the number of CPUs ○ We have identified some bottlenecks, such as statistics counters. ○ This is work in progress. MARIADB SERVER 10.5 24
  • 25.
  • 26.
    Limitations in CurrentFile Formats ● Redo log: 512-byte block size causes copying and mutex contention ○ Block framing forces log records to be split or padded ○ A mutex must be held while copying, padding, encrypting, computing checksums ● Secondary indexes are missing a per-record transaction ID ○ MVCC, purge, and checks for implicit locks could be much simpler and faster ● DB_ROLL_PTR and the undo log format limit us to 128 rollback segments ○ Cannot possibly scale beyond 128 concurrent write transactions MARIADB SERVER 10.5 26
  • 27.
    Future Ideas forDurability and Backup ● Flash-friendly redo log file format, with a separate checkpoint log file ○ Arbitrary block size to minimize write amplification: No padding on /dev/pmem! ○ Encrypt log records and compute checksums while not holding any mutex! ○ mariabackup --prepare could become optional. No need for .delta files! ● Asynchronous COMMIT: send OK packet on write completion ○ Execute next statement(s) without waiting for COMMIT (Idea: Vladislav Vaintroub) ● Complete the InnoDB recovery in the background, while allowing connections. MARIADB SERVER 10.5 27
  • 28.
    Conclusion ● MariaDB Server10.5 makes better use of the hardware resources ○ Useless or harmful parameters were removed, others made dynamic. ○ Performance and scalability were improved for various types of workloads. ● Performance must never come at the cost of reliability or compatibility ○ Our stress tests are based on some formal methods and state-of-the-art tools. ○ We also test in-place upgrades of existing data files. ● Watch out for more improvements in future releases MARIADB SERVER 10.5 28