Faster, better, stronger: The new InnoDB

Faster, better,
stronger: The new
InnoDB
Marko Mäkelä
Lead Developer InnoDB
MariaDB Corporation
1

Cleanup of Configuration Parameters
● We deprecate & hard-wire a number of parameters, including the following:
○ innodb_buffer_pool_instances=1, innodb_page_cleaners=1
○ innodb_log_files_in_group=1 (only ib_logfile0)
○ innodb_undo_logs=128 (the “rollback segment” component of DB_ROLL_PTR)
● Enterprise Server allows more changes while the server is running:
○ SET GLOBAL innodb_log_file_size=…;
○ SET GLOBAL innodb_purge_threads=…;
MARIADB SERVER 10.5
2

Improvements to the Redo Log
● More compact, extensible redo log record format and faster backup&recovery
○ We now avoid writes of freed pages after DROP (or rebuild) operations.
○ The doublewrite buffer is not used for newly (re)initialized pages.
○ The physical format is easy to parse, thanks to explicitly encoded lengths.
○ Optimized memory management on recovery (or mariabackup --prepare).
● An improved group commit reduces contention and improves scalability
● Compile-time option for persistent memory (such as Intel® Optane™ DC)
MARIADB SERVER 10.5
3

Cleanup of Background Tasks
● Background tasks are run in a thread pool
○ No more merges of buffered changes to secondary indexes in the background
○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…;
● An InnoDB internal rw-lock was all but replaced by metadata locks (MDL)
○ We must ensure that the table not be dropped during an operation.
○ This used to be covered by dict_operation_lock covering any InnoDB table!
○ Acquiring only MDL on the table name ought to improve scalability.
MARIADB SERVER 10.5
4

Improvements to the
Redo Log and Recovery
… while still guaranteeing Atomicity, Consistency, Isolation, Durability
5

A Comparison to the ISO OSI Model
MARIADB SERVER
Low Layers in the OSI Model
● Transport: Retransmission,
flow control (TCP/IP)
● Network: IP, ICMP, UDP, BGP,
DNS, … (router/switch)
● Data link: Packet framing,
checksums
● Physical: Ethernet (CSMA/CD),
WLAN (CSMA/CA), …
● Transaction: ACID, MVCC
● Mini-transaction: Atomic,
Durable multi-page changes
with checksums & recovery
● File system: ext4, XFS, ZFS,
NFS, …
● Storage: HDD, SSD, PMEM, …
InnoDB Engine in MariaDB Server
6

Write Dependencies and ACID
● Log sequence number (LSN) totally orders the output of mini-transactions.
○ The mini-transaction’s atomic change to one or multiple pages is durable if all log up
to the end LSN has been written.
● Undo log pages implement ACID transactions (implicit locks, rollback, MVCC)
○ A user transaction COMMIT is durable if the undo page change is durable.
● Write-ahead logging: Must write log before changed pages, at least up to the
FIL_PAGE_LSN of the changed page that is about to be written
● Log checkpoint: write all changed pages older than the checkpoint LSN
● Recovery will have to process log from the checkpoint LSN to last durable LSN
MARIADB SERVER 10.5
7

Mini-Transactions: Latches and Log
MARIADB SERVER 10.5
8
Mini-Transaction
Memo:
Locks or
Buffer-Fixes
dict_index_t::
lock covers internal
(non-leaf) pages
fil_space_t::
latch: allocating or
freeing pages
Log:
Page Changes
Data Files
FIL_PAGE_LSN
Flush (after log written)
ib_logfile0Log Buffer
log_sys.buf
Write ahead (of page flush) to log
Buffer pool page
buf_page_t::
oldest_modification
commit
A mini-transaction commit stores
the log position (LSN) to each
changed page.
Recovery will apply log if its LSN is
newer than the FIL_PAGE_LSN.
Log position (LSN)

(Almost) Physical Redo Logging
● Original goal: Replace all physio-logical log records with purely physical ones
○ “write N bytes to offset O”, “copy N bytes from offset O to P”.
○ Simple and fast recovery, with no possibility of validation.
● Setback: Any increase of redo log record volume severely hurts performance!
○ Must avoid writing log for updating numerous index page header or footer fields.
● Solution: Special logical records UNDO_APPEND, INSERT, DELETE
○ For these records, recovery will validate the page contents.
MARIADB SERVER 10.5
9

Fewer Writes and Reads of Data Pages
● Page (re)initialization will write an INIT_PAGE record
○ The page will be recovered based on log records only (without reading the page!
○ Page flushing can safely skip the doublewrite buffer.
● Freeing a page will write a FREE_PAGE record
○ Freed pages will not be written back, nor read by crash recovery!
○ Only if scrubbing is enabled, we must discard the garbage contents.
MARIADB SERVER 10.5
10

Stronger, Faster Backup and Recovery
● Recovery (and mariabackup) must parse and buffer all log records
● Recovery (and mariabackup --prepare) must process all log that was
durably written since the last completed log checkpoint LSN
● In MariaDB Server 10.5, thanks to the new log record format makes this faster:
○ Explicitly encoded lengths simplify parsing.
○ Less copying, because a record can never exceed innodb_page_size.
● The recovery of logical INSERT, DELETE includes validation of page contents
○ Corrupted data can be detected more reliably.
MARIADB SERVER 10.5
11

Faster InnoDB Redo Log Writes
● Vladislav Vaintroub introduced group_commit_lock for more efficient
synchronization of redo log writing and flushing.
○ The goal was to reduce CPU consumption on log_write_up_to(), to reduce
spurious wakeups, and improve the throughput in write-intensive benchmarks.
○ Vlad’s extensive benchmarks highlighted (again) that performance is very sensitive to
any increase of the redo log volume. This was why the logical UNDO_APPEND,
INSERT, DELETE records were added.
● Sergey Vojtovich and Eugene Kosov wrote an optonal libpmem interface to
improve performance on Intel® Optane™ DC Persistent Memory Module.
MARIADB SERVER 10.5
12

Other InnoDB Changes
in MariaDB Server 10.5
Better performance by cleaning up maintenance debt
13

More Predictable Change Buffer
● InnoDB aims to avoid read-before-write when it needs to modify a secondary
index B-tree leaf page that is not in the buffer pool.
○ Insert, delete-mark and purge (delete) operations can be written to a change buffer in
the system tablespace, to be merged to the final location later.
● MariaDB Server 10.5 no longer merges buffered changes in the background
○ Change buffer merges can no longer cause hard-to-predict I/O spikes.
○ A corrupted index can only cause trouble when it is being accessed.
● This was joint work with Thirunarayanan Balathandayuthapani.
MARIADB SERVER 10.5
14

Thread Pool for Background Tasks
● Vladislav Vaintroub converted most InnoDB internal threads into tasks
○ The internal thread pool will dynamically create or destroy threads based on the
number of active tasks.
○ Vlad also refactored the InnoDB I/O threads and the asynchronous I/O.
● It is easier to configure the maximum number of tasks of a kind than to
configure the number of threads.
○ In the Enterprise Server, you can SET GLOBAL innodb_purge_threads=…;
MARIADB SERVER 10.5
15

InnoDB data dictionary cleanup
● Thirunarayanan Balathandayuthapani extended the use of metadata locks
(MDL)
○ Background operations must ensure that the table not be dropped.
○ This used to be covered by dict_operation_lock (or dict_sys.latch), which
covers any InnoDB table!
○ It suffices to acquire MDL on the table name.
● In a future release, we hope to remove dict_sys.latch altogether, and to
replace internal transactional table locks with MDL.
MARIADB SERVER 10.5
16

The Buffer Pool is Single Again
● In 2010, the buffer pool was partitioned in order to reduce contention.
○ But many scalability fixes had been implemented since then!
○ Extensive benchmarks by Axel Schwenke showed that a single buffer pool generally
performs best; in one case, 4 buffer pool instances was better.
○ With a single buffer pool, it is easier to employ std::atomic data members.
● We removed some data members from buf_page_t and buf_block_t
altogether, including buf_block_t::mutex.
● Page lookups will no longer acquire buf_pool.mutex at all.
MARIADB SERVER 10.5
17

A Few Words on
Quality
How we Test the InnoDB Storage Engine at MariaDB
18

The Backbone of InnoDB Testing at MariaDB
● A large number of assertions (in debug builds) form a ‘specification’
○ A regression test (mtr) is run on CI systems.
○ Random Query Generator (RQG) and grammar simplifier
■ Many tweaks by Matthias Leich, especially to test crash recovery
● AddressSanitizer (ASAN) with custom memory poisoning
● MemorySanitizer (MSAN), UndefinedBehaviorSanitizer (UBSAN)
● Performance tests are run on non-debug builds to prevent regressions
MARIADB SERVER 10.5
19

Repeatable Execution Traces of Failures
● https://rr-project.org by the Mozilla Foundation records an execution trace that
can be used for deterministic debugging with rr replay
○ Breakpoints and watchpoints will work and can catch data races!
○ Much smaller than core dumps, even though all intermediate states are included.
● Even the most nondeterministic bugs become tractable and fixable.
○ Recovery bugs: need a trace of the killed server and the recovering server.
○ We recently found and fixed several elusive 10-year-old bugs.
● Best of all, rr record can be combined with ASAN and RQG.
MARIADB SERVER 10.5
20

Performance Evaluation
I/O Bound Performance on a 2×16-CPU System
21

OLTP Read-Write (14 reads, 4 writes / trx)
● Disabling the
innodb_adaptive_hash_index
helps in this case
● 10.4.13 is faster than 10.4.12
● 10.5 is clearly better
MARIADB SERVER 10.5
22

OLTP Point Selects and Read-Only (14/trx)
MARIADB SERVER 10.5
23

Summary of the Benchmarks
● MariaDB Server 10.5 significantly improves write performance
○ Thanks to reduced contention on buffer pool and redo log mutexes.
● Some improvement in read workloads until reaching the number of CPUs
● Read workloads show degradation after reaching the number of CPUs
○ We have identified some bottlenecks, such as statistics counters.
○ This is work in progress.
MARIADB SERVER 10.5
24

Limitations in Current File Formats
● Redo log: 512-byte block size causes copying and mutex contention
○ Block framing forces log records to be split or padded
○ A mutex must be held while copying, padding, encrypting, computing checksums
● Secondary indexes are missing a per-record transaction ID
○ MVCC, purge, and checks for implicit locks could be much simpler and faster
● DB_ROLL_PTR and the undo log format limit us to 128 rollback segments
○ Cannot possibly scale beyond 128 concurrent write transactions
MARIADB SERVER 10.5
26

Future Ideas for Durability and Backup
● Flash-friendly redo log file format, with a separate checkpoint log file
○ Arbitrary block size to minimize write amplification: No padding on /dev/pmem!
○ Encrypt log records and compute checksums while not holding any mutex!
○ mariabackup --prepare could become optional. No need for .delta files!
● Asynchronous COMMIT: send OK packet on write completion
○ Execute next statement(s) without waiting for COMMIT (Idea: Vladislav Vaintroub)
● Complete the InnoDB recovery in the background, while allowing connections.
MARIADB SERVER 10.5
27

Conclusion
● MariaDB Server 10.5 makes better use of the hardware resources
○ Useless or harmful parameters were removed, others made dynamic.
○ Performance and scalability were improved for various types of workloads.
● Performance must never come at the cost of reliability or compatibility
○ Our stress tests are based on some formal methods and state-of-the-art tools.
○ We also test in-place upgrades of existing data files.
● Watch out for more improvements in future releases
MARIADB SERVER 10.5
28

Faster, better, stronger: The new InnoDB

More Related Content

What's hot

Similar to Faster, better, stronger: The new InnoDB

More from MariaDB plc

Recently uploaded

Faster, better, stronger: The new InnoDB