Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Redo log

476 views

Published on

New design for mini transactions and redo log (MySQL 8 / InnoDB), optimized for workloads with high concurrency.

Published in: Technology
  • Be the first to comment

Redo log

  1. 1. The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle's products may change and remains at the sole discretion of Oracle Corporation. Safe Harbor Slide
  2. 2. Mini transactions When they are used ? InnoDB stores all data in 16kB pages (default size) and all changes to these pages go through usage of mini transactions. This means that mini transactions are used very, very often. Single user transaction consists of multiple mini transactions. Commit of transaction itself requires a new mini transaction (which modifies undo log pages). What they are for ? • Allow to do atomic changes to multiple pages • Postpone writes of re-modified pages to disk • Write only log of changes applied to pages
  3. 3. Mini transaction commit in MySQL 5.7 Reserve place and space in the redo log Write log records to the log buffer Mark modified pages, add them to flush lists and release latches 1 2 3 ACQUIRE / DO WORK / RELEASE Mutex exchange: log_sys → log_flush_order caused performance issue when first thread started to wait for the log_flush_order mutex, holding the log_sys mutex.
  4. 4. New design in 8.0.5+ Reserve place and space in the redo log Write log records to the log buffer Mark modified pages, add them to flush lists and release latches1 2a 3a Report written2b Report done3b
  5. 5. Comparison of mtr_commit in 5.7 vs in 8.0.5
  6. 6. 1. The LSN sequence defines time line for recovery. 2. Stages of mini transaction commit are executed concurrently and threads may interleave. 3. Threads concurrently report finished operations to a new lock-free data structure. 4. The data structure tracks up to which LSN all operations are reported as finished (per stage). Tracking concurrent operations
  7. 7. Limited window for pending operations (L) Pending tasks (in progress) Wait (unlikely) All past tasks done Tracking concurrent operations 1.Window of pending operations is limited (to L bytes of the LSN sequence (1 MB)) 2.Before adding dirty page to flush list, wait until its oldest_lsn fits the current window. 3.This guarantees that checkpoint_lsn could be written at oldest_lsn - L
  8. 8. Relaxed order of pages in flush list
  9. 9. /* Create a new “light task” */ your_start_time = time_sequence.next_time(planned_time_interval); /* Wait until it's permitted to start the execution (unlikely to wait). */ tasks_done.wait_until_in_current_window(your_start_time); /* Do your work */ foo(); /* Report it's done. */ tasks_done.report_task_done(your_start_time, your_start_time + planned_time_interval); Generalized algorithm (extracted)
  10. 10. 1 2 3 S This step is just to have an option to: “stop the world” which is very uncommon Sharded RW-latch for mtr_commit
  11. 11. New strategy for writing to disk: 1. Sooner log is written, sooner transaction's commit can finish. 2. We keep an eager loop of writes to OS buffer. 3. We keep an eager loop of fsyncs. However: 4. We avoid rewriting log blocks - we write only full log blocks unless none is ready. 5. We preserve write-ahead strategy to avoid read-on-write issue. Redo threads
  12. 12. Waiting for redo written / flushed New strategy to wait for redo written / flushed • Select finer grained event (in 5.7 there was only 1 event for that) • Granularity adjusted to the expected granularity of writes (per log block) • Optionally use spin delay first (if CPU is not busy) • Users waiting in block for which write started, when it was only partially filled, could experience false wake-ups.
  13. 13. Waking up waiting threads
  14. 14. CPU usage is monitored not to use spin delay when server is almost idle, and not to use spin delay when we don't have enough CPU power for useful things. Average time between consecutive requests to write or flush redo is monitored to detect situation in which requests are really not often and spin delay is not required. In such cases we also start sleeps with higher timeout. This helps to avoid wasting CPU in cases where log threads don't need to be so eager. Consuming unused CPU to improve TPS 1 2
  15. 15. Dedicated solution (5.7-alike) for low-concurrent workloads to avoid need for spinning and consuming CPU and still deliver top TPS for that # of connections. Changes to redo format. Dynamic resize of the redo log on disk, no more wrapping within single file. Checkpoints stored within each log file. No longer logfile0 is special. Changes to redo log incoming soon 1 2
  16. 16. Thank You Paweł Olchawa Senior Software Developer Oracle / MySQL / InnoDB

×