SlideShare a Scribd company logo
1 of 24
Download to read offline
Life of a dirty page
  InnoDB disk IO
    Mark Callaghan
URLs

 slides at http://tinyurl.com/dzshl2
 v3 Google patch at code.google.com
 many similar features in Percona binaries
 soon to be many similar features in MySQL 5.4
Putting things in context

How to make a great OLTP engine
   Go back in time about 15 years
   Start with the Gray & Reuter text
   Develop on a typical PC with one disk and one CPU
 
Then servers with many cores and 10k IOPs arrive.
 
I will focus on the design and implementation of official
InnoDB.
Questions

 Should InnoDB switch to 64kb pages?
    The 5 Minute Rule suggests it should
    Nobody has published performance results for this
 Can InnoDB effectively use 10,000 IOPs?
    Yes for read-intensive workloads
    Not yet for write-intensive workloads
High-level architecture

 Some IO done directly in user threads (mostly reads)
 Background threads perform simulated AIO for
    Log writes
    Dirty page writes
    Prefetch reads
 Extent is 64 16kb pages
 Adjacent requests are merged
 Rate limits based on capacity of single-disk server
Medium-level architecture

Background thread tasks (srv_master_thread)
   Once per second
   Once per 10 seconds
   Only when idle
Tasks to do once per second
 Force transaction log to disk
 Make space available in tx log buffer (maybe)
 Fuzzy checkpoint (maybe)
 Read up to 5 pages to merge insert buffer records using
 sync IO. Skipped unless server is somewhat idle.
 Write 100 dirty pages from the buffer pool using async
 IO. Select pages from the flush list. Skipped when there
 are too many dirty pages. Restart the task loop
 immediately if this is done.
Tasks to do once per 10 seconds

 Write 100 or 10 dirty pages from the buffer pool using
 async IO while selecting pages from the flush list.
 Skipped unless the server is somewhat idle.
 Read up to 5 pages to merge insert buffer records using
 sync IO
 Force transaction log to disk
 Remove deleted rows 
 Fuzzy checkpoint
Tasks to do when the server is idle

  Remove deleted rows
  Read up to 20 pages to merge insert buffer records
  using sync IO
  Write 100 dirty pages from the buffer pool using sync
  IO. Select pages from the flush list.
  Fuzzy checkpoint 
  Repeat if server still idle
Processing IO requests (os0file.c)
Request arrays
   256 slots per array
   Arrays for insert buffer, tx log, page writes, prefetch
   One thread does all of the IO per array using sync IO
 
Choosing the next request
   Requests older than 2 seconds get priority
   Otherwise choose the request with the lowest file
   offset when considering the low 32 bits of the offset
   This requires 2 passes of the request array
Processing IO requests (2)

Request merging
   Merge requests for adjacent blocks. Alas, this does one
   pass on the array per adjacent block merged.
   Stop trying to merge when there is no pending request
   for the next block or 64 requests have been merged.
   Allocate a temporary buffer for the multi-block IO
   when requests have been merged.
   Use one read/write system call. This is a large IO
   operation when merging is used. Wait for entire request
   to complete.
Processing IO requests (3)

Request completion
   Remove the request for the first block in the merged
   request (or only block) from the IO array.
   Completion for other requests is done on the next
   iteration of the request handling loop.
Page write protocol
Protect against partial page writes by writing each dirty
page first to a serial log before updating it in place. This
can be disabled for copy-on-write filesystems.
 
Important functions are
   buf_flush_post_to_doublewrite_buf
   buf_flush_batch
   buf_flush_buffered_writes
Page write protocol (2)
  First write page to in-memory doublewrite log
  When doublewrite log is full
     Write doublewrite log to a file (many pages, 1 big IO).
     This file is the doublewrite buffer.
     Force writes for doublewrite buffer to disk.
     Submit async IO requests to update pages just
     logged in the doublewrite buffer (1 write per page).
     Wait for writes to complete.
     Force writes for dirty pages to disk.
Prefetch for random access

 buf_read_ahead_random
 There are 64 blocks per extent
 Before requesting a block read:
    Count number of blocks in extent that were recently
    read based on position in buffer pool LRU list.
    If more than 13 were recently read, prefetch others
Prefetch for sequential access

 buf_read_ahead_linear
 May be used when
    Accessing first or last page in extent
    Many (24) pages in extent have been accessed
    Access pattern was sequential
 Checks are made prior to issuing a read request.
 When used, issue read requests for extent that contains
 that page that follows or precedes this extent.
Insert buffer (ibuf0ibuf.c)

 Persistent and index-organized on (space id, page#)
 Allows changes to secondary index leaf blocks to be
 deferred when the block is not in the buffer pool.
 Used for insert and the insert portion of an update
 Used until the insert buffer is half of the buffer pool
 Fully resident in the buffer pool
Merging insert buffer records

 ibuf_contract_ext
 Choose random page from the insert buffer
 Open a cursor on a random record on that page
 Read insert buffer entries from that cursor to find at
 most 8 pages that should be fetched.
 Issue async IO requests
 When async read IO completes, callback is done to apply
 deferred changes
 (Usually) wait for IO to complete
Buffer pool (buf0buf.h)

 Protected by one mutex
 Contains
    Flush list - ordered by oldest modification LSN
    LRU list
       ordered by approximate access time
       I have not found code to guard against wiping out
       LRU list and buffer pool on large table scan
    Hash table for pages in buffer pool
More on writing dirty pages

 Most pages to be written selected from end of flush list
    min(modification LSN from any page) must be greater
    than the min LSN from the current log files. Pages
    flushed to enforce this.
    If a page is written from the LRU list it is moved to
    the free list on IO completion.
 When a page is selected, other dirty pages in the same
 extent are also selected.
Problems (PA == patch available)

  Rate limits for background IO are based on a single-disk
  server (PA)
  Only one thread per IO request array (PA)
  Insert buffer merge rate limit is much to small (PA)
  Dirty page writing rate limit is much too small (PA)
  Request arrays are too small for high IOPs servers (PA)
  Request array iteration uses too much CPU (PA)
  BUF_READ_AHEAD_AREA  uses too much CPU
  SHOW STATUS and SHOW INNODB STATUS need
  more statistics (PA)
Problems (PA == patch available)

  InnoDB not good at enforcing max dirty pages (PA)
  No option for enforcing max insert buffer size (PA)
  Limited visibility into utility of prefetching
  Prefetch requests for sequential access are submitted
  late
  When IO requests are merged, must wait for large IO
  to complete before any blocks can be used
IO performance problems

 Insert buffer useless when full, but not emptied quickly
 Updates and inserts are much slower without it
 Run the insert benchmark (iibench.py)
    2 disk server, SW RAID 0, XFS
    1 GB InnoDB buffer pool, O_DIRECT
    Start with 50M rows in the table, restart test
    Eventually unpatched InnoDB does 25 inserts/sec
    versus 200 inserts/sec for the v3 Google patch
    Result has been repeated on larger servers
IO Performance problems

More Related Content

What's hot

Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structurezhaolinjnu
 
Harrison fisk masteringinnodb-diagnostics
Harrison fisk masteringinnodb-diagnosticsHarrison fisk masteringinnodb-diagnostics
Harrison fisk masteringinnodb-diagnosticsguest8212a5
 
Percona Server 5.7: Key Performance Algorithms
Percona Server 5.7: Key Performance AlgorithmsPercona Server 5.7: Key Performance Algorithms
Percona Server 5.7: Key Performance AlgorithmsLaurynas Biveinis
 
TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)Ontico
 
Internal representation of files ppt
Internal representation of files pptInternal representation of files ppt
Internal representation of files pptAbhaysinh Surve
 
C++ Memory Management
C++ Memory ManagementC++ Memory Management
C++ Memory ManagementRahul Jamwal
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPTony Rogerson
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureMySQLConference
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQLInMobi Technology
 
TokuDB - What You Need to Know
TokuDB - What You Need to KnowTokuDB - What You Need to Know
TokuDB - What You Need to KnowJervin Real
 
MySQL Buffer Management
MySQL Buffer ManagementMySQL Buffer Management
MySQL Buffer ManagementMIJIN AN
 
Nextgen virtualization.pptx
Nextgen virtualization.pptxNextgen virtualization.pptx
Nextgen virtualization.pptxRyoichi Kida
 
Thrashing allocation frames.43
Thrashing allocation frames.43Thrashing allocation frames.43
Thrashing allocation frames.43myrajendra
 

What's hot (17)

Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structure
 
Harrison fisk masteringinnodb-diagnostics
Harrison fisk masteringinnodb-diagnosticsHarrison fisk masteringinnodb-diagnostics
Harrison fisk masteringinnodb-diagnostics
 
Percona Server 5.7: Key Performance Algorithms
Percona Server 5.7: Key Performance AlgorithmsPercona Server 5.7: Key Performance Algorithms
Percona Server 5.7: Key Performance Algorithms
 
TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)TokuDB internals / Лесин Владислав (Percona)
TokuDB internals / Лесин Владислав (Percona)
 
Internal representation of files ppt
Internal representation of files pptInternal representation of files ppt
Internal representation of files ppt
 
C++ Memory Management
C++ Memory ManagementC++ Memory Management
C++ Memory Management
 
Percona Server 8.0
Percona Server 8.0Percona Server 8.0
Percona Server 8.0
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
 
TokuDB - What You Need to Know
TokuDB - What You Need to KnowTokuDB - What You Need to Know
TokuDB - What You Need to Know
 
MySQL Buffer Management
MySQL Buffer ManagementMySQL Buffer Management
MySQL Buffer Management
 
Nextgen virtualization.pptx
Nextgen virtualization.pptxNextgen virtualization.pptx
Nextgen virtualization.pptx
 
Vmfs
VmfsVmfs
Vmfs
 
Thrashing allocation frames.43
Thrashing allocation frames.43Thrashing allocation frames.43
Thrashing allocation frames.43
 
Demand paging
Demand pagingDemand paging
Demand paging
 
.NET RDF APIS
.NET RDF APIS.NET RDF APIS
.NET RDF APIS
 

Similar to Life Of A Dirty Page Inno Db Disk Io

Similar to Life Of A Dirty Page Inno Db Disk Io (20)

OSCh10
OSCh10OSCh10
OSCh10
 
Ch10 OS
Ch10 OSCh10 OS
Ch10 OS
 
OS_Ch10
OS_Ch10OS_Ch10
OS_Ch10
 
My sql innovation work -innosql
My sql innovation work -innosqlMy sql innovation work -innosql
My sql innovation work -innosql
 
Mca ii os u-4 memory management
Mca  ii  os u-4 memory managementMca  ii  os u-4 memory management
Mca ii os u-4 memory management
 
Chapter 9 - Virtual Memory
Chapter 9 - Virtual MemoryChapter 9 - Virtual Memory
Chapter 9 - Virtual Memory
 
Virtual memory
Virtual memory Virtual memory
Virtual memory
 
CH09.pdf
CH09.pdfCH09.pdf
CH09.pdf
 
Ch09
Ch09Ch09
Ch09
 
Linux Memory
Linux MemoryLinux Memory
Linux Memory
 
ikh311-06
ikh311-06ikh311-06
ikh311-06
 
9 virtual memory management
9 virtual memory management9 virtual memory management
9 virtual memory management
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Redo log
Redo logRedo log
Redo log
 
Operating System
Operating SystemOperating System
Operating System
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
MySQL 5.5&5.6 new features summary
MySQL 5.5&5.6 new features summaryMySQL 5.5&5.6 new features summary
MySQL 5.5&5.6 new features summary
 
Page replacement
Page replacementPage replacement
Page replacement
 
Virtual Memory Management Part - II.pdf
Virtual Memory Management Part - II.pdfVirtual Memory Management Part - II.pdf
Virtual Memory Management Part - II.pdf
 
Virtual Memory.pdf
Virtual Memory.pdfVirtual Memory.pdf
Virtual Memory.pdf
 

More from Sky Jian

Oracle my sql-or-nosql
Oracle my sql-or-nosqlOracle my sql-or-nosql
Oracle my sql-or-nosqlSky Jian
 
MySQL Tuning For CPU Bottleneck
MySQL Tuning For CPU BottleneckMySQL Tuning For CPU Bottleneck
MySQL Tuning For CPU BottleneckSky Jian
 
MySQL性能调优最佳实践
MySQL性能调优最佳实践MySQL性能调优最佳实践
MySQL性能调优最佳实践Sky Jian
 
浅谈 MySQL 性能调优
浅谈 MySQL 性能调优浅谈 MySQL 性能调优
浅谈 MySQL 性能调优Sky Jian
 
浅谈数据库优化
浅谈数据库优化浅谈数据库优化
浅谈数据库优化Sky Jian
 
基于MySQL开放复制协议的同步扩展
基于MySQL开放复制协议的同步扩展基于MySQL开放复制协议的同步扩展
基于MySQL开放复制协议的同步扩展Sky Jian
 
基于 MySQL 的B2C电商系统前端数据层架构
基于 MySQL 的B2C电商系统前端数据层架构基于 MySQL 的B2C电商系统前端数据层架构
基于 MySQL 的B2C电商系统前端数据层架构Sky Jian
 
高可用可扩展数据层 - MySQL架构实践
高可用可扩展数据层 - MySQL架构实践高可用可扩展数据层 - MySQL架构实践
高可用可扩展数据层 - MySQL架构实践Sky Jian
 
MySQL Explain
MySQL Explain MySQL Explain
MySQL Explain Sky Jian
 
My sql cluster 基础
My sql cluster   基础My sql cluster   基础
My sql cluster 基础Sky Jian
 
高可用可扩展数据库架构方案探讨
高可用可扩展数据库架构方案探讨高可用可扩展数据库架构方案探讨
高可用可扩展数据库架构方案探讨Sky Jian
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A CloudSky Jian
 
Oracle Data Buffer Cache
Oracle Data Buffer CacheOracle Data Buffer Cache
Oracle Data Buffer CacheSky Jian
 

More from Sky Jian (13)

Oracle my sql-or-nosql
Oracle my sql-or-nosqlOracle my sql-or-nosql
Oracle my sql-or-nosql
 
MySQL Tuning For CPU Bottleneck
MySQL Tuning For CPU BottleneckMySQL Tuning For CPU Bottleneck
MySQL Tuning For CPU Bottleneck
 
MySQL性能调优最佳实践
MySQL性能调优最佳实践MySQL性能调优最佳实践
MySQL性能调优最佳实践
 
浅谈 MySQL 性能调优
浅谈 MySQL 性能调优浅谈 MySQL 性能调优
浅谈 MySQL 性能调优
 
浅谈数据库优化
浅谈数据库优化浅谈数据库优化
浅谈数据库优化
 
基于MySQL开放复制协议的同步扩展
基于MySQL开放复制协议的同步扩展基于MySQL开放复制协议的同步扩展
基于MySQL开放复制协议的同步扩展
 
基于 MySQL 的B2C电商系统前端数据层架构
基于 MySQL 的B2C电商系统前端数据层架构基于 MySQL 的B2C电商系统前端数据层架构
基于 MySQL 的B2C电商系统前端数据层架构
 
高可用可扩展数据层 - MySQL架构实践
高可用可扩展数据层 - MySQL架构实践高可用可扩展数据层 - MySQL架构实践
高可用可扩展数据层 - MySQL架构实践
 
MySQL Explain
MySQL Explain MySQL Explain
MySQL Explain
 
My sql cluster 基础
My sql cluster   基础My sql cluster   基础
My sql cluster 基础
 
高可用可扩展数据库架构方案探讨
高可用可扩展数据库架构方案探讨高可用可扩展数据库架构方案探讨
高可用可扩展数据库架构方案探讨
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
Oracle Data Buffer Cache
Oracle Data Buffer CacheOracle Data Buffer Cache
Oracle Data Buffer Cache
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Life Of A Dirty Page Inno Db Disk Io

  • 1. Life of a dirty page InnoDB disk IO Mark Callaghan
  • 2. URLs slides at http://tinyurl.com/dzshl2 v3 Google patch at code.google.com many similar features in Percona binaries soon to be many similar features in MySQL 5.4
  • 3. Putting things in context How to make a great OLTP engine Go back in time about 15 years Start with the Gray & Reuter text Develop on a typical PC with one disk and one CPU   Then servers with many cores and 10k IOPs arrive.   I will focus on the design and implementation of official InnoDB.
  • 4. Questions Should InnoDB switch to 64kb pages? The 5 Minute Rule suggests it should Nobody has published performance results for this Can InnoDB effectively use 10,000 IOPs? Yes for read-intensive workloads Not yet for write-intensive workloads
  • 5. High-level architecture Some IO done directly in user threads (mostly reads) Background threads perform simulated AIO for Log writes Dirty page writes Prefetch reads Extent is 64 16kb pages Adjacent requests are merged Rate limits based on capacity of single-disk server
  • 6. Medium-level architecture Background thread tasks (srv_master_thread) Once per second Once per 10 seconds Only when idle
  • 7. Tasks to do once per second Force transaction log to disk Make space available in tx log buffer (maybe) Fuzzy checkpoint (maybe) Read up to 5 pages to merge insert buffer records using sync IO. Skipped unless server is somewhat idle. Write 100 dirty pages from the buffer pool using async IO. Select pages from the flush list. Skipped when there are too many dirty pages. Restart the task loop immediately if this is done.
  • 8. Tasks to do once per 10 seconds Write 100 or 10 dirty pages from the buffer pool using async IO while selecting pages from the flush list. Skipped unless the server is somewhat idle. Read up to 5 pages to merge insert buffer records using sync IO Force transaction log to disk Remove deleted rows  Fuzzy checkpoint
  • 9. Tasks to do when the server is idle Remove deleted rows Read up to 20 pages to merge insert buffer records using sync IO Write 100 dirty pages from the buffer pool using sync IO. Select pages from the flush list. Fuzzy checkpoint  Repeat if server still idle
  • 10. Processing IO requests (os0file.c) Request arrays 256 slots per array Arrays for insert buffer, tx log, page writes, prefetch One thread does all of the IO per array using sync IO   Choosing the next request Requests older than 2 seconds get priority Otherwise choose the request with the lowest file offset when considering the low 32 bits of the offset This requires 2 passes of the request array
  • 11. Processing IO requests (2) Request merging Merge requests for adjacent blocks. Alas, this does one pass on the array per adjacent block merged. Stop trying to merge when there is no pending request for the next block or 64 requests have been merged. Allocate a temporary buffer for the multi-block IO when requests have been merged. Use one read/write system call. This is a large IO operation when merging is used. Wait for entire request to complete.
  • 12. Processing IO requests (3) Request completion Remove the request for the first block in the merged request (or only block) from the IO array. Completion for other requests is done on the next iteration of the request handling loop.
  • 13. Page write protocol Protect against partial page writes by writing each dirty page first to a serial log before updating it in place. This can be disabled for copy-on-write filesystems.   Important functions are buf_flush_post_to_doublewrite_buf buf_flush_batch buf_flush_buffered_writes
  • 14. Page write protocol (2) First write page to in-memory doublewrite log When doublewrite log is full Write doublewrite log to a file (many pages, 1 big IO). This file is the doublewrite buffer. Force writes for doublewrite buffer to disk. Submit async IO requests to update pages just logged in the doublewrite buffer (1 write per page). Wait for writes to complete. Force writes for dirty pages to disk.
  • 15. Prefetch for random access buf_read_ahead_random There are 64 blocks per extent Before requesting a block read: Count number of blocks in extent that were recently read based on position in buffer pool LRU list. If more than 13 were recently read, prefetch others
  • 16. Prefetch for sequential access buf_read_ahead_linear May be used when Accessing first or last page in extent Many (24) pages in extent have been accessed Access pattern was sequential Checks are made prior to issuing a read request. When used, issue read requests for extent that contains that page that follows or precedes this extent.
  • 17. Insert buffer (ibuf0ibuf.c) Persistent and index-organized on (space id, page#) Allows changes to secondary index leaf blocks to be deferred when the block is not in the buffer pool. Used for insert and the insert portion of an update Used until the insert buffer is half of the buffer pool Fully resident in the buffer pool
  • 18. Merging insert buffer records ibuf_contract_ext Choose random page from the insert buffer Open a cursor on a random record on that page Read insert buffer entries from that cursor to find at most 8 pages that should be fetched. Issue async IO requests When async read IO completes, callback is done to apply deferred changes (Usually) wait for IO to complete
  • 19. Buffer pool (buf0buf.h) Protected by one mutex Contains Flush list - ordered by oldest modification LSN LRU list ordered by approximate access time I have not found code to guard against wiping out LRU list and buffer pool on large table scan Hash table for pages in buffer pool
  • 20. More on writing dirty pages Most pages to be written selected from end of flush list min(modification LSN from any page) must be greater than the min LSN from the current log files. Pages flushed to enforce this. If a page is written from the LRU list it is moved to the free list on IO completion. When a page is selected, other dirty pages in the same extent are also selected.
  • 21. Problems (PA == patch available) Rate limits for background IO are based on a single-disk server (PA) Only one thread per IO request array (PA) Insert buffer merge rate limit is much to small (PA) Dirty page writing rate limit is much too small (PA) Request arrays are too small for high IOPs servers (PA) Request array iteration uses too much CPU (PA) BUF_READ_AHEAD_AREA  uses too much CPU SHOW STATUS and SHOW INNODB STATUS need more statistics (PA)
  • 22. Problems (PA == patch available) InnoDB not good at enforcing max dirty pages (PA) No option for enforcing max insert buffer size (PA) Limited visibility into utility of prefetching Prefetch requests for sequential access are submitted late When IO requests are merged, must wait for large IO to complete before any blocks can be used
  • 23. IO performance problems Insert buffer useless when full, but not emptied quickly Updates and inserts are much slower without it Run the insert benchmark (iibench.py) 2 disk server, SW RAID 0, XFS 1 GB InnoDB buffer pool, O_DIRECT Start with 50M rows in the table, restart test Eventually unpatched InnoDB does 25 inserts/sec versus 200 inserts/sec for the v3 Google patch Result has been repeated on larger servers