SlideShare a Scribd company logo
Andrea Righi - andrea@betterlinux.com
Conoscere e ottimizzare l'I/O su Linux
Andrea Righi - andrea@betterlinux.com
Agenda
● Overview
● I/O Monitoring
● I/O Tuning
● Reliability
● Q/A
Andrea Righi - andrea@betterlinux.com
Overview
Andrea Righi - andrea@betterlinux.com
File I/O in Linux
Andrea Righi - andrea@betterlinux.com
READ vs WRITE
● READ
● synchronous: CPU needs to wait the completion of
the READ to continue
● cached pages are easy to reclaim
● WRITE
● asynchronous: CPU doesn't need to wait the
completion of the WRITE to continue
● cached pages are hard to reclaim (require I/O)
Andrea Righi - andrea@betterlinux.com
SYNC vs ASYNC
● SYNC I/O READ: kernel queues a read operation for the data and returns
only after the entire block of data is read back, process is in waiting for
I/O state (D)
● SYNC I/O WRITE: kernel queues a write operation for the data and
returns only after the entire block of data is written, process is in waiting
for I/O
● ASYNC I/O READ: process repeatedly call read() with the size of the data
remaning, until the entire block is read (use select()/poll() to determine
when some data is available)
● ASYNC I/O WRITE: kernel updates the corresponding pages in page-
cache and marks them dirty; then the control quickly returns to the
process which can continue to run; the data is flushed later from a
different context in more optimal ways (i.e., sequential vs seeky blocks)
Andrea Righi - andrea@betterlinux.com
Block I/O subsystem
(simplified view)
● Processes submit I/O
requests to request queues
● The block I/O layer saves
the context of the process
that submits the request
● Requests can be merged
and reordered by the I/O
scheduler
● Minimize disk seeks,
optimize performance,
provide fairness among
processes
Andrea Righi - andrea@betterlinux.com
Plug / unplug
● When I/O is queued to a device that device enters a
plugged state
● I/O isn't immediately dispatched to the low-level device
driver
● When a process is going to wait on the I/O to finish, the
device is unplugged
● Allow merging of sequenial requests (writing and
reading bigger chunks of data allows to save re-writes of
the same hardware blocks and improves I/O throughput)
Andrea Righi - andrea@betterlinux.com
Flash memory
● Limited amount of erase cycles
● Flash memory blocks have to be explicitly
erased before they can be written to
● Writes decrease flash memory lifetime
● Wear leveling: logical mapping to distribute
writes evenly among the available physical
blocks
Andrea Righi - andrea@betterlinux.com
I/O Monitoring
Andrea Righi - andrea@betterlinux.com
iostat
● Informations about request queues associated
with specific block devices
● Number of blocks read/written, average I/O wait
time, disk utilization %, ...
● It does not provide detailed informations per-I/O
based (pid? uid? ...)
Andrea Righi - andrea@betterlinux.com
iotop
● top-like I/O monitoring tool
● Disk read, write, I/O wait time percentage
● Still does not provide enough informations on a
per-I/O basis:
● per block device statistics are missing
● no statistics about the nature of each request
Andrea Righi - andrea@betterlinux.com
blktrace
● Low-overhead monitoring tool
● detailed per user / cgroup / thread and block
device statistics
● allow to trace events for specific operations
performed on each I/O entering the block I/O
layer
Andrea Righi - andrea@betterlinux.com
blktrace events
● Request queue entry allocated
● Sleep during request queue allocation
● Request queue insertion
● Front/back merge
● Re-queue of a request
● Request issued to underlying block device
● Request queue plug/unplug
● I/O remap (DM / MD)
●
I/O split/bounce operation
●
Request completed
● ...
Andrea Righi - andrea@betterlinux.com
blktrace operations
● RWBS
● 'R' - read
● 'W' - write
● 'D' - discard
● 'B' - barrier
● 'A' - ahead
● 'S' - synchronous
● 'M' - meta-data
● 'N' - No data
static void fill_rwbs(char *rwbs, struct blk_io_trace *t)
{
int i = 0;
if (t->action & BLK_TC_DISCARD) rwbs[i++] = 'D';
else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
else if (t->bytes) rwbs[i++] = 'R';
else rwbs[i++] = 'N';
if (t->action & BLK_TC_AHEAD) rwbs[i++] = 'A';
if (t->action & BLK_TC_BARRIER) rwbs[i++] = 'B';
if (t->action & BLK_TC_SYNC) rwbs[i++] = 'S';
if (t->action & BLK_TC_META) rwbs[i++] = 'M';
rwbs[i] = '0';
}
Andrea Righi - andrea@betterlinux.com
blktrace actions
● Actions
● C -- complete
● D -- issued
● I – inserted
● Q -- queued
● B -- bounced
● M – back merge
● F -- front merge
●
G -- get request
● S -- sleep
● P -- plug
● U -- unplug
●
T -- unplug due to timer
● X -- split
● A -- remap
● m -- message
Andrea Righi - andrea@betterlinux.com
blktrace output
# btrace /dev/sda
...
8,0 1 26 0.054596889 228 Q WS 237891152 + 8 [jbd2/sda3-8]
8,0 1 27 0.054597204 228 M WS 237891152 + 8 [jbd2/sda3-8]
8,0 1 28 0.054597816 228 A WS 237891160 + 8 <- (8,3) 230983256
8,0 1 29 0.054598137 228 Q WS 237891160 + 8 [jbd2/sda3-8]
8,0 1 30 0.054598457 228 M WS 237891160 + 8 [jbd2/sda3-8]
8,0 1 31 0.054599094 228 A WS 237891168 + 8 <- (8,3) 230983264
8,0 1 32 0.054599399 228 Q WS 237891168 + 8 [jbd2/sda3-8]
8,0 1 33 0.054599725 228 M WS 237891168 + 8 [jbd2/sda3-8]
Device, CPU, seq.num., timestamp, PID, Action, RWBS, Start block + # of blocks, PID
Andrea Righi - andrea@betterlinux.com
I/O Tuning
Andrea Righi - andrea@betterlinux.com
Dirty pages writeback
● Writeback is the process of writing pages back to
persistent storage
● Dirty pages (grep Dirty /proc/meminfo)
● Slow down tasks that are creating more dirty pages
than the system can handle balance_dirty_pages()
● direct reclaim (bad I/O pattern)
● pause
● IO-less dirty throttling (>= 3.2)
● pdflush vs per backing device writeback (>= 2.6.32)
Andrea Righi - andrea@betterlinux.com
Background vs direct cleaning
● From Documentation/sysctl/vm.txt:
● Background cleaning (kernel flusher threads):
– /proc/sys/vm/dirty_background_ratio
– /proc/sys/vm/dirty_background_bytes
● Direct cleaning (normal tasks generating disk
writes):
– /proc/sys/vm/dirty_ratio
– /proc/sys/vm/dirty_bytes
Andrea Righi - andrea@betterlinux.com
Flusher thread tuning
● /proc/sys/vm/dirty_writeback_centisecs
● Wake up kernel flusher threads every
dirty_writeback_centisecs
● /proc/sys/vm/dirty_expire_centisecs
● Define when dirty data is old enough to be eligible
for writeout by kernel flusher threads
Andrea Righi - andrea@betterlinux.com
Swap I/O
● /proc/sys/vm/swappiness
● anonymous vs file LRU scanning ratio
– high value: aggressive swap
– low value: aggressive file pages reclaim
Andrea Righi - andrea@betterlinux.com
Filesystem I/O
● ext3: data=journal | ordered | writeback
● journal: meta-data + data committed in the journal
● ordered: data committed before its meta-data
● writeback: meta-data and data committed out-of-order
● ext4: delayed allocation
● block allocation deferred until background writeback
● improve chances of using contiguous blocks (threads writing at
different offsets simultaneously)
● xfs, ext4, zfs, …
● zero-length file problem:
– open-write-close-rename
Andrea Righi - andrea@betterlinux.com
Filesystem I/O tuning
● noatime, nodiratime:
● do not update inode access times
● relatime:
● access time is updated if the previous access time was
earlier than the current modify or change time (doesn't
break applications like mutt that needs to know if a file
has been read since the last time it was modified)
● commit=N
● sync data and meta-data every N seconds (default = 5s)
Andrea Righi - andrea@betterlinux.com
I/O tuning at different layers
● Applications
● LD_PRELOAD
● VM
● caching
● Filesystem
● mount options / filesystem tuning
● Block device
● caching
Andrea Righi - andrea@betterlinux.com
Reliability
Andrea Righi - andrea@betterlinux.com
I/O data flow
● Application to library buffer
● fwrite(), fprintf(), etc.
● Library to OS buffer
● write()
● OS buffer to disk
● paged out, periodic flush (5 sec usually)
● fsync(), fdatasync(), sync(), sync_file_range()
Andrea Righi - andrea@betterlinux.com
Simple use case
● User hits “Save” in Word Processor
● Expects that data to be on disk when saved
● If power goes out
● The last saved version of my data is there
● If there isn't an explicit save, some recent version of
my data should be okay
Andrea Righi - andrea@betterlinux.com
Buggy implementation
struct wp_doc {
char *document;
size_t len;
};
struct wp_doc d;
...
FILE *f;
f = fopen(“document.txt”, ”w”);
fwrite(d.document, d.len, 1, f);
fclose(f);
Andrea Righi - andrea@betterlinux.com
Bugs
● No error checking
● fopen (did we open the file?)
● fwrite (did we write the entire file?)
● Crash in the middle of fwrite()
● document corrupted
● No sync
● close does not imply sync()!
Andrea Righi - andrea@betterlinux.com
Reliable implementation
struct wp_doc {
char *document;
size_t len;
};
struct wp_doc d;
...
FILE *f;
size_t len;
f = fopen(“.document.txt”, ”w”);
if (!f) return errno;
size_t len = fwrite(d.document, d.len, 1, f);
if (len != 1) { fclose(f); return errno; }
if (fflush(f) != 0) { fclose(f); return errno };
if (fsync(fileno(f)) == -1) { fclose(f); return errno };
fclose(f);
rename(“.document.txt”, ”document.txt”);
error checking
temp file
flush libc buffer
sync to disk
before rename
Andrea Righi - andrea@betterlinux.com
References
● Block I/O layer tracing - blktrace:
http://www.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materi
aly/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf
● Eat my data:
http://www.flamingspork.com/talks/2007/06/eat_my_data.odp
● fsync() problems with Firefox:
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/
● Linux documentation
● Documentation/sysctl/vm.txt
● Documentation/laptops/laptop-mode.txt
Andrea Righi - andrea@betterlinux.com
Q/A
● You're very welcome!
● Twitter
● @arighi
● #bem2014

More Related Content

What's hot

Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Jian-Hong Pan
 
Much Faster Networking
Much Faster NetworkingMuch Faster Networking
Much Faster Networking
C4Media
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
Hao-Ran Liu
 
Introduction to open_sbi
Introduction to open_sbiIntroduction to open_sbi
Introduction to open_sbi
Nylon
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
monad bobo
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
Michelle Holley
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
Kernel TLV
 
Browsing Linux Kernel Source
Browsing Linux Kernel SourceBrowsing Linux Kernel Source
Browsing Linux Kernel Source
Motaz Saad
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
Houcheng Lin
 
Hands-on ethernet driver
Hands-on ethernet driverHands-on ethernet driver
Hands-on ethernet driver
SUSE Labs Taipei
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
Satpal Parmar
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
shimosawa
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
Kernel TLV
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
shimosawa
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
Adrien Mahieux
 
Linux Internals - Part I
Linux Internals - Part ILinux Internals - Part I
ucOS
ucOSucOS
BusyBox for Embedded Linux
BusyBox for Embedded LinuxBusyBox for Embedded Linux
BusyBox for Embedded Linux
Emertxe Information Technologies Pvt Ltd
 

What's hot (20)

Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 
Much Faster Networking
Much Faster NetworkingMuch Faster Networking
Much Faster Networking
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Introduction to open_sbi
Introduction to open_sbiIntroduction to open_sbi
Introduction to open_sbi
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Browsing Linux Kernel Source
Browsing Linux Kernel SourceBrowsing Linux Kernel Source
Browsing Linux Kernel Source
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
Hands-on ethernet driver
Hands-on ethernet driverHands-on ethernet driver
Hands-on ethernet driver
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Linux Internals - Part I
Linux Internals - Part ILinux Internals - Part I
Linux Internals - Part I
 
ucOS
ucOSucOS
ucOS
 
BusyBox for Embedded Linux
BusyBox for Embedded LinuxBusyBox for Embedded Linux
BusyBox for Embedded Linux
 

Viewers also liked

Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Anne Nicolas
 
Kernel I/O Subsystem
Kernel I/O SubsystemKernel I/O Subsystem
Kernel I/O Subsystem
Sushil Ale
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Anne Nicolas
 
Kernel I/O subsystem
Kernel I/O subsystemKernel I/O subsystem
Kernel I/O subsystem
AtiKa Bhatti
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
Kernel TLV
 
Chapter 13 - I/O Systems
Chapter 13 - I/O SystemsChapter 13 - I/O Systems
Chapter 13 - I/O Systems
Wayne Jones Jnr
 
Eat my data
Eat my dataEat my data
Eat my dataPeng Zuo
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
Babak Farrokhi
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
iammutex
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems update
Lukáš Czerner
 
Linux System-R.D.Sivakumar
Linux System-R.D.SivakumarLinux System-R.D.Sivakumar
Linux System-R.D.Sivakumar
Sivakumar R D .
 
VM and IO Topics in Linux
VM and IO Topics in LinuxVM and IO Topics in Linux
VM and IO Topics in Linux
cucufrog
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
Vinay Jindal
 
Recent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource managementRecent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource management
OpenVZ
 
Ext4 filesystem(1)
Ext4 filesystem(1)Ext4 filesystem(1)
Ext4 filesystem(1)
Yoshihiro Yunomae
 
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
OpenVZ
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
Marc Cortinas Val
 
4. linux file systems
4. linux file systems4. linux file systems
4. linux file systems
Marian Marinov
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
Kernel TLV
 

Viewers also liked (20)

Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
Kernel I/O Subsystem
Kernel I/O SubsystemKernel I/O Subsystem
Kernel I/O Subsystem
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Kernel I/O subsystem
Kernel I/O subsystemKernel I/O subsystem
Kernel I/O subsystem
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 
Chapter 13 - I/O Systems
Chapter 13 - I/O SystemsChapter 13 - I/O Systems
Chapter 13 - I/O Systems
 
Eat my data
Eat my dataEat my data
Eat my data
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems update
 
Linux System-R.D.Sivakumar
Linux System-R.D.SivakumarLinux System-R.D.Sivakumar
Linux System-R.D.Sivakumar
 
VM and IO Topics in Linux
VM and IO Topics in LinuxVM and IO Topics in Linux
VM and IO Topics in Linux
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
 
Recent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource managementRecent advances in the Linux kernel resource management
Recent advances in the Linux kernel resource management
 
Ext4 filesystem(1)
Ext4 filesystem(1)Ext4 filesystem(1)
Ext4 filesystem(1)
 
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
What's missing from upstream kernel containers? - Kir Kolyshkin, Sergey Bronn...
 
Tuning Linux for MongoDB
Tuning Linux for MongoDBTuning Linux for MongoDB
Tuning Linux for MongoDB
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
4. linux file systems
4. linux file systems4. linux file systems
4. linux file systems
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 

Similar to Understand and optimize Linux I/O

Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
Valeriy Kravchuk
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Dave Stokes
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
Dave Stokes
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: Revision
Damian T. Gordon
 
Pen Testing Development
Pen Testing DevelopmentPen Testing Development
Pen Testing Development
CTruncer
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
RichardWarburton
 
Backups
BackupsBackups
Backups
Payal Singh
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
Marcelo Altmann
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimization
Manish Rawat
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
Vinicius M Grippa
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case study
Linaro
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona Server
Severalnines
 
Threads and processes
Threads and processesThreads and processes
Threads and processes
Fungirayiini Chiweshe Mushaninga
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
Rodrique Heron
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
Jérôme Petazzoni
 
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Valeriy Kravchuk
 

Similar to Understand and optimize Linux I/O (20)

Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/O
 
Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: Revision
 
Pen Testing Development
Pen Testing DevelopmentPen Testing Development
Pen Testing Development
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
 
Backups
BackupsBackups
Backups
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimization
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
 
HKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case studyHKG15-409: ARM Hibernation enablement on SoCs - a case study
HKG15-409: ARM Hibernation enablement on SoCs - a case study
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona Server
 
Threads and processes
Threads and processesThreads and processes
Threads and processes
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
 
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
 

More from Andrea Righi

Eco-friendly Linux kernel development
Eco-friendly Linux kernel developmentEco-friendly Linux kernel development
Eco-friendly Linux kernel development
Andrea Righi
 
Linux kernel bug hunting
Linux kernel bug huntingLinux kernel bug hunting
Linux kernel bug hunting
Andrea Righi
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug hunting
Andrea Righi
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
Andrea Righi
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
Andrea Righi
 

More from Andrea Righi (7)

Eco-friendly Linux kernel development
Eco-friendly Linux kernel developmentEco-friendly Linux kernel development
Eco-friendly Linux kernel development
 
Linux kernel bug hunting
Linux kernel bug huntingLinux kernel bug hunting
Linux kernel bug hunting
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug hunting
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Debugging linux
Debugging linuxDebugging linux
Debugging linux
 
Linux boot-time
Linux boot-timeLinux boot-time
Linux boot-time
 

Recently uploaded

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 

Recently uploaded (20)

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 

Understand and optimize Linux I/O

  • 1. Andrea Righi - andrea@betterlinux.com Conoscere e ottimizzare l'I/O su Linux
  • 2. Andrea Righi - andrea@betterlinux.com Agenda ● Overview ● I/O Monitoring ● I/O Tuning ● Reliability ● Q/A
  • 3. Andrea Righi - andrea@betterlinux.com Overview
  • 4. Andrea Righi - andrea@betterlinux.com File I/O in Linux
  • 5. Andrea Righi - andrea@betterlinux.com READ vs WRITE ● READ ● synchronous: CPU needs to wait the completion of the READ to continue ● cached pages are easy to reclaim ● WRITE ● asynchronous: CPU doesn't need to wait the completion of the WRITE to continue ● cached pages are hard to reclaim (require I/O)
  • 6. Andrea Righi - andrea@betterlinux.com SYNC vs ASYNC ● SYNC I/O READ: kernel queues a read operation for the data and returns only after the entire block of data is read back, process is in waiting for I/O state (D) ● SYNC I/O WRITE: kernel queues a write operation for the data and returns only after the entire block of data is written, process is in waiting for I/O ● ASYNC I/O READ: process repeatedly call read() with the size of the data remaning, until the entire block is read (use select()/poll() to determine when some data is available) ● ASYNC I/O WRITE: kernel updates the corresponding pages in page- cache and marks them dirty; then the control quickly returns to the process which can continue to run; the data is flushed later from a different context in more optimal ways (i.e., sequential vs seeky blocks)
  • 7. Andrea Righi - andrea@betterlinux.com Block I/O subsystem (simplified view) ● Processes submit I/O requests to request queues ● The block I/O layer saves the context of the process that submits the request ● Requests can be merged and reordered by the I/O scheduler ● Minimize disk seeks, optimize performance, provide fairness among processes
  • 8. Andrea Righi - andrea@betterlinux.com Plug / unplug ● When I/O is queued to a device that device enters a plugged state ● I/O isn't immediately dispatched to the low-level device driver ● When a process is going to wait on the I/O to finish, the device is unplugged ● Allow merging of sequenial requests (writing and reading bigger chunks of data allows to save re-writes of the same hardware blocks and improves I/O throughput)
  • 9. Andrea Righi - andrea@betterlinux.com Flash memory ● Limited amount of erase cycles ● Flash memory blocks have to be explicitly erased before they can be written to ● Writes decrease flash memory lifetime ● Wear leveling: logical mapping to distribute writes evenly among the available physical blocks
  • 10. Andrea Righi - andrea@betterlinux.com I/O Monitoring
  • 11. Andrea Righi - andrea@betterlinux.com iostat ● Informations about request queues associated with specific block devices ● Number of blocks read/written, average I/O wait time, disk utilization %, ... ● It does not provide detailed informations per-I/O based (pid? uid? ...)
  • 12. Andrea Righi - andrea@betterlinux.com iotop ● top-like I/O monitoring tool ● Disk read, write, I/O wait time percentage ● Still does not provide enough informations on a per-I/O basis: ● per block device statistics are missing ● no statistics about the nature of each request
  • 13. Andrea Righi - andrea@betterlinux.com blktrace ● Low-overhead monitoring tool ● detailed per user / cgroup / thread and block device statistics ● allow to trace events for specific operations performed on each I/O entering the block I/O layer
  • 14. Andrea Righi - andrea@betterlinux.com blktrace events ● Request queue entry allocated ● Sleep during request queue allocation ● Request queue insertion ● Front/back merge ● Re-queue of a request ● Request issued to underlying block device ● Request queue plug/unplug ● I/O remap (DM / MD) ● I/O split/bounce operation ● Request completed ● ...
  • 15. Andrea Righi - andrea@betterlinux.com blktrace operations ● RWBS ● 'R' - read ● 'W' - write ● 'D' - discard ● 'B' - barrier ● 'A' - ahead ● 'S' - synchronous ● 'M' - meta-data ● 'N' - No data static void fill_rwbs(char *rwbs, struct blk_io_trace *t) { int i = 0; if (t->action & BLK_TC_DISCARD) rwbs[i++] = 'D'; else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W'; else if (t->bytes) rwbs[i++] = 'R'; else rwbs[i++] = 'N'; if (t->action & BLK_TC_AHEAD) rwbs[i++] = 'A'; if (t->action & BLK_TC_BARRIER) rwbs[i++] = 'B'; if (t->action & BLK_TC_SYNC) rwbs[i++] = 'S'; if (t->action & BLK_TC_META) rwbs[i++] = 'M'; rwbs[i] = '0'; }
  • 16. Andrea Righi - andrea@betterlinux.com blktrace actions ● Actions ● C -- complete ● D -- issued ● I – inserted ● Q -- queued ● B -- bounced ● M – back merge ● F -- front merge ● G -- get request ● S -- sleep ● P -- plug ● U -- unplug ● T -- unplug due to timer ● X -- split ● A -- remap ● m -- message
  • 17. Andrea Righi - andrea@betterlinux.com blktrace output # btrace /dev/sda ... 8,0 1 26 0.054596889 228 Q WS 237891152 + 8 [jbd2/sda3-8] 8,0 1 27 0.054597204 228 M WS 237891152 + 8 [jbd2/sda3-8] 8,0 1 28 0.054597816 228 A WS 237891160 + 8 <- (8,3) 230983256 8,0 1 29 0.054598137 228 Q WS 237891160 + 8 [jbd2/sda3-8] 8,0 1 30 0.054598457 228 M WS 237891160 + 8 [jbd2/sda3-8] 8,0 1 31 0.054599094 228 A WS 237891168 + 8 <- (8,3) 230983264 8,0 1 32 0.054599399 228 Q WS 237891168 + 8 [jbd2/sda3-8] 8,0 1 33 0.054599725 228 M WS 237891168 + 8 [jbd2/sda3-8] Device, CPU, seq.num., timestamp, PID, Action, RWBS, Start block + # of blocks, PID
  • 18. Andrea Righi - andrea@betterlinux.com I/O Tuning
  • 19. Andrea Righi - andrea@betterlinux.com Dirty pages writeback ● Writeback is the process of writing pages back to persistent storage ● Dirty pages (grep Dirty /proc/meminfo) ● Slow down tasks that are creating more dirty pages than the system can handle balance_dirty_pages() ● direct reclaim (bad I/O pattern) ● pause ● IO-less dirty throttling (>= 3.2) ● pdflush vs per backing device writeback (>= 2.6.32)
  • 20. Andrea Righi - andrea@betterlinux.com Background vs direct cleaning ● From Documentation/sysctl/vm.txt: ● Background cleaning (kernel flusher threads): – /proc/sys/vm/dirty_background_ratio – /proc/sys/vm/dirty_background_bytes ● Direct cleaning (normal tasks generating disk writes): – /proc/sys/vm/dirty_ratio – /proc/sys/vm/dirty_bytes
  • 21. Andrea Righi - andrea@betterlinux.com Flusher thread tuning ● /proc/sys/vm/dirty_writeback_centisecs ● Wake up kernel flusher threads every dirty_writeback_centisecs ● /proc/sys/vm/dirty_expire_centisecs ● Define when dirty data is old enough to be eligible for writeout by kernel flusher threads
  • 22. Andrea Righi - andrea@betterlinux.com Swap I/O ● /proc/sys/vm/swappiness ● anonymous vs file LRU scanning ratio – high value: aggressive swap – low value: aggressive file pages reclaim
  • 23. Andrea Righi - andrea@betterlinux.com Filesystem I/O ● ext3: data=journal | ordered | writeback ● journal: meta-data + data committed in the journal ● ordered: data committed before its meta-data ● writeback: meta-data and data committed out-of-order ● ext4: delayed allocation ● block allocation deferred until background writeback ● improve chances of using contiguous blocks (threads writing at different offsets simultaneously) ● xfs, ext4, zfs, … ● zero-length file problem: – open-write-close-rename
  • 24. Andrea Righi - andrea@betterlinux.com Filesystem I/O tuning ● noatime, nodiratime: ● do not update inode access times ● relatime: ● access time is updated if the previous access time was earlier than the current modify or change time (doesn't break applications like mutt that needs to know if a file has been read since the last time it was modified) ● commit=N ● sync data and meta-data every N seconds (default = 5s)
  • 25. Andrea Righi - andrea@betterlinux.com I/O tuning at different layers ● Applications ● LD_PRELOAD ● VM ● caching ● Filesystem ● mount options / filesystem tuning ● Block device ● caching
  • 26. Andrea Righi - andrea@betterlinux.com Reliability
  • 27. Andrea Righi - andrea@betterlinux.com I/O data flow ● Application to library buffer ● fwrite(), fprintf(), etc. ● Library to OS buffer ● write() ● OS buffer to disk ● paged out, periodic flush (5 sec usually) ● fsync(), fdatasync(), sync(), sync_file_range()
  • 28. Andrea Righi - andrea@betterlinux.com Simple use case ● User hits “Save” in Word Processor ● Expects that data to be on disk when saved ● If power goes out ● The last saved version of my data is there ● If there isn't an explicit save, some recent version of my data should be okay
  • 29. Andrea Righi - andrea@betterlinux.com Buggy implementation struct wp_doc { char *document; size_t len; }; struct wp_doc d; ... FILE *f; f = fopen(“document.txt”, ”w”); fwrite(d.document, d.len, 1, f); fclose(f);
  • 30. Andrea Righi - andrea@betterlinux.com Bugs ● No error checking ● fopen (did we open the file?) ● fwrite (did we write the entire file?) ● Crash in the middle of fwrite() ● document corrupted ● No sync ● close does not imply sync()!
  • 31. Andrea Righi - andrea@betterlinux.com Reliable implementation struct wp_doc { char *document; size_t len; }; struct wp_doc d; ... FILE *f; size_t len; f = fopen(“.document.txt”, ”w”); if (!f) return errno; size_t len = fwrite(d.document, d.len, 1, f); if (len != 1) { fclose(f); return errno; } if (fflush(f) != 0) { fclose(f); return errno }; if (fsync(fileno(f)) == -1) { fclose(f); return errno }; fclose(f); rename(“.document.txt”, ”document.txt”); error checking temp file flush libc buffer sync to disk before rename
  • 32. Andrea Righi - andrea@betterlinux.com References ● Block I/O layer tracing - blktrace: http://www.mimuw.edu.pl/~lichota/09-10/Optymalizacja-open-source/Materi aly/10%20-%20Dysk/gelato_ICE06apr_blktrace_brunelle_hp.pdf ● Eat my data: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp ● fsync() problems with Firefox: http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ ● Linux documentation ● Documentation/sysctl/vm.txt ● Documentation/laptops/laptop-mode.txt
  • 33. Andrea Righi - andrea@betterlinux.com Q/A ● You're very welcome! ● Twitter ● @arighi ● #bem2014