SlideShare a Scribd company logo
1 of 42
Download to read offline
The Linux Block Layer
Built for Fast Storage
Light up your cloud!
Sagi Grimberg KernelTLV
27/6/18
1
First off, Happy 1’st birthday Roni!
2
Who am I?
• Co-founder and Principal Architect @ Lightbits Labs
• LightBits Labs is a stealth-mode startup pushing the software and hardware technology
boundaries in cloud-scale storage.
• We are looking for excellent people who enjoy a challenge for a variety of positions, including
both software and hardware. More information on our website at
http://www.lightbitslabs.com/#join or talk with me after the talk.
• Active contributor to Linux I/O and RDMA stack
• I am the maintainer of the iSCSI over RDMA (iSER) drivers
• I co-maintain the Linux NVMe subsystem
• Used to work for Mellanox on Storage, Networking, RDMA and
pretty much everything in between…
3
Where were we 10 years ago
• Only rotating storage devices exist.
• Devices were limited to hundreds of IOPs
• Devices access latency was in the milliseconds ballpark
• The Linux block layer was sufficient to handle these devices
• High performance applications found clever ways to avoid storage
access as much as possible
4
What happened? (hint: HW)
• Flash SSDs started appearing in the DataCenter
• IOPs went from Hundreds to Hundreds of thousands to Millions
• Latency went from Milliseconds to Microseconds
• Fast Interfaces evolved: PCIe (NVMe)
• Processors core count increased a lot!
• And NUMA...
5
I/O Stack
6
What are the issues?
• Existing I/O stack had a lot of data sharing
• Between different applications (running on different cores)
• Between submission and completion
• Locking for synchronization
• Zero NUMA awareness
• All stack heuristics and optimizations centered around slow
storage
• The result is very bad (even negative) scaling, spending lots of CPU
cycles and much much higher latencies.
7
I/O Stack - Little deeper
8
I/O Stack - Little deeper
9
Hmmm...
- Request are serialized
- Placed for staging
- Retrieved by the drivers
⇒ Lots of shared state!
I/O Stack - Performance
10
I/O Stack - Performance
11
Workaround: Bypass the the request layer
12
Problems with bypass:
● Give up flow control
● Give up error handling
● Give up statistics
● Give up tagging and indexing
● Give up I/O deadlines
● Give up I/O scheduling
● Crazy code duplication -
mistakes are copied because
people get stuff wrong...
Most importantly, this is not the
Linux design approach!
Enter Block Multiqueue
• The old stack does not consist of “one serialization point”
• The stack needed a complete re-write from ground up
• What do we do:
• Go look at the networking stack which solved the exact same issue 10+
years ago.
• But build from scratch for storage devices
13
Block Multiqueue - Goals
• Linear Scaling with CPU cores
• Split shared state between applications and
submission/completion
• Careful locality awareness: Cachelines, NUMA
• Pre-allocate resources as much as possible
• Provide full helper functionality - ease of implementation
• Support all existing HW
• Become THE queueing mode, not a “3’rd one”
14
Block Multiqueue - Architecture
15
Block Multiqueue - Features
• Efficient tagging
• Locality of submissions and completions
• Extremely aware to minimize cache pollutions
• Smart error handling - minimum intrusion to the hot path
• Smart cpu <-> queue mappings
• Clean API
• Easy conversion (usually just cleanup old cruft)
16
Block Multiqueue - I/O Flow
17
Block Multiqueue - Completions
18
● Applications are usually “cpu-sticky”
● If I/O completion comes on the
“correct” cpu, complete it
● Else, IPI to the “correct” cpu
Block Multiqueue - Tagging
19
• Almost every modern HW supports queueing
• Tags are used to identify individual I/Os in the presence of
out-of-order completions
• Tags are limited by capabilities of the HW, driver needs to flow
control
Block Multiqueue - Tagging
20
• PerCPU Cacheline aware scalable bitmaps
• Efficient at near-exhaustion
• Rolling wake-ups
• Maps 1x1 with HW usage - no driver specific tagging
Block Multiqueue - Pre-allocations
21
• Eliminate hot path allocations
• Allocate all the requests memory at initialization time
• Tag and request allocations are combined (no two step allocation)
• No driver per-request allocation
• Driver context and SG lists are placed in “slack space” behind the request
Block Multiqueue - Performance
22
Test-Case:
- null_blk driver
- fio
- 4K sync random read
- Dual socket system
Block Multiqueue - perf profiling
23
• Locking time is drastically reduced
• FIO reports much less “system time”
• Average and tail latencies are much lower and consistent
Next on the Agenda: SCSI, NVMe and friends
• NVMe started as a bypass driver - converted to blk-mq
• mtip32xx (Micron)
• virtio_blk, xen
• rbd (ceph)
• loop
• more...
• SCSI midlayer was a bigger project..
24
SCSI multiqueue
• Needed the concept of “shared tag sets”
• Tags are now a property of the HBA and not the storage device
• Needed a chunking of scatter-gather lists
• SCSI HBAs support huge sg lists, two much to allocate up front
• Needed “Head of queue” insertion
• For SCSI complex error handling
• Removed the “big scsi host_busy lock”
• reduced the huge contention on the scsi target “busy” atomic
• Needed Partial completion support
• Needed BIDI support (yukk..)
• Hardened the stack a lot with lots of user bug reports.
25
Block multiqueue - MSI(X) based queue mapping
26
● Motivation: Eliminate the IPI case
● Expose MSI(X) vector affinity
mappings to the block layer
● Map the HW context mappings via
the underlying device IRQ mappings
● Offer MSI(X) allocation and correct
affinity spreading via the PCI
subsystem
● Take advantage in pci based drivers
(nvme, rdma, fc, hpsa, etc..)
But wait, what about I/O schedulers?
• What we didn’t mention was that block multiqueue lacked a
proper I/O scheduler for approximately 3 years!
• A fundamental part of the I/O stack functionality is scheduling
• To optimize I/O sequentiality - Elevator algorithm
• Prevent write vs. read starvation (i.e. deadline scheduler)
• Fairness enforcement (i.e. CFQ)
• One can argue that I/O scheduling was designed for rotating media
• Optimized for reducing actuator seek time
NOT NECESSARILY TRUE - Flash can benefit scheduling!
27
Start from ground up: WriteBack Throttling
• Linux since the dawn of times sucked at buffered I/O
• Writes are naturally buffered and committed to disk in the
background
• Needs to have little or no impact on foreground activity
• What was needed:
• Plumb I/O stats for submitted reads and writes
• Track average latency in window granularity and what is currently enqueued
• Scale queue depth accordingly
• Prefer reads over non-directIO writes
28
WriteBack Throttling - Performance
29
Before... After...
Now we are ready for I/O schedules - MQ-Deadline
• Added I/O interception of requests for building schedulers on top
• First MQ conversion was for deadline scheduler
• Pretty easy and straightforward
• Just delay writes FIFO until deadline hits
• Reads FIFO are pass-through
• All percpu context - tradeoff?
• Remember: I/O scheduler can hurt synthetic workloads, but impact on
real life workloads.
30
Next: Kyber I/O Scheduler
• Targeted for fast multi-queue devices
• Lightweight
• Prefers reads over writes
• All I/Os are split into two queues (reads and writes)
• Reads are typically preferred
• Writes are throttled but not to a point of starvation
• The key is to keep submission queues short to guarantee latency
targets
• Kyber tracks I/O latency stats and adjust queue size accordingly
• Aware of flash background operations.
31
Next: BFQ I/O Scheduler
• Budget fair queueing scheduler
• A lot heavier
• Maintain Per-Process I/O budget
• Maintain bunch of Per-Process heuristics
• Yields the “best” I/O to queue at any given time
• A better fit for slower storage, especially rotating media and cheap &
deep SSDs.
32
But wait #2: What about Ultra-low latency devices
• New media is emerging with Ultra low latency (1-2 us)
• 3D-Xpoint
• Z-NAND
• Even with block MQ, the Linux I/O stack still has issues providing these
latencies
• It starts with IRQ (interrupt handling)
• If I/O is so fast, we might want to poll for completion and avoid paying the
cost of MSI(X) interrupt
33
Interrupt based I/O completion model
34
Polling based I/O completion model
35
IRQ vs. Polling
36
• Polling can remove the extra context switch from the completion
handling
So we should support polling!
37
• Add selective polling syscall interface:
• Use preadv2/pwritev2 with flag IOCB_HIGHPRI
• Saves roughly 25% of added latency
But what about CPU% - can we go hybrid?
38
• Yes!
• We have all the statistics framework in place, let’s use it for hybrid polling!
• Wake up poller after ½ of the mean latency.
Hybrid polling - Performance
39
Hybrid polling - Adjust to I/O size
40
• Block layer sees I/Os of different sizes.
• Some are 4k, some are 256K and some or 1-2MB
• We need to consider that when tracking stats for Polling considerations
• Simple solution: Bucketize stats...
• 0-4k
• 4-16k
• 16k-64k
• >64k
• Now Hybrid polling has good QoS!
To Conclude
41
• Lots of interesting stuff happening in Linux
• Linux belongs to everyone, Get involved!
• We always welcome patches and bug reports :)
42
LIGHT UP YOUR CLOUD!

More Related Content

What's hot

Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetJames Wernicke
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisBuland Singh
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netYan Vugenfirer
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewRajKumar Rampelli
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...The Linux Foundation
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 

What's hot (20)

Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Implementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over EthernetImplementation &amp; Comparison Of Rdma Over Ethernet
Implementation &amp; Comparison Of Rdma Over Ethernet
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 
Interrupts on xv6
Interrupts on xv6Interrupts on xv6
Interrupts on xv6
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_Analysis
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver Overview
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 

Similar to The Linux Block Layer - Built for Fast Storage

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)Shivam Gupta
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect APImicrokerneldude
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 

Similar to The Linux Block Layer - Built for Fast Storage (20)

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Micro controller & Micro processor
Micro controller & Micro processorMicro controller & Micro processor
Micro controller & Micro processor
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect API
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

More from Kernel TLV

Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution EnvironmentKernel TLV
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel TLV
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityKernel TLV
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to BottomKernel TLV
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Kernel TLV
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernel TLV
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentKernel TLV
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the BeastKernel TLV
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 

More from Kernel TLV (20)

DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution Environment
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem Security
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to Bottom
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker Guidelines
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future Development
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the Beast
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 

Recently uploaded

React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 

Recently uploaded (20)

React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 

The Linux Block Layer - Built for Fast Storage

  • 1. The Linux Block Layer Built for Fast Storage Light up your cloud! Sagi Grimberg KernelTLV 27/6/18 1
  • 2. First off, Happy 1’st birthday Roni! 2
  • 3. Who am I? • Co-founder and Principal Architect @ Lightbits Labs • LightBits Labs is a stealth-mode startup pushing the software and hardware technology boundaries in cloud-scale storage. • We are looking for excellent people who enjoy a challenge for a variety of positions, including both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk. • Active contributor to Linux I/O and RDMA stack • I am the maintainer of the iSCSI over RDMA (iSER) drivers • I co-maintain the Linux NVMe subsystem • Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between… 3
  • 4. Where were we 10 years ago • Only rotating storage devices exist. • Devices were limited to hundreds of IOPs • Devices access latency was in the milliseconds ballpark • The Linux block layer was sufficient to handle these devices • High performance applications found clever ways to avoid storage access as much as possible 4
  • 5. What happened? (hint: HW) • Flash SSDs started appearing in the DataCenter • IOPs went from Hundreds to Hundreds of thousands to Millions • Latency went from Milliseconds to Microseconds • Fast Interfaces evolved: PCIe (NVMe) • Processors core count increased a lot! • And NUMA... 5
  • 7. What are the issues? • Existing I/O stack had a lot of data sharing • Between different applications (running on different cores) • Between submission and completion • Locking for synchronization • Zero NUMA awareness • All stack heuristics and optimizations centered around slow storage • The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies. 7
  • 8. I/O Stack - Little deeper 8
  • 9. I/O Stack - Little deeper 9 Hmmm... - Request are serialized - Placed for staging - Retrieved by the drivers ⇒ Lots of shared state!
  • 10. I/O Stack - Performance 10
  • 11. I/O Stack - Performance 11
  • 12. Workaround: Bypass the the request layer 12 Problems with bypass: ● Give up flow control ● Give up error handling ● Give up statistics ● Give up tagging and indexing ● Give up I/O deadlines ● Give up I/O scheduling ● Crazy code duplication - mistakes are copied because people get stuff wrong... Most importantly, this is not the Linux design approach!
  • 13. Enter Block Multiqueue • The old stack does not consist of “one serialization point” • The stack needed a complete re-write from ground up • What do we do: • Go look at the networking stack which solved the exact same issue 10+ years ago. • But build from scratch for storage devices 13
  • 14. Block Multiqueue - Goals • Linear Scaling with CPU cores • Split shared state between applications and submission/completion • Careful locality awareness: Cachelines, NUMA • Pre-allocate resources as much as possible • Provide full helper functionality - ease of implementation • Support all existing HW • Become THE queueing mode, not a “3’rd one” 14
  • 15. Block Multiqueue - Architecture 15
  • 16. Block Multiqueue - Features • Efficient tagging • Locality of submissions and completions • Extremely aware to minimize cache pollutions • Smart error handling - minimum intrusion to the hot path • Smart cpu <-> queue mappings • Clean API • Easy conversion (usually just cleanup old cruft) 16
  • 17. Block Multiqueue - I/O Flow 17
  • 18. Block Multiqueue - Completions 18 ● Applications are usually “cpu-sticky” ● If I/O completion comes on the “correct” cpu, complete it ● Else, IPI to the “correct” cpu
  • 19. Block Multiqueue - Tagging 19 • Almost every modern HW supports queueing • Tags are used to identify individual I/Os in the presence of out-of-order completions • Tags are limited by capabilities of the HW, driver needs to flow control
  • 20. Block Multiqueue - Tagging 20 • PerCPU Cacheline aware scalable bitmaps • Efficient at near-exhaustion • Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging
  • 21. Block Multiqueue - Pre-allocations 21 • Eliminate hot path allocations • Allocate all the requests memory at initialization time • Tag and request allocations are combined (no two step allocation) • No driver per-request allocation • Driver context and SG lists are placed in “slack space” behind the request
  • 22. Block Multiqueue - Performance 22 Test-Case: - null_blk driver - fio - 4K sync random read - Dual socket system
  • 23. Block Multiqueue - perf profiling 23 • Locking time is drastically reduced • FIO reports much less “system time” • Average and tail latencies are much lower and consistent
  • 24. Next on the Agenda: SCSI, NVMe and friends • NVMe started as a bypass driver - converted to blk-mq • mtip32xx (Micron) • virtio_blk, xen • rbd (ceph) • loop • more... • SCSI midlayer was a bigger project.. 24
  • 25. SCSI multiqueue • Needed the concept of “shared tag sets” • Tags are now a property of the HBA and not the storage device • Needed a chunking of scatter-gather lists • SCSI HBAs support huge sg lists, two much to allocate up front • Needed “Head of queue” insertion • For SCSI complex error handling • Removed the “big scsi host_busy lock” • reduced the huge contention on the scsi target “busy” atomic • Needed Partial completion support • Needed BIDI support (yukk..) • Hardened the stack a lot with lots of user bug reports. 25
  • 26. Block multiqueue - MSI(X) based queue mapping 26 ● Motivation: Eliminate the IPI case ● Expose MSI(X) vector affinity mappings to the block layer ● Map the HW context mappings via the underlying device IRQ mappings ● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem ● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)
  • 27. But wait, what about I/O schedulers? • What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years! • A fundamental part of the I/O stack functionality is scheduling • To optimize I/O sequentiality - Elevator algorithm • Prevent write vs. read starvation (i.e. deadline scheduler) • Fairness enforcement (i.e. CFQ) • One can argue that I/O scheduling was designed for rotating media • Optimized for reducing actuator seek time NOT NECESSARILY TRUE - Flash can benefit scheduling! 27
  • 28. Start from ground up: WriteBack Throttling • Linux since the dawn of times sucked at buffered I/O • Writes are naturally buffered and committed to disk in the background • Needs to have little or no impact on foreground activity • What was needed: • Plumb I/O stats for submitted reads and writes • Track average latency in window granularity and what is currently enqueued • Scale queue depth accordingly • Prefer reads over non-directIO writes 28
  • 29. WriteBack Throttling - Performance 29 Before... After...
  • 30. Now we are ready for I/O schedules - MQ-Deadline • Added I/O interception of requests for building schedulers on top • First MQ conversion was for deadline scheduler • Pretty easy and straightforward • Just delay writes FIFO until deadline hits • Reads FIFO are pass-through • All percpu context - tradeoff? • Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads. 30
  • 31. Next: Kyber I/O Scheduler • Targeted for fast multi-queue devices • Lightweight • Prefers reads over writes • All I/Os are split into two queues (reads and writes) • Reads are typically preferred • Writes are throttled but not to a point of starvation • The key is to keep submission queues short to guarantee latency targets • Kyber tracks I/O latency stats and adjust queue size accordingly • Aware of flash background operations. 31
  • 32. Next: BFQ I/O Scheduler • Budget fair queueing scheduler • A lot heavier • Maintain Per-Process I/O budget • Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time • A better fit for slower storage, especially rotating media and cheap & deep SSDs. 32
  • 33. But wait #2: What about Ultra-low latency devices • New media is emerging with Ultra low latency (1-2 us) • 3D-Xpoint • Z-NAND • Even with block MQ, the Linux I/O stack still has issues providing these latencies • It starts with IRQ (interrupt handling) • If I/O is so fast, we might want to poll for completion and avoid paying the cost of MSI(X) interrupt 33
  • 34. Interrupt based I/O completion model 34
  • 35. Polling based I/O completion model 35
  • 36. IRQ vs. Polling 36 • Polling can remove the extra context switch from the completion handling
  • 37. So we should support polling! 37 • Add selective polling syscall interface: • Use preadv2/pwritev2 with flag IOCB_HIGHPRI • Saves roughly 25% of added latency
  • 38. But what about CPU% - can we go hybrid? 38 • Yes! • We have all the statistics framework in place, let’s use it for hybrid polling! • Wake up poller after ½ of the mean latency.
  • 39. Hybrid polling - Performance 39
  • 40. Hybrid polling - Adjust to I/O size 40 • Block layer sees I/Os of different sizes. • Some are 4k, some are 256K and some or 1-2MB • We need to consider that when tracking stats for Polling considerations • Simple solution: Bucketize stats... • 0-4k • 4-16k • 16k-64k • >64k • Now Hybrid polling has good QoS!
  • 41. To Conclude 41 • Lots of interesting stuff happening in Linux • Linux belongs to everyone, Get involved! • We always welcome patches and bug reports :)