SlideShare a Scribd company logo
1 of 42
Download to read offline
The Linux Block Layer
Built for Fast Storage
Light up your cloud!
Sagi Grimberg KernelTLV
27/6/18
1
First off, Happy 1’st birthday Roni!
2
Who am I?
• Co-founder and Principal Architect @ Lightbits Labs
• LightBits Labs is a stealth-mode startup pushing the software and hardware technology
boundaries in cloud-scale storage.
• We are looking for excellent people who enjoy a challenge for a variety of positions, including
both software and hardware. More information on our website at
http://www.lightbitslabs.com/#join or talk with me after the talk.
• Active contributor to Linux I/O and RDMA stack
• I am the maintainer of the iSCSI over RDMA (iSER) drivers
• I co-maintain the Linux NVMe subsystem
• Used to work for Mellanox on Storage, Networking, RDMA and
pretty much everything in between…
3
Where were we 10 years ago
• Only rotating storage devices exist.
• Devices were limited to hundreds of IOPs
• Devices access latency was in the milliseconds ballpark
• The Linux block layer was sufficient to handle these devices
• High performance applications found clever ways to avoid storage
access as much as possible
4
What happened? (hint: HW)
• Flash SSDs started appearing in the DataCenter
• IOPs went from Hundreds to Hundreds of thousands to Millions
• Latency went from Milliseconds to Microseconds
• Fast Interfaces evolved: PCIe (NVMe)
• Processors core count increased a lot!
• And NUMA...
5
I/O Stack
6
What are the issues?
• Existing I/O stack had a lot of data sharing
• Between different applications (running on different cores)
• Between submission and completion
• Locking for synchronization
• Zero NUMA awareness
• All stack heuristics and optimizations centered around slow
storage
• The result is very bad (even negative) scaling, spending lots of CPU
cycles and much much higher latencies.
7
I/O Stack - Little deeper
8
I/O Stack - Little deeper
9
Hmmm...
- Request are serialized
- Placed for staging
- Retrieved by the drivers
⇒ Lots of shared state!
I/O Stack - Performance
10
I/O Stack - Performance
11
Workaround: Bypass the the request layer
12
Problems with bypass:
● Give up flow control
● Give up error handling
● Give up statistics
● Give up tagging and indexing
● Give up I/O deadlines
● Give up I/O scheduling
● Crazy code duplication -
mistakes are copied because
people get stuff wrong...
Most importantly, this is not the
Linux design approach!
Enter Block Multiqueue
• The old stack does not consist of “one serialization point”
• The stack needed a complete re-write from ground up
• What do we do:
• Go look at the networking stack which solved the exact same issue 10+
years ago.
• But build from scratch for storage devices
13
Block Multiqueue - Goals
• Linear Scaling with CPU cores
• Split shared state between applications and
submission/completion
• Careful locality awareness: Cachelines, NUMA
• Pre-allocate resources as much as possible
• Provide full helper functionality - ease of implementation
• Support all existing HW
• Become THE queueing mode, not a “3’rd one”
14
Block Multiqueue - Architecture
15
Block Multiqueue - Features
• Efficient tagging
• Locality of submissions and completions
• Extremely aware to minimize cache pollutions
• Smart error handling - minimum intrusion to the hot path
• Smart cpu <-> queue mappings
• Clean API
• Easy conversion (usually just cleanup old cruft)
16
Block Multiqueue - I/O Flow
17
Block Multiqueue - Completions
18
● Applications are usually “cpu-sticky”
● If I/O completion comes on the
“correct” cpu, complete it
● Else, IPI to the “correct” cpu
Block Multiqueue - Tagging
19
• Almost every modern HW supports queueing
• Tags are used to identify individual I/Os in the presence of
out-of-order completions
• Tags are limited by capabilities of the HW, driver needs to flow
control
Block Multiqueue - Tagging
20
• PerCPU Cacheline aware scalable bitmaps
• Efficient at near-exhaustion
• Rolling wake-ups
• Maps 1x1 with HW usage - no driver specific tagging
Block Multiqueue - Pre-allocations
21
• Eliminate hot path allocations
• Allocate all the requests memory at initialization time
• Tag and request allocations are combined (no two step allocation)
• No driver per-request allocation
• Driver context and SG lists are placed in “slack space” behind the request
Block Multiqueue - Performance
22
Test-Case:
- null_blk driver
- fio
- 4K sync random read
- Dual socket system
Block Multiqueue - perf profiling
23
• Locking time is drastically reduced
• FIO reports much less “system time”
• Average and tail latencies are much lower and consistent
Next on the Agenda: SCSI, NVMe and friends
• NVMe started as a bypass driver - converted to blk-mq
• mtip32xx (Micron)
• virtio_blk, xen
• rbd (ceph)
• loop
• more...
• SCSI midlayer was a bigger project..
24
SCSI multiqueue
• Needed the concept of “shared tag sets”
• Tags are now a property of the HBA and not the storage device
• Needed a chunking of scatter-gather lists
• SCSI HBAs support huge sg lists, two much to allocate up front
• Needed “Head of queue” insertion
• For SCSI complex error handling
• Removed the “big scsi host_busy lock”
• reduced the huge contention on the scsi target “busy” atomic
• Needed Partial completion support
• Needed BIDI support (yukk..)
• Hardened the stack a lot with lots of user bug reports.
25
Block multiqueue - MSI(X) based queue mapping
26
● Motivation: Eliminate the IPI case
● Expose MSI(X) vector affinity
mappings to the block layer
● Map the HW context mappings via
the underlying device IRQ mappings
● Offer MSI(X) allocation and correct
affinity spreading via the PCI
subsystem
● Take advantage in pci based drivers
(nvme, rdma, fc, hpsa, etc..)
But wait, what about I/O schedulers?
• What we didn’t mention was that block multiqueue lacked a
proper I/O scheduler for approximately 3 years!
• A fundamental part of the I/O stack functionality is scheduling
• To optimize I/O sequentiality - Elevator algorithm
• Prevent write vs. read starvation (i.e. deadline scheduler)
• Fairness enforcement (i.e. CFQ)
• One can argue that I/O scheduling was designed for rotating media
• Optimized for reducing actuator seek time
NOT NECESSARILY TRUE - Flash can benefit scheduling!
27
Start from ground up: WriteBack Throttling
• Linux since the dawn of times sucked at buffered I/O
• Writes are naturally buffered and committed to disk in the
background
• Needs to have little or no impact on foreground activity
• What was needed:
• Plumb I/O stats for submitted reads and writes
• Track average latency in window granularity and what is currently enqueued
• Scale queue depth accordingly
• Prefer reads over non-directIO writes
28
WriteBack Throttling - Performance
29
Before... After...
Now we are ready for I/O schedules - MQ-Deadline
• Added I/O interception of requests for building schedulers on top
• First MQ conversion was for deadline scheduler
• Pretty easy and straightforward
• Just delay writes FIFO until deadline hits
• Reads FIFO are pass-through
• All percpu context - tradeoff?
• Remember: I/O scheduler can hurt synthetic workloads, but impact on
real life workloads.
30
Next: Kyber I/O Scheduler
• Targeted for fast multi-queue devices
• Lightweight
• Prefers reads over writes
• All I/Os are split into two queues (reads and writes)
• Reads are typically preferred
• Writes are throttled but not to a point of starvation
• The key is to keep submission queues short to guarantee latency
targets
• Kyber tracks I/O latency stats and adjust queue size accordingly
• Aware of flash background operations.
31
Next: BFQ I/O Scheduler
• Budget fair queueing scheduler
• A lot heavier
• Maintain Per-Process I/O budget
• Maintain bunch of Per-Process heuristics
• Yields the “best” I/O to queue at any given time
• A better fit for slower storage, especially rotating media and cheap &
deep SSDs.
32
But wait #2: What about Ultra-low latency devices
• New media is emerging with Ultra low latency (1-2 us)
• 3D-Xpoint
• Z-NAND
• Even with block MQ, the Linux I/O stack still has issues providing these
latencies
• It starts with IRQ (interrupt handling)
• If I/O is so fast, we might want to poll for completion and avoid paying the
cost of MSI(X) interrupt
33
Interrupt based I/O completion model
34
Polling based I/O completion model
35
IRQ vs. Polling
36
• Polling can remove the extra context switch from the completion
handling
So we should support polling!
37
• Add selective polling syscall interface:
• Use preadv2/pwritev2 with flag IOCB_HIGHPRI
• Saves roughly 25% of added latency
But what about CPU% - can we go hybrid?
38
• Yes!
• We have all the statistics framework in place, let’s use it for hybrid polling!
• Wake up poller after ½ of the mean latency.
Hybrid polling - Performance
39
Hybrid polling - Adjust to I/O size
40
• Block layer sees I/Os of different sizes.
• Some are 4k, some are 256K and some or 1-2MB
• We need to consider that when tracking stats for Polling considerations
• Simple solution: Bucketize stats...
• 0-4k
• 4-16k
• 16k-64k
• >64k
• Now Hybrid polling has good QoS!
To Conclude
41
• Lots of interesting stuff happening in Linux
• Linux belongs to everyone, Get involved!
• We always welcome patches and bug reports :)
42
LIGHT UP YOUR CLOUD!

More Related Content

What's hot

Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringShapeBlue
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingScyllaDB
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Uboot startup sequence
Uboot startup sequenceUboot startup sequence
Uboot startup sequenceHoucheng Lin
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicJoseph Lu
 
Introduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsIntroduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsQUONTRASOLUTIONS
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 

What's hot (20)

Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
Linux scheduler
Linux schedulerLinux scheduler
Linux scheduler
 
Linux Memory
Linux MemoryLinux Memory
Linux Memory
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Uboot startup sequence
Uboot startup sequenceUboot startup sequence
Uboot startup sequence
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panic
 
Introduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsIntroduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra Solutions
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 

Similar to The Linux Block Layer - Built for Fast Storage

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)Shivam Gupta
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect APImicrokerneldude
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 

Similar to The Linux Block Layer - Built for Fast Storage (20)

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Micro controller & Micro processor
Micro controller & Micro processorMicro controller & Micro processor
Micro controller & Micro processor
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect API
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

More from Kernel TLV

SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution EnvironmentKernel TLV
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel TLV
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityKernel TLV
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to BottomKernel TLV
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Kernel TLV
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernel TLV
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentKernel TLV
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the BeastKernel TLV
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 
Linux Interrupts
Linux InterruptsLinux Interrupts
Linux InterruptsKernel TLV
 

More from Kernel TLV (20)

SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution Environment
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem Security
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to Bottom
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker Guidelines
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future Development
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the Beast
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 
Linux Interrupts
Linux InterruptsLinux Interrupts
Linux Interrupts
 

Recently uploaded

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 

Recently uploaded (20)

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 

The Linux Block Layer - Built for Fast Storage

  • 1. The Linux Block Layer Built for Fast Storage Light up your cloud! Sagi Grimberg KernelTLV 27/6/18 1
  • 2. First off, Happy 1’st birthday Roni! 2
  • 3. Who am I? • Co-founder and Principal Architect @ Lightbits Labs • LightBits Labs is a stealth-mode startup pushing the software and hardware technology boundaries in cloud-scale storage. • We are looking for excellent people who enjoy a challenge for a variety of positions, including both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk. • Active contributor to Linux I/O and RDMA stack • I am the maintainer of the iSCSI over RDMA (iSER) drivers • I co-maintain the Linux NVMe subsystem • Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between… 3
  • 4. Where were we 10 years ago • Only rotating storage devices exist. • Devices were limited to hundreds of IOPs • Devices access latency was in the milliseconds ballpark • The Linux block layer was sufficient to handle these devices • High performance applications found clever ways to avoid storage access as much as possible 4
  • 5. What happened? (hint: HW) • Flash SSDs started appearing in the DataCenter • IOPs went from Hundreds to Hundreds of thousands to Millions • Latency went from Milliseconds to Microseconds • Fast Interfaces evolved: PCIe (NVMe) • Processors core count increased a lot! • And NUMA... 5
  • 7. What are the issues? • Existing I/O stack had a lot of data sharing • Between different applications (running on different cores) • Between submission and completion • Locking for synchronization • Zero NUMA awareness • All stack heuristics and optimizations centered around slow storage • The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies. 7
  • 8. I/O Stack - Little deeper 8
  • 9. I/O Stack - Little deeper 9 Hmmm... - Request are serialized - Placed for staging - Retrieved by the drivers ⇒ Lots of shared state!
  • 10. I/O Stack - Performance 10
  • 11. I/O Stack - Performance 11
  • 12. Workaround: Bypass the the request layer 12 Problems with bypass: ● Give up flow control ● Give up error handling ● Give up statistics ● Give up tagging and indexing ● Give up I/O deadlines ● Give up I/O scheduling ● Crazy code duplication - mistakes are copied because people get stuff wrong... Most importantly, this is not the Linux design approach!
  • 13. Enter Block Multiqueue • The old stack does not consist of “one serialization point” • The stack needed a complete re-write from ground up • What do we do: • Go look at the networking stack which solved the exact same issue 10+ years ago. • But build from scratch for storage devices 13
  • 14. Block Multiqueue - Goals • Linear Scaling with CPU cores • Split shared state between applications and submission/completion • Careful locality awareness: Cachelines, NUMA • Pre-allocate resources as much as possible • Provide full helper functionality - ease of implementation • Support all existing HW • Become THE queueing mode, not a “3’rd one” 14
  • 15. Block Multiqueue - Architecture 15
  • 16. Block Multiqueue - Features • Efficient tagging • Locality of submissions and completions • Extremely aware to minimize cache pollutions • Smart error handling - minimum intrusion to the hot path • Smart cpu <-> queue mappings • Clean API • Easy conversion (usually just cleanup old cruft) 16
  • 17. Block Multiqueue - I/O Flow 17
  • 18. Block Multiqueue - Completions 18 ● Applications are usually “cpu-sticky” ● If I/O completion comes on the “correct” cpu, complete it ● Else, IPI to the “correct” cpu
  • 19. Block Multiqueue - Tagging 19 • Almost every modern HW supports queueing • Tags are used to identify individual I/Os in the presence of out-of-order completions • Tags are limited by capabilities of the HW, driver needs to flow control
  • 20. Block Multiqueue - Tagging 20 • PerCPU Cacheline aware scalable bitmaps • Efficient at near-exhaustion • Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging
  • 21. Block Multiqueue - Pre-allocations 21 • Eliminate hot path allocations • Allocate all the requests memory at initialization time • Tag and request allocations are combined (no two step allocation) • No driver per-request allocation • Driver context and SG lists are placed in “slack space” behind the request
  • 22. Block Multiqueue - Performance 22 Test-Case: - null_blk driver - fio - 4K sync random read - Dual socket system
  • 23. Block Multiqueue - perf profiling 23 • Locking time is drastically reduced • FIO reports much less “system time” • Average and tail latencies are much lower and consistent
  • 24. Next on the Agenda: SCSI, NVMe and friends • NVMe started as a bypass driver - converted to blk-mq • mtip32xx (Micron) • virtio_blk, xen • rbd (ceph) • loop • more... • SCSI midlayer was a bigger project.. 24
  • 25. SCSI multiqueue • Needed the concept of “shared tag sets” • Tags are now a property of the HBA and not the storage device • Needed a chunking of scatter-gather lists • SCSI HBAs support huge sg lists, two much to allocate up front • Needed “Head of queue” insertion • For SCSI complex error handling • Removed the “big scsi host_busy lock” • reduced the huge contention on the scsi target “busy” atomic • Needed Partial completion support • Needed BIDI support (yukk..) • Hardened the stack a lot with lots of user bug reports. 25
  • 26. Block multiqueue - MSI(X) based queue mapping 26 ● Motivation: Eliminate the IPI case ● Expose MSI(X) vector affinity mappings to the block layer ● Map the HW context mappings via the underlying device IRQ mappings ● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem ● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)
  • 27. But wait, what about I/O schedulers? • What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years! • A fundamental part of the I/O stack functionality is scheduling • To optimize I/O sequentiality - Elevator algorithm • Prevent write vs. read starvation (i.e. deadline scheduler) • Fairness enforcement (i.e. CFQ) • One can argue that I/O scheduling was designed for rotating media • Optimized for reducing actuator seek time NOT NECESSARILY TRUE - Flash can benefit scheduling! 27
  • 28. Start from ground up: WriteBack Throttling • Linux since the dawn of times sucked at buffered I/O • Writes are naturally buffered and committed to disk in the background • Needs to have little or no impact on foreground activity • What was needed: • Plumb I/O stats for submitted reads and writes • Track average latency in window granularity and what is currently enqueued • Scale queue depth accordingly • Prefer reads over non-directIO writes 28
  • 29. WriteBack Throttling - Performance 29 Before... After...
  • 30. Now we are ready for I/O schedules - MQ-Deadline • Added I/O interception of requests for building schedulers on top • First MQ conversion was for deadline scheduler • Pretty easy and straightforward • Just delay writes FIFO until deadline hits • Reads FIFO are pass-through • All percpu context - tradeoff? • Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads. 30
  • 31. Next: Kyber I/O Scheduler • Targeted for fast multi-queue devices • Lightweight • Prefers reads over writes • All I/Os are split into two queues (reads and writes) • Reads are typically preferred • Writes are throttled but not to a point of starvation • The key is to keep submission queues short to guarantee latency targets • Kyber tracks I/O latency stats and adjust queue size accordingly • Aware of flash background operations. 31
  • 32. Next: BFQ I/O Scheduler • Budget fair queueing scheduler • A lot heavier • Maintain Per-Process I/O budget • Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time • A better fit for slower storage, especially rotating media and cheap & deep SSDs. 32
  • 33. But wait #2: What about Ultra-low latency devices • New media is emerging with Ultra low latency (1-2 us) • 3D-Xpoint • Z-NAND • Even with block MQ, the Linux I/O stack still has issues providing these latencies • It starts with IRQ (interrupt handling) • If I/O is so fast, we might want to poll for completion and avoid paying the cost of MSI(X) interrupt 33
  • 34. Interrupt based I/O completion model 34
  • 35. Polling based I/O completion model 35
  • 36. IRQ vs. Polling 36 • Polling can remove the extra context switch from the completion handling
  • 37. So we should support polling! 37 • Add selective polling syscall interface: • Use preadv2/pwritev2 with flag IOCB_HIGHPRI • Saves roughly 25% of added latency
  • 38. But what about CPU% - can we go hybrid? 38 • Yes! • We have all the statistics framework in place, let’s use it for hybrid polling! • Wake up poller after ½ of the mean latency.
  • 39. Hybrid polling - Performance 39
  • 40. Hybrid polling - Adjust to I/O size 40 • Block layer sees I/Os of different sizes. • Some are 4k, some are 256K and some or 1-2MB • We need to consider that when tracking stats for Polling considerations • Simple solution: Bucketize stats... • 0-4k • 4-16k • 16k-64k • >64k • Now Hybrid polling has good QoS!
  • 41. To Conclude 41 • Lots of interesting stuff happening in Linux • Linux belongs to everyone, Get involved! • We always welcome patches and bug reports :)