KVM Storage
performance
With io_uring
SPONSORED BY:
CLDIN: CLouDINfra
● CLDIN builds and runs the infrastructure of Total Webhosting Solutions
● TWS is a European company with multiple hosting brands in
○ Netherlands
○ France
○ Spain
● We build our infrastructure with
○ Open source software
○ Apache CloudStack
○ Ceph
○ IPv6
CLDIN cloud deployment
● CloudStack
○ Locations
■ Netherlands: Amsterdam and Haarlem
■ Spain: Valencia
○ Advanced Networking
■ BGP+VXLAN+EVPN
○ Storage
■ Ceph (RBD)
■ TrueNAS Enterprise (ZFS HA)
● Numbers
○ ~10.000 Virtual Machines
○ ~200 physical hosts
○ ~100TB RAM
○ ~15PB storage
● Hypervisors (latest)
○ Dual AMD Epyc 64C
○ 1TB RAM
○ Dell R6525 or SuperMicro AS-1123US-TN10RT
History
● Bare metal with NVMe provides best performance
○ Lowest latency
○ Highest amount of IOps
○ But we want to run our workloads inside Virtual Machines!
● Virtual Machines
○ CPU and Memory performance has a small (~5%) overhead
○ Disk I/O has a much higher overhead…..
● KVM uses the QCOW2 format
○ Usually used when using Local Storage and NFS as Primary Storage
● Virtio-blk is a bit slow and the bottleneck
io_uring
● New mapping between host/hypervisor and VM
● Provides lower latency and thus more IOps
○ Latency and IOps are always connected
● Software requirements
○ Kernel >= 5.8
■ I tested with 5.13
○ Qemu >= 5.0
○ Libvirt >= 6.3
QCOW2 vs RAW
● QCOW2 is most flexible
○ And being used by almost all cloud deployments
○ Local Storage and NFS Primary Storage use this format
○ Supports snapshots and cloning
● RAW is fastest
○ Is not being used by many deployments
QCOW2 preallocation
By preallocating space within the QCOW2 disk image performance can be increased.
As data is saved to the QCOW2 image, the physical space used by the image will
increase. Growing the QCOW2 image takes time and thus decreases the performance.
Preallocation modes:
● preallocation=metadata - allocates the space required by the metadata but doesn’t allocate any space for the data. This is the quickest
to provision but the slowest for guest writes.
● preallocation=falloc - allocates space for the metadata and data but marks the blocks as unallocated. This will provision slower than
metadata but quicker than full. Guest write performance will be much quicker than metadata and similar to full.
● preallocation=full - allocates space for the metadata and data and will therefore consume all the physical space that you allocate (not
sparse). All empty allocated space will be set as a zero. This is the slowest to provision and will give similar guest write performance to
falloc.
Test setup
Hypervisor
● AMD Epyc 7351P 16C
● 256GB RAM
● Samsung PM983 NVMe
○ ext4 filesystem
○ No RAID
● Ubuntu 20.04
○ kernel 5.13 (HWE)
○ Qemu 5.0 (PPA)
○ Plain libvirt with manual XML file
Virtual Machine
● 16 Cores
● 64GB RAM
● Ubuntu 20.04 with kernel 5.13 (HWE)
Results: 512 bytes writes
Results: 4k writes
Results found on the internet
I’m not able to get the near
bare-metal performance.
Further testing is needed!
CloudStack & io_uring
● io_uring supported
○ Since version 4.16
○ Enabled automatically if supported by Libvirt and Qemu
● Service Offerings support different provisioning types
○ Thin: preallocation=metadata
○ Sparse: preallocation=falloc
○ Fat: preallocation=full
● https://cloudstack.apache.org/api/apidocs-4.16/apis/createDiskOffering.html
○ provisioningtype = thin/sparse/fat
Libvirt
<iothreads>16</iothreads>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' io='io_uring'/>
<source file='/var/lib/libvirt/images/vm1-data-2.qcow2'/>
<backingStore/>
<target dev='sdc' bus='scsi'/>
</disk>
<controller type='scsi' index='0' model='virtio-scsi'>
<driver queues='16' iothread='16'/>
</controller>
Conclusion
● Virtio-blk is the limiting factor currently
○ Local Storage should and can be much faster then it is right now
● 50% lower latency
● 2x performance increase with io_uring
● Other benchmarks suggest 80~90% performance of bare metal
○ Still need to investigate why we don’t reach that performance
Looking forward
● Supported in CloudStack 4.16
● Ubuntu 22.04 LTS (Jammy) has all the right packages
○ Qemu 6.3
○ Libvirt 8.0
● More performance testing is welcome
● More real-life experiences are welcome
● Small enhancements to the Libvirt XML can be made
○ You can also manually make changes using the Libvirt Qemu hooks
○ https://libvirt.org/hooks.html
Questions?
@widodh
wido@denhollander.io
Please send feedback to users@cloudstack.apache.org
Appendix
Useful links
● https://www.jamescoyle.net/how-to/1810-qcow2-disk-images-and-performanc
e
● https://blog.programster.org/qcow2-performance
● https://techpiezo.com/tech-insights/raw-vs-qcow2-disk-images-in-qemu-kvm/

Boosting I/O Performance with KVM io_uring

  • 1.
  • 2.
  • 3.
    CLDIN: CLouDINfra ● CLDINbuilds and runs the infrastructure of Total Webhosting Solutions ● TWS is a European company with multiple hosting brands in ○ Netherlands ○ France ○ Spain ● We build our infrastructure with ○ Open source software ○ Apache CloudStack ○ Ceph ○ IPv6
  • 4.
    CLDIN cloud deployment ●CloudStack ○ Locations ■ Netherlands: Amsterdam and Haarlem ■ Spain: Valencia ○ Advanced Networking ■ BGP+VXLAN+EVPN ○ Storage ■ Ceph (RBD) ■ TrueNAS Enterprise (ZFS HA) ● Numbers ○ ~10.000 Virtual Machines ○ ~200 physical hosts ○ ~100TB RAM ○ ~15PB storage ● Hypervisors (latest) ○ Dual AMD Epyc 64C ○ 1TB RAM ○ Dell R6525 or SuperMicro AS-1123US-TN10RT
  • 5.
    History ● Bare metalwith NVMe provides best performance ○ Lowest latency ○ Highest amount of IOps ○ But we want to run our workloads inside Virtual Machines! ● Virtual Machines ○ CPU and Memory performance has a small (~5%) overhead ○ Disk I/O has a much higher overhead….. ● KVM uses the QCOW2 format ○ Usually used when using Local Storage and NFS as Primary Storage ● Virtio-blk is a bit slow and the bottleneck
  • 6.
    io_uring ● New mappingbetween host/hypervisor and VM ● Provides lower latency and thus more IOps ○ Latency and IOps are always connected ● Software requirements ○ Kernel >= 5.8 ■ I tested with 5.13 ○ Qemu >= 5.0 ○ Libvirt >= 6.3
  • 7.
    QCOW2 vs RAW ●QCOW2 is most flexible ○ And being used by almost all cloud deployments ○ Local Storage and NFS Primary Storage use this format ○ Supports snapshots and cloning ● RAW is fastest ○ Is not being used by many deployments
  • 8.
    QCOW2 preallocation By preallocatingspace within the QCOW2 disk image performance can be increased. As data is saved to the QCOW2 image, the physical space used by the image will increase. Growing the QCOW2 image takes time and thus decreases the performance. Preallocation modes: ● preallocation=metadata - allocates the space required by the metadata but doesn’t allocate any space for the data. This is the quickest to provision but the slowest for guest writes. ● preallocation=falloc - allocates space for the metadata and data but marks the blocks as unallocated. This will provision slower than metadata but quicker than full. Guest write performance will be much quicker than metadata and similar to full. ● preallocation=full - allocates space for the metadata and data and will therefore consume all the physical space that you allocate (not sparse). All empty allocated space will be set as a zero. This is the slowest to provision and will give similar guest write performance to falloc.
  • 9.
    Test setup Hypervisor ● AMDEpyc 7351P 16C ● 256GB RAM ● Samsung PM983 NVMe ○ ext4 filesystem ○ No RAID ● Ubuntu 20.04 ○ kernel 5.13 (HWE) ○ Qemu 5.0 (PPA) ○ Plain libvirt with manual XML file Virtual Machine ● 16 Cores ● 64GB RAM ● Ubuntu 20.04 with kernel 5.13 (HWE)
  • 10.
  • 11.
  • 12.
    Results found onthe internet I’m not able to get the near bare-metal performance. Further testing is needed!
  • 13.
    CloudStack & io_uring ●io_uring supported ○ Since version 4.16 ○ Enabled automatically if supported by Libvirt and Qemu ● Service Offerings support different provisioning types ○ Thin: preallocation=metadata ○ Sparse: preallocation=falloc ○ Fat: preallocation=full ● https://cloudstack.apache.org/api/apidocs-4.16/apis/createDiskOffering.html ○ provisioningtype = thin/sparse/fat
  • 14.
    Libvirt <iothreads>16</iothreads> <disk type='file' device='disk'> <drivername='qemu' type='qcow2' io='io_uring'/> <source file='/var/lib/libvirt/images/vm1-data-2.qcow2'/> <backingStore/> <target dev='sdc' bus='scsi'/> </disk> <controller type='scsi' index='0' model='virtio-scsi'> <driver queues='16' iothread='16'/> </controller>
  • 15.
    Conclusion ● Virtio-blk isthe limiting factor currently ○ Local Storage should and can be much faster then it is right now ● 50% lower latency ● 2x performance increase with io_uring ● Other benchmarks suggest 80~90% performance of bare metal ○ Still need to investigate why we don’t reach that performance
  • 16.
    Looking forward ● Supportedin CloudStack 4.16 ● Ubuntu 22.04 LTS (Jammy) has all the right packages ○ Qemu 6.3 ○ Libvirt 8.0 ● More performance testing is welcome ● More real-life experiences are welcome ● Small enhancements to the Libvirt XML can be made ○ You can also manually make changes using the Libvirt Qemu hooks ○ https://libvirt.org/hooks.html
  • 17.
  • 18.
  • 19.
    Useful links ● https://www.jamescoyle.net/how-to/1810-qcow2-disk-images-and-performanc e ●https://blog.programster.org/qcow2-performance ● https://techpiezo.com/tech-insights/raw-vs-qcow2-disk-images-in-qemu-kvm/