SlideShare a Scribd company logo
1 of 45
Download to read offline
Achieving the Ultimate Performance
with KVM
Venko Moyankov
data://disrupted
2020
About me
● Solutions Architect @ StorPool
● Network and System administrator
● 20+ years in telecoms building and operating
infrastructures
linkedin.com/in/venkomoyankov/
venko@storpool.com
About StorPool
● NVMe software-defined storage for VMs and containers
● Scale-out, HA, API-controlled
● Since 2011, in commercial production use since 2013
● Based in Sofia, Bulgaria
● Mostly virtual disks for KVM
● … and bare metal Linux hosts
● Also used with VMWare, Hyper-V, XenServer
● Integrations into OpenStack/Cinder, Kubernetes Persistent
Volumes, CloudStack, OpenNebula, OnApp
Why performance
● Better application performance -- e.g. time to load a page, time to
rebuild, time to execute specific query
● Happier customers (in cloud / multi-tenant environments)
● ROI, TCO - Lower cost per delivered resource (per VM) through
higher density
Agenda
● Hardware
● Compute - CPU & Memory
● Networking
● Storage
Usual optimization goal
- lowest cost per delivered resource
- fixed performance target
- calculate all costs - power, cooling, space, server, network,
support/maintenance
Example: cost per VM with 4x dedicated 3 GHz cores and 16 GB
RAM
Unusual
- Best single-thread performance I can get at any cost
- 5 GHz cores, yummy :)
Compute node hardware
Compute node hardware
Compute node hardware
Intel
lowest cost per core:
- Xeon Gold 5220R - 24 cores @ 2.6 GHz ($244/core)
lowest cost per 3GHz+ core:
- Xeon Gold 6240R - 24 cores @ 3.2 GHz ($276/core)
- Xeon Gold 6248R - 24 cores @ 3.6 GHz ($308/core)
lowest cost per GHz:
- Xeon Gold 6230R - 26 cores @ 30 GHz ($81/GHz)
Compute node hardware
AMD
- EPYC 7702P - 64 cores @ 2.0/3.35 GHz - lowest cost per core
- EPYC 7402P - 24 cores / 1S - low density
- EPYC 7742 - 64 cores @ 2.2/3.4GHz x 2S - max density
- EPYC 7262 - 8 cores @3.4GHz - max IO/cache per core, per $
Compute node hardware
Form factor
from to
Compute node hardware
● firmware versions and BIOS settings
● Understand power management -- esp. C-states, P-states,
HWP and “bias”
○ Different on AMD EPYC: "power-deterministic",
"performance-deterministic"
● Think of rack level optimization - how do we get the lowest
total cost per delivered resource?
Agenda
● Hardware
● Compute - CPU & Memory
● Networking
● Storage
Tuning KVM
RHEL7 Virtualization_Tuning_and_Optimization_Guide link
https://pve.proxmox.com/wiki/Performance_Tweaks
https://events.static.linuxfound.org/sites/events/files/slides/CloudOpen2013_Khoa_Huynh_v3.pdf
http://www.linux-kvm.org/images/f/f9/2012-forum-virtio-blk-performance-improvement.pdf
http://www.slideshare.net/janghoonsim/kvm-performance-optimization-for-ubuntu
… but don’t trust everything you read. Perform your own benchmarking!
CPU and Memory
Recent Linux kernel, KVM and QEMU
… but beware of the bleeding edge
E.g. qemu-kvm-ev from RHEV (repackaged by CentOS)
tuned-adm virtual-host
tuned-adm virtual-guest
CPU
Typical
● (heavy) oversubscription, because VMs are mostly idling
● HT
● NUMA
● route IRQs of network and storage adapters to a core on the
NUMA node they are on
Unusual
● CPU Pinning
Understanding oversubscription and congestion
Linux scheduler statistics: /proc/schedstat
(linux-stable/Documentation/scheduler/sched-stats.txt)
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in ms)
8) sum of all time spent waiting to run by tasks on this processor (in ms)
9) # of tasks (not necessarily unique) given to the processor
* In nanoseconds, not ms.
20% CPU load with large wait time (bursty congestion) is possible
100% CPU load with no wait time, also possible
Measure CPU congestion!
Understanding oversubscription and congestion
Memory
Typical
● Dedicated RAM
● huge pages, THP
● NUMA
● use local-node memory if you can
Unusual
● Oversubscribed RAM
● balloon
● KSM (RAM dedup)
Agenda
● Hardware
● Compute - CPU & Memory
● Networking
● Storage
Networking
Virtualized networking
● hardware emulation (rtl8139, e1000)
● paravirtualized drivers - virtio-net
regular virtio vs vhost-net vs vhost-user
Linux Bridge vs OVS in-kernel vs OVS-DPDK
Pass-through networking
SR-IOV (PCIe pass-through)
virtio-net QEMU
● Multiple context switches:
1. virtio-net driver → KVM
2. KVM → qemu/virtio-net
device
3. qemu → TAP device
4. qemu → KVM (notification)
5. KVM → virtio-net driver
(interrupt)
● Much more efficient than
emulated hardware
● shared memory with qemu
process
● qemu thread process packets
virtio vhost-net
● Two context switches
(optional):
1. virtio-net driver → KVM
2. KVM → virtio-net driver
(interrupt)
● shared memory with the host
kernel (vhost protocol)
● Allows Linux Bridge Zero
Copy
● qemu / virtio-net device is on
the control path only
● kernel thread [vhost] process
packets
virtio vhost-usr / OVS-DPDK
● No context switches
● shared memory between the
guest and the Open vSwitch
(requres huge pages)
● Zero copy
● qemu / virtio-net device is on
the control path only
● KVM not in the path
● ovs-vswitchd process
packets.
● Poll-mode-driver (PMD) takes
1 CPU core, 100%
PCI Passthrough
● No paravirtualized devices
● Direct access from the guest
kernel to the PCI device
● Host, KVM and qemu are not
on the data nor the control
path.
● NIC driver in the guest
● No virtual networking
● No live migrations
● No filtering
● No control
● Shared devices via SR-IOV
Virtual Network Performance
All measurements are between two VMs on the same host
# ping -f -c 100000 vm2
virtio-net QEMU
qemu pid
virtio vhost-net
qemu vhost thread
virtio vhost-usr / OVS-DPDK
qemu OVS
Discussion
● Deep dive into Virtio-networking and vhost-net
https://www.redhat.com/en/blog/deep-dive-virtio-networking-and-vhost-net
● Open vSwitch DPDK support
https://docs.openvswitch.org/en/latest/topics/dpdk/
Agenda
● Hardware
● Compute - CPU & Memory
● Networking
● Storage
Storage - virtualization
Virtualized
live migration
thin provisioning, snapshots, etc.
vs. Full bypass
only speed
Storage - virtualization
Virtualized
cache=none -- direct IO, bypass host buffer cache
io=native -- use Linux Native AIO, not POSIX AIO (threads)
virtio-blk vs virtio-scsi
virtio-scsi multiqueue
iothread
vs. Full bypass
SR-IOV for NVMe devices
Storage - vhost
Virtualized with qemu bypass
vhost
before:
guest kernel -> host kernel -> qemu -> host kernel -> storage system
after:
guest kernel -> storage system
storpool_server instance
1 CPU thread
2-4 GB RAM
NIC
storpool_server instance
1 CPU thread
2-4 GB RAM
storpool_server instance
1 CPU thread
2-4 GB RAM
• Highly scalable and efficient architecture
• Scales up in each storage node & out with multiple nodes
25GbE
. . .
25GbE
storpool_block instance
1 CPU thread
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
NVMe SSD
KVM Virtual Machine
KVM Virtual Machine
Storage benchmarks
Beware: lots of snake oil out there!
● performance numbers from hardware configurations totally
unlike what you’d use in production
● synthetic tests with high iodepth - 10 nodes, 10 workloads *
iodepth 256 each. (because why not)
● testing with ramdisk backend
● synthetic workloads don't approximate real world (example)
Latency
ops
per
second
best service
36
Latency
ops
per
second
best service
lowest cost per
delivered resource
37
Latency
ops
per
second
best service
lowest cost per
delivered resource
only pain
38
Latency
ops
per
second
best service
lowest cost per
delivered resource
only pain
39
benchmarks
Benchmarks
Real load
?
?
Follow StorPool Online
Venko Moyankov
venko@storpool.com
StorPool Storage
www.storpool.com
@storpool
Thank you!

More Related Content

What's hot

OpenStack 2012 fall summit observation - Quantum/SDN
OpenStack 2012 fall summit observation - Quantum/SDNOpenStack 2012 fall summit observation - Quantum/SDN
OpenStack 2012 fall summit observation - Quantum/SDN
Te-Yen Liu
 

What's hot (20)

OpenStack Neutron Tutorial
OpenStack Neutron TutorialOpenStack Neutron Tutorial
OpenStack Neutron Tutorial
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
Open vSwitch Implementation Options
Open vSwitch Implementation Options Open vSwitch Implementation Options
Open vSwitch Implementation Options
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 
OpenStack networking juno l3 h-a, dvr
OpenStack networking   juno l3 h-a, dvrOpenStack networking   juno l3 h-a, dvr
OpenStack networking juno l3 h-a, dvr
 
OpenvSwitch Deep Dive
OpenvSwitch Deep DiveOpenvSwitch Deep Dive
OpenvSwitch Deep Dive
 
TC Flower Offload
TC Flower OffloadTC Flower Offload
TC Flower Offload
 
UEFI HTTP/HTTPS Boot
UEFI HTTP/HTTPS BootUEFI HTTP/HTTPS Boot
UEFI HTTP/HTTPS Boot
 
Osdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauserOsdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauser
 
Understanding docker networking
Understanding docker networkingUnderstanding docker networking
Understanding docker networking
 
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NAT
 
DPDK Support for New HW Offloads
DPDK Support for New HW OffloadsDPDK Support for New HW Offloads
DPDK Support for New HW Offloads
 
"One network to rule them all" - OpenStack Summit Austin 2016
"One network to rule them all" - OpenStack Summit Austin 2016"One network to rule them all" - OpenStack Summit Austin 2016
"One network to rule them all" - OpenStack Summit Austin 2016
 
OpenStack 2012 fall summit observation - Quantum/SDN
OpenStack 2012 fall summit observation - Quantum/SDNOpenStack 2012 fall summit observation - Quantum/SDN
OpenStack 2012 fall summit observation - Quantum/SDN
 
An Overview of Linux Networking Options
An Overview of Linux Networking OptionsAn Overview of Linux Networking Options
An Overview of Linux Networking Options
 
Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
Kubernetes Intro
Kubernetes IntroKubernetes Intro
Kubernetes Intro
 
OVS Hardware Offload with TC Flower
OVS Hardware Offload with TC FlowerOVS Hardware Offload with TC Flower
OVS Hardware Offload with TC Flower
 
How Networking works with Data Science
How Networking works with Data Science How Networking works with Data Science
How Networking works with Data Science
 
LF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and GotchasLF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and Gotchas
 

Similar to Achieving the Ultimate Performance with KVM

Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfStorage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
aaajjj4
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
Sim Janghoon
 
Nytro-XV_NWD_VM_Performance_Acceleration
Nytro-XV_NWD_VM_Performance_AccelerationNytro-XV_NWD_VM_Performance_Acceleration
Nytro-XV_NWD_VM_Performance_Acceleration
Khai Le
 

Similar to Achieving the Ultimate Performance with KVM (20)

Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVMAchieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
Optimization_of_Virtual_Machines_for_High_Performance
Optimization_of_Virtual_Machines_for_High_PerformanceOptimization_of_Virtual_Machines_for_High_Performance
Optimization_of_Virtual_Machines_for_High_Performance
 
Optimization of OpenNebula VMs for Higher Performance - Boyan Krosnov
Optimization of OpenNebula VMs for Higher Performance - Boyan KrosnovOptimization of OpenNebula VMs for Higher Performance - Boyan Krosnov
Optimization of OpenNebula VMs for Higher Performance - Boyan Krosnov
 
Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)
 
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfStorage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
 
WAN - trends and use cases
WAN - trends and use casesWAN - trends and use cases
WAN - trends and use cases
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV Features
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
 
Nytro-XV_NWD_VM_Performance_Acceleration
Nytro-XV_NWD_VM_Performance_AccelerationNytro-XV_NWD_VM_Performance_Acceleration
Nytro-XV_NWD_VM_Performance_Acceleration
 
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 

More from data://disrupted®

Rook: Storage for Containers in Containers – data://disrupted® 2020
Rook: Storage for Containers in Containers  – data://disrupted® 2020Rook: Storage for Containers in Containers  – data://disrupted® 2020
Rook: Storage for Containers in Containers – data://disrupted® 2020
data://disrupted®
 
Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020
Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020
Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020
data://disrupted®
 
HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...
HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...
HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...
data://disrupted®
 

More from data://disrupted® (18)

Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
​Muss es wirklich wieder Tape sein?
​Muss es wirklich wieder Tape sein? ​Muss es wirklich wieder Tape sein?
​Muss es wirklich wieder Tape sein?
 
​Tape-basierter Object-Storage als S3 Speicherklasse und Cloud-Absicherung
​Tape-basierter Object-Storage als S3 Speicherklasse und Cloud-Absicherung​Tape-basierter Object-Storage als S3 Speicherklasse und Cloud-Absicherung
​Tape-basierter Object-Storage als S3 Speicherklasse und Cloud-Absicherung
 
Rook: Storage for Containers in Containers – data://disrupted® 2020
Rook: Storage for Containers in Containers  – data://disrupted® 2020Rook: Storage for Containers in Containers  – data://disrupted® 2020
Rook: Storage for Containers in Containers – data://disrupted® 2020
 
Storage Benchmarks - Voodoo oder Wissenschaft? – data://disrupted® 2020
Storage Benchmarks - Voodoo oder Wissenschaft? – data://disrupted® 2020Storage Benchmarks - Voodoo oder Wissenschaft? – data://disrupted® 2020
Storage Benchmarks - Voodoo oder Wissenschaft? – data://disrupted® 2020
 
Datenspeicherung 2020 bis 2030 – immer noch auf Festplatten? – data://disrupt...
Datenspeicherung 2020 bis 2030 – immer noch auf Festplatten? – data://disrupt...Datenspeicherung 2020 bis 2030 – immer noch auf Festplatten? – data://disrupt...
Datenspeicherung 2020 bis 2030 – immer noch auf Festplatten? – data://disrupt...
 
Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020
Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020
Speichermedium Tape – Warum es keine Alternative gibt – data://disrupted® 2020
 
Ransomware: Ohne Air Gap & Tape sind Sie verloren! – data://disrupted® 2020
Ransomware: Ohne Air Gap & Tape sind Sie verloren! – data://disrupted® 2020Ransomware: Ohne Air Gap & Tape sind Sie verloren! – data://disrupted® 2020
Ransomware: Ohne Air Gap & Tape sind Sie verloren! – data://disrupted® 2020
 
HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...
HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...
HCI einfach einfach! IT-Infrastruktur wie ein Smartphone! – data://disrupted®...
 
Erasure coding stief.tech 2020-03
Erasure coding stief.tech 2020-03Erasure coding stief.tech 2020-03
Erasure coding stief.tech 2020-03
 
Nextcloud als On-Premises Lösung für hochsicheren Datenaustausch (Frank Karli...
Nextcloud als On-Premises Lösung für hochsicheren Datenaustausch (Frank Karli...Nextcloud als On-Premises Lösung für hochsicheren Datenaustausch (Frank Karli...
Nextcloud als On-Premises Lösung für hochsicheren Datenaustausch (Frank Karli...
 
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
 
Die IBM 3592 Speicherlösung: Ein Vorgeschmack auf die Zukunft (Anne Ingenhaag)
Die IBM 3592 Speicherlösung: Ein Vorgeschmack auf die Zukunft (Anne Ingenhaag)Die IBM 3592 Speicherlösung: Ein Vorgeschmack auf die Zukunft (Anne Ingenhaag)
Die IBM 3592 Speicherlösung: Ein Vorgeschmack auf die Zukunft (Anne Ingenhaag)
 
CANDIDATE EXPERIENCE – Was Bewerber tatsächlich erwarten.
CANDIDATE EXPERIENCE – Was Bewerber tatsächlich erwarten.CANDIDATE EXPERIENCE – Was Bewerber tatsächlich erwarten.
CANDIDATE EXPERIENCE – Was Bewerber tatsächlich erwarten.
 
Cloud/Object-basierte Datenspeicherung mit HSM/ILM in S3 Speicherklassen (Tho...
Cloud/Object-basierte Datenspeicherung mit HSM/ILM in S3 Speicherklassen (Tho...Cloud/Object-basierte Datenspeicherung mit HSM/ILM in S3 Speicherklassen (Tho...
Cloud/Object-basierte Datenspeicherung mit HSM/ILM in S3 Speicherklassen (Tho...
 
Buzzword Bingo Storage Edition 2019 (Wolfgang Stief)
Buzzword Bingo Storage Edition 2019 (Wolfgang Stief)Buzzword Bingo Storage Edition 2019 (Wolfgang Stief)
Buzzword Bingo Storage Edition 2019 (Wolfgang Stief)
 
Hochleistungsspeichersysteme für Datenanalyse an der TU Dresden (Michael Kluge)
Hochleistungsspeichersysteme für Datenanalyse an der TU Dresden (Michael Kluge)Hochleistungsspeichersysteme für Datenanalyse an der TU Dresden (Michael Kluge)
Hochleistungsspeichersysteme für Datenanalyse an der TU Dresden (Michael Kluge)
 
Intelligent Edge - breaking the storage hype (Michael Beeck, mibeeck GmbH)
Intelligent Edge - breaking the storage hype (Michael Beeck, mibeeck GmbH)Intelligent Edge - breaking the storage hype (Michael Beeck, mibeeck GmbH)
Intelligent Edge - breaking the storage hype (Michael Beeck, mibeeck GmbH)
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 

Achieving the Ultimate Performance with KVM

  • 1. Achieving the Ultimate Performance with KVM Venko Moyankov data://disrupted 2020
  • 2. About me ● Solutions Architect @ StorPool ● Network and System administrator ● 20+ years in telecoms building and operating infrastructures linkedin.com/in/venkomoyankov/ venko@storpool.com
  • 3. About StorPool ● NVMe software-defined storage for VMs and containers ● Scale-out, HA, API-controlled ● Since 2011, in commercial production use since 2013 ● Based in Sofia, Bulgaria ● Mostly virtual disks for KVM ● … and bare metal Linux hosts ● Also used with VMWare, Hyper-V, XenServer ● Integrations into OpenStack/Cinder, Kubernetes Persistent Volumes, CloudStack, OpenNebula, OnApp
  • 4. Why performance ● Better application performance -- e.g. time to load a page, time to rebuild, time to execute specific query ● Happier customers (in cloud / multi-tenant environments) ● ROI, TCO - Lower cost per delivered resource (per VM) through higher density
  • 5. Agenda ● Hardware ● Compute - CPU & Memory ● Networking ● Storage
  • 6. Usual optimization goal - lowest cost per delivered resource - fixed performance target - calculate all costs - power, cooling, space, server, network, support/maintenance Example: cost per VM with 4x dedicated 3 GHz cores and 16 GB RAM Unusual - Best single-thread performance I can get at any cost - 5 GHz cores, yummy :) Compute node hardware
  • 8. Compute node hardware Intel lowest cost per core: - Xeon Gold 5220R - 24 cores @ 2.6 GHz ($244/core) lowest cost per 3GHz+ core: - Xeon Gold 6240R - 24 cores @ 3.2 GHz ($276/core) - Xeon Gold 6248R - 24 cores @ 3.6 GHz ($308/core) lowest cost per GHz: - Xeon Gold 6230R - 26 cores @ 30 GHz ($81/GHz)
  • 9. Compute node hardware AMD - EPYC 7702P - 64 cores @ 2.0/3.35 GHz - lowest cost per core - EPYC 7402P - 24 cores / 1S - low density - EPYC 7742 - 64 cores @ 2.2/3.4GHz x 2S - max density - EPYC 7262 - 8 cores @3.4GHz - max IO/cache per core, per $
  • 10. Compute node hardware Form factor from to
  • 11. Compute node hardware ● firmware versions and BIOS settings ● Understand power management -- esp. C-states, P-states, HWP and “bias” ○ Different on AMD EPYC: "power-deterministic", "performance-deterministic" ● Think of rack level optimization - how do we get the lowest total cost per delivered resource?
  • 12. Agenda ● Hardware ● Compute - CPU & Memory ● Networking ● Storage
  • 13. Tuning KVM RHEL7 Virtualization_Tuning_and_Optimization_Guide link https://pve.proxmox.com/wiki/Performance_Tweaks https://events.static.linuxfound.org/sites/events/files/slides/CloudOpen2013_Khoa_Huynh_v3.pdf http://www.linux-kvm.org/images/f/f9/2012-forum-virtio-blk-performance-improvement.pdf http://www.slideshare.net/janghoonsim/kvm-performance-optimization-for-ubuntu … but don’t trust everything you read. Perform your own benchmarking!
  • 14. CPU and Memory Recent Linux kernel, KVM and QEMU … but beware of the bleeding edge E.g. qemu-kvm-ev from RHEV (repackaged by CentOS) tuned-adm virtual-host tuned-adm virtual-guest
  • 15. CPU Typical ● (heavy) oversubscription, because VMs are mostly idling ● HT ● NUMA ● route IRQs of network and storage adapters to a core on the NUMA node they are on Unusual ● CPU Pinning
  • 16. Understanding oversubscription and congestion Linux scheduler statistics: /proc/schedstat (linux-stable/Documentation/scheduler/sched-stats.txt) Next three are statistics describing scheduling latency: 7) sum of all time spent running by tasks on this processor (in ms) 8) sum of all time spent waiting to run by tasks on this processor (in ms) 9) # of tasks (not necessarily unique) given to the processor * In nanoseconds, not ms. 20% CPU load with large wait time (bursty congestion) is possible 100% CPU load with no wait time, also possible Measure CPU congestion!
  • 18. Memory Typical ● Dedicated RAM ● huge pages, THP ● NUMA ● use local-node memory if you can Unusual ● Oversubscribed RAM ● balloon ● KSM (RAM dedup)
  • 19. Agenda ● Hardware ● Compute - CPU & Memory ● Networking ● Storage
  • 20. Networking Virtualized networking ● hardware emulation (rtl8139, e1000) ● paravirtualized drivers - virtio-net regular virtio vs vhost-net vs vhost-user Linux Bridge vs OVS in-kernel vs OVS-DPDK Pass-through networking SR-IOV (PCIe pass-through)
  • 21. virtio-net QEMU ● Multiple context switches: 1. virtio-net driver → KVM 2. KVM → qemu/virtio-net device 3. qemu → TAP device 4. qemu → KVM (notification) 5. KVM → virtio-net driver (interrupt) ● Much more efficient than emulated hardware ● shared memory with qemu process ● qemu thread process packets
  • 22. virtio vhost-net ● Two context switches (optional): 1. virtio-net driver → KVM 2. KVM → virtio-net driver (interrupt) ● shared memory with the host kernel (vhost protocol) ● Allows Linux Bridge Zero Copy ● qemu / virtio-net device is on the control path only ● kernel thread [vhost] process packets
  • 23. virtio vhost-usr / OVS-DPDK ● No context switches ● shared memory between the guest and the Open vSwitch (requres huge pages) ● Zero copy ● qemu / virtio-net device is on the control path only ● KVM not in the path ● ovs-vswitchd process packets. ● Poll-mode-driver (PMD) takes 1 CPU core, 100%
  • 24. PCI Passthrough ● No paravirtualized devices ● Direct access from the guest kernel to the PCI device ● Host, KVM and qemu are not on the data nor the control path. ● NIC driver in the guest ● No virtual networking ● No live migrations ● No filtering ● No control ● Shared devices via SR-IOV
  • 25. Virtual Network Performance All measurements are between two VMs on the same host # ping -f -c 100000 vm2
  • 28. virtio vhost-usr / OVS-DPDK qemu OVS
  • 29. Discussion ● Deep dive into Virtio-networking and vhost-net https://www.redhat.com/en/blog/deep-dive-virtio-networking-and-vhost-net ● Open vSwitch DPDK support https://docs.openvswitch.org/en/latest/topics/dpdk/
  • 30. Agenda ● Hardware ● Compute - CPU & Memory ● Networking ● Storage
  • 31. Storage - virtualization Virtualized live migration thin provisioning, snapshots, etc. vs. Full bypass only speed
  • 32. Storage - virtualization Virtualized cache=none -- direct IO, bypass host buffer cache io=native -- use Linux Native AIO, not POSIX AIO (threads) virtio-blk vs virtio-scsi virtio-scsi multiqueue iothread vs. Full bypass SR-IOV for NVMe devices
  • 33. Storage - vhost Virtualized with qemu bypass vhost before: guest kernel -> host kernel -> qemu -> host kernel -> storage system after: guest kernel -> storage system
  • 34. storpool_server instance 1 CPU thread 2-4 GB RAM NIC storpool_server instance 1 CPU thread 2-4 GB RAM storpool_server instance 1 CPU thread 2-4 GB RAM • Highly scalable and efficient architecture • Scales up in each storage node & out with multiple nodes 25GbE . . . 25GbE storpool_block instance 1 CPU thread NVMe SSD NVMe SSD NVMe SSD NVMe SSD NVMe SSD NVMe SSD KVM Virtual Machine KVM Virtual Machine
  • 35. Storage benchmarks Beware: lots of snake oil out there! ● performance numbers from hardware configurations totally unlike what you’d use in production ● synthetic tests with high iodepth - 10 nodes, 10 workloads * iodepth 256 each. (because why not) ● testing with ramdisk backend ● synthetic workloads don't approximate real world (example)
  • 38. Latency ops per second best service lowest cost per delivered resource only pain 38
  • 39. Latency ops per second best service lowest cost per delivered resource only pain 39 benchmarks
  • 42. ?
  • 43. ?