KVM Performance Optimization

Paul Sim
Cloud Consultant
paul.sim@canonical.com
Index
● CPU & Memory
○
○
○
○

vCPU pinning
NUMA affinity
THP (Transparent Huge Page)
KSM (Kernel SamePage Merging) & virti...
● CPU & Memory
Modern CPU cache architecture and latency (Intel Sandy Bridge)

CPU

CPU

Core

Core

L1i
32K

L1d
32K

L1i...
● CPU & Memory - vCPU pinning

KVM Guest4

KVM
Guest0

KVM
Guest1

KVM
Guest3

* Pin specific vCPU on specific physical CP...
● CPU & Memory - vCPU pinning

* Two times faster memory access than without pinning
- Shorter is better on scheduling and...
● CPU & Memory - NUMA affinity

Cores

Memory
Controller

Memory

CPU 2

QPI

CPU 3

I/O
Controller

PCI-E

Cores

Cores

...
● CPU & Memory - NUMA affinity
janghoon@machine-04:~$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 ...
● CPU & Memory - NUMA affinity
1. Determine where the pages of a VM are allocated.
- cat /proc/<PID>/numa_maps
- cat /sys/...
● CPU & Memory - NUMA affinity
- NUMA reclaim
Node0

Node1

1. Check if zone reclaim is enabled.

Free memory
Free memory
...
● CPU & Memory - THP (Transparent Huge Page)
- Paging level in some 64 bit architectures

Platform

Page size

Address bit...
● CPU & Memory - THP (Transparent Huge Page)
- Memory address translation in 64 bit Linux
Linear(Virtual) address (48 bit)...
● CPU & Memory - THP (Transparent Huge Page)
Paging hardware with TLB (Translation lookaside buffer)

* Translation lookas...
● CPU & Memory - THP (Transparent Huge Page)
THP performance benchmark - MySQL 5.5 OLTP testing

* A guest machine also ha...
● CPU & Memory - THP (Transparent Huge Page)
1. Check current THP configuration
- cat /sys/kernel/mm/transparent_hugepage/...
● CPU & Memory - KSM & virtio_balloon
- KSM (Kernel SamePage Merging)

KSM
MADV_MERGEABLE

Guest 0

Merging
Guest 1

Guest...
● CPU & Memory - KSM & virtio_balloon
- Virtio_balloon

currentMemory

Memory

Host

Host

VM 0

VM 1

memory

memory

VM ...
● Networking - vhost-net
Vhost-net

Virtio

Guest
Guest

Virtio front-end
Virtio front-end

QEMU

Virtio back-end

QEMU

K...
● Networking - vhost-net
SR-IOV

PCI Path-through
Guest

Guest
vNIC

Guest
vNIC

QEMU

vNIC

QEMU

Kernel

Guest

QEMU

vN...
● Networking - vhost-net
● Networking - Interrupt handling
1. Multi-queue NIC
root@machine-04:~# cat /proc/interrupts | grep eth0
71:
0
0
0
0
0
0
0...
● Networking - Interrupt handling
2. MSI(Message Signaled Interrupt), MSI-X
- Make sure your MSI-X enabled
janghoon@machin...
● Networking - Interrupt handling
4. IRQ Affinity (Pin Interrupts to the local node)

Memory
Controller

Memory

Cores

Co...
● Networking - Interrupt handling
4. IRQ Affinity (Pin Interrupts to the local node)
1) stop irqbalance service
2) determi...
● Networking - Large Segment Offload
w/o LSO
Data (i.e. 4K)

Application

with LSO
Data (i.e. 4K)

Kernel

Header

Data

H...
● Networking - Large Segment Offload

1.
2.
3.

100% throughput performance improvement
GSO/GRO, TSO and UFO are enabled b...
● Block device - I/O Scheduler
1. Noop : basic FIFO queue.
2. Deadline : I/O requests place in a priority queue and are gu...
● Block device - VM Cache mode
Disable host page cache
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' c...
● Block device - Asynchronous I/O
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io=’nativ...
Upcoming SlideShare
Loading in...5
×

Kvm performance optimization for ubuntu

9,188

Published on

Published in: Technology
3 Comments
49 Likes
Statistics
Notes
No Downloads
Views
Total Views
9,188
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
403
Comments
3
Likes
49
Embeds 0
No embeds

No notes for slide

Transcript of "Kvm performance optimization for ubuntu"

  1. 1. KVM Performance Optimization Paul Sim Cloud Consultant paul.sim@canonical.com
  2. 2. Index ● CPU & Memory ○ ○ ○ ○ vCPU pinning NUMA affinity THP (Transparent Huge Page) KSM (Kernel SamePage Merging) & virtio_balloon ● Networking ○ ○ ○ vhost_net Interrupt handling Large Segment offload ● Block Device ○ ○ ○ I/O Scheduler VM Cache mode Asynchronous I/O
  3. 3. ● CPU & Memory Modern CPU cache architecture and latency (Intel Sandy Bridge) CPU CPU Core Core L1i 32K L1d 32K L1i 32K L1d 32K 4 cycles, 0.5 ns 4 cycles, 0.5 ns 4 cycles, 0.5 ns 4 cycles, 0.5 ns QPI L2 or MLC (unified) 256K L2 or MLC (unified) 256K 11 cycles, 7 ns 11 cycles, 7 ns L3 or LLC (Shared) 14 ~ 38 cycles, 45 ns Local memory 75 ~ 100 ns Remote memory 120 ~ 160 ns
  4. 4. ● CPU & Memory - vCPU pinning KVM Guest4 KVM Guest0 KVM Guest1 KVM Guest3 * Pin specific vCPU on specific physical CPU * vCPU pinning increases CPU cache hit ratio KVM Guest2 1. Discover cpu topology. - virsh capabilities Node0 Node1 2. pin vcpu on specific core - vcpupin <domain> vcpu-num cpu-num Core 0 Core 1 Core 2 Core 3 3. print vcpu information - vcpuinfo <domain> LLC LLC * Multi-node virtual machine? Node memory Node memory
  5. 5. ● CPU & Memory - vCPU pinning * Two times faster memory access than without pinning - Shorter is better on scheduling and mutex test. - Longer is better on memory test.
  6. 6. ● CPU & Memory - NUMA affinity Cores Memory Controller Memory CPU 2 QPI CPU 3 I/O Controller PCI-E Cores Cores LLC LLC Cores Cores PCI-E Cores Memory LLC PCI-E LLC Memory Cores Memory Controller I/O Controller PCI-E Cores I/O Controller CPU 1 Memory Controller Memory Controller Memory CPU 0 I/O Controller CPU architecture (Intel Sandy Bridge)
  7. 7. ● CPU & Memory - NUMA affinity janghoon@machine-04:~$ sudo numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49 node 0 size: 24567 MB node 0 free: 6010 MB node 1 cpus: cpus: 10 11 12 13 14 15 16 17 18 19 50 51 52 53 54 55 56 57 58 59 node 1 size: 24576 MB node 1 free: 8557 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69 node 2 size: 24567 MB node 2 free: 6320 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 70 71 72 73 74 75 76 77 78 79 node 3 size: 24567 MB node 3 free: 8204 MB node distances: node 0 1 2 3 0: 10 16 16 22 1: 16 10 22 16 2: 16 22 10 16 3: 22 16 16 10 * Remote memory is expensive!
  8. 8. ● CPU & Memory - NUMA affinity 1. Determine where the pages of a VM are allocated. - cat /proc/<PID>/numa_maps - cat /sys/fs/cgroup/memory/sysdefault/libvirt/qemu/<KVM name>/memory.numa_stat total=244973 N0=118375 N1=126598 file=81 N0=24 N1=57 anon=244892 N0=118351 N1=126541 unevictable=0 N0=0 N1=0 2. Change memory policy mode - cgset -r cpuset.mems=<Node> sysdefault/libvirt/qemu/<KVM name>/emulator/ 3. Migrate pages into a specific node. - migratepages <PID> from-node to-node - cat /proc/<PID>/status * Memory policy modes 1) interleave : Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. 2) bind : Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. 3) preferred : Preferably allocate memory on node, but if memory cannot be allocated there fall back to other nodes. * “preferred” memory policy mode is not currently supported on cgroup
  9. 9. ● CPU & Memory - NUMA affinity - NUMA reclaim Node0 Node1 1. Check if zone reclaim is enabled. Free memory Free memory Mapped page cache - cat /proc/sys/vm/zone_reclaim_mode 0(default): Linux kernel allocates the Mapped page cache memory to a remote NUMA node where Unmapped page cache Unmapped page cache Anonymous page Anonymous page free memory is available. 1 : Linux kernel reclaims unmapped page caches for the local NUMA node rather than Local reclaim immediately allocating the memory to a Node0 Node1 Free memory Free memory Mapped page cache Mapped page cache remote NUMA node. It is known that a virtual machine causes zone reclaim to occur in the situations when KSM(Kernel Same-page Merging) is enabled or Hugepage are Unmapped page cache Unmapped page cache Anonymous page Anonymous page enabled on a virtual machine side.
  10. 10. ● CPU & Memory - THP (Transparent Huge Page) - Paging level in some 64 bit architectures Platform Page size Address bit used Paging levels splitting Alpha 8KB 43 3 10 + 10 + 10 + 13 IA64 4KB 39 3 9 + 9 + 9 + 12 PPC64 4KB 41 3 10 + 10 + 9 + 12 x86_64 4KB(default) 48 4 9 + 9 + 9 + 9 + 12 x86_64 2MB(THP) 48 3 9 + 9 + 9 + 21
  11. 11. ● CPU & Memory - THP (Transparent Huge Page) - Memory address translation in 64 bit Linux Linear(Virtual) address (48 bit) Upper Dir (9 bit) Global Dir (9 bit) Table Offset (9 bit in 4KB 0 bit in 2MB) Middle Dir (9 bit) (12 bit in 4KB 21 bit in 2MB) Physical Page (4KB page size) Page table Page Middle Directory Page Upper Directory Page Global Directory Physical Page (2MB page size) cr3 reduce 1 step!
  12. 12. ● CPU & Memory - THP (Transparent Huge Page) Paging hardware with TLB (Translation lookaside buffer) * Translation lookaside buffer (TLB) : a cache that memory management hardware uses to improve virtual address translation speed. * TLB is also a kind of cache memory in CPU. * Then, how can we increase TLB hit ratio? - TLB can hold only 8 ~ 1024 entries. - Decrease the number of pages. i.e. On 32GB memory system, 8,388,608 pages with 4KB page block 16,384 pages with 2MB page block
  13. 13. ● CPU & Memory - THP (Transparent Huge Page) THP performance benchmark - MySQL 5.5 OLTP testing * A guest machine also has to use HugePage for the best effect
  14. 14. ● CPU & Memory - THP (Transparent Huge Page) 1. Check current THP configuration - cat /sys/kernel/mm/transparent_hugepage/enabled 2. Configure THP mode - echo mode > /sys/kernel/mm/transparent_hugepage/enabled 3. monitor Huge page usage - cat /proc/meminfo | grep Huge janghoon@machine-04:~$ grep Huge /proc/meminfo AnonHugePages: 462848 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB - grep Huge /proc/<PID>/smaps 4. Adjust parameters under /sys/kernel/mm/transparent_hugepage/khugepaged - grep thp /proc/vmstat * THP modes 1) always : use HugePages always 2) madvise : use HugePages only in specific regions, madvise(MADV_HUGEPAGE). Default on Ubuntu precise 3) never : not use HugePages * Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache layer starting with tmpfs. - Linux Kernel Documentation/vm/transhuge.txt
  15. 15. ● CPU & Memory - KSM & virtio_balloon - KSM (Kernel SamePage Merging) KSM MADV_MERGEABLE Guest 0 Merging Guest 1 Guest 2 1. A kernel feature in KVM that shares memory pages between various processes, over-committing the memory. 2. Only merges anonymous (private) pages. 3. Enable KSM - echo “1” > /sys/kernel/mm/ksm/run 4. Monitor KSM - Files under /sys/kernel/mm/ksm/ 5. For NUMA (in Linux 3.9) - /sys/kernel/mm/ksm/merge_across_nodes 0 : merge only pages in the memory area of the same NUMA node 1 : merge pages across nodes
  16. 16. ● CPU & Memory - KSM & virtio_balloon - Virtio_balloon currentMemory Memory Host Host VM 0 VM 1 memory memory VM 0 memory VM 1 memory balloon balloon balloon balloon 1. The hypervisor sends a request to the guest operating system to return some amount of memory back to the hypervisor. 2. The virtio_balloon driver in the guest operating system receives the request from the hypervisor. 3. The virtio_balloon driver inflates a balloon of memory inside the guest operating system 4. The guest operating system returns the balloon of memory back to the hypervisor. 5. The hypervisor allocates the memory from the balloon elsewhere as needed. 6. If the memory from the balloon later becomes available, the hypervisor can return the memory to the guest operating system
  17. 17. ● Networking - vhost-net Vhost-net Virtio Guest Guest Virtio front-end Virtio front-end QEMU Virtio back-end QEMU Kernel TAP Context switching Kernel DMA virtqueue vhost_net Bridge Bridge pNIC pNIC TAP
  18. 18. ● Networking - vhost-net SR-IOV PCI Path-through Guest Guest vNIC Guest vNIC QEMU vNIC QEMU Kernel Guest QEMU vNIC QEMU Kernel pNIC PCI Manager I/O MMU pNIC pNIC VF PF VF Bridge
  19. 19. ● Networking - vhost-net
  20. 20. ● Networking - Interrupt handling 1. Multi-queue NIC root@machine-04:~# cat /proc/interrupts | grep eth0 71: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0 72: 0 0 213 679058 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 73: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 74: 562872 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 75: 561303 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 76: 583422 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-4 77: 21 5 563963 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-5 78: 23 0 548548 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-6 79: 567242 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-7 - RPS(Receive Packet Steering) / XPS(Transmit Packet Steering) : A mechanism for intelligently selecting which receive/transmit queue to use when receiving/transmitting a packet on a multi-queue device. it can configure CPU affinity per-devicequeue, disabled by default. : configure cpu mask - echo 1 > /sys/class/net/<device>/queues/rx-<queue num>/rps_cpus - echo 16 > /sys/class/net/<device>/queues/tx-<queue num>/xps_cpus around 20 ~ 100 % improvement
  21. 21. ● Networking - Interrupt handling 2. MSI(Message Signaled Interrupt), MSI-X - Make sure your MSI-X enabled janghoon@machine-04:~$ sudo lspci -vs 01:00.1 | grep MSI Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=10 MaskedCapabilities: [a0] Express Endpoint, MSI 00 3. NAPI(New API) - a feature that in the linux kernel aiming to improve the performance of high-speed networking, and avoid interrupt storms. - Ask your NIC vendor whether it’s enabled by default. Most modern NICs support Multi-queue, MSI-X and NAPI. However, You may need to make sure these features are configured correctly and working properly. http://en.wikipedia.org/wiki/Message_Signaled_Interrupts http://en.wikipedia.org/wiki/New_API
  22. 22. ● Networking - Interrupt handling 4. IRQ Affinity (Pin Interrupts to the local node) Memory Controller Memory Cores Cores QPI I/O Controller PCI-E PCI-E I/O Controller PCI-E PCI-E Each node has PCI-E devices connected directly (On Intel Sandy Bridge). Where is my 10G NIC connected to? Memory CPU 1 Memory Controller CPU 0
  23. 23. ● Networking - Interrupt handling 4. IRQ Affinity (Pin Interrupts to the local node) 1) stop irqbalance service 2) determine node that a NIC is connected to - lspci -tv - lspci -vs <PCI device bus address> - dmidecode -t slot - cat /sys/devices/pci*/<bus address>/numa_node (-1 : not detected) - cat /proc/irq/<Interrupt number>/node 3) Pin interrupts from the NIC to specific node - echo f > /proc/irq/<Interrupt number>/smp_affinity
  24. 24. ● Networking - Large Segment Offload w/o LSO Data (i.e. 4K) Application with LSO Data (i.e. 4K) Kernel Header Data Header Data Header Data NIC Header Data Header Data Header Data Header Data Header janghoon@ubuntu-precise:~$ sudo ethtool -k eth0 Offload parameters for eth0: tcp-segmentation-offload: on udp-fragmentation-offload: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off Data Header + Data metadata Header Data
  25. 25. ● Networking - Large Segment Offload 1. 2. 3. 100% throughput performance improvement GSO/GRO, TSO and UFO are enabled by default on Ubuntu precise 12.04 LTS Jumbo-frame is not much effective
  26. 26. ● Block device - I/O Scheduler 1. Noop : basic FIFO queue. 2. Deadline : I/O requests place in a priority queue and are guaranteed to be run within a certain time. low latency. 3. CFQ (Completely Fair Queueing) : I/O requests are distributed to a number of perprocess queues. default I/O scheduler on Ubuntu. janghoon@ubuntu-precise:~$ sudo cat /sys/block/sdb/queue/scheduler noop deadline [cfq]
  27. 27. ● Block device - VM Cache mode Disable host page cache <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/mnt/VM_IMAGES/VM-test.img'/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> Host R w writeback R w Guest writethrough R w none Host Page Cache Disk Cache Physical Disk R w directsync
  28. 28. ● Block device - Asynchronous I/O <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' io=’native’/> <source file='/dev/sda7'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×