• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,823
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
178
Comments
3
Likes
31

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. KVM Performance Optimization Paul Sim Cloud Consultant paul.sim@canonical.com
  • 2. Index ● CPU & Memory ○ ○ ○ ○ vCPU pinning NUMA affinity THP (Transparent Huge Page) KSM (Kernel SamePage Merging) & virtio_balloon ● Networking ○ ○ ○ vhost_net Interrupt handling Large Segment offload ● Block Device ○ ○ ○ I/O Scheduler VM Cache mode Asynchronous I/O
  • 3. ● CPU & Memory Modern CPU cache architecture and latency (Intel Sandy Bridge) CPU CPU Core Core L1i 32K L1d 32K L1i 32K L1d 32K 4 cycles, 0.5 ns 4 cycles, 0.5 ns 4 cycles, 0.5 ns 4 cycles, 0.5 ns QPI L2 or MLC (unified) 256K L2 or MLC (unified) 256K 11 cycles, 7 ns 11 cycles, 7 ns L3 or LLC (Shared) 14 ~ 38 cycles, 45 ns Local memory 75 ~ 100 ns Remote memory 120 ~ 160 ns
  • 4. ● CPU & Memory - vCPU pinning KVM Guest4 KVM Guest0 KVM Guest1 KVM Guest3 * Pin specific vCPU on specific physical CPU * vCPU pinning increases CPU cache hit ratio KVM Guest2 1. Discover cpu topology. - virsh capabilities Node0 Node1 2. pin vcpu on specific core - vcpupin <domain> vcpu-num cpu-num Core 0 Core 1 Core 2 Core 3 3. print vcpu information - vcpuinfo <domain> LLC LLC * Multi-node virtual machine? Node memory Node memory
  • 5. ● CPU & Memory - vCPU pinning * Two times faster memory access than without pinning - Shorter is better on scheduling and mutex test. - Longer is better on memory test.
  • 6. ● CPU & Memory - NUMA affinity Cores Memory Controller Memory CPU 2 QPI CPU 3 I/O Controller PCI-E Cores Cores LLC LLC Cores Cores PCI-E Cores Memory LLC PCI-E LLC Memory Cores Memory Controller I/O Controller PCI-E Cores I/O Controller CPU 1 Memory Controller Memory Controller Memory CPU 0 I/O Controller CPU architecture (Intel Sandy Bridge)
  • 7. ● CPU & Memory - NUMA affinity janghoon@machine-04:~$ sudo numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49 node 0 size: 24567 MB node 0 free: 6010 MB node 1 cpus: cpus: 10 11 12 13 14 15 16 17 18 19 50 51 52 53 54 55 56 57 58 59 node 1 size: 24576 MB node 1 free: 8557 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69 node 2 size: 24567 MB node 2 free: 6320 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 70 71 72 73 74 75 76 77 78 79 node 3 size: 24567 MB node 3 free: 8204 MB node distances: node 0 1 2 3 0: 10 16 16 22 1: 16 10 22 16 2: 16 22 10 16 3: 22 16 16 10 * Remote memory is expensive!
  • 8. ● CPU & Memory - NUMA affinity 1. Determine where the pages of a VM are allocated. - cat /proc/<PID>/numa_maps - cat /sys/fs/cgroup/memory/sysdefault/libvirt/qemu/<KVM name>/memory.numa_stat total=244973 N0=118375 N1=126598 file=81 N0=24 N1=57 anon=244892 N0=118351 N1=126541 unevictable=0 N0=0 N1=0 2. Change memory policy mode - cgset -r cpuset.mems=<Node> sysdefault/libvirt/qemu/<KVM name>/emulator/ 3. Migrate pages into a specific node. - migratepages <PID> from-node to-node - cat /proc/<PID>/status * Memory policy modes 1) interleave : Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. 2) bind : Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. 3) preferred : Preferably allocate memory on node, but if memory cannot be allocated there fall back to other nodes. * “preferred” memory policy mode is not currently supported on cgroup
  • 9. ● CPU & Memory - NUMA affinity - NUMA reclaim Node0 Node1 1. Check if zone reclaim is enabled. Free memory Free memory Mapped page cache - cat /proc/sys/vm/zone_reclaim_mode 0(default): Linux kernel allocates the Mapped page cache memory to a remote NUMA node where Unmapped page cache Unmapped page cache Anonymous page Anonymous page free memory is available. 1 : Linux kernel reclaims unmapped page caches for the local NUMA node rather than Local reclaim immediately allocating the memory to a Node0 Node1 Free memory Free memory Mapped page cache Mapped page cache remote NUMA node. It is known that a virtual machine causes zone reclaim to occur in the situations when KSM(Kernel Same-page Merging) is enabled or Hugepage are Unmapped page cache Unmapped page cache Anonymous page Anonymous page enabled on a virtual machine side.
  • 10. ● CPU & Memory - THP (Transparent Huge Page) - Paging level in some 64 bit architectures Platform Page size Address bit used Paging levels splitting Alpha 8KB 43 3 10 + 10 + 10 + 13 IA64 4KB 39 3 9 + 9 + 9 + 12 PPC64 4KB 41 3 10 + 10 + 9 + 12 x86_64 4KB(default) 48 4 9 + 9 + 9 + 9 + 12 x86_64 2MB(THP) 48 3 9 + 9 + 9 + 21
  • 11. ● CPU & Memory - THP (Transparent Huge Page) - Memory address translation in 64 bit Linux Linear(Virtual) address (48 bit) Upper Dir (9 bit) Global Dir (9 bit) Table Offset (9 bit in 4KB 0 bit in 2MB) Middle Dir (9 bit) (12 bit in 4KB 21 bit in 2MB) Physical Page (4KB page size) Page table Page Middle Directory Page Upper Directory Page Global Directory Physical Page (2MB page size) cr3 reduce 1 step!
  • 12. ● CPU & Memory - THP (Transparent Huge Page) Paging hardware with TLB (Translation lookaside buffer) * Translation lookaside buffer (TLB) : a cache that memory management hardware uses to improve virtual address translation speed. * TLB is also a kind of cache memory in CPU. * Then, how can we increase TLB hit ratio? - TLB can hold only 8 ~ 1024 entries. - Decrease the number of pages. i.e. On 32GB memory system, 8,388,608 pages with 4KB page block 16,384 pages with 2MB page block
  • 13. ● CPU & Memory - THP (Transparent Huge Page) THP performance benchmark - MySQL 5.5 OLTP testing * A guest machine also has to use HugePage for the best effect
  • 14. ● CPU & Memory - THP (Transparent Huge Page) 1. Check current THP configuration - cat /sys/kernel/mm/transparent_hugepage/enabled 2. Configure THP mode - echo mode > /sys/kernel/mm/transparent_hugepage/enabled 3. monitor Huge page usage - cat /proc/meminfo | grep Huge janghoon@machine-04:~$ grep Huge /proc/meminfo AnonHugePages: 462848 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB - grep Huge /proc/<PID>/smaps 4. Adjust parameters under /sys/kernel/mm/transparent_hugepage/khugepaged - grep thp /proc/vmstat * THP modes 1) always : use HugePages always 2) madvise : use HugePages only in specific regions, madvise(MADV_HUGEPAGE). Default on Ubuntu precise 3) never : not use HugePages * Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache layer starting with tmpfs. - Linux Kernel Documentation/vm/transhuge.txt
  • 15. ● CPU & Memory - KSM & virtio_balloon - KSM (Kernel SamePage Merging) KSM MADV_MERGEABLE Guest 0 Merging Guest 1 Guest 2 1. A kernel feature in KVM that shares memory pages between various processes, over-committing the memory. 2. Only merges anonymous (private) pages. 3. Enable KSM - echo “1” > /sys/kernel/mm/ksm/run 4. Monitor KSM - Files under /sys/kernel/mm/ksm/ 5. For NUMA (in Linux 3.9) - /sys/kernel/mm/ksm/merge_across_nodes 0 : merge only pages in the memory area of the same NUMA node 1 : merge pages across nodes
  • 16. ● CPU & Memory - KSM & virtio_balloon - Virtio_balloon currentMemory Memory Host Host VM 0 VM 1 memory memory VM 0 memory VM 1 memory balloon balloon balloon balloon 1. The hypervisor sends a request to the guest operating system to return some amount of memory back to the hypervisor. 2. The virtio_balloon driver in the guest operating system receives the request from the hypervisor. 3. The virtio_balloon driver inflates a balloon of memory inside the guest operating system 4. The guest operating system returns the balloon of memory back to the hypervisor. 5. The hypervisor allocates the memory from the balloon elsewhere as needed. 6. If the memory from the balloon later becomes available, the hypervisor can return the memory to the guest operating system
  • 17. ● Networking - vhost-net Vhost-net Virtio Guest Guest Virtio front-end Virtio front-end QEMU Virtio back-end QEMU Kernel TAP Context switching Kernel DMA virtqueue vhost_net Bridge Bridge pNIC pNIC TAP
  • 18. ● Networking - vhost-net SR-IOV PCI Path-through Guest Guest vNIC Guest vNIC QEMU vNIC QEMU Kernel Guest QEMU vNIC QEMU Kernel pNIC PCI Manager I/O MMU pNIC pNIC VF PF VF Bridge
  • 19. ● Networking - vhost-net
  • 20. ● Networking - Interrupt handling 1. Multi-queue NIC root@machine-04:~# cat /proc/interrupts | grep eth0 71: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0 72: 0 0 213 679058 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 73: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 74: 562872 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 75: 561303 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 76: 583422 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-4 77: 21 5 563963 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-5 78: 23 0 548548 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-6 79: 567242 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-7 - RPS(Receive Packet Steering) / XPS(Transmit Packet Steering) : A mechanism for intelligently selecting which receive/transmit queue to use when receiving/transmitting a packet on a multi-queue device. it can configure CPU affinity per-devicequeue, disabled by default. : configure cpu mask - echo 1 > /sys/class/net/<device>/queues/rx-<queue num>/rps_cpus - echo 16 > /sys/class/net/<device>/queues/tx-<queue num>/xps_cpus around 20 ~ 100 % improvement
  • 21. ● Networking - Interrupt handling 2. MSI(Message Signaled Interrupt), MSI-X - Make sure your MSI-X enabled janghoon@machine-04:~$ sudo lspci -vs 01:00.1 | grep MSI Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=10 MaskedCapabilities: [a0] Express Endpoint, MSI 00 3. NAPI(New API) - a feature that in the linux kernel aiming to improve the performance of high-speed networking, and avoid interrupt storms. - Ask your NIC vendor whether it’s enabled by default. Most modern NICs support Multi-queue, MSI-X and NAPI. However, You may need to make sure these features are configured correctly and working properly. http://en.wikipedia.org/wiki/Message_Signaled_Interrupts http://en.wikipedia.org/wiki/New_API
  • 22. ● Networking - Interrupt handling 4. IRQ Affinity (Pin Interrupts to the local node) Memory Controller Memory Cores Cores QPI I/O Controller PCI-E PCI-E I/O Controller PCI-E PCI-E Each node has PCI-E devices connected directly (On Intel Sandy Bridge). Where is my 10G NIC connected to? Memory CPU 1 Memory Controller CPU 0
  • 23. ● Networking - Interrupt handling 4. IRQ Affinity (Pin Interrupts to the local node) 1) stop irqbalance service 2) determine node that a NIC is connected to - lspci -tv - lspci -vs <PCI device bus address> - dmidecode -t slot - cat /sys/devices/pci*/<bus address>/numa_node (-1 : not detected) - cat /proc/irq/<Interrupt number>/node 3) Pin interrupts from the NIC to specific node - echo f > /proc/irq/<Interrupt number>/smp_affinity
  • 24. ● Networking - Large Segment Offload w/o LSO Data (i.e. 4K) Application with LSO Data (i.e. 4K) Kernel Header Data Header Data Header Data NIC Header Data Header Data Header Data Header Data Header janghoon@ubuntu-precise:~$ sudo ethtool -k eth0 Offload parameters for eth0: tcp-segmentation-offload: on udp-fragmentation-offload: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off Data Header + Data metadata Header Data
  • 25. ● Networking - Large Segment Offload 1. 2. 3. 100% throughput performance improvement GSO/GRO, TSO and UFO are enabled by default on Ubuntu precise 12.04 LTS Jumbo-frame is not much effective
  • 26. ● Block device - I/O Scheduler 1. Noop : basic FIFO queue. 2. Deadline : I/O requests place in a priority queue and are guaranteed to be run within a certain time. low latency. 3. CFQ (Completely Fair Queueing) : I/O requests are distributed to a number of perprocess queues. default I/O scheduler on Ubuntu. janghoon@ubuntu-precise:~$ sudo cat /sys/block/sdb/queue/scheduler noop deadline [cfq]
  • 27. ● Block device - VM Cache mode Disable host page cache <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/mnt/VM_IMAGES/VM-test.img'/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> Host R w writeback R w Guest writethrough R w none Host Page Cache Disk Cache Physical Disk R w directsync
  • 28. ● Block device - Asynchronous I/O <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' io=’native’/> <source file='/dev/sda7'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk>