Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli

1,698 views

Published on

Andrea will provide a short high level view of the most notable milestones in the evolution of the Linux Virtual Memory over the years. He will then focus on the various Memory Management features such as Transparent Huge Pages(THP), automatic NUMA balancing and userfaultd/postcopy live migration of Kernel Virtual Machines (KVM). Andrea will cover best practices, providing the audience with an understanding of when and how to leverage these features in their environments.

Andrea Arcangeli, Red Hat

Published in: Technology
  • Login to see the comments

Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli

  1. 1. Copyright © 2017 Red Hat Inc. 1 20 years of Linux Virtual Memory20 years of Linux Virtual Memory Red Hat, Inc. Andrea Arcangeli <aarcange at redhat.com> Kernel Recipes, Paris 29 Sep 2017
  2. 2. Copyright © 2017 Red Hat Inc. 2 Topics ● Milestones in the evolution of the Virtual Memory subsystem ● Virtual Memory latest innovations – Automatic NUMA balancing – THP developments – KSMscale – userfaultfd ● Postcopy live Migration, etc..
  3. 3. Copyright © 2017 Red Hat Inc. 3 Virtual Memory (simplified) Virtual pages They cost “nothing” Practically unlimited on 64bit archs Physical pages They cost money! This is the RAM arrows = pagetables virtual to physical mapping
  4. 4. Copyright © 2017 Red Hat Inc. 4 PageTablesPageTables pgd pmd pte pud pte pmd pte pte pmd pte pud pte pmd pte pte 2^9 = 512 arrows, not just 2 ● Common code and x86 pagetable format is a tree ● All pagetables are 4KB in size ● Total: grep PageTables /proc/meminfo ● (((2**9)**4)*4096)>>48 = 1 → 48bits → 256 TiB ● 5 levels in v4.13 → (48+9) bits → 57bits → 128 PiB – Build time to avoid slowdown
  5. 5. Copyright © 2017 Red Hat Inc. 5 The Fabric of the Virtual Memory ● The fabric are all those data structures that connects to the hardware constrained structures like pagetables and that collectively create all the software abstractions we're accustomed to – tasks, processes, virtual memory areas, mmap (glibc malloc) ... ● The fabric is the most black and white part of the Virtual Memory ● The algorithms doing the computations on those data structures are the Virtual Memory heuristics – They need to solve hard problems with no guaranteed perfect solution – i.e. when it's the right time to start to unmap pages (swappiness) ● Some of the design didn't change: we still measure how hard it is to free memory while we're trying to free it ● All free memory is used as cache and we overcommit by default (not excessively by default) – Android uses: echo 1 >/proc/sys/vm/overcommit_memory
  6. 6. Copyright © 2017 Red Hat Inc. 6 Physical page and struct page PAGE_SIZE (4KiB) large physical pagestruct page 64 bytes PAGE SIZE 4KiB ...Physical RAMmem_map 64/4096 1.56% of the RAM
  7. 7. Copyright © 2017 Red Hat Inc. 7 MM & VMA ● mm_struct aka MM – Memory of the process ● Shared by threads ● vm_area_struct aka VMA – Virtual Memory Area ● Created and teardown by mmap and munmap ● Defines the virtual address space of an “MM”
  8. 8. Copyright © 2017 Red Hat Inc. 11 Original bitmap... Active and Inactive list LRU ● The active page LRU preserves the the active memory working set – only the inactive LRU loses information as fast as use-once I/O goes – Introduced in 2001, it works good enough also with an arbitrary balance – Active/inactive list optimum balancing algorithm was solved in 2012-2014 ● shadow radix tree nodes that detect re-faults
  9. 9. Copyright © 2017 Red Hat Inc. 13 Active and Inactive list LRU $ grep -i active /proc/meminfo Active: 3555744 kB Inactive: 2511156 kB Active(anon): 2286400 kB Inactive(anon): 1472540 kB Active(file): 1269344 kB Inactive(file): 1038616 kB
  10. 10. Copyright © 2017 Red Hat Inc. 14 Active LRU working-set detection ● From Johannes Weiner fault ------------------------+ | +--------------+ | +-------------+ reclaim <- T inactive H <-+-- demotion T active H <--+ +--------------+ +-------------+ | | | +-------------- promotion ------------------+ H = Head, T = Tail +-memory available to cache-+ | | +-inactive------+-active----+ a b | c d e f g h i | J K L M N | +---------------+-----------+
  11. 11. Copyright © 2017 Red Hat Inc. 15 lru→inactive_age and radix tree shadow entries inode/file radix_tree root inode/file radix_tree root inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node pagepage
  12. 12. Copyright © 2017 Red Hat Inc. 16 Reclaim saving inactive_age inode/file radix_tree root inode/file radix_tree root inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node inode/file radix_tree node pagepage Exceptional/shadow entry memcg LRU LRU reclaim inactive_age++ Exceptional/shadow entry memcg LRU LRU reclaim inactive_age++
  13. 13. Copyright © 2017 Red Hat Inc. 20 Object-based reverse mapping PHYSICAL PAGE Anonymous anon_vma inode (objrmap) vma vma vma vma prio_tree rmap item rmap item vma vma PHYSICAL PAGE Filesystem PHYSICAL PAGE Anonymous PHYSICAL PAGE Filesystem PHYSICAL PAGE KSM
  14. 14. Copyright © 2017 Red Hat Inc. 21 Active & inactive + rmap page[0] page[1] page[N] mem_map process virtualmemory process virtualmemory process virtualmemory process virtualmemory active_lruinactive_lru rmap rmap rmap rmap Use-once pages to trash Use-many pages
  15. 15. Copyright © 2017 Red Hat Inc. 22 Many more LRUs ● Separated LRU for anon and file backed mappings ● Memcg (memory cgroups) introduced per- memcg LRUs ● Removal of unfreeable pages from LRUs – anonymous memory with no swap – mlocked memory ● Transparent Hugepages in the LRU increase scalability further (lru size decreased 512 times)
  16. 16. Copyright © 2017 Red Hat Inc. 23 Recent Virtual Memory trends ● Optimizing the workloads for you, without manual tuning – NUMA hard bindings (numactl) → Automatic NUMA Balancing – Hugetlbfs → Transparent Hugepage – Programs or Virtual Machines duplicating memory → KSM – Page pinning (RDMA/KVM shadow MMU) -> MMU notifier – Private device memory managed by hand and pinned → HMM/UVM (unified virtual memory) for GPU seamlessly computing in GPU memory ● The optimizations can be optionally disabled
  17. 17. Copyright © 2017 Red Hat Inc. 30
  18. 18. Copyright © 2017 Red Hat Inc. 31 Automatic NUMA Balancing benchmark Intel SandyBridge (Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz) 2 Sockets – 32 Cores with Hyperthreads 256G Memory RHEV 3.6 Host bare metal – 3.10.0-327.el7 (RHEL7.2) VM guest – 3.10.0-324.el7 (RHEL7.2) VM – 32P , 160G (Optimized for Server) Storage – Violin 6616 – 16G Fibre Channel Oracle – 12C , 128G SGA Test – Running Oracle OLTP workload with increasing user count and measuring Trans / min for each run as a metric for comparison
  19. 19. Copyright © 2017 Red Hat Inc. 32 10U 20U 40U 80U 100U 0 200000 400000 600000 800000 1000000 1200000 1400000 4 VMs with different NUMA options OLTP workload Auto numa OFF Auto Numa on - No pinning Auto Numa on - Numa pinning User count Trans/min
  20. 20. Copyright © 2017 Red Hat Inc. 33 Automatic NUMA balancing configuration ● https://tinyurl.com/zupp9v3 https://access.redhat.com/ ● In RHEL7 Automatic NUMA balancing is enabled when: # numactl --hardware shows multiple nodes ● To disable automatic NUMA balancing: # echo 0 > /proc/sys/kernel/numa_balancing ● To enable automatic NUMA balancing: # echo 1 > /proc/sys/kernel/numa_balancing ● At boot: numa_balancing=enable|disable
  21. 21. Copyright © 2017 Red Hat Inc. 34
  22. 22. Copyright © 2017 Red Hat Inc. 35 Hugepages ● Traditionally x86 hardware gave us 4KiB pages ● The more memory the bigger the overhead in managing 4KiB pages ● What if you had bigger pages? – 512 times bigger → 2MiB
  23. 23. Copyright © 2017 Red Hat Inc. 36 PageTablesPageTables pgd pmd pte pud pte pmd pte pte pmd pte pud pte pmd pte pte 2^9 = 512 arrows, not just 2
  24. 24. Copyright © 2017 Red Hat Inc. 37 ● Improve CPU performance – Enlarge TLB size (essential for KVM) – Speed up TLB miss (essential for KVM) ● Need 3 accesses to memory instead of 4 to refill the TLB – Faster to allocate memory initially (minor) – Page colouring inside the hugepage (minor) – Higher scalability of the page LRUs ● Cons – clear_page/copy_page less cache friendly ● Clear faulting subpage last included in v4.13 from Andi Kleen and Ying Hhuang – 28.3% increase in vm-scalability anon-w-seq – higher memory footprint sometime – Direct compaction takes time Benefit of hugepages
  25. 25. Copyright © 2017 Red Hat Inc. 38 TLB miss cost:TLB miss cost: number of accesses to memorynumber of accesses to memory host THP off guest THP off host THP on guest THP off host THP off guest THP on host THP on guest THP on novirt THP off novirt THP on 0 5 10 15 20 25 BaremetalVirtualization
  26. 26. Copyright © 2017 Red Hat Inc. 39 ● How do we get the benefits of hugetlbfs without having to configure anything? ● Any Linux process will receive 2M pages – if the mmap region is 2M naturally aligned – If compaction succeeds in producing hugepages ● Entirely transparent to userland Transparent Hugepage design
  27. 27. Copyright © 2017 Red Hat Inc. 40 ● /sys/kernel/mm/transparent_hugepage/enabled – [always] madvise never ● Always use THP if vma start/end permits – always [madvise] never ● Use THP only inside MADV_HUGEPAGE – Applies to khugepaged too – always madvise [never] ● Never use THP – khugepaged quits ● Default selected at build time THP sysfs enabled
  28. 28. Copyright © 2017 Red Hat Inc. 41 ● /sys/kernel/mm/transparent_hugepage/defrag – [always] defer defer+madvise madvise never ● Always use direct compaction (ideal for long lived allocations) – always [defer] defer+madvise madvise never ● Defer compaction asynchronously (kswapd/kcompactd) – always defer [defer+madvise] madvise never ● Direct compaction in MADV_HUGEPAGE, async otherwise – always defer defer+madvise [madvise] never ● Use direct compaction only inside MADV_HUGEPAGE ● khugepaged enabled in the background – always defer defer+madvise madvise [never] ● Never use compaction, quit khugepaged too ● Disabling THP is excessive if only direct compaction is too expensive ● KVM uses MADV_HUGEPAGE – MADV_HUGEPAGE will still use direct compaction THP defrag - compaction control
  29. 29. Copyright © 2017 Red Hat Inc. 42 Priming the VM ● To achieve maximum THP utilization you can run the two commands below # cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3 Node 0, zone DMA32 8357 4904 2670 472 126 82 8 1 0 3 5 Node 0, zone Normal 5364 38097 608 63 11 1 1 55 0 0 0 # echo 3 >/proc/sys/vm/drop_caches # cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3 Node 0, zone DMA32 5989 5445 5235 5022 4420 3583 2424 1331 471 119 25 Node 0, zone Normal 71579 58338 36733 19523 6431 1572 559 282 115 29 2 # echo >/proc/sys/vm/compact_memory # cat /proc/buddyinfo Node 0, zone DMA 0 0 1 1 1 1 1 1 0 1 3 Node 0, zone DMA32 1218 1161 1227 1240 1135 925 688 502 342 305 369 Node 0, zone Normal 7479 5147 5124 4240 2768 1959 1391 814 389 270 149 4k 8k 16k 32k 64k 2M 4M
  30. 30. Copyright © 2017 Red Hat Inc. 43 ● From Kirill A. Shutemov and Hugh Dickins – Merged in the v4.8 kernel ● # mount -o remount,huge=always none /dev/shm – Including small files (i.e. < 2MB in size) ● Very high memory usage on small files ● # mount -o remount,huge=never none /dev/shm – Current default ● # mount -o remount,huge=within_size none /dev/shm – Allocate THP if inode i_size can contain hugepages, or if there’s a madvise/fadvise hint ● Ideal for apps benefiting from THP on tmpfs! ● # mount -o remount,huge=advise none /dev/shm – Only if requested through madvise/fadvise THP on tmpfs
  31. 31. Copyright © 2017 Red Hat Inc. 44 ● shmem uses an internal tmpfs mount for: SysV SHM, memfd, MAP_SHARED of /dev/zero or MAP_ANONYMOUS, DRM objects, ashmem ● Internal mount is controlled by: $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled always within_size advise [never] deny force ● deny – Disable THP for all tmpfs mounts ● For debugging ● force – Always use THP for all tmpfs mounts ● For testing THP on shmem
  32. 32. Copyright © 2017 Red Hat Inc. 45
  33. 33. Copyright © 2017 Red Hat Inc. 46 ● Virtual memory to dedup is practically unlimited (even on 32bit if you deduplicate across multiple processes) ● The same page content can be deduplicated an unlimited number of times KSMscale PageKSM stable_node rmap_item rmap_item Virtual address In userland process (mm, addr1) Virtual address In userland process (mm, addr2)
  34. 34. Copyright © 2017 Red Hat Inc. 47 $ cat /sys/kernel/mm/ksm/max_page_sharing 256 ● KSMscale limits the maximum deduplication for each physical PageKSM allocated – Limiting “compression” to x256 is enough – Higher maximum deduplication generates diminishing returns ● To alter the max page sharing: $ echo 2 > /sys/kernel/mm/ksm/run $ echo 512 > /sys/kernel/mm/ksm/max_page_sharing ● Included in v4.13 KSMscale
  35. 35. Copyright © 2017 Red Hat Inc. 48
  36. 36. Copyright © 2017 Red Hat Inc. 49 Why: Memory Externalization ● Memory externalization is about running a program with part (or all) of its memory residing on a remote node ● Memory is transferred from the memory node to the compute node on access ● Memory can be transferred from the compute node to the memory node if it's not frequently used during memory pressure Node 0Compute node Local Memory Node 0Memory node Remote Memory userfault Node 0Compute node Local Memory Node 0Memory node Remote Memory memory pressure
  37. 37. Copyright © 2017 Red Hat Inc. 50 Postcopy live migration ● Postcopy live migration is a form of memory externalization ● When the QEMU compute node (destination) faults on a missing page that resides in the memory node (source) the kernel has no way to fetch the page – Solution: let QEMU in userland handle the pagefault Partially funded by the Orbit European Union project Node 0Compute node Local Memory Node 0Memory node Remote Memory userfault QEMU source QEMU destination Postcopy live migration
  38. 38. Copyright © 2017 Red Hat Inc. 51 userfaultfd latency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 0 200 400 600 800 1000 1200 1400 1600 userfault latency during postcopy live migration - 10Gbit qemu 2.5+ - RHEL7.2+ - stressapptest running in guest <= latency in milliseconds numberofuserfaults Userfaults triggered on pages that were already in network-flight are instantaneous. Background transfer seeks at the last userfault address.
  39. 39. Copyright © 2017 Red Hat Inc. 52 KVM precopy live migration 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Before precopy After precopy During precopy Precopy never completes until the database benchmark completes 10Gbit NIC 120GiB guest Database TPM ~120sec Time to Transfer RAM over network pre copy
  40. 40. Copyright © 2017 Red Hat Inc. 53 KVM postcopy live migration 00:10 00:50 01:30 02:10 02:50 03:30 04:10 04:50 05:30 06:10 06:50 07:30 08:10 08:50 09:30 10:10 10:50 11:30 12:10 12:50 13:30 14:10 14:50 15:30 16:10 16:50 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Before postcopy During postcopy After postcopy # virsh migrate .. --postcopy --timeout <sec> --timeout-postcopy # virsh migrate .. --postcopy --postcopy-after-precopy precopy runs From 5m To 7m postcopy runs From 7m To about ~9m deterministic pre copy post copy khugepaged collapses THPs
  41. 41. Copyright © 2017 Red Hat Inc. 54 All available upstream ● Userfaultfd() syscall in Linux Kernel >= v4.3 ● Postcopy live migration in: – QEMU >= v2.5.0 ● Author: David Gilbert @ Red Hat Inc. – Postcopy in Libvirt >= 1.3.4 – OpenStack Nova >= Newton ● In production since RHEL 7.3
  42. 42. Copyright © 2017 Red Hat Inc. 55 Live migration total time Total time 0 50 100 150 200 250 300 350 400 450 500 autoconverge postcopy seconds
  43. 43. Copyright © 2017 Red Hat Inc. 56 Max UDP latency 0 5 10 15 20 25 precopy timeout autoconverge postcopy seconds Live migration max perceived downtime latency
  44. 44. Copyright © 2017 Red Hat Inc. 57 userfaultfd v4.13 features ● The current v4.13 upstream kernel supports: – Missing faults (i.e. missing pages EVENT_PAGEFAULT) ● Anonymous (UFFDIO_COPY, UFFDIO_ZEROPAGE, UFFDIO_WAKE) ● Hugetlbfs (UFFDIO_COPY, UFFDIO_WAKE) ● Shmem (UFFDIO_COPY, UFFDIO_ZEROPAGE, UFFDIO_WAKE) – UFFD_FEATURE_THREAD_ID (to collect vcpu statistics) – UFFD_FEATURE_SIGBUS (raises SIGBUS instead of userfault) – Non cooperative Events ● EVENT_REMOVE (MADV_REMOVE/DONTNEED/FREE) ● EVENT_UNMAP (munmap notification to stop background transfer) ● EVENT_REMAP (non-cooperative process called mremap) ● EVENT_FORK (create new uffd over fork()) ● UFFDIO_COPY returns -ESRCH after exit()
  45. 45. Copyright © 2017 Red Hat Inc. 58 userfaultfd in-progress features ● v4.13 rebased aa.git for anonymous memory supports: – UFFDIO_WRITEPROTECT – UFFDIO_COPY_MODE_WP ● Same as UFFDIO_COPY but wrprotected – UFFDIO_REMAP ● monitor can remove memory atomically ● Floating patchset for synchronous EVENT_REMOVE to allow multithreaded non cooperative manager
  46. 46. Copyright © 2017 Red Hat Inc. 59 struct uffd_msg/* read() structure */ struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3; union { struct { __u64 flags; __u64 address; union { __u32 ptid; } feat; } pagefault; struct { __u32 ufd; } fork; struct { __u64 from; __u64 to; __u64 len; } remap; struct { __u64 start; __u64 end; } remove; struct { /* unused reserved fields */ __u64 reserved1; __u64 reserved2; __u64 reserved3; } reserved; } arg; } __packed; sizeof(struct uffd_msg) 32bit/64bit ABI enforcement Zeros here, can extend with UFFD_FEATURE flags UFFD_EVENT_* tells which part of the union is valid Default cooperative support tracking pagefaults UFFD_EVENT_PAGEFAULT Non cooperative support tracking MM syscalls UFFD_EVENT_FORK UFFD_EVENT_REMAP UFFD_EVENT_REMOVE|UNMAP
  47. 47. Copyright © 2017 Red Hat Inc. 60 userfaultfd potential use cases ● Efficient snapshotting (drop fork()) ● JIT/to-native compilers (drop write bits) ● Add robustness to hugetlbfs/tmpfs ● Host enforcement for virtualization memory ballooning ● Distributed shared memory projects: at Berkeley and University of Colorado ● Tracking of anon pages written by other threads: at llnl.gov ● Obsoletes soft-dirty ● Obsoletes volatile pages SIGBUS
  48. 48. Copyright © 2017 Red Hat Inc. 61 Git userfaultfd branch ● https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault
  49. 49. Copyright © 2017 Red Hat Inc. 62
  50. 50. Copyright © 2017 Red Hat Inc. 63 Virtual Memory evolution ● Amazing to see the room for further innovation there was back then – Things constantly looks pretty mature ● They may actually have been considering my hardware back then was much less powerful and not more complex than my cellphone ● Unthinkable to maintain the current level of mission critical complexity by reinventing the wheel in a not Open Source way – Can perhaps be still done in a limited set of laptops and cellphones models, but for how long? ● Innovation in the Virtual Memory space is probably one among the plenty of factors that contributed to Linux success and the KVM lead in OpenStack user base too – KVM (unlike the preceding Hypervisor designs) leverages the power of the Linux Virtual Memory in its entirety

×