Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linux NUMA & Databases: Perils and Opportunities

247 views

Published on

This talk is about linux NUMA (non-uniform memory access), how it works, issues with it, how databases can fail or benefit from it.

Published in: Engineering
  • Be the first to comment

Linux NUMA & Databases: Perils and Opportunities

  1. 1. Linux NUMA & Databases Perils and Opportunities
  2. 2. NUMA Reference architecture
  3. 3. What is NUMA ● Stands for Non Uniform Memory Access ○ Non Uniform to whom. ○ Von Neumann bottleneck. ○ Cache coherent NUMA ● How does it work ○ Memory is placed local to the processes. ○ Balancing access to data over the available processors on multiple nodes. ● Large memory installations are becoming the norm ○ The i2 series on AWS. ○ Databases are the main consumers. ● Constraints ○ Speed of light ○ Interconnect saturation
  4. 4. What is NUMA ● Constraints ○ Speed of light ■ Higher latency of accessing remote memory. ○ Interconnect saturation ■ Performance counters. ● Slow abundant memory ○ Fast limited memory ● Cache coherence ○ Processor threads and cores share resources ■ Execution units (between HT threads) ■ Cache (between threads and cores)
  5. 5. Exotic cases ● Network cards ● PCIe storage ● NVRAM ● Nodes without memory ● Nodes without processors ● Unbalanced ● Central/Large memory ● Big Little architecture ● GPU
  6. 6. NUMA complications ● Unmovable memory ● KSM ● THP ● Interrupt balancing and locality
  7. 7. Tools/libraries for NUMA ● Supported by Linux since 2.5 ○ Symmetric and CPU/Memory ● Numactl ● Hwloc / lstopo ● Numad ● Numatop ● Libnuma ● Numastat ● Taskset ● KVM for simulation and testing ● Perf
  8. 8. Tools/libraries for NUMA ● KVM for simulation and testing ● Useful for testing databases. qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic, macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 - numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2, cpus=8-15 -m 2G
  9. 9. Tunings and observables ● /proc/zoneinfo ○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim ○ /proc/sys/vm/min_unmapped_ratio ● /proc/meminfo ● /proc/vmstat ● Ftrace ● Cgroup hierarchy ○ memory
  10. 10. Tunings and observables ● ACPI ○ SLIT and SRAT ● Per process: ○ /proc/<pid>/numa_maps ○ /proc/<pid>/sched ● Auto NUMA balancing ○ CONFIG_NUMA_BALANCING in /proc/config.gz ● get_mempolicy(2), mbind(2), migrate_pages(2), move_pages(2), set_mempolicy(2), sched_getaffinity(2) ● Libnuma (3) ○ Higher abstraction - numa_set_localalloc
  11. 11. Numa statistics
  12. 12. Numa statistics
  13. 13. AutoNUMA ● CPU follows memory ○ Reschedule tasks on same nodes as memory ● Memory follows CPU ○ Copy memory pages to same nodes as tasks/threads ● Heuristics ○ Fault statistics ○ Task grouping ○ Multi-resource optimization - cache, cpu, memory, starvation ■ Avoid thrashing ● Only CPU and memory? ○ For others, use manual pinning!
  14. 14. NUMA Policies ● MPOL_DEFAULT ● MPOL_BIND ● MPOL_INTERLEAVE ○ Memory striping in hardware ● MPOL_PREFERRED ● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
  15. 15. Databases ● Most databases support multiple cores and NUMA. ○ MAP_ANONYMOUS and O_DIRECT are common ● Most default to interleaving to avoid zone imbalance issues ○ Effects ■ Swapping due to Reclaim ■ OOM ○ Downsides to interleaving ○ MySQL, Cassandra et.al. ● Pattern of accesses ○ Cause of imbalance ● Duality of Applications v/s OS
  16. 16. Reclaim ● Swappiness ○ Anon v/s File-backed ● Zone reclaim ○ Single process can span multiple zones ○ Imbalance without any strategies ○ Watermarks ○ Databases suffer the most ■ They carry a lot of state! ○ Types of reclaim ● Imbalance ○ Why does this happen
  17. 17. Access Pattern Optimizations ● Thread pool ○ Reuse of threads with longer lifetime ○ Explicit or implicit bind ■ Numa_set_localalloc / numa_set_preferred ■ Sched_setaffinity ■ CONFIG_NO_HZ and latency ● Global heaps - buffer pool, JVM ○ Allocation by proxy ○ Mbind and MPOL_BIND ○ MAP_POPULATE (why? - First touch policy) ○ Node_set_preferred
  18. 18. Access Patterns (contd) ● Split Pools ○ Independent pools of memory in a database Ex: Multiple buffer pool instances ● Multiple instances ○ Mostly for simple databases. ■ Redis ○ Containers ● Hybird ○ Linux kernel - boot and init ○ MySQL / InnoDB ■ MPOL_LOCAL for threads ■ MPOL_INTERLEAVE for global heaps ● Task Grouping
  19. 19. Credits! ● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ● http://events.linuxfoundation.org/sites/events/files/slides/Normal% 20and%20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6% 8E%A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85% B7/sample-output-of-the-numastat-command.png

×