Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
NUMA
&
Java Databases
Should we worry
Raghavendra Prabhu
me@rdprabhu.com
@randomsurfer
NUMA Reference architecture
What is NUMA
● Stands for Non Uniform Memory Access
○ Non Uniform to whom.
○ Von Neumann bottleneck.
○ Cache coherent NUMA...
What is NUMA
● Constraints
○ Speed of light
■ Higher latency of accessing remote memory.
○ Interconnect saturation
■ Perfo...
Exotic cases
● Network cards
● PCIe storage
● NVRAM
● Nodes without memory
● Nodes without processors
● Unbalanced
● Centr...
Numa statistics
Tools/libraries for NUMA
● Supported by Linux since 2.5
○ Symmetric and CPU/Memory
● Numactl
● Hwloc / lstopo
● Numad
● Nu...
Tools/libraries for NUMA
● KVM for simulation and testing
● Useful for testing databases.
qemu-system-x86_64 -enable-kvm -...
NUMA Policies
● MPOL_DEFAULT
● MPOL_BIND
● MPOL_INTERLEAVE
○ Memory striping in hardware
● MPOL_PREFERRED
● MPOL_MF_MOVE |...
JVM GC spaces
● Concepts
○ Weak Generational Hypothesis:
■ Most objects soon become unreachable.
■ References from old obj...
GC graphs
JVM GC spaces
● Generations:
○ Young Generation
■ Eden space
● Mutable Space.
● Thread Local Allocation Buffer.
● Mark and...
Garbage collectors
Located in hotspot/src/share/vm/gc_implementation
● Serial
● Parallel
○ Only GC which is fully NUMA awa...
GC Options
● UseNUMA
● UseNUMAInterleaving
● ForceNUMA
● NUMAStats
● ParallelGC only
○ NUMAChunkResizeWeight
○ NUMASpaceResizeRate
○ ...
NUMA and Collectors
● -XX:+UseNUMA -XX:+UseNUMAInterleaving: All GC spaces.
○ Independent of GC choices.
○ NUMA interleave...
Cassandra
● JVM options are supported through environment variable.
● Cassandra’s ‘supported’ NUMA is through numactl in s...
Cassandra off-heap
● Why off-heap
○ Reduce GC pressure
○ Access patterns
○ Lack of support for primitives such as O_DIRECT...
Cassandra off-heap
● Cache Providers:
○ SerializingCache
■ Issues with serialization and CPU usage.
○ OHCP - org.caffinita...
Numa issues
● Numactl --interleave:
○ Thread-local native allocations - Bad [X]
■ Tons of them throughout code which bypas...
Interpretation
● Low off-heap usage
○ Use the JVM NUMA options. Don’t interleave with numactl, it is a hammer.
● High off-...
Interpretation
● JVM is (still) not good with native primitives such as O_DIRECT or NUMA (there is a
jnuma which is not th...
Wishlist for cassandra
● Use whatever GC fits best. (G1?)
■ Ask for NUMA support in this.
● Use the JVM NUMA options when ...
AutoNUMA
● Introduced late in 4.x kernel
● CPU follows memory
○ Reschedule tasks on same nodes as memory
● Memory follows ...
Tunings and observables
● /proc/zoneinfo
○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim
○ /proc/sys/vm/min_unm...
Numa statistics
Further
● http://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/
● http://queue.acm.org/detail.cfm?id=2852078
...
Credits!
● http://queue.acm.org/detail.cfm?id=2513149
● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf
● http:...
NUMA and Java Databases
NUMA and Java Databases
NUMA and Java Databases
Upcoming SlideShare
Loading in …5
×

NUMA and Java Databases

1,692 views

Published on

Talk given on state of NUMA with Java databases such as Cassandra and how it can improved / ameliorated, and compared with traditional storage engines.

Published in: Engineering
  • Be the first to comment

NUMA and Java Databases

  1. 1. NUMA & Java Databases Should we worry Raghavendra Prabhu me@rdprabhu.com @randomsurfer
  2. 2. NUMA Reference architecture
  3. 3. What is NUMA ● Stands for Non Uniform Memory Access ○ Non Uniform to whom. ○ Von Neumann bottleneck. ○ Cache coherent NUMA ● How does it work ○ Memory is placed local to the processes. ○ Balancing access to data over the available processors on multiple nodes. ● Large memory installations are becoming the norm ○ The i2 series on AWS. ○ Databases are the main consumers. ● Constraints ○ Speed of light ○ Interconnect saturation
  4. 4. What is NUMA ● Constraints ○ Speed of light ■ Higher latency of accessing remote memory. ○ Interconnect saturation ■ Performance counters. ● Slow abundant memory ○ Fast limited memory ● Cache coherence ○ Processor threads and cores share resources ■ Execution units (between HT threads) ■ Cache (between threads and cores)
  5. 5. Exotic cases ● Network cards ● PCIe storage ● NVRAM ● Nodes without memory ● Nodes without processors ● Unbalanced ● Central/Large memory ● Big Little architecture ● GPU
  6. 6. Numa statistics
  7. 7. Tools/libraries for NUMA ● Supported by Linux since 2.5 ○ Symmetric and CPU/Memory ● Numactl ● Hwloc / lstopo ● Numad ● Numatop ● Libnuma ● Numastat ● Taskset ● KVM for simulation and testing ● Perf
  8. 8. Tools/libraries for NUMA ● KVM for simulation and testing ● Useful for testing databases. qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2,cpus=8-15 -m 2G
  9. 9. NUMA Policies ● MPOL_DEFAULT ● MPOL_BIND ● MPOL_INTERLEAVE ○ Memory striping in hardware ● MPOL_PREFERRED ● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
  10. 10. JVM GC spaces ● Concepts ○ Weak Generational Hypothesis: ■ Most objects soon become unreachable. ■ References from old objects to young objects only exist in small numbers. ■ The ones that do not usually survive for a (very) long time ○ Garbage Collection Roots ○ Mark & ■ Copy ■ Compact ■ Sweep ○ Minor and Major GC ○ Stop-the-World
  11. 11. GC graphs
  12. 12. JVM GC spaces ● Generations: ○ Young Generation ■ Eden space ● Mutable Space. ● Thread Local Allocation Buffer. ● Mark and Copy. ■ Survivor spaces (S0 and S1). ○ Old/Tenured Generation ○ Permanent Generation ■ => native MetaSpace in Java8 ● Cross-generation links. ● Card-marking
  13. 13. Garbage collectors Located in hotspot/src/share/vm/gc_implementation ● Serial ● Parallel ○ Only GC which is fully NUMA aware. ● ParNew ● Concurrent Mark and Sweep (CMS) ● Garbage First (G1) ● Official Oracle documentation is notoriously bad! ○ Code and comments are the (only) documentation (sadly). ■ Try searching for ‘NUMAPageScanRate’ - find a page from 2008 with links to sun.com and Solaris examples.
  14. 14. GC Options
  15. 15. ● UseNUMA ● UseNUMAInterleaving ● ForceNUMA ● NUMAStats ● ParallelGC only ○ NUMAChunkResizeWeight ○ NUMASpaceResizeRate ○ UseAdaptiveNUMAChunkSizing ○ NUMAPageScanRate Defined in hotspot/src/share/vm/runtime/globals.hpp and used in hotspot/src/os/linux/vm NUMA options
  16. 16. NUMA and Collectors ● -XX:+UseNUMA -XX:+UseNUMAInterleaving: All GC spaces. ○ Independent of GC choices. ○ NUMA interleaved allocation. (numactl --interleave) ● ParallelGC (in addition to above) ○ Supports all exotic NUMA options. ○ Eden mutableSpace (even without NUMA) ■ Pretouching the pages. ○ Eden mutableNUMASpace (with above NUMA options) ■ Space split into LG chunks. ● Adaptive Resizing. ■ Does thread-local NUMA allocation. ● allocations performed in chunk corresponding to the home locality.
  17. 17. Cassandra ● JVM options are supported through environment variable. ● Cassandra’s ‘supported’ NUMA is through numactl in shell wrapper. ○ This interleaves ‘everything’. ○ When you have numactl (hammer), everything looks like a (binary?) nail. ● Cassandra memory model ○ JVM GC spaces. ○ OHC - off heap cache: https://github.com/snazy/ohc ■ Written specifically for Cassandra 2.x ○ MemoryUtil.java ■ com.sun.jna.Native - Native.malloc ■ sun.nio.ch.DirectBuffer ■ sun.misc.Unsafe - unsafe.allocateMemory ■ java.nio.ByteBuffer - ByteBuffer.allocateDirect
  18. 18. Cassandra off-heap ● Why off-heap ○ Reduce GC pressure ○ Access patterns ○ Lack of support for primitives such as O_DIRECT. (https://bugs.openjdk.java.net/browse/JDK-8164900) ○ Lack of NUMA support in newer GCs. ■ ( JEP 157: G1 GC: NUMA-Aware Allocation http://openjdk.java.net/jeps/157) ● Off-heap caches are used for: ○ Row cache ○ Key cache ○ Counter cache ● 2.x onwards, actually better with 2.2.
  19. 19. Cassandra off-heap ● Cache Providers: ○ SerializingCache ■ Issues with serialization and CPU usage. ○ OHCP - org.caffinitas.ohc.OHCacheBuilder - 2.2 onwards ■ “OHC shall provide a good performance on both commodity hardware and big systems using non-uniform-memory-architectures.” ■ sun.misc.Unsafe: unsafe.allocateMemory ■ Linked: For Larger entries ● Malloc and fragmentation ■ Chunked: For smaller entries
  20. 20. Numa issues ● Numactl --interleave: ○ Thread-local native allocations - Bad [X] ■ Tons of them throughout code which bypass JVM. ○ JVM’s Eden space will also be interleaved - Bad [X] ● JVM’s options only: ○ Native allocations will be local. ○ Large off-heap allocations can suffer. ● Numactl + JVM ■ JVM-aware GC (Parallel) ● Best possible combination (without invasive code changes in cassandra). ● JVM’s memory options will override numactl. ● But, ParallelGC is not comparable to new ones (G1).
  21. 21. Interpretation ● Low off-heap usage ○ Use the JVM NUMA options. Don’t interleave with numactl, it is a hammer. ● High off-heap usage (like cassandra) ○ Just go with the flow, and do numactl. ■ -XX:+AlwaysPreTouch? (MAP_POPULATE) ○ Cost-benefit analysis. ● ParallelGC is too old (and bad for latency) - don’t use it just for NUMA. ○ Well-implemented NUMA can easily pique anyone’s geeky senses. :) ○ Ask Cassandra or Oracle to add NUMA support to G1 ;) ● In newer kernels (Xenial), one can try AutoNUMA. ○ Completely managed by kernel based on access patterns. ○ Has caveats but one can always benchmark and see. :)
  22. 22. Interpretation ● JVM is (still) not good with native primitives such as O_DIRECT or NUMA (there is a jnuma which is not that well maintained). ○ Many database authors write their own off-JVM implementations for these. (there are so many java databases these days) ○ Some also do things like this. ○ MySQL (InnoDB) can (and does) take advantage of these for good performance. ■ InnoDB was in Cassandra’s place about two years ago, till fixes landed. ● How InnoDB does it. ○ May be ScyllaDB in future. ;)
  23. 23. Wishlist for cassandra ● Use whatever GC fits best. (G1?) ■ Ask for NUMA support in this. ● Use the JVM NUMA options when supported. ■ Having NUMA support for Eden spaces will help a lot. ● Don’t use numactl. ○ Let all native allocations be local (OS default). ○ Use jnuma (or equivalent, it is just a JNI wrapper) for OHCP and other large non-local caches. ■ Use numa interleaving here. ■ This requires cassandra or OHCP code to be changed. ● Changing OHCP code is easier. ● Benchmark ○ ?? ○ Profit!
  24. 24. AutoNUMA ● Introduced late in 4.x kernel ● CPU follows memory ○ Reschedule tasks on same nodes as memory ● Memory follows CPU ○ Copy memory pages to same nodes as tasks/threads ● Heuristics ○ Fault statistics ○ Task grouping ○ Multi-resource optimization - cache, cpu, memory, starvation ■ Avoid thrashing
  25. 25. Tunings and observables ● /proc/zoneinfo ○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim ○ /proc/sys/vm/min_unmapped_ratio ● /proc/meminfo ● /proc/vmstat ● Ftrace / Perf ● Cgroup hierarchy ○ Memory ● Per process: ○ /proc/<pid>/numa_maps ○ /proc/<pid>/sched
  26. 26. Numa statistics
  27. 27. Further ● http://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/ ● http://queue.acm.org/detail.cfm?id=2852078 ● https://plumbr.eu/java-garbage-collection-handbook ● http://mechanical-sympathy.blogspot.in/2013/07/java-garbage-collection- distilled.html
  28. 28. Credits! ● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ● http://events.linuxfoundation.org/sites/events/files/slides/Normal%20and %20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%8E %A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sam ple-output-of-the-numastat-command.png

×