Notes on NUMA architecture

1,287 views

Published on

Intel Software Conference 2014 Brazil
May 2014
Leonardo Borges

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,287
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
52
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Notes on NUMA architecture

  1. 1. Intel Software Conference 2014 Brazil May 2014 Leonardo Borges Notes on NUMA architecture
  2. 2. 2 2 Non-Uniform Memory Access (NUMA) FSB architecture - All memory in one location Starting with Nehalem - Memory located in multiple places Latency to memory dependent on location Local memory - Highest BW - Lowest latency Remote Memory - Higher latency Socket 0 Socket 1 QPI Ensure software is NUMA-optimized for best performance Notes for Intel Software Conference – Brazil, May 2014
  3. 3. 3 3 CPU1 DRAM Node 1 Non-Uniform Memory Access (NUMA) Locality matters - Remote memory access latency ~1.7x than local memory - Local memory bandwidth can be up to 2x greater than remote Intel® QPI = Intel® QuickPath Interconnect Remote Memory Access Intel® QPI CPU0DRAM Local Memory Access Node 0 BIOS: - NUMA mode (NUMA Enabled) First Half of memory space on Node 0, second half on Node 1 Should be default on Nehalem (!) - Non-NUMA (NUMA Disabled) Even/Odd cache lines assigned to Nodes 0/1: Line interleaving Notes for Intel Software Conference – Brazil, May 2014
  4. 4. 4 4 Local Memory Access Example CPU0 requests cache line X, not present in any CPU0 cache - CPU0 requests data from its DRAM - CPU0 snoops CPU1 to check if data is present Step 2: - DRAM returns data - CPU1 returns snoop response Local memory latency is the maximum latency of the two responses Nehalem optimized to keep key latencies close to each other CPU0 CPU1 QPI DRAMDRAM Notes for Intel Software Conference – Brazil, May 2014
  5. 5. 5 5 Remote Memory Access Example CPU0 requests cache line X, not present in any CPU0 cache - CPU0 requests data from CPU1 - Request sent over QPI to CPU1 - CPU1’s IMC makes request to its DRAM - CPU1 snoops internal caches - Data returned to CPU0 over QPI Remote memory latency a function of having a low latency interconnect CPU0 CPU1 QPI DRAMDRAM Notes for Intel Software Conference – Brazil, May 2014
  6. 6. 6 6 Non Uniform Memory Access and Parallel Execution Process-parallel execution: - NUMA friendly- data belongs only to the process - E.g. MPI - Affinity pinning maximizes local memory access - Standard for HPC Shared-memory threading: - More problematic: same thread may require data from multiple NUMA nodes - E.g. OpenMP, TBB , explicit threading - OS scheduled thread migration can aggravate situation - NUMA and non-NUMA should be compared Notes for Intel Software Conference – Brazil, May 2014
  7. 7. 7 7 Operating System Differences Operating systems allocate data differently Linux* - Malloc reserves the memory - Assigns the physical page when data touched (first touch) Many HPC code initialize memory by single ‘master’ thread !! - A couple of extensions available via numactl and libnuma like numactl --interleave=all /bin/program numactl --cpunodebind=1 --membind=1 /bin/program numactl --hardware numa_run_on_node(3) // run thread on node 3 Microsoft Windows* - Malloc assigns the physical page on allocation - This default allocation policy is not NUMA friendly - Microsoft Windows has NUMA Friendly API’s VirtualAlloc reserves memory (like malloc on Linux*) Physical pages assigned at first use For more details: http://kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf http://msdn.microsoft.com/en-us/library/aa363804.aspx Notes for Intel Software Conference – Brazil, May 2014
  8. 8. 8 8 Other Ways to Set Process Affinity taskset: sets or retrieves the CPU affinity Intel MPI: using I_MPI_PIN and I_MPI_PIN_PROCESSOR_LIST environment variables KMP_AFFINITY on Intel Compilers OpenMP - Compact: binds the OpenMP thread n+1 as close as possible to OpenMP thread n - Scatter: distributes threads evenly across the entire system. Scatter is the opposite of compact Notes for Intel Software Conference – Brazil, May 2014
  9. 9. 9 9 NUMA Application Level Tuning: Shared Memory Threading Example: TRIAD Parallelized time consuming hotspot “TRIAD” (e.g. of STREAM benchmark) using OpenMP main() { … #pragma omp parallel { //Parallelized TRIAD loop… #pragma omp parallel for private(j) for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; } //end omp parallel … } //end main Parallelizing hotspots may not be sufficient for NUMA Notes for Intel Software Conference – Brazil, May 2014
  10. 10. 10 10 NUMA Shared Memory Threading Example ( Linux* ) KMP_AFFINITY=compact,0,verbose main() { … #pragma omp parallel { #pragma omp for private(i) for(i=0;i<N;i++) { a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;} … //Parallelized TRIAD loop… #pragma omp parallel for private(j) for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; } //end omp parallel … } //end main Each thread initializes its data pinning the pages to local memory Environment variable to pin affinity Same thread that initialized data uses data Notes for Intel Software Conference – Brazil, May 2014
  11. 11. 11 11 NUMA Optimization Summary NUMA adds complexity to software parallelization and optimization Optimize for latency and for bandwidth - In most cases goal to minimize latency - Use local memory - Keep memory near the thread it accesses - Keep thread near memory it uses Rely on quality middle-ware for CPU affinitization: Example: Intel Compiler OpenMP or MPI environment variables Application level tuning may be required to minimize NUMA first touch policy effects Notes for Intel Software Conference – Brazil, May 2014
  12. 12. 12 12Notes for Intel Software Conference – Brazil, May 2014

×