Performence Metrics to Manage Memory SC.

Performance Metrics to Manage
Memory on Supercomputers
Andrès RUBIO PROAÑO
Post-doctoral Researcher, High Performance Big Data Research Team, RIKEN R-CCS,
21/12/2023

My Background
⚫ 2015 Electronics Engineering Bachelor
Degree at Escuela Politécnica Nacional,
Ecuador.
⚫ CEDIA CEPRA Projects
⚫ 2018 Computer Engineering Master’s
Degree at Universitat Autònoma de
Barcelona, Spain.
⚫ 2021 PhD in Computer Science at
Université de Bordeaux, France
⚫ PhD Contractual with Inria during 3
years with Intel Funding.
⚫ 2021-currently Postdoctoral Researcher at
Riken R-CCS, Japan.

Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary

INTRODUCTION:
From Real World to HPC Applications
Real World Problem
• Meshes, Sparse, Matrices,
etc.
• Need to be optimally partit
ioned
Surrogate Model
• Molecular Dynamics
• Oil and gas exploration
• Forecast climate changes
• Discover new drogs for deseas
es
Computer Simulation Model
• AI, DL, ML workloads

From Homogeneity to Heterogeneity:
Processing

From Homogeneity to Heterogeneity:
Memory/Storage System
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
VOLATILE MEMORY
DRAM
~80-100ns
DISSAGREGATED MEMORY
CXL.mem (HBM, DRAM, NVDIMM)
~170-250ns*
MECHANICAL MEDIA
HDD
~10ms
SEQUENTIAL MEDIA
TAPE
~100ms
SOLID STATE MEDIA
NVMe SSD
SATA SSD
~10-100µs
HBM
IN PACKAGE MEMORY
*Improvement
in Bandwidth
Memory
Bus
Capacity/Latency
Increasing
PERSISTENT MEMORY
NVDIMM
~350ns-1µs
I/O
Bus
Processor
Package
Memory
Bus
I/O
Bus
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
STORAGE
HDD
~80-100ns
DRAM
MEMORY
~10ms

Heterogeneous Memory Systems
NVM
DRAM
CPU
2MK HMS
Xeon Cascade Lake
Xeon Icelake
Sapphire Rapids
DRAM
HBM
2MK HMS
CPU
KNL
Sapphire Rapids
NVM
DRAM
HBM
3MK HMS
CPU
next generations?
CXL.mem?

DRAM
HBM
2MK HMS
K-AB21
Rhea chips
CPU

NVM
DRAM
CPU
2MK HMS
POWER10
DRAM
HBM
2MK HMS
CPU
POWER10
NVM
DRAM
HBM
3MK HMS
CPU
POWER10

Where to allocate?
Application
Buffer
Buffer
Buffer
N x
total
Allocation
Requests
Memory System
NVM
DRAM
HBM
Allocating on a Homogenous Memory System?
NUMA 0
NUMA 1
Allocating on a NUMA Memory System
Allocating on a Heterogeneous Memory System

Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Where to allocate memory
buffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
Rubio Proaño, A & Sato ,K. (2023, December). Understanding Power Consumption M
etric on Heterogeneous Memory Systems In The 29th IEEE International Conference o
n Parallel and Distributed Systems (ICPADS 2023) (in proceedings).

1
2
3
Memory Attributes
High Bandwidth
High Capacity
Low Power
Consumption
Low Latency
HBM
DRAM
NVM
Attributes Discovery
Capacity, Locality Native supported
Bandwidth, Latency On most platforms
R/W Bandwidth, R/W Latency On some platforms
Reliability, Persistence,
Endurance, Power Consumption
Under investigation

Hwloc:
Apple Mac Mini with M1 hybrid processor
4 E-cores on top (energy efficient), 4 P-cores below (performance, with bigger caches).
The machine has 16GB of memory but most of it is given to the GPU (as shown in the OpenCL device)

Hwloc:
2x Xeon CascadeLake 6230 with NVDIMMs as separate NUMA nodes

Hwloc:
2x Xeon SapphireRapids Max 9460
Processors are configured in SubNUMA-Cluster mode, hence showing 4 DRAM NUMA nodes and 4 HBMs
in each package.

API functions for manage memory attributes
⚫ Get the array of memory targets that are local to a given initiator:
hwloc_get_local_numanode_objs(topology, initiator, &nr, &targets)
⚫ Get the best memory target (and its value) for the given initiator and attribute:
hwloc_memattr_get_best_target(topology, attribute, initiator, &best target,
&target value)
⚫ Get the value of an attribute for the given memory target and initiator:
hwloc_memattr_get_value(topology, attribute, target, initiator, &value)
⚫ Add a custom memory attribute: (e.g. STREAM-triad kernel)
hwloc_memattr_register(topology, attribute, name, &value)

E.g. Allocate on the best target for an existing attribute
/* Initialise Topology */
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
[...]
/* Allocating function */
void * alloc_on_best_target(topology, initiator, attribute, size);
{
hwloc_memattr_get_best_target(topology, attribute, initiator, &best_target, NULL);
return hwloc_alloc_membind(topology, size, best_target->nodeset, BIND);
}
[...]
/* Allocating 1MB on best bandwidth memory near a given core */
void * buffer = alloc_on_best_target(topology, core->cpuset, HWLOC_MEMATTR_ID_BANDWIDTH, 1024*1024);

hwloc_memattr_get_value(…)
Application
Heterogeneous Allocator
hwloc / API Extension
ACPI HMAT
Allocation Requests
MemoryTargets and Attributes
PerformanceInformation
Benchmarking

ACPI HMAT
$ lstopo --memattrs
Memory attribute #0 name ’Capacity’
NUMANode L#0 = 99786076160
NUMANode L#1 = 101468516352
NUMANode L#2 = 796716433408
NUMANode L#3 = 99883061248
NUMANode L#4 = 101428244480
NUMANode L#5 = 798863917056
Memory attribute #2 name ’Bandwidth’
NUMANode L#0 = 131072 from Group0 L#0
NUMANode L#2 = 78644 from Package L#0
Memory attribute #3 name ’Latency’

Summary
⚫ Expected:
⚫ Obtain memory attributes information from HMAT (simpler)
⚫ Vendors start using HMAT and put reliable information
⚫ No need to spend time benchmarking
⚫ Currently:
⚫ Benchmarking is the safest way to get attribute values
⚫ HMAT is only appearing in genering platforms ACPI tables since 2021

Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.

Strategy Framework –> High Productivity
Application
Heterogeneous Allocator
hwloc / API Extension
Hardware
Allocation Requests
MemoryTargets and Attributes
• Measured Performance
• Hardware Performance Information
MemoryIdentifiers
Determine Sensitivity
to memory metrics
Allocation Criteria
Profiling
Benchmarking
Static code analysis

Benchmarking (Bandwidth): Stream-Triad
NVM
DRAM
HBM
3MK HMS
Application
Buffer
Buffer
Buffer
N
x
total
3N binding
tests
Buffer target
node
Best Rate in
GB/s
A B C Triad
0 0 0 74.97
0 0 1 51.88
0 1 0 55.59
0 1 1 38.32
1 0 0 9.92
1 0 1 9.05
1 1 0 9.16
1 1 1 8.50
0→ DRAM
1→ NVM

Benchmarking
⚫ General Idea of the sensitivity of the application
⚫ Hard to evaluate when taking into account all buffers separately

Profiling: VTune
• Graph 500
• Memory Access Analysis tools:
• DRAM Bound – latency issues
• Persistent Memory Bound – latency iss
ues
• DRAM Bandwidth Bound – bandwidth
• Persistent Memory Bandwidth Bound –
bandwidth

Profiling: VTune
Allocating in Local DDR
35.33 GB/s
0 GB/s
Memory
Object
Loads Stores
LLC Miss
Count
Average
Latency
(cycles)
xmalloc 12,258,444,411 1,205,069,645 580,459,141 61
xoff 39,030,512 0 0 8
xadj 10,311,019 0 0 9
Allocating in Local NVDIMM
Memory
Object
Loads Stores
LLC Miss
Count
Average
Latency
(cycles)
xmalloc 31,972,081,700 1,220,455,472 562,889.063 131
xoff 92,622,498 0 0 7
xadj 20,176,351 0 0 9
Bandwidth
Bandwidth
8.573 GB/s
0 GB/s
xmalloc Memory Object in utils.c
27 #include <omp.h>
28 #endif
29 #include "utils.h"
30
31 void* xmalloc(size_t n) {
32 void* p = malloc(n);
33 if (!p) {
34 fprintf(stderr, "Out of memory
trying to allocate %zu byte(s)n", n);
35 abort();
36 }
37 return p;
38 }
39
40 void* xcalloc(size_t n, size_t k) {
41 void* p = calloc(n, k);
42 if (!p) {
43 fprintf(stderr, "Out of memory trying
to allocate %zu byte(s)n", n);

Profiling: PCM-power
Local DRAM
Local NVM
Remote DRAM
#threads
Memory
Power (Watts)
FT.A

Profiling
⚫ Perform an analysis of the execution
⚫ Kind of memory used
⚫ Most relevant buffers
⚫ Analyse the related source code line.
⚫ Identify memory related issues:
⚫ Bottleneks, Hot spots, etc
⚫ Fewer runs but analysis could be more difficult

Power Consumption
⚫ We consider that we need to understand
and/or characterise the power
consumption within a Heterogeneous
Memory System(HMS) to be able to give a
ranking of memory targets that enables of
use applications in power constraint
scenarios or in situations that requires a
balance between performance and power
consumption. And for that we need to
follow a strategy
MSR_DRAM_ENERGY_STATUS
Package Domain
Cores Domain
Memory Domain
DRAM NVDIMM
MSR_ENERGY_STATUS

Summary
⚫ Manage the complexity of having HMS through hwloc extension.
⚫ Presented a strategy that allows HPC applications detect affinities for certain kinds of memory
and allocate in the right place their buffers.
⚫ The presented strategy framework is for high productivity ( for non-experienced developers)
and better utilisation of the memory system.
⚫ Manage performance counters in a manner that allow us to differentiate the power
consumption of different types of memory

Future Work
⚫ Validate and extend our work on emerging platforms.
⚫ Static Code Analysis for taking allocation decisions.
⚫ Extend our allocation policies to handle more application requirements.

Thanks
andres.rubioproano@riken.jp

RIKEN is Hiring
Recruitment: Our team is hiring student interns/Postdoc/Researchers
- Student interns: - Postdoc/Researchers:

Performence Metrics to Manage Memory SC.

Recommended

Recommended

More Related Content

Similar to Performence Metrics to Manage Memory SC.

Similar to Performence Metrics to Manage Memory SC. (20)

Recently uploaded

Recently uploaded (20)

Performence Metrics to Manage Memory SC.