Lessons Learned from Building a Serverless Notifications System.pdf
Performence Metrics to Manage Memory SC.
1. Performance Metrics to Manage
Memory on Supercomputers
Andrès RUBIO PROAÑO
Post-doctoral Researcher, High Performance Big Data Research Team, RIKEN R-CCS,
21/12/2023
2. My Background
⚫ 2015 Electronics Engineering Bachelor
Degree at Escuela Politécnica Nacional,
Ecuador.
⚫ CEDIA CEPRA Projects
⚫ 2018 Computer Engineering Master’s
Degree at Universitat Autònoma de
Barcelona, Spain.
⚫ 2021 PhD in Computer Science at
Université de Bordeaux, France
⚫ PhD Contractual with Inria during 3
years with Intel Funding.
⚫ 2021-currently Postdoctoral Researcher at
Riken R-CCS, Japan.
3. Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
4. INTRODUCTION:
From Real World to HPC Applications
Real World Problem
• Meshes, Sparse, Matrices,
etc.
• Need to be optimally partit
ioned
Surrogate Model
• Molecular Dynamics
• Oil and gas exploration
• Forecast climate changes
• Discover new drogs for deseas
es
Computer Simulation Model
• AI, DL, ML workloads
6. From Homogeneity to Heterogeneity:
Memory/Storage System
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
VOLATILE MEMORY
DRAM
~80-100ns
DISSAGREGATED MEMORY
CXL.mem (HBM, DRAM, NVDIMM)
~170-250ns*
MECHANICAL MEDIA
HDD
~10ms
SEQUENTIAL MEDIA
TAPE
~100ms
SOLID STATE MEDIA
NVMe SSD
SATA SSD
~10-100µs
HBM
IN PACKAGE MEMORY
*Improvement
in Bandwidth
Memory
Bus
Capacity/Latency
Increasing
PERSISTENT MEMORY
NVDIMM
~350ns-1µs
I/O
Bus
Processor
Package
Memory
Bus
I/O
Bus
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
STORAGE
HDD
~80-100ns
DRAM
MEMORY
~10ms
7. Heterogeneous Memory Systems
NVM
DRAM
CPU
2MK HMS
Xeon Cascade Lake
Xeon Icelake
Sapphire Rapids
DRAM
HBM
2MK HMS
CPU
KNL
Sapphire Rapids
NVM
DRAM
HBM
3MK HMS
CPU
next generations?
CXL.mem?
10. Where to allocate?
Application
Buffer
Buffer
Buffer
N x
total
Allocation
Requests
Memory System
NVM
DRAM
HBM
Allocating on a Homogenous Memory System?
NUMA 0
NUMA 1
Allocating on a NUMA Memory System
Allocating on a Heterogeneous Memory System
11. Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Where to allocate memory
buffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
Rubio Proaño, A & Sato ,K. (2023, December). Understanding Power Consumption M
etric on Heterogeneous Memory Systems In The 29th IEEE International Conference o
n Parallel and Distributed Systems (ICPADS 2023) (in proceedings).
12. Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
13. 1
2
3
Memory Attributes
High Bandwidth
High Capacity
Low Power
Consumption
Low Latency
HBM
DRAM
NVM
Attributes Discovery
Capacity, Locality Native supported
Bandwidth, Latency On most platforms
R/W Bandwidth, R/W Latency On some platforms
Reliability, Persistence,
Endurance, Power Consumption
Under investigation
14. Hwloc:
Apple Mac Mini with M1 hybrid processor
4 E-cores on top (energy efficient), 4 P-cores below (performance, with bigger caches).
The machine has 16GB of memory but most of it is given to the GPU (as shown in the OpenCL device)
16. Hwloc:
2x Xeon SapphireRapids Max 9460
Processors are configured in SubNUMA-Cluster mode, hence showing 4 DRAM NUMA nodes and 4 HBMs
in each package.
17. API functions for manage memory attributes
⚫ Get the array of memory targets that are local to a given initiator:
hwloc_get_local_numanode_objs(topology, initiator, &nr, &targets)
⚫ Get the best memory target (and its value) for the given initiator and attribute:
hwloc_memattr_get_best_target(topology, attribute, initiator, &best target,
&target value)
⚫ Get the value of an attribute for the given memory target and initiator:
hwloc_memattr_get_value(topology, attribute, target, initiator, &value)
⚫ Add a custom memory attribute: (e.g. STREAM-triad kernel)
hwloc_memattr_register(topology, attribute, name, &value)
18. E.g. Allocate on the best target for an existing attribute
/* Initialise Topology */
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
[...]
/* Allocating function */
void * alloc_on_best_target(topology, initiator, attribute, size);
{
hwloc_memattr_get_best_target(topology, attribute, initiator, &best_target, NULL);
return hwloc_alloc_membind(topology, size, best_target->nodeset, BIND);
}
[...]
/* Allocating 1MB on best bandwidth memory near a given core */
void * buffer = alloc_on_best_target(topology, core->cpuset, HWLOC_MEMATTR_ID_BANDWIDTH, 1024*1024);
22. Summary
⚫ Expected:
⚫ Obtain memory attributes information from HMAT (simpler)
⚫ Vendors start using HMAT and put reliable information
⚫ No need to spend time benchmarking
⚫ Currently:
⚫ Benchmarking is the safest way to get attribute values
⚫ HMAT is only appearing in genering platforms ACPI tables since 2021
23. Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
24. Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
25. Strategy Framework –> High Productivity
Application
Heterogeneous Allocator
hwloc / API Extension
Hardware
Allocation Requests
MemoryTargets and Attributes
• Measured Performance
• Hardware Performance Information
MemoryIdentifiers
Determine Sensitivity
to memory metrics
Allocation Criteria
Profiling
Benchmarking
Static code analysis
26. Benchmarking (Bandwidth): Stream-Triad
NVM
DRAM
HBM
3MK HMS
Application
Buffer
Buffer
Buffer
N
x
total
3N binding
tests
Buffer target
node
Best Rate in
GB/s
A B C Triad
0 0 0 74.97
0 0 1 51.88
0 1 0 55.59
0 1 1 38.32
1 0 0 9.92
1 0 1 9.05
1 1 0 9.16
1 1 1 8.50
0→ DRAM
1→ NVM
27. Benchmarking
⚫ General Idea of the sensitivity of the application
⚫ Hard to evaluate when taking into account all buffers separately
31. Profiling
⚫ Perform an analysis of the execution
⚫ Kind of memory used
⚫ Most relevant buffers
⚫ Analyse the related source code line.
⚫ Identify memory related issues:
⚫ Bottleneks, Hot spots, etc
⚫ Fewer runs but analysis could be more difficult
32. Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
33. Power Consumption
⚫ We consider that we need to understand
and/or characterise the power
consumption within a Heterogeneous
Memory System(HMS) to be able to give a
ranking of memory targets that enables of
use applications in power constraint
scenarios or in situations that requires a
balance between performance and power
consumption. And for that we need to
follow a strategy
MSR_DRAM_ENERGY_STATUS
Package Domain
Cores Domain
Memory Domain
DRAM NVDIMM
MSR_ENERGY_STATUS
36. Summary
⚫ Manage the complexity of having HMS through hwloc extension.
⚫ Presented a strategy that allows HPC applications detect affinities for certain kinds of memory
and allocate in the right place their buffers.
⚫ The presented strategy framework is for high productivity ( for non-experienced developers)
and better utilisation of the memory system.
⚫ Manage performance counters in a manner that allow us to differentiate the power
consumption of different types of memory
37. Future Work
⚫ Validate and extend our work on emerging platforms.
⚫ Static Code Analysis for taking allocation decisions.
⚫ Extend our allocation policies to handle more application requirements.