SlideShare a Scribd company logo
1 of 42
Download to read offline
Performance Metrics to Manage
Memory on Supercomputers
Andrès RUBIO PROAÑO
Post-doctoral Researcher, High Performance Big Data Research Team, RIKEN R-CCS,
21/12/2023
My Background
⚫ 2015 Electronics Engineering Bachelor
Degree at Escuela Politécnica Nacional,
Ecuador.
⚫ CEDIA CEPRA Projects
⚫ 2018 Computer Engineering Master’s
Degree at Universitat Autònoma de
Barcelona, Spain.
⚫ 2021 PhD in Computer Science at
Université de Bordeaux, France
⚫ PhD Contractual with Inria during 3
years with Intel Funding.
⚫ 2021-currently Postdoctoral Researcher at
Riken R-CCS, Japan.
Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
INTRODUCTION:
From Real World to HPC Applications
Real World Problem
• Meshes, Sparse, Matrices,
etc.
• Need to be optimally partit
ioned
Surrogate Model
• Molecular Dynamics
• Oil and gas exploration
• Forecast climate changes
• Discover new drogs for deseas
es
Computer Simulation Model
• AI, DL, ML workloads
From Homogeneity to Heterogeneity:
Processing
From Homogeneity to Heterogeneity:
Memory/Storage System
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
VOLATILE MEMORY
DRAM
~80-100ns
DISSAGREGATED MEMORY
CXL.mem (HBM, DRAM, NVDIMM)
~170-250ns*
MECHANICAL MEDIA
HDD
~10ms
SEQUENTIAL MEDIA
TAPE
~100ms
SOLID STATE MEDIA
NVMe SSD
SATA SSD
~10-100µs
HBM
IN PACKAGE MEMORY
*Improvement
in Bandwidth
Memory
Bus
Capacity/Latency
Increasing
PERSISTENT MEMORY
NVDIMM
~350ns-1µs
I/O
Bus
Processor
Package
Memory
Bus
I/O
Bus
Register
CPU
~0.1ns
Level 1 Cache
Level 2 Cache
Level 3 Cache
CACHE
~1-50ns
STORAGE
HDD
~80-100ns
DRAM
MEMORY
~10ms
Heterogeneous Memory Systems
NVM
DRAM
CPU
2MK HMS
Xeon Cascade Lake
Xeon Icelake
Sapphire Rapids
DRAM
HBM
2MK HMS
CPU
KNL
Sapphire Rapids
NVM
DRAM
HBM
3MK HMS
CPU
next generations?
CXL.mem?
Heterogeneous Memory Systems
DRAM
HBM
2MK HMS
K-AB21
Rhea chips
CPU
Heterogeneous Memory Systems
NVM
DRAM
CPU
2MK HMS
POWER10
DRAM
HBM
2MK HMS
CPU
POWER10
NVM
DRAM
HBM
3MK HMS
CPU
POWER10
Where to allocate?
Application
Buffer
Buffer
Buffer
N x
total
Allocation
Requests
Memory System
NVM
DRAM
HBM
Allocating on a Homogenous Memory System?
NUMA 0
NUMA 1
Allocating on a NUMA Memory System
Allocating on a Heterogeneous Memory System
Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Where to allocate memory
buffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
Rubio Proaño, A & Sato ,K. (2023, December). Understanding Power Consumption M
etric on Heterogeneous Memory Systems In The 29th IEEE International Conference o
n Parallel and Distributed Systems (ICPADS 2023) (in proceedings).
Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
1
2
3
Memory Attributes
High Bandwidth
High Capacity
Low Power
Consumption
Low Latency
HBM
DRAM
NVM
Attributes Discovery
Capacity, Locality Native supported
Bandwidth, Latency On most platforms
R/W Bandwidth, R/W Latency On some platforms
Reliability, Persistence,
Endurance, Power Consumption
Under investigation
Hwloc:
Apple Mac Mini with M1 hybrid processor
4 E-cores on top (energy efficient), 4 P-cores below (performance, with bigger caches).
The machine has 16GB of memory but most of it is given to the GPU (as shown in the OpenCL device)
Hwloc:
2x Xeon CascadeLake 6230 with NVDIMMs as separate NUMA nodes
Hwloc:
2x Xeon SapphireRapids Max 9460
Processors are configured in SubNUMA-Cluster mode, hence showing 4 DRAM NUMA nodes and 4 HBMs
in each package.
API functions for manage memory attributes
⚫ Get the array of memory targets that are local to a given initiator:
hwloc_get_local_numanode_objs(topology, initiator, &nr, &targets)
⚫ Get the best memory target (and its value) for the given initiator and attribute:
hwloc_memattr_get_best_target(topology, attribute, initiator, &best target,
&target value)
⚫ Get the value of an attribute for the given memory target and initiator:
hwloc_memattr_get_value(topology, attribute, target, initiator, &value)
⚫ Add a custom memory attribute: (e.g. STREAM-triad kernel)
hwloc_memattr_register(topology, attribute, name, &value)
E.g. Allocate on the best target for an existing attribute
/* Initialise Topology */
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
[...]
/* Allocating function */
void * alloc_on_best_target(topology, initiator, attribute, size);
{
hwloc_memattr_get_best_target(topology, attribute, initiator, &best_target, NULL);
return hwloc_alloc_membind(topology, size, best_target->nodeset, BIND);
}
[...]
/* Allocating 1MB on best bandwidth memory near a given core */
void * buffer = alloc_on_best_target(topology, core->cpuset, HWLOC_MEMATTR_ID_BANDWIDTH, 1024*1024);
hwloc_memattr_get_value(…)
Application
Heterogeneous Allocator
hwloc / API Extension
ACPI HMAT
Allocation Requests
MemoryTargets and Attributes
PerformanceInformation
Benchmarking
ACPI HMAT
$ lstopo --memattrs
Memory attribute #0 name ’Capacity’
NUMANode L#0 = 99786076160
NUMANode L#1 = 101468516352
NUMANode L#2 = 796716433408
NUMANode L#3 = 99883061248
NUMANode L#4 = 101428244480
NUMANode L#5 = 798863917056
Memory attribute #2 name ’Bandwidth’
NUMANode L#0 = 131072 from Group0 L#0
NUMANode L#1 = 131072 from Group0 L#1
NUMANode L#2 = 78644 from Package L#0
NUMANode L#3 = 131072 from Group0 L#2
NUMANode L#4 = 131072 from Group0 L#3
NUMANode L#5 = 78644 from Package L#1
Memory attribute #3 name ’Latency’
NUMANode L#0 = 26 from Group0 L#0
NUMANode L#1 = 26 from Group0 L#1
NUMANode L#2 = 77 from Package L#0
NUMANode L#3 = 26 from Group0 L#2
NUMANode L#4 = 26 from Group0 L#3
NUMANode L#5 = 77 from Package L#1
Benchmarking
Summary
⚫ Expected:
⚫ Obtain memory attributes information from HMAT (simpler)
⚫ Vendors start using HMAT and put reliable information
⚫ No need to spend time benchmarking
⚫ Currently:
⚫ Benchmarking is the safest way to get attribute values
⚫ HMAT is only appearing in genering platforms ACPI tables since 2021
Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
Outline
1. Background
⚫ The big question
⚫ Contribution
2. Complex Memory Spaces
⚫ Memory characterisation and memory attributes
⚫ API
⚫ Attribute Values
3. Preparing HPC applications to complex HMS
⚫ Benchmarking as an allocation criteria
⚫ Profiling as an allocation criteria
⚫ Static code Analysis as an allocation criteria
4. Summary
Strategy Framework –> High Productivity
Application
Heterogeneous Allocator
hwloc / API Extension
Hardware
Allocation Requests
MemoryTargets and Attributes
• Measured Performance
• Hardware Performance Information
MemoryIdentifiers
Determine Sensitivity
to memory metrics
Allocation Criteria
Profiling
Benchmarking
Static code analysis
Benchmarking (Bandwidth): Stream-Triad
NVM
DRAM
HBM
3MK HMS
Application
Buffer
Buffer
Buffer
N
x
total
3N binding
tests
Buffer target
node
Best Rate in
GB/s
A B C Triad
0 0 0 74.97
0 0 1 51.88
0 1 0 55.59
0 1 1 38.32
1 0 0 9.92
1 0 1 9.05
1 1 0 9.16
1 1 1 8.50
0→ DRAM
1→ NVM
Benchmarking
⚫ General Idea of the sensitivity of the application
⚫ Hard to evaluate when taking into account all buffers separately
Profiling: VTune
• Graph 500
• Memory Access Analysis tools:
• DRAM Bound – latency issues
• Persistent Memory Bound – latency iss
ues
• DRAM Bandwidth Bound – bandwidth
• Persistent Memory Bandwidth Bound –
bandwidth
Profiling: VTune
Allocating in Local DDR
35.33 GB/s
0 GB/s
Memory
Object
Loads Stores
LLC Miss
Count
Average
Latency
(cycles)
xmalloc 12,258,444,411 1,205,069,645 580,459,141 61
xoff 39,030,512 0 0 8
xadj 10,311,019 0 0 9
Allocating in Local NVDIMM
Memory
Object
Loads Stores
LLC Miss
Count
Average
Latency
(cycles)
xmalloc 31,972,081,700 1,220,455,472 562,889.063 131
xoff 92,622,498 0 0 7
xadj 20,176,351 0 0 9
Bandwidth
Bandwidth
8.573 GB/s
0 GB/s
xmalloc Memory Object in utils.c
27 #include <omp.h>
28 #endif
29 #include "utils.h"
30
31 void* xmalloc(size_t n) {
32 void* p = malloc(n);
33 if (!p) {
34 fprintf(stderr, "Out of memory
trying to allocate %zu byte(s)n", n);
35 abort();
36 }
37 return p;
38 }
39
40 void* xcalloc(size_t n, size_t k) {
41 void* p = calloc(n, k);
42 if (!p) {
43 fprintf(stderr, "Out of memory trying
to allocate %zu byte(s)n", n);
Profiling: PCM-power
Local DRAM
Local NVM
Remote DRAM
#threads
Memory
Power (Watts)
FT.A
Profiling
⚫ Perform an analysis of the execution
⚫ Kind of memory used
⚫ Most relevant buffers
⚫ Analyse the related source code line.
⚫ Identify memory related issues:
⚫ Bottleneks, Hot spots, etc
⚫ Fewer runs but analysis could be more difficult
Contribution
1. How to expose heterogeneous
memory systems to
applications/runtime?
2. Criterion about where to
allocate memorybuffers?
3. Developping for
Heterogeneous Memory
Systems without having access
to the hardware?
4. How to manage HMS in batch
schedulers.
5. Power consumption metric on
HMS
Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning
Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I
n European Conference on Parallel Processing (pp. 82-94).
León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n
avigating complex memory spaces with hwloc. In Proceedings of the Interna
tional Symposium on Memory Systems (pp. 149-155).
Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture
s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C
onférence francophone d'informatique en Parallélisme, Architecture et Syst
ème.
Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing
Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE.
Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate
heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
Power Consumption
⚫ We consider that we need to understand
and/or characterise the power
consumption within a Heterogeneous
Memory System(HMS) to be able to give a
ranking of memory targets that enables of
use applications in power constraint
scenarios or in situations that requires a
balance between performance and power
consumption. And for that we need to
follow a strategy
MSR_DRAM_ENERGY_STATUS
Package Domain
Cores Domain
Memory Domain
DRAM NVDIMM
MSR_ENERGY_STATUS
Power Consumption
Power Consumption
Summary
⚫ Manage the complexity of having HMS through hwloc extension.
⚫ Presented a strategy that allows HPC applications detect affinities for certain kinds of memory
and allocate in the right place their buffers.
⚫ The presented strategy framework is for high productivity ( for non-experienced developers)
and better utilisation of the memory system.
⚫ Manage performance counters in a manner that allow us to differentiate the power
consumption of different types of memory
Future Work
⚫ Validate and extend our work on emerging platforms.
⚫ Static Code Analysis for taking allocation decisions.
⚫ Extend our allocation policies to handle more application requirements.
Thanks
andres.rubioproano@riken.jp
RIKEN is Hiring
Recruitment: Our team is hiring student interns/Postdoc/Researchers
- Student interns: - Postdoc/Researchers:
40
41
42

More Related Content

Similar to Performence Metrics to Manage Memory SC.

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
jsvetter
 
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
mlaij
 
Presentation
PresentationPresentation
Presentation
butest
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
Förderverein Technische Fakultät
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
Danilo Calle
 

Similar to Performence Metrics to Manage Memory SC. (20)

ACACES 2019: Towards Energy Efficient Deep Learning
ACACES 2019: Towards Energy Efficient Deep LearningACACES 2019: Towards Energy Efficient Deep Learning
ACACES 2019: Towards Energy Efficient Deep Learning
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
 
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Elastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtimeElastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtime
 
Elastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtimeElastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtime
 
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
 
IRJET- Python Libraries and Packages for Deep Learning-A Survey
IRJET-  	  Python Libraries and Packages for Deep Learning-A SurveyIRJET-  	  Python Libraries and Packages for Deep Learning-A Survey
IRJET- Python Libraries and Packages for Deep Learning-A Survey
 
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
 
IRJET - Positioning and Tracking of a Person using Embedded Controller in a D...
IRJET - Positioning and Tracking of a Person using Embedded Controller in a D...IRJET - Positioning and Tracking of a Person using Embedded Controller in a D...
IRJET - Positioning and Tracking of a Person using Embedded Controller in a D...
 
73
7373
73
 
Programming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi Clusters
 
Presentation
PresentationPresentation
Presentation
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Your Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | EvmuxYour Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | Evmux
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 

Performence Metrics to Manage Memory SC.

  • 1. Performance Metrics to Manage Memory on Supercomputers Andrès RUBIO PROAÑO Post-doctoral Researcher, High Performance Big Data Research Team, RIKEN R-CCS, 21/12/2023
  • 2. My Background ⚫ 2015 Electronics Engineering Bachelor Degree at Escuela Politécnica Nacional, Ecuador. ⚫ CEDIA CEPRA Projects ⚫ 2018 Computer Engineering Master’s Degree at Universitat Autònoma de Barcelona, Spain. ⚫ 2021 PhD in Computer Science at Université de Bordeaux, France ⚫ PhD Contractual with Inria during 3 years with Intel Funding. ⚫ 2021-currently Postdoctoral Researcher at Riken R-CCS, Japan.
  • 3. Outline 1. Background ⚫ The big question ⚫ Contribution 2. Complex Memory Spaces ⚫ Memory characterisation and memory attributes ⚫ API ⚫ Attribute Values 3. Preparing HPC applications to complex HMS ⚫ Benchmarking as an allocation criteria ⚫ Profiling as an allocation criteria ⚫ Static code Analysis as an allocation criteria 4. Summary
  • 4. INTRODUCTION: From Real World to HPC Applications Real World Problem • Meshes, Sparse, Matrices, etc. • Need to be optimally partit ioned Surrogate Model • Molecular Dynamics • Oil and gas exploration • Forecast climate changes • Discover new drogs for deseas es Computer Simulation Model • AI, DL, ML workloads
  • 5. From Homogeneity to Heterogeneity: Processing
  • 6. From Homogeneity to Heterogeneity: Memory/Storage System Register CPU ~0.1ns Level 1 Cache Level 2 Cache Level 3 Cache CACHE ~1-50ns VOLATILE MEMORY DRAM ~80-100ns DISSAGREGATED MEMORY CXL.mem (HBM, DRAM, NVDIMM) ~170-250ns* MECHANICAL MEDIA HDD ~10ms SEQUENTIAL MEDIA TAPE ~100ms SOLID STATE MEDIA NVMe SSD SATA SSD ~10-100µs HBM IN PACKAGE MEMORY *Improvement in Bandwidth Memory Bus Capacity/Latency Increasing PERSISTENT MEMORY NVDIMM ~350ns-1µs I/O Bus Processor Package Memory Bus I/O Bus Register CPU ~0.1ns Level 1 Cache Level 2 Cache Level 3 Cache CACHE ~1-50ns STORAGE HDD ~80-100ns DRAM MEMORY ~10ms
  • 7. Heterogeneous Memory Systems NVM DRAM CPU 2MK HMS Xeon Cascade Lake Xeon Icelake Sapphire Rapids DRAM HBM 2MK HMS CPU KNL Sapphire Rapids NVM DRAM HBM 3MK HMS CPU next generations? CXL.mem?
  • 8. Heterogeneous Memory Systems DRAM HBM 2MK HMS K-AB21 Rhea chips CPU
  • 9. Heterogeneous Memory Systems NVM DRAM CPU 2MK HMS POWER10 DRAM HBM 2MK HMS CPU POWER10 NVM DRAM HBM 3MK HMS CPU POWER10
  • 10. Where to allocate? Application Buffer Buffer Buffer N x total Allocation Requests Memory System NVM DRAM HBM Allocating on a Homogenous Memory System? NUMA 0 NUMA 1 Allocating on a NUMA Memory System Allocating on a Heterogeneous Memory System
  • 11. Contribution 1. How to expose heterogeneous memory systems to applications/runtime? 2. Where to allocate memory buffers? 3. Developping for Heterogeneous Memory Systems without having access to the hardware? 4. How to manage HMS in batch schedulers. 5. Power consumption metric on HMS Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I n European Conference on Parallel Processing (pp. 82-94). León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n avigating complex memory spaces with hwloc. In Proceedings of the Interna tional Symposium on Memory Systems (pp. 149-155). Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C onférence francophone d'informatique en Parallélisme, Architecture et Syst ème. Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE. Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate heterogeneous memory systems in high-performance computing. Parallel Computing, 103023. Rubio Proaño, A & Sato ,K. (2023, December). Understanding Power Consumption M etric on Heterogeneous Memory Systems In The 29th IEEE International Conference o n Parallel and Distributed Systems (ICPADS 2023) (in proceedings).
  • 12. Outline 1. Background ⚫ The big question ⚫ Contribution 2. Complex Memory Spaces ⚫ Memory characterisation and memory attributes ⚫ API ⚫ Attribute Values 3. Preparing HPC applications to complex HMS ⚫ Benchmarking as an allocation criteria ⚫ Profiling as an allocation criteria ⚫ Static code Analysis as an allocation criteria 4. Summary
  • 13. 1 2 3 Memory Attributes High Bandwidth High Capacity Low Power Consumption Low Latency HBM DRAM NVM Attributes Discovery Capacity, Locality Native supported Bandwidth, Latency On most platforms R/W Bandwidth, R/W Latency On some platforms Reliability, Persistence, Endurance, Power Consumption Under investigation
  • 14. Hwloc: Apple Mac Mini with M1 hybrid processor 4 E-cores on top (energy efficient), 4 P-cores below (performance, with bigger caches). The machine has 16GB of memory but most of it is given to the GPU (as shown in the OpenCL device)
  • 15. Hwloc: 2x Xeon CascadeLake 6230 with NVDIMMs as separate NUMA nodes
  • 16. Hwloc: 2x Xeon SapphireRapids Max 9460 Processors are configured in SubNUMA-Cluster mode, hence showing 4 DRAM NUMA nodes and 4 HBMs in each package.
  • 17. API functions for manage memory attributes ⚫ Get the array of memory targets that are local to a given initiator: hwloc_get_local_numanode_objs(topology, initiator, &nr, &targets) ⚫ Get the best memory target (and its value) for the given initiator and attribute: hwloc_memattr_get_best_target(topology, attribute, initiator, &best target, &target value) ⚫ Get the value of an attribute for the given memory target and initiator: hwloc_memattr_get_value(topology, attribute, target, initiator, &value) ⚫ Add a custom memory attribute: (e.g. STREAM-triad kernel) hwloc_memattr_register(topology, attribute, name, &value)
  • 18. E.g. Allocate on the best target for an existing attribute /* Initialise Topology */ hwloc_topology_init(&topology); hwloc_topology_load(topology); [...] /* Allocating function */ void * alloc_on_best_target(topology, initiator, attribute, size); { hwloc_memattr_get_best_target(topology, attribute, initiator, &best_target, NULL); return hwloc_alloc_membind(topology, size, best_target->nodeset, BIND); } [...] /* Allocating 1MB on best bandwidth memory near a given core */ void * buffer = alloc_on_best_target(topology, core->cpuset, HWLOC_MEMATTR_ID_BANDWIDTH, 1024*1024);
  • 19. hwloc_memattr_get_value(…) Application Heterogeneous Allocator hwloc / API Extension ACPI HMAT Allocation Requests MemoryTargets and Attributes PerformanceInformation Benchmarking
  • 20. ACPI HMAT $ lstopo --memattrs Memory attribute #0 name ’Capacity’ NUMANode L#0 = 99786076160 NUMANode L#1 = 101468516352 NUMANode L#2 = 796716433408 NUMANode L#3 = 99883061248 NUMANode L#4 = 101428244480 NUMANode L#5 = 798863917056 Memory attribute #2 name ’Bandwidth’ NUMANode L#0 = 131072 from Group0 L#0 NUMANode L#1 = 131072 from Group0 L#1 NUMANode L#2 = 78644 from Package L#0 NUMANode L#3 = 131072 from Group0 L#2 NUMANode L#4 = 131072 from Group0 L#3 NUMANode L#5 = 78644 from Package L#1 Memory attribute #3 name ’Latency’ NUMANode L#0 = 26 from Group0 L#0 NUMANode L#1 = 26 from Group0 L#1 NUMANode L#2 = 77 from Package L#0 NUMANode L#3 = 26 from Group0 L#2 NUMANode L#4 = 26 from Group0 L#3 NUMANode L#5 = 77 from Package L#1
  • 22. Summary ⚫ Expected: ⚫ Obtain memory attributes information from HMAT (simpler) ⚫ Vendors start using HMAT and put reliable information ⚫ No need to spend time benchmarking ⚫ Currently: ⚫ Benchmarking is the safest way to get attribute values ⚫ HMAT is only appearing in genering platforms ACPI tables since 2021
  • 23. Contribution 1. How to expose heterogeneous memory systems to applications/runtime? 2. Criterion about where to allocate memorybuffers? 3. Developping for Heterogeneous Memory Systems without having access to the hardware? 4. How to manage HMS in batch schedulers. 5. Power consumption metric on HMS Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I n European Conference on Parallel Processing (pp. 82-94). León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n avigating complex memory spaces with hwloc. In Proceedings of the Interna tional Symposium on Memory Systems (pp. 149-155). Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C onférence francophone d'informatique en Parallélisme, Architecture et Syst ème. Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE. Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
  • 24. Outline 1. Background ⚫ The big question ⚫ Contribution 2. Complex Memory Spaces ⚫ Memory characterisation and memory attributes ⚫ API ⚫ Attribute Values 3. Preparing HPC applications to complex HMS ⚫ Benchmarking as an allocation criteria ⚫ Profiling as an allocation criteria ⚫ Static code Analysis as an allocation criteria 4. Summary
  • 25. Strategy Framework –> High Productivity Application Heterogeneous Allocator hwloc / API Extension Hardware Allocation Requests MemoryTargets and Attributes • Measured Performance • Hardware Performance Information MemoryIdentifiers Determine Sensitivity to memory metrics Allocation Criteria Profiling Benchmarking Static code analysis
  • 26. Benchmarking (Bandwidth): Stream-Triad NVM DRAM HBM 3MK HMS Application Buffer Buffer Buffer N x total 3N binding tests Buffer target node Best Rate in GB/s A B C Triad 0 0 0 74.97 0 0 1 51.88 0 1 0 55.59 0 1 1 38.32 1 0 0 9.92 1 0 1 9.05 1 1 0 9.16 1 1 1 8.50 0→ DRAM 1→ NVM
  • 27. Benchmarking ⚫ General Idea of the sensitivity of the application ⚫ Hard to evaluate when taking into account all buffers separately
  • 28. Profiling: VTune • Graph 500 • Memory Access Analysis tools: • DRAM Bound – latency issues • Persistent Memory Bound – latency iss ues • DRAM Bandwidth Bound – bandwidth • Persistent Memory Bandwidth Bound – bandwidth
  • 29. Profiling: VTune Allocating in Local DDR 35.33 GB/s 0 GB/s Memory Object Loads Stores LLC Miss Count Average Latency (cycles) xmalloc 12,258,444,411 1,205,069,645 580,459,141 61 xoff 39,030,512 0 0 8 xadj 10,311,019 0 0 9 Allocating in Local NVDIMM Memory Object Loads Stores LLC Miss Count Average Latency (cycles) xmalloc 31,972,081,700 1,220,455,472 562,889.063 131 xoff 92,622,498 0 0 7 xadj 20,176,351 0 0 9 Bandwidth Bandwidth 8.573 GB/s 0 GB/s xmalloc Memory Object in utils.c 27 #include <omp.h> 28 #endif 29 #include "utils.h" 30 31 void* xmalloc(size_t n) { 32 void* p = malloc(n); 33 if (!p) { 34 fprintf(stderr, "Out of memory trying to allocate %zu byte(s)n", n); 35 abort(); 36 } 37 return p; 38 } 39 40 void* xcalloc(size_t n, size_t k) { 41 void* p = calloc(n, k); 42 if (!p) { 43 fprintf(stderr, "Out of memory trying to allocate %zu byte(s)n", n);
  • 30. Profiling: PCM-power Local DRAM Local NVM Remote DRAM #threads Memory Power (Watts) FT.A
  • 31. Profiling ⚫ Perform an analysis of the execution ⚫ Kind of memory used ⚫ Most relevant buffers ⚫ Analyse the related source code line. ⚫ Identify memory related issues: ⚫ Bottleneks, Hot spots, etc ⚫ Fewer runs but analysis could be more difficult
  • 32. Contribution 1. How to expose heterogeneous memory systems to applications/runtime? 2. Criterion about where to allocate memorybuffers? 3. Developping for Heterogeneous Memory Systems without having access to the hardware? 4. How to manage HMS in batch schedulers. 5. Power consumption metric on HMS Goglin, B., & Rubio Proaño, A. (2019, August). Opportunities for Partitioning Non-volatile Memory DIMMs Between Co-scheduled Jobs on HPC Nodes. I n European Conference on Parallel Processing (pp. 82-94). León, E. A., Goglin, B., & Rubio Proaño, A.. (2019, September). M&MMs: n avigating complex memory spaces with hwloc. In Proceedings of the Interna tional Symposium on Memory Systems (pp. 149-155). Rubio Proaño, A. (2020, June). Exposer les caractéristiques des architecture s à mémoires hétérogènes aux applications parallèles. In COMPAS 2020-C onférence francophone d'informatique en Parallélisme, Architecture et Syst ème. Goglin, B., & Rubio Proaño, A. (2022, May). Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 890-899). IEEE. Foyer, C., Goglin, B., & Rubio Proaño, A. (2023). A survey of software techniques to emulate heterogeneous memory systems in high-performance computing. Parallel Computing, 103023.
  • 33. Power Consumption ⚫ We consider that we need to understand and/or characterise the power consumption within a Heterogeneous Memory System(HMS) to be able to give a ranking of memory targets that enables of use applications in power constraint scenarios or in situations that requires a balance between performance and power consumption. And for that we need to follow a strategy MSR_DRAM_ENERGY_STATUS Package Domain Cores Domain Memory Domain DRAM NVDIMM MSR_ENERGY_STATUS
  • 36. Summary ⚫ Manage the complexity of having HMS through hwloc extension. ⚫ Presented a strategy that allows HPC applications detect affinities for certain kinds of memory and allocate in the right place their buffers. ⚫ The presented strategy framework is for high productivity ( for non-experienced developers) and better utilisation of the memory system. ⚫ Manage performance counters in a manner that allow us to differentiate the power consumption of different types of memory
  • 37. Future Work ⚫ Validate and extend our work on emerging platforms. ⚫ Static Code Analysis for taking allocation decisions. ⚫ Extend our allocation policies to handle more application requirements.
  • 39. RIKEN is Hiring Recruitment: Our team is hiring student interns/Postdoc/Researchers - Student interns: - Postdoc/Researchers:
  • 40. 40
  • 41. 41
  • 42. 42