Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

REGALE
Optimizing Hardware Resource Partitioning and Job
Allocations on Modern GPUs under Power Caps
Eishi Arima1, Minjoon Kang1, Issa Saba1,
Josef Weidendorfer2, Carsten Trinitis1, Martin Schulz12
1 Technische Universität München (TUM)
2 Leibniz-Rechenzentrum (LRZ)
DUAC@ICPP'22 Aug 29, 2022

Executive Summary
Our focus: Co-optimizing resource partitioning, job allocations, and power
capping on a real GPU chip
• NVIDIA MIG (Multi-Instance GPU) feature: partition a chip at the granularity of GPC
Key observations:
1. Scalability within a chip depends highly on compute/memory intensity, types of
operations (e.g., FP64, FP16, or INT8), and power cap setup
2. Memory partitioning option also matters (private or shared LLC/HBM)
Optimization: HW setup & job mapping optimization for a given app pair as
a function of app characteristics using a predictive performance model
• Problem definition, systematic solution, and statistical modeling (linear regression)
Evaluation results: Near optimal for different objective functions, i.e.,
maximizing throughput or energy-efficiency
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 2

Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions

Outline
• Background
• Observations

Trends in Top-class Supercomputers
• GPUs are now commonly
used in top-class
supercomputers
• About 160 in the Top500
list (as of Jun 2022)
• Recent supercomputers
consume an enormous
amount of power
• Can be over 20MW
Over 20MW Power
Consumption
GPUs
Top500 List (as of Jun 2022)
We target power and resource management for
GPU-based HPC systems

.
.
.
.
.
Node-level
Power Cap
Hierarchical Power/Resource Management
• In the literature of power-aware HPC studies, power has been managed in
a hierarchical manner
• The central governor distibutes power budgets to nodes/jobs; The node manager
sets power caps to node components using the given power budget
• Co-scheduling, i.e., co-locating multiple apps at the same node, is
promising to mitigate the waste of resources
• E.g., mixing compute- and memory-intensive apps
Central Power
Governor
.
.
.
.
.
Compute Nodes
CPU GPU
DRAM
Component-level power cap
.
.
.
.
.
Our Target
Job1 Job2
Job1 Job2 Job1 Job2
We focus on power capping, resource
partitioning, and job allocations on a GPU

Resource Partitioning and Power Capping in
Modern GPUs: NVIDIA A100 as an Example
A100 Chip Architecture
GPC GPC GPC GPC
GPC
GPC
GPC
GPC
1SM
• Power capping is applicable at the granularity of
chip, e.g., nvidia-smi -pl 200
• MIG (Multi-Instance GPU) enables us to partition
a GPU at the granularity of GPC to colocate apps
• A GPC (Graphics Processing Cluster) consists of 10s
of SMs (Streaming Multiprocessors), each of which
includes Tensor Cores, FPUs, ALUs, LD/ST units, etc.
Power
Cap
Partition
A100 SM Architecture
Ten-
sor
Core
F
P
6
4
s
F
P
3
2
s
I
N
T
s
We consider these different types of computational
units as well as memory bandwidth utilizations to
optimize the partitioning, power cap, and job allocs

Outline
• Background
• Observations

Our Platform & Preliminary Evaluation
Objective: quantify what aspects we need to take
into account when our optimization
Experiments:
1. Solo-run scalability analysis, i.e., performance
as a function of # of GPCs
2. Co-run throughput analysis for different app
maps & memory partitioning options
• Mem partitioning: shared or private L2$/HBM
• 1 out of 8GPCs must be disabled to use the MIG
feature; 7GPCs are partitioned into 4GPCs/3GPCs
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Shared L2$/HBM Option
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Private L2$/HBM Option
GI2
Our Evaluation Platform

. . . .
Solo-run Scalability Analysis w/o Power Cap
kmeans: low compute/bw resource
utilization (Rodinia)
stream: memory bound
dgemm: compute bound (w/o Tensor Core)
hgemm: compute bound (w/ Tensor Core)
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GI1
CI1
Scaling # of GPCs
w/ private L2$/HBM
Unused
Scale
. . . .
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GI1
CI1
Scaling # of GPCs
w/ shared L2$/HBM
Unused
Scale
Neither GPC alloc nor
mem option matter
1. Mem option matters
2. Does not need full
compute resources
1. Scale very well
2. Mem option does
not matter

Solo-run Scalability Analysis w/ Power Cap
• Scale down the power cap from 250W (TDP) to 150W
• Scale # of GPCs for the shared memory partitioning option
• Power cap affects the scalability the most for Tensor-Core-intensive workloads
• Tensor Core is a power hungry module
We need to take the heterogeneity in compute resources into
account – they can have different performance-power features!

Co-run Throughput Analysis (Power Cap: 250[W])
Co-run multiple applications while
changing alloc of GPCs as well as the
memory partitioning options
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Shared L2$/HBM Option
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Private L2$/HBM Option
GI2
• Allocate more GPCs to the compute intensive
app (Tensor Core intensive in this case)!
• Use the shared option so that the memory
intensive app can fully access the BW sources!
Using the private can be the best: (1) it is interference free;
(2) non-memory bound apps may not need the full BW

Summary of the Preliminary Experiments
• GPC allocation decision – scalability matters
• Compute-bound apps scale well and need more GPCs
• The power capping can affect the scalability esp. when using Tensor Cores intensively
• Memory-bound apps do not need full GPCs esp. when the shared option is used
• Memory partitioning decision
• If a memory-bound app is scheduled, the shared option is preferred so that it can fully
access the BW sources
• If no memory-bound app is scheduled, the private option can be better as it can
mitigate the shared resource contention
• Requiements for the optimization:
1. Consider app characteristics incl. memory intensity, compute intensity (per different
compute module, e.g., FP64, Tensor Core, etc.), etc.
2. Performance estimation should cover both scalability and interference aspects

Outline
• Background
• Observations

Workflow Overview
• Consists of offline and
online parts
• Offline: determine the
model coefficients by
using a benchmark set
• Online: optimize the HW
setup for a given app pair
• Profile-based application
characterization
• Model-based performance
estimation

Problem Formulation
• Two policies: maximize throughput (left) or energy efficiency (right)
1. For a given power cap (P) and a given set of applications to co-schedule, optimize
partitioning state (S) under a fairness constraint (controlled by α)
2. For a given set of applications to co-schedule, optimize partitioning state (S) and
power cap (P) under a fairness constraint (controlled by α)
• Co-schedule two apps in this paper

Metrics
• Throughput:
• WeightedSpeedup = sum of relative performance (RPerfAppi) across co-scheduled
applications
• RPerfAppi(S,P): Relative performance of ith app normalized to its exclusive solo-run
performance w/o resource partitioning nor power capping
• This is a function of the resource partitioning state (S) and the power capping (P)
• Fairness (> α):
• Limit slowdown within a certain level for all the co-located applications

Performance Modeling
• A simple model using linear regression:
• Coefficients (C, D) are functions of HW setups & job alloc (S, P)
• Variables are based on the collected performance counters (FAppi) and
transformed by using functions H & J
Scalability
Interference

Outline
• Background
• Observations

Evaluation Setups
• Workloads:
• Rodinia, NVIDIA Cutlass library for GEMMs, streaming, random access
• Classify them into 4: TI (Tensor core Intensive); CI (Compute Intensive w/o tensor);
MI (Memory Intensive); US (UnScalable)
• Create pairs of classes, and randomly select an app for each class in each pair
• Methodologies:
• Profile applications by using NSight Compute framework
• Train co-efficients for all the possible HW setups & Job alloc |S|x|P|
• Obtain estimated performance and choose the best job alloc for the two problems

Model Accuracy (P=250W)
• Good throughput/fairness accuracy
for any partitioning state (S)
• Observed similar trends for different
power caps (P)

Throughput-oriented Optimization
• Optimize partitioning state (S) for a given power cap (P) & a given job pair
• Close to the best for almost all the workloads under P=230 [W] (left)
• Close to the best for the geometric mean even when we scale the power cap (right)
Power Cap P = 230[W]
Geomean

Energy-efficiency-oriented Optimization
α = 0.2
Geomean
• Optimize partitioning state (S) & a given power cap (P) for a given job pair
• Close to the best for almost all the workloads under α = 0.2 (left)
• Close to the best for the geometric mean even when we scale α (right)

Outline
• Background
• Observations

Conclusion and Future Directions
• Conclusion:
• We targeted the optimization of resource partitioning, job allocations, and
power capping at the same time on a real GPU chip
• We provided some insights on scalability and resource sharing options useful
for decisions on resource partitioning, job allocations, and power capping
• We proposed a systematic approach to optimize the above knobs
• We quantified the benefits of our approach – close to the optimal
• Future directions:
• Include scheduling decisions, i.e., co-run pair selections from a job queue
• Include more HW components, e.g., CPUs, DRAMs, multi GPUs, etc.
• Integrate with state-of-the-art scheduling software framework, e.g., slurm

Acknowledgement
This work has received funding under the European Commission’s
EuroHPC and H2020 programmes under grant agreement no. 956560
and was supported by the NVIDIA Academic Hardware Grant Program.
DUAC@ICPP'22 Aug 29, 2022
REGALE
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 26

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Similar to Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps (20)

Recently uploaded

Recently uploaded (20)

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps