SlideShare a Scribd company logo
1 of 26
Download to read offline
REGALE
Optimizing Hardware Resource Partitioning and Job
Allocations on Modern GPUs under Power Caps
Eishi Arima1, Minjoon Kang1, Issa Saba1,
Josef Weidendorfer2, Carsten Trinitis1, Martin Schulz12
1 Technische Universität München (TUM)
2 Leibniz-Rechenzentrum (LRZ)
DUAC@ICPP'22 Aug 29, 2022
Executive Summary
Our focus: Co-optimizing resource partitioning, job allocations, and power
capping on a real GPU chip
• NVIDIA MIG (Multi-Instance GPU) feature: partition a chip at the granularity of GPC
Key observations:
1. Scalability within a chip depends highly on compute/memory intensity, types of
operations (e.g., FP64, FP16, or INT8), and power cap setup
2. Memory partitioning option also matters (private or shared LLC/HBM)
Optimization: HW setup & job mapping optimization for a given app pair as
a function of app characteristics using a predictive performance model
• Problem definition, systematic solution, and statistical modeling (linear regression)
Evaluation results: Near optimal for different objective functions, i.e.,
maximizing throughput or energy-efficiency
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 2
Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 3
Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 4
Trends in Top-class Supercomputers
• GPUs are now commonly
used in top-class
supercomputers
• About 160 in the Top500
list (as of Jun 2022)
• Recent supercomputers
consume an enormous
amount of power
• Can be over 20MW
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 5
Over 20MW Power
Consumption
GPUs
Top500 List (as of Jun 2022)
We target power and resource management for
GPU-based HPC systems
.
.
.
.
.
Node-level
Power Cap
Hierarchical Power/Resource Management
• In the literature of power-aware HPC studies, power has been managed in
a hierarchical manner
• The central governor distibutes power budgets to nodes/jobs; The node manager
sets power caps to node components using the given power budget
• Co-scheduling, i.e., co-locating multiple apps at the same node, is
promising to mitigate the waste of resources
• E.g., mixing compute- and memory-intensive apps
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 6
Central Power
Governor
.
.
.
.
.
Compute Nodes
CPU GPU
DRAM
Component-level power cap
.
.
.
.
.
Our Target
Job1 Job2
Job1 Job2 Job1 Job2
We focus on power capping, resource
partitioning, and job allocations on a GPU
Resource Partitioning and Power Capping in
Modern GPUs: NVIDIA A100 as an Example
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 7
A100 Chip Architecture
GPC GPC GPC GPC
GPC
GPC
GPC
GPC
1SM
• Power capping is applicable at the granularity of
chip, e.g., nvidia-smi -pl 200
• MIG (Multi-Instance GPU) enables us to partition
a GPU at the granularity of GPC to colocate apps
• A GPC (Graphics Processing Cluster) consists of 10s
of SMs (Streaming Multiprocessors), each of which
includes Tensor Cores, FPUs, ALUs, LD/ST units, etc.
Power
Cap
Partition
A100 SM Architecture
Ten-
sor
Core
F
P
6
4
s
F
P
3
2
s
I
N
T
s
We consider these different types of computational
units as well as memory bandwidth utilizations to
optimize the partitioning, power cap, and job allocs
Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 8
Our Platform & Preliminary Evaluation
Objective: quantify what aspects we need to take
into account when our optimization
Experiments:
1. Solo-run scalability analysis, i.e., performance
as a function of # of GPCs
2. Co-run throughput analysis for different app
maps & memory partitioning options
• Mem partitioning: shared or private L2$/HBM
• 1 out of 8GPCs must be disabled to use the MIG
feature; 7GPCs are partitioned into 4GPCs/3GPCs
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 9
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Shared L2$/HBM Option
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Private L2$/HBM Option
GI2
Our Evaluation Platform
. . . .
Solo-run Scalability Analysis w/o Power Cap
kmeans: low compute/bw resource
utilization (Rodinia)
stream: memory bound
dgemm: compute bound (w/o Tensor Core)
hgemm: compute bound (w/ Tensor Core)
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 10
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GI1
CI1
Scaling # of GPCs
w/ private L2$/HBM
Unused
Scale
. . . .
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GPC
L2
HB
M
GI1
CI1
Scaling # of GPCs
w/ shared L2$/HBM
Unused
Scale
Neither GPC alloc nor
mem option matter
1. Mem option matters
2. Does not need full
compute resources
1. Scale very well
2. Mem option does
not matter
Solo-run Scalability Analysis w/ Power Cap
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 11
• Scale down the power cap from 250W (TDP) to 150W
• Scale # of GPCs for the shared memory partitioning option
• Power cap affects the scalability the most for Tensor-Core-intensive workloads
• Tensor Core is a power hungry module
We need to take the heterogeneity in compute resources into
account – they can have different performance-power features!
Co-run Throughput Analysis (Power Cap: 250[W])
Co-run multiple applications while
changing alloc of GPCs as well as the
memory partitioning options
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 12
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Shared L2$/HBM Option
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
GPC
L2$+
HBM
L2$+
HBM
GI1
CI1 CI2
App1 App2
Disa
bled
GPC
Private L2$/HBM Option
GI2
• Allocate more GPCs to the compute intensive
app (Tensor Core intensive in this case)!
• Use the shared option so that the memory
intensive app can fully access the BW sources!
Using the private can be the best: (1) it is interference free;
(2) non-memory bound apps may not need the full BW
Summary of the Preliminary Experiments
• GPC allocation decision – scalability matters
• Compute-bound apps scale well and need more GPCs
• The power capping can affect the scalability esp. when using Tensor Cores intensively
• Memory-bound apps do not need full GPCs esp. when the shared option is used
• Memory partitioning decision
• If a memory-bound app is scheduled, the shared option is preferred so that it can fully
access the BW sources
• If no memory-bound app is scheduled, the private option can be better as it can
mitigate the shared resource contention
• Requiements for the optimization:
1. Consider app characteristics incl. memory intensity, compute intensity (per different
compute module, e.g., FP64, Tensor Core, etc.), etc.
2. Performance estimation should cover both scalability and interference aspects
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 13
Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 14
Workflow Overview
• Consists of offline and
online parts
• Offline: determine the
model coefficients by
using a benchmark set
• Online: optimize the HW
setup for a given app pair
• Profile-based application
characterization
• Model-based performance
estimation
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 15
Problem Formulation
• Two policies: maximize throughput (left) or energy efficiency (right)
1. For a given power cap (P) and a given set of applications to co-schedule, optimize
partitioning state (S) under a fairness constraint (controlled by α)
2. For a given set of applications to co-schedule, optimize partitioning state (S) and
power cap (P) under a fairness constraint (controlled by α)
• Co-schedule two apps in this paper
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 16
Metrics
• Throughput:
• WeightedSpeedup = sum of relative performance (RPerfAppi) across co-scheduled
applications
• RPerfAppi(S,P): Relative performance of ith app normalized to its exclusive solo-run
performance w/o resource partitioning nor power capping
• This is a function of the resource partitioning state (S) and the power capping (P)
• Fairness (> α):
• Limit slowdown within a certain level for all the co-located applications
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 17
Performance Modeling
• A simple model using linear regression:
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 18
• Coefficients (C, D) are functions of HW setups & job alloc (S, P)
• Variables are based on the collected performance counters (FAppi) and
transformed by using functions H & J
Scalability
Interference
Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 19
Evaluation Setups
• Workloads:
• Rodinia, NVIDIA Cutlass library for GEMMs, streaming, random access
• Classify them into 4: TI (Tensor core Intensive); CI (Compute Intensive w/o tensor);
MI (Memory Intensive); US (UnScalable)
• Create pairs of classes, and randomly select an app for each class in each pair
• Methodologies:
• Profile applications by using NSight Compute framework
• Train co-efficients for all the possible HW setups & Job alloc |S|x|P|
• Obtain estimated performance and choose the best job alloc for the two problems
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 20
Model Accuracy (P=250W)
• Good throughput/fairness accuracy
for any partitioning state (S)
• Observed similar trends for different
power caps (P)
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 21
Throughput-oriented Optimization
• Optimize partitioning state (S) for a given power cap (P) & a given job pair
• Close to the best for almost all the workloads under P=230 [W] (left)
• Close to the best for the geometric mean even when we scale the power cap (right)
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 22
Power Cap P = 230[W]
Geomean
Energy-efficiency-oriented Optimization
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 23
α = 0.2
Geomean
• Optimize partitioning state (S) & a given power cap (P) for a given job pair
• Close to the best for almost all the workloads under α = 0.2 (left)
• Close to the best for the geometric mean even when we scale α (right)
Outline
• Background
• Observations
• Proposed optimizations:
• Overall workflow, problem formulation, and performance modeling
• Evaluation setup and experimental results
• Conclusion and future directions
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 24
Conclusion and Future Directions
• Conclusion:
• We targeted the optimization of resource partitioning, job allocations, and
power capping at the same time on a real GPU chip
• We provided some insights on scalability and resource sharing options useful
for decisions on resource partitioning, job allocations, and power capping
• We proposed a systematic approach to optimize the above knobs
• We quantified the benefits of our approach – close to the optimal
• Future directions:
• Include scheduling decisions, i.e., co-run pair selections from a job queue
• Include more HW components, e.g., CPUs, DRAMs, multi GPUs, etc.
• Integrate with state-of-the-art scheduling software framework, e.g., slurm
DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 25
Acknowledgement
This work has received funding under the European Commission’s
EuroHPC and H2020 programmes under grant agreement no. 956560
and was supported by the NVIDIA Academic Hardware Grant Program.
DUAC@ICPP'22 Aug 29, 2022
REGALE
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 26

More Related Content

What's hot

What's hot (20)

JUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by exampleJUDCon London 2011 - Bin packing with drools planner by example
JUDCon London 2011 - Bin packing with drools planner by example
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
 
MapReduce/YARNの仕組みを知る
MapReduce/YARNの仕組みを知るMapReduce/YARNの仕組みを知る
MapReduce/YARNの仕組みを知る
 
Oracle Database Vaultのご紹介
Oracle Database Vaultのご紹介Oracle Database Vaultのご紹介
Oracle Database Vaultのご紹介
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
 
[Oracle Cloud Days Tokyo 2015] Oracle Database 12c最新情報 ~Maximum Availability ...
[Oracle Cloud Days Tokyo 2015] Oracle Database 12c最新情報 ~Maximum Availability ...[Oracle Cloud Days Tokyo 2015] Oracle Database 12c最新情報 ~Maximum Availability ...
[Oracle Cloud Days Tokyo 2015] Oracle Database 12c最新情報 ~Maximum Availability ...
 
MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08
 
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항
[오픈소스컨설팅]클라우드기반U2L마이그레이션 전략 및 고려사항
 
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage
 
Zero Data Loss Recovery Applianceのご紹介
Zero Data Loss Recovery Applianceのご紹介Zero Data Loss Recovery Applianceのご紹介
Zero Data Loss Recovery Applianceのご紹介
 
TripleOの光と闇
TripleOの光と闇TripleOの光と闇
TripleOの光と闇
 
OCI Database Management 설정 방법
OCI Database Management 설정 방법OCI Database Management 설정 방법
OCI Database Management 설정 방법
 
Dpdk pmd
Dpdk pmdDpdk pmd
Dpdk pmd
 
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxMySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)
 
NEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfNEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdf
 
SQL Server/SQL Database の新機能のお話し
SQL Server/SQL Database の新機能のお話しSQL Server/SQL Database の新機能のお話し
SQL Server/SQL Database の新機能のお話し
 
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニングしばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
しばちょう先生による特別講義! RMANバックアップの運用と高速化チューニング
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
 

Similar to Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

FYP1 Progress Report (final)
FYP1 Progress Report (final)FYP1 Progress Report (final)
FYP1 Progress Report (final)
waqas khan
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
airbots
 
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...
Tarik Reza Toha
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
Deepak Shankar
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 

Similar to Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps (20)

GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
FYP1 Progress Report (final)
FYP1 Progress Report (final)FYP1 Progress Report (final)
FYP1 Progress Report (final)
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
 
Exploring performance and energy consumption differences between recent Intel...
Exploring performance and energy consumption differences between recent Intel...Exploring performance and energy consumption differences between recent Intel...
Exploring performance and energy consumption differences between recent Intel...
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Dynamic heterogeneity aware resource ...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Dynamic heterogeneity aware resource ...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Dynamic heterogeneity aware resource ...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Dynamic heterogeneity aware resource ...
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based Scheduling
 
Sustainable Development using Green Programming
Sustainable Development using Green ProgrammingSustainable Development using Green Programming
Sustainable Development using Green Programming
 
Performance and power comparisons between nvidia and ati gpus
Performance and power comparisons between nvidia and ati gpusPerformance and power comparisons between nvidia and ati gpus
Performance and power comparisons between nvidia and ati gpus
 
Maha an energy efficient malleable hardware accelerator for data intensive a...
Maha  an energy efficient malleable hardware accelerator for data intensive a...Maha  an energy efficient malleable hardware accelerator for data intensive a...
Maha an energy efficient malleable hardware accelerator for data intensive a...
 
Aged Data Center Infrastructure.pptx
Aged Data Center Infrastructure.pptxAged Data Center Infrastructure.pptx
Aged Data Center Infrastructure.pptx
 
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPUA Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

  • 1. REGALE Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps Eishi Arima1, Minjoon Kang1, Issa Saba1, Josef Weidendorfer2, Carsten Trinitis1, Martin Schulz12 1 Technische Universität München (TUM) 2 Leibniz-Rechenzentrum (LRZ) DUAC@ICPP'22 Aug 29, 2022
  • 2. Executive Summary Our focus: Co-optimizing resource partitioning, job allocations, and power capping on a real GPU chip • NVIDIA MIG (Multi-Instance GPU) feature: partition a chip at the granularity of GPC Key observations: 1. Scalability within a chip depends highly on compute/memory intensity, types of operations (e.g., FP64, FP16, or INT8), and power cap setup 2. Memory partitioning option also matters (private or shared LLC/HBM) Optimization: HW setup & job mapping optimization for a given app pair as a function of app characteristics using a predictive performance model • Problem definition, systematic solution, and statistical modeling (linear regression) Evaluation results: Near optimal for different objective functions, i.e., maximizing throughput or energy-efficiency DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 2
  • 3. Outline • Background • Observations • Proposed optimizations: • Overall workflow, problem formulation, and performance modeling • Evaluation setup and experimental results • Conclusion and future directions DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 3
  • 4. Outline • Background • Observations • Proposed optimizations: • Overall workflow, problem formulation, and performance modeling • Evaluation setup and experimental results • Conclusion and future directions DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 4
  • 5. Trends in Top-class Supercomputers • GPUs are now commonly used in top-class supercomputers • About 160 in the Top500 list (as of Jun 2022) • Recent supercomputers consume an enormous amount of power • Can be over 20MW DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 5 Over 20MW Power Consumption GPUs Top500 List (as of Jun 2022) We target power and resource management for GPU-based HPC systems
  • 6. . . . . . Node-level Power Cap Hierarchical Power/Resource Management • In the literature of power-aware HPC studies, power has been managed in a hierarchical manner • The central governor distibutes power budgets to nodes/jobs; The node manager sets power caps to node components using the given power budget • Co-scheduling, i.e., co-locating multiple apps at the same node, is promising to mitigate the waste of resources • E.g., mixing compute- and memory-intensive apps DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 6 Central Power Governor . . . . . Compute Nodes CPU GPU DRAM Component-level power cap . . . . . Our Target Job1 Job2 Job1 Job2 Job1 Job2 We focus on power capping, resource partitioning, and job allocations on a GPU
  • 7. Resource Partitioning and Power Capping in Modern GPUs: NVIDIA A100 as an Example DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 7 A100 Chip Architecture GPC GPC GPC GPC GPC GPC GPC GPC 1SM • Power capping is applicable at the granularity of chip, e.g., nvidia-smi -pl 200 • MIG (Multi-Instance GPU) enables us to partition a GPU at the granularity of GPC to colocate apps • A GPC (Graphics Processing Cluster) consists of 10s of SMs (Streaming Multiprocessors), each of which includes Tensor Cores, FPUs, ALUs, LD/ST units, etc. Power Cap Partition A100 SM Architecture Ten- sor Core F P 6 4 s F P 3 2 s I N T s We consider these different types of computational units as well as memory bandwidth utilizations to optimize the partitioning, power cap, and job allocs
  • 8. Outline • Background • Observations • Proposed optimizations: • Overall workflow, problem formulation, and performance modeling • Evaluation setup and experimental results • Conclusion and future directions DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 8
  • 9. Our Platform & Preliminary Evaluation Objective: quantify what aspects we need to take into account when our optimization Experiments: 1. Solo-run scalability analysis, i.e., performance as a function of # of GPCs 2. Co-run throughput analysis for different app maps & memory partitioning options • Mem partitioning: shared or private L2$/HBM • 1 out of 8GPCs must be disabled to use the MIG feature; 7GPCs are partitioned into 4GPCs/3GPCs DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 9 GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM L2$+ HBM GI1 CI1 CI2 App1 App2 Disa bled GPC Shared L2$/HBM Option GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM L2$+ HBM GI1 CI1 CI2 App1 App2 Disa bled GPC Private L2$/HBM Option GI2 Our Evaluation Platform
  • 10. . . . . Solo-run Scalability Analysis w/o Power Cap kmeans: low compute/bw resource utilization (Rodinia) stream: memory bound dgemm: compute bound (w/o Tensor Core) hgemm: compute bound (w/ Tensor Core) DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 10 GPC L2 HB M GPC L2 HB M GPC L2 HB M GPC L2 HB M GI1 CI1 Scaling # of GPCs w/ private L2$/HBM Unused Scale . . . . GPC L2 HB M GPC L2 HB M GPC L2 HB M GPC L2 HB M GI1 CI1 Scaling # of GPCs w/ shared L2$/HBM Unused Scale Neither GPC alloc nor mem option matter 1. Mem option matters 2. Does not need full compute resources 1. Scale very well 2. Mem option does not matter
  • 11. Solo-run Scalability Analysis w/ Power Cap DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 11 • Scale down the power cap from 250W (TDP) to 150W • Scale # of GPCs for the shared memory partitioning option • Power cap affects the scalability the most for Tensor-Core-intensive workloads • Tensor Core is a power hungry module We need to take the heterogeneity in compute resources into account – they can have different performance-power features!
  • 12. Co-run Throughput Analysis (Power Cap: 250[W]) Co-run multiple applications while changing alloc of GPCs as well as the memory partitioning options DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 12 GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM L2$+ HBM GI1 CI1 CI2 App1 App2 Disa bled GPC Shared L2$/HBM Option GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM GPC L2$+ HBM L2$+ HBM GI1 CI1 CI2 App1 App2 Disa bled GPC Private L2$/HBM Option GI2 • Allocate more GPCs to the compute intensive app (Tensor Core intensive in this case)! • Use the shared option so that the memory intensive app can fully access the BW sources! Using the private can be the best: (1) it is interference free; (2) non-memory bound apps may not need the full BW
  • 13. Summary of the Preliminary Experiments • GPC allocation decision – scalability matters • Compute-bound apps scale well and need more GPCs • The power capping can affect the scalability esp. when using Tensor Cores intensively • Memory-bound apps do not need full GPCs esp. when the shared option is used • Memory partitioning decision • If a memory-bound app is scheduled, the shared option is preferred so that it can fully access the BW sources • If no memory-bound app is scheduled, the private option can be better as it can mitigate the shared resource contention • Requiements for the optimization: 1. Consider app characteristics incl. memory intensity, compute intensity (per different compute module, e.g., FP64, Tensor Core, etc.), etc. 2. Performance estimation should cover both scalability and interference aspects DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 13
  • 14. Outline • Background • Observations • Proposed optimizations: • Overall workflow, problem formulation, and performance modeling • Evaluation setup and experimental results • Conclusion and future directions DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 14
  • 15. Workflow Overview • Consists of offline and online parts • Offline: determine the model coefficients by using a benchmark set • Online: optimize the HW setup for a given app pair • Profile-based application characterization • Model-based performance estimation DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 15
  • 16. Problem Formulation • Two policies: maximize throughput (left) or energy efficiency (right) 1. For a given power cap (P) and a given set of applications to co-schedule, optimize partitioning state (S) under a fairness constraint (controlled by α) 2. For a given set of applications to co-schedule, optimize partitioning state (S) and power cap (P) under a fairness constraint (controlled by α) • Co-schedule two apps in this paper DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 16
  • 17. Metrics • Throughput: • WeightedSpeedup = sum of relative performance (RPerfAppi) across co-scheduled applications • RPerfAppi(S,P): Relative performance of ith app normalized to its exclusive solo-run performance w/o resource partitioning nor power capping • This is a function of the resource partitioning state (S) and the power capping (P) • Fairness (> α): • Limit slowdown within a certain level for all the co-located applications DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 17
  • 18. Performance Modeling • A simple model using linear regression: DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 18 • Coefficients (C, D) are functions of HW setups & job alloc (S, P) • Variables are based on the collected performance counters (FAppi) and transformed by using functions H & J Scalability Interference
  • 19. Outline • Background • Observations • Proposed optimizations: • Overall workflow, problem formulation, and performance modeling • Evaluation setup and experimental results • Conclusion and future directions DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 19
  • 20. Evaluation Setups • Workloads: • Rodinia, NVIDIA Cutlass library for GEMMs, streaming, random access • Classify them into 4: TI (Tensor core Intensive); CI (Compute Intensive w/o tensor); MI (Memory Intensive); US (UnScalable) • Create pairs of classes, and randomly select an app for each class in each pair • Methodologies: • Profile applications by using NSight Compute framework • Train co-efficients for all the possible HW setups & Job alloc |S|x|P| • Obtain estimated performance and choose the best job alloc for the two problems DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 20
  • 21. Model Accuracy (P=250W) • Good throughput/fairness accuracy for any partitioning state (S) • Observed similar trends for different power caps (P) DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 21
  • 22. Throughput-oriented Optimization • Optimize partitioning state (S) for a given power cap (P) & a given job pair • Close to the best for almost all the workloads under P=230 [W] (left) • Close to the best for the geometric mean even when we scale the power cap (right) DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 22 Power Cap P = 230[W] Geomean
  • 23. Energy-efficiency-oriented Optimization DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 23 α = 0.2 Geomean • Optimize partitioning state (S) & a given power cap (P) for a given job pair • Close to the best for almost all the workloads under α = 0.2 (left) • Close to the best for the geometric mean even when we scale α (right)
  • 24. Outline • Background • Observations • Proposed optimizations: • Overall workflow, problem formulation, and performance modeling • Evaluation setup and experimental results • Conclusion and future directions DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 24
  • 25. Conclusion and Future Directions • Conclusion: • We targeted the optimization of resource partitioning, job allocations, and power capping at the same time on a real GPU chip • We provided some insights on scalability and resource sharing options useful for decisions on resource partitioning, job allocations, and power capping • We proposed a systematic approach to optimize the above knobs • We quantified the benefits of our approach – close to the optimal • Future directions: • Include scheduling decisions, i.e., co-run pair selections from a job queue • Include more HW components, e.g., CPUs, DRAMs, multi GPUs, etc. • Integrate with state-of-the-art scheduling software framework, e.g., slurm DUAC@ICPP'22 Aug 29, 2022 Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 25
  • 26. Acknowledgement This work has received funding under the European Commission’s EuroHPC and H2020 programmes under grant agreement no. 956560 and was supported by the NVIDIA Academic Hardware Grant Program. DUAC@ICPP'22 Aug 29, 2022 REGALE Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 26