Profiling PyTorch for Efficiency & Sustainability

P R O F I L I N G P Y T O R C H F O R
E F F I C I E N C Y & S U S T A I N A B I L I T Y
N O V 1 7 , 2 0 2 1
G E E T A C H A U H A N
P Y T O R C H P A R T N E R E N G I N E E R I N G
M E T A A I

PyTorch Profile r Talk
A G E N D A
0 1
G P U P E R F O R M A N C E T U N I N G
0 2
P Y T O R C H P R O F I L E R
0 3
T I M E L I N E T R A C I N G
0 4
O P T I M I Z A T I O N E X A M P L E S
0 5
F U R T U R E : S U S T A I N A B L E A I

Optimized for single thread performance
- Majority of chip area is control logic & caches
Complex and deep out-of-order pipelines
- Extract instruction level parallelism
The brain
- Job is to keep the accelerator busy
CPU GPU
Optimized for throughput of data-parallel problems
- Majority of chip area is functional units
Simple, relatively slow in-order pipelines
- Achieves much higher total throughput
Accelerator attached via PCIe
- Order of magnitude faster but off to the side
A DIFFERENT MENTAL MODEL REQUIRED

Composed of Streaming
Multiprocessors (SMs)
Volta V100: 80x SMs
Ampere A100: 108 SMs
DGX A100 with 8 GPUs:
864 SMs vs 128 CPU cores
NVIDIA Volta V100 GPU

64x FP32 units
64x INT, 32x FP64, 32x LD/ST
8x Tensor Cores
5120 (6920 ON A100)
FP32 EXECUTION UNITS
PER GPU
Streaming Multiprocessor

• Excessive CPU/GPU interactions – e.g. for loop launching GPU operations
- Dominated by launch overheads
• Short GPU kernel durations – e.g. small inputs
- Need enough data to feed 10s of thousands of threads
• CPU overheads and I/O bottlenecks are starving the GPU
- Small operations on the CPU can quickly become dominant
• Framework inefficiencies
- E.g. unnecessary copies and hidden CPU-side overheads
VISIBILITY IS KEY
Common Pitfalls

PyTorch Profiler
W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y

CONTRIBUTED BY MICROSOFT &
FACEBOOK
• PyTorch and GPU level information
• Automatic bottleneck detection
• Actionable performance
recommendations
• Data Scientist friendly lifecycle and tools
• TensorBoard Plugin - chrome traces
visualization
• OSS Kineto library - built on CUPTI
• Easy-to-use python API
• VS Code integration
libkineto
PyTorch Profiler
libCUPTI
PyTorch Process
aten operators
Python
C++
CUDA
TensorBoard
Python Events
GPU 1 GPU 2 GPU n
…
NVIDIA Driver
OS
Profiler
Plugin
CUDA Activities
CPU operators
Queue GPU ops
Traces
CPU operators
Traces
T H E P Y T O R C H P R O F I L E R

https://pytorch.org/tutorials/recipes/recipes/profiler.html
import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(record_shapes=True) as prof:
with profiler.record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
Profiling API : Base Usage

Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))

• When to trigger
• How many steps to profile
• Which activities to profile
• Results callable handler
• Extra metadata, eg shapes, stacks, memory
• Output options eg Chrome tracing , TensorBoard
Advanced

D I S T R I B U T E D T R A I N I N G V I E W

V S C O D E D A T A W R A N G L E R

T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S

Chrome Trace Viewer: CPU and GPU timelines

• Can leave in permanently, no perf overhead

See how CPU and GPU ops are connected

Nvidia-smi shows
86% utilization
But.. only a
fraction of SMs are
actually used by
these kernels!
Inspect stats for individual activities

Looks much better
after increasing input
sizes
Inspect stats for individual activities

Trace Analysis
E x a m p l e s f r o m M e t a w o r k l o a d s
#thanks to
Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan
for examples

Issue:
1. Large periods of GPU inactivity
2. Trace does not show why
Solution:
1. Use record_function to reveal
bottlenecks on CPU
2. Parallelize CPU operations
3. Overlap CPU and GPU operations
temp = ""
num_substr = len(emb[k])
with record_function("## join_string {} ##".format(num_substr)):
temp = ",".join(str(x) for x in emb[k]) # string concatenation
with record_function("## append_record_in_else ##"):
records.append(f"{input_df.id[i + k]}t{temp}n") # list append
T R A C E A N A L Y S I S
Anti-pattern: Long GPU idle time

A F T E R
def on_step(self, task) -> None:
...
with torch.no_grad():
torch._foreach_mul_(
self.ema_model_state_list, self.decay)
torch._foreach_add_(
self.ema_model_state_list,
self.param_list,
alpha=(1 - self.decay))
First issue:
• Exponential moving avg hook function has a
loop – CPU bottleneck
• Can rewrite using torch._foreach ops – loop
now on GPU
EMA HOOK 100X FASTER
ITERATION TIME: 860MS -> 770MS
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def on_step(self, task) -> None:
...
with torch.no_grad():
it = model_state_iterator(task.base_model)
# iterate on every name & param
for name, param in it:
s = self.state.ema_model_state
s[name] = self.decay * s[name] +
(1 – self.decay) *
param.to(device= self.device)

Second issue:
• Optimizer step uses a naïve implementation
of RMSProp
• PyTorch provides an optimized multi-tensor
version – using torch._foreach
• Switch to optimized version!
OPTIMIZER 12X FASTER
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def prepare(self, param_groups):
self.optimizer = RMSpropTFV2Optimizer(
param_groups,
…
A F T E R
import torch.optim._multi_tensor as optim_mt
def prepare(self, param_groups):
self.optimizer = optim_mt.RMSprop(
param_groups,
…

Third issue:
• Forward & backward pass dominated
by SyncBatchNorm
• 84x SyncBatchNorm in fwd pass
• 3x ncclAllGather per SyncBatchNorm
• Another 2x ncclAllReduce per
SyncBatchNorm in bwd pass
Anti-pattern: Excessive CPU/GPU interactions
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
2.2ms
1.7ms

From 2.4 req/s to 1,400+ req/s
CPU inference
torch.set_num_threads(1)
Intel IPEX
Quantization
GPU inference on 1 T4 GPU
model.half()
DistilBERT
Increase batch size
Do not overpad
Faster Transformer
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
2.2ms
1.7ms
BERT PERFORMANCE OPTIMIZATION CASE STUDY
• From 2.4 req/s to 1,400+ req/s
• CPU inference
• torch.set_num_threads(1)
• Intel IPEX
• Quantization
• GPU inference on 1 T4 GPU
• model.half()
• DistilBERT
• Increase batch size
• Do not overpad
• Faster Transformer
Throughput P99
BERT
unoptimized
bs=1
70.67 seq/s 20.44ms
BERT
model.half()
bs=8
359 seq/s 23.58ms
DistilBERT
model.half()
bs=16
689 seq/s 22.8ms
BERT Faster
Transformer
885 seq/s 19.83ms
DistilBERT no
padding
model.half()
bs=32
1423 seq/s 19.7ms

FUTURE
S u s t a i n a b l e A I

A I M O D E L G R O W T H

M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N

• Platform level caching – 6.7x
improvements
• GPU Acceleration – unlocks 10.1x
energy efficiency
• Algorithmic Optimizations – 10x
improvements
O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M

1. Data Utilization Efficiency:
Data Scaling & Sampling, Data perishability
2. Experimentation and Training Efficiency:
NAS, HPO, Multi-Objective Optimizations,
Resource Efficient Architectures
3. Efficient Environment Scalable Infrastructure:
Carbon efficient scheduling, On-device Learning, …
4. Develop easy to adopt Telemetry:
Measure and publish,
Carbon impact statement & model cards
S U S T A I N B I L I T Y M I N D S E T
https://arxiv.org/pdf/2111.00364.pdf
Source: https://docs.cohere.ai/environmental-impact

• What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/
• Introducing PyTorch Profiler:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
• Profiler: https://pytorch.org/docs/stable/profiler.html
• Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html
• VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/
• PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS:
https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731
• Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance-
batch-size-with-pytorch-profiler/
• Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples
• PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch-
lightning/blob/master/pl_examples/basic_examples/profiler_example.py
• Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf
• Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact
R E F E R E N C E S

Questions?
Contact:
Email: gchauhan@fb.com
Linkedin: https://www.linkedin.com/in/geetachauhan/

Profiling PyTorch for Efficiency & Sustainability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Profiling PyTorch for Efficiency & Sustainability

Similar to Profiling PyTorch for Efficiency & Sustainability (20)

More from geetachauhan

More from geetachauhan (20)

Recently uploaded

Recently uploaded (20)

Profiling PyTorch for Efficiency & Sustainability