SlideShare a Scribd company logo
P R O F I L I N G P Y T O R C H F O R
E F F I C I E N C Y & S U S T A I N A B I L I T Y
N O V 1 7 , 2 0 2 1
G E E T A C H A U H A N
P Y T O R C H P A R T N E R E N G I N E E R I N G
M E T A A I
PyTorch Profile r Talk
A G E N D A
0 1
G P U P E R F O R M A N C E T U N I N G
0 2
P Y T O R C H P R O F I L E R
0 3
T I M E L I N E T R A C I N G
0 4
O P T I M I Z A T I O N E X A M P L E S
0 5
F U R T U R E : S U S T A I N A B L E A I
GPU Performance Tuning
PyTorch Profile r Talk
Optimized for single thread performance
- Majority of chip area is control logic & caches
Complex and deep out-of-order pipelines
- Extract instruction level parallelism
The brain
- Job is to keep the accelerator busy
CPU GPU
Optimized for throughput of data-parallel problems
- Majority of chip area is functional units
Simple, relatively slow in-order pipelines
- Achieves much higher total throughput
Accelerator attached via PCIe
- Order of magnitude faster but off to the side
A DIFFERENT MENTAL MODEL REQUIRED
G P U P E R F O R M A N C E T U N I N G
PyTorch Profile r Talk
Composed of Streaming
Multiprocessors (SMs)
Volta V100: 80x SMs
Ampere A100: 108 SMs
DGX A100 with 8 GPUs:
864 SMs vs 128 CPU cores
NVIDIA Volta V100 GPU
G P U P E R F O R M A N C E T U N I N G
PyTorch Profile r Talk
G P U P E R F O R M A N C E T U N I N G
64x FP32 units
64x INT, 32x FP64, 32x LD/ST
8x Tensor Cores
5120 (6920 ON A100)
FP32 EXECUTION UNITS
PER GPU
Streaming Multiprocessor
PyTorch Profile r Talk
• Excessive CPU/GPU interactions – e.g. for loop launching GPU operations
- Dominated by launch overheads
• Short GPU kernel durations – e.g. small inputs
- Need enough data to feed 10s of thousands of threads
• CPU overheads and I/O bottlenecks are starving the GPU
- Small operations on the CPU can quickly become dominant
• Framework inefficiencies
- E.g. unnecessary copies and hidden CPU-side overheads
VISIBILITY IS KEY
G P U P E R F O R M A N C E T U N I N G
Common Pitfalls
PyTorch Profiler
W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
CONTRIBUTED BY MICROSOFT &
FACEBOOK
• PyTorch and GPU level information
• Automatic bottleneck detection
• Actionable performance
recommendations
• Data Scientist friendly lifecycle and tools
• TensorBoard Plugin - chrome traces
visualization
• OSS Kineto library - built on CUPTI
• Easy-to-use python API
• VS Code integration
libkineto
PyTorch Profiler
libCUPTI
PyTorch Process
aten operators
Python
C++
CUDA
TensorBoard
Python Events
GPU 1 GPU 2 GPU n
…
NVIDIA Driver
OS
Profiler
Plugin
CUDA Activities
CPU operators
Queue GPU ops
Traces
CPU operators
Traces
T H E P Y T O R C H P R O F I L E R
https://pytorch.org/tutorials/recipes/recipes/profiler.html
import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(record_shapes=True) as prof:
with profiler.record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Base Usage
T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
• When to trigger
• How many steps to profile
• Which activities to profile
• Results callable handler
• Extra metadata, eg shapes, stacks, memory
• Output options eg Chrome tracing , TensorBoard
T H E P Y T O R C H P R O F I L E R
Advanced
PyTorch Profile r Talk
T H E P Y T O R C H P R O F I L E R
PyTorch Profile r Talk
D I S T R I B U T E D T R A I N I N G V I E W
PyTorch Profile r Talk
V S C O D E D A T A W R A N G L E R
Timeline Tracing
PyTorch Profile r Talk
T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
PyTorch Profile r Talk
T I M E L I N E T R A C I N G
Chrome Trace Viewer: CPU and GPU timelines
PyTorch Profile r Talk
• Can leave in permanently, no perf overhead
T I M E L I N E T R A C I N G
PyTorch Profile r Talk
T I M E L I N E T R A C I N G
See how CPU and GPU ops are connected
PyTorch Profile r Talk
Nvidia-smi shows
86% utilization
But.. only a
fraction of SMs are
actually used by
these kernels!
T I M E L I N E T R A C I N G
Inspect stats for individual activities
PyTorch Profile r Talk
Looks much better
after increasing input
sizes
T I M E L I N E T R A C I N G
Inspect stats for individual activities
Trace Analysis
E x a m p l e s f r o m M e t a w o r k l o a d s
#thanks to
Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan
for examples
PyTorch Profile r Talk
Issue:
1. Large periods of GPU inactivity
2. Trace does not show why
Solution:
1. Use record_function to reveal
bottlenecks on CPU
2. Parallelize CPU operations
3. Overlap CPU and GPU operations
temp = ""
num_substr = len(emb[k])
with record_function("## join_string {} ##".format(num_substr)):
temp = ",".join(str(x) for x in emb[k]) # string concatenation
with record_function("## append_record_in_else ##"):
records.append(f"{input_df.id[i + k]}t{temp}n") # list append
T R A C E A N A L Y S I S
Anti-pattern: Long GPU idle time
PyTorch Profile r Talk
A F T E R
def on_step(self, task) -> None:
...
with torch.no_grad():
torch._foreach_mul_(
self.ema_model_state_list, self.decay)
torch._foreach_add_(
self.ema_model_state_list,
self.param_list,
alpha=(1 - self.decay))
First issue:
• Exponential moving avg hook function has a
loop – CPU bottleneck
• Can rewrite using torch._foreach ops – loop
now on GPU
EMA HOOK 100X FASTER
ITERATION TIME: 860MS -> 770MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def on_step(self, task) -> None:
...
with torch.no_grad():
it = model_state_iterator(task.base_model)
# iterate on every name & param
for name, param in it:
s = self.state.ema_model_state
s[name] = self.decay * s[name] +
(1 – self.decay) *
param.to(device= self.device)
PyTorch Profile r Talk
Second issue:
• Optimizer step uses a naïve implementation
of RMSProp
• PyTorch provides an optimized multi-tensor
version – using torch._foreach
• Switch to optimized version!
OPTIMIZER 12X FASTER
ITERATION TIME: 770MS -> 600MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def prepare(self, param_groups):
self.optimizer = RMSpropTFV2Optimizer(
param_groups,
…
A F T E R
import torch.optim._multi_tensor as optim_mt
def prepare(self, param_groups):
self.optimizer = optim_mt.RMSprop(
param_groups,
…
PyTorch Profile r Talk
Third issue:
• Forward & backward pass dominated
by SyncBatchNorm
• 84x SyncBatchNorm in fwd pass
• 3x ncclAllGather per SyncBatchNorm
• Another 2x ncclAllReduce per
SyncBatchNorm in bwd pass
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU interactions
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
PyTorch Profile r Talk
From 2.4 req/s to 1,400+ req/s
CPU inference
torch.set_num_threads(1)
Intel IPEX
Quantization
GPU inference on 1 T4 GPU
model.half()
DistilBERT
Increase batch size
Do not overpad
Faster Transformer
T R A C E A N A L Y S I S
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
BERT PERFORMANCE OPTIMIZATION CASE STUDY
• From 2.4 req/s to 1,400+ req/s
• CPU inference
• torch.set_num_threads(1)
• Intel IPEX
• Quantization
• GPU inference on 1 T4 GPU
• model.half()
• DistilBERT
• Increase batch size
• Do not overpad
• Faster Transformer
Throughput P99
BERT
unoptimized
bs=1
70.67 seq/s 20.44ms
BERT
model.half()
bs=8
359 seq/s 23.58ms
DistilBERT
model.half()
bs=16
689 seq/s 22.8ms
BERT Faster
Transformer
885 seq/s 19.83ms
DistilBERT no
padding
model.half()
bs=32
1423 seq/s 19.7ms
FUTURE
S u s t a i n a b l e A I
PyTorch Profile r Talk
A I M O D E L G R O W T H
PyTorch Profile r Talk
M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
• Platform level caching – 6.7x
improvements
• GPU Acceleration – unlocks 10.1x
energy efficiency
• Algorithmic Optimizations – 10x
improvements
O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
1. Data Utilization Efficiency:
Data Scaling & Sampling, Data perishability
2. Experimentation and Training Efficiency:
NAS, HPO, Multi-Objective Optimizations,
Resource Efficient Architectures
3. Efficient Environment Scalable Infrastructure:
Carbon efficient scheduling, On-device Learning, …
4. Develop easy to adopt Telemetry:
Measure and publish,
Carbon impact statement & model cards
S U S T A I N B I L I T Y M I N D S E T
https://arxiv.org/pdf/2111.00364.pdf
Source: https://docs.cohere.ai/environmental-impact
PyTorch Profile r Talk
• What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/
• Introducing PyTorch Profiler:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
• Profiler: https://pytorch.org/docs/stable/profiler.html
• Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html
• VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/
• PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS:
https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731
• Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance-
batch-size-with-pytorch-profiler/
• Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples
• PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch-
lightning/blob/master/pl_examples/basic_examples/profiler_example.py
• Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf
• Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact
R E F E R E N C E S
PyTorch Profile r Talk
Questions?
Contact:
Email: gchauhan@fb.com
Linkedin: https://www.linkedin.com/in/geetachauhan/
Thank You

More Related Content

What's hot

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
nftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewallnftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewall
Marian Marinov
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress Update
Linaro
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
Denys Haryachyy
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
lcplcp1
 
Embedded linux system development (slides)
Embedded linux system development (slides)Embedded linux system development (slides)
Embedded linux system development (slides)
Jaime Barragan
 
LCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEE
Linaro
 
用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver
艾鍗科技
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
inside-BigData.com
 
SFO15-200: Linux kernel generic TEE driver
SFO15-200: Linux kernel generic TEE driverSFO15-200: Linux kernel generic TEE driver
SFO15-200: Linux kernel generic TEE driver
Linaro
 
U-boot and Android Verified Boot 2.0
U-boot and Android Verified Boot 2.0U-boot and Android Verified Boot 2.0
U-boot and Android Verified Boot 2.0
GlobalLogic Ukraine
 
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
Linaro
 
BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE
Linaro
 
Android audio system(audioflinger)
Android audio system(audioflinger)Android audio system(audioflinger)
Android audio system(audioflinger)
fefe7270
 
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
Edge AI and Vision Alliance
 
OPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build TutorialOPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build Tutorial
Dalton Valadares
 
Quick and Easy Device Drivers for Embedded Linux Using UIO
Quick and Easy Device Drivers for Embedded Linux Using UIOQuick and Easy Device Drivers for Embedded Linux Using UIO
Quick and Easy Device Drivers for Embedded Linux Using UIO
Chris Simmonds
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
Opersys inc.
 
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UK
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UKZephyr RTOS in One Hour | HARDWARIO @ IoT North UK
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UK
HARDWARIO
 

What's hot (20)

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
nftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewallnftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewall
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress Update
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Embedded linux system development (slides)
Embedded linux system development (slides)Embedded linux system development (slides)
Embedded linux system development (slides)
 
LCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEE
 
用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
SFO15-200: Linux kernel generic TEE driver
SFO15-200: Linux kernel generic TEE driverSFO15-200: Linux kernel generic TEE driver
SFO15-200: Linux kernel generic TEE driver
 
U-boot and Android Verified Boot 2.0
U-boot and Android Verified Boot 2.0U-boot and Android Verified Boot 2.0
U-boot and Android Verified Boot 2.0
 
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
 
BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE
 
Android audio system(audioflinger)
Android audio system(audioflinger)Android audio system(audioflinger)
Android audio system(audioflinger)
 
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
 
OPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build TutorialOPTEE on QEMU - Build Tutorial
OPTEE on QEMU - Build Tutorial
 
Quick and Easy Device Drivers for Embedded Linux Using UIO
Quick and Easy Device Drivers for Embedded Linux Using UIOQuick and Easy Device Drivers for Embedded Linux Using UIO
Quick and Easy Device Drivers for Embedded Linux Using UIO
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UK
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UKZephyr RTOS in One Hour | HARDWARIO @ IoT North UK
Zephyr RTOS in One Hour | HARDWARIO @ IoT North UK
 

Similar to Profiling PyTorch for Efficiency & Sustainability

Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
Jerry Yoakum
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
Databricks
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
Domino Data Lab
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applications
Mai Nishimura
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
geetachauhan
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
ScyllaDB
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
NECST Lab @ Politecnico di Milano
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
inside-BigData.com
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
NTU CSIE, Taiwan
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
Stanley Ho
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 
Postgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even FasterPostgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even Faster
EDB
 

Similar to Profiling PyTorch for Efficiency & Sustainability (20)

Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applications
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Postgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even FasterPostgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even Faster
 

More from geetachauhan

Building AI with Security Privacy in Mind
Building AI with Security Privacy in MindBuilding AI with Security Privacy in Mind
Building AI with Security Privacy in Mind
geetachauhan
 
Building AI with Security and Privacy in mind
Building AI with Security and Privacy in mindBuilding AI with Security and Privacy in mind
Building AI with Security and Privacy in mind
geetachauhan
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
geetachauhan
 
Future is private intel dev fest
Future is private   intel dev festFuture is private   intel dev fest
Future is private intel dev fest
geetachauhan
 
Decentralized AI Draper
Decentralized AI   DraperDecentralized AI   Draper
Decentralized AI Draper
geetachauhan
 
Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain
geetachauhan
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
geetachauhan
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
geetachauhan
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
geetachauhan
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
geetachauhan
 
NIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSNIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCS
geetachauhan
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
geetachauhan
 
Deep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute StickDeep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute Stick
geetachauhan
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
geetachauhan
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
geetachauhan
 
Distributed deep learning optimizations
Distributed deep learning optimizationsDistributed deep learning optimizations
Distributed deep learning optimizations
geetachauhan
 
Tensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengeTensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challenge
geetachauhan
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
geetachauhan
 
Transfer learning for IoT
Transfer learning for IoTTransfer learning for IoT
Transfer learning for IoT
geetachauhan
 
Tensorflow for IoT
Tensorflow for IoTTensorflow for IoT
Tensorflow for IoT
geetachauhan
 

More from geetachauhan (20)

Building AI with Security Privacy in Mind
Building AI with Security Privacy in MindBuilding AI with Security Privacy in Mind
Building AI with Security Privacy in Mind
 
Building AI with Security and Privacy in mind
Building AI with Security and Privacy in mindBuilding AI with Security and Privacy in mind
Building AI with Security and Privacy in mind
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
 
Future is private intel dev fest
Future is private   intel dev festFuture is private   intel dev fest
Future is private intel dev fest
 
Decentralized AI Draper
Decentralized AI   DraperDecentralized AI   Draper
Decentralized AI Draper
 
Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
NIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSNIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCS
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
 
Deep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute StickDeep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute Stick
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
 
Distributed deep learning optimizations
Distributed deep learning optimizationsDistributed deep learning optimizations
Distributed deep learning optimizations
 
Tensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengeTensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challenge
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
 
Transfer learning for IoT
Transfer learning for IoTTransfer learning for IoT
Transfer learning for IoT
 
Tensorflow for IoT
Tensorflow for IoTTensorflow for IoT
Tensorflow for IoT
 

Recently uploaded

Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 

Recently uploaded (20)

Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 

Profiling PyTorch for Efficiency & Sustainability

  • 1. P R O F I L I N G P Y T O R C H F O R E F F I C I E N C Y & S U S T A I N A B I L I T Y N O V 1 7 , 2 0 2 1 G E E T A C H A U H A N P Y T O R C H P A R T N E R E N G I N E E R I N G M E T A A I
  • 2. PyTorch Profile r Talk A G E N D A 0 1 G P U P E R F O R M A N C E T U N I N G 0 2 P Y T O R C H P R O F I L E R 0 3 T I M E L I N E T R A C I N G 0 4 O P T I M I Z A T I O N E X A M P L E S 0 5 F U R T U R E : S U S T A I N A B L E A I
  • 4. PyTorch Profile r Talk Optimized for single thread performance - Majority of chip area is control logic & caches Complex and deep out-of-order pipelines - Extract instruction level parallelism The brain - Job is to keep the accelerator busy CPU GPU Optimized for throughput of data-parallel problems - Majority of chip area is functional units Simple, relatively slow in-order pipelines - Achieves much higher total throughput Accelerator attached via PCIe - Order of magnitude faster but off to the side A DIFFERENT MENTAL MODEL REQUIRED G P U P E R F O R M A N C E T U N I N G
  • 5. PyTorch Profile r Talk Composed of Streaming Multiprocessors (SMs) Volta V100: 80x SMs Ampere A100: 108 SMs DGX A100 with 8 GPUs: 864 SMs vs 128 CPU cores NVIDIA Volta V100 GPU G P U P E R F O R M A N C E T U N I N G
  • 6. PyTorch Profile r Talk G P U P E R F O R M A N C E T U N I N G 64x FP32 units 64x INT, 32x FP64, 32x LD/ST 8x Tensor Cores 5120 (6920 ON A100) FP32 EXECUTION UNITS PER GPU Streaming Multiprocessor
  • 7. PyTorch Profile r Talk • Excessive CPU/GPU interactions – e.g. for loop launching GPU operations - Dominated by launch overheads • Short GPU kernel durations – e.g. small inputs - Need enough data to feed 10s of thousands of threads • CPU overheads and I/O bottlenecks are starving the GPU - Small operations on the CPU can quickly become dominant • Framework inefficiencies - E.g. unnecessary copies and hidden CPU-side overheads VISIBILITY IS KEY G P U P E R F O R M A N C E T U N I N G Common Pitfalls
  • 8. PyTorch Profiler W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
  • 9. CONTRIBUTED BY MICROSOFT & FACEBOOK • PyTorch and GPU level information • Automatic bottleneck detection • Actionable performance recommendations • Data Scientist friendly lifecycle and tools • TensorBoard Plugin - chrome traces visualization • OSS Kineto library - built on CUPTI • Easy-to-use python API • VS Code integration libkineto PyTorch Profiler libCUPTI PyTorch Process aten operators Python C++ CUDA TensorBoard Python Events GPU 1 GPU 2 GPU n … NVIDIA Driver OS Profiler Plugin CUDA Activities CPU operators Queue GPU ops Traces CPU operators Traces T H E P Y T O R C H P R O F I L E R
  • 10. https://pytorch.org/tutorials/recipes/recipes/profiler.html import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile(record_shapes=True) as prof: with profiler.record_function("model_inference"): model(inputs) print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) T H E P Y T O R C H P R O F I L E R Profiling API : Base Usage
  • 11. T H E P Y T O R C H P R O F I L E R Profiling API : Tensorboard Plugin import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile( record_shapes=True, on_trace_ready=torch.profiler.tensorboard_ trace_handler(‘results’) ) as prof: model(inputs) print(prof.key_averages().table(sort_by= "cpu_time_total", row_limit=10))
  • 12. T H E P Y T O R C H P R O F I L E R Profiling API : Tensorboard Plugin import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile( record_shapes=True, on_trace_ready=torch.profiler.tensorboard_ trace_handler(‘results’) ) as prof: model(inputs) print(prof.key_averages().table(sort_by= "cpu_time_total", row_limit=10))
  • 13. • When to trigger • How many steps to profile • Which activities to profile • Results callable handler • Extra metadata, eg shapes, stacks, memory • Output options eg Chrome tracing , TensorBoard T H E P Y T O R C H P R O F I L E R Advanced
  • 14. PyTorch Profile r Talk T H E P Y T O R C H P R O F I L E R
  • 15. PyTorch Profile r Talk D I S T R I B U T E D T R A I N I N G V I E W
  • 16. PyTorch Profile r Talk V S C O D E D A T A W R A N G L E R
  • 18. PyTorch Profile r Talk T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
  • 19. PyTorch Profile r Talk T I M E L I N E T R A C I N G Chrome Trace Viewer: CPU and GPU timelines
  • 20. PyTorch Profile r Talk • Can leave in permanently, no perf overhead T I M E L I N E T R A C I N G
  • 21. PyTorch Profile r Talk T I M E L I N E T R A C I N G See how CPU and GPU ops are connected
  • 22. PyTorch Profile r Talk Nvidia-smi shows 86% utilization But.. only a fraction of SMs are actually used by these kernels! T I M E L I N E T R A C I N G Inspect stats for individual activities
  • 23. PyTorch Profile r Talk Looks much better after increasing input sizes T I M E L I N E T R A C I N G Inspect stats for individual activities
  • 24. Trace Analysis E x a m p l e s f r o m M e t a w o r k l o a d s #thanks to Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan for examples
  • 25. PyTorch Profile r Talk Issue: 1. Large periods of GPU inactivity 2. Trace does not show why Solution: 1. Use record_function to reveal bottlenecks on CPU 2. Parallelize CPU operations 3. Overlap CPU and GPU operations temp = "" num_substr = len(emb[k]) with record_function("## join_string {} ##".format(num_substr)): temp = ",".join(str(x) for x in emb[k]) # string concatenation with record_function("## append_record_in_else ##"): records.append(f"{input_df.id[i + k]}t{temp}n") # list append T R A C E A N A L Y S I S Anti-pattern: Long GPU idle time
  • 26. PyTorch Profile r Talk A F T E R def on_step(self, task) -> None: ... with torch.no_grad(): torch._foreach_mul_( self.ema_model_state_list, self.decay) torch._foreach_add_( self.ema_model_state_list, self.param_list, alpha=(1 - self.decay)) First issue: • Exponential moving avg hook function has a loop – CPU bottleneck • Can rewrite using torch._foreach ops – loop now on GPU EMA HOOK 100X FASTER ITERATION TIME: 860MS -> 770MS T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions B E F O R E def on_step(self, task) -> None: ... with torch.no_grad(): it = model_state_iterator(task.base_model) # iterate on every name & param for name, param in it: s = self.state.ema_model_state s[name] = self.decay * s[name] + (1 – self.decay) * param.to(device= self.device)
  • 27. PyTorch Profile r Talk Second issue: • Optimizer step uses a naïve implementation of RMSProp • PyTorch provides an optimized multi-tensor version – using torch._foreach • Switch to optimized version! OPTIMIZER 12X FASTER ITERATION TIME: 770MS -> 600MS T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions B E F O R E def prepare(self, param_groups): self.optimizer = RMSpropTFV2Optimizer( param_groups, … A F T E R import torch.optim._multi_tensor as optim_mt def prepare(self, param_groups): self.optimizer = optim_mt.RMSprop( param_groups, …
  • 28. PyTorch Profile r Talk Third issue: • Forward & backward pass dominated by SyncBatchNorm • 84x SyncBatchNorm in fwd pass • 3x ncclAllGather per SyncBatchNorm • Another 2x ncclAllReduce per SyncBatchNorm in bwd pass T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions FORWARD PASS 1.5X FASTER BACKWARD PASS 1.3X FASTER ITERATION TIME: 600MS -> 450MS 2.2ms 1.7ms
  • 29. PyTorch Profile r Talk From 2.4 req/s to 1,400+ req/s CPU inference torch.set_num_threads(1) Intel IPEX Quantization GPU inference on 1 T4 GPU model.half() DistilBERT Increase batch size Do not overpad Faster Transformer T R A C E A N A L Y S I S FORWARD PASS 1.5X FASTER BACKWARD PASS 1.3X FASTER ITERATION TIME: 600MS -> 450MS 2.2ms 1.7ms BERT PERFORMANCE OPTIMIZATION CASE STUDY • From 2.4 req/s to 1,400+ req/s • CPU inference • torch.set_num_threads(1) • Intel IPEX • Quantization • GPU inference on 1 T4 GPU • model.half() • DistilBERT • Increase batch size • Do not overpad • Faster Transformer Throughput P99 BERT unoptimized bs=1 70.67 seq/s 20.44ms BERT model.half() bs=8 359 seq/s 23.58ms DistilBERT model.half() bs=16 689 seq/s 22.8ms BERT Faster Transformer 885 seq/s 19.83ms DistilBERT no padding model.half() bs=32 1423 seq/s 19.7ms
  • 30. FUTURE S u s t a i n a b l e A I
  • 31. PyTorch Profile r Talk A I M O D E L G R O W T H
  • 32. PyTorch Profile r Talk M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
  • 33. • Platform level caching – 6.7x improvements • GPU Acceleration – unlocks 10.1x energy efficiency • Algorithmic Optimizations – 10x improvements O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
  • 34. 1. Data Utilization Efficiency: Data Scaling & Sampling, Data perishability 2. Experimentation and Training Efficiency: NAS, HPO, Multi-Objective Optimizations, Resource Efficient Architectures 3. Efficient Environment Scalable Infrastructure: Carbon efficient scheduling, On-device Learning, … 4. Develop easy to adopt Telemetry: Measure and publish, Carbon impact statement & model cards S U S T A I N B I L I T Y M I N D S E T https://arxiv.org/pdf/2111.00364.pdf Source: https://docs.cohere.ai/environmental-impact
  • 35. PyTorch Profile r Talk • What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/ • Introducing PyTorch Profiler: https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ • Profiler: https://pytorch.org/docs/stable/profiler.html • Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html • VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/ • PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS: https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731 • Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance- batch-size-with-pytorch-profiler/ • Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples • PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch- lightning/blob/master/pl_examples/basic_examples/profiler_example.py • Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf • Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact R E F E R E N C E S
  • 36. PyTorch Profile r Talk Questions? Contact: Email: gchauhan@fb.com Linkedin: https://www.linkedin.com/in/geetachauhan/