This document summarizes PyTorch's approach to distributed data parallel training. It discusses the challenges of mathematical equivalence, a non-intrusive API, and high performance. It describes how PyTorch uses allreduce primitives and gradient bucketing to average gradients across processes. It evaluates the performance of distributed data parallel training with PyTorch using NCCL and Gloo backends on ResNet50 and BERT models.
More than Just Lines on a Map: Best Practices for U.S Bike Routes
HPC in Pytorch
1. PyTorch Distributed:
Experiences on Accelerating
Data Parallel Training
Shen Liy Yanli Zhaoy Rohan Varmay Omkar Salpekary
Pieter Noordhuis Teng Liy Adam Paszkez
Jeff Smithy Brian Vaughany Pritam Damaniay Soumith Chintalay
fshenli, yanlizhao, rvarm1, osalpekarg@fb.com,
pcnoordhuis@gmail.com, tengli@fb.com, adam.paszke@gmail.com,
fjeffksmith, bvaughan, pritam.damania, soumithg@fb.com
yFacebook AI zUniversity of Warsaw
A presentation by Rëza Habibi
4. Introduction
4
To provide a general distributed data parallel package, the challenges are
three-fold:
• Mathematical equivalence
• Non-intrusive and interceptive API
• High Performance
6. 6
Data Parallelism
• DataParallel for single-process.
• Multithread data parallel training using multiple GPUs on the same
machine.
• Distributed DataParallel(DDP) for multi-process data parallel
training across GPUs and machines.
• RPC for general distributed model parallel training.
7. 7
AllReduce
• AllReduce is the primitive communication API used by
DistributedDataParallel to compute gradient summation across all
processes.
• It is supported by multiple communication libraries, including NCCL,
Gloo, and MPI.
9. 9
What is the requirements for API?
• Non-intrusive:
The API must be non-intrusive to applications.
• Interceptive:
The API needs to allow the implementation to
intercept various signals and trigger appropriate
algorithms promptly.
16. CONCLUSION
16
This paper explained the design and implementation of
the distributed data parallel module in PyTorch v1.5, and
conducted performance evaluations on NCCL and Gloo back-end using ResNet50 and
BERT models.