AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

Wanchao Liang
PyTorch, Meta
Composable PyTorch Distributed with PT2

Large Scale Training Challenges
The LLM evolution poses significant challenges to distributed training, especially on large
number of GPUs
● Model Size (~100x each
generation):
○ GPT-2: 1.5B
○ GPT-3: 175B
○ LLaMa: 65B
● Data Size:
○ GPT-2: 40GB
○ GPT-3: 570GB
○ LLaMa: 1.4T tokens
○ LLaMa2: 2T tokens
● Number of GPUs
○ LLaMa: 2k GPUs
○ # GPUs required for LLM
training surging
LLMs parameters count growth throughout years. Source:
https://huggingface.co/blog/large-language-models

● Different parallelisms needed to enable larger model training (i.e. 3-D
Parallel, etc.)
○ Data Parallel (Sharded Data Parallel/Hybrid Data Parallel)
○ Tensor Parallel/Sequence Parallel
○ Pipeline Parallel
● Solutions are built independently without considering composability
● Complicated process_groups/devices management for 2-D, 3-D
Parallelisms
● Distributed state_dict save/load becomes messier as we compose
different parallelisms together
● Computation/Communication optimizations are hand tuned within each
parallelism
Why we need composable Distributed Training

Composable PyTorch Distributed
DeviceMesh: The higher level abstraction
that manages ProcessGroups
cuda:0 cuda:1
cuda:0 cuda:1
Host 0
Host 1
2 host with 2 GPUs each, represented
as a 2-D mesh [[0, 1], [2, 3]]

cuda:0 cuda:1
cuda:0 cuda:1
Host 0
Host 1
as a 2-D mesh [[0, 1], [2, 3]]
Why useful?
Distributed setups before:

cuda:0 cuda:1
cuda:0 cuda:1
Host 0
Host 1
as a 2-D mesh [[0, 1], [2, 3]]
Why useful?
Now:

DTensor
A fundamental distributed tensor abstraction that performs tensor level sharding computation
● Provides uniform Tensor Sharding
Layout to represents different
parallelisms state_dict
● Perform Sharded computation
easily in SPMD style.
● The backend of PyTorch native
Tensor Parallel APIs

DTensor
A fundamental distributed tensor abstraction that performs tensor level sharding computation
● Provides uniform Tensor Sharding
Layout to represents different
parallelisms state_dict
● Perform Sharded computation
easily in SPMD style.
● The backend of PyTorch native
Tensor Parallel APIs
Placement Types
● Shard(tensor_dim): shard on
tensor dimension on a device
mesh dimension
● Replicate: replicate on a device
mesh dimension
6 8 10 12
6, 8,
10, 12
6, 8
10, 12
6, 8
10, 12
6, 8
10, 12
6, 8 6,8
10, 12 10, 12
[Shard(0)]
Data: [6, 8, 10, 12]
[Replicate()]
DeviceMesh: ([0, 1, 2, 3]) DeviceMesh: ([[0, 1], [2, 3]])
[Shard(0), Replicate()]

● Tensor Parallel (TP) + Fully Sharded
Data Parallel (FSDP) is one popular
way for LLM training
● Composing TP and FSDP together
scales up model training efficiently
Challenges:
● Tensor Parallel usually very intrusive
to the model code, diverges model
code with training code, results in
maintenance burden
● Compose Tensor Parallel and FSDP
together exposes challenges to
checkpoint save/load
Introducing native PyTorch 2-D Parallel API

● Within-host FSDP,
cross-host DDP
● Similar 2-D setup to TP +
FSDP
Introducing HSDP
(Hybrid Sharded Data Parallel)
DeviceMesh composes 1-D, 2-D, … N-D Parallelism in an easy to use way!
And, it allows parallelisms to generate uniform state_dict by using DTensor,
enables efficient checkpoint save/load, resharding

● Distributed Checkpoint (beta in 2.1)
can save/load 2-D Parallel
workloads seamlessly
● Supports efficient sharded
state_dict save/load without writing
redundant copies
● Supports checkpoint load
resharding to a different world size
● Supports save in one type of
parallelism, load in another type of
parallelism (i.e. 2-D parallel to 1-D
parallel)
Distributed Checkpoint + 2-D Parallel

PT-2(D): Compile the PT-D Parallelisms
Torch.compile supports many PT-D Parallelism
solutions as of today’s nightly build:
● DDP
● FSDP
● Tensor Parallel
● Tensor Parallel + FSDP
Torch.compile enabling:
● Computation fusion with TorchInductor
● Captures collectives in Tensor Parallel to
allow better compute/communication
overlap in both forward and backward
Torch.compile
work out of box!
(nightly build)

(features available in nightly build)
Thank You

AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

Recommended

Recommended

More Related Content

Similar to AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

Similar to AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta