Fine-tuning a large
language model without
your own supercomputer
Sylvain Gugger, Research Engineer at Hugging Face
A bit about myself
Transformers,
a bit of history
 June 2017: Introduction of theTransformers architecture
 June 2018:GPT model (OpenAI)
 October 2018: BERT model (Google)
 February 2019:GPT-2 model (OpenAI)
 October 2019: DistilBERT model (Hugging Face)
 October 2019:T5 model (Google)
 October 2019: BART model (Facebook)
 May 2020:GPT-3 model (OpenAI)
Pretraining a
Transformers
model is long
and expensive
Transfer
learning has
become the
go-to
technology in
NLP
Large
language
model
Task-specific
model
Semi-supervised
training on lots of
data
Supervised
fine-tuning on
your task
Examples of
language
models
pretraining
objectives
My
My
My
My
name
name
name
is
is Sylvain
name
is
Sylvain
.
Guess the next word in the sentence (GPT)
My [mask] is Sylvain .
name
Guess some masked words in the sentence (BERT)
Converting a
pretrained
model for
transfer
learning
Embeddings
Model
Decoder
Tensor<float>(bs, sl, nh)
Tensor<float>(bs, sl, vs)
Tensor<float>(bs, sl, nh)
Tensor<int>(bs, sl)
Embeddings
Model
New head
Tensor<float>(bs, sl, nh)
Tensor<float>(bs, sl, d)
Tensor<float>(bs, sl, nh)
Tensor<int>(bs, sl)
Practical
examples
 Text classification (open in colab)
 Summarization (open in colab)
 See more examples on the 🤗Transformers repo:
 scripts
 notebooks
Transformers
are big models
…
And they keep
getting bigger!
So we need to
use special
techniques to
fine-tune
them.
Here are some techniques that can help:
• Reduce the batch size
• Gradient accumulation
• Gradient checkpointing
• ZeRO-offload
• ZeRO-DP / Sharded DDP
• Model parallelism
• Pipeline parallelism
Fine-tuning a big model without any care will often result in errors like this.
Schematically,
a training looks
like this.
Loss
Backward
pass
Gradients
Optimizer
step
Forward
pass
Input
data
What takes
GPU memory
during
training?
Forward
pass
Loss
Backward
pass
Gradients
Optimizer
step
• Model weights
• Intermediate activations
• Gradients
• Optimizer state
• Gradients of intermediate activations
Reducing the
batch size does
not reduce the
size of all those
objects.
Same size as the model (independent of batch size or sequence length):
• Model
• Gradients of the model
• Optimizer state (2x for Adam)
Dependent of batch size or sequence length:
• Intermediate activations
• Gradient of intermediate activations
In the text classification example the DistilBERT model takes 255MB of memory.
• Intermediate activations
• Gradients of the intermediate activations
• Model
• Gradients
• Optimizer state (2x model for Adam)
3,625Mb 1,020Mb
Gradient
accumulation
Train at batch size 32 while using the memory footprint of batch size 4
with 8 steps of gradient accumulation
total_batch_size = dataloader_batch_size x gradient_accumulation_steps
Gradient
checkpointing
Don’t store all intermediate activations during the forward pass.
Recompute the missing ones as needed during the backward pass.
Traditional forward/backward
Forward/backward with a checkpoint
Gifs fromYaroslav Bulatov: original article
Training in
FP16
Up until now, everything was stored in FP32 floats
Using FP16 float would use half the space in memory!
Problems:
• small numbers become zero much quicker
• large numbers become infinity much quicker
Training fully in FP16 diverges 
Mixed
precision
training
Loss
(FP32,
scaled)
Backward
pass
(FP16)
Gradients
(FP16 ->
FP32)
Optimizer
step
(FP32)
Forward
pass
(FP16)
Input
data
Mixed
precision
impact on
memory.
Activations: FP16
Model: FP16 + FP32
Gradients: FP32
Optimizer state: FP32
X 0.5
X 1.5
Mixed precision training requires two copies of the model
Mixed
precision
doesn’t
necessarily use
less memory
GPU Memory used in fine-tuning BERT on MRPC
Activations Model
X 0.5 + 0.5 x model
ZeRO-offload
makes mixed
precision
memory
efficient Activations: FP16
Model: FP16
Model: FP32
Optimizer state
With ZeRO-offload, only the forward and backward passes are done on the GPU.
The optimizer step is done on the CPU.
GPU
Forward pass
Loss
Backward pass
Optimizer step
Gradients
CPU
Allows you to fine-tune t5-3b on a single 24GB RTX-
3090 card!
Sharded Data
Parallelism
ZeRO DP-2
A different way of dealing with the gradients and optimizer states
is to shard them across the GPUs.
Model
parallelism
(vertical)
If you have several GPUs, a natural idea is to try to split your model
and put different layers on different devices.
This is pretty inefficient however.
Pipeline
parallelism
To be more efficient you have to execute the forward passes
of several mini-batches in parallel.
What exists
now and
where to find
it.
Trainer PyTorch Fairscale Deepspeed SageMaker
Gradient accumulation ✅ ✅ ✅ ✅ ✅
Gradient checkpointing ✅ ✅ ✅ ✅ ✅
Mixed precision ✅ ✅ ✅ ✅ ✅
ZeRO-offload ✅ ✅ ✅
Sharded DDP/ZeRO 2 ✅ ✅ ✅
Model Parallelism ✅
Pipeline Parallelism ✅ ✅ ✅ ✅
ZeRO-DP stage 3 ✅ ✅
More
resources
 Deep learning for coders with fastai and PyTorch
 Free notebook version of the book
 Trainer documentation
 Example notebooks usingTransformers
 Example scripts usingTransformers
 Blog post by Stas Bekman on ZeRO/sharded DDP
 ZeRO paper
 Pipeline Parallelism in PyTorch

Fine tuning large LMs