Accelerated Training of Transformer Models

Accelerated Training of
Transformer Models
Kaarthik Sivashanmugam – Principal Engineering Manager
Sherlock Huang – Principal Engineer
Azure AI - Frameworks

Agenda
ONNX Runtime for Training
Introduction
Integration with training frameworks
Acceleration & Native Capabilities
Memory usage and execution optimizations
Mixed precision training, Distributed training parallelism
modes, Gradient checkpointing, AdaSum, DeepSpeed
ZeRO
Training Recipes & Perf Results
Pretraining and finetuning: BERT, GPT-2, Turing
Demo: ONNX Runtime Training in Azure Databricks

ONNX: an open and interoperable format for ML models

ONNX IR (intermediate representation)
ONNX Operator schema
Operation type
Attributes
Inputs/outputs
Shape inference function
https://onnx.ai/
https://github.com/onnx/onnx/blob/master/docs/Operators.md
Y
weight
(128 x 256)
(128 x 256)
(batch x 256)
X
(batch x 128)
bias
(256)
(256)
Inputs
A (batch x 128)
B (128 x 256)
C (256)
Outputs
Y (batch x 256)
Attributes
alpha: 0.7
beta: 0.5
Gemm
ONNX Spec

Graph composed of computational
nodes
Built-in and custom operators
ONNX Model

ONNX Runtime (ORT)
Cross-platform accelerator for training and inferencing
Core part of ML stack at Microsoft for innovations from the company
and industry
ORT Training
Adopted by 1P and 3P workloads for acceleration
Current focus on large transformer models (based on demand and acceleration needs)
Extensible and supports PyTorch, Keras/Tensorflow, …

Training & ORT Acceleration
Define Model
Get Data Batch
Compute Loss
Compute Gradients
& Update Weights
Evaluate
Train
Loop
Acceleration scope
Create ORTTrainer
using the model
ORTTrainer.train_step()
Checkpoint

import torch
from onnxruntime.training import ORTTrainer, optim
# Model definition
class NeuralNet(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
...
def forward(self, x):
...
model = NeuralNet(input_size=784, hidden_size=500, num_classes=10)
criterion = torch.nn.Functional.cross_entropy
model_description =
{'inputs': [('data', ['in', 'batch_size']),
('target', ['label_x_batch_size'])],
'outputs’: [('loss', [], True),
('output', ['out', 'batch_size’])]
}
optimizer_config = optim.AdamConfig(lr=learning_rate)
trainer = ORTTrainer(model, model_description, optimizer_config,
optimizer configuration, criterion)
# Training Loop
for t in range(1000):
# forward + backward + weight update
loss, y_pred = trainer.train_step(x, y)
ORT in PyTorch
PyTorch PyTorch + ONNX Runtime backend
import torch
# Model definition
class NeuralNet(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
...
def forward(self, x):
...
model = NeuralNet(input_size=784, hidden_size=500, num_classes=10)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# Training Loop
for t in range(1000):
# forward
y_pred = model(x)
loss = criterion(y_pred, y)
# reset gradient buffer
optimizer.zero_grad()
# backward
loss.backward()
# weight update
optimizer.step()

ONNXRuntime
ORT TrainingSession Python API
PyTorch Script
PyTorch
ORTTrainer
To ONNX
GPU
buffer
TF/Keras Script
TF
ORTTrainer
To ONNX
GPU
buffer
ORT Frontend Adapters

Acceleration & Native Capabilities

Contributors to ORT Acceleration
Optimal
Gradient Graph
CUDA Kernel
Optimizations
Graph
Optimizations
Memory
Efficiency
Other Training
Capabilities
Static graph optimization
techniques like constant
folding, redundant node
elimination
Memory and compute
optimized using global
knowledge of data
dependencies
Static graph used for
preallocation of memory
for weights and gradients
Memory reuse
Op fusion
Reimplemented cuDNN
kernels
Removed redundant
computation
Mixed precision training
Distributed training
parallelism modes
Gradient checkpointing
AdaSum
DeepSpeed ZeRO

Native Capabilities in ORT
Distributed
Training
Modes
Gradient
Checkpoint
Mixed
Precision
Training
Gradient
Accumulation
AdaSum
16-bit and 32-bit FP types to
make training faster and use
less memory
Parallelism modes: Data,
Horizontal and Pipeline
Computed gradients are
accumulated into gradient buffer
using partial execution of graph
repeated for N steps
Averaged gradients are used in
optimizer for weight updates
Stashed activations often
dominate memory consumption
in training
Recompute discarded
activations when needed.
Trade off between memory
usage vs. computation cost.
Combines gradients in a novel
way to improve convergence
Model converges faster
DeepSpeed
ZeRO
Redundancy
Optimizer
Optimizer State Partitioning
Gradient Partitioning
Parameter Partitioning

Code Sample & Training Recipes

BERT Pretraining using ORT
https://github.com/microsoft/onnxruntime-training-examples/

Training Recipes
▪ BERT Pretraining
▪ Nvidia’s implementation of BERT pretraining accelerated using ORT
▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/nvidia-bert
▪ GPT-2 Finetuning
▪ Finetuning of Hugging Face GPT-2 model
▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/huggingface-gpt2
▪ Turing Finetuning
▪ Finetuning of Microsoft Turing model for abstractive text summarization, sentiment analysis and suggested reply scenarios
▪ https://github.com/microsoft/Turing-NLR (private preview)

Performance Improvement Results

BERT Pretraining in 4xDGX-2
PyTorch 1.5 with
NGC 20.03-py3
PyTorch 1.5 with
ONNX Runtime
% Gain with
ONNX Runtime
Phase 1 Throughput (ex/sec) 11522.1 12826.2 11.32%
Phase 2 Throughput (ex/sec) 2150.0 2464.1 14.61%
Phase 1 time (hours) 11.12 9.99 10.16%
Phase 2 time (hours) 6.62 5.77 12.84%
Total time (hours) 17.74 15.76 11.16%
PyTorch w/ ORT can train with 2x the local batch size as PyTorch w/o ORT
(global batch size was kept the same for comparison)

Perf Improvement with ORT
Model (Scenario)/# Params Perf improvement w/ ORT
Turing* (pretraining)/340M 1.4x
Turing* (pretraining)/350M 1.2x
RoBERTa XL (pretraining)/500M 3x
RoBERTa XL (finetuning)/500M 1.2x
RoBERTa XXL (pretraining)/1B 7x
GPT-2 M(pretraining)/345M 1.2x
* https://msturing.org/

Demo: ONNX Runtime Training in Azure Databricks
https://github.com/skaarthik/onnxruntime-training-databricks

Summary
▪ Optimize and accelerate model
training using ONNX Runtime (ORT)
▪ ORT is used in training very large
models used in various Microsoft
products/services
▪ https://github.com/microsoft/onnxruntime
▪ https://github.com/microsoft/onnxruntime-
training-examples

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Accelerated Training of Transformer Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerated Training of Transformer Models

Similar to Accelerated Training of Transformer Models (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Accelerated Training of Transformer Models