SlideShare a Scribd company logo
Accelerated Training of
Transformer Models
Kaarthik Sivashanmugam – Principal Engineering Manager
Sherlock Huang – Principal Engineer
Azure AI - Frameworks
Agenda
ONNX Runtime for Training
Introduction
Integration with training frameworks
Acceleration & Native Capabilities
Memory usage and execution optimizations
Mixed precision training, Distributed training parallelism
modes, Gradient checkpointing, AdaSum, DeepSpeed
ZeRO
Training Recipes & Perf Results
Pretraining and finetuning: BERT, GPT-2, Turing
Demo: ONNX Runtime Training in Azure Databricks
Intro: ONNX, ONNX Runtime
ONNX: an open and interoperable format for ML models
ONNX IR (intermediate representation)
ONNX Operator schema
Operation type
Attributes
Inputs/outputs
Shape inference function
https://onnx.ai/
https://github.com/onnx/onnx/blob/master/docs/Operators.md
Y
weight
(128 x 256)
(128 x 256)
(batch x 256)
X
(batch x 128)
bias
(256)
(256)
Inputs
A (batch x 128)
B (128 x 256)
C (256)
Outputs
Y (batch x 256)
Attributes
alpha: 0.7
beta: 0.5
Gemm
ONNX Spec
Graph composed of computational
nodes
Built-in and custom operators
ONNX Model
ONNX Runtime (ORT)
Cross-platform accelerator for training and inferencing
Core part of ML stack at Microsoft for innovations from the company
and industry
ORT Training
Adopted by 1P and 3P workloads for acceleration
Current focus on large transformer models (based on demand and acceleration needs)
Extensible and supports PyTorch, Keras/Tensorflow, …
ONNX Runtime for Training
Training & ORT Acceleration
Define Model
Get Data Batch
Compute Loss
Compute Gradients
& Update Weights
Evaluate
Train
Loop
Acceleration scope
Create ORTTrainer
using the model
ORTTrainer.train_step()
Checkpoint
import torch
from onnxruntime.training import ORTTrainer, optim
# Model definition
class NeuralNet(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
...
def forward(self, x):
...
model = NeuralNet(input_size=784, hidden_size=500, num_classes=10)
criterion = torch.nn.Functional.cross_entropy
model_description =
{'inputs': [('data', ['in', 'batch_size']),
('target', ['label_x_batch_size'])],
'outputs’: [('loss', [], True),
('output', ['out', 'batch_size’])]
}
optimizer_config = optim.AdamConfig(lr=learning_rate)
trainer = ORTTrainer(model, model_description, optimizer_config,
optimizer configuration, criterion)
# Training Loop
for t in range(1000):
# forward + backward + weight update
loss, y_pred = trainer.train_step(x, y)
ORT in PyTorch
PyTorch PyTorch + ONNX Runtime backend
import torch
# Model definition
class NeuralNet(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
...
def forward(self, x):
...
model = NeuralNet(input_size=784, hidden_size=500, num_classes=10)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# Training Loop
for t in range(1000):
# forward
y_pred = model(x)
loss = criterion(y_pred, y)
# reset gradient buffer
optimizer.zero_grad()
# backward
loss.backward()
# weight update
optimizer.step()
ONNXRuntime
ORT TrainingSession Python API
PyTorch Script
PyTorch
ORTTrainer
To ONNX
GPU
buffer
TF/Keras Script
TF
ORTTrainer
To ONNX
GPU
buffer
ORT Frontend Adapters
Acceleration & Native Capabilities
Contributors to ORT Acceleration
Optimal
Gradient Graph
CUDA Kernel
Optimizations
Graph
Optimizations
Memory
Efficiency
Other Training
Capabilities
Static graph optimization
techniques like constant
folding, redundant node
elimination
Memory and compute
optimized using global
knowledge of data
dependencies
Static graph used for
preallocation of memory
for weights and gradients
Memory reuse
Op fusion
Reimplemented cuDNN
kernels
Removed redundant
computation
Mixed precision training
Distributed training
parallelism modes
Gradient checkpointing
AdaSum
DeepSpeed ZeRO
Native Capabilities in ORT
Distributed
Training
Modes
Gradient
Checkpoint
Mixed
Precision
Training
Gradient
Accumulation
AdaSum
16-bit and 32-bit FP types to
make training faster and use
less memory
Parallelism modes: Data,
Horizontal and Pipeline
Computed gradients are
accumulated into gradient buffer
using partial execution of graph
repeated for N steps
Averaged gradients are used in
optimizer for weight updates
Stashed activations often
dominate memory consumption
in training
Recompute discarded
activations when needed.
Trade off between memory
usage vs. computation cost.
Combines gradients in a novel
way to improve convergence
Model converges faster
DeepSpeed
ZeRO
Redundancy
Optimizer
Optimizer State Partitioning
Gradient Partitioning
Parameter Partitioning
Code Sample & Training Recipes
BERT Pretraining using ORT
https://github.com/microsoft/onnxruntime-training-examples/
Training Recipes
▪ BERT Pretraining
▪ Nvidia’s implementation of BERT pretraining accelerated using ORT
▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/nvidia-bert
▪ GPT-2 Finetuning
▪ Finetuning of Hugging Face GPT-2 model
▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/huggingface-gpt2
▪ Turing Finetuning
▪ Finetuning of Microsoft Turing model for abstractive text summarization, sentiment analysis and suggested reply scenarios
▪ https://github.com/microsoft/Turing-NLR (private preview)
Performance Improvement Results
BERT Pretraining in 4xDGX-2
PyTorch 1.5 with
NGC 20.03-py3
PyTorch 1.5 with
ONNX Runtime
% Gain with
ONNX Runtime
Phase 1 Throughput (ex/sec) 11522.1 12826.2 11.32%
Phase 2 Throughput (ex/sec) 2150.0 2464.1 14.61%
Phase 1 time (hours) 11.12 9.99 10.16%
Phase 2 time (hours) 6.62 5.77 12.84%
Total time (hours) 17.74 15.76 11.16%
PyTorch w/ ORT can train with 2x the local batch size as PyTorch w/o ORT
(global batch size was kept the same for comparison)
Perf Improvement with ORT
Model (Scenario)/# Params Perf improvement w/ ORT
Turing* (pretraining)/340M 1.4x
Turing* (pretraining)/350M 1.2x
RoBERTa XL (pretraining)/500M 3x
RoBERTa XL (finetuning)/500M 1.2x
RoBERTa XXL (pretraining)/1B 7x
GPT-2 M(pretraining)/345M 1.2x
* https://msturing.org/
Demo: ONNX Runtime Training in Azure Databricks
https://github.com/skaarthik/onnxruntime-training-databricks
Summary
▪ Optimize and accelerate model
training using ONNX Runtime (ORT)
▪ ORT is used in training very large
models used in various Microsoft
products/services
▪ https://github.com/microsoft/onnxruntime
▪ https://github.com/microsoft/onnxruntime-
training-examples
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
Po-Chuan Chen
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
Knoldus Inc.
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
Suman Debnath
 
Best Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceBest Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI Service
Kumton Suttiraksiri
 
OpenAI API crash course
OpenAI API crash courseOpenAI API crash course
OpenAI API crash course
Dimitrios Platis
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
VikasBisoi
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
Fiza987241
 
The current state of generative AI
The current state of generative AIThe current state of generative AI
The current state of generative AI
Benjaminlapid1
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
Julien SIMON
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
Loic Merckel
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Márton Kodok
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Sangwoo Mo
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Fwdays
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
David Rostcheck
 

What's hot (20)

Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Best Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceBest Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI Service
 
OpenAI API crash course
OpenAI API crash courseOpenAI API crash course
OpenAI API crash course
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
The current state of generative AI
The current state of generative AIThe current state of generative AI
The current state of generative AI
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 

Similar to Accelerated Training of Transformer Models

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Universitat Politècnica de Catalunya
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016
Matt Warren
 
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Amazon Web Services
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Intel Software Brasil
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scale
Amazon Web Services
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Lviv Startup Club
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Fwdays
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Databricks
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
C3 w2
C3 w2C3 w2
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Databricks
 

Similar to Accelerated Training of Transformer Models (20)

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016
 
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Build, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scaleBuild, train, and deploy machine learning models at scale
Build, train, and deploy machine learning models at scale
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
C3 w2
C3 w2C3 w2
C3 w2
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 

Recently uploaded (20)

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 

Accelerated Training of Transformer Models

  • 1. Accelerated Training of Transformer Models Kaarthik Sivashanmugam – Principal Engineering Manager Sherlock Huang – Principal Engineer Azure AI - Frameworks
  • 2. Agenda ONNX Runtime for Training Introduction Integration with training frameworks Acceleration & Native Capabilities Memory usage and execution optimizations Mixed precision training, Distributed training parallelism modes, Gradient checkpointing, AdaSum, DeepSpeed ZeRO Training Recipes & Perf Results Pretraining and finetuning: BERT, GPT-2, Turing Demo: ONNX Runtime Training in Azure Databricks
  • 4. ONNX: an open and interoperable format for ML models
  • 5. ONNX IR (intermediate representation) ONNX Operator schema Operation type Attributes Inputs/outputs Shape inference function https://onnx.ai/ https://github.com/onnx/onnx/blob/master/docs/Operators.md Y weight (128 x 256) (128 x 256) (batch x 256) X (batch x 128) bias (256) (256) Inputs A (batch x 128) B (128 x 256) C (256) Outputs Y (batch x 256) Attributes alpha: 0.7 beta: 0.5 Gemm ONNX Spec
  • 6. Graph composed of computational nodes Built-in and custom operators ONNX Model
  • 7. ONNX Runtime (ORT) Cross-platform accelerator for training and inferencing Core part of ML stack at Microsoft for innovations from the company and industry ORT Training Adopted by 1P and 3P workloads for acceleration Current focus on large transformer models (based on demand and acceleration needs) Extensible and supports PyTorch, Keras/Tensorflow, …
  • 8. ONNX Runtime for Training
  • 9. Training & ORT Acceleration Define Model Get Data Batch Compute Loss Compute Gradients & Update Weights Evaluate Train Loop Acceleration scope Create ORTTrainer using the model ORTTrainer.train_step() Checkpoint
  • 10. import torch from onnxruntime.training import ORTTrainer, optim # Model definition class NeuralNet(torch.nn.Module): def __init__(self, input_size, hidden_size, num_classes): ... def forward(self, x): ... model = NeuralNet(input_size=784, hidden_size=500, num_classes=10) criterion = torch.nn.Functional.cross_entropy model_description = {'inputs': [('data', ['in', 'batch_size']), ('target', ['label_x_batch_size'])], 'outputs’: [('loss', [], True), ('output', ['out', 'batch_size’])] } optimizer_config = optim.AdamConfig(lr=learning_rate) trainer = ORTTrainer(model, model_description, optimizer_config, optimizer configuration, criterion) # Training Loop for t in range(1000): # forward + backward + weight update loss, y_pred = trainer.train_step(x, y) ORT in PyTorch PyTorch PyTorch + ONNX Runtime backend import torch # Model definition class NeuralNet(torch.nn.Module): def __init__(self, input_size, hidden_size, num_classes): ... def forward(self, x): ... model = NeuralNet(input_size=784, hidden_size=500, num_classes=10) criterion = torch.nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr=1e-4) # Training Loop for t in range(1000): # forward y_pred = model(x) loss = criterion(y_pred, y) # reset gradient buffer optimizer.zero_grad() # backward loss.backward() # weight update optimizer.step()
  • 11. ONNXRuntime ORT TrainingSession Python API PyTorch Script PyTorch ORTTrainer To ONNX GPU buffer TF/Keras Script TF ORTTrainer To ONNX GPU buffer ORT Frontend Adapters
  • 12. Acceleration & Native Capabilities
  • 13. Contributors to ORT Acceleration Optimal Gradient Graph CUDA Kernel Optimizations Graph Optimizations Memory Efficiency Other Training Capabilities Static graph optimization techniques like constant folding, redundant node elimination Memory and compute optimized using global knowledge of data dependencies Static graph used for preallocation of memory for weights and gradients Memory reuse Op fusion Reimplemented cuDNN kernels Removed redundant computation Mixed precision training Distributed training parallelism modes Gradient checkpointing AdaSum DeepSpeed ZeRO
  • 14. Native Capabilities in ORT Distributed Training Modes Gradient Checkpoint Mixed Precision Training Gradient Accumulation AdaSum 16-bit and 32-bit FP types to make training faster and use less memory Parallelism modes: Data, Horizontal and Pipeline Computed gradients are accumulated into gradient buffer using partial execution of graph repeated for N steps Averaged gradients are used in optimizer for weight updates Stashed activations often dominate memory consumption in training Recompute discarded activations when needed. Trade off between memory usage vs. computation cost. Combines gradients in a novel way to improve convergence Model converges faster DeepSpeed ZeRO Redundancy Optimizer Optimizer State Partitioning Gradient Partitioning Parameter Partitioning
  • 15. Code Sample & Training Recipes
  • 16. BERT Pretraining using ORT https://github.com/microsoft/onnxruntime-training-examples/
  • 17. Training Recipes ▪ BERT Pretraining ▪ Nvidia’s implementation of BERT pretraining accelerated using ORT ▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/nvidia-bert ▪ GPT-2 Finetuning ▪ Finetuning of Hugging Face GPT-2 model ▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/huggingface-gpt2 ▪ Turing Finetuning ▪ Finetuning of Microsoft Turing model for abstractive text summarization, sentiment analysis and suggested reply scenarios ▪ https://github.com/microsoft/Turing-NLR (private preview)
  • 19. BERT Pretraining in 4xDGX-2 PyTorch 1.5 with NGC 20.03-py3 PyTorch 1.5 with ONNX Runtime % Gain with ONNX Runtime Phase 1 Throughput (ex/sec) 11522.1 12826.2 11.32% Phase 2 Throughput (ex/sec) 2150.0 2464.1 14.61% Phase 1 time (hours) 11.12 9.99 10.16% Phase 2 time (hours) 6.62 5.77 12.84% Total time (hours) 17.74 15.76 11.16% PyTorch w/ ORT can train with 2x the local batch size as PyTorch w/o ORT (global batch size was kept the same for comparison)
  • 20. Perf Improvement with ORT Model (Scenario)/# Params Perf improvement w/ ORT Turing* (pretraining)/340M 1.4x Turing* (pretraining)/350M 1.2x RoBERTa XL (pretraining)/500M 3x RoBERTa XL (finetuning)/500M 1.2x RoBERTa XXL (pretraining)/1B 7x GPT-2 M(pretraining)/345M 1.2x * https://msturing.org/
  • 21. Demo: ONNX Runtime Training in Azure Databricks https://github.com/skaarthik/onnxruntime-training-databricks
  • 22. Summary ▪ Optimize and accelerate model training using ONNX Runtime (ORT) ▪ ORT is used in training very large models used in various Microsoft products/services ▪ https://github.com/microsoft/onnxruntime ▪ https://github.com/microsoft/onnxruntime- training-examples
  • 23. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.