SlideShare a Scribd company logo
1 of 44
FlashNeuron SSD-Enabled Large-Batch Training
of Very Deep Neural Networks
Speaker: Po-Chuan, Chen
Abstract
• Deep neural networks (DNN) are widely used in AI application
• Problem → Memory capacity wall :
• The limited DRAM capacity of a training platform like GPU often becomes the
limiting factor on the size of DNNs and batch size
• Solution → FlashNeuron
1. The first DNN training system using an NVMeSSD
2. Introducing an offloading scheduler (minimal interference to CPU processes)
★ Increase the batch size, improve the training throughput (up to 37.8%)
FlashNeuron
1. It uses a high performance SSD as backing store (lower bandwidth)
2. Minimize bandwidth waste while overlapping GPU computation
3. It introduces an offloading scheduler, selecting a set of tensor to offload
Using four state-of-the-part NVIDIA V100 GPU with 16GB DRAM on PyTorch
to training DNN model.
Dnn training
• Using a large amount of labeled data to learn optimal network parameters.
 Mini-batch stochastic gradient descent (SGD) algorithm
A training process → multiple epochs
A single epoch → multiple iterations (single partition of the dataset, called mini-batch)
Forward :
Computing error for the given input batch
Backward :
Back-propagating errors and updating network weights
The error (also called loss) of the model is computed by comparing the last
layer’s outcome with the correct output.
Overcoming GPU memory capacity wall
A popular approach to overcome GPU memory capacity bottleneck is to buffer data
in the host CPU memory.
For example, running data augmentation on CPU at every iteration of DNN training is
a common practice to prevent overfitting to the training data set.
A typical data augmentation pipeline consists of image loading, decoding, and a
sequence of geometric transformations, requiring high memory bandwidth.
Solution
• Propose a new solution that buffers inputs and intermediate data (tensors) to SSDs.
Specifically, we leverage direct peer-to-peer communication between the GPU and
NVMe SSD devices so that data buffering does not consume either host CPU cycles or
memory bandwidth. This buffering-on-SSD approach complements the popular
buffering-on-memory approach, hence improving overall resource utilization over
various use cases.
Flashneuron Design
 Offloading scheduler :
Based on multiple factors (tensor size, tensor transfer times …)
Identifies a set of tensors to offload.
 Memory manager :
To minimize performance overheads from offloading, it orchestrates data transfers
between the GPU memory and SSDs using peer-to-peer direct storage access.
[ 1 ] Memory manager
 Tensor allocation :
1. A tensor is first created during forward propagation
2. An offloaded tensor is prefetched from the SSD to the memory during backward pass
 Tensor deallocation :
1. A tensor is completely offloaded from the memory to the SSD
2. A tensor is no longer used by any layer during the iteration
Fragmentation
 One crucial issue in the tensor allocation and deallocation is fragmentation.
• Solution :
To avoid this issue, we allocate memory-resident (i.e., not offloaded) tensors from
the lowest end of the memory address space and allocate ephemeral (i.e.,
offloaded) tensors from the highest end of the memory address space
Offload / Prefetch
The memory manager interacts with peer-to-peer direct storage access to perform
offloading and prefetching.
 Offload
Forward pass : initiate an offloading request after its use by the next layer
At the end of execution : the memory manager checks whether the offloading request
is completed, and the tensor is deallocated from the GPU memory.
Offload / Prefetch
 Prefetch
Backward pass : the memory manager issues prefetch requests to the SSD.
Whenever those tensors are used and freed : the memory manager eagerly
prefetches additional tensors using the available memory space while reserving
enough memory for execution to run the largest layer.
Compressed-square row compression and decompression
When the compression ratio estimated during the profiling iteration is greater than
one, the memory manager applies CSR compression.
But, the CSR compression is only applied to output tensors of ReLU, which output have
a high sparsity.
To improve the compression ratio, the technique apply a slightly different CSR format.
Replacing a vector storing the column index of each element to a set of bit-vectors
where each bit vector represents a set of nonzero elements for a row.
The size of CSR format representation decreases by 8 bits (to represent
the column index) × the number of nonzero elements in the matrix
and increases by 1 bit × the total number of elements in the matrix
Since this representation is beneficial when more than one-eighth of all the
elements are nonzero (typical case for input and intermediate tensors)
→ It improve the compression ratio
Use of a half-precision floating points for offloaded tensors
To further reduce the traffic between the GPU and the SSD, the memory manager
exploits the fact that neural network can tolerate a certain level of precision loss
without significantly degrading the final model accuracy.
The FP32 tensor is utilized during forward propagation, and the stashed FP16
version of the same tensor is only utilized during a backward propagation.
[ 2 ] Offload scheduler
In FlashNeuron, it takes DNN as input and derives an optimal tensor offloading schedule.
 It should offload enough tensors so that the GPU can correctly run at the target batch
size without triggering an out-of-memory error.
 It should avoid excessive data transfers from the GPU to the SSD (and vice versa), thus
minimizing the increase of the iteration time induced by tensor offloading.
How it works ?
It first performs a profiling iteration. All the tensors that are buffered during the
forward pass are offloaded to the SSD so that the system can run with a large
target batch size without causing an out-of-memory error.
Profiling iteration collects
1. the size of each buffered tensor 2. the time it takes to offload each buffers
tensor
3. the expected compression ratio of a tensor 4. the execution time
(forward/backward)
5. the total size of the other memory-resident objects (weights, temporary
workspace)
Phase 1 : Linear Tensor Selection
• Iteratively select a certain number of tensors from the beginning until the total size of
the unselected tensors plus the total size of the other memory-resident objects (i.e.,
weights and temporary workspace) becomes smaller than the total GPU memory size.
(unselected tensors + the other memory-resident objects < GPU memory size)
Check
The scheduler checks whether the total data transfer time, which is computed by
summing up the individual tensor offloading times, is smaller than the total execution
time of all layers in the forward pass.
(Data transfer time (sum up individual tensor) < total execution time in forward pass ? )
Yes → The scheduler adopts this schedule and stops because it can fully overlap
tensor offloading with layer computation via scheduling.
No → The offloading scheduler enters the second phase.
Phase 2: Compression-aware tensor selection
This section run only when a satisfactory schedule is not found in the
first phase.
The scheduler replaces the already selected tensors with compression-
friendly tensors expected to have high compression ratios with CSR and
FP16 conversion.
Finding satisfactory schedule
 The scheduler excludes the last uncompressible tensor selected from Phase 1.
It replace with one or more tensors having the highest expected compression ratios
among the tensors that are not yet selected, such that the size of the newly selected
tensors exceeds the excluded tensor size.
 It recomputes the expected total data transfer time.
If this total transfer time does < the forward pass’s total execution time
→ the scheduler stops.
Else it repeats this process
→ until the condition is satisfied or there exist no compression-friendly tensors
[ 3 ] Peer-to-Peer Direct Storage Access
• Peer-to-peer direct storage access (P2P-DSA) builds on GDRCopy and SPDK to
communicate tensors from/to GPU to/from NVMe SSDs.
 GDRCopy :
A fast GPU memory copy library based on NVIDIA GPUDirect, also a technology that
exposed the GPU memory to be accessed directly by other PCIe peripherals.
 Intel SPDK :
It exposes a block-level I/O interface directly to the user-space software.
Difference between prefetching data
The reverse-path (prefetching data from the SSD to the GPU memory) is performed
similarly except :
1. LBAs are read from the metadata table instead of being allocated
2. Read commands are issued instead of write commands
In both offloading and prefetching cases, most data accesses are sequential accesses ,
which are more advantageous than random accesses in throughput and endurance.
SSD lifespan can be increased
• Because P2P-DSA only performs sequential writes to sustain the write amplification
factor (WAF) of (nearly) one to maximize endurance.
• Furthermore, if P2P-DSA uses the emerging low-latency SSDs, such as Intel Optane SSD
and Samsung Z-SSD, the endurance can be further improved by a factor of 5× to 10×
Evaluation
• The team choose four state-of-the-art models and scale them up to represent the future
DNN models with very deep structures:
Overview
The training throughput indeed peaks at the batch size marked with the
arrow. As we further increase the batch size, the throughput gets
degraded as the cost of tensor offloading outweighs the benefits of
the increased batch size
However, when the batch size is too large, the limited bandwidth
between the GPU and the SSD becomes the bottleneck and offsets the
higher utilization benefits.
Performance evaluation
1. Source of Performance Improvement
2. Maximum Batch Size
3. Half-precision Training
4. I/O Pattern
5. Cost Efficiency
Source of performance improvement
• Dotted line : shows the best throughput that can be achieved by the baseline.
• An arrow : the maximum batch size for which the proposed offloading scheduler is able
to find an effective schedule. Note that the tensor compression does not improve the
performances of BERT-XLarge and HBMP because those models
do not utilize a ReLU layer.
Normalized per-layer throughput
Maximum batch size
FlashNeuron uses 2.09× larger batch size to
maximize the training throughput
and increases the maximum runnable batch
size by a factor of 13.4×.
Half-precision training
Unfortunately, when the tensor is already represented in FP16, FlashNeuron does
not benefit from converting the offloaded tensors to FP16.
However, our experiments still demonstrate that FlashNeuron can result in extra
speedup as well as an increase in the per-GPU batch size
Other evaluation
• I/O Pattern :
• Cost Efficiency : More expensive
Case study
Superior performance isolation of FlashNeuron can enable
consolidation of CPU applications and DNN training jobs.
To demonstrate that even memory-intensive CPU workloads can be
effectively co-located with DNN training using FlashNeuron (SSD).
For I/O-intensive workloads, one can opt to use FlashNeuron (Memory)
to avoid I/O contention.
Co-locating Bandwidth-Intensive Tasks on CPU
 Throughput of DNN Training on GPU
Buffering-on memory can potentially achieve higher throughput than buffering-on-SSD for
the higher write bandwidth of the CPU DRAM than the SSDs.
However, the DNN training throughput with buffering-on-memory can be heavily affected
by CPU processes’ characteristics due to the memory bandwidth contention between CPU
and GPU processes.
• By controlling the number of data augmentation threads, we make the CPU process
consume a certain portion of the CPU memory bandwidth.
 Execution Time of Data Augmentation Task on CPU
1. No offloading represents the case when GPU is running DNN training with no tensor
offloading to either host memory or SSDs
2. FlashNeuron (SSD) only utilizes a minimal amount of the host memory bandwidth
(mostly for PyTorch application code) to incur a comparable degree of the slowdown
with No offloading.
3. FlashNeuron (Memory) consumes a large amount of the host CPU memory bandwidth
Co-locating Latency-Critical Tasks on CPU
• We run a BERT-as-service on CPU, which takes user-provided sentences as input and
invokes BERT to return their embedding, while concurrently running a BERT training on
GPU
Related work
 Augmented GPU Memory for DNN Training
 Data Transfer Methods between GPU and Storage Devices
 Reducing Memory Footprint of DNN Models
Conclusion
FlashNeuron enables large-batch training of very deep and wide neural networks
of today and the future to achieve high training throughput.
Furthermore, it flexibly supports both SSD and memory offloading to provide
excellent performance isolation between GPU training jobs and a wide range of
co-located CPU workloads, including memory- and I/O-intensive ones
FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx
FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx
FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx
FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx

More Related Content

Similar to FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx

Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer Architechture
Shweta Ghate
 
memorytechnologyandoptimization-140416131506-phpapp02.pptx
memorytechnologyandoptimization-140416131506-phpapp02.pptxmemorytechnologyandoptimization-140416131506-phpapp02.pptx
memorytechnologyandoptimization-140416131506-phpapp02.pptx
shahdivyanshu1002
 
USING EMC FAST SUITE WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMS
USING EMC FAST SUITE  WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMSUSING EMC FAST SUITE  WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMS
USING EMC FAST SUITE WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMS
suri1155
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 

Similar to FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx (20)

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthqua...
 
What is the average rotational latency of this disk drive What seek.docx
 What is the average rotational latency of this disk drive  What seek.docx What is the average rotational latency of this disk drive  What seek.docx
What is the average rotational latency of this disk drive What seek.docx
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer Architechture
 
memorytechnologyandoptimization-140416131506-phpapp02.pptx
memorytechnologyandoptimization-140416131506-phpapp02.pptxmemorytechnologyandoptimization-140416131506-phpapp02.pptx
memorytechnologyandoptimization-140416131506-phpapp02.pptx
 
Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3Tdt4260 miniproject report_group_3
Tdt4260 miniproject report_group_3
 
USING EMC FAST SUITE WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMS
USING EMC FAST SUITE  WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMSUSING EMC FAST SUITE  WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMS
USING EMC FAST SUITE WITH SYBASE ASE ON EMC VNX STORAGE SYSTEMS
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
C3 w3
C3 w3C3 w3
C3 w3
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Operation System
Operation SystemOperation System
Operation System
 
DMA_document__1696148675.pdf
DMA_document__1696148675.pdfDMA_document__1696148675.pdf
DMA_document__1696148675.pdf
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 

More from Po-Chuan Chen

Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Po-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
Po-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
Po-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
Po-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Po-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Po-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Po-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
Po-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Po-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Po-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Recently uploaded (20)

Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 

FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks.pptx

  • 1. FlashNeuron SSD-Enabled Large-Batch Training of Very Deep Neural Networks Speaker: Po-Chuan, Chen
  • 2. Abstract • Deep neural networks (DNN) are widely used in AI application • Problem → Memory capacity wall : • The limited DRAM capacity of a training platform like GPU often becomes the limiting factor on the size of DNNs and batch size • Solution → FlashNeuron 1. The first DNN training system using an NVMeSSD 2. Introducing an offloading scheduler (minimal interference to CPU processes) ★ Increase the batch size, improve the training throughput (up to 37.8%)
  • 3. FlashNeuron 1. It uses a high performance SSD as backing store (lower bandwidth) 2. Minimize bandwidth waste while overlapping GPU computation 3. It introduces an offloading scheduler, selecting a set of tensor to offload Using four state-of-the-part NVIDIA V100 GPU with 16GB DRAM on PyTorch to training DNN model.
  • 4. Dnn training • Using a large amount of labeled data to learn optimal network parameters.  Mini-batch stochastic gradient descent (SGD) algorithm A training process → multiple epochs A single epoch → multiple iterations (single partition of the dataset, called mini-batch) Forward : Computing error for the given input batch Backward : Back-propagating errors and updating network weights The error (also called loss) of the model is computed by comparing the last layer’s outcome with the correct output.
  • 5. Overcoming GPU memory capacity wall A popular approach to overcome GPU memory capacity bottleneck is to buffer data in the host CPU memory. For example, running data augmentation on CPU at every iteration of DNN training is a common practice to prevent overfitting to the training data set. A typical data augmentation pipeline consists of image loading, decoding, and a sequence of geometric transformations, requiring high memory bandwidth.
  • 6. Solution • Propose a new solution that buffers inputs and intermediate data (tensors) to SSDs. Specifically, we leverage direct peer-to-peer communication between the GPU and NVMe SSD devices so that data buffering does not consume either host CPU cycles or memory bandwidth. This buffering-on-SSD approach complements the popular buffering-on-memory approach, hence improving overall resource utilization over various use cases.
  • 7. Flashneuron Design  Offloading scheduler : Based on multiple factors (tensor size, tensor transfer times …) Identifies a set of tensors to offload.  Memory manager : To minimize performance overheads from offloading, it orchestrates data transfers between the GPU memory and SSDs using peer-to-peer direct storage access.
  • 8. [ 1 ] Memory manager  Tensor allocation : 1. A tensor is first created during forward propagation 2. An offloaded tensor is prefetched from the SSD to the memory during backward pass  Tensor deallocation : 1. A tensor is completely offloaded from the memory to the SSD 2. A tensor is no longer used by any layer during the iteration
  • 9. Fragmentation  One crucial issue in the tensor allocation and deallocation is fragmentation. • Solution : To avoid this issue, we allocate memory-resident (i.e., not offloaded) tensors from the lowest end of the memory address space and allocate ephemeral (i.e., offloaded) tensors from the highest end of the memory address space
  • 10. Offload / Prefetch The memory manager interacts with peer-to-peer direct storage access to perform offloading and prefetching.  Offload Forward pass : initiate an offloading request after its use by the next layer At the end of execution : the memory manager checks whether the offloading request is completed, and the tensor is deallocated from the GPU memory.
  • 11. Offload / Prefetch  Prefetch Backward pass : the memory manager issues prefetch requests to the SSD. Whenever those tensors are used and freed : the memory manager eagerly prefetches additional tensors using the available memory space while reserving enough memory for execution to run the largest layer.
  • 12. Compressed-square row compression and decompression When the compression ratio estimated during the profiling iteration is greater than one, the memory manager applies CSR compression. But, the CSR compression is only applied to output tensors of ReLU, which output have a high sparsity. To improve the compression ratio, the technique apply a slightly different CSR format. Replacing a vector storing the column index of each element to a set of bit-vectors where each bit vector represents a set of nonzero elements for a row.
  • 13. The size of CSR format representation decreases by 8 bits (to represent the column index) × the number of nonzero elements in the matrix and increases by 1 bit × the total number of elements in the matrix Since this representation is beneficial when more than one-eighth of all the elements are nonzero (typical case for input and intermediate tensors) → It improve the compression ratio
  • 14. Use of a half-precision floating points for offloaded tensors To further reduce the traffic between the GPU and the SSD, the memory manager exploits the fact that neural network can tolerate a certain level of precision loss without significantly degrading the final model accuracy. The FP32 tensor is utilized during forward propagation, and the stashed FP16 version of the same tensor is only utilized during a backward propagation.
  • 15. [ 2 ] Offload scheduler In FlashNeuron, it takes DNN as input and derives an optimal tensor offloading schedule.  It should offload enough tensors so that the GPU can correctly run at the target batch size without triggering an out-of-memory error.  It should avoid excessive data transfers from the GPU to the SSD (and vice versa), thus minimizing the increase of the iteration time induced by tensor offloading.
  • 16. How it works ? It first performs a profiling iteration. All the tensors that are buffered during the forward pass are offloaded to the SSD so that the system can run with a large target batch size without causing an out-of-memory error. Profiling iteration collects 1. the size of each buffered tensor 2. the time it takes to offload each buffers tensor 3. the expected compression ratio of a tensor 4. the execution time (forward/backward) 5. the total size of the other memory-resident objects (weights, temporary workspace)
  • 17. Phase 1 : Linear Tensor Selection • Iteratively select a certain number of tensors from the beginning until the total size of the unselected tensors plus the total size of the other memory-resident objects (i.e., weights and temporary workspace) becomes smaller than the total GPU memory size. (unselected tensors + the other memory-resident objects < GPU memory size)
  • 18. Check The scheduler checks whether the total data transfer time, which is computed by summing up the individual tensor offloading times, is smaller than the total execution time of all layers in the forward pass. (Data transfer time (sum up individual tensor) < total execution time in forward pass ? ) Yes → The scheduler adopts this schedule and stops because it can fully overlap tensor offloading with layer computation via scheduling. No → The offloading scheduler enters the second phase.
  • 19. Phase 2: Compression-aware tensor selection This section run only when a satisfactory schedule is not found in the first phase. The scheduler replaces the already selected tensors with compression- friendly tensors expected to have high compression ratios with CSR and FP16 conversion.
  • 20. Finding satisfactory schedule  The scheduler excludes the last uncompressible tensor selected from Phase 1. It replace with one or more tensors having the highest expected compression ratios among the tensors that are not yet selected, such that the size of the newly selected tensors exceeds the excluded tensor size.  It recomputes the expected total data transfer time. If this total transfer time does < the forward pass’s total execution time → the scheduler stops. Else it repeats this process → until the condition is satisfied or there exist no compression-friendly tensors
  • 21.
  • 22. [ 3 ] Peer-to-Peer Direct Storage Access • Peer-to-peer direct storage access (P2P-DSA) builds on GDRCopy and SPDK to communicate tensors from/to GPU to/from NVMe SSDs.  GDRCopy : A fast GPU memory copy library based on NVIDIA GPUDirect, also a technology that exposed the GPU memory to be accessed directly by other PCIe peripherals.  Intel SPDK : It exposes a block-level I/O interface directly to the user-space software.
  • 23.
  • 24. Difference between prefetching data The reverse-path (prefetching data from the SSD to the GPU memory) is performed similarly except : 1. LBAs are read from the metadata table instead of being allocated 2. Read commands are issued instead of write commands In both offloading and prefetching cases, most data accesses are sequential accesses , which are more advantageous than random accesses in throughput and endurance.
  • 25. SSD lifespan can be increased • Because P2P-DSA only performs sequential writes to sustain the write amplification factor (WAF) of (nearly) one to maximize endurance. • Furthermore, if P2P-DSA uses the emerging low-latency SSDs, such as Intel Optane SSD and Samsung Z-SSD, the endurance can be further improved by a factor of 5× to 10×
  • 26. Evaluation • The team choose four state-of-the-art models and scale them up to represent the future DNN models with very deep structures:
  • 27. Overview The training throughput indeed peaks at the batch size marked with the arrow. As we further increase the batch size, the throughput gets degraded as the cost of tensor offloading outweighs the benefits of the increased batch size However, when the batch size is too large, the limited bandwidth between the GPU and the SSD becomes the bottleneck and offsets the higher utilization benefits.
  • 28. Performance evaluation 1. Source of Performance Improvement 2. Maximum Batch Size 3. Half-precision Training 4. I/O Pattern 5. Cost Efficiency
  • 29. Source of performance improvement • Dotted line : shows the best throughput that can be achieved by the baseline. • An arrow : the maximum batch size for which the proposed offloading scheduler is able to find an effective schedule. Note that the tensor compression does not improve the performances of BERT-XLarge and HBMP because those models do not utilize a ReLU layer.
  • 31. Maximum batch size FlashNeuron uses 2.09× larger batch size to maximize the training throughput and increases the maximum runnable batch size by a factor of 13.4×.
  • 32. Half-precision training Unfortunately, when the tensor is already represented in FP16, FlashNeuron does not benefit from converting the offloaded tensors to FP16. However, our experiments still demonstrate that FlashNeuron can result in extra speedup as well as an increase in the per-GPU batch size
  • 33. Other evaluation • I/O Pattern : • Cost Efficiency : More expensive
  • 34. Case study Superior performance isolation of FlashNeuron can enable consolidation of CPU applications and DNN training jobs. To demonstrate that even memory-intensive CPU workloads can be effectively co-located with DNN training using FlashNeuron (SSD). For I/O-intensive workloads, one can opt to use FlashNeuron (Memory) to avoid I/O contention.
  • 35. Co-locating Bandwidth-Intensive Tasks on CPU  Throughput of DNN Training on GPU Buffering-on memory can potentially achieve higher throughput than buffering-on-SSD for the higher write bandwidth of the CPU DRAM than the SSDs. However, the DNN training throughput with buffering-on-memory can be heavily affected by CPU processes’ characteristics due to the memory bandwidth contention between CPU and GPU processes.
  • 36. • By controlling the number of data augmentation threads, we make the CPU process consume a certain portion of the CPU memory bandwidth.
  • 37.  Execution Time of Data Augmentation Task on CPU 1. No offloading represents the case when GPU is running DNN training with no tensor offloading to either host memory or SSDs 2. FlashNeuron (SSD) only utilizes a minimal amount of the host memory bandwidth (mostly for PyTorch application code) to incur a comparable degree of the slowdown with No offloading. 3. FlashNeuron (Memory) consumes a large amount of the host CPU memory bandwidth
  • 38. Co-locating Latency-Critical Tasks on CPU • We run a BERT-as-service on CPU, which takes user-provided sentences as input and invokes BERT to return their embedding, while concurrently running a BERT training on GPU
  • 39. Related work  Augmented GPU Memory for DNN Training  Data Transfer Methods between GPU and Storage Devices  Reducing Memory Footprint of DNN Models
  • 40. Conclusion FlashNeuron enables large-batch training of very deep and wide neural networks of today and the future to achieve high training throughput. Furthermore, it flexibly supports both SSD and memory offloading to provide excellent performance isolation between GPU training jobs and a wide range of co-located CPU workloads, including memory- and I/O-intensive ones