Deploying Pretrained Model In Edge IoT Devices.pdf
1. Accelerated Inferencing of a Pretrained Model In Edge
IoT Devices
• Introduction
• Challenges
• Solutions
• Limitations
• Proposed Solution
• Results of Experimentation
• Conclusion
2. Introduction
• This work is related to solving
one of the most complex
problems in the AI domain
• Explains about those
challenges/problems
• The works done by the research
community to address these
challenges
• Limitations and gaps in the
existing solutions
3. Challenges in deploying pretrained
models
• Hardware architectural difference between processors of pretrained model
and edge device
• Smaller size memory and less computation capacity
• Poor energy efficiency
• No source code available for pretrained model
• No knowledge of how it is trained and its hyper parameters
• Need to maintain accuracy as close as possible to pretrained model
Pretrained AI Model
To be deployed on resource
constrained edge device
6. Disadvantages
If one of the nodes fail, then the whole system collapses
Devices made use are small microcontrollers, RaspberryPI, MCUs which have fewer cores
and less computational capacity(fewer core, clock speed).
Inference speed cannot match to the speed of pretrained model
Takes more time to perform inference and hence consume more energy(power)
Due to this it has got poor energy efficiency. It will result in more corban efficiency
7. Edge-Cloud Co-Operation
• Disadvantages
- Though the cloud has got high computational capacity but there is always a
delay in exchange of data between edge and cloud
- The data speed is not constant and more delay if the public network is
congested
- Threat of data as it is exchanged over public network
Public IP Network
Remote Cloud Edge Device
8. Deploying model on FPGA
Advantages
• Able deploy and improve inference
Disdvantages
• Only specific AI models for which FPGA is
designed
• Cannot deploy other AI models
Deploying model on the GPU
cores
Pretrained
Model
Convert to FPGA
Specific Format
Run on
FPGA
9. Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core
10. Proposed
Solution
The pretrained model size is reduced making
use of following optimization techniques
• - Using the precision bits FP16 or INT8 instead of
FP32
• - Using Layer Fusion
A model is optimized for inference
acceleration using optimization techniques
• - CUDA Computing
• - CUDA Graph
• - Batch processing
• dsjfdkfj
A model is optimized to achieve energy
efficiency using
• - DLA Core
11. How the model size is reduced
• Any CNN or DNN model size depends on
Number of layers, parameters in each layer and size of the weights and bias in
each layer default size is floating point 32 bit(FP32)
• Reduce size of precision bits of weights and bias
• Fuse the CNN/DNN layers together to make network simpler
• What if we reduce it to 16 bits floating point or integer 8-bits
32 bits
16 bits
8 bits
12. Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core
13. Create TensorRT Builder
From Builder Create TensorRT Parser and
Config components
TensorRT Parser
Import Input
Pretrained
Model
TensorRT Config
TensorRT Network
Optimization
Input Parameters
FP32, FP16,
INT8, CUDA
Graph, Layer
Fusion, DLA
Core
(Input Network and Config)
Create TensorRT Engine
TensorRT Engine for
Inference on GPU
14. NVIDIA Jetson Xavier Family GPU
6 CPU Cores 40 Tensor Cores
384 GPU Cores 1 DLA (Deep Learning
Accelerator)
6 Streaming
Multiprocessors
Clock Frequency 1.109 GHz
Number of CUDA cores 384
Compute Clock Rate 1.109 GHz