SlideShare a Scribd company logo
1 of 16
Accelerated Inferencing of a Pretrained Model In Edge
IoT Devices
• Introduction
• Challenges
• Solutions
• Limitations
• Proposed Solution
• Results of Experimentation
• Conclusion
Introduction
• This work is related to solving
one of the most complex
problems in the AI domain
• Explains about those
challenges/problems
• The works done by the research
community to address these
challenges
• Limitations and gaps in the
existing solutions
Challenges in deploying pretrained
models
• Hardware architectural difference between processors of pretrained model
and edge device
• Smaller size memory and less computation capacity
• Poor energy efficiency
• No source code available for pretrained model
• No knowledge of how it is trained and its hyper parameters
• Need to maintain accuracy as close as possible to pretrained model
Pretrained AI Model
To be deployed on resource
constrained edge device
Present research work (Cluster of heterogeneous
devices)
Literature work 1
Microcontr
oller
Input test data
(AI workload)
Partition AI
workload
MCU
MCU
Raspberry
Pi
Combine partial
outputs and
display results
Disadvantages
If one of the nodes fail, then the whole system collapses
Devices made use are small microcontrollers, RaspberryPI, MCUs which have fewer cores
and less computational capacity(fewer core, clock speed).
Inference speed cannot match to the speed of pretrained model
Takes more time to perform inference and hence consume more energy(power)
Due to this it has got poor energy efficiency. It will result in more corban efficiency
Edge-Cloud Co-Operation
• Disadvantages
- Though the cloud has got high computational capacity but there is always a
delay in exchange of data between edge and cloud
- The data speed is not constant and more delay if the public network is
congested
- Threat of data as it is exchanged over public network
Public IP Network
Remote Cloud Edge Device
Deploying model on FPGA
Advantages
• Able deploy and improve inference
Disdvantages
• Only specific AI models for which FPGA is
designed
• Cannot deploy other AI models
Deploying model on the GPU
cores
Pretrained
Model
Convert to FPGA
Specific Format
Run on
FPGA
Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core
Proposed
Solution
The pretrained model size is reduced making
use of following optimization techniques
• - Using the precision bits FP16 or INT8 instead of
FP32
• - Using Layer Fusion
A model is optimized for inference
acceleration using optimization techniques
• - CUDA Computing
• - CUDA Graph
• - Batch processing
• dsjfdkfj
A model is optimized to achieve energy
efficiency using
• - DLA Core
How the model size is reduced
• Any CNN or DNN model size depends on
Number of layers, parameters in each layer and size of the weights and bias in
each layer default size is floating point 32 bit(FP32)
• Reduce size of precision bits of weights and bias
• Fuse the CNN/DNN layers together to make network simpler
• What if we reduce it to 16 bits floating point or integer 8-bits
32 bits
16 bits
8 bits
Proposed Solution
Reduce model size by reducing the precision bits size of weight and biases
Make network simpler by reducing number of layers in CNN/DNN
Run the model parallelly on hundreds of cores of GPU
Accelerate inference using parallel execution, CUDA Graph and batch
processing
Achieve processor occupancy of the model using CUDA computing
Achieve energy efficiency making use of DLA core
Create TensorRT Builder
From Builder Create TensorRT Parser and
Config components
TensorRT Parser
Import Input
Pretrained
Model
TensorRT Config
TensorRT Network
Optimization
Input Parameters
FP32, FP16,
INT8, CUDA
Graph, Layer
Fusion, DLA
Core
(Input Network and Config)
Create TensorRT Engine
TensorRT Engine for
Inference on GPU
NVIDIA Jetson Xavier Family GPU
6 CPU Cores 40 Tensor Cores
384 GPU Cores 1 DLA (Deep Learning
Accelerator)
6 Streaming
Multiprocessors
Clock Frequency 1.109 GHz
Number of CUDA cores 384
Compute Clock Rate 1.109 GHz
•1
CPU
C
P
U
M
E
M
O
R
Y
G
P
U
M
E
M
O
R
Y
Perform parallel
execution in GPU
GPU IDLE
CPU IDLE
GPU IDLE
Transfer contents (resultant matrix) from
GPU (Device) to CPU (Host) memory
GPU
Transfer contents (metrices) from CPU
(Host) to GPU (Device) memory
Results from the experiment
Model Model
Size (Kbs)
(BS = 1)
Model
Size(Kbs)
(BS=32)
Model
Size(Kbs)
(BS=64)
Model
Size(Kbs)
(BS=128)
Model
Size(Kbs)
(BS=256)
CPU_FP32 55831
GPU_FP32 1715 1823 1823 1771 1768
GPU_FP16 877 919 917 919 917
GPU_INT8 487 533 537 532 538

More Related Content

Similar to Deploying Pretrained Model In Edge IoT Devices.pdf

Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart cameraXIMEA
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Sudip Roy
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSimDeepak Shankar
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 

Similar to Deploying Pretrained Model In Edge IoT Devices.pdf (20)

GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart camera
 
Ip so c-30sept2010
Ip so c-30sept2010Ip so c-30sept2010
Ip so c-30sept2010
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSim
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 

More from Object Automation

RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncObject Automation
 
CHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshopCHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshopObject Automation
 
RTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation IncRTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation IncObject Automation
 
High-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI ChipsHigh-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI ChipsObject Automation
 
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous IntegrationAI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous IntegrationObject Automation
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
CDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and FutureCDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and FutureObject Automation
 
Global AI Festivla and Future one day event
Global AI Festivla and Future one day eventGlobal AI Festivla and Future one day event
Global AI Festivla and Future one day eventObject Automation
 
Generative AI In Logistics_Object Automation
Generative AI In Logistics_Object AutomationGenerative AI In Logistics_Object Automation
Generative AI In Logistics_Object AutomationObject Automation
 
Gen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshopGen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshopObject Automation
 
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfAI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfObject Automation
 
5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshop5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshopObject Automation
 
Course_Object Automation.pdf
Course_Object Automation.pdfCourse_Object Automation.pdf
Course_Object Automation.pdfObject Automation
 
Enterprise AI by using IBM DB2
Enterprise AI by using IBM DB2Enterprise AI by using IBM DB2
Enterprise AI by using IBM DB2Object Automation
 

More from Object Automation (20)

RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
 
CHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshopCHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshop
 
RTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation IncRTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation Inc
 
High-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI ChipsHigh-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI Chips
 
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous IntegrationAI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
CDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and FutureCDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and Future
 
Global AI Festivla and Future one day event
Global AI Festivla and Future one day eventGlobal AI Festivla and Future one day event
Global AI Festivla and Future one day event
 
Generative AI In Logistics_Object Automation
Generative AI In Logistics_Object AutomationGenerative AI In Logistics_Object Automation
Generative AI In Logistics_Object Automation
 
Gen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshopGen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshop
 
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfAI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
 
5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshop5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshop
 
COE AI Lab Universities
COE AI Lab UniversitiesCOE AI Lab Universities
COE AI Lab Universities
 
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdfBootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
 
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdfBootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
 
Bootcamp_AIAppsUCSD.pptx
Bootcamp_AIAppsUCSD.pptxBootcamp_AIAppsUCSD.pptx
Bootcamp_AIAppsUCSD.pptx
 
Course_Object Automation.pdf
Course_Object Automation.pdfCourse_Object Automation.pdf
Course_Object Automation.pdf
 
Enterprise AI_New.pdf
Enterprise AI_New.pdfEnterprise AI_New.pdf
Enterprise AI_New.pdf
 
Super AI tools
Super AI toolsSuper AI tools
Super AI tools
 
Enterprise AI by using IBM DB2
Enterprise AI by using IBM DB2Enterprise AI by using IBM DB2
Enterprise AI by using IBM DB2
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Deploying Pretrained Model In Edge IoT Devices.pdf

  • 1. Accelerated Inferencing of a Pretrained Model In Edge IoT Devices • Introduction • Challenges • Solutions • Limitations • Proposed Solution • Results of Experimentation • Conclusion
  • 2. Introduction • This work is related to solving one of the most complex problems in the AI domain • Explains about those challenges/problems • The works done by the research community to address these challenges • Limitations and gaps in the existing solutions
  • 3. Challenges in deploying pretrained models • Hardware architectural difference between processors of pretrained model and edge device • Smaller size memory and less computation capacity • Poor energy efficiency • No source code available for pretrained model • No knowledge of how it is trained and its hyper parameters • Need to maintain accuracy as close as possible to pretrained model Pretrained AI Model To be deployed on resource constrained edge device
  • 4. Present research work (Cluster of heterogeneous devices)
  • 5. Literature work 1 Microcontr oller Input test data (AI workload) Partition AI workload MCU MCU Raspberry Pi Combine partial outputs and display results
  • 6. Disadvantages If one of the nodes fail, then the whole system collapses Devices made use are small microcontrollers, RaspberryPI, MCUs which have fewer cores and less computational capacity(fewer core, clock speed). Inference speed cannot match to the speed of pretrained model Takes more time to perform inference and hence consume more energy(power) Due to this it has got poor energy efficiency. It will result in more corban efficiency
  • 7. Edge-Cloud Co-Operation • Disadvantages - Though the cloud has got high computational capacity but there is always a delay in exchange of data between edge and cloud - The data speed is not constant and more delay if the public network is congested - Threat of data as it is exchanged over public network Public IP Network Remote Cloud Edge Device
  • 8. Deploying model on FPGA Advantages • Able deploy and improve inference Disdvantages • Only specific AI models for which FPGA is designed • Cannot deploy other AI models Deploying model on the GPU cores Pretrained Model Convert to FPGA Specific Format Run on FPGA
  • 9. Proposed Solution Reduce model size by reducing the precision bits size of weight and biases Make network simpler by reducing number of layers in CNN/DNN Run the model parallelly on hundreds of cores of GPU Accelerate inference using parallel execution, CUDA Graph and batch processing Achieve processor occupancy of the model using CUDA computing Achieve energy efficiency making use of DLA core
  • 10. Proposed Solution The pretrained model size is reduced making use of following optimization techniques • - Using the precision bits FP16 or INT8 instead of FP32 • - Using Layer Fusion A model is optimized for inference acceleration using optimization techniques • - CUDA Computing • - CUDA Graph • - Batch processing • dsjfdkfj A model is optimized to achieve energy efficiency using • - DLA Core
  • 11. How the model size is reduced • Any CNN or DNN model size depends on Number of layers, parameters in each layer and size of the weights and bias in each layer default size is floating point 32 bit(FP32) • Reduce size of precision bits of weights and bias • Fuse the CNN/DNN layers together to make network simpler • What if we reduce it to 16 bits floating point or integer 8-bits 32 bits 16 bits 8 bits
  • 12. Proposed Solution Reduce model size by reducing the precision bits size of weight and biases Make network simpler by reducing number of layers in CNN/DNN Run the model parallelly on hundreds of cores of GPU Accelerate inference using parallel execution, CUDA Graph and batch processing Achieve processor occupancy of the model using CUDA computing Achieve energy efficiency making use of DLA core
  • 13. Create TensorRT Builder From Builder Create TensorRT Parser and Config components TensorRT Parser Import Input Pretrained Model TensorRT Config TensorRT Network Optimization Input Parameters FP32, FP16, INT8, CUDA Graph, Layer Fusion, DLA Core (Input Network and Config) Create TensorRT Engine TensorRT Engine for Inference on GPU
  • 14. NVIDIA Jetson Xavier Family GPU 6 CPU Cores 40 Tensor Cores 384 GPU Cores 1 DLA (Deep Learning Accelerator) 6 Streaming Multiprocessors Clock Frequency 1.109 GHz Number of CUDA cores 384 Compute Clock Rate 1.109 GHz
  • 15. •1 CPU C P U M E M O R Y G P U M E M O R Y Perform parallel execution in GPU GPU IDLE CPU IDLE GPU IDLE Transfer contents (resultant matrix) from GPU (Device) to CPU (Host) memory GPU Transfer contents (metrices) from CPU (Host) to GPU (Device) memory
  • 16. Results from the experiment Model Model Size (Kbs) (BS = 1) Model Size(Kbs) (BS=32) Model Size(Kbs) (BS=64) Model Size(Kbs) (BS=128) Model Size(Kbs) (BS=256) CPU_FP32 55831 GPU_FP32 1715 1823 1823 1771 1768 GPU_FP16 877 919 917 919 917 GPU_INT8 487 533 537 532 538