SlideShare a Scribd company logo
1 of 17
Download to read offline
Enabling Machine Learning on the Edge
using SRAM Conserving Efficient Neural
Networks Execution Approach
Bharath Sudharsan
Edge Computing - Hardware View
▪ MCUs and small CPUs: BLE beacons, smart
bulbs, smart plugs, TV remotes, fitness bands
▪ SBCs: Raspberry Pis, BeagleBones, NVIDIA
Jetsons, Latte Pandas, Intel NUCs, Google Coral
▪ GPU accelerated: AWS snowball, Digi gateways,
Dell Edge Gateways for IoT, HPE Edgeline
Edge computing hardware: highly resource constrained -> high resource (left to right)
ARM Cortex-M0 MCU
based BLE beacon
Edge gateway with
GPUs and SSDs
Powerful CPU + basic GPU based
SBCs (single board computers)
▪ Why MCU? Roughly 50 billion MCU chips were shipped
in 2020 (market estimates), which far exceeds other
chips like GPUs & CPUs (only 100 million units sold)
▪ No file system
▪ Lack Python support. Only C, C++
▪ Only few hundred MHz clock speed and low
SRAM and Flash memory
▪ Lack multiple cores. No parallel execution units
▪ Lack inbuilt hardware accelerators such as
APU (Accelerated Processing Unit), KPU
(convolution operation accelerator), FPU
(Floating-point accelerator), and FFT (Fourier
transform accelerator)
Specification of popular MCU boards
Edge Computing - Hardware Spec
Neural Networks for Edge Computing
▪ Running NNs on MCUs is ultra-low-power machine learning, also known as TinyML
✓ Audio - always-on inference. Industry telemetry - monitor, detect anomalies in motor bearing vibrations.
Image - object counting, visual wake words. Physiological/behavior - activity recognition using IMU or EMG
Who are practicing TinyML
NNs on MCUs - Prevailing Challenges
SRAM Overflow Challenge
▪ To deploy, when taking maximum deep compressed
NN models
✓ Can exceed the device's memory capacity just
by a few bytes SRAM margin
✓ Cannot apply additionally optimization as
already max optimized
▪ Core algorithm of proposed SRAM optimized
approach - makes such max compressed NNs
consume less SRAM during execution
IoT Device Fail Challenge
▪ Devices running large NNs fail - cases reported
✓ Due to overheating, fast battery wear, and run-
time stalling
✓ Prime reason for such failure is the exhaustion
of device SRAM memory
▪ Proposed Tensor Memory Mapping (TMM) -
accurately compute + visualize tensor memory
requirement of each operator in NN computation
graph during execution
Efficient NN Execution - Proposed Design
▪ The trained model size and its peak SRAM need to be highly reduced due to the limited Flash and SRAM
memory capacity of IoT devices or TinyML hardware
▪ Our approach can reduce the peak SRAM consumed by NNs, speedup inference, and save energy during model
execution. Upcoming slides presents the parts of proposed approach
i. Tensor Memory Mapping (TMM) method
ii. Load fewer tensors and tensors re-usage
iii. Cheapest NN graph execution sequence
iv. Finally, the core algorithm that combines above three parts and presents the complete approach in the
form of an implementable algorithm
Tensor Memory Mapping (TMM)
▪ Before deployment, the execution memory requirement of models is often unknown or calculated with less
accuracy. i.e., there will exist a few MB of deviations in the calculations
▪ When the model is targeted to run on smartphones or edge GPUs, these few MB deviations do not cause any
issues. But when target is MCUs, then the low-accuracy calculation causes run-time memory overflows
and/or restrict flashing model on IoT devices due to SRAM peaks
▪ TMM can be realized to accurately compute and visualize the tensor memory requirement of each operator
in any NN model computation graph. We use this high-accuracy calculation method in the core algorithm
design of our efficient NN execution approach
Tensor Memory Mapping (TMM)
▪ TMM is suitable for any pre-trained models like NASNet, Tiny-
YOLO, SqueezeNet, etc.
✓ Computes total required SRAM. i.e., the space required to
store input tensors + output tensors + other tensors, and
then exports the detailed report in CSV format
✓ Can also produce images that show the tensor memory
requirement of each operator
✓ For example, when we feed the Inception V1 that contains 84
graph nodes/operators to TMM, it produces shown Fig (for
brevity, we show only 0 - 29 operators) along with the
detailed CSV report
Load Fewer Tensors, also Re-use
▪ Intermediate tensors stored in SRAM during
the NN graph execution
▪ First branching point, when the default
model execution software is utilized, the two
tensors are loaded. Then it executes all the
branched operators
▪ This method of loading many tensors and
executing many operators leads to the most
critical SRAM overflow issue, especially in
the scenarios where multiple branches are
emerging from one branching point
A part of the COCO SSD MobileNet graph with its branched operators
Load Fewer Tensors, also Re-use
▪ As explained, in traditional NN graph execution, multiple tensors of various sizes are loaded into the buffer.
Such bulk loading is the reason for causing peak memory usage. In contrast, our approach
✓ Executes many operators by just loading a minimum number of tensors. This part of our approach also
aims to achieve SRAM conservation by tensors re-usage
✓ Identifies and stores a particular set of tensors in the buffer (buffers are created within SRAM) and first
executes the branch of the NN graph containing operators compatible with the stored tensors
✓ Then in next iteration, it loads tensors that are suitable as input for the set of operators belonging to the
next branch, then performs the execution. After each iteration, the buffers are reclaimed
Cheapest NN Graph Execution Sequence
▪ Changing execution sequence of operators still produces a valid scheme?
✓ Computation graph of NN model are DAG. We proved that in DAGs, execution of available nodes in any
topological order will result in a valid execution sequence
✓ Branched computation graphs provide freedom for the NN model execution software to alter the
execution order/sequence of the operators
✓ So, the proposed approach achieves its memory conservation goal by intelligently selecting the
execution branch that when executed consumes less SRAM (reduces the peak memory consumption)
Core Algorithm
▪ Discovering multiple topological orders of nodes in a computation graph. Uses the previous slides presented
methods inside loops
✓ Analyzes the complete network by running through each branch of the network and finally discovering
the cheapest graph execution path/sequence
✓ The time consumed by algorithm to produce the results depends on complexity Tc = (|O| 2|O|), where |O|
is the total operators count
Experimental Evaluation
▪ Downloaded popular pre-trained TensorFlow Lite models (.tflite format)
✓ 9 models that range from image classification to text detection
✓ Contain hundreds of operators so the complexity of our algorithm is high
✓ Conduct evaluation on a standard laptop Intel (R) Core (TM) i7-5500 CPU @ 2.40 GHz
▪ We execute each model using default execution sequence, then using execution order altered sequence
✓ We tabulate corresponding peak SRAM usage, unit inference time, and the energy consumed
✓ In next slides we present the benefits achieved as a result of optimizing model graph execution using
the proposed approach
SRAM Usage
▪ We take quantized DenseNet with its default execution
sequence and feed it to TMM. From the computed memory
requirement for each operator in default graph, 24th operator
showed peak SRAM usage of 8429.568 KB
▪ Next, after applying proposed Algorithm on DenseNet, the
resultant memory-friendly graph execution sequence, when
evaluated by TMM showed the peak memory of only
5221.264 KB (peak reduced by 38.06%)
▪ We plot thus calculated peak SRAM reduction percentage for
all models (A - I)
▪ The maximum peak SRAM reduction of 38.06% for DenseNet
and least of 1.61% for DeepLabv3
Inference Time and Energy Consumption
▪ Execute each model (A - I) first with their default execution
sequence, then with the memory peak reduced sequence
produced by the proposed approach
✓ DenseNet shows max inference time reduction of 4.9
ms and the least of 0.28 ms reduction for EAST
✓ 4 - 84 mJ less energy to perform unit inference since
executing model using optimized sequence is 0.28 - 4.9
ms faster than the default sequence
Conclusion
▪ We presented an approach to efficiently execute (with reduced SRAM usage) deeply optimized (maximally
compressed) ML models on resource-constrained devices. As shown with evaluations, when users apply the
presented approach, they can
✓ Execute large-high-quality models on their IoT devices/products without needing to upgrade the
hardware or alter the model architecture and re-train to produce a smaller model
✓ Devices can control real-world applications by making timely predictions/decisions
✓ Devices can perform high accuracy offline analytics without affecting the operating time of battery-
powered devices
Contact: Bharath Sudharsan
Email: bharath.sudharsan@insight-centre.org
www.confirm.ie

More Related Content

What's hot

Lecture 1
Lecture 1Lecture 1
Lecture 1Mr SMAK
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingAjil Jose
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
Optimization of Electrical Machines in the Cloud with SyMSpace by LCMOptimization of Electrical Machines in the Cloud with SyMSpace by LCM
Optimization of Electrical Machines in the Cloud with SyMSpace by LCMcloudSME
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...jemin lee
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learningpratik pratyay
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semesterRafi Ullah
 

What's hot (20)

Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel Processing
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Cv mini project (1)
Cv mini project (1)Cv mini project (1)
Cv mini project (1)
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
Optimization of Electrical Machines in the Cloud with SyMSpace by LCMOptimization of Electrical Machines in the Cloud with SyMSpace by LCM
Optimization of Electrical Machines in the Cloud with SyMSpace by LCM
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
Google TPU
Google TPUGoogle TPU
Google TPU
 

Similar to Enabling Machine Learning on the Edge using SRAM Conserving Efficient Neural Networks Execution Approach

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
ARM architcture
ARM architcture ARM architcture
ARM architcture Hossam Adel
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Let’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllLet’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllScyllaDB
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Bruno Castelucci
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldPraveen Kumar
 
eMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overvieweMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overviewVijayGESYS
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance mentoresd
 
Digital signal processor architecture
Digital signal processor architectureDigital signal processor architecture
Digital signal processor architecturekomal mistry
 
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Atollic
 

Similar to Enabling Machine Learning on the Edge using SRAM Conserving Efficient Neural Networks Execution Approach (20)

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Arm processor
Arm processorArm processor
Arm processor
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
UNIT 2.pptx
UNIT 2.pptxUNIT 2.pptx
UNIT 2.pptx
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Let’s Fix Logging Once and for All
Let’s Fix Logging Once and for AllLet’s Fix Logging Once and for All
Let’s Fix Logging Once and for All
 
RAMinate Invited Talk at NII
RAMinate Invited Talk at NIIRAMinate Invited Talk at NII
RAMinate Invited Talk at NII
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 
DSP Processor.pptx
DSP Processor.pptxDSP Processor.pptx
DSP Processor.pptx
 
eMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overvieweMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overview
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
 
Digital signal processor architecture
Digital signal processor architectureDigital signal processor architecture
Digital signal processor architecture
 
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Enabling Machine Learning on the Edge using SRAM Conserving Efficient Neural Networks Execution Approach

  • 1. Enabling Machine Learning on the Edge using SRAM Conserving Efficient Neural Networks Execution Approach Bharath Sudharsan
  • 2. Edge Computing - Hardware View ▪ MCUs and small CPUs: BLE beacons, smart bulbs, smart plugs, TV remotes, fitness bands ▪ SBCs: Raspberry Pis, BeagleBones, NVIDIA Jetsons, Latte Pandas, Intel NUCs, Google Coral ▪ GPU accelerated: AWS snowball, Digi gateways, Dell Edge Gateways for IoT, HPE Edgeline Edge computing hardware: highly resource constrained -> high resource (left to right) ARM Cortex-M0 MCU based BLE beacon Edge gateway with GPUs and SSDs Powerful CPU + basic GPU based SBCs (single board computers) ▪ Why MCU? Roughly 50 billion MCU chips were shipped in 2020 (market estimates), which far exceeds other chips like GPUs & CPUs (only 100 million units sold)
  • 3. ▪ No file system ▪ Lack Python support. Only C, C++ ▪ Only few hundred MHz clock speed and low SRAM and Flash memory ▪ Lack multiple cores. No parallel execution units ▪ Lack inbuilt hardware accelerators such as APU (Accelerated Processing Unit), KPU (convolution operation accelerator), FPU (Floating-point accelerator), and FFT (Fourier transform accelerator) Specification of popular MCU boards Edge Computing - Hardware Spec
  • 4. Neural Networks for Edge Computing ▪ Running NNs on MCUs is ultra-low-power machine learning, also known as TinyML ✓ Audio - always-on inference. Industry telemetry - monitor, detect anomalies in motor bearing vibrations. Image - object counting, visual wake words. Physiological/behavior - activity recognition using IMU or EMG Who are practicing TinyML
  • 5. NNs on MCUs - Prevailing Challenges SRAM Overflow Challenge ▪ To deploy, when taking maximum deep compressed NN models ✓ Can exceed the device's memory capacity just by a few bytes SRAM margin ✓ Cannot apply additionally optimization as already max optimized ▪ Core algorithm of proposed SRAM optimized approach - makes such max compressed NNs consume less SRAM during execution IoT Device Fail Challenge ▪ Devices running large NNs fail - cases reported ✓ Due to overheating, fast battery wear, and run- time stalling ✓ Prime reason for such failure is the exhaustion of device SRAM memory ▪ Proposed Tensor Memory Mapping (TMM) - accurately compute + visualize tensor memory requirement of each operator in NN computation graph during execution
  • 6. Efficient NN Execution - Proposed Design ▪ The trained model size and its peak SRAM need to be highly reduced due to the limited Flash and SRAM memory capacity of IoT devices or TinyML hardware ▪ Our approach can reduce the peak SRAM consumed by NNs, speedup inference, and save energy during model execution. Upcoming slides presents the parts of proposed approach i. Tensor Memory Mapping (TMM) method ii. Load fewer tensors and tensors re-usage iii. Cheapest NN graph execution sequence iv. Finally, the core algorithm that combines above three parts and presents the complete approach in the form of an implementable algorithm
  • 7. Tensor Memory Mapping (TMM) ▪ Before deployment, the execution memory requirement of models is often unknown or calculated with less accuracy. i.e., there will exist a few MB of deviations in the calculations ▪ When the model is targeted to run on smartphones or edge GPUs, these few MB deviations do not cause any issues. But when target is MCUs, then the low-accuracy calculation causes run-time memory overflows and/or restrict flashing model on IoT devices due to SRAM peaks ▪ TMM can be realized to accurately compute and visualize the tensor memory requirement of each operator in any NN model computation graph. We use this high-accuracy calculation method in the core algorithm design of our efficient NN execution approach
  • 8. Tensor Memory Mapping (TMM) ▪ TMM is suitable for any pre-trained models like NASNet, Tiny- YOLO, SqueezeNet, etc. ✓ Computes total required SRAM. i.e., the space required to store input tensors + output tensors + other tensors, and then exports the detailed report in CSV format ✓ Can also produce images that show the tensor memory requirement of each operator ✓ For example, when we feed the Inception V1 that contains 84 graph nodes/operators to TMM, it produces shown Fig (for brevity, we show only 0 - 29 operators) along with the detailed CSV report
  • 9. Load Fewer Tensors, also Re-use ▪ Intermediate tensors stored in SRAM during the NN graph execution ▪ First branching point, when the default model execution software is utilized, the two tensors are loaded. Then it executes all the branched operators ▪ This method of loading many tensors and executing many operators leads to the most critical SRAM overflow issue, especially in the scenarios where multiple branches are emerging from one branching point A part of the COCO SSD MobileNet graph with its branched operators
  • 10. Load Fewer Tensors, also Re-use ▪ As explained, in traditional NN graph execution, multiple tensors of various sizes are loaded into the buffer. Such bulk loading is the reason for causing peak memory usage. In contrast, our approach ✓ Executes many operators by just loading a minimum number of tensors. This part of our approach also aims to achieve SRAM conservation by tensors re-usage ✓ Identifies and stores a particular set of tensors in the buffer (buffers are created within SRAM) and first executes the branch of the NN graph containing operators compatible with the stored tensors ✓ Then in next iteration, it loads tensors that are suitable as input for the set of operators belonging to the next branch, then performs the execution. After each iteration, the buffers are reclaimed
  • 11. Cheapest NN Graph Execution Sequence ▪ Changing execution sequence of operators still produces a valid scheme? ✓ Computation graph of NN model are DAG. We proved that in DAGs, execution of available nodes in any topological order will result in a valid execution sequence ✓ Branched computation graphs provide freedom for the NN model execution software to alter the execution order/sequence of the operators ✓ So, the proposed approach achieves its memory conservation goal by intelligently selecting the execution branch that when executed consumes less SRAM (reduces the peak memory consumption)
  • 12. Core Algorithm ▪ Discovering multiple topological orders of nodes in a computation graph. Uses the previous slides presented methods inside loops ✓ Analyzes the complete network by running through each branch of the network and finally discovering the cheapest graph execution path/sequence ✓ The time consumed by algorithm to produce the results depends on complexity Tc = (|O| 2|O|), where |O| is the total operators count
  • 13. Experimental Evaluation ▪ Downloaded popular pre-trained TensorFlow Lite models (.tflite format) ✓ 9 models that range from image classification to text detection ✓ Contain hundreds of operators so the complexity of our algorithm is high ✓ Conduct evaluation on a standard laptop Intel (R) Core (TM) i7-5500 CPU @ 2.40 GHz ▪ We execute each model using default execution sequence, then using execution order altered sequence ✓ We tabulate corresponding peak SRAM usage, unit inference time, and the energy consumed ✓ In next slides we present the benefits achieved as a result of optimizing model graph execution using the proposed approach
  • 14. SRAM Usage ▪ We take quantized DenseNet with its default execution sequence and feed it to TMM. From the computed memory requirement for each operator in default graph, 24th operator showed peak SRAM usage of 8429.568 KB ▪ Next, after applying proposed Algorithm on DenseNet, the resultant memory-friendly graph execution sequence, when evaluated by TMM showed the peak memory of only 5221.264 KB (peak reduced by 38.06%) ▪ We plot thus calculated peak SRAM reduction percentage for all models (A - I) ▪ The maximum peak SRAM reduction of 38.06% for DenseNet and least of 1.61% for DeepLabv3
  • 15. Inference Time and Energy Consumption ▪ Execute each model (A - I) first with their default execution sequence, then with the memory peak reduced sequence produced by the proposed approach ✓ DenseNet shows max inference time reduction of 4.9 ms and the least of 0.28 ms reduction for EAST ✓ 4 - 84 mJ less energy to perform unit inference since executing model using optimized sequence is 0.28 - 4.9 ms faster than the default sequence
  • 16. Conclusion ▪ We presented an approach to efficiently execute (with reduced SRAM usage) deeply optimized (maximally compressed) ML models on resource-constrained devices. As shown with evaluations, when users apply the presented approach, they can ✓ Execute large-high-quality models on their IoT devices/products without needing to upgrade the hardware or alter the model architecture and re-train to produce a smaller model ✓ Devices can control real-world applications by making timely predictions/decisions ✓ Devices can perform high accuracy offline analytics without affecting the operating time of battery- powered devices
  • 17. Contact: Bharath Sudharsan Email: bharath.sudharsan@insight-centre.org www.confirm.ie