Working with Lightweight AI models – TinyML, TensorFlow Lite, MobileNet

Working with Lightweight AI models – TinyML,
TensorFlow Lite, MobileNet
Dinesh Yuvaraj
CTO,CubeAISolutions Tech Pvt Ltd.
Bangalore.

CONTENTS 01 TinyML Landscape
02 Model Design
03 Compression Toolkit
04 TensorFlow Lite Path
05 MobileNet in Practice
06 Next Steps

What is Lightweight AI Models
Definition
Lightweight AI models are efficient and compact machine learning models designed to deliver intelligent results while
using less memory, computation, and power.
Why Lightweight Models?
Traditional AI models are powerful but often require:
• High computational resources
• Large memory and storage
• Expensive infrastructure
Lightweight models address these challenges by being
faster, smaller, and more practical for real-world use.
Key Characteristics
• Low memory footprint
• Faster inference time
• Reduced power consumption
• Optimized for edge and limited-resource environments
• Cost-effective deployment

Where Are They Used?
.
Why It Matters
•Mobile and IoT devices
•Educational and academic systems
•Real-time applications
•On-device AI and edge computing
•Low-cost cloud environments
•Enables AI adoption without heavy
infrastructure
•Suitable for teaching, research, and
student projects
•Bridges theory with practical,
deployable AI systems

TinyML
What is TinyML?
Definition
TinyML (Tiny Machine Learning) is the practice of running machine learning models directly on microcontrollers and very
small devices with extremely limited memory, power, and compute resources.
Key Idea
Instead of sending data to the cloud, TinyML processes data locally on the device, enabling real-time intelligence with low
power consumption.
Key Characteristics
• Runs on microcontrollers (MCUs)
• Requires very small memory (KBs, not GBs)
• Operates with low power (battery-powered devices)
• No operating system or internet required
• Uses lightweight AI models

What TinyML Brings to Edge
01
Low Power Inference
TinyML enables inference on
microcontrollers at milliwatt
levels, significantly reducing
power consumption
compared to traditional
cloud-based solutions,
making it ideal for battery-
powered devices.
02
Latency Reduction
By eliminating the need for
cloud connectivity, TinyML
reduces latency, ensuring
faster response times for
applications requiring real-
time processing, such as
anomaly detection and
keyword spotting.
03
Cost and Privacy Benefits
TinyML cuts down on bill of
materials (BOM) costs by
leveraging low-cost
microcontrollers and
enhances privacy by keeping
data processing local, without
sending sensitive information
to the cloud.

Resource Wall vs Cloud ML
Microcontroller
Constraints
Model Optimization
•TinyML runs on microcontrollers, not
powerful servers
•Very limited SRAM (64–512 KB)
•Limited Flash memory (around 1 MB)
•Low-frequency CPUs (~100 MHz)
•Often no floating-point hardware
•Designed for low power and low cost
•Large AI models cannot run directly on
these devices
•Models must be redesigned to fit hardware
limits
•Model size reduction to fit Flash memory
•RAM usage optimization for inference
•Reduction of MAC (Multiply-Accumulate)
operations
•Use of 8-bit quantization instead of floating-
point
•Balance between accuracy and real-time
performance
•Enables AI models to run efficiently on
microcontrollers

Picking Micro-Architectures
01
Preferred
Architectures
Start with lightweight
architectures like
MobileNetV3-Small,
SqueezeNet, or custom
DSC-NN models to ensure
efficient performance on
resource-constrained
devices.
02
Depthwise Separable
Convolutions
Favor depthwise
separable convolutions,
which significantly
reduce the number of
parameters while
maintaining high
accuracy, making models
more efficient for TinyML
applications.
03
Inverted
Residuals
Incorporate inverted
residuals to improve
model efficiency and
accuracy, ensuring that
the model can run
effectively on low-power
microcontrollers.
04
Squeeze-
Excitation Blocks
Use squeeze-excitation
blocks to enhance
feature representation
with minimal additional
parameters, maintaining
accuracy within 2% of
larger models.

Design for Deployment First
Early Embedding of Constraints
•Deployment limitations are embedded during model design
•Models are built to work with:
•INT8-only operations
•Small activatiaon memory (< 32 KB)
•Hardware-friendly activations (ReLU6)
•Softmax-free or simplified output layers
•Ensures smooth and efficient on-device execution
What “Design for Deployment First” Means
• AI models are designed with the target hardware in mind
• Constraints such as memory, compute, power, and latency are
considered early
• Avoids redesigning or retraining models at the final stage
Why This Approach Is Important
• Prevents deployment failures
• Reduces optimization effort later
• Improves performance on edge and TinyML devices
• Makes models practical, efficient, and scalable

Quantization to
INT8
What is Quantization to INT8?
• Quantization converts 32-bit floating-point (FP32) model values into 8-bit integers (INT8).
• This makes AI models smaller, faster, and more efficient for deployment on edge devices.
Post-Training Static Quantization
• Quantization is applied after the model is fully trained
• Both weights and activations are converted to INT8
• Uses scale and zero-point to map floating values to integers
• Reduces model size by 4×
• Improves inference speed by 2–3×
• Ideal for TinyML and on-device inference
Calibration for Accuracy
• Uses representative sample data to calibrate the model
• Helps determine correct scaling ranges
• Minimizes accuracy loss after quantization
• Ensures INT8 model performance is close to FP32 model

Pruning & Sparsity Gains
Magnitude
Pruning
Remove 50–90% of
weights by zeroing the
smallest absolute
values and retraining,
achieving significant
model compression
without substantial
accuracy loss.
Structured
Channel Pruning
Apply structured
channel pruning to
map cleanly to CMSIS-
NN kernels, yielding 2×
flash reduction and a
30% cycle cut,
enhancing model
efficiency.
Accuracy
Preservation
Maintain model
accuracy within 1% of
the original by carefully
balancing pruning and
retraining strategies,
ensuring robust
performance on-device.

TFLite Micro Workflow
01
Export SavedModel
02
INT8
Quantization
03
Micro
Interpreter
04
Bare-Metal
Deployment
 Start with a trained
TensorFlow model
 Export a in Saved Model
format
 Ensures compatibility with
the TensorFlow Lite
converter
 Acts as the bridge between
training (PC/cloud) and
deployment (device)
 Prepare the model for
conversion and deployment.
•Apply INT8 quantization
during model conversion
•Converts FP32 weights and
activations to 8-bit integers
•Reduces memory usage
•Speeds up inference
•Optimized for low-power
devices
•Make the model
lightweight and efficient.
•Uses TensorFlow Lite Micro
Interpreter
•Converts the model into a C
array
•Maps tensors to fixed RAM
memory arenas
•Integrates CMSIS-NN or
custom kernels for
optimized execution
•Enable the model to run
without an OS on
microcontrollers.
•Compile the model directly
into firmware
•Use compiler optimizations
like -O3
•Enable hardware support
(e.g., FPU flags if available)
•Ensure the entire model fits
within ~100 KB memory
•Runs on bare-metal (no
operating system)

What is the TensorFlow Lite Path?
Definition
The TensorFlow Lite Path describes the end-to-end journey of an AI model from training to deployment on resource-
constrained devices such as mobile phones, embedded systems, and microcontrollers.
TensorFlow Lite
Why TensorFlow Lite?
• Full TensorFlow models are too heavy for edge devices
• TensorFlow Lite provides a lightweight runtime
• Optimized for low latency, low memory, and low power usage
Typical TensorFlow Lite Path
1. Model Training
1. Train the model using TensorFlow (FP32)
2. Model Optimization
1. Apply quantization, pruning, or other optimizations
3. TFLite Conversion
1. Convert the model to .tflite format
4. Deployment
1. Run on mobile, edge, or embedded devices
2. Use TFLite Interpreter or TFLite Micro for microcontrollers

Memory & Latency Tuning
Tensor Lifetime
Logging
Utilize the
RecordingMicroInterpreter to
log tensor lifetimes, enabling
efficient buffer reuse via
pointer arithmetic and
optimizing memory usage.
Cycle Count Profiling
Profile cycle counts with ETM
traces to ensure inference
times remain under 50 ms for
80 MHz Cortex-M4 processors,
maintaining real-time
performance.

What is MobileNet?
Definition
MobileNet is a lightweight deep learning model designed specifically for mobile, edge, and embedded devices where
memory, compute power, and energy are limited.
Mobile
Net
Why MobileNet Was Introduced
• Traditional CNNs are too large and computationally expensive
• Mobile and edge devices require fast and efficient models
• MobileNet achieves high accuracy with much lower computation
Key Features
• Small model size
• Low latency
• Low power consumption
• Easily quantizable to INT8
• Ideal for real-time on-device inference

Shrinking MobileNetV3
01
02
Initial
Configuration
Start with a width
multiplier of 0.35,
removing squeeze-
excitation from early
layers and replacing h-
swish with ReLU6 to
reduce model
complexity.
Quantization
Quantize the model to INT8, achieving a compact size of 125 k
parameters and 250 kB Flash, suitable for deployment on low-
power microcontrollers.
Performance
Achieve 85% top-1 accuracy on CIFAR-10, with inference times
of 32 ms on STM32H7 at 400 MHz, demonstrating efficient on-
device performance.

Real-World Deployment Tips
On-Device
Validation
Validate the model on-
device using noisy
sensor data to ensure
robust performance in
real-world conditions,
accounting for
environmental
variations.
Preprocessing
Implement sliding-
window preprocessing
in DMA buffers to
efficiently handle
continuous data
streams, ensuring
seamless input for the
model.
Heap Safety
Add watchdog resets to
protect against heap
corruption, ensuring
system stability and
reliability during long-
term operation.
Secure OTA
Updates
Enable secure over-the-
air (OTA) updates using
signed binary diffs,
allowing for easy model
refinement and
deployment without
physical access.

Roadmap to Production
Automation
Automate the compression
pipeline using tools like Edge
Impulse or SHYFT to
streamline the development
process and ensure
consistent model
optimization.
Power Benchmarking
Benchmark power
consumption using tools like
joulescope to ensure the
model meets energy
efficiency requirements for
battery-powered devices.
Continuous
Improvement
Track TinyMLPerf benchmarks
and incorporate federated
learning-on-device for
personalization, ensuring
continual efficiency gains as
silicon technology evolves.

Working with Lightweight AI models – TinyML, TensorFlow Lite, MobileNet

More Related Content

Similar to Working with Lightweight AI models – TinyML, TensorFlow Lite, MobileNet

More from RavikumarR77

Recently uploaded

Working with Lightweight AI models – TinyML, TensorFlow Lite, MobileNet