SlideShare a Scribd company logo
1 of 32
Download to read offline
Understanding, Selecting, and
Optimizing Object Detectors for
Edge Applications
Md Nasir Uddin Laskar
Staff Machine Learning Engineer
Walmart Global Tech
Highlights
2
Introduction
Evolution of OD
models
Two-stage
models: R-CNNs
One-stage
models: YOLO,
SSD
Transformer-
based: DETR
OD for edge
devices
OD future
directions
https://www.youtube.com/embed/1_SiUOYUoOI
Object Detection: Introduction
3
[ Bochkovskiy A. et al ]
Input: A single image
(typically RGB)
Output: A set of
detected objects as class
label and bounding box
Objects: From a set of
classes. Person, things,
even Texts
Object Detection: Task
4
Object Detection: Applications
5
[ researchleap.com/ | psimagazine.co.uk/ ]
[ Sang-gil Lee et al, MICCAI 2018 ]
[ learn.arcgis.com/ | vectorstock.com/ ]
• Multiple Outputs
• Image can have variable number of objects from
various classes
• Can also have high overlap between objects in the
image
• Multiple Types of Outputs
• Need to output what (class label) and where
(bounding box)
• High Resolution Images
• Classification works at 224x224. Higher resolution
is needed for detection.
Object Detection: Challenges
6
[ image credit Bochkovskiy A. ]
Object Detection: Evolution of Models
7
. . .
Traditional
One-stage
Two-stage
• Viola Jones
• HOG
• DPM
Deep Learning Transformers
• Fast R-CNN
• Faster R-CNN
• FPN
• R-CNN
• YOLO
• SSD
• RetinaNet
• YOLOv2
• CornerNet
• CenterNet
• EfficientDet
• YOLOv5
• DeTR, Swin
Object Detection: Simple Approach
8
CNN Model
Classification head: What
Detection head: Where
Correct label: Bird
Bbox: (x’, y’, w’, h’)
4
0
9
6
Class Scores
Bird: 0.90
Cat: 0.05
Dog: 0.01
…
Bounding
Box
(x, y, w, h)
Softmax
Loss
L2 Loss
Weighted
Sum
Multitask Loss
[ Bird picture: https://pixabay.com/ ]
• Question: What is the problem with this setup?
It cannot detect if the image has multiple objects.
• Use selective search to identify a manageable number of object region candidates (region of interest or
RoI).
• Extracts CNN features from each region independently for classification.
R-CNN Class of Models
9
[ Girshick et al, CVPR 2014 ]
1. Propose category-independent RoIs by selective search
2. Warp region candidates to a fixed size as required by CNN, e.g. 224x224
3. Generate potential bounding boxes, and then run a classifier on these proposed boxes, e.g. SVM
4. Refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other
objects in the scene
R-CNN Steps in Detail
10
[ Girshick et al, CVPR 2014 ]
Summary
Input
image
Conv
Net
Conv
Net
Conv
Net
Class
Class
Class
Warped image
regions (224x224)
Regions of
Interest (RoI)
from a proposal
method (~2k)
Forward each
region through
ConvNet
Bbox
Bbox
Bbox
F
d
c
im
“Slow” R-CNN: Run
CNN independently
for each region
“B
ne
Ale
Re
Re
Int
fro
me
R-CNN: Impacts / Limitations
11
Cannot be
trained end-
to-end
Requires 100s
of GB of
storage space
Not suitable to
run real-time
applications
Selective
search is not
optimized for
object detection
Pioneered the
CNN for object
detection
Sets the stage
to evolve the
field
• 30K citations
• 4K papers with
title "R-CNN" 1
[ 1Google Scholar advanced search. allintitle:"R-CNN" ]
• Run a single CNN on the entire image. Get RoIs from the image features instead of the image itself.
• Share computations across all ROIs rather than doing calculations for each proposal independently.
• Does not need to cache extracted features in the disk. The architecture is trained end-to-end with a
multi-task loss.
Fast R-CNN
12
[ Paper: Girshick , ICCV 2015 ]
[ Image: https://www.mathworks.com/ ]
Summary
Input
image
Conv
Net
Conv
Net
Conv
Net
Class
Class
Class
Warped image
regions (224x224)
Regions of
Interest (RoI)
from a proposal
method (~2k)
Forward each
region through
ConvNet
Bbox
Bbox
Bbox
Fast R-CNN: Apply
differentiable
cropping to shared
image features
“Slow” R-CNN: Run
CNN independently
for each region
F
C
w
ConvNet
Input image
Run whole image
through ConvNet
Image features
Crop + Resize features
Per-Region Network
“Backbone”
network:
AlexNet, VGG,
ResNet, etc
Regions of
Interest (RoIs)
from a proposal
method
CNN
CNN
CNN
Bbox
Class
Bbox
Class
Bbox
Class
Category and box
transform per region
• Nearly cost-free region proposals using Region Proposal Network (RPN), that shares convolutional features with the
detection network.
• The convolutional computations are shared across the RPN and the Fast R-CNN, effectively reducing the computation time.
Faster R-CNN
13
• Introduced multi-scale anchor boxes to detect objects of various sizes.
[ Ren et al, NeurIPS 2015 ]
Slow, Fast, and Faster R-CNN
14
Summary
Input
image
Conv
Net
Conv
Net
Conv
Net
Class
Class
Class
Warped image
regions (224x224)
Regions of
Interest (RoI)
from a proposal
method (~2k)
Forward each
region through
ConvNet
Bbox
Bbox
Bbox
Fast R-CNN: Apply
differentiable
cropping to shared
image features
“Slow” R-CNN: Run
CNN independently
for each region
Faster R-CNN:
Compute proposals
with CNN
ConvNet
Input image
Run whole image
through ConvNet
Image features
Crop + Resize features
Per-Region Network
“Backbone”
network:
AlexNet, VGG,
ResNet, etc
Regions of
Interest (RoIs)
from a proposal
method
CNN
CNN
CNN
Bbox
Class
Bbox
Class
Bbox
Class
Category and box
transform per region
Si
Fu
de
Run CNN independently for
each region
Summary
Input
image
Conv
Net
Conv
Net
Conv
Net
Class
Class
Class
Warped image
regions (224x224)
Regions of
Interest (RoI)
from a proposal
method (~2k)
Forward each
region through
ConvNet
Bbox
Bbox
Bbox
Fast R-CNN: Apply
differentiable
cropping to shared
image features
“Slow” R-CNN: Run
CNN independently
for each region
Faster R-CNN:
Compute proposals
with CNN
ConvNet
Input image
Run whole image
through ConvNet
Image features
Crop + Resize features
Per-Region Network
“Backbone”
network:
AlexNet, VGG,
ResNet, etc
Regions of
Interest (RoIs)
from a proposal
method
CNN
CNN
CNN
Bbox
Class
Bbox
Class
Bbox
Class
Category and box
transform per region
Summary
Input
image
Conv
Net
Conv
Net
Conv
Net
Class
Class
Class
Warped image
regions (224x224)
Regions of
Interest (RoI)
from a proposal
method (~2k)
Forward each
region through
ConvNet
Bbox
Bbox
Bbox
Fast R-CNN: Apply
differentiable
cropping to shared
image features
“Slow” R-CNN: Run
CNN independently
for each region
Faster R-CNN:
Compute proposa
with CNN
ConvNet
Input image
Run whole image
through ConvNet
Image features
Crop + Resize features
Per-Region Network
“Backbone”
network:
AlexNet, VGG,
ResNet, etc
Regions of
Interest (RoIs)
from a proposal
method
CNN
CNN
CNN
Bbox
Class
Bbox
Class
Bbox
Class
Category and box
transform per region
[ image credit: Justin Johnson, University of Michigan]
Differentiable cropping to shared
image features
Compute region proposals with
CNNs
• Use pyramidal feature hierarchy for efficient detection of objects of various sizes.
• Model Architecture: Backbone model (VGG) and SSD head. SSD head outputs the bounding box
and object classes.
• Large fine-grained feature maps (lower level) at are good at capturing small objects and small
coarse-grained feature maps detect large objects well (higher level).
Single Shot Detector: SSD
15
[ Wei Liu et al, ECCV 2016 ]
Conv
Net
Class
Warped image
regions (224x224)
Regions of
Interest (RoI)
from a proposal
method (~2k)
Forward each
region through
ConvNet
Bbox
Fast R-CNN: Apply
differentiable
cropping to shared
image features
-CNN: Run
pendently
egion
Faster R-CNN:
Compute proposals
with CNN
ConvNet
Input image
Run whole image
through ConvNet
Image features
Crop + Resize features
Per-Region Network
“Backbone”
network:
AlexNet, VGG,
ResNet, etc
Regions of
Interest (RoIs)
from a proposal
method
CNN
CNN
CNN
Bbox
Class
Bbox
Class
Bbox
Class
Category and box
transform per region
Single Stage
• Eliminate RPN. Use grid cells technique to detect object of various sizes.
• Predicts offset of predefined anchor (default) boxes for every location of the feature map.
• The anchor boxes on different levels are rescaled so that one feature map is only responsible for
objects at one particular scale.
SSD: Steps
16
[ Wei Liu et al, ECCV 2016 ]
• Cat (Small Object) is captured by the 8x8 feature
map (lower level).
• Dog (Large Object) can only be detected in the 4x4
feature map (higher level)
Fine-grained Coarse-grained
• One of the first attempts to build a fast, real-time object detector.
• YOLO Frames the object detection as a single regression problem, straight from image pixels to
bounding box and class probabilities. Hence, YOLO, You Only Look Once.
• The final prediction of shape S × S × (5B + C) is produced by two fully connected layers over the whole
conv feature map.
YOLO Class of Models
17
[ Paper: Redmond et al, CVPR 2016. ]
[ image: https://lilianweng.github.io/ ]
YOLO: Steps and Limitations
18
• Split the image into SxS cells. Each cell predicts
• The location of bounding boxes as (x, y, w, h), a confidence score, and a probability of
object class
• Final prediction is S × S × (5B + C). For PASCAL VOC S=7, B=2, C=20. That is why the final map
is 7x7x30
[ Redmond et al, CVPR 2016. ]
• Cannot detect group of small objects.
Maximum B (here, 2) objects per cell
• Irregular shaped objects
YOLOv2 and Beyond
19
• Light-weight base model,
DarkNet-19
• BatchNorm on conv layers
• Conv layers to predict anchor
boxes
• Direct location prediction
YOLOv2
• Latest in the series
YOLOv8
• Logistic regression for
confidence scores
• Multiple independent classifiers
instead of one softmax
• Skip-layer concatenation
YOLOv3
[ Redmond et al, CVPR 2017 ]
. . .
• DETR frames the object detection task as an image-to-set problem. Given an image, the model predicts
an unordered set of all the objects present.
• Existing methods have number of components that make them complicated.
Transformer-based Detectors: DETR
20
[ Carion N, Massa F et al ]
RPN
• Directly predicts the final set of
detections in parallel
• During training, bipartite matching
uniquely assigns predictions with
ground truth boxes.
• Predictions with no match yield a “no
object” class prediction.
Transformer-based Detectors: DETR
21
[ Carion N, Massa F et al ]
• Slow convergence, 5x slower than
Faster R-CNN
• Poor detection on small objects
• Task of detecting objects from a video, such as in autonomous driving scenario
• Challenges
• Appearance deterioration
• Changes of video frames, e.g., motion blur, part occlusion, camera re-focous, rare poses etc.
• Aggregate temporal cues from different frames. Two-step baseline models (Faster R-CNN, R-FCN)
• Box-level. Post-processing of temporal information.
• Feature-level. Improve features of the current frame by aggregating that of adjacent frames.
Object Detection in Video
22
• Recent. Use one-step models such as YOLO / DETR to build end-to-end detectors.
• Precision measures how accurate are the predictions of the detector, aka, percentage of
correct predictions.
• Recall measures how good the object detector can detect all the positives.
• IoU measures the overlap between GT and predicted boundaries.
Evaluation Metrics
23
intersection
IoU =
poor good
poor
[ Bird picture: https://pixabay.com/ ]
• Average Precision (AP) computes the mean precision value for recall value over 0 to 1.
union
1. Run the detector for all test images
2. For each category: for each detection
1. Compute the AP, which is area under PR curve
2. Plot a point on PR curve if IoU > 0.5
3. mAP = average of AP for each category
4. COCO mAP : average AP for IoU from 0.5 to 0.95
with a step size of 0.05.
Mean Average Precision (mAP)
24
• Speed of the detection is usually quantified with FPS
0.99 0.95 0.90 0.50 0.10
All Bird detections sorted by scores
All GT Bird boxes
IoU > 0.5
Benchmark Analysis
25
Object Detection at the Edge: Considerations / Tradeoffs
26
• CPU / GPU / NPU
• Real-time applications
• High resolution images
Compute / Speed
• Some edge devices do
not support NMS
Post-Process
• Model size / #Params
• RAM / Flash
• Imbalanced memory
distribution in first conv
layers
Memory
Considerations
and
Challenges
Tradeoffs
• Single-stage models
have lower mAP
Accuracy
• Higher precision models
usually have lower FPS
FPS
• Design New Model
• Design new model architecture that runs on your target device and train it [Not Recommended]
• Smaller version of an existing model and train it, such as FOMO, MCUNetV2
• Transfer Learning
• Fine-tune an existing model on your custom data. For example, TF Detection Model Zoo.
• Pick a model that works best for your use-case and target hardware.
• Pre-training Optimizations
• Quantization-aware training of existing models
• Post-training Optimizations
• Model pruning / quantization
• Hardware specific optimizations: TFLite / TensorRT / ONNX / similar
Object Detection at the Edge: Develop and Optimize
27
Object Detection at the Edge: Example
28
• MobileNetV2 base model
• Patch-by-patch inference to solve
imbalanced memory distribution
• Receptive Field redistribution to
reduce computation overhead
MCUNetV2
[ MCUNetV2: Lin et all, NeurIPS 2021 ]
On Pascal VOC
• 68.3% (+16.9) with 438kB SRAM
• 64.6% (+13.2) with 247kB SRAM
• Only 7 FPS
• Not tested on high resolution
images
Object Detection: What is Next?
29
• Accuracy of two-stage
• Speed of one-stage
Fastest R-CNN
• Particularly critical for
autonomous driving
3D Obj Detection
• Training at the edge
devices
• Adapt to data drifts
On-device Training
• Efficient detection in
video
• Has so many real-world
applications
Detection in Video
• More algorithms/
models
• Compatibility towards
edge devices
Transformers
• Object detection applications and challenges
• Evolution of object detection systems
• Some of the popular object detection models
• Considerations and tradeoffs of object detection for edge applications
• Optimizing object detection systems for edge devices
Conclusion
30
31
Questions / Discussions
• Off the shelf object detection models:
• TensorFlow OD model Zoo
• TensorFlow Mobile Optimized Detectors
• Detectron 2: object detection using PyTorch and model zoo
• Object detection training datasets
• Pascal VOC dataset
• MS COCO Dataset
• Object detection training frameworks
• TensorFlow Lite , Example object detection for mobile devices
• PyTorch example object detection using pre-trained models
• Get hands-on
• Train YOLOv4 using Google Colab
Resources
32

More Related Content

Similar to “Understanding, Selecting and Optimizing Object Detectors for Edge Applications,” a Presentation from Walmart Global Tech

Sem 2 Presentation
Sem 2 PresentationSem 2 Presentation
Sem 2 PresentationShalom Cohen
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxfahmi324663
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningCharles Deledalle
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNNJunho Cho
 
week12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfweek12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfYuChianWu
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureSanghamitra Deb
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Jihong Kang
 
Deep learning based object detection
Deep learning based object detectionDeep learning based object detection
Deep learning based object detectionMonicaDommaraju
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIYu Huang
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionEntrepreneur / Startup
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術CHENHuiMei
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkNader Karimi
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNsAuro Tripathy
 
#10 pydata warsaw object detection with dn ns
#10   pydata warsaw object detection with dn ns#10   pydata warsaw object detection with dn ns
#10 pydata warsaw object detection with dn nsAndrew Brozek
 

Similar to “Understanding, Selecting and Optimizing Object Detectors for Edge Applications,” a Presentation from Walmart Global Tech (20)

object-detection.pptx
object-detection.pptxobject-detection.pptx
object-detection.pptx
 
Sem 2 Presentation
Sem 2 PresentationSem 2 Presentation
Sem 2 Presentation
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
week12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfweek12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdf
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
D3L4-objects.pdf
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
 
Deep learning based object detection
Deep learning based object detectionDeep learning based object detection
Deep learning based object detection
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNs
 
#10 pydata warsaw object detection with dn ns
#10   pydata warsaw object detection with dn ns#10   pydata warsaw object detection with dn ns
#10 pydata warsaw object detection with dn ns
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

“Understanding, Selecting and Optimizing Object Detectors for Edge Applications,” a Presentation from Walmart Global Tech

  • 1. Understanding, Selecting, and Optimizing Object Detectors for Edge Applications Md Nasir Uddin Laskar Staff Machine Learning Engineer Walmart Global Tech
  • 2. Highlights 2 Introduction Evolution of OD models Two-stage models: R-CNNs One-stage models: YOLO, SSD Transformer- based: DETR OD for edge devices OD future directions
  • 4. Input: A single image (typically RGB) Output: A set of detected objects as class label and bounding box Objects: From a set of classes. Person, things, even Texts Object Detection: Task 4
  • 5. Object Detection: Applications 5 [ researchleap.com/ | psimagazine.co.uk/ ] [ Sang-gil Lee et al, MICCAI 2018 ] [ learn.arcgis.com/ | vectorstock.com/ ]
  • 6. • Multiple Outputs • Image can have variable number of objects from various classes • Can also have high overlap between objects in the image • Multiple Types of Outputs • Need to output what (class label) and where (bounding box) • High Resolution Images • Classification works at 224x224. Higher resolution is needed for detection. Object Detection: Challenges 6 [ image credit Bochkovskiy A. ]
  • 7. Object Detection: Evolution of Models 7 . . . Traditional One-stage Two-stage • Viola Jones • HOG • DPM Deep Learning Transformers • Fast R-CNN • Faster R-CNN • FPN • R-CNN • YOLO • SSD • RetinaNet • YOLOv2 • CornerNet • CenterNet • EfficientDet • YOLOv5 • DeTR, Swin
  • 8. Object Detection: Simple Approach 8 CNN Model Classification head: What Detection head: Where Correct label: Bird Bbox: (x’, y’, w’, h’) 4 0 9 6 Class Scores Bird: 0.90 Cat: 0.05 Dog: 0.01 … Bounding Box (x, y, w, h) Softmax Loss L2 Loss Weighted Sum Multitask Loss [ Bird picture: https://pixabay.com/ ] • Question: What is the problem with this setup? It cannot detect if the image has multiple objects.
  • 9. • Use selective search to identify a manageable number of object region candidates (region of interest or RoI). • Extracts CNN features from each region independently for classification. R-CNN Class of Models 9 [ Girshick et al, CVPR 2014 ]
  • 10. 1. Propose category-independent RoIs by selective search 2. Warp region candidates to a fixed size as required by CNN, e.g. 224x224 3. Generate potential bounding boxes, and then run a classifier on these proposed boxes, e.g. SVM 4. Refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene R-CNN Steps in Detail 10 [ Girshick et al, CVPR 2014 ] Summary Input image Conv Net Conv Net Conv Net Class Class Class Warped image regions (224x224) Regions of Interest (RoI) from a proposal method (~2k) Forward each region through ConvNet Bbox Bbox Bbox F d c im “Slow” R-CNN: Run CNN independently for each region “B ne Ale Re Re Int fro me
  • 11. R-CNN: Impacts / Limitations 11 Cannot be trained end- to-end Requires 100s of GB of storage space Not suitable to run real-time applications Selective search is not optimized for object detection Pioneered the CNN for object detection Sets the stage to evolve the field • 30K citations • 4K papers with title "R-CNN" 1 [ 1Google Scholar advanced search. allintitle:"R-CNN" ]
  • 12. • Run a single CNN on the entire image. Get RoIs from the image features instead of the image itself. • Share computations across all ROIs rather than doing calculations for each proposal independently. • Does not need to cache extracted features in the disk. The architecture is trained end-to-end with a multi-task loss. Fast R-CNN 12 [ Paper: Girshick , ICCV 2015 ] [ Image: https://www.mathworks.com/ ] Summary Input image Conv Net Conv Net Conv Net Class Class Class Warped image regions (224x224) Regions of Interest (RoI) from a proposal method (~2k) Forward each region through ConvNet Bbox Bbox Bbox Fast R-CNN: Apply differentiable cropping to shared image features “Slow” R-CNN: Run CNN independently for each region F C w ConvNet Input image Run whole image through ConvNet Image features Crop + Resize features Per-Region Network “Backbone” network: AlexNet, VGG, ResNet, etc Regions of Interest (RoIs) from a proposal method CNN CNN CNN Bbox Class Bbox Class Bbox Class Category and box transform per region
  • 13. • Nearly cost-free region proposals using Region Proposal Network (RPN), that shares convolutional features with the detection network. • The convolutional computations are shared across the RPN and the Fast R-CNN, effectively reducing the computation time. Faster R-CNN 13 • Introduced multi-scale anchor boxes to detect objects of various sizes. [ Ren et al, NeurIPS 2015 ]
  • 14. Slow, Fast, and Faster R-CNN 14 Summary Input image Conv Net Conv Net Conv Net Class Class Class Warped image regions (224x224) Regions of Interest (RoI) from a proposal method (~2k) Forward each region through ConvNet Bbox Bbox Bbox Fast R-CNN: Apply differentiable cropping to shared image features “Slow” R-CNN: Run CNN independently for each region Faster R-CNN: Compute proposals with CNN ConvNet Input image Run whole image through ConvNet Image features Crop + Resize features Per-Region Network “Backbone” network: AlexNet, VGG, ResNet, etc Regions of Interest (RoIs) from a proposal method CNN CNN CNN Bbox Class Bbox Class Bbox Class Category and box transform per region Si Fu de Run CNN independently for each region Summary Input image Conv Net Conv Net Conv Net Class Class Class Warped image regions (224x224) Regions of Interest (RoI) from a proposal method (~2k) Forward each region through ConvNet Bbox Bbox Bbox Fast R-CNN: Apply differentiable cropping to shared image features “Slow” R-CNN: Run CNN independently for each region Faster R-CNN: Compute proposals with CNN ConvNet Input image Run whole image through ConvNet Image features Crop + Resize features Per-Region Network “Backbone” network: AlexNet, VGG, ResNet, etc Regions of Interest (RoIs) from a proposal method CNN CNN CNN Bbox Class Bbox Class Bbox Class Category and box transform per region Summary Input image Conv Net Conv Net Conv Net Class Class Class Warped image regions (224x224) Regions of Interest (RoI) from a proposal method (~2k) Forward each region through ConvNet Bbox Bbox Bbox Fast R-CNN: Apply differentiable cropping to shared image features “Slow” R-CNN: Run CNN independently for each region Faster R-CNN: Compute proposa with CNN ConvNet Input image Run whole image through ConvNet Image features Crop + Resize features Per-Region Network “Backbone” network: AlexNet, VGG, ResNet, etc Regions of Interest (RoIs) from a proposal method CNN CNN CNN Bbox Class Bbox Class Bbox Class Category and box transform per region [ image credit: Justin Johnson, University of Michigan] Differentiable cropping to shared image features Compute region proposals with CNNs
  • 15. • Use pyramidal feature hierarchy for efficient detection of objects of various sizes. • Model Architecture: Backbone model (VGG) and SSD head. SSD head outputs the bounding box and object classes. • Large fine-grained feature maps (lower level) at are good at capturing small objects and small coarse-grained feature maps detect large objects well (higher level). Single Shot Detector: SSD 15 [ Wei Liu et al, ECCV 2016 ] Conv Net Class Warped image regions (224x224) Regions of Interest (RoI) from a proposal method (~2k) Forward each region through ConvNet Bbox Fast R-CNN: Apply differentiable cropping to shared image features -CNN: Run pendently egion Faster R-CNN: Compute proposals with CNN ConvNet Input image Run whole image through ConvNet Image features Crop + Resize features Per-Region Network “Backbone” network: AlexNet, VGG, ResNet, etc Regions of Interest (RoIs) from a proposal method CNN CNN CNN Bbox Class Bbox Class Bbox Class Category and box transform per region Single Stage
  • 16. • Eliminate RPN. Use grid cells technique to detect object of various sizes. • Predicts offset of predefined anchor (default) boxes for every location of the feature map. • The anchor boxes on different levels are rescaled so that one feature map is only responsible for objects at one particular scale. SSD: Steps 16 [ Wei Liu et al, ECCV 2016 ] • Cat (Small Object) is captured by the 8x8 feature map (lower level). • Dog (Large Object) can only be detected in the 4x4 feature map (higher level) Fine-grained Coarse-grained
  • 17. • One of the first attempts to build a fast, real-time object detector. • YOLO Frames the object detection as a single regression problem, straight from image pixels to bounding box and class probabilities. Hence, YOLO, You Only Look Once. • The final prediction of shape S × S × (5B + C) is produced by two fully connected layers over the whole conv feature map. YOLO Class of Models 17 [ Paper: Redmond et al, CVPR 2016. ] [ image: https://lilianweng.github.io/ ]
  • 18. YOLO: Steps and Limitations 18 • Split the image into SxS cells. Each cell predicts • The location of bounding boxes as (x, y, w, h), a confidence score, and a probability of object class • Final prediction is S × S × (5B + C). For PASCAL VOC S=7, B=2, C=20. That is why the final map is 7x7x30 [ Redmond et al, CVPR 2016. ] • Cannot detect group of small objects. Maximum B (here, 2) objects per cell • Irregular shaped objects
  • 19. YOLOv2 and Beyond 19 • Light-weight base model, DarkNet-19 • BatchNorm on conv layers • Conv layers to predict anchor boxes • Direct location prediction YOLOv2 • Latest in the series YOLOv8 • Logistic regression for confidence scores • Multiple independent classifiers instead of one softmax • Skip-layer concatenation YOLOv3 [ Redmond et al, CVPR 2017 ] . . .
  • 20. • DETR frames the object detection task as an image-to-set problem. Given an image, the model predicts an unordered set of all the objects present. • Existing methods have number of components that make them complicated. Transformer-based Detectors: DETR 20 [ Carion N, Massa F et al ] RPN
  • 21. • Directly predicts the final set of detections in parallel • During training, bipartite matching uniquely assigns predictions with ground truth boxes. • Predictions with no match yield a “no object” class prediction. Transformer-based Detectors: DETR 21 [ Carion N, Massa F et al ] • Slow convergence, 5x slower than Faster R-CNN • Poor detection on small objects
  • 22. • Task of detecting objects from a video, such as in autonomous driving scenario • Challenges • Appearance deterioration • Changes of video frames, e.g., motion blur, part occlusion, camera re-focous, rare poses etc. • Aggregate temporal cues from different frames. Two-step baseline models (Faster R-CNN, R-FCN) • Box-level. Post-processing of temporal information. • Feature-level. Improve features of the current frame by aggregating that of adjacent frames. Object Detection in Video 22 • Recent. Use one-step models such as YOLO / DETR to build end-to-end detectors.
  • 23. • Precision measures how accurate are the predictions of the detector, aka, percentage of correct predictions. • Recall measures how good the object detector can detect all the positives. • IoU measures the overlap between GT and predicted boundaries. Evaluation Metrics 23 intersection IoU = poor good poor [ Bird picture: https://pixabay.com/ ] • Average Precision (AP) computes the mean precision value for recall value over 0 to 1. union
  • 24. 1. Run the detector for all test images 2. For each category: for each detection 1. Compute the AP, which is area under PR curve 2. Plot a point on PR curve if IoU > 0.5 3. mAP = average of AP for each category 4. COCO mAP : average AP for IoU from 0.5 to 0.95 with a step size of 0.05. Mean Average Precision (mAP) 24 • Speed of the detection is usually quantified with FPS 0.99 0.95 0.90 0.50 0.10 All Bird detections sorted by scores All GT Bird boxes IoU > 0.5
  • 26. Object Detection at the Edge: Considerations / Tradeoffs 26 • CPU / GPU / NPU • Real-time applications • High resolution images Compute / Speed • Some edge devices do not support NMS Post-Process • Model size / #Params • RAM / Flash • Imbalanced memory distribution in first conv layers Memory Considerations and Challenges Tradeoffs • Single-stage models have lower mAP Accuracy • Higher precision models usually have lower FPS FPS
  • 27. • Design New Model • Design new model architecture that runs on your target device and train it [Not Recommended] • Smaller version of an existing model and train it, such as FOMO, MCUNetV2 • Transfer Learning • Fine-tune an existing model on your custom data. For example, TF Detection Model Zoo. • Pick a model that works best for your use-case and target hardware. • Pre-training Optimizations • Quantization-aware training of existing models • Post-training Optimizations • Model pruning / quantization • Hardware specific optimizations: TFLite / TensorRT / ONNX / similar Object Detection at the Edge: Develop and Optimize 27
  • 28. Object Detection at the Edge: Example 28 • MobileNetV2 base model • Patch-by-patch inference to solve imbalanced memory distribution • Receptive Field redistribution to reduce computation overhead MCUNetV2 [ MCUNetV2: Lin et all, NeurIPS 2021 ] On Pascal VOC • 68.3% (+16.9) with 438kB SRAM • 64.6% (+13.2) with 247kB SRAM • Only 7 FPS • Not tested on high resolution images
  • 29. Object Detection: What is Next? 29 • Accuracy of two-stage • Speed of one-stage Fastest R-CNN • Particularly critical for autonomous driving 3D Obj Detection • Training at the edge devices • Adapt to data drifts On-device Training • Efficient detection in video • Has so many real-world applications Detection in Video • More algorithms/ models • Compatibility towards edge devices Transformers
  • 30. • Object detection applications and challenges • Evolution of object detection systems • Some of the popular object detection models • Considerations and tradeoffs of object detection for edge applications • Optimizing object detection systems for edge devices Conclusion 30
  • 32. • Off the shelf object detection models: • TensorFlow OD model Zoo • TensorFlow Mobile Optimized Detectors • Detectron 2: object detection using PyTorch and model zoo • Object detection training datasets • Pascal VOC dataset • MS COCO Dataset • Object detection training frameworks • TensorFlow Lite , Example object detection for mobile devices • PyTorch example object detection using pre-trained models • Get hands-on • Train YOLOv4 using Google Colab Resources 32