YOLO releases are one-stage object detection models that predict bounding boxes and class probabilities in an image using a single neural network. YOLO v1 divides the image into a grid and predicts bounding boxes and confidence scores for each grid cell. YOLO v2 improves on v1 with anchor boxes, batch normalization, and a Darknet-19 backbone network. YOLO v3 uses a Darknet-53 backbone, multi-scale feature maps, and a logistic classifier to achieve better accuracy. The YOLO models aim to perform real-time object detection with high accuracy while remaining fast and unified end-to-end models.
3. Object Detection Problem
• Image classification is the task of taking an input image and
outputting a class (a cat, dog, etc) or a probability of those classes
that better describe the image. For humans, this task of recognition is
one of the first skills we learn.
• Object Localization is the task of predict the object in an image as
well as its boundaries. The aims is to locate object in an image.
4. Object Detection Problem
Object detection tries to find out all the objects and their boundaries.
Classification
Classification
+ Localization
Object Detection
CAT CAT CAT,DOG,DUCK
7. Deep Learning Object Detection Methods
A naive approach to object detection problem would be to take
different regions of interest from the image, and use a CNN to classify
the presence of the object within that region.
8. Deep Learning Object Detection Methods:
Two-stage detector
The detection happens in two stages:
1. First, the model proposes a set of regions of interests by select
search or regional proposal network.
2. Then a classifier only processes the region candidates.
10. Fast R-CNN
The regions are extracted not from image, but from feature-map
generated by a CNN.
11. Faster R-CNN
Selective search is a slow and
time-consuming process.
Use a separated NN to generate
proposals.
Training and test are faster than
R-CNN and Fast R-CNN.
12. Deep Learning Object Detection Methods:
One-stage detector
In a one-stage detector there is no intermediate task (region
proposals).
A back-bone network is used to extract features from image, usually
pre-trained as an image classifier.
Use a grid to predict a fixed number of bounding-box.
13. You Only Look Once (YOLO)
The base idea is to divide the image in a grid with fixed number of cells.
There are three version of YOLO:
• YOLO v1 : Joseph Redmon,Santosh Divvala, Ross Girshick, Ali Farhadi, 2015.
• YOLO v2, YOLO9000: Joseph Redmon and Ali Farhadi, 2016.
• YOLO v3 : Joseph Redmon and Ali Farhadi, 2018.
14. YOLO v1
• Divide the input image into an S × S grid.
• Each grid cell predicts B bounding boxes.
• Each bounding box :
• Confidence = 𝑃𝑟 𝑜𝑔𝑔𝑒𝑡𝑡𝑜 ∗ 𝐼𝑂𝑈 𝑝𝑟𝑒𝑑
𝑡𝑟𝑢𝑡ℎ
.
• 𝒙, 𝒚, 𝒘, 𝒉 = (𝑥, 𝑦) bb center, 𝑤 width, ℎ height
• C class probabilities.
• Prediction = S × S × (B ∗ 5 + C)
16. YOLO v1 : Cost Function
Classification Loss
Localization Loss
Confidence Loss
17. YOLO v1 : Pros & Cons
• Spatial constraints on bounding
box predictions.
• Small objects that appear in
groups.
• Generalize to objects in new or
unusual aspect ratios or
configurations
• Fast.
• Predictions are made from one
single network.
• Can be trained end-to-end to
improve accuracy.
PROS CONS
19. YOLO v2: Anchor Box and Dimension Cluster
Yolo v1 predicts bounding box with convolutional layers. Faster R-CNN
uses a separated network to predict offsets and confidences for anchor
boxes.
Yolo v2 use anchor boxes. Instead of hand pick priors, K-means is used
on the training set bounding boxes to find better priors.
Distance measure indipendent of the size of the box:
𝑑 𝑏𝑜𝑥, 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 = 1 − 𝐼𝑂𝑈(𝑏𝑜𝑥, 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑)
23. YOLO v2: Darknet-19
Back-bone network with 19
convolutional layers.
1x1 filters to compress the
feature map.
Batch normalization to stabilize
training and avoid overfitting.
Passthrough layer is added so the
model can use fine grain features
from previous layers.
24. YOLO 9000
Yolo v2 is trained separately for classification and detection.
It is been proposed a method to jointly training the network for both
task.
A new hierarchical dataset is created from COCO and ImageNet based
on concept of synonyms and hyponomes.
Selective search: segmentation and merging
Cnn produce 4096-dim feature vector -> feature extractor
SVM
the algorithm also predicts four values which are offset values to increase the precision of the bounding box
Problems: 2000 is a huge number, not real-time
Pooling layer to resize box at fixed size -> FC
Problem: choice of regions is still a bootleneck
classification loss + localization loss + confidence loss -> sum squared error
1i-obj = 1 if object 0 otherwise in cell i
1ij-obj = 1 if bb j respons of detect object in cell i 0 otherwise
Lambda coord : increase the weight for the loss in the boundary coordinates
1ij noobj is the complement of 1i-obj -> if no obj 1 otherwise 0, to limit the error on the background
predict location coordinates relative to the location of the grid cell
This bounds the ground truth to fall between 0 and 1. We use a logistic activation to constrain the network’s predictions to fall in this range
Dual IOU like R-CNN >.7 ok, [.3,.7] ignored, <.3 negative
Focal loss- retina net