YOLO releases
Gianmaria Perillo
Data Scientist, Sferanet
perillo@sferaspa.com
What objects are where?
Object Detection Problem
• Image classification is the task of taking an input image and
outputting a class (a cat, dog, etc) or a probability of those classes
that better describe the image. For humans, this task of recognition is
one of the first skills we learn.
• Object Localization is the task of predict the object in an image as
well as its boundaries. The aims is to locate object in an image.
Object Detection Problem
Object detection tries to find out all the objects and their boundaries.
Classification
Classification
+ Localization
Object Detection
CAT CAT CAT,DOG,DUCK
Object Detection Milestones
Traditional Detection Methods
• Feature extraction: Haar, HOG, SIFT …
• Feature selection: PCA, ICA …
• Feature Matching
• Classification: SVM, Logistic Regression, Nearest Neighbor …
Deep Learning Object Detection Methods
A naive approach to object detection problem would be to take
different regions of interest from the image, and use a CNN to classify
the presence of the object within that region.
Deep Learning Object Detection Methods:
Two-stage detector
The detection happens in two stages:
1. First, the model proposes a set of regions of interests by select
search or regional proposal network.
2. Then a classifier only processes the region candidates.
Region-CNN (R-CNN)
Use selective search to extract just 2000 regions from the image.
Fast R-CNN
The regions are extracted not from image, but from feature-map
generated by a CNN.
Faster R-CNN
Selective search is a slow and
time-consuming process.
Use a separated NN to generate
proposals.
Training and test are faster than
R-CNN and Fast R-CNN.
Deep Learning Object Detection Methods:
One-stage detector
In a one-stage detector there is no intermediate task (region
proposals).
A back-bone network is used to extract features from image, usually
pre-trained as an image classifier.
Use a grid to predict a fixed number of bounding-box.
You Only Look Once (YOLO)
The base idea is to divide the image in a grid with fixed number of cells.
There are three version of YOLO:
• YOLO v1 : Joseph Redmon,Santosh Divvala, Ross Girshick, Ali Farhadi, 2015.
• YOLO v2, YOLO9000: Joseph Redmon and Ali Farhadi, 2016.
• YOLO v3 : Joseph Redmon and Ali Farhadi, 2018.
YOLO v1
• Divide the input image into an S × S grid.
• Each grid cell predicts B bounding boxes.
• Each bounding box :
• Confidence = 𝑃𝑟 𝑜𝑔𝑔𝑒𝑡𝑡𝑜 ∗ 𝐼𝑂𝑈 𝑝𝑟𝑒𝑑
𝑡𝑟𝑢𝑡ℎ
.
• 𝒙, 𝒚, 𝒘, 𝒉 = (𝑥, 𝑦) bb center, 𝑤 width, ℎ height
• C class probabilities.
• Prediction = S × S × (B ∗ 5 + C)
YOLO v1: Network Architecture
YOLO v1 : Cost Function
Classification Loss
Localization Loss
Confidence Loss
YOLO v1 : Pros & Cons
• Spatial constraints on bounding
box predictions.
• Small objects that appear in
groups.
• Generalize to objects in new or
unusual aspect ratios or
configurations
• Fast.
• Predictions are made from one
single network.
• Can be trained end-to-end to
improve accuracy.
PROS CONS
YOLO v2
• Batch Normalization
• Anchor-Box
• Dimension Clusters
• Direct location prediction
• Fine-Grained Features
• Darknet-19
• Hierarchical classification
YOLO v2: Anchor Box and Dimension Cluster
Yolo v1 predicts bounding box with convolutional layers. Faster R-CNN
uses a separated network to predict offsets and confidences for anchor
boxes.
Yolo v2 use anchor boxes. Instead of hand pick priors, K-means is used
on the training set bounding boxes to find better priors.
Distance measure indipendent of the size of the box:
𝑑 𝑏𝑜𝑥, 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 = 1 − 𝐼𝑂𝑈(𝑏𝑜𝑥, 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑)
YOLO v2: Anchor Box and Dimension Cluster
YOLO v2: Direct Location Prediction
The network predicts 5 bounding boxes at each cell in the output
feature map. The network predicts 5 coordinates for each bounding
box: 𝒕 𝒙, 𝒕 𝒚, 𝒕 𝒘, 𝒕 𝒉, 𝒕 𝒐.
𝑏 𝑥 = 𝜎 𝑡 𝑥 + 𝑐 𝑥
𝑏 𝑦 = 𝜎 𝑡 𝑦 + 𝑐 𝑦
𝑏 𝑤 = 𝑝 𝑤 𝑒 𝑡 𝑤
𝑏ℎ = 𝑝ℎ 𝑒 𝑡ℎ
With (𝑐 𝑥, 𝑐 𝑦) offset of the cell from top left corner and 𝑝 𝑤, 𝑝ℎ the
bounding box prior width and height.
YOLO v2: Direct Location Prediction ????
YOLO v2: Darknet-19
Back-bone network with 19
convolutional layers.
1x1 filters to compress the
feature map.
Batch normalization to stabilize
training and avoid overfitting.
Passthrough layer is added so the
model can use fine grain features
from previous layers.
YOLO 9000
Yolo v2 is trained separately for classification and detection.
It is been proposed a method to jointly training the network for both
task.
A new hierarchical dataset is created from COCO and ImageNet based
on concept of synonyms and hyponomes.
YOLO 9000
YOLO 9000
YOLO v2 : Pros & Cons
• Pre-processing for prior.
• Experimental threshold.
• Faster and more accurate.
• Can detect small object.
• Joint detection and classification
training.
• Hierachical classification.
PROS CONS
YOLO v3
• Darknet 53
• Multi scale feature
• Residual block
• Logistic classifier
• Multi-label classification
YOLO v3
YOLO v3
• Dual IOU thresholds.
• Focal loss (RetinaNet).
• Linear activation.
• More accurate.
• Multiscale feature.
• Multilabel approach.
PROS ATTEMPTS
Conclusion
Gianmaria Perillo
Data Scientist, Sferanet
perillo@sferaspa.com

Yolo releases gianmaria

  • 1.
    YOLO releases Gianmaria Perillo DataScientist, Sferanet perillo@sferaspa.com
  • 2.
  • 3.
    Object Detection Problem •Image classification is the task of taking an input image and outputting a class (a cat, dog, etc) or a probability of those classes that better describe the image. For humans, this task of recognition is one of the first skills we learn. • Object Localization is the task of predict the object in an image as well as its boundaries. The aims is to locate object in an image.
  • 4.
    Object Detection Problem Objectdetection tries to find out all the objects and their boundaries. Classification Classification + Localization Object Detection CAT CAT CAT,DOG,DUCK
  • 5.
  • 6.
    Traditional Detection Methods •Feature extraction: Haar, HOG, SIFT … • Feature selection: PCA, ICA … • Feature Matching • Classification: SVM, Logistic Regression, Nearest Neighbor …
  • 7.
    Deep Learning ObjectDetection Methods A naive approach to object detection problem would be to take different regions of interest from the image, and use a CNN to classify the presence of the object within that region.
  • 8.
    Deep Learning ObjectDetection Methods: Two-stage detector The detection happens in two stages: 1. First, the model proposes a set of regions of interests by select search or regional proposal network. 2. Then a classifier only processes the region candidates.
  • 9.
    Region-CNN (R-CNN) Use selectivesearch to extract just 2000 regions from the image.
  • 10.
    Fast R-CNN The regionsare extracted not from image, but from feature-map generated by a CNN.
  • 11.
    Faster R-CNN Selective searchis a slow and time-consuming process. Use a separated NN to generate proposals. Training and test are faster than R-CNN and Fast R-CNN.
  • 12.
    Deep Learning ObjectDetection Methods: One-stage detector In a one-stage detector there is no intermediate task (region proposals). A back-bone network is used to extract features from image, usually pre-trained as an image classifier. Use a grid to predict a fixed number of bounding-box.
  • 13.
    You Only LookOnce (YOLO) The base idea is to divide the image in a grid with fixed number of cells. There are three version of YOLO: • YOLO v1 : Joseph Redmon,Santosh Divvala, Ross Girshick, Ali Farhadi, 2015. • YOLO v2, YOLO9000: Joseph Redmon and Ali Farhadi, 2016. • YOLO v3 : Joseph Redmon and Ali Farhadi, 2018.
  • 14.
    YOLO v1 • Dividethe input image into an S × S grid. • Each grid cell predicts B bounding boxes. • Each bounding box : • Confidence = 𝑃𝑟 𝑜𝑔𝑔𝑒𝑡𝑡𝑜 ∗ 𝐼𝑂𝑈 𝑝𝑟𝑒𝑑 𝑡𝑟𝑢𝑡ℎ . • 𝒙, 𝒚, 𝒘, 𝒉 = (𝑥, 𝑦) bb center, 𝑤 width, ℎ height • C class probabilities. • Prediction = S × S × (B ∗ 5 + C)
  • 15.
    YOLO v1: NetworkArchitecture
  • 16.
    YOLO v1 :Cost Function Classification Loss Localization Loss Confidence Loss
  • 17.
    YOLO v1 :Pros & Cons • Spatial constraints on bounding box predictions. • Small objects that appear in groups. • Generalize to objects in new or unusual aspect ratios or configurations • Fast. • Predictions are made from one single network. • Can be trained end-to-end to improve accuracy. PROS CONS
  • 18.
    YOLO v2 • BatchNormalization • Anchor-Box • Dimension Clusters • Direct location prediction • Fine-Grained Features • Darknet-19 • Hierarchical classification
  • 19.
    YOLO v2: AnchorBox and Dimension Cluster Yolo v1 predicts bounding box with convolutional layers. Faster R-CNN uses a separated network to predict offsets and confidences for anchor boxes. Yolo v2 use anchor boxes. Instead of hand pick priors, K-means is used on the training set bounding boxes to find better priors. Distance measure indipendent of the size of the box: 𝑑 𝑏𝑜𝑥, 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 = 1 − 𝐼𝑂𝑈(𝑏𝑜𝑥, 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑)
  • 20.
    YOLO v2: AnchorBox and Dimension Cluster
  • 21.
    YOLO v2: DirectLocation Prediction The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates for each bounding box: 𝒕 𝒙, 𝒕 𝒚, 𝒕 𝒘, 𝒕 𝒉, 𝒕 𝒐. 𝑏 𝑥 = 𝜎 𝑡 𝑥 + 𝑐 𝑥 𝑏 𝑦 = 𝜎 𝑡 𝑦 + 𝑐 𝑦 𝑏 𝑤 = 𝑝 𝑤 𝑒 𝑡 𝑤 𝑏ℎ = 𝑝ℎ 𝑒 𝑡ℎ With (𝑐 𝑥, 𝑐 𝑦) offset of the cell from top left corner and 𝑝 𝑤, 𝑝ℎ the bounding box prior width and height.
  • 22.
    YOLO v2: DirectLocation Prediction ????
  • 23.
    YOLO v2: Darknet-19 Back-bonenetwork with 19 convolutional layers. 1x1 filters to compress the feature map. Batch normalization to stabilize training and avoid overfitting. Passthrough layer is added so the model can use fine grain features from previous layers.
  • 24.
    YOLO 9000 Yolo v2is trained separately for classification and detection. It is been proposed a method to jointly training the network for both task. A new hierarchical dataset is created from COCO and ImageNet based on concept of synonyms and hyponomes.
  • 25.
  • 26.
  • 27.
    YOLO v2 :Pros & Cons • Pre-processing for prior. • Experimental threshold. • Faster and more accurate. • Can detect small object. • Joint detection and classification training. • Hierachical classification. PROS CONS
  • 28.
    YOLO v3 • Darknet53 • Multi scale feature • Residual block • Logistic classifier • Multi-label classification
  • 29.
  • 30.
    YOLO v3 • DualIOU thresholds. • Focal loss (RetinaNet). • Linear activation. • More accurate. • Multiscale feature. • Multilabel approach. PROS ATTEMPTS
  • 31.
  • 33.
    Gianmaria Perillo Data Scientist,Sferanet perillo@sferaspa.com

Editor's Notes

  • #8 sliding window
  • #10 Selective search: segmentation and merging Cnn produce 4096-dim feature vector -> feature extractor SVM the algorithm also predicts four values which are offset values to increase the precision of the bounding box Problems: 2000 is a huge number, not real-time
  • #11 Pooling layer to resize box at fixed size -> FC Problem: choice of regions is still a bootleneck
  • #13 Remove last layers and output feature map
  • #15 Girshick -> R-CNN
  • #16 24 layers conv : 20 pre-trained + 4 Resolution 2x input
  • #17 classification loss + localization loss + confidence loss -> sum squared error 1i-obj = 1 if object 0 otherwise in cell i 1ij-obj = 1 if bb j respons of detect object in cell i 0 otherwise Lambda coord : increase the weight for the loss in the boundary coordinates 1ij noobj is the complement of 1i-obj -> if no obj 1 otherwise 0, to limit the error on the background
  • #22 predict location coordinates relative to the location of the grid cell This bounds the ground truth to fall between 0 and 1. We use a logistic activation to constrain the network’s predictions to fall in this range
  • #31 Dual IOU like R-CNN >.7 ok, [.3,.7] ignored, <.3 negative Focal loss- retina net