Object detection with
Deep Learning
Matthew Opala
AGENDA
Region proposals based models
Regression models
Localization & detection
Localization & detection
Computer Vision tasks
Classification Classification &
Localization
Object
detection
Instance
segmentation
Single Object Multiple Objects
Credit: http://vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture8.pdf
Computer Vision tasks
Classification Classification &
Localization
Object
detection
Instance
segmentation
Single Object Multiple Objects
Classification & Localization
Classification:
◦ Input: image
◦ Output: class label
◦ Evaluation: accuracy
Localization:
◦ Input: image
◦ Output: Box(x, y, w, h)
◦ Evaluation: IoU
CAT (x, y, w, h)
Object detection
◦ Many objects of
different classes
on an image
◦ Needs variable
size output
ConvNet
Final conv
feature maps
Classification
head
Regression
head
Region
proposals
Crop & warp
ConvNet
Final conv
feature maps
Classifier
Detection as regression vs. Detection as classification
Object detection models
R-CNN
Region proposals - selective search
Credit: Uijlings et al, “Selective search for Object Recognition”, IJCV 2013
RCNN - model
ConvNet
Bbox
regressors
SVM
Input Image
Regions of
Interest (RoI)
Warped image regions
Selective
search
RCNN - training
◦ Train a classification model
◦ Fine-tune it for detection
◦ Extract features
◦ Train a binary SVM for each class
◦ Train a linear regression model for
each class
RCNN - disadvantages
◦ Complex training pipeline
◦ Slow at test time - 50s per image
Fast R-CNN
Input Image
ConvNet
Bbox
regressors
Softmax
RoI projection onto
the feature map
RoI pooling
FC layers
Selective
search
RoI Pooling
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,8 0,95
0,9 0,74
Fast R-CNN advantages
◦ Much simpler training
◦ Faster - 2s per image
Faster R-CNN
Input Image
ConvNet
Bbox
regressors
Softmax
RoI pooling
FC layers
Region
proposal
network
Feature
Map
Regions
propositions
Faster R-CNN
◦ Fast enough for many applications:
140 ms per image
YOLO
Even Faster-RCNN is too slow for real-time
Model Time/img FPS Pascal 2007 mAP
RCNN 20 s/img 0.05 0.66
Fast-RCNN 2 s/img 0.5 0.7
Faster-RCNN 140 ms/img 7 0.732
YOLO v1. 22 ms/img 45 0.63
Fast YOLO v1. 6,45 ms/img 155 0.53
Credit:
https://pjreddie.com/darknet/yolo
50 km/h
278 m
RCNN
278 m
RCNN
28 m
Fast-RCNN
278 m
RCNN
28 m
Fast-RCNN
1,95 m
Faster-RCNN
278 m
RCNN
28 m
Fast-RCNN
1,95 m
Faster-RCNN
0.3 m
YOLO
Split image into S x S grid
Each cell predicts boxes (x, y, w, h) and confidences P(object)
Each cell predicts boxes (x, y, w, h) and confidences P(object)
Each cell predicts boxes (x, y, w, h) and confidences P(object)
Each cell predicts class probability conditioned on object e.g.
P(Car | object)
CarBicycle
Dog
Dining
table
At test time we combine the box and class predictions
After NMS and thresholding
Model
◦ Image divided into S x S grid
◦ Within each grid cell predict:
▫ B boxes (4 coordinates + confidence)
▫ C class scores
◦ Regression from image to S x S x (5 * B + C) tensor
Credit: Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
During training we match examples to correct cell
Dog = 1
Cat = 0
Bicycle = 0
...
Adjust cell’s class probabilities
Find predicted bounding box with highest IoU
Increase its confidence
Decrease confidence of other boxes
Decrease confidence of boxes in cells without ground truth
detection
Training details
● Pretrain Extraction Net on Imagenet (24 conv laters)
● SGD with decreasing learning rate
● Extensive data augmentation
● Leaky ReLUs
● Increase loss from bounding boxes coordinate predictions and
decrease for boxes that don’t contain objects
● Predicts square root of width and height instead of direct
prediction
YOLO v2
YOLO drawbacks
◦ YOLO makes a significant number
of localization errors in comparison
to Faster-RCNN
◦ Low recall in comparison to region
proposal based methods
YOLO v2
◦ Batch normalization
◦ High resolution classifier
◦ Convolutional anchor boxes
◦ K-Means for choosing boxes’ priors
◦ Fine-grained features
◦ Multi-scale training
mAP and speed on VOC 2007
Credit: Redmon, Farhadi: “YOLO9000, Better, Faster, Stronger”, arXiv 2017
YOLO 9000 - WordTree
YOLO 9000: Hierarchical Classification
◦ Train Darknet-19 on WordTree
◦ Propagate ground truth labels up
the tree
◦ Perform multiple softmax over
co-hyponyms
YOLO 9000 - Joint Classification and Detection training
◦ COCO detection + top 9000 classes from
ImageNet
◦ On detection image, backpropagate loss as
normal
◦ On classification image, only
backpropagate loss at or above the
corresponding level of label
◦ ImageNet shares 44 categories with COCO
◦ Generalizes quite good to new animals
(tiger 0.61 AP, fox, 0.52)
◦ Fails on clothing e.g. “sunglasses”
YOLO 9000: Visualizations
Single Shot Multibox Detector
SSD - YOLO architecture comparison
Credit: Liu, et al: “SSD: Single Shot Multibox Detector””,, arxiv 2016.
SSD detection
◦ Described by four parameters (cx,
cy, w, h) and class category
◦ Detector outputs single value, we
need #classes + 4 detectors for a
single detection
Different “classes” of detection
Aspect ratio: 2:1 Aspect ratio 1:2 Aspect ratio 1:1
Default boxes and aspect ratios
For each conv layer that is input to detection
there are:
(classes + 4) x #default boxes x m x n outputs
SSD Training
◦ Ground truth data needs to be assigned to
specific outputs in the fixed set of detector
outputs
◦ For each GT box we choose the default one
with best jaccard overlap
◦ Hard negative mining
◦ Data augmentation
◦ Loss: cross-entropy + Smooth L1
Deconvolutional Single Shot Detector
Credit: Liu, et al: “DSSD: Deconvolutional Single Shot Detectorr””,, arxiv 2017.
Models comparison - according to DSSD paper
Model Network Pascal 2007 mAP
Faster-RCNN ResNet-101 0.764
R-FCN ResNet-101 0.805
SSD-300 VGG-16 0.77.5
SSD-513 ResNet-101 0.806
YOLO v2 - 544 Darknet-19 0.786
DSSD-513 ResNet-101 0.815
Recap
◦ Detection as regression or
detection as classification
◦ Static images detectors are
already fast enough to work even
on video
◦ Fast YOLO is the fastest detector
◦ State-of-the-art:
▫ Resnet-101 + SSD + deconvolutions
Thanks!
Q&A
You can contact us at:
matthew.opala@craftinity.com

Codetecon #KRK 3 - Object detection with Deep Learning

  • 1.
    Object detection with DeepLearning Matthew Opala
  • 2.
    AGENDA Region proposals basedmodels Regression models Localization & detection
  • 3.
  • 4.
    Computer Vision tasks ClassificationClassification & Localization Object detection Instance segmentation Single Object Multiple Objects Credit: http://vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture8.pdf
  • 5.
    Computer Vision tasks ClassificationClassification & Localization Object detection Instance segmentation Single Object Multiple Objects
  • 6.
    Classification & Localization Classification: ◦Input: image ◦ Output: class label ◦ Evaluation: accuracy Localization: ◦ Input: image ◦ Output: Box(x, y, w, h) ◦ Evaluation: IoU CAT (x, y, w, h)
  • 7.
    Object detection ◦ Manyobjects of different classes on an image ◦ Needs variable size output
  • 8.
    ConvNet Final conv feature maps Classification head Regression head Region proposals Crop& warp ConvNet Final conv feature maps Classifier Detection as regression vs. Detection as classification
  • 9.
  • 10.
  • 11.
    Region proposals -selective search Credit: Uijlings et al, “Selective search for Object Recognition”, IJCV 2013
  • 12.
    RCNN - model ConvNet Bbox regressors SVM InputImage Regions of Interest (RoI) Warped image regions Selective search
  • 13.
    RCNN - training ◦Train a classification model ◦ Fine-tune it for detection ◦ Extract features ◦ Train a binary SVM for each class ◦ Train a linear regression model for each class
  • 14.
    RCNN - disadvantages ◦Complex training pipeline ◦ Slow at test time - 50s per image
  • 15.
  • 16.
    Input Image ConvNet Bbox regressors Softmax RoI projectiononto the feature map RoI pooling FC layers Selective search
  • 17.
    RoI Pooling 0,81 0,40,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 18.
    RoI Pooling, outputsize 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 19.
    RoI Pooling, outputsize 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 20.
    RoI Pooling, outputsize 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 21.
    RoI Pooling, outputsize 2 x 2, region of interest 7 x 5 0,8 0,95 0,9 0,74
  • 22.
    Fast R-CNN advantages ◦Much simpler training ◦ Faster - 2s per image
  • 23.
  • 24.
    Input Image ConvNet Bbox regressors Softmax RoI pooling FClayers Region proposal network Feature Map Regions propositions
  • 25.
    Faster R-CNN ◦ Fastenough for many applications: 140 ms per image
  • 26.
  • 27.
    Even Faster-RCNN istoo slow for real-time Model Time/img FPS Pascal 2007 mAP RCNN 20 s/img 0.05 0.66 Fast-RCNN 2 s/img 0.5 0.7 Faster-RCNN 140 ms/img 7 0.732 YOLO v1. 22 ms/img 45 0.63 Fast YOLO v1. 6,45 ms/img 155 0.53 Credit: https://pjreddie.com/darknet/yolo
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    278 m RCNN 28 m Fast-RCNN 1,95m Faster-RCNN 0.3 m YOLO
  • 34.
    Split image intoS x S grid
  • 35.
    Each cell predictsboxes (x, y, w, h) and confidences P(object)
  • 36.
    Each cell predictsboxes (x, y, w, h) and confidences P(object)
  • 37.
    Each cell predictsboxes (x, y, w, h) and confidences P(object)
  • 38.
    Each cell predictsclass probability conditioned on object e.g. P(Car | object) CarBicycle Dog Dining table
  • 39.
    At test timewe combine the box and class predictions
  • 40.
    After NMS andthresholding
  • 41.
    Model ◦ Image dividedinto S x S grid ◦ Within each grid cell predict: ▫ B boxes (4 coordinates + confidence) ▫ C class scores ◦ Regression from image to S x S x (5 * B + C) tensor Credit: Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
  • 42.
    During training wematch examples to correct cell
  • 44.
    Dog = 1 Cat= 0 Bicycle = 0 ... Adjust cell’s class probabilities
  • 45.
    Find predicted boundingbox with highest IoU
  • 47.
  • 48.
  • 50.
    Decrease confidence ofboxes in cells without ground truth detection
  • 52.
    Training details ● PretrainExtraction Net on Imagenet (24 conv laters) ● SGD with decreasing learning rate ● Extensive data augmentation ● Leaky ReLUs ● Increase loss from bounding boxes coordinate predictions and decrease for boxes that don’t contain objects ● Predicts square root of width and height instead of direct prediction
  • 53.
  • 54.
    YOLO drawbacks ◦ YOLOmakes a significant number of localization errors in comparison to Faster-RCNN ◦ Low recall in comparison to region proposal based methods
  • 55.
    YOLO v2 ◦ Batchnormalization ◦ High resolution classifier ◦ Convolutional anchor boxes ◦ K-Means for choosing boxes’ priors ◦ Fine-grained features ◦ Multi-scale training
  • 56.
    mAP and speedon VOC 2007 Credit: Redmon, Farhadi: “YOLO9000, Better, Faster, Stronger”, arXiv 2017
  • 57.
    YOLO 9000 -WordTree
  • 58.
    YOLO 9000: HierarchicalClassification ◦ Train Darknet-19 on WordTree ◦ Propagate ground truth labels up the tree ◦ Perform multiple softmax over co-hyponyms
  • 59.
    YOLO 9000 -Joint Classification and Detection training ◦ COCO detection + top 9000 classes from ImageNet ◦ On detection image, backpropagate loss as normal ◦ On classification image, only backpropagate loss at or above the corresponding level of label ◦ ImageNet shares 44 categories with COCO ◦ Generalizes quite good to new animals (tiger 0.61 AP, fox, 0.52) ◦ Fails on clothing e.g. “sunglasses”
  • 60.
  • 61.
  • 62.
    SSD - YOLOarchitecture comparison Credit: Liu, et al: “SSD: Single Shot Multibox Detector””,, arxiv 2016.
  • 63.
    SSD detection ◦ Describedby four parameters (cx, cy, w, h) and class category ◦ Detector outputs single value, we need #classes + 4 detectors for a single detection
  • 64.
    Different “classes” ofdetection Aspect ratio: 2:1 Aspect ratio 1:2 Aspect ratio 1:1
  • 65.
    Default boxes andaspect ratios
  • 66.
    For each convlayer that is input to detection there are: (classes + 4) x #default boxes x m x n outputs
  • 67.
    SSD Training ◦ Groundtruth data needs to be assigned to specific outputs in the fixed set of detector outputs ◦ For each GT box we choose the default one with best jaccard overlap ◦ Hard negative mining ◦ Data augmentation ◦ Loss: cross-entropy + Smooth L1
  • 68.
    Deconvolutional Single ShotDetector Credit: Liu, et al: “DSSD: Deconvolutional Single Shot Detectorr””,, arxiv 2017.
  • 69.
    Models comparison -according to DSSD paper Model Network Pascal 2007 mAP Faster-RCNN ResNet-101 0.764 R-FCN ResNet-101 0.805 SSD-300 VGG-16 0.77.5 SSD-513 ResNet-101 0.806 YOLO v2 - 544 Darknet-19 0.786 DSSD-513 ResNet-101 0.815
  • 70.
    Recap ◦ Detection asregression or detection as classification ◦ Static images detectors are already fast enough to work even on video ◦ Fast YOLO is the fastest detector ◦ State-of-the-art: ▫ Resnet-101 + SSD + deconvolutions
  • 71.
    Thanks! Q&A You can contactus at: matthew.opala@craftinity.com