SlideShare a Scribd company logo
1 of 62
OBJECT DETECTION
OBJECT CLASSIFICATION
OBJECT LOCALIZATION
 Before the deep learning era, hand-crafted features like
HOG and feature pyramids are used pervasively to
capture localization signals in an image.
 However, those methods usually can’t extend to generic
object detection well, so most of the applications are
limited to face or pedestrian detections.
 With the power of deep learning, we can train a network
to learn which features to capture, as well as what
coordinates to predict for an object.
2013: OVERFEAT
 OverFeat: Integrated Recognition, Localization and
Detection using Convolutional Networks
 Inspired by the early success of AlexNet in the 2012 ImageNet competition, where
CNN-based feature extraction defeated all hand-crafted feature extractors,
OverFeat quickly introduced CNN back into the object detection area as well.
 The idea is very straight forward: if we can classify one image using CNN, what
about greedily scrolling through the whole image with different sizes of windows,
and try to regress and classify them one-by-one using a CNN?
 This leverages the power of CNN for feature extraction and classification, and also
bypassed the hard region proposal problem by pre-defined sliding windows.
 Also, since a nearby convolution kernel can share part of the computation result, it
is not necessary to compute convolutions for the overlapping area, hence reducing
cost a lot.
 OverFeat is a pioneer in the one-stage object detector. It tried to combine feature
extraction, location regression, and region classification in the same CNN.
 Unfortunately, such a one-stage approach also suffers from relatively poorer
accuracy due to less prior knowledge used.
 Also proposed in 2013, R-CNN is a bit late compared with OverFeat.
 However, this region-based approach eventually led to a big wave of
object detection research with its two-stage framework, i.e, region
proposal stage, and region classification and refinement stage.
2013: R-CNN
 R-CNN first extracts potential regions of interest from an input
image by using a technique called selective search.
 Selective search doesn’t really try to understand the
foreground object, instead, it groups similar pixels by relying
on a heuristic: similar pixels usually belong to the same object.
 Therefore, the results of selective search have a very high
probability to contain something meaningful.
 Next, R-CNN warps these region proposals into fixed-size
images with some paddings, and feed these images into the
second stage of the network for more fine-grained recognition.
 Unlike those old methods using selective search, R-CNN
replaced HOG with a CNN to extract features from all region
proposals in its second stage.
 Region proposal from selective search highly depends on the
similarity assumption, so it can only provide a rough estimate
of location.
 To further improve localization accuracy, R-CNN borrowed an
idea from “Deep Neural Networks for Object Detection” (aka
DetectorNet), and introduced an additional bounding box
regression to predict the center coordinates, width and height
of a box. This regressor is widely used in the future object
detectors.
 However, a two-stage detector like R-CNN suffers from two big
issues: 1) It’s not fully convolutional because selective search
is not E2E trainable. 2) region proposal stage is usually very
slow compared with other one-stage detectors like OverFeat,
and running on each region proposal separately makes it even
slower.
 Later, we will see how R-CNN evolve over time to address
these two issues.
2015: FAST R-CNN
 A quick follow-up for R-CNN is to reduce the duplicate
convolution over multiple region proposals.
 Since these region proposals all come from one image,
it’s naturally to improve R-CNN by running CNN over the
entire image once and share the computation among
many region proposals.
 However, different region proposals have different sizes,
which also result in different output feature map sizes if
we are using the same CNN feature extractor.
 These feature maps with various sizes will prevent us
from using fully connected layers for further classification
and regression because the FC layer only works with a
fixed size input.
 Fortunately, a paper called “Spatial Pyramid Pooling in Deep
Convolutional Networks for Visual Recognition” has already solved
the dynamic scale issue for FC layers.
 In SPPNet, a feature pyramid pooling is introduced between
convolution layers and FC layers to create a bag-of-words style of the
feature vector.
 This vector has a fixed size and encodes features from different
scales, so our convolution layers can now take any size of images as
input without worrying about the incompatibility of the FC layer.
 Inspired by this, Fast R-CNN proposed a similar layer call the ROI
Pooling layer.
 This pooling layer downsamples feature maps with different sizes into
a fixed-size vector. By doing so, we can now use the same FC layers
for classification and box regression, no matter how large or small the
ROI is.
 With a shared feature extractor and the scale-invariant ROI pooling
layer, Fast R-CNN can reach a similar localization accuracy but having
10~20x faster training and 100~200x faster inference.
 The near real-time inference and an easier E2E training protocol for the
detection part make Fast R-CNN a popular choice in the industry as
well.
 This dense prediction over the entire image can cause
trouble in computation cost, so YOLO took the bottleneck
structure from GooLeNet to avoid this issue.
 Another problem of YOLO is that two objects might fall into
the same coarse grid cell, so it doesn’t work well with small
objects such as a flock of birds.
 Despite lower accuracy, YOLO’s straightforward design
and real-time inference ability makes one-stage object
detection popular again in the research, and also a go-to
solution for the industry.
2015: FASTER R-CNN
 Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks
 As we introduced above, in early 2015, Ross Girshick proposed
an improved version of R-CNN called Fast R-CNN by using a
shared feature extractor for proposed regions.
 Just a few months later, Ross and his team came back with
another improvement again.
 This new network Faster R-CNN is not only faster than previous
versions but also marks a milestone for object detection with a
deep learning method.
 With Fast R-CNN, the only non-convolutional piece of the network is the selective
search region proposal.
 As of 2015, researchers started to realize that the deep neural network is so
magical, that it can learn anything given enough data.
 So, is it possible to also train a neural network to proposal regions, instead of
relying on heuristic and hand-crafted approach like selective search?
 Faster R-CNN followed this direction and thinking, and successfully created the
Region Proposal Network (RPN).
 To simply put, RPN is a CNN that takes an image as input and outputs a set of
rectangular object proposals, each with an objectiveness score.
 The paper used VGG originally but other backbone networks such as ResNet
become more widespread later.
 To generate region proposals, a 3×3 sliding window is applied over the CNN
feature map output to generate 2 scores (foreground and background) and 4
coordinates each location.
 In practice, this sliding window is implemented with a 3×3 convolution kernel with a
1×1 convolution kernel.
 Although the sliding window has a fixed size, our objects may
appear on different scales.
 Therefore, Faster R-CNN introduced a technique called anchor
box.
 Anchor boxes are pre-defined prior boxes with different aspect
ratios and sizes but share the same central location.
 In Faster R-CNN there are k=9 anchors for each sliding window
location, which covers 3 aspect ratios for 3 scales each.
 These repeated anchor boxes over different scales bring nice
translation-invariance and scale-invariance features to the
network while sharing outputs of the same feature map.
 Note that the bounding box regression will be computed from
these anchor box instead of the whole image.
 So far, we discussed the new Region Proposal Network to replace the
old selective search region proposal.
 To make the final detection, Faster R-CNN uses the same detection
head from Fast R-CNN to do classification and fine-grained
localization.
 Fast R-CNN also uses a shared CNN feature extractor. Now that RPN
itself is also a feature extraction CNN, we can just share it with
detection head like the diagram above.
 This sharing design doesn’t bring some trouble though. If we train RPN
and Fast R-CNN detector together, we will treat RPN proposals as a
constant input of ROI pooling, and inevitably ignore the gradients of
RPN’s bounding box proposals.
 One walk around is called alternative training where you train RPN and
Fast R-CNN in turns.
 And later in a paper “Instance-aware semantic segmentation via multi-
task network cascades”, we can see that the ROI pooling layer can
also be made differentiable w.r.t. the box coordinates proposals.
2015: YOLO V1
 You Only Look Once: Unified, Real-Time Object Detection
 While the R-CNN series started a big hype over two-stage
object detection in the research community, its complicated
implementation brought many headaches for engineers who
maintain it.
 Does object detection need to be so cumbersome?
 If we are willing to sacrifice a bit of accuracy, can we trade for
much faster speed?
 With these questions, Joseph Redmon submitted a network
called YOLO to arxiv.org only four days after Faster R-CNN’s
submission.
 It finally brought popularity back to one-stage object detection
two years after OverFeat’s debut.
 Unlike R-CNN, YOLO decided to tackle region proposal and region
classification together in the same CNN.
 In other words, it treats object detection as a regression problem, instead
of a classification problem relying on region proposals.
 The general idea is to split the input into an SxS grid and having each cell
directly regress the bounding box location and the confidence score if the
object center falls into that cell.
 Because objects may have different sizes, there will be more than one
bounding box regressor per cell.
 During training, the regressor with the highest IOU will be assigned to
compare with the ground-truth label, so regressors at the same location
will learn to handle different scales over time.
 In the meantime, each cell will also predict C class probabilities,
conditioned on the grid cell containing an object (high confidence score).
 This approach is later described as dense predictions because YOLO tried
to predict classes and bounding boxes for all possible locations in an
image.
CNN MODEL THAT FORMS THE BACKBONE OF
YOLO
OBJECT LOCALIZATION
STEPS
 1. YOLO cuts an image into squares.
 This makes it easier for YOLO to find objects in the image. It only needs
to look at one square at a time, instead of the entire image.
 2. For each square, YOLO guesses if there is an object in it and, if
so, what kind of object it is.
 It does this by using a deep learning model. The model has been trained
on a lot of images and labels. This means that the model knows how to
identify different types of objects in images.
 3. YOLO gets rid of any extra guesses.
 It does this by using a technique called non-maximum suppression. This
removes any guesses that are overlapping with other guesses. This
makes sure that YOLO only outputs one guess for each object in the
image.
 4. YOLO outputs the remaining guesses as rectangles and object
labels.
 A rectangle is a box that surrounds an object in an image. An object label
is a name for the type of object in the box.
 These outputs the remaining guesses as rectangles and object labels.
This means that YOLO outputs a box and a name for each object that it
finds in the image.
FOR MULTIPLE OBJECTS
NON MAX SUPPRESSION
ANCHOR BOXES
CNN WITH TWO ANCHOR BOXES
2015: SSD
 SSD: Single Shot MultiBox Detector
 YOLO v1 demonstrated the potentials of one-stage detection, but the
performance gap from two-stage detection is still noticeable.
 In YOLO v1, multiple objects could be assigned to the same grid
cell.
 This was a big challenge when detecting small objects, and became
a critical problem to solve in order to improve a one-stage detector’s
performance to be on par with two-stage detectors.
 SSD is such a challenger and attacks this problem from three angles.
KEY FEATURES OF SSD
 Single Shot: Unlike some traditional object detection
models that use a two-stage approach (first proposing
regions of interest and then classifying those regions),
SSD performs object detection in a single pass through
the network. It directly predicts the presence of objects
and their bounding box coordinates in a single shot,
making it faster and more efficient.
 MultiBox: SSD uses a set of default bounding boxes
(anchor boxes) of different scales and aspect ratios at
multiple locations in the input image. These default
boxes serve as prior knowledge about where objects are
likely to appear. SSD predicts adjustments to these
default boxes to locate objects accurately.
KEY FEATURES OF SSD
 Multi-Scale Detection: SSD operates on multiple
feature maps with different resolutions, allowing it to
detect objects of various sizes. Predictions are
made at different scales to capture objects at
varying levels of granularity.
 Class Scores: SSD not only predicts the bounding
box coordinates but also assigns class scores to
each default box, indicating the likelihood of an
object belonging to a specific category (e.g., car,
pedestrian, bicycle).
KEY CONCEPTS OF SSD
 Default Bounding Boxes (Anchor Boxes): SSD
uses a predefined set of default bounding boxes,
also known as anchor boxes. These boxes come in
various scales and aspect ratios, providing prior
knowledge about where objects are likely to be
located in the image. SSD predicts adjustments to
these default boxes to localize objects accurately.
 Multi-Scale Feature Maps: SSD operates on
multiple feature maps at different resolutions.
Obtain these feature maps by applying
convolutional layers to the input image at various
stages. Using feature maps at numerous scales
allows SSD to detect objects of different sizes.
KEY CONCEPTS OF SSD
 Multi-Scale Predictions: For each default
bounding box, SSD makes predictions at multiple
feature map layers with different resolutions. This
enables the model to capture objects at various
scales. These predictions include class scores for
different object categories and offsets for adjusting
the default boxes to match the objects’ positions.
 Aspect Ratio Handling: SSD uses separate
predictors (convolutional filters) for different aspect
ratios of bounding boxes. This allows it to adapt to
objects with varying shapes and aspect ratios.
 Base Network (Truncated for Classification):
 SSD begins with a standard CNN architecture, which is
typically used for high-quality image classification tasks.
However, in SSD, this base network is truncated before
any classification layers. The base network is
responsible for extracting essential features from the
input image.
 Multi-Scale Feature Maps: Additional convolutional layers are added to
the truncated base network. These layers progressively reduce the
spatial dimensions while increasing the number of channels (feature
channels). This design allows SSD to produce feature maps at multiple
scales. Each scale’s feature map is suitable for detecting objects of
different sizes.
 Default Bounding Boxes (Anchor Boxes): SSD associates a
predefined set of default bounding boxes (anchor boxes) with each
feature map cell. These default boxes have various scales and aspect
ratios. The placement of default boxes relative to their corresponding cell
is fixed and follows a convolutional grid pattern. For each feature map
cell, SSD predicts the offsets necessary to adjust these default boxes to
fit objects and the class scores indicating the presence of specific object
categories.
 Aspect Ratios and Multiple Feature Maps: SSD employs default boxes
with different aspect ratios and uses them across multiple feature maps
at various resolutions. This approach efficiently captures a range of
possible object shapes and sizes. Unlike other models, SSD doesn’t rely
on an intermediate fully connected layer for predictions but uses
convolutional filters directly.
GRID CELL
 Instead of using sliding window, SSD divides the image using
a grid and have each grid cell be responsible for detecting
objects in that region of the image. Detection objects simply
means predicting the class and location of an object within that
region. If no object is present, we consider it as the
background class and the location is ignored. For instance, we
could use a 4x4 grid in the example below. Each grid cell is
able to output the position and shape of the object it contains.
ANCHOR BOX
 Each grid cell in SSD can be assigned with multiple
anchor/prior boxes. These anchor boxes are pre-defined
and each one is responsible for a size and shape within a
grid cell. For example, the swimming pool in the image
below corresponds to the taller anchor box while the
building corresponds to the wider box.
 SSD uses a matching phase while training, to
match the appropriate anchor box with the
bounding boxes of each ground truth object within
an image.
 Essentially, the anchor box with the highest degree
of overlap with an object is responsible for
predicting that object’s class and its location.
 This property is used for training the network and
for predicting the detected objects and their
locations once the network has been trained. In
practice, each anchor box is specified by an aspect
ratio and a zoom level.
ASPECT RATIO
 Not all objects are square in shape. Some are longer and
some are wider, by varying degrees. The SSD architecture
allows pre-defined aspect ratios of the anchor boxes to
account for this. The ratios parameter can be used to specify
the different aspect ratios of the anchor boxes associates with
each grid cell at each zoom/scale level.
 Zoom level
 It is not necessary for the anchor boxes to have the same size
as the grid cell. We might be interested in finding smaller or
larger objects within a grid cell. The zooms parameter is used
to specify how much the anchor boxes need to be scaled up
or down with respect to each grid cell. Just like what we have
seen in the anchor box example, the size of building is
generally larger than swimming pool.
2016: FPN
 Feature Pyramid Networks for Object Detection
 With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like
the general structure an object detector is determined.
 Researchers start to look at improving each individual parts of these
networks.
 Feature Pyramid Networks is an attempt to improve the detection head by
using features from different layers to form a feature pyramid.
 This feature pyramid idea isn’t very novel in computer vision research.
 Back then when features are still manually designed, feature pyramid is
already a very effective way to recognize patterns at different scales.
 However, how to share the feature pyramid between RPN and the region-
based detector is still yet to be determined.
 First, to rebuild RPN with an FPN structure like the diagram
above, we need to have a region proposal running on multiple
different scales of feature output.
 Also, we only need 3 anchors with different aspect ratios per
location now because objects with different sizes will be handle
by different levels of the feature pyramid.
 Next, to use an FPN structure in the Fast R-CNN detector, we
also need to adapt it to detect on multiple scales of feature maps
as well.
 Since region proposals might have different scales too, we
should use them in the corresponding level of FPN as well.
 In short, if Faster R-CNN is a pair of RPN and region-based
detector running on one scale, FPN converts it into multiple
parallel branches running on different scales and collects the
final results from all branches in the end.
2016: YOLO V2
 The initial version of YOLO suffers from many
shortcomings: predictions based on a coarse grid brought
lower localization accuracy, two scale-agnostic regressors
per grid cell also made it difficult to recognize small packed
objects.
 YOLO v2 added Batch Normalization layers from a paper
called “Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift”.
 Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of
anchor boxes for bounding box regression.
 Also, anchors sizes are determined by a K-means clustering of
the target dataset to better align with object shapes.
 A new backbone network called Darknet is used for feature
extraction. This is inspired by “Network in Network” and
GooLeNet’s bottleneck structure.
 To improve the detection of small objects, YOLO v2 added a
passthrough layer to merge features from an early layer. This
part can be seen as a simplified version of SSD.
 YOLO v2 also experimented with a version that’s trained on
9000 classes hierarchical datasets, which also represents an
early trial of multi-label classification in an object detector.
2017: RETINANET
 To understand why one-stage detectors are usually not as good as two-
stage detectors, RetinaNet investigated the foreground-background
class imbalance issue from a one-stage detector’s dense predictions.
 RetinaNet invented a new loss function called Focal Loss to help the
network learn what’s important.
 Focal Loss added a power γ (they call it focusing parameter) to Cross-
Entropy loss. Naturally, as the confidence score becomes higher, the
loss value will become much lower than a normal Cross-Entropy.
 It is composed of a ResNet backbone, an FPN detection neck to
channel features at different scales, and two subnets for classification
and box regression as detection head.
 Similar to SSD and YOLO v2, RetinaNet uses anchor boxes to cover
targets of various scales and aspect ratios.
2018: YOLO V3
 YOLOv3: An Incremental Improvement
 Following YOLO v2’s tradition, YOLO v3 borrowed more
ideas from previous research and got an incredible
powerful one-stage detector.
 YOLO v3 balanced the speed, accuracy, and
implementation complexity pretty well.
 And it got really popular in the industry because of its fast
speed and simple components.
 Simply put, YOLO v3’s success comes from its more
powerful backbone feature extractor and a RetinaNet-like
detection head with an FPN neck.
 The new backbone network Darknet-53 leveraged
ResNet’s skip connections to achieve an accuracy that’s
on par with ResNet-50 but much faster.
 Also, YOLO v3 ditched v2’s pass through layers and fully
embraced FPN’s multi-scale predictions design.
 Since then, YOLO v3 finally reversed people’s impression
of its poor performance when dealing with small objects.
2019: OBJECTS AS POINTS
 Although the image classification area becomes less active
recently, object detection research is still far from mature.
 In 2018, a paper called “CornerNet: Detecting Objects as
Paired Keypoints” provided a new perspective for detector
training.
 Since preparing anchor box targets is a quite cumbersome job,
is it really necessary to use them as a prior?
 This new trend of ditching anchor boxes is called “anchor-free”
object detection.
 Inspired by the use of heat-map in the Hourglass network
for human pose estimation, CornerNet uses a heat-map
generated by box corners to supervise the bounding box
regression.
 Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to
represent object centers, and the network will regress the box width and height
directly from these box centers.
 Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed
heat-map, the training is also easier to converge compared with previous attempts
which tried to regress bounding box size directly.
 The elimination of anchor boxes also has another useful side effect. Previously, we
rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to
assign training targets.
 By doing so, a few neighboring anchors may get all assigned a positive target for
the same object. And the network will learn to predict multiple positive boxes for the
same object too.
 The common way to fix this issue is to use a technique called Non-maximum
Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close
together.
 Now that anchors are gone and we only have one peak per object in the heat-map,
there’s no need to use NMS any more.
 Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a
big benefit for the applications that run in various environments with limited
resources.
2019: EFFICIENTDET
 EfficientDet: Scalable and Efficient Object Detection
 EfficientDet showed us some more exciting development in the object detection
area.
 FPN structure has been proved to be a powerful technique to improve the detection
network’s performance for objects at different scales.
 Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN
neck before box regression and classification.
 Later, NAS-FPN and PANet both demonstrated that a plain multi-layer FPN
structure may benefit from more design optimization.
 EfficientDet continued exploring in this direction, eventually created a new neck
called BiFPN.
 Basically, BiFPN features additional cross-layer connections to encourage feature
aggregation back and forth.
 To justify the efficiency part of the network, this BiFPN also removed some less
useful connections from the original PANet design.
 Another innovative improvement over the FPN structure is the weight feature fusion.
BiFPN added additional learnable weights to feature aggregation so that the
network can learn the importance of different branches.
MORE LESS FAMOUS MODELS…
 2009: DPM
 Object Detection with Discriminatively Trained Part Based Models
 By matching many HOG features for each deformable parts, DPM was one of the most efficient object
detection models before the deep learning era. Take pedestrian detection as an example, it uses a star
structure to recognize the general person pattern first, and then recognize parts with different sub-filters and
calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular
after we switch from HOG features to CNN features.
 2012: Selective Search
 Selective Search for Object Recognition
 Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined
so many classical computer vision approaches together, and also used in the early R-CNN detector. The core
idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective
Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge
similar areas together. And these merged area areas served as foreground predictions and followed by an
SVM classifier for object recognition.
 2016: R-FCN
 R-FCN: Object Detection via Region-based Fully Convolutional Networks
 Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for
each region proposal, we still need fully connected layers to compute class and bounding box separately. If
we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major
speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for
semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score
maps. This map will predict the probability of the appearance of the object at each location, and all locations
will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution
in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To
understand what is atrous convolution, please see my previous article “Witnessing the Progression in
Semantic Segmentation: DeepLab Series from V1 to V3+”.
 2017: Soft-NMS
 Improving Object Detection With One Line of Code
 Non-maximum suppression (NMS) is widely used in anchor-based object detection
networks to reduce duplicate positive proposals that are close-by. More specifically,
NMS iteratively eliminates candidate boxes if they have a high IOU with a more
confident candidate box. This could lead to some unexpected behavior when two
objects with the same class are indeed very close to each other. Soft-NMS made a
small change to only scaling down the confidence score of the overlapped
candidate boxes with a parameter. This scaling parameter gives us more control
when tuning the localization performance, and also leads to a better precision when
a high recall is also needed.
 2017: Cascade R-CNN
 Cascade R-CNN: Delving into High Quality Object Detection
 While FPN exploring how to design a better R-CNN neck to use backbone features
Cascade R-CNN investigated a redesign of R-CNN classification and regression
head. The underlying assumption is simple yet insightful: the higher IOU criteria we
use when preparing positive targets, the less false positive predictions the network
will learn to make. However, we can’t simply increase such IOU threshold from
commonly used 0.5 to more aggressive 0.7, because it could also lead to more
overwhelming negative examples during training. Cascade R-CNN’s solution is to
chain multiple detection head together, each will rely on the bounding box
proposals from the previous detection head. Only the first detection head will use
the original RPN proposals. This effectively simulated an increasing IOU threshold
for latter heads.
 2017: Mask R-CNN
 Mask R-CNN
 Mask R-CNN is not a typical object detection network. It was designed to solve a challenging
instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask
R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired
object detection research. The main idea is to add a binary mask prediction branch after ROI
pooling along with the existing bounding box and classification branches. Besides, to address
the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new
ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-
task training (segmentation + detection) and the new ROI Align layer contribute to some
improvement over the bounding box benchmark.
 2018: PANet
 Path Aggregation Network for Instance Segmentation
 Instance segmentation has a close relationship with object detection, so often a new instance
segmentation network could also benefit object detection research indirectly. PANet aims at
boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up
path after the original top-down path. To visualize this change, we have a ↑↓ structure in the
original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from
multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an
“adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of
sum) multi-scale features.
 2019: NAS-FPN
 NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
 PANet’s success in adapting FPN structure drew attention from a group of NAS researchers.
They used a similar reinforcement learning method from the image classification network
NASNet and focused on searching the best combination of multiple merging cells. Here, a
merging cell is the basic build block of an FPN that merges any two input features layers into
one output feature layer. The final results proved the idea that FPN could use further
optimization, but the complex computer-searched structure made it too difficult for humans to
understand.

More Related Content

Similar to Object Detection is a very powerful field.pptx

IRJET- Weakly Supervised Object Detection by using Fast R-CNN
IRJET- Weakly Supervised Object Detection by using Fast R-CNNIRJET- Weakly Supervised Object Detection by using Fast R-CNN
IRJET- Weakly Supervised Object Detection by using Fast R-CNNIRJET Journal
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningTrong-An Bui
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Jihong Kang
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETMarco Parenzan
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Transformer models for FER
Transformer models for FERTransformer models for FER
Transformer models for FERIRJET Journal
 
Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling Yu Huang
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageIRJET Journal
 
Conception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdfConception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdfSofianeHassine2
 
Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning IJECEIAES
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsPetteriTeikariPhD
 
IRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET Journal
 
A Literature Survey: Neural Networks for object detection
A Literature Survey: Neural Networks for object detectionA Literature Survey: Neural Networks for object detection
A Literature Survey: Neural Networks for object detectionvivatechijri
 

Similar to Object Detection is a very powerful field.pptx (20)

Mmpaper draft10
Mmpaper draft10Mmpaper draft10
Mmpaper draft10
 
IRJET- Weakly Supervised Object Detection by using Fast R-CNN
IRJET- Weakly Supervised Object Detection by using Fast R-CNNIRJET- Weakly Supervised Object Detection by using Fast R-CNN
IRJET- Weakly Supervised Object Detection by using Fast R-CNN
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Transformer models for FER
Transformer models for FERTransformer models for FER
Transformer models for FER
 
information-11-00583-v3.pdf
information-11-00583-v3.pdfinformation-11-00583-v3.pdf
information-11-00583-v3.pdf
 
Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling Deep learning for 3-D Scene Reconstruction and Modeling
Deep learning for 3-D Scene Reconstruction and Modeling
 
Mapping mobile robotics
Mapping mobile roboticsMapping mobile robotics
Mapping mobile robotics
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
Mnist report
Mnist reportMnist report
Mnist report
 
Mnist report ppt
Mnist report pptMnist report ppt
Mnist report ppt
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from Image
 
Conception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdfConception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdf
 
Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning Object gripping algorithm for robotic assistance by means of deep leaning
Object gripping algorithm for robotic assistance by means of deep leaning
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 
IRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A Survey
 
A Literature Survey: Neural Networks for object detection
A Literature Survey: Neural Networks for object detectionA Literature Survey: Neural Networks for object detection
A Literature Survey: Neural Networks for object detection
 

Recently uploaded

The Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdf
The Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdfThe Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdf
The Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdfbelieveminhh
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxSaksham Gupta
 
HAL Financial Performance Analysis and Future Prospects
HAL Financial Performance Analysis and Future ProspectsHAL Financial Performance Analysis and Future Prospects
HAL Financial Performance Analysis and Future ProspectsRajesh Gupta
 
Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...
Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...
Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...ssuserf63bd7
 
The Art of Decision-Making: Navigating Complexity and Uncertainty
The Art of Decision-Making: Navigating Complexity and UncertaintyThe Art of Decision-Making: Navigating Complexity and Uncertainty
The Art of Decision-Making: Navigating Complexity and Uncertaintycapivisgroup
 
Global Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdfGlobal Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdfAmer Morgan
 
How to refresh to be fit for the future world
How to refresh to be fit for the future worldHow to refresh to be fit for the future world
How to refresh to be fit for the future worldChris Skinner
 
South Africa's 10 Most Influential CIOs to Watch.pdf
South Africa's 10 Most Influential CIOs to Watch.pdfSouth Africa's 10 Most Influential CIOs to Watch.pdf
South Africa's 10 Most Influential CIOs to Watch.pdfTHECIOWORLD
 
00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![© ر
00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![©  ر00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![©  ر
00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![© رnafizanafzal
 
Toyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsToyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsStefan Wolpers
 
hyundai capital 2023 consolidated financial statements
hyundai capital 2023 consolidated financial statementshyundai capital 2023 consolidated financial statements
hyundai capital 2023 consolidated financial statementsirhcs
 
Navigating Tax Season with Confidence Streamlines CPA Firms
Navigating Tax Season with Confidence Streamlines CPA FirmsNavigating Tax Season with Confidence Streamlines CPA Firms
Navigating Tax Season with Confidence Streamlines CPA FirmsYourLegal Accounting
 
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证ogawka
 
Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...
Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...
Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...Aurelien Domont, MBA
 
1Q24_EN hyundai capital 1q performance
1Q24_EN   hyundai capital 1q performance1Q24_EN   hyundai capital 1q performance
1Q24_EN hyundai capital 1q performanceirhcs
 
Shots fired Budget Presentation.pdf12312
Shots fired Budget Presentation.pdf12312Shots fired Budget Presentation.pdf12312
Shots fired Budget Presentation.pdf12312LR1709MUSIC
 
Progress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdf
Progress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdfProgress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdf
Progress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdfHolger Mueller
 
A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...
A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...
A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...prakheeshc
 
Elevate Your Online Presence with SEO Services
Elevate Your Online Presence with SEO ServicesElevate Your Online Presence with SEO Services
Elevate Your Online Presence with SEO ServicesHaseebBashir5
 

Recently uploaded (20)

The Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdf
The Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdfThe Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdf
The Vietnam Believer Newsletter_May 13th, 2024_ENVol. 007.pdf
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
 
HAL Financial Performance Analysis and Future Prospects
HAL Financial Performance Analysis and Future ProspectsHAL Financial Performance Analysis and Future Prospects
HAL Financial Performance Analysis and Future Prospects
 
Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...
Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...
Understanding Financial Accounting 3rd Canadian Edition by Christopher D. Bur...
 
The Art of Decision-Making: Navigating Complexity and Uncertainty
The Art of Decision-Making: Navigating Complexity and UncertaintyThe Art of Decision-Making: Navigating Complexity and Uncertainty
The Art of Decision-Making: Navigating Complexity and Uncertainty
 
Global Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdfGlobal Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdf
 
Home Furnishings Ecommerce Platform Short Pitch 2024
Home Furnishings Ecommerce Platform Short Pitch 2024Home Furnishings Ecommerce Platform Short Pitch 2024
Home Furnishings Ecommerce Platform Short Pitch 2024
 
How to refresh to be fit for the future world
How to refresh to be fit for the future worldHow to refresh to be fit for the future world
How to refresh to be fit for the future world
 
South Africa's 10 Most Influential CIOs to Watch.pdf
South Africa's 10 Most Influential CIOs to Watch.pdfSouth Africa's 10 Most Influential CIOs to Watch.pdf
South Africa's 10 Most Influential CIOs to Watch.pdf
 
00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![© ر
00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![©  ر00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![©  ر
00971508021841 حبوب الإجهاض في دبي | أبوظبي | الشارقة | السطوة |❇ ❈ ((![© ر
 
Toyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsToyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & Transformations
 
hyundai capital 2023 consolidated financial statements
hyundai capital 2023 consolidated financial statementshyundai capital 2023 consolidated financial statements
hyundai capital 2023 consolidated financial statements
 
Navigating Tax Season with Confidence Streamlines CPA Firms
Navigating Tax Season with Confidence Streamlines CPA FirmsNavigating Tax Season with Confidence Streamlines CPA Firms
Navigating Tax Season with Confidence Streamlines CPA Firms
 
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
 
Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...
Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...
Creating an Income Statement with Forecasts: A Simple Guide and Free Excel Te...
 
1Q24_EN hyundai capital 1q performance
1Q24_EN   hyundai capital 1q performance1Q24_EN   hyundai capital 1q performance
1Q24_EN hyundai capital 1q performance
 
Shots fired Budget Presentation.pdf12312
Shots fired Budget Presentation.pdf12312Shots fired Budget Presentation.pdf12312
Shots fired Budget Presentation.pdf12312
 
Progress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdf
Progress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdfProgress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdf
Progress Report - UKG Analyst Summit 2024 - A lot to do - Good Progress1-1.pdf
 
A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...
A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...
A BUSINESS PROPOSAL FOR SLAUGHTER HOUSE WASTE MANAGEMENT IN MYSORE MUNICIPAL ...
 
Elevate Your Online Presence with SEO Services
Elevate Your Online Presence with SEO ServicesElevate Your Online Presence with SEO Services
Elevate Your Online Presence with SEO Services
 

Object Detection is a very powerful field.pptx

  • 4.  Before the deep learning era, hand-crafted features like HOG and feature pyramids are used pervasively to capture localization signals in an image.  However, those methods usually can’t extend to generic object detection well, so most of the applications are limited to face or pedestrian detections.  With the power of deep learning, we can train a network to learn which features to capture, as well as what coordinates to predict for an object.
  • 5. 2013: OVERFEAT  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
  • 6.  Inspired by the early success of AlexNet in the 2012 ImageNet competition, where CNN-based feature extraction defeated all hand-crafted feature extractors, OverFeat quickly introduced CNN back into the object detection area as well.  The idea is very straight forward: if we can classify one image using CNN, what about greedily scrolling through the whole image with different sizes of windows, and try to regress and classify them one-by-one using a CNN?  This leverages the power of CNN for feature extraction and classification, and also bypassed the hard region proposal problem by pre-defined sliding windows.  Also, since a nearby convolution kernel can share part of the computation result, it is not necessary to compute convolutions for the overlapping area, hence reducing cost a lot.  OverFeat is a pioneer in the one-stage object detector. It tried to combine feature extraction, location regression, and region classification in the same CNN.  Unfortunately, such a one-stage approach also suffers from relatively poorer accuracy due to less prior knowledge used.
  • 7.  Also proposed in 2013, R-CNN is a bit late compared with OverFeat.  However, this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e, region proposal stage, and region classification and refinement stage. 2013: R-CNN
  • 8.  R-CNN first extracts potential regions of interest from an input image by using a technique called selective search.  Selective search doesn’t really try to understand the foreground object, instead, it groups similar pixels by relying on a heuristic: similar pixels usually belong to the same object.  Therefore, the results of selective search have a very high probability to contain something meaningful.  Next, R-CNN warps these region proposals into fixed-size images with some paddings, and feed these images into the second stage of the network for more fine-grained recognition.  Unlike those old methods using selective search, R-CNN replaced HOG with a CNN to extract features from all region proposals in its second stage.
  • 9.  Region proposal from selective search highly depends on the similarity assumption, so it can only provide a rough estimate of location.  To further improve localization accuracy, R-CNN borrowed an idea from “Deep Neural Networks for Object Detection” (aka DetectorNet), and introduced an additional bounding box regression to predict the center coordinates, width and height of a box. This regressor is widely used in the future object detectors.  However, a two-stage detector like R-CNN suffers from two big issues: 1) It’s not fully convolutional because selective search is not E2E trainable. 2) region proposal stage is usually very slow compared with other one-stage detectors like OverFeat, and running on each region proposal separately makes it even slower.  Later, we will see how R-CNN evolve over time to address these two issues.
  • 11.  A quick follow-up for R-CNN is to reduce the duplicate convolution over multiple region proposals.  Since these region proposals all come from one image, it’s naturally to improve R-CNN by running CNN over the entire image once and share the computation among many region proposals.  However, different region proposals have different sizes, which also result in different output feature map sizes if we are using the same CNN feature extractor.  These feature maps with various sizes will prevent us from using fully connected layers for further classification and regression because the FC layer only works with a fixed size input.
  • 12.  Fortunately, a paper called “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” has already solved the dynamic scale issue for FC layers.  In SPPNet, a feature pyramid pooling is introduced between convolution layers and FC layers to create a bag-of-words style of the feature vector.  This vector has a fixed size and encodes features from different scales, so our convolution layers can now take any size of images as input without worrying about the incompatibility of the FC layer.  Inspired by this, Fast R-CNN proposed a similar layer call the ROI Pooling layer.  This pooling layer downsamples feature maps with different sizes into a fixed-size vector. By doing so, we can now use the same FC layers for classification and box regression, no matter how large or small the ROI is.
  • 13.  With a shared feature extractor and the scale-invariant ROI pooling layer, Fast R-CNN can reach a similar localization accuracy but having 10~20x faster training and 100~200x faster inference.  The near real-time inference and an easier E2E training protocol for the detection part make Fast R-CNN a popular choice in the industry as well.
  • 14.  This dense prediction over the entire image can cause trouble in computation cost, so YOLO took the bottleneck structure from GooLeNet to avoid this issue.  Another problem of YOLO is that two objects might fall into the same coarse grid cell, so it doesn’t work well with small objects such as a flock of birds.  Despite lower accuracy, YOLO’s straightforward design and real-time inference ability makes one-stage object detection popular again in the research, and also a go-to solution for the industry.
  • 15. 2015: FASTER R-CNN  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks  As we introduced above, in early 2015, Ross Girshick proposed an improved version of R-CNN called Fast R-CNN by using a shared feature extractor for proposed regions.  Just a few months later, Ross and his team came back with another improvement again.  This new network Faster R-CNN is not only faster than previous versions but also marks a milestone for object detection with a deep learning method.
  • 16.
  • 17.  With Fast R-CNN, the only non-convolutional piece of the network is the selective search region proposal.  As of 2015, researchers started to realize that the deep neural network is so magical, that it can learn anything given enough data.  So, is it possible to also train a neural network to proposal regions, instead of relying on heuristic and hand-crafted approach like selective search?  Faster R-CNN followed this direction and thinking, and successfully created the Region Proposal Network (RPN).  To simply put, RPN is a CNN that takes an image as input and outputs a set of rectangular object proposals, each with an objectiveness score.  The paper used VGG originally but other backbone networks such as ResNet become more widespread later.  To generate region proposals, a 3×3 sliding window is applied over the CNN feature map output to generate 2 scores (foreground and background) and 4 coordinates each location.  In practice, this sliding window is implemented with a 3×3 convolution kernel with a 1×1 convolution kernel.
  • 18.  Although the sliding window has a fixed size, our objects may appear on different scales.  Therefore, Faster R-CNN introduced a technique called anchor box.  Anchor boxes are pre-defined prior boxes with different aspect ratios and sizes but share the same central location.  In Faster R-CNN there are k=9 anchors for each sliding window location, which covers 3 aspect ratios for 3 scales each.  These repeated anchor boxes over different scales bring nice translation-invariance and scale-invariance features to the network while sharing outputs of the same feature map.  Note that the bounding box regression will be computed from these anchor box instead of the whole image.
  • 19.
  • 20.  So far, we discussed the new Region Proposal Network to replace the old selective search region proposal.  To make the final detection, Faster R-CNN uses the same detection head from Fast R-CNN to do classification and fine-grained localization.  Fast R-CNN also uses a shared CNN feature extractor. Now that RPN itself is also a feature extraction CNN, we can just share it with detection head like the diagram above.  This sharing design doesn’t bring some trouble though. If we train RPN and Fast R-CNN detector together, we will treat RPN proposals as a constant input of ROI pooling, and inevitably ignore the gradients of RPN’s bounding box proposals.  One walk around is called alternative training where you train RPN and Fast R-CNN in turns.  And later in a paper “Instance-aware semantic segmentation via multi- task network cascades”, we can see that the ROI pooling layer can also be made differentiable w.r.t. the box coordinates proposals.
  • 21.
  • 22. 2015: YOLO V1  You Only Look Once: Unified, Real-Time Object Detection  While the R-CNN series started a big hype over two-stage object detection in the research community, its complicated implementation brought many headaches for engineers who maintain it.  Does object detection need to be so cumbersome?  If we are willing to sacrifice a bit of accuracy, can we trade for much faster speed?  With these questions, Joseph Redmon submitted a network called YOLO to arxiv.org only four days after Faster R-CNN’s submission.  It finally brought popularity back to one-stage object detection two years after OverFeat’s debut.
  • 23.  Unlike R-CNN, YOLO decided to tackle region proposal and region classification together in the same CNN.  In other words, it treats object detection as a regression problem, instead of a classification problem relying on region proposals.  The general idea is to split the input into an SxS grid and having each cell directly regress the bounding box location and the confidence score if the object center falls into that cell.  Because objects may have different sizes, there will be more than one bounding box regressor per cell.  During training, the regressor with the highest IOU will be assigned to compare with the ground-truth label, so regressors at the same location will learn to handle different scales over time.  In the meantime, each cell will also predict C class probabilities, conditioned on the grid cell containing an object (high confidence score).  This approach is later described as dense predictions because YOLO tried to predict classes and bounding boxes for all possible locations in an image.
  • 24. CNN MODEL THAT FORMS THE BACKBONE OF YOLO
  • 26.
  • 27.
  • 28. STEPS  1. YOLO cuts an image into squares.  This makes it easier for YOLO to find objects in the image. It only needs to look at one square at a time, instead of the entire image.  2. For each square, YOLO guesses if there is an object in it and, if so, what kind of object it is.  It does this by using a deep learning model. The model has been trained on a lot of images and labels. This means that the model knows how to identify different types of objects in images.  3. YOLO gets rid of any extra guesses.  It does this by using a technique called non-maximum suppression. This removes any guesses that are overlapping with other guesses. This makes sure that YOLO only outputs one guess for each object in the image.  4. YOLO outputs the remaining guesses as rectangles and object labels.  A rectangle is a box that surrounds an object in an image. An object label is a name for the type of object in the box.  These outputs the remaining guesses as rectangles and object labels. This means that YOLO outputs a box and a name for each object that it finds in the image.
  • 30.
  • 31.
  • 34. CNN WITH TWO ANCHOR BOXES
  • 35. 2015: SSD  SSD: Single Shot MultiBox Detector  YOLO v1 demonstrated the potentials of one-stage detection, but the performance gap from two-stage detection is still noticeable.  In YOLO v1, multiple objects could be assigned to the same grid cell.  This was a big challenge when detecting small objects, and became a critical problem to solve in order to improve a one-stage detector’s performance to be on par with two-stage detectors.  SSD is such a challenger and attacks this problem from three angles.
  • 36. KEY FEATURES OF SSD  Single Shot: Unlike some traditional object detection models that use a two-stage approach (first proposing regions of interest and then classifying those regions), SSD performs object detection in a single pass through the network. It directly predicts the presence of objects and their bounding box coordinates in a single shot, making it faster and more efficient.  MultiBox: SSD uses a set of default bounding boxes (anchor boxes) of different scales and aspect ratios at multiple locations in the input image. These default boxes serve as prior knowledge about where objects are likely to appear. SSD predicts adjustments to these default boxes to locate objects accurately.
  • 37. KEY FEATURES OF SSD  Multi-Scale Detection: SSD operates on multiple feature maps with different resolutions, allowing it to detect objects of various sizes. Predictions are made at different scales to capture objects at varying levels of granularity.  Class Scores: SSD not only predicts the bounding box coordinates but also assigns class scores to each default box, indicating the likelihood of an object belonging to a specific category (e.g., car, pedestrian, bicycle).
  • 38. KEY CONCEPTS OF SSD  Default Bounding Boxes (Anchor Boxes): SSD uses a predefined set of default bounding boxes, also known as anchor boxes. These boxes come in various scales and aspect ratios, providing prior knowledge about where objects are likely to be located in the image. SSD predicts adjustments to these default boxes to localize objects accurately.  Multi-Scale Feature Maps: SSD operates on multiple feature maps at different resolutions. Obtain these feature maps by applying convolutional layers to the input image at various stages. Using feature maps at numerous scales allows SSD to detect objects of different sizes.
  • 39.
  • 40. KEY CONCEPTS OF SSD  Multi-Scale Predictions: For each default bounding box, SSD makes predictions at multiple feature map layers with different resolutions. This enables the model to capture objects at various scales. These predictions include class scores for different object categories and offsets for adjusting the default boxes to match the objects’ positions.  Aspect Ratio Handling: SSD uses separate predictors (convolutional filters) for different aspect ratios of bounding boxes. This allows it to adapt to objects with varying shapes and aspect ratios.
  • 41.  Base Network (Truncated for Classification):  SSD begins with a standard CNN architecture, which is typically used for high-quality image classification tasks. However, in SSD, this base network is truncated before any classification layers. The base network is responsible for extracting essential features from the input image.
  • 42.  Multi-Scale Feature Maps: Additional convolutional layers are added to the truncated base network. These layers progressively reduce the spatial dimensions while increasing the number of channels (feature channels). This design allows SSD to produce feature maps at multiple scales. Each scale’s feature map is suitable for detecting objects of different sizes.  Default Bounding Boxes (Anchor Boxes): SSD associates a predefined set of default bounding boxes (anchor boxes) with each feature map cell. These default boxes have various scales and aspect ratios. The placement of default boxes relative to their corresponding cell is fixed and follows a convolutional grid pattern. For each feature map cell, SSD predicts the offsets necessary to adjust these default boxes to fit objects and the class scores indicating the presence of specific object categories.  Aspect Ratios and Multiple Feature Maps: SSD employs default boxes with different aspect ratios and uses them across multiple feature maps at various resolutions. This approach efficiently captures a range of possible object shapes and sizes. Unlike other models, SSD doesn’t rely on an intermediate fully connected layer for predictions but uses convolutional filters directly.
  • 43. GRID CELL  Instead of using sliding window, SSD divides the image using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detection objects simply means predicting the class and location of an object within that region. If no object is present, we consider it as the background class and the location is ignored. For instance, we could use a 4x4 grid in the example below. Each grid cell is able to output the position and shape of the object it contains.
  • 44. ANCHOR BOX  Each grid cell in SSD can be assigned with multiple anchor/prior boxes. These anchor boxes are pre-defined and each one is responsible for a size and shape within a grid cell. For example, the swimming pool in the image below corresponds to the taller anchor box while the building corresponds to the wider box.
  • 45.  SSD uses a matching phase while training, to match the appropriate anchor box with the bounding boxes of each ground truth object within an image.  Essentially, the anchor box with the highest degree of overlap with an object is responsible for predicting that object’s class and its location.  This property is used for training the network and for predicting the detected objects and their locations once the network has been trained. In practice, each anchor box is specified by an aspect ratio and a zoom level.
  • 46. ASPECT RATIO  Not all objects are square in shape. Some are longer and some are wider, by varying degrees. The SSD architecture allows pre-defined aspect ratios of the anchor boxes to account for this. The ratios parameter can be used to specify the different aspect ratios of the anchor boxes associates with each grid cell at each zoom/scale level.  Zoom level  It is not necessary for the anchor boxes to have the same size as the grid cell. We might be interested in finding smaller or larger objects within a grid cell. The zooms parameter is used to specify how much the anchor boxes need to be scaled up or down with respect to each grid cell. Just like what we have seen in the anchor box example, the size of building is generally larger than swimming pool.
  • 47. 2016: FPN  Feature Pyramid Networks for Object Detection  With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like the general structure an object detector is determined.  Researchers start to look at improving each individual parts of these networks.  Feature Pyramid Networks is an attempt to improve the detection head by using features from different layers to form a feature pyramid.  This feature pyramid idea isn’t very novel in computer vision research.  Back then when features are still manually designed, feature pyramid is already a very effective way to recognize patterns at different scales.  However, how to share the feature pyramid between RPN and the region- based detector is still yet to be determined.
  • 48.
  • 49.  First, to rebuild RPN with an FPN structure like the diagram above, we need to have a region proposal running on multiple different scales of feature output.  Also, we only need 3 anchors with different aspect ratios per location now because objects with different sizes will be handle by different levels of the feature pyramid.  Next, to use an FPN structure in the Fast R-CNN detector, we also need to adapt it to detect on multiple scales of feature maps as well.  Since region proposals might have different scales too, we should use them in the corresponding level of FPN as well.  In short, if Faster R-CNN is a pair of RPN and region-based detector running on one scale, FPN converts it into multiple parallel branches running on different scales and collects the final results from all branches in the end.
  • 50. 2016: YOLO V2  The initial version of YOLO suffers from many shortcomings: predictions based on a coarse grid brought lower localization accuracy, two scale-agnostic regressors per grid cell also made it difficult to recognize small packed objects.  YOLO v2 added Batch Normalization layers from a paper called “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”.
  • 51.  Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of anchor boxes for bounding box regression.  Also, anchors sizes are determined by a K-means clustering of the target dataset to better align with object shapes.  A new backbone network called Darknet is used for feature extraction. This is inspired by “Network in Network” and GooLeNet’s bottleneck structure.  To improve the detection of small objects, YOLO v2 added a passthrough layer to merge features from an early layer. This part can be seen as a simplified version of SSD.  YOLO v2 also experimented with a version that’s trained on 9000 classes hierarchical datasets, which also represents an early trial of multi-label classification in an object detector.
  • 52. 2017: RETINANET  To understand why one-stage detectors are usually not as good as two- stage detectors, RetinaNet investigated the foreground-background class imbalance issue from a one-stage detector’s dense predictions.  RetinaNet invented a new loss function called Focal Loss to help the network learn what’s important.  Focal Loss added a power γ (they call it focusing parameter) to Cross- Entropy loss. Naturally, as the confidence score becomes higher, the loss value will become much lower than a normal Cross-Entropy.  It is composed of a ResNet backbone, an FPN detection neck to channel features at different scales, and two subnets for classification and box regression as detection head.  Similar to SSD and YOLO v2, RetinaNet uses anchor boxes to cover targets of various scales and aspect ratios.
  • 53. 2018: YOLO V3  YOLOv3: An Incremental Improvement  Following YOLO v2’s tradition, YOLO v3 borrowed more ideas from previous research and got an incredible powerful one-stage detector.  YOLO v3 balanced the speed, accuracy, and implementation complexity pretty well.  And it got really popular in the industry because of its fast speed and simple components.
  • 54.  Simply put, YOLO v3’s success comes from its more powerful backbone feature extractor and a RetinaNet-like detection head with an FPN neck.  The new backbone network Darknet-53 leveraged ResNet’s skip connections to achieve an accuracy that’s on par with ResNet-50 but much faster.  Also, YOLO v3 ditched v2’s pass through layers and fully embraced FPN’s multi-scale predictions design.  Since then, YOLO v3 finally reversed people’s impression of its poor performance when dealing with small objects.
  • 55. 2019: OBJECTS AS POINTS  Although the image classification area becomes less active recently, object detection research is still far from mature.  In 2018, a paper called “CornerNet: Detecting Objects as Paired Keypoints” provided a new perspective for detector training.  Since preparing anchor box targets is a quite cumbersome job, is it really necessary to use them as a prior?  This new trend of ditching anchor boxes is called “anchor-free” object detection.
  • 56.  Inspired by the use of heat-map in the Hourglass network for human pose estimation, CornerNet uses a heat-map generated by box corners to supervise the bounding box regression.
  • 57.  Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to represent object centers, and the network will regress the box width and height directly from these box centers.  Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed heat-map, the training is also easier to converge compared with previous attempts which tried to regress bounding box size directly.  The elimination of anchor boxes also has another useful side effect. Previously, we rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to assign training targets.  By doing so, a few neighboring anchors may get all assigned a positive target for the same object. And the network will learn to predict multiple positive boxes for the same object too.  The common way to fix this issue is to use a technique called Non-maximum Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close together.  Now that anchors are gone and we only have one peak per object in the heat-map, there’s no need to use NMS any more.  Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a big benefit for the applications that run in various environments with limited resources.
  • 58. 2019: EFFICIENTDET  EfficientDet: Scalable and Efficient Object Detection
  • 59.  EfficientDet showed us some more exciting development in the object detection area.  FPN structure has been proved to be a powerful technique to improve the detection network’s performance for objects at different scales.  Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN neck before box regression and classification.  Later, NAS-FPN and PANet both demonstrated that a plain multi-layer FPN structure may benefit from more design optimization.  EfficientDet continued exploring in this direction, eventually created a new neck called BiFPN.  Basically, BiFPN features additional cross-layer connections to encourage feature aggregation back and forth.  To justify the efficiency part of the network, this BiFPN also removed some less useful connections from the original PANet design.  Another innovative improvement over the FPN structure is the weight feature fusion. BiFPN added additional learnable weights to feature aggregation so that the network can learn the importance of different branches.
  • 60. MORE LESS FAMOUS MODELS…  2009: DPM  Object Detection with Discriminatively Trained Part Based Models  By matching many HOG features for each deformable parts, DPM was one of the most efficient object detection models before the deep learning era. Take pedestrian detection as an example, it uses a star structure to recognize the general person pattern first, and then recognize parts with different sub-filters and calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular after we switch from HOG features to CNN features.  2012: Selective Search  Selective Search for Object Recognition  Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined so many classical computer vision approaches together, and also used in the early R-CNN detector. The core idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge similar areas together. And these merged area areas served as foreground predictions and followed by an SVM classifier for object recognition.  2016: R-FCN  R-FCN: Object Detection via Region-based Fully Convolutional Networks  Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for each region proposal, we still need fully connected layers to compute class and bounding box separately. If we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score maps. This map will predict the probability of the appearance of the object at each location, and all locations will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To understand what is atrous convolution, please see my previous article “Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+”.
  • 61.  2017: Soft-NMS  Improving Object Detection With One Line of Code  Non-maximum suppression (NMS) is widely used in anchor-based object detection networks to reduce duplicate positive proposals that are close-by. More specifically, NMS iteratively eliminates candidate boxes if they have a high IOU with a more confident candidate box. This could lead to some unexpected behavior when two objects with the same class are indeed very close to each other. Soft-NMS made a small change to only scaling down the confidence score of the overlapped candidate boxes with a parameter. This scaling parameter gives us more control when tuning the localization performance, and also leads to a better precision when a high recall is also needed.  2017: Cascade R-CNN  Cascade R-CNN: Delving into High Quality Object Detection  While FPN exploring how to design a better R-CNN neck to use backbone features Cascade R-CNN investigated a redesign of R-CNN classification and regression head. The underlying assumption is simple yet insightful: the higher IOU criteria we use when preparing positive targets, the less false positive predictions the network will learn to make. However, we can’t simply increase such IOU threshold from commonly used 0.5 to more aggressive 0.7, because it could also lead to more overwhelming negative examples during training. Cascade R-CNN’s solution is to chain multiple detection head together, each will rely on the bounding box proposals from the previous detection head. Only the first detection head will use the original RPN proposals. This effectively simulated an increasing IOU threshold for latter heads.
  • 62.  2017: Mask R-CNN  Mask R-CNN  Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi- task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.  2018: PANet  Path Aggregation Network for Instance Segmentation  Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of sum) multi-scale features.  2019: NAS-FPN  NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection  PANet’s success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of multiple merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.