Object Detection is a very powerful field.pptx

 Before the deep learning era, hand-crafted features like
HOG and feature pyramids are used pervasively to
capture localization signals in an image.
 However, those methods usually can’t extend to generic
object detection well, so most of the applications are
limited to face or pedestrian detections.
 With the power of deep learning, we can train a network
to learn which features to capture, as well as what
coordinates to predict for an object.

2013: OVERFEAT
 OverFeat: Integrated Recognition, Localization and
Detection using Convolutional Networks

 Inspired by the early success of AlexNet in the 2012 ImageNet competition, where
CNN-based feature extraction defeated all hand-crafted feature extractors,
OverFeat quickly introduced CNN back into the object detection area as well.
 The idea is very straight forward: if we can classify one image using CNN, what
about greedily scrolling through the whole image with different sizes of windows,
and try to regress and classify them one-by-one using a CNN?
 This leverages the power of CNN for feature extraction and classification, and also
bypassed the hard region proposal problem by pre-defined sliding windows.
 Also, since a nearby convolution kernel can share part of the computation result, it
is not necessary to compute convolutions for the overlapping area, hence reducing
cost a lot.
 OverFeat is a pioneer in the one-stage object detector. It tried to combine feature
extraction, location regression, and region classification in the same CNN.
 Unfortunately, such a one-stage approach also suffers from relatively poorer
accuracy due to less prior knowledge used.

 Also proposed in 2013, R-CNN is a bit late compared with OverFeat.
 However, this region-based approach eventually led to a big wave of
object detection research with its two-stage framework, i.e, region
proposal stage, and region classification and refinement stage.
2013: R-CNN

 R-CNN first extracts potential regions of interest from an input
image by using a technique called selective search.
 Selective search doesn’t really try to understand the
foreground object, instead, it groups similar pixels by relying
on a heuristic: similar pixels usually belong to the same object.
 Therefore, the results of selective search have a very high
probability to contain something meaningful.
 Next, R-CNN warps these region proposals into fixed-size
images with some paddings, and feed these images into the
second stage of the network for more fine-grained recognition.
 Unlike those old methods using selective search, R-CNN
replaced HOG with a CNN to extract features from all region
proposals in its second stage.

 Region proposal from selective search highly depends on the
similarity assumption, so it can only provide a rough estimate
of location.
 To further improve localization accuracy, R-CNN borrowed an
idea from “Deep Neural Networks for Object Detection” (aka
DetectorNet), and introduced an additional bounding box
regression to predict the center coordinates, width and height
of a box. This regressor is widely used in the future object
detectors.
 However, a two-stage detector like R-CNN suffers from two big
issues: 1) It’s not fully convolutional because selective search
is not E2E trainable. 2) region proposal stage is usually very
slow compared with other one-stage detectors like OverFeat,
and running on each region proposal separately makes it even
slower.
 Later, we will see how R-CNN evolve over time to address
these two issues.

 A quick follow-up for R-CNN is to reduce the duplicate
convolution over multiple region proposals.
 Since these region proposals all come from one image,
it’s naturally to improve R-CNN by running CNN over the
entire image once and share the computation among
many region proposals.
 However, different region proposals have different sizes,
which also result in different output feature map sizes if
we are using the same CNN feature extractor.
 These feature maps with various sizes will prevent us
from using fully connected layers for further classification
and regression because the FC layer only works with a
fixed size input.

 Fortunately, a paper called “Spatial Pyramid Pooling in Deep
Convolutional Networks for Visual Recognition” has already solved
the dynamic scale issue for FC layers.
 In SPPNet, a feature pyramid pooling is introduced between
convolution layers and FC layers to create a bag-of-words style of the
feature vector.
 This vector has a fixed size and encodes features from different
scales, so our convolution layers can now take any size of images as
input without worrying about the incompatibility of the FC layer.
 Inspired by this, Fast R-CNN proposed a similar layer call the ROI
Pooling layer.
 This pooling layer downsamples feature maps with different sizes into
a fixed-size vector. By doing so, we can now use the same FC layers
for classification and box regression, no matter how large or small the
ROI is.

 With a shared feature extractor and the scale-invariant ROI pooling
layer, Fast R-CNN can reach a similar localization accuracy but having
10~20x faster training and 100~200x faster inference.
 The near real-time inference and an easier E2E training protocol for the
detection part make Fast R-CNN a popular choice in the industry as
well.

 This dense prediction over the entire image can cause
trouble in computation cost, so YOLO took the bottleneck
structure from GooLeNet to avoid this issue.
 Another problem of YOLO is that two objects might fall into
the same coarse grid cell, so it doesn’t work well with small
objects such as a flock of birds.
 Despite lower accuracy, YOLO’s straightforward design
and real-time inference ability makes one-stage object
detection popular again in the research, and also a go-to
solution for the industry.

2015: FASTER R-CNN
 Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks
 As we introduced above, in early 2015, Ross Girshick proposed
an improved version of R-CNN called Fast R-CNN by using a
shared feature extractor for proposed regions.
 Just a few months later, Ross and his team came back with
another improvement again.
 This new network Faster R-CNN is not only faster than previous
versions but also marks a milestone for object detection with a
deep learning method.

 With Fast R-CNN, the only non-convolutional piece of the network is the selective
search region proposal.
 As of 2015, researchers started to realize that the deep neural network is so
magical, that it can learn anything given enough data.
 So, is it possible to also train a neural network to proposal regions, instead of
relying on heuristic and hand-crafted approach like selective search?
 Faster R-CNN followed this direction and thinking, and successfully created the
Region Proposal Network (RPN).
 To simply put, RPN is a CNN that takes an image as input and outputs a set of
rectangular object proposals, each with an objectiveness score.
 The paper used VGG originally but other backbone networks such as ResNet
become more widespread later.
 To generate region proposals, a 3×3 sliding window is applied over the CNN
feature map output to generate 2 scores (foreground and background) and 4
coordinates each location.
 In practice, this sliding window is implemented with a 3×3 convolution kernel with a
1×1 convolution kernel.

 Although the sliding window has a fixed size, our objects may
appear on different scales.
 Therefore, Faster R-CNN introduced a technique called anchor
box.
 Anchor boxes are pre-defined prior boxes with different aspect
ratios and sizes but share the same central location.
 In Faster R-CNN there are k=9 anchors for each sliding window
location, which covers 3 aspect ratios for 3 scales each.
 These repeated anchor boxes over different scales bring nice
translation-invariance and scale-invariance features to the
network while sharing outputs of the same feature map.
 Note that the bounding box regression will be computed from
these anchor box instead of the whole image.

 So far, we discussed the new Region Proposal Network to replace the
old selective search region proposal.
 To make the final detection, Faster R-CNN uses the same detection
head from Fast R-CNN to do classification and fine-grained
localization.
 Fast R-CNN also uses a shared CNN feature extractor. Now that RPN
itself is also a feature extraction CNN, we can just share it with
detection head like the diagram above.
 This sharing design doesn’t bring some trouble though. If we train RPN
and Fast R-CNN detector together, we will treat RPN proposals as a
constant input of ROI pooling, and inevitably ignore the gradients of
RPN’s bounding box proposals.
 One walk around is called alternative training where you train RPN and
Fast R-CNN in turns.
 And later in a paper “Instance-aware semantic segmentation via multi-
task network cascades”, we can see that the ROI pooling layer can
also be made differentiable w.r.t. the box coordinates proposals.

2015: YOLO V1
 You Only Look Once: Uniﬁed, Real-Time Object Detection
 While the R-CNN series started a big hype over two-stage
object detection in the research community, its complicated
implementation brought many headaches for engineers who
maintain it.
 Does object detection need to be so cumbersome?
 If we are willing to sacrifice a bit of accuracy, can we trade for
much faster speed?
 With these questions, Joseph Redmon submitted a network
called YOLO to arxiv.org only four days after Faster R-CNN’s
submission.
 It finally brought popularity back to one-stage object detection
two years after OverFeat’s debut.

 Unlike R-CNN, YOLO decided to tackle region proposal and region
classification together in the same CNN.
 In other words, it treats object detection as a regression problem, instead
of a classification problem relying on region proposals.
 The general idea is to split the input into an SxS grid and having each cell
directly regress the bounding box location and the confidence score if the
object center falls into that cell.
 Because objects may have different sizes, there will be more than one
bounding box regressor per cell.
 During training, the regressor with the highest IOU will be assigned to
compare with the ground-truth label, so regressors at the same location
will learn to handle different scales over time.
 In the meantime, each cell will also predict C class probabilities,
conditioned on the grid cell containing an object (high confidence score).
 This approach is later described as dense predictions because YOLO tried
to predict classes and bounding boxes for all possible locations in an
image.

CNN MODEL THAT FORMS THE BACKBONE OF
YOLO

STEPS
 1. YOLO cuts an image into squares.
 This makes it easier for YOLO to find objects in the image. It only needs
to look at one square at a time, instead of the entire image.
 2. For each square, YOLO guesses if there is an object in it and, if
so, what kind of object it is.
 It does this by using a deep learning model. The model has been trained
on a lot of images and labels. This means that the model knows how to
identify different types of objects in images.
 3. YOLO gets rid of any extra guesses.
 It does this by using a technique called non-maximum suppression. This
removes any guesses that are overlapping with other guesses. This
makes sure that YOLO only outputs one guess for each object in the
image.
 4. YOLO outputs the remaining guesses as rectangles and object
labels.
 A rectangle is a box that surrounds an object in an image. An object label
is a name for the type of object in the box.
 These outputs the remaining guesses as rectangles and object labels.
This means that YOLO outputs a box and a name for each object that it
finds in the image.

2015: SSD
 SSD: Single Shot MultiBox Detector
 YOLO v1 demonstrated the potentials of one-stage detection, but the
performance gap from two-stage detection is still noticeable.
 In YOLO v1, multiple objects could be assigned to the same grid
cell.
 This was a big challenge when detecting small objects, and became
a critical problem to solve in order to improve a one-stage detector’s
performance to be on par with two-stage detectors.
 SSD is such a challenger and attacks this problem from three angles.

KEY FEATURES OF SSD
 Single Shot: Unlike some traditional object detection
models that use a two-stage approach (first proposing
regions of interest and then classifying those regions),
SSD performs object detection in a single pass through
the network. It directly predicts the presence of objects
and their bounding box coordinates in a single shot,
making it faster and more efficient.
 MultiBox: SSD uses a set of default bounding boxes
(anchor boxes) of different scales and aspect ratios at
multiple locations in the input image. These default
boxes serve as prior knowledge about where objects are
likely to appear. SSD predicts adjustments to these
default boxes to locate objects accurately.

KEY FEATURES OF SSD
 Multi-Scale Detection: SSD operates on multiple
feature maps with different resolutions, allowing it to
detect objects of various sizes. Predictions are
made at different scales to capture objects at
varying levels of granularity.
 Class Scores: SSD not only predicts the bounding
box coordinates but also assigns class scores to
each default box, indicating the likelihood of an
object belonging to a specific category (e.g., car,
pedestrian, bicycle).

KEY CONCEPTS OF SSD
 Default Bounding Boxes (Anchor Boxes): SSD
uses a predefined set of default bounding boxes,
also known as anchor boxes. These boxes come in
various scales and aspect ratios, providing prior
knowledge about where objects are likely to be
located in the image. SSD predicts adjustments to
these default boxes to localize objects accurately.
 Multi-Scale Feature Maps: SSD operates on
multiple feature maps at different resolutions.
Obtain these feature maps by applying
convolutional layers to the input image at various
stages. Using feature maps at numerous scales
allows SSD to detect objects of different sizes.

KEY CONCEPTS OF SSD
 Multi-Scale Predictions: For each default
bounding box, SSD makes predictions at multiple
feature map layers with different resolutions. This
enables the model to capture objects at various
scales. These predictions include class scores for
different object categories and offsets for adjusting
the default boxes to match the objects’ positions.
 Aspect Ratio Handling: SSD uses separate
predictors (convolutional filters) for different aspect
ratios of bounding boxes. This allows it to adapt to
objects with varying shapes and aspect ratios.

 Base Network (Truncated for Classification):
 SSD begins with a standard CNN architecture, which is
typically used for high-quality image classification tasks.
However, in SSD, this base network is truncated before
any classification layers. The base network is
responsible for extracting essential features from the
input image.

 Multi-Scale Feature Maps: Additional convolutional layers are added to
the truncated base network. These layers progressively reduce the
spatial dimensions while increasing the number of channels (feature
channels). This design allows SSD to produce feature maps at multiple
scales. Each scale’s feature map is suitable for detecting objects of
different sizes.
 Default Bounding Boxes (Anchor Boxes): SSD associates a
predefined set of default bounding boxes (anchor boxes) with each
feature map cell. These default boxes have various scales and aspect
ratios. The placement of default boxes relative to their corresponding cell
is fixed and follows a convolutional grid pattern. For each feature map
cell, SSD predicts the offsets necessary to adjust these default boxes to
fit objects and the class scores indicating the presence of specific object
categories.
 Aspect Ratios and Multiple Feature Maps: SSD employs default boxes
with different aspect ratios and uses them across multiple feature maps
at various resolutions. This approach efficiently captures a range of
possible object shapes and sizes. Unlike other models, SSD doesn’t rely
on an intermediate fully connected layer for predictions but uses
convolutional filters directly.

GRID CELL
 Instead of using sliding window, SSD divides the image using
a grid and have each grid cell be responsible for detecting
objects in that region of the image. Detection objects simply
means predicting the class and location of an object within that
region. If no object is present, we consider it as the
background class and the location is ignored. For instance, we
could use a 4x4 grid in the example below. Each grid cell is
able to output the position and shape of the object it contains.

ANCHOR BOX
 Each grid cell in SSD can be assigned with multiple
anchor/prior boxes. These anchor boxes are pre-defined
and each one is responsible for a size and shape within a
grid cell. For example, the swimming pool in the image
below corresponds to the taller anchor box while the
building corresponds to the wider box.

 SSD uses a matching phase while training, to
match the appropriate anchor box with the
bounding boxes of each ground truth object within
an image.
 Essentially, the anchor box with the highest degree
of overlap with an object is responsible for
predicting that object’s class and its location.
 This property is used for training the network and
for predicting the detected objects and their
locations once the network has been trained. In
practice, each anchor box is specified by an aspect
ratio and a zoom level.

ASPECT RATIO
 Not all objects are square in shape. Some are longer and
some are wider, by varying degrees. The SSD architecture
allows pre-defined aspect ratios of the anchor boxes to
account for this. The ratios parameter can be used to specify
the different aspect ratios of the anchor boxes associates with
each grid cell at each zoom/scale level.
 Zoom level
 It is not necessary for the anchor boxes to have the same size
as the grid cell. We might be interested in finding smaller or
larger objects within a grid cell. The zooms parameter is used
to specify how much the anchor boxes need to be scaled up
or down with respect to each grid cell. Just like what we have
seen in the anchor box example, the size of building is
generally larger than swimming pool.

2016: FPN
 Feature Pyramid Networks for Object Detection
 With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like
the general structure an object detector is determined.
 Researchers start to look at improving each individual parts of these
networks.
 Feature Pyramid Networks is an attempt to improve the detection head by
using features from different layers to form a feature pyramid.
 This feature pyramid idea isn’t very novel in computer vision research.
 Back then when features are still manually designed, feature pyramid is
already a very effective way to recognize patterns at different scales.
 However, how to share the feature pyramid between RPN and the region-
based detector is still yet to be determined.

 First, to rebuild RPN with an FPN structure like the diagram
above, we need to have a region proposal running on multiple
different scales of feature output.
 Also, we only need 3 anchors with different aspect ratios per
location now because objects with different sizes will be handle
by different levels of the feature pyramid.
 Next, to use an FPN structure in the Fast R-CNN detector, we
also need to adapt it to detect on multiple scales of feature maps
as well.
 Since region proposals might have different scales too, we
should use them in the corresponding level of FPN as well.
 In short, if Faster R-CNN is a pair of RPN and region-based
detector running on one scale, FPN converts it into multiple
parallel branches running on different scales and collects the
final results from all branches in the end.

2016: YOLO V2
 The initial version of YOLO suffers from many
shortcomings: predictions based on a coarse grid brought
lower localization accuracy, two scale-agnostic regressors
per grid cell also made it difficult to recognize small packed
objects.
 YOLO v2 added Batch Normalization layers from a paper
called “Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift”.

 Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of
anchor boxes for bounding box regression.
 Also, anchors sizes are determined by a K-means clustering of
the target dataset to better align with object shapes.
 A new backbone network called Darknet is used for feature
extraction. This is inspired by “Network in Network” and
GooLeNet’s bottleneck structure.
 To improve the detection of small objects, YOLO v2 added a
passthrough layer to merge features from an early layer. This
part can be seen as a simplified version of SSD.
 YOLO v2 also experimented with a version that’s trained on
9000 classes hierarchical datasets, which also represents an
early trial of multi-label classification in an object detector.

2017: RETINANET
 To understand why one-stage detectors are usually not as good as two-
stage detectors, RetinaNet investigated the foreground-background
class imbalance issue from a one-stage detector’s dense predictions.
 RetinaNet invented a new loss function called Focal Loss to help the
network learn what’s important.
 Focal Loss added a power γ (they call it focusing parameter) to Cross-
Entropy loss. Naturally, as the confidence score becomes higher, the
loss value will become much lower than a normal Cross-Entropy.
 It is composed of a ResNet backbone, an FPN detection neck to
channel features at different scales, and two subnets for classification
and box regression as detection head.
 Similar to SSD and YOLO v2, RetinaNet uses anchor boxes to cover
targets of various scales and aspect ratios.

2018: YOLO V3
 YOLOv3: An Incremental Improvement
 Following YOLO v2’s tradition, YOLO v3 borrowed more
ideas from previous research and got an incredible
powerful one-stage detector.
 YOLO v3 balanced the speed, accuracy, and
implementation complexity pretty well.
 And it got really popular in the industry because of its fast
speed and simple components.

 Simply put, YOLO v3’s success comes from its more
powerful backbone feature extractor and a RetinaNet-like
detection head with an FPN neck.
 The new backbone network Darknet-53 leveraged
ResNet’s skip connections to achieve an accuracy that’s
on par with ResNet-50 but much faster.
 Also, YOLO v3 ditched v2’s pass through layers and fully
embraced FPN’s multi-scale predictions design.
 Since then, YOLO v3 finally reversed people’s impression
of its poor performance when dealing with small objects.

2019: OBJECTS AS POINTS
 Although the image classification area becomes less active
recently, object detection research is still far from mature.
 In 2018, a paper called “CornerNet: Detecting Objects as
Paired Keypoints” provided a new perspective for detector
training.
 Since preparing anchor box targets is a quite cumbersome job,
is it really necessary to use them as a prior?
 This new trend of ditching anchor boxes is called “anchor-free”
object detection.

 Inspired by the use of heat-map in the Hourglass network
for human pose estimation, CornerNet uses a heat-map
generated by box corners to supervise the bounding box
regression.

 Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to
represent object centers, and the network will regress the box width and height
directly from these box centers.
 Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed
heat-map, the training is also easier to converge compared with previous attempts
which tried to regress bounding box size directly.
 The elimination of anchor boxes also has another useful side effect. Previously, we
rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to
assign training targets.
 By doing so, a few neighboring anchors may get all assigned a positive target for
the same object. And the network will learn to predict multiple positive boxes for the
same object too.
 The common way to fix this issue is to use a technique called Non-maximum
Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close
together.
 Now that anchors are gone and we only have one peak per object in the heat-map,
there’s no need to use NMS any more.
 Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a
big benefit for the applications that run in various environments with limited
resources.

2019: EFFICIENTDET
 EfficientDet: Scalable and Efficient Object Detection

 EfficientDet showed us some more exciting development in the object detection
area.
 FPN structure has been proved to be a powerful technique to improve the detection
network’s performance for objects at different scales.
 Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN
neck before box regression and classification.
 Later, NAS-FPN and PANet both demonstrated that a plain multi-layer FPN
structure may benefit from more design optimization.
 EfficientDet continued exploring in this direction, eventually created a new neck
called BiFPN.
 Basically, BiFPN features additional cross-layer connections to encourage feature
aggregation back and forth.
 To justify the efficiency part of the network, this BiFPN also removed some less
useful connections from the original PANet design.
 Another innovative improvement over the FPN structure is the weight feature fusion.
BiFPN added additional learnable weights to feature aggregation so that the
network can learn the importance of different branches.

MORE LESS FAMOUS MODELS…
 2009: DPM
 Object Detection with Discriminatively Trained Part Based Models
 By matching many HOG features for each deformable parts, DPM was one of the most efficient object
detection models before the deep learning era. Take pedestrian detection as an example, it uses a star
structure to recognize the general person pattern first, and then recognize parts with different sub-filters and
calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular
after we switch from HOG features to CNN features.
 2012: Selective Search
 Selective Search for Object Recognition
 Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined
so many classical computer vision approaches together, and also used in the early R-CNN detector. The core
idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective
Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge
similar areas together. And these merged area areas served as foreground predictions and followed by an
SVM classifier for object recognition.
 2016: R-FCN
 R-FCN: Object Detection via Region-based Fully Convolutional Networks
 Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for
each region proposal, we still need fully connected layers to compute class and bounding box separately. If
we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major
speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for
semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score
maps. This map will predict the probability of the appearance of the object at each location, and all locations
will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution
in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To
understand what is atrous convolution, please see my previous article “Witnessing the Progression in
Semantic Segmentation: DeepLab Series from V1 to V3+”.

 2017: Soft-NMS
 Improving Object Detection With One Line of Code
 Non-maximum suppression (NMS) is widely used in anchor-based object detection
networks to reduce duplicate positive proposals that are close-by. More specifically,
NMS iteratively eliminates candidate boxes if they have a high IOU with a more
confident candidate box. This could lead to some unexpected behavior when two
objects with the same class are indeed very close to each other. Soft-NMS made a
small change to only scaling down the confidence score of the overlapped
candidate boxes with a parameter. This scaling parameter gives us more control
when tuning the localization performance, and also leads to a better precision when
a high recall is also needed.
 2017: Cascade R-CNN
 Cascade R-CNN: Delving into High Quality Object Detection
 While FPN exploring how to design a better R-CNN neck to use backbone features
Cascade R-CNN investigated a redesign of R-CNN classification and regression
head. The underlying assumption is simple yet insightful: the higher IOU criteria we
use when preparing positive targets, the less false positive predictions the network
will learn to make. However, we can’t simply increase such IOU threshold from
commonly used 0.5 to more aggressive 0.7, because it could also lead to more
overwhelming negative examples during training. Cascade R-CNN’s solution is to
chain multiple detection head together, each will rely on the bounding box
proposals from the previous detection head. Only the first detection head will use
the original RPN proposals. This effectively simulated an increasing IOU threshold
for latter heads.

 2017: Mask R-CNN
 Mask R-CNN
 Mask R-CNN is not a typical object detection network. It was designed to solve a challenging
instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask
R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired
object detection research. The main idea is to add a binary mask prediction branch after ROI
pooling along with the existing bounding box and classification branches. Besides, to address
the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new
ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-
task training (segmentation + detection) and the new ROI Align layer contribute to some
improvement over the bounding box benchmark.
 2018: PANet
 Path Aggregation Network for Instance Segmentation
 Instance segmentation has a close relationship with object detection, so often a new instance
segmentation network could also benefit object detection research indirectly. PANet aims at
boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up
path after the original top-down path. To visualize this change, we have a ↑↓ structure in the
original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from
multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an
“adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of
sum) multi-scale features.
 2019: NAS-FPN
 NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
 PANet’s success in adapting FPN structure drew attention from a group of NAS researchers.
They used a similar reinforcement learning method from the image classification network
NASNet and focused on searching the best combination of multiple merging cells. Here, a
merging cell is the basic build block of an FPN that merges any two input features layers into
one output feature layer. The final results proved the idea that FPN could use further
optimization, but the complex computer-searched structure made it too difficult for humans to
understand.

Object Detection is a very powerful field.pptx

Recommended

Recommended

More Related Content

Similar to Object Detection is a very powerful field.pptx

Similar to Object Detection is a very powerful field.pptx (20)

Recently uploaded

Recently uploaded (20)

Object Detection is a very powerful field.pptx