R-FCN is a two-stage object detection network that addresses the translation invariance vs variance dilemma. It uses position-sensitive score maps and RoI pooling to classify objects. Position-sensitive score maps are produced from a convolutional layer and are specialized for different locations within objects. Position-sensitive RoI pooling pools only over the relevant score map for each RoI bin. Bounding box regression is also performed using position-sensitive techniques. R-FCN achieves state-of-the-art object detection performance while being faster than Faster R-CNN since it removes unnecessary RoI pooling layers.
3. Introduction
● Two-stage object detection networks have two subnetworks
○ Shared fully convolutional subnetwork independent of RoIs
○ RoI-wise subnetwork that does not share computation
● RoI pooling layer is unnaturally inserted to address invariance vs variance dilemma
○ Sacrifices training and testing efficiency since it introduces a considerable number of
region-wise layers -> each RoI goes through classification layer
6. R-FCN vs Faster R-CNN
Faster R-CNN
R-FCN
conv layer
NO conv layer
7. Position-sensitive score maps
● Attach a convolutional layer on top of feature map to produce k^2(C+1) position-
sensitive score maps
● For each class C, k^2 feature maps are produced
○ feature map specialized for (top-left, top-middle,...) locations of an object
k
k
9. Position-sensitive RoI Pooling
● Each RoI rectangle is divided into k x k bins
○ For w x h RoI, each bin has size of (w/k x h/k)
● For each (i, j)th bin, position-sensitive RoI pooling pools only over the (i,j)th score map
RoI
0 1 2
0
1
2
one score map out of k^2(C+1) score
maps
top_left corner of an
ROI
# of pixels in the bin
(123, 245)
14. Bounding box regression
● Aside from the k^2(C+1)-d conv layer, a sibling 4k^2 sibling conv layer for bounding box regression
is appended
○ produce 4k^2-d vector for each RoI
● Then, it’s aggregated into 4-d vector by average voting.
● 4-d vector parameterizes (t_x, t_y, t_w, t_h)