MARGINALIZATION (Different learners in Marginalized Group
Mmpaper draft10
1. Bidirectional Information exchange
Subtitle as needed (paper subtitle)
Authors Name/s per 1st Affiliation (Author)
line 1 (of Affiliation): dept. name of organization
line 2-name of organization, acronyms acceptable
line 3-City, Country
line 4-e-mail address if desired
Authors Name/s per 2nd Affiliation (Author)
line 1 (of Affiliation): dept. name of organization
line 2-name of organization, acronyms acceptable
line 3-City, Country
line 4-e-mail address if desired
Abstract— It is essential to collect and analyze
environmental information surrounding the vehicle in an
autonomous driving environment. To do this, it is
necessary to analyze real-time information such as
location, orientation and size of objects. In particular,
handling of occlusion or truncation helps to ensure safe
driving by providing exact information about the
surroundings of the vehicle. In this paper, we propose a
new method to generate object proposals around a vehicle
using projected LiDAR information and to perform object
detection more accurately by exploiting a 2D image based
classifier. This method generates a proposal by filtering
the LiDAR information into a 2D edge with a simple but
strong affinity. Then classify the objects in the proposal
through R-FCN and map the class labels classified at the
edge points used to make the proposal. This class label
combines with the orientation information of the edge in
3D space to complete the 3D box of the object. This
compensates for the disadvantage of the CCD-based CNN
classifier which shows difficulties to obtain spatial
information and solves such problems as occlusion. We
compared our method to state-of-the-art results on the
KITTI Dataset and showed good results in terms of speed
and accuracy (especially IOU with ground-truth).
Keywords—LiDAR; CCD; Proposal generation; Object
Detection; R-FCN
I. INTRODUCTION
In recent years, CNN had many contributions from fields such
as object classification and object detection. Especially, as the
autonomous driving becomes an issue, deep network
researches for object recognition and behavior analysis in the
driving environment are actively being carried out. The most
recent deep network-based object classification researches are
divided into two categories : two-stage type and one-stage
type. In two-stage type, 'proposal generator' and 'classifier'
scheme are implemented as an independent component, but in
one-stage type these are performed together.
In two-stage type, R-CNN[1] shows good performance in
object classification by evaluating the 'scores' of features that
exist in that region via 'region'(?) and classifying the proposed
regions into feature maps. This method significantly reduces
the computational cost compared to the sliding window
method, and thus has advantages in terms of speed. Faster R-
CNN [2] devised the 'RPN (Region Proposal Network)' that
generates the proposal through feature sharing, taking
advantage of the fact that the convolution layer itself contains
enough information to represent the object. HyperNet [3]
replaces the features used in the Faster R-CNN by pooling
several times to compensate for the fact that the FC5 layer
information is insufficient for small objects. It creates a hyper
feature set and used it to match objects of various sizes. There
is, however, a dilemma for translation invariants and variants
between linked object detection and classifiers. To solve this
dilemma, R-FCN [4] divides the ROI into k × k lattices and
finds objects in the center of ROI through local voting.
In one-stage type, bounding prediction and class
probability calculation are performed together like YOLO[5].
In YOLO, the class probability map and the bounding box
confidence map in the grid are generated based on the grid
without generating a separate proposal, and the class of the
bounding box is classified immediately, greatly simplifying
the overall calculation process. Fast YOLO [ ] simplifies the
network a little bit and shows a computation speed of 155 fps.
In a similar approach, a single shot detector (SSD)[6] was
introduced to evaluate box and confidence through step
features in the convolution layer like HyperNet.
Even though the existing methods are good at classifying
occlusion and truncated objects, but is vulnerable to the
processing of the obscured part of the object.
However, for autonomous navigation, additional
information is needed in detecting and classifying objects
well. The most important information is occlusion and
truncation among objects. These two factors are important
factors in understanding the actual position and arrangement
of nearby objects in the driving environment. Of course, the
existing set of methods is also good at classifying occlusion
and truncated objects, but is vulnerable to the processing of
the obscured part of the object.
In both types, a bounding box is created based on the
characteristics of the unobscured part of the obscured object,
and the object is classified through a part of the object
represented in the box. So, although the class of the object can
be found, there is a problem that the size of the actual object in
2. the
hidden
area
can
not be
estimated.
Another problem is obtaining spatial information such as the
3D location and orientation of the detected object, and the
actual size of the object. In order to grasp the traffic flow
around the vehicle prior to autonomous driving, it is very
difficult to estimate the 3D spatial information using only one
accurate RGB image. Therefore, using 3D depth information
is the most intuitive method. 3DOP[7] generates Depth map
using stereo image, and by projecting each Pixel of RGB
Image into 3D space. It defines the relationship between each
pixel as an MRF energy function through several properties,
classifies the object by Linear SVM. However, the
sophisticated stereo-based depth map generation used in
3DOP consumes a lot of computational cost. Therefore, there
is a method to obtain depth in addition to stereo, among which
laser sensor information such as LiDAR is used. Laser sensors
can collect 3D spatial information very quickly. Among these
studies, vote3D [8] used LiDAR directly learning in 3D space
through 3D Voxel. 3DVP classifies LiDAR pointcloud into
3D Voxel and applies 2D Alignment with 3D CAD Model. In
addition, there are studies in which LiDAR points are
projected onto 2D to generate 2D depth maps. But, LiDAR
information has sparse characteristics basically and it is
difficult to learn steadily because it shows non-uniform results
depending on the surface of environment or object. Based on
the structure of RPN + classifier of Faster R-CNN,
SubCNN[9] adds subcategory information to both sides and
Figure 1. System overview. The upper network is characterized by scoring through voting for the area of k x k with R-
FCN. In this study, ROI of this network is generated by filtering LiDAR pointcloud. Through this ROI, the object is
classified by voting and the classified class label is mapped again to the point to expand it in 3D space. This creates a final
bounding box for each object.
3. corrects the area according to the size and direction
information of subclass. Particularly in 'Car' category,
3DVP[10] was applied to identify the 3D location and
orientation of the vehicle and perform 2D and 3D
segmentation. However, this method can only be applied to
the rigid body model, and many additional samples are
required depending on each sub category class. In addition,
manual sorting for alignment is required for all of them, and
the sparse LiDAR pointcloud has the disadvantage that it is
difficult to construct a sufficient voxel for long distance
objects. And the computation speed is not fast enough and is
not suitable for autonomous driving environments where real-
time processing is essential. Therefore, this study proposes
fast and accurate 2D and 3D object detection and classification
method suitable for autonomous driving environment by
adding CCD image and LiDAR sensor information. The
proposed method uses a LiDAR Point as 2D data that
combines all the points of the z-axis in one plane instead of
the 3D voxel, which guarantees a much higher density than the
conventional one. We filter them and group edges to find
edges with high affinity and generate 2D proposals. The
generated proposal is used as the ROI of the R-FCN network
and the bounding box of the object in 3D space is created by
mapping the classification result to the constituent edge point.
Finally, if you project it onto 2D again, you can expand the
box to the hidden region that does not appear in the 2D image.
Our benefits through this study:
- This method replaces the existing proposal generator and
improves the speed by mixing the simplified LiDAR sensor
information with the powerful CCD-based CNN architecture.
- It is possible to acquire 3-D spatial information about the
object around the vehicle at the same time by only the
information generated in the process of making proposal
without additional process.
- This method expands the bounding box of the occluded or
truncated region to extend the box closer to the actual size of
the object.
4. To do this, we show how to perform a mutual projection
between 2D and 3D points using a simple method in Section
II-A. Section II-B uses this to detect objects in the CCD-based
state-of-the-art classifier R-FCN network. Section II-C
expands the box based on orientation information obtained in
II-A and orientation information obtained in 3D space through
the detected class label to determine the optimal bounding box
for the actual class. Section-III quantitatively verifies the
performance of the Average precision(AP) in the object and
tracking sequences in the KITTI Dataset and compares it with
the previous studies. Also, confirm the expansion result for the
obscured area through the picture.
II. BIDIRECTIONAL INFORMATION EXCHANGE
A. Proposal Generation with LiDAR
We propose 'bidirectional projection filtering' using 2D -
3D information as an intersection through a calibration matrix
between CCD image and 3D point cloud as shown in Figure 2.
Through this method, it is possible to generate an object
proposal, estimate an obscured area, and obtain the actual size
and 3D spatial information of the detected object based on the
CCD image. In this section, we describe how to remove point
information that is not needed for proposal generation and
exchange information between 2D and 3D space through
projection. As shown in figure 1- (b), we remove the ground to
separate object points and eliminate points outside the unused
CCD angle. Next, project the z-axis on one side to obtain a
dense 2D object shape. Then, the error points are filtered and
the remaining edges are grouped based on the affinity between
the neighboring edges, and a proposal is generated in units of
edge groups generated.
Figure 2. Point filtering through 2D projection of LiDAR.
(a) original image, (b) Lidar point cloud with ground
removed, (c) removal of points outside the angle of view
and reduce dimensions to increase density (d) 2D
projection and filtering of points, (e) 3D point cloud before
filtering, and (F) 2D point cloud after filtering.
Ground Plane removal - In the LiDAR environment, the
simplest way to remove the ground is to remove all of the
LiDAR points below the corresponding height by considering
the ground of the vehicle equipped with the sensor as ground
plane. In this paper, we propose a surface detection method
considering the curvature of the road surface because the road
is infinitely flat and there is no obstacle. Although there are
many methods, this study uses the fact that the objects in the
driving environment have a volume including a certain height.
Through the voxel grid of 20x20x10 cm, the height of the
points constituting the cell is obtained for all XY cells in which
the point exists.
(1)
(2)
When each coordinate of the XY plane is i and j, each grid
cell is represented by Cij, and if the point set belonging to Cij
is PSij. In equation (1), HPSij is the height of each cells.
Ground G(i,j) in equation (2) is the grid with zero height area.
Figure 2-(b) shows the point cloud in which the ground area is
removed in this way.
Outside point elimination – It is necessary to further
simplify the point set with ground removed. First, the
LiDAR Point is multiplied with the calibration matrix to
remove all points outside the x-axis range of the CCD image
and then process the remaining points only. The first reason for
doing a CCD projection is because you can easily remove
LiDAR points outside the angle of view. Second, since the
reference of the horizontal axis search changes from radian to
pixel through projection, it is possible to perform filtering on
the horizontal axis in a very simple manner. Figure 2-(c) shows
that the range of points to be processed is greatly reduced.
Z axis elimination - Sparse LiDAR points after the ground is
removed are configured in a two-dimensional plane by
projecting all the points on the z-axis to increase the density
and at the same time for fast computation. Since the height that
can be reached is 50cm or more, PS
z
ij = -0.9 is set for all point
set PSij in order to locate the upper plane at 50cm above the
ground where the radar sensor of the vehicle is installed.
However, as shown in Figure 2-(b), since the 3D point of the
existing 3D space is pressed in one plane, the density of the
point is increased, but there is a portion that looks like noise
due to the depth difference.
Edge filtering in CCD - The use of 2D LiDAR point clouds
by CCD projection has several advantages. In 3D space,
LiDAR spreads radially with zero point and obtains distance
information. In order to extract the shortest line of an object
from a shooting point, the distance must be calculated for all
the angles that are continuous values. However, if we project
this onto the CCD, the discrete value of the pixel becomes the
reference domain as shown in Figure 2-(d). As shown in the
figure 3, duplicate points can easily be removed according to
height or depth in one pixel column in pixel level. In addition,
since the error pixel can be easily discriminated through the
gradient operation proportional to the distance between
5. neighboring pixels, it is possible to generate the edges closest
to the photographed vehicle in each column of the x axis.
(3)
In equation (3), Cli denotes all x-axis columns containing at
least one projected LiDAR Point and hPsi denotes the
minimum height of each Cli. This gives hPsi for all i and
removes noise through a one-dimensional median filter[11].
At this time, not only the height of the point selected by the
median filter but also the index of the corresponding point are
copied together to remove the noise point in the 3D space as
well. Figure 2-(f) shows a three-dimensional point cloud
arranged through this.
Segmentation by edge affinity – The point map determined
through the above process is segmented by two conditions.
First, we group the consecutive edges by the edge grouping
method introduced in Edgeboxes[12]. In this case, the edge
magnitude mp is used to inverse the Euclidean distance
between adjacent points Pi and Pj of two neighboring edge
groups as shown in equation (4).
(4)
(5)
The affinity score a(si,sj) is obtained for all edge groups Si, and
the edge is segmented based on the boundary of the group
whose score changes rapidly. (4) is a formula for obtaining the
affinity score between neighboring edge groups. PGz
i and PGz
j,
which are the ground heights, and PGz1.5
i and PGz1.5
j, which are
1.5m high from the ground, are projected on the CCD image,
respectively, for pi and pj located at the boundary of each edge
set. Create a bounding box by connecting horizontally to the
adjacent boundary.
Proposal generation based on actual size – We create a
bounding box between segmented boundaries. At this time, the
bottom value of each boundary point is determined through the
ground position obtained by the ground removal process. The
height of the bounding box is 1.5m which covers all three
classes that are mainly detected in the driving environment and
it is projected on CCD image to generate proposal in CCD
image. The yellow line in Figure 4 represents the projected
LiDAR edge, and the orange line represents the boundary
divided by the affinity of the LiDAR edge. The green box at
the bottom of Figure 4 creates a box with a height of 1.5m
between the boundaries based on this.
Figure 4. A box proposal created based on edge group
boundaries divided by edge affinity.
B. Classification with R-FCN
The generated proposal classifies objects by combining them
with R-FCN showing state-of-the-art performance. Figure 5-(b)
shows the R-FCN implemented through ResNets [13] 101. R-
FCN solves the object translation invariance and variance
dilemma between detection and classification by dividing
Figure 3. The process by which objects are detected by bidirectional projection filtering. (a) original image, (b) 2D LiDAR
projection, (c) 2D gradient-based edge cleanup, (d) Proposal creation using edge affinity, and classification result by R-
FCN. (e) 3D bounding box expansion by LiDAR projection(Red Boxes). , (f) final result.
6. ROI is divided into n x n grid and feature set is trained through
each grid cell. If a box proposal is determined according to this
learning method, a high voting score is returned if the object is
sufficiently wrapped. By using this feature, it is possible to
organize the surrounding box based on the voting score of the
redundant box due to noise of LiDAR.
Figure 5. Detailed structure of proposed method. (a) 3D to
2D Projection Filtering (b) R-FCN (c) 2D to 3D Projection
Filtering. (d) Result
C. 3D projection and box extension
This time, we assign the class label to the edge group inside the
edge boundary of the box classified by R-FCN. Each labeled
edge is projected in the 3D space through the index given in
Figure 5-(c) and the orientation of the object is estimated
through the X-Y axis pole among the points constituting the
edge. Once the orientation is determined, the 3D box is
expanded to match the class specific size and aspect ratio. In
this case, the edge to be used as the base of the expansion can
be determined according to the positional relationship between
the edge group and the height difference of hPi and hPj at the
segmentation boundary line surrounding the edge in 2D space.
Using this, we extend the box around the edge that contains the
edge of the object. Figure 6 shows the process of expanding the
box through the labeled edges and integrating the overlapping
boxes and edges.
III. EXPERIMENT RESULTS
This study was conducted through caffe[15] framework and
used NVIDIA TITAN X GPU. The two networks we used for
evaluation, Faster R-CNN and R-FCN, were originally studied
and evaluated in the PASCAL VOC Dataset, but this study
was conducted through KITTI Dataset because LiDAR
information is required together. We also compared the
original R-FCN method using Selective search [14] and the
results of this study with the KITTI object dataset by using R-
FCN learning model learned in PASCAL VOC for objective
comparison. The metrics used in the evaluation are the
standard mean average precision and the mean intersection-
over-union.
A. Experiments on KITTI
We first training through the KITTI Object Dataset and
Tracking sequence, which is a typical dataset that can use
CCD information and LiDAR information among public
datasets that can use R-FCN at present. In training phase, four
classes of 'Car', 'Pedestrian', 'cyclist', and 'background' were
combined and learned in four classes. Since the proposed
method does not require additional learning for LiDAR, there
is no need to change the architecture of R-FCN to learn CCD
image. However, as the type of class changes, the volume of
the feature set has changed. But the valid range of LiDAR
Sensor data given in KITTI Dataset is about 50m. It was not
possible to evaluate the object of hard difficulty which is not
wide enough to cover the entire CCD image and belong to the
distance of 50m or more.
Table 1 compares the object detection rates of existing state-
of-the-art methods and our research by measuring the mAP by
difficulty for the three classes. We measured the precision of
the box with the ground truth area and IOU of 50% or more,
and found good results for 'Car' and 'Pedestrian'. However, in
the case of 'cyclist', LiDAR points were not uniformly
distributed according to the spoke shape of the bicycle wheel,
so that a box proposal of a sufficient size could not be
generated, resulting in a relatively low value.
Car Pedestrian Cyclist
Method E M E M E M
Regionlet[15] 84.75 76.45 73.14 61.15 70.41 58.72
3DVP[10] 87.46 75.77 - - - -
SubCat[16] 84.14 75.46 - - - -
SDP [17] 90.33 83.53 77.74 64.19 74.08 61.31
Ours 95.41 88.54 81.78 65.71 72.11 60.85
Table 1. The results of the KITTI dataset of our study are
compared with state-of-the-art methods. Hard difficulty
was excluded from comparison because of the limit of
measurement distance of LiDAR sensor of about 50m.
B. Experiments on PASCAL VOC & KITTI
In fact, this study is close to the proposal generation method,
so the classifier R-FCN itself is used without any
modification. Therefore, the measurement of mAP at KITTI is
Figure 6. Expanded box by determined edge orientation
and corner detection. (a) shows the object box classified by
R-FCN, and (b) is a figure that is extended to 2D according
to the actual size of the class through 2D to 3D projection
and then projected to 2D.
7. not enough to analyze the performance of our study because it
is more influenced by the performance of our R-FCN.
Therefore, we compared the results with R-FCN using Faster
R-CNN or Selective search using RPN to objectively measure
performance as a proposal creation and result correction tool.
In this comparison, we used the caffe model learned from
PASCAL VOC 07 + 12 published in the original paper as it is
to prevent problems that may occur during re-learning of two
networks through KITTI Dataset. Table 2 compares the
accuracy and computation time of two other studies with this
study. Table 3 shows the AP variation with IOU ratio to
ground-truth. As shown in the table, the overlap ratio is higher
than that of conventional RPN or SS. This is because the box
extension through the class label reduces the error by re-
expanding the box closer to the actual object size. We have
confirmed that, as long as the object belongs to the LiDAR
sensor range, this study expresses the object as a bounding box
closer to the ground-truth than the existing research. Although
the box extension is performed only for the vehicle category
due to the problem of orientation acquisition, as shown in
Zhang et al.[14]'s experiment. But, if the aspect ratio and size
of the object such as pedestrian can be generalized, it is
possible to expand the additional category.
Training
Data
Test
Data
mAP
(%)
test time
(sec/img)
RPN+Faster R-
CNN
07+12 KITTI 75.7 0.37
RPN+R-FCN 07+12 KITTI 77.4 0.20
SS+R-FCN 07+12 KITTI 80.4 2.21
Ours +R-FCN 07+12 KITTI 82.4 0.17
Table 2. Comparison result of detection rate in KITTI
Object and Tracking dataset. To reduce errors, we used
the original pre-training model.
Training
Data
Test
Data
AP@
0.5
AP@
0.7
AP@
0.9
RPN+R-FCN 07+12 KITTI(car) 84.8 77.4 55.2
SS+R-FCN 07+12 KITTI(car) 86.3 80.4 58.4
Ours 07+12 KITTI(car) 89.7 82.4 80.1
Table 3. The AP change according to the IOU rate in the
Car category. The proposed method shows a high overlap
ratio with respect to the ground truth compared to the
RGB feature based proposal generator such as selective
search.
IV. CONCLUSION
We propose a simple, but strong method for autonomous
driving through this study. We have shown better results than
state-of-the-art methods by combining CCD-based classifier
and LiDAR information effectively through the proposed BPF.
Especially, 3D space information of a partially obscured object
is grasped in real time, and the detection area is extended based
on this, so that the result shows a high IOU rate with the
groundtruth. However, there are still problems to be solved,
such as how to handle objects that appear on the CCD but are
outside the LiDAR range, or to expand on both sides of the
object. In addition, LiDAR sensing range is shorter than the
visible range of the CCD, so it is necessary to compensate for
CCD objects. We will address these issues through additional
research.
REFERENCES
[1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object
detection and semantic segmentation." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2014.
[2] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object
detection with region proposal networks." Advances in neural
information processing systems. 2015.
[3] Kong, Tao, et al. "HyperNet: Towards Accurate Region Proposal
Generation and Joint Object Detection." arXiv preprint
arXiv:1604.00600 (2016).
[4] Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully
Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).
[5] Redmon, Joseph, et al. "You only look once: Unified, real-time object
detection." arXiv preprint arXiv:1506.02640 (2015).
[6] Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint
arXiv:1512.02325 (2015).
[7] Chen, Xiaozhi, et al. "3d object proposals for accurate object class
detection." Advances in Neural Information Processing Systems. 2015.
[8] Wang, Dominic Zeng, and Ingmar Posner. "Voting for voting in online
point cloud object detection." Proceedings of the Robotics: Science and
Systems, Rome, Italy 1317 (2015).
[9] Xiang, Yu, et al. "Subcategory-aware Convolutional Neural Networks
for Object Proposals and Detection." arXiv preprint arXiv:1604.04693
(2016).
[10] Xiang, Yu, et al. "Data-driven 3d voxel patterns for object category
recognition." 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE, 2015.
[11] T. Huang, G. Yang, and G. Tang, "A fast two-dimensional median
filtering algorithm", IEEE Trans. Acoust., Speech, Signal Processing,
vol. 27, no. 1, pp. 13–18, 1979.
[12] Zitnick, C. Lawrence, and Piotr Dollár. "Edge boxes: Locating object
proposals from edges." European Conference on Computer Vision.
Springer International Publishing, 2014.
[13] He, Kaiming, et al. "Deep residual learning for image recognition."
arXiv preprint arXiv:1512.03385 (2015).
[14] Uijlings, Jasper RR, et al. "Selective search for object recognition."
International journal of computer vision 104.2 (2013): 154-171.
[15] Wang, Xiaoyu, et al. "Regionlets for generic object detection." IEEE
transactions on pattern analysis and machine intelligence 37.10 (2015):
2071-2084.
[16] Ohn-Bar, Eshed, and Mohan Manubhai Trivedi. "Learning to detect
vehicles by clustering appearance patterns." IEEE Transactions on
Intelligent Transportation Systems 16.5 (2015): 2511-2521.
[17] Yang, Fan, Wongun Choi, and Yuanqing Lin. "Exploit all the layers:
Fast and accurate cnn object detector with scale dependent pooling and
cascaded rejection classifiers." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2016.
[18] Zhang, Liliang, et al. "Is Faster R-CNN Doing Well for Pedestrian
Detection?." European Conference on Computer Vision. Springer
International Publishing, 2016.