Learning RGB-D Salient Object Detection using background enclosure, depth contrast, and top down features
1. Presenter: Benyamin Moadab
Learning RGB-D Salient Object Detection
using
background enclosure, depth contrast, and top-down features
Supervisor: Dr. Vahid Rostami
Islamic Azad University of QAZVIN
IN THE NAME OF GOD
May - June 2021
Department of Electrical, Computer and Information Technology
6. Introduction
• In human visual saliency, top-down and bottom-up information are combined as a
basis of visual attention.
• The use of top-down information and the exploration of CNNs for RGB-D salient
object detection remains limited.
• Saliency can be used in the context of many computer vision problems such as
compression, object detection, visual tracking and etc.
8. In this section, we introduce our approach to
RGB-D salient object detection.
We combine the strengths of previous approaches to high-level and low-level feature
based deep CNN RGB salient object detection
top-down
and
bottom-up methods
Mid
level depth feature
low
level features
Proposed architecture
8
This paper makes three major contributions:
10. RGB-D saliency detection system
10
2
3
5 Concatenation of
Color and Depth
Features
RGB low and high level
saliency from ELD-Net
BED Feature
• As in the previous section, features to identify the protrusion of the object introduced in this section,
each of them are introduced in detail, which are:
1
4
Low level
Depth Features
Nonlinear combination
of depth features
11. .
1 BED Feature
11
✓ BED is inspired by LBE
✓ LBE
✓ BED
✓ BED relies on learning
✓ BED feature captures
the enclosure distribution
properties of a patch.
❖ How the BED feature works
1) Descriptive
2) Math
𝑓 𝑃,𝑡 =
𝜃𝑎1+𝜃𝑎2
2𝜋
و g P,t =
𝜃𝑎3
2𝜋
12. 2. Low-level Depth Features
12
✓ Depth contrast
✓ Depth histogram distance
✓ SLIC Algorithm
✓ focused superpixel
✓ Procedures:
1) Calculate the average depth value
2) subdivide the image grid cells
3) subtracting the average depth value of each grid cell
from the average depth value for each super-pixel
4) calculate the difference between the depth histogram
of the focused super-pixel
5) divide the entire range of depth values into 8 intervals
13. 3. RGB low and high level saliency from ELDNet
13
✓ To represent high-level and low-level features for RGB,
we make use of the extended version of ELD-Net
▪ VGG-Net
▪ GoogleNet
VGG Net
GoogleNet
14. 4. Nonlinear combination of depth features
14
✓ As described in the previous sections on the low-level feature and the BED feature, objects must be combined to detect
protrusions.
✓ In order to better combine these features, we use three layers of convolution to create depth feature outputs.
15. 5. Concatenation of Color and Depth Features
15
1) we make use of the pretrained caffe-model of ELD-Net.
2) The calculated 1 × 20 × 20 color feature layer is concatenated with the depth feature outputs.
3) We then connect the 1×20×20+1×20×20 concatenated output features with two fully connected layers and
calculate the saliency score for the focused super-pixel.
4) We calculate the cross entropy loss for a softmax classifier to evaluate the outputs as:
16. 5.1. Preprocessing on depth and color images
16
✓ synchronize the scale of depth values with color values.
✓ Normalize the depth value to the same scale 0 to 255.
• Depth values of RGBD1000 are represented with greater bit depth and so require normalization.
• On NJUDS2000 the scale of depth values are already 0- 255, and so are not modified.
✓ After normalization, we resize the color and depth images to 324×324.
➢ The following figures show these processes:
17. Superpixel Segmentation
17
✓ We use gSLICr ,the GPU version of SLIC, to segment the images into super-pixels.
✓ We divide each image into approximately 18 × 18 super-pixels,
21. 21
Discussion
Table 1. Comparing average F-measure score
with other state-of-the-art saliency methods on two datasets.
Table 2. Replacing the super-pixel histogram with mean depth improves results for
NJUDS2000 where depth data is noisy.
Table 3. Comparing scores with different input features on
RGBD1000 .Note that LD means Low-level Depth Features.
Table 4. Comparing scores with different input features on
NJUDS2000 .Note that LD means Low-level Depth Features.
24. Conclusion
24
✓ As can be seen in the results, the proposed method was based on CNN and using the combination of features:
background space, contrast depth and top-down and bottom-up information achieved the best result in comparison
with other available methods.
25. Performance improvements (My suggestion)
25
▪ Because the architectures used in high-level feature recognition, VGG and Google Net, have disadvantages
(such as the following), it is recommended to use other architectures such as YOLO or RCNN to improve
speed and better performance.
Now Any Question??