Using Machine Learning and Deep Learning Techniques, we train the ResNet CNN Model and build a model for estimating Depth using the Discrete Fourier Domain Analysis, and generate results including the explanation of the Loss function and code snippets.
Single Image Depth Estimation using frequency domain analysis and Deep learning
1. Single-Image Depth
Estimation Based on
Fourier Domain
Analysis
Ahan M R | Ishaant Agarwal | Karthik K | Srijan Nikhar
BITS PILANI K K BIRLA GOA CAMPUS
2. ● The depth map of an image is used for 3-D reconstruction,human pose
estimation and scene recognition.
● Most mainstream estimation algorithms use 3-D stereo images or motion
sequences to calculate the depth map of a scene.
● With the fast development of deep learning technology, various attempts
have been made to use convolutional neural networks (CNNs) for single-
image depth estimation.
3. Main contributions of the paper
● Designed a ResNet-based depth estimation network which is build
over a Convolutional Neural Network with shortcut layers.
● Proposes a new DBE loss to enable more reliable training of the
network to determine shallow and deep depths of images.
● This is one of the only works to perform the Fourier analysis of the
single-image depth estimation problem. It proposes an accurate
and reliable scheme to combine multiple depths in the frequency
domain.
4. Proposed Algorithm
The proposed algorithm involves four major steps.
1. Training
● Defining Loss Function.
● Training the weights of ResNet-152 CNN.
1. Candidate Generation.
● Division of images into sub-images using different
cropping ratio.
● Combining images from 4 corners to find estimated depth
map.
1. Candidate Combination using Fourier Analysis.
● Using fourier analysis to combine images of different
cropping ratios to improve overall accuracy.
1. Testing
● Test the CNN on images of particular dataset and
estimate the accuracy.
5. Training Stage
Structure of Res-Net 152
1. ResNet is used for the architecture of CNN used in Depth Estimation, 34-layer is
better than 18-layer model, since we learn model advanced filters and parameters
over the Deep Neural Network.
2. Vanishing gradient problem has been solved by skip connections in ResNet which is
mainly observed in VGG16 and AlexNet. Hence, we particularly used this model,
including the advantages provided by the skip connections in the architecture.
3. All the extracted intermediate features are concatenated and fed into a fully
connected layer of 1000 neurons to obtain the estimated depth map.
4. The intermediate shortcut connections also help in improving the context of the
Pixels close to the pixel p in the feedforward network. The mathematical formula
Or the formulation reduces to f(x)+(x) where + is the addition of weights from nth
Layer and (n-2)th layer in the network, which is a given baseline in the paper.
6. Training Stage
Euclidean Loss Function
1. The usual loss function used in CNN
Algorithms is the Euclidean Loss Function
given by equation shown below.
1. This type of loss function poses a problem in
the case of depth estimation since an error in
depths of higher values have a higher impact
on the weight parameter than an error of
proportionate magnitude for lower depth
values this can be observed by looking at the
equation for weight update.
7. Training Stage
Improved Loss function (DBE)
1. The paper provides an alternative loss
function that eliminates the depth bias
that exists in the simple Euclidean Loss
Function.
1. We use a quadratic function g(d) to
balance the loss function we note that
if the value of a2 is negative the impact
due to higher depth values diminishes
and we get a more reliable balanced
loss function.
2. We can tune the value of a1 and a2 to
give us the most optimal results, this
form of loss function is known as
Depth Balanced Loss Function.
8. Candidate Generation
● After the training process we
generate Depth map
candidates for different
cropping ratios
● Cropping ratio is defined as
the ratio of size of sub-image
to the size of the original
image.
● The cropped images from the
four corners are passed
through the trained CNN and
are combined to form depth
map candidates
corresponding to different
cropping ratios.
9. Candidate Combination using Fourier Analysis.
● Local depth variations obtained in the depth map candidates with small cropping ratios
correspond to higher frequency components
● Overall depth features obtained in the depth map candidates with larger cropping
ratios correspond to low frequency components
● We combine these two complementary features using Fourier Transform of the
individual candidates
Estimated value of kth
component of fcap for mth
candidate
Combined fmcap matrix
with entries from all
cropping ratios.
The fourier transform of
the true depth map
10. Candidate Combination using
Fourier Analysis.
We find the weight factor by
minimizing the mean squared
error between fkcap and fk
The matrix Wk for which we
achieve the smallest value for
MSE is the required solution.
11. Implementation & Testing.
● As explained previously, we select a1 and a2 such that, a1 is large and a2 is negative to extract
deeper depths from our image and modify the basic euclidean loss which helps in more reliably
estimating shallow and deep depths.
● It is used as an objective function and after each epoch we reduce the rms(root mean squared)
error of our network, and hence, generate output depth map based on 50-70 epochs.
12. Implementation & Testing.
● We implement the code by particularly working on 3 different parts of the code: Depth
Estimation Network, Depth Balanced Euclidean Loss, Fourier Domain Combination model.
● Depth Estimation Network: Based on ResNet-152, we modify all the last 19 ResNet blocks to
extract intermediate features. All extracted features are concatenated and fed into fully
connected layer to estimate depth map. [1]
● We import ResNet-152 architecture from keras or Pytorch framework and train the weights.
13. Implementation & Testing.
● We convert the depth map candidates into frequency domain using 2-D DFT, and add the
different cropped images in their frequency domain to obtain the depth map generated by the
depth candidates.
● We implemented these function using inbuilt rfft and irfft functions from Pytorch library and
hence, extract the features in frequency domain and process it.
● This part is crucial step of our project, since this particularly differentiates our approach from the
other depth estimation methods, since we generate cropped ratios in our method and
concatenate the frequency domain maps of different cropped sub-images.
14. Figure a. Input Image, b. Depth estimated
map after 15 epochs, c. Depth estimated
map after 25 epochs.
15. Figure a. Input Image, b. Depth estimated
map after 15 epochs and estimating much
larger depths including multiple objects.
17. Real-life applications and Further improvements.
● Object localization and Detection of the objects using Depth Maps will ensure de-
localization of objects to much closer extent, and is much more efficient than the
bounding box prediction used by traditional algorithms, and hence, this approach
would be as fast as the semantic segmentation methods by also would be segmented
Based on the depth estimate, so, the results are much more accurate. [3]
● Planetary exploration would find the depth estimation very useful, since most
satellites capture only single image and estimates as much information from it as
possible, and with the usage of depth estimation maps, we can ensure the rovers and
other space vehicles land safely by finding out the geographical topology of the
planetary space of different planets and moons.
18. References 1. Jae-Han Lee, Minhyeok Heo, Kyung-Rae Kim,
and Chang-Su Kim, Single Image Depth
Estimation using Fourier Domain Analysis,
IEEE, 2018
2. Clement Godard, Unsupervised Monocular
Depth Estimation using Left Right
consistency, IEEE, 2017
3. Xiao Lin, Depth estimation and semantic
segmentation from Single RGB image using
Hybrid CNNs, 2019