Network Deconvolution Removes Correlations for Faster CNN Training

Network Deconvolution
Choi Dongmin
Yonsei University Severance Hospital CCIDS

• Images of natural scenes contain adjacent pixels that are
statistically highly correlated

• This correlation effect makes network training challenging
because adjacent pixels contain redundant information

• Hubel and Wiesel found that a visual correlation removal process
exists in animal brain
https://www.researchgate.net/figure/Fig-In-the-classic-neuroscience-experiment-Hubel-and-Wiesel-discovered-a-cats-visual_fig1_335707980
Introduction

• Network deconvolution 
- a decorrelation method to remove both the pixel-wise and 
channel-wise correlation at each layer of the network
Introduction
A correlated signal  
( : kernel / : the corresponding convolution matrix) 
Removing the correlation eﬀects via:
b = k * x = Kx
k K
x = K−1
b

• Contributions 
- Introduce network deconvolution, a decorrelation method to remove the
both the pixel-wise and channel-wise correlation at each layer of the
network. 
- Deconvolution can replace batch normalization with better model training 
- Optimal transform if considering optimization 
- Deconvolution reduces redundancy in the data, leading to sparse
representations
L2
Introduction

Motivations
LossL2
=
1
2
||y − ̂y||2
=
1
2
||Xw − ̂y||2
( : the inputs, : an unknown weight matrix )X w
wnew = wold − α
1
N
(Xt
Xwold − Xt ̂y)
( : the learning rate )α
A linear regression problem with lossL2
One iteration of gradient descent

Motivations
∂LossL2
∂w
= Xt
(Xw − ̂y) = 0 w = (Xt
X)−1
Xt ̂y
wnew = wold − α
1
N
(Xt
Xwold − Xt ̂y)
An optimal solution (gradient is zero)
Proposition 1. Gradient descent converges to the optimal solution in one iteration if
1
N
Xt
X = I
Eq. 2
Eq. 3

Motivations
Proposition 1. Gradient descent converges to the optimal solution in one iteration if
1
N
Xt
X = I
• : the covariance matrix of the features 
- the features should be standardized and uncorrelated with each other 
- the more correlated does not hold, the slower the convergence 
- the solution for this problem : correcting the gradient by a change of
coordinates so that in the new space we have
1
N
Xt
X = I
1
N
Xt
X = I

The Deconvolution Operation
ﬂattened 2D kernelData matrix created by im2col
Representing standard conv ﬁltering as large matrix multiplicationx * kernel

• Given a data matrix , the covariance matrix  
Let and multiply this to (the centered vectors) 
Then, the covariance of transformed matrix is  
• That is, the pixel-wise and channel-wise correlation is removed by
multiplying
XN×F Cov =
1
N
(X − μ)T
(X − μ)
D = Cov−1
2 X − μ
(X − μ)·D I
D
DT
(X − μ)T
(X − μ)D = Cov−0.5
·Cov·Cov−0.5
= I

• By the associative rule of matrix multiplication,

• Therefore, the deconvolution can be carried out implicitly by changing the
model parameters.

• One training is ﬁnished, freeze to be the running average

• This change of parameters makes a network perform faster at testing time 
D
y = X·D·w = X·(D·w)

Experiments
Linear Regression with loss and Logistic RegressionL2

Experiments
Convolutional Networks on CIFAR-10/100
BN : Batch Normalization / ND : Network Deconvolution / 1,20,100 : trained for epochs

Experiments
Convolutional Networks on ImageNet

Experiments
Generalization To Semantic Segmentation

Source Code
...
https://github.com/yechengxi/deconvolution

Conclusion
• Network deconvolution is likely the correct way of training the
convolutional networks.

• The deconvolution ﬁlters resemble the center-surround
structures in animal neurons.

• ICLR2020 Oﬃcial Blind Review #1 (Rating 8) 
- This paper proposes an operation for removing the pixel-wise and channel-wise
correlations of input features 
- The approach has a well-sounded neurological inspired motivation 
- Achieved a good performance compared with batch normalization 
- Providing CPU time is very appreciated 
 
- The computation cost for the im2col in a large kernel (7x7) is insanely large, but
not shown on paper 
- The arguments made on the sparse representations is not convincing because
showing only 2 learning curves is not enough
Reviews

- Network deconvolution is a generalization of batch normalization that not only
whitens per channel, but also removes correlations between channels and across
spatial locations. 
 
- How about the dependence on batch size? : Reply 
- Are results sensitive to the epsilon in algorithm 1? (In computing deconv matrix) 
- How does this method interact with regularization methods?
Reviews
Reply : …, our method works best for batch sizes
128/256 on CIFAR10/ImageNet dataset. But it also
works well with relatively small/large batch sizes. ….
When the batch size is tiny, for example 4, we also
need to reduce the learning rate to 0.01 to avoid the
negative eﬀects of noisy samples - this correction is
necessary with batch normalization as well and is
not unique to network deconvolution.

- The concept of the paper is pretty simple and straightforward - basically it
removes the correlation present in the input data, speciﬁcally in the case of
convolution. 
 
- How about PCA transformation? 
: Reply - PCA has a number of issues for whitening 
1) Finding the principle axes is slow 
2) not well-deﬁned if several axes have the same variance
Reviews

Thank you
Yonsei University Severance Hospital CCIDS

Network Deconvolution Removes Correlations for Faster CNN Training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Network Deconvolution Removes Correlations for Faster CNN Training

Similar to Network Deconvolution Removes Correlations for Faster CNN Training (20)

More from Dongmin Choi

More from Dongmin Choi (20)

Recently uploaded

Recently uploaded (20)

Network Deconvolution Removes Correlations for Faster CNN Training