2. • Images of natural scenes contain adjacent pixels that are
statistically highly correlated
• This correlation effect makes network training challenging
because adjacent pixels contain redundant information
• Hubel and Wiesel found that a visual correlation removal process
exists in animal brain
https://www.researchgate.net/figure/Fig-In-the-classic-neuroscience-experiment-Hubel-and-Wiesel-discovered-a-cats-visual_fig1_335707980
Introduction
3. • Network deconvolution
- a decorrelation method to remove both the pixel-wise and
channel-wise correlation at each layer of the network
Introduction
A correlated signal
( : kernel / : the corresponding convolution matrix)
Removing the correlation effects via:
b = k * x = Kx
k K
x = K−1
b
4. • Contributions
- Introduce network deconvolution, a decorrelation method to remove the
both the pixel-wise and channel-wise correlation at each layer of the
network.
- Deconvolution can replace batch normalization with better model training
- Optimal transform if considering optimization
- Deconvolution reduces redundancy in the data, leading to sparse
representations
L2
Introduction
5. Motivations
LossL2
=
1
2
||y − ̂y||2
=
1
2
||Xw − ̂y||2
( : the inputs, : an unknown weight matrix )X w
wnew = wold − α
1
N
(Xt
Xwold − Xt ̂y)
( : the learning rate )α
A linear regression problem with lossL2
One iteration of gradient descent
6. Motivations
∂LossL2
∂w
= Xt
(Xw − ̂y) = 0 w = (Xt
X)−1
Xt ̂y
wnew = wold − α
1
N
(Xt
Xwold − Xt ̂y)
An optimal solution (gradient is zero)
Proposition 1. Gradient descent converges to the optimal solution in one iteration if
1
N
Xt
X = I
Eq. 2
Eq. 3
7. Motivations
Proposition 1. Gradient descent converges to the optimal solution in one iteration if
1
N
Xt
X = I
• : the covariance matrix of the features
- the features should be standardized and uncorrelated with each other
- the more correlated does not hold, the slower the convergence
- the solution for this problem : correcting the gradient by a change of
coordinates so that in the new space we have
1
N
Xt
X = I
1
N
Xt
X = I
8. The Deconvolution Operation
flattened 2D kernelData matrix created by im2col
Representing standard conv filtering as large matrix multiplicationx * kernel
9. The Deconvolution Operation
• Given a data matrix , the covariance matrix
Let and multiply this to (the centered vectors)
Then, the covariance of transformed matrix is
• That is, the pixel-wise and channel-wise correlation is removed by
multiplying
XN×F Cov =
1
N
(X − μ)T
(X − μ)
D = Cov−1
2 X − μ
(X − μ)·D I
D
DT
(X − μ)T
(X − μ)D = Cov−0.5
·Cov·Cov−0.5
= I
10. The Deconvolution Operation
• By the associative rule of matrix multiplication,
• Therefore, the deconvolution can be carried out implicitly by changing the
model parameters.
• One training is finished, freeze to be the running average
• This change of parameters makes a network perform faster at testing time
D
y = X·D·w = X·(D·w)
17. Conclusion
• Network deconvolution is likely the correct way of training the
convolutional networks.
• The deconvolution filters resemble the center-surround
structures in animal neurons.
18. • ICLR2020 Official Blind Review #1 (Rating 8)
- This paper proposes an operation for removing the pixel-wise and channel-wise
correlations of input features
- The approach has a well-sounded neurological inspired motivation
- Achieved a good performance compared with batch normalization
- Providing CPU time is very appreciated
- The computation cost for the im2col in a large kernel (7x7) is insanely large, but
not shown on paper
- The arguments made on the sparse representations is not convincing because
showing only 2 learning curves is not enough
Reviews
19. • ICLR2020 Official Blind Review #2 (Rating 8)
- Network deconvolution is a generalization of batch normalization that not only
whitens per channel, but also removes correlations between channels and across
spatial locations.
- How about the dependence on batch size? : Reply
- Are results sensitive to the epsilon in algorithm 1? (In computing deconv matrix)
- How does this method interact with regularization methods?
Reviews
Reply : …, our method works best for batch sizes
128/256 on CIFAR10/ImageNet dataset. But it also
works well with relatively small/large batch sizes. ….
When the batch size is tiny, for example 4, we also
need to reduce the learning rate to 0.01 to avoid the
negative effects of noisy samples - this correction is
necessary with batch normalization as well and is
not unique to network deconvolution.
20. • ICLR2020 Official Blind Review #3 (Rating 6)
- The concept of the paper is pretty simple and straightforward - basically it
removes the correlation present in the input data, specifically in the case of
convolution.
- How about PCA transformation?
: Reply - PCA has a number of issues for whitening
1) Finding the principle axes is slow
2) not well-defined if several axes have the same variance
Reviews