SHAKE-SHAKE & SHAKE-DROP
Ruijie Quan 2018/07/08
− Help deep learning practitioners faced with an overﬁt problem
− The idea is to replace the standard summation of parallel branches with a
stochastic afﬁne combination in a multi-branch network.
Input Images Internal Representations
Motivation: Data Augmentation Techniques
Resnet+2 residual branches:
• Shake-Shake regularization can be seen as an extension of this concept where
gradient noise is replaced by a form of gradient augmentation.
• Adding noise to the gradient during training helps training and generalization
of complicated neural networks.
Shake: all scaling coefﬁcients are overwritten with new random numbers before the pass.
Even: all scaling coefﬁcients are set to 0.5 before the pass.
Keep: we keep, for the backward pass, the scaling coefﬁcients used during the forward pass.
CORRELATION BETWEEN RESIDUAL BRANCHES
Whether the correlation between the 2 residual branches is increased or decreased by
• At the end of the residual blocks forces an alignment of the layers on the left and right
• The correlation between the output tensors of the 2 residual branches seems to be reduced
by the regularization. The regularization forces the branches to learn something different.
(1)Shake-Shake can be applied to only multi-branch
architectures (i.e., ResNeXt).
(2) Shake-Shake is not memory efﬁcient
A similar disturbance to Shake-Shake on a single residual block.
Not trivial to realize
Shake-Drop disturbs learning more strongly by multiplying even a negative
factor to the output of a convolutional layer in the forward training pass.
(To stabilize the learning process by employing ResDrop in a different usage
from the usual. )
SHAKE-SHAKE REGULARIZATION (DRAWBACKS)
SIMILAR REGULARIZATION TO SHAKE-SHAKE
ON 1-BRANCH NETWORK ARCHITECTURES
STABILIZING LEARNING WITH INTRODUCTION
OF MECHANISM OF RESDROP
too strong perturbation