Scanning the Internet for External Cloud Exposures via SSL Certs
Β
Batch normalization presentation
1. Batch Normalization
Accelerating Deep Network Training by
Reducing Internal Covariate Shift
By : seraj alhamidi
Instructor: Associate Prof. Mohammed Alhanjouri
June .2019
2. About the paper
Sergey Ioffe
Google Inc.,
sioffe@google.com
Christian Szegedy
Google Inc.,
szegedy@google.com
Authors
The 32nd International Conference on Machine
Learning (2015)
presented
publishers
https://ai.google/research/pubs/pub43442
Journal of Machine Learning Research
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
Cornell university
https://arxiv.org/abs/1502.03167
paper with over 6000 citations on ICML 2015citations
4. Introduction
ILSVRC Competition in 2015
οΆThe ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored
by google and Facebook
οΆImageNet, is a dataset of over 15 millions labelled high-resolution images with
around 22,000 categories, for classification and localization tasks
οΆILSVRC uses a subset of ImageNet of 1000 categories.
οΆOn 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error
rate which surpasses the human error rate of 5.1%
οΆFive days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8%
error rate.
οΆReach best accuracy in 7% of time need to reach same accuracy
5. Issues with Training Deep Neural Networks
Vanishing Gradient
οΆSaturating nonlinearities (like π‘ππβ ππ π ππππππ) cannot be used for deep
networks
οΆAn example, the sigmoid function and itβs derivative. When the inputs of
the sigmoid function becomes larger or smaller , the derivative becomes
close to zero.
the sigmoid function and its derivativebackpropagation algorithm update rule
οΆ π€ π + 1 = π€ π β πΌ
ππΏ
ππ€
οΆ πΏ = 0.5 π‘ β π 2 π‘ βΆ π‘πππππ‘ , π βΆ ππ’π‘ππ’π‘ ππ πππ‘ππ£ππ‘πππ
οΆ π π
= π π₯ π
π βΆ πππ‘ππ£ππ‘πππ π§ βΆ ππππ’π‘ π‘π πππ‘ππ£ππ‘πππ
οΆ π₯ π = π€π,π β π πβ1 + π€π+1 ,π β π πβ1 + β―
οΆ
ππΏ
ππ€
=
ππΏ
ππ
.
ππ
ππ₯
.
ππ₯
ππ€
β‘
ππΏ
ππ
. π π₯ π .
ππ₯
ππ€
6. Issues with Training Deep Neural Networks
Vanishing Gradient
Sigmoid function with restricted inputsRectified linear units π π₯ = π₯+
= max(0, π₯)
Some ways around this are to use:
οΆ batch normalization layers can also resolve the issue
οΆ Nonlinearities like Rectified linear units (ReLU) which do not saturate.
οΆ Smaller learning rates
οΆ Careful weights initializations
7. Issues with Training Deep Neural Networks
Internal Covariate shift
οΆ Covariate β The Features of the Input Data
οΆ Covariate Shift - The change in the distribution of inputs layers in the middle of a
deep neural network, is referred to the technical name βinternal covariate shift β.
when the distribution that is fed to the layers of a network should be somewhat:
οΆZero-centered
οΆConstant through time and data
οΆthe distribution of the data being fed to the layers should not vary too much across
the mini-batches fed to the network
Neural networks learn efficiently
8. Issues with Training Deep Neural Networks
Internal Covariate shift in deep NN
Iteration i
Iteration i+1
Iteration i+2
Every time thereβs new
relation (distribution)
specially in deep layers and
at the beginning of training
9. Issues with Training Deep Neural Networks
Take one layer from internal layers
Assume has a distribution given
below. Also, let us suppose, the
function learned by the layer is
represented by the dashed line
suppose, after the gradient
updating, the distribution of x
gets changed to something like
the loss for this mini-batch is more as
compared to the previous loss
10. Issues with Training Deep Neural Networks
οΆ Every time we force the layer (l) to design new Perceptron
οΆ Because I give it new disruption every time
οΆ In deep layers , we may have butterfly effect , however the change in π at
first layers is small, thus make the network unstable
11. Batch Normalization
οΆ The Batch Normalization attempts to normalize a batch of inputs before
they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during
training
οΆ so that the input to the activation function across each training batch has a
mean of 0 and a variance of 1
οΆ applying batch normalization to the activation Ο(Wx + b) would result
in Ο(BN(Wx + b)) where π΅π is the batch normalizing transform
12. Batch Normalization
To make each dimension unit gaussian, we apply:
π₯ π =
π₯ π β πΈ π₯ π
πππ π₯ π
where πΈ π₯(π) and πππ π₯(π) are respectively the mean and variance of π-th
feature over a batch. Then we transform π₯(π) as:
π¦ π = πΎ π π₯ π + π½ π
where πΎ and π½ are the hyper (learnable) parameters of the so-called batch
normalization layer
πΎ is the π-th sample in one feature mini batch
18. Backpropagation during test time
using the population, rather than mini-batch statistics. Effectively, we process mini-
batches of size π and use their statistics to compute:
πΈ π₯(π) = πΈ π΅[ π π΅]
πππ π₯(π) =
π
π β 1
πΈ π΅[π π΅
2 ]
we can use ππ₯ππππππ‘πππ πππ£πππ ππ£πππππ to estimate the mean and variance to be
used during test time, we estimate the running average of mean and variance as:
π ππ’πππππ = πΌ. π ππ’πππππ + (1- πΌ). π π΅
β«ΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩΩβ¬β«ΩΩΩβ¬
πππ’πππππ
2
= πΌ. πππ’πππππ
2
+ (1- πΌ).π π΅
2
where πΌ is a constant smoothing factor between 0 and 1 and represents the degree
of dependence on the previous observations .
19. Ablation Study
MNIST dataset
28Γ28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each
, the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross
entropy loss.
BN network is much more stable
20. Ablation Study
ImageNet of 1000 categories on GoogleNet/Inception(2014)
weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for
the testing
CNN architectures tested
22. Some Notes : cross-entropy loss function
We use cross-entropy loss function
neural network (1)
Computed | targets | correct?
------------------------------------------------
0.3 0.3 0.4 | 0 0 1 (democrat) | yes
0.3 0.4 0.3 | 0 1 0 (republican) | yes
0.1 0.2 0.7 | 1 0 0 (other) | no
neural network (2)
Computed | targets | correct?
------------------------------------------------
0.1 0.2 0.7 | 0 0 1 (democrat) | yes
0.1 0.7 0.2 | 0 1 0 (republican) | yes
0.3 0.4 0.3 | 1 0 0 (other) | no
cross-entropy error for the first training
β( (ln(0.3) β 0) + (ln(0.3) β 0) + (ln(0.4) β 1) ) = βln(0.4)
average cross-entropy error (ACE)
β(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
average cross-entropy error (ACE)
β(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64
mean squared error for the first item
(0.3 β 0)^2 + (0.3 β 0)^2 + (0.4 β 1)^2 = 0.54
the MSE for the first neural network is
(0.54 + 0.54 + 1.34) / 3 = 0.81
The MSE for the second, better, network is
(0.14 + 0.14 + 0.74) / 3 = 0.34
(1.38 β 0.64 = 0.74) > (0.81 β 0.34 = 0.47)
The ππ() function in cross-entropy takes into account the closeness of a prediction
23. Some Notes : Convolutional Neural Network (CNN)
οΆ It use for Image classification is the task
οΆ It was developed between 1988 and 1993, at Bell Labs
οΆ the first convolutional network that could recognize handwritten digits
27. Some Notes : Convolutional Neural Network (CNN)
2D filters
28. Some Notes : Convolutional Neural Network (CNN)
3D filters
29. Pooling Layer (Sub-sampling or Down-sampling)
οΆ reduce the size of feature maps by using some functions average or the maximum ,
(hence called down-sampling)
οΆ make extracted features more robust by making it more invariant to scale and
orientation changes.
30. ReLU Layer
Remove all the black elements from it, keeping only those carrying a positive value
(the grey and white colours )
)ππ’π‘ππ’π‘ = πππ₯(π§πππ, πΌπππ’π‘
to introduce non-linearity in our ConvNet
33. MY WORK ο MNIST on google colab
οInputs = 28*28 = 784
οLayer 1&2 = 100 nodes | Layer 3 = 10 nodes
οAll Activations are sigmoid
οCross-entropy loss function
The train and test set
is already splited in
tensorflow
the distribution over
time of the inputs to
the sigmoid function
of the first five
neurons in the
second layer . Batch
normalization has a
visible and
significant effect of
removing
variance/noise in
these inputs.final acc: 99%
34. MY WORK ο caltech dataset
πππ‘β π΅π
#ππππβ = 150
πΏπ = 1 β 10β3
πππ‘βππ’π‘ π΅π
#ππππβ = 150
πΏπ = 1 β 10β3
π€ππ‘βππ’π‘ BN
#ππππβ = 250
πΏπ = 1 β 10β3
final acc: 90.54% final acc: 94.44%final acc: 96.04%
We use ten (10)
classes from Caltech
dataset instead of
ImageNet
Dataset because itβs
huge size
classify the
input image
1/3 ,1/3 , 1/3 train
, validation, test