SlideShare a Scribd company logo
1 of 35
Batch Normalization
Accelerating Deep Network Training by
Reducing Internal Covariate Shift
By : seraj alhamidi
Instructor: Associate Prof. Mohammed Alhanjouri
June .2019
About the paper
Sergey Ioffe
Google Inc.,
sioffe@google.com
Christian Szegedy
Google Inc.,
szegedy@google.com
Authors
The 32nd International Conference on Machine
Learning (2015)
presented
publishers
https://ai.google/research/pubs/pub43442
Journal of Machine Learning Research
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
Cornell university
https://arxiv.org/abs/1502.03167
paper with over 6000 citations on ICML 2015citations
Outlines
οƒ˜Introduction
οƒ˜Issues with Training Deep Neural Networks
οƒ˜Batch Normalization
οƒ˜Ablation Study
οƒ˜Comparison with the State of the art Approaches
οƒ˜Some notes
οƒ˜My work
Introduction
ILSVRC Competition in 2015
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored
by google and Facebook
ImageNet, is a dataset of over 15 millions labelled high-resolution images with
around 22,000 categories, for classification and localization tasks
ILSVRC uses a subset of ImageNet of 1000 categories.
On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error
rate which surpasses the human error rate of 5.1%
Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8%
error rate.
Reach best accuracy in 7% of time need to reach same accuracy
Issues with Training Deep Neural Networks
Vanishing Gradient
Saturating nonlinearities (like π‘‘π‘Žπ‘›β„Ž π‘œπ‘Ÿ π‘ π‘–π‘”π‘šπ‘œπ‘–π‘‘) cannot be used for deep
networks
An example, the sigmoid function and it’s derivative. When the inputs of
the sigmoid function becomes larger or smaller , the derivative becomes
close to zero.
the sigmoid function and its derivativebackpropagation algorithm update rule
 𝑀 πœ… + 1 = 𝑀 πœ… βˆ’ 𝛼
πœ•πΏ
πœ•π‘€
 𝐿 = 0.5 𝑑 βˆ’ π‘Ž 2 𝑑 ∢ π‘‘π‘Žπ‘Ÿπ‘”π‘’π‘‘ , π‘Ž ∢ π‘œπ‘’π‘‘π‘π‘’π‘‘ π‘œπ‘“ π‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œπ‘›
 π‘Ž 𝑙
= 𝜎 π‘₯ 𝑙
𝜎 ∢ π‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œπ‘› 𝑧 ∢ 𝑖𝑛𝑝𝑒𝑑 π‘‘π‘œ π‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œπ‘›
 π‘₯ 𝑙 = 𝑀𝑖,𝑗 βˆ— π‘Ž π‘™βˆ’1 + 𝑀𝑖+1 ,𝑗 βˆ— π‘Ž π‘™βˆ’1 + β‹―

πœ•πΏ
πœ•π‘€
=
πœ•πΏ
πœ•π‘Ž
.
πœ•π‘Ž
πœ•π‘₯
.
πœ•π‘₯
πœ•π‘€
≑
πœ•πΏ
πœ•π‘Ž
. 𝜎 π‘₯ 𝑙 .
πœ•π‘₯
πœ•π‘€
Issues with Training Deep Neural Networks
Vanishing Gradient
Sigmoid function with restricted inputsRectified linear units 𝑓 π‘₯ = π‘₯+
= max(0, π‘₯)
Some ways around this are to use:
 batch normalization layers can also resolve the issue
 Nonlinearities like Rectified linear units (ReLU) which do not saturate.
 Smaller learning rates
 Careful weights initializations
Issues with Training Deep Neural Networks
Internal Covariate shift
 Covariate – The Features of the Input Data
 Covariate Shift - The change in the distribution of inputs layers in the middle of a
deep neural network, is referred to the technical name β€œinternal covariate shift ”.
when the distribution that is fed to the layers of a network should be somewhat:
Zero-centered
Constant through time and data
the distribution of the data being fed to the layers should not vary too much across
the mini-batches fed to the network
Neural networks learn efficiently
Issues with Training Deep Neural Networks
Internal Covariate shift in deep NN
Iteration i
Iteration i+1
Iteration i+2
Every time there’s new
relation (distribution)
specially in deep layers and
at the beginning of training
Issues with Training Deep Neural Networks
Take one layer from internal layers
Assume has a distribution given
below. Also, let us suppose, the
function learned by the layer is
represented by the dashed line
suppose, after the gradient
updating, the distribution of x
gets changed to something like
the loss for this mini-batch is more as
compared to the previous loss
Issues with Training Deep Neural Networks
 Every time we force the layer (l) to design new Perceptron
 Because I give it new disruption every time
 In deep layers , we may have butterfly effect , however the change in π’˜ at
first layers is small, thus make the network unstable
Batch Normalization
 The Batch Normalization attempts to normalize a batch of inputs before
they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during
training
 so that the input to the activation function across each training batch has a
mean of 0 and a variance of 1
 applying batch normalization to the activation Οƒ(Wx + b) would result
in Οƒ(BN(Wx + b)) where 𝐡𝑁 is the batch normalizing transform
Batch Normalization
To make each dimension unit gaussian, we apply:
π‘₯ π‘˜ =
π‘₯ π‘˜ βˆ’ 𝐸 π‘₯ π‘˜
π‘‰π‘Žπ‘Ÿ π‘₯ π‘˜
where 𝐸 π‘₯(π‘˜) and π‘‰π‘Žπ‘Ÿ π‘₯(π‘˜) are respectively the mean and variance of π‘˜-th
feature over a batch. Then we transform π‘₯(π‘˜) as:
𝑦 π‘˜ = 𝛾 π‘˜ π‘₯ π‘˜ + 𝛽 π‘˜
where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch
normalization layer
𝐾 is the π‘˜-th sample in one feature mini batch
Batch Normalization
Transformation of inputs
Forward Propagation through Batch Normalization layer
We have shown the normalization of multiple sample in just one feature
Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃
Output: 𝓨𝐒 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐒)
Flow of computation through Batch Normalization layer
𝝁 𝑩 =
𝟏
π’Ž
π’Š=𝟏
π’Ž
π’™π’Š
𝝈 𝑩
𝟐
=
𝟏
π’Ž
π’Š=𝟏
π’Ž
π’™π’Š βˆ’ 𝝁 𝑩
𝟐
π’™π’Š=
π’™π’Š βˆ’ 𝝁 𝑩
𝝈 𝑩
𝟐
+ 𝝐
π“¨π’Š= 𝜸 𝒙 + 𝜷 = 𝑩𝑡 𝜸,𝜷(π’™π’Š)
πœ–
πœ– is a small value 1 βˆ— 10βˆ’8
for not devided by zero
Forward Propagation through Batch Normalization
layer
TWO features
THE MAGIC
Imagine that the network was thought that the optimal that will minimize
the cost is to Cancel the BN effect !
Forward Propagation through Batch Normalization layer
Ξ² = 𝐸 π‘₯ = ΞΌB
𝛾 = π‘‰π‘Žπ‘Ÿ π‘₯ = 𝜎 𝐡
2
+ πœ–
π‘₯𝑖 =
π‘₯𝑖 βˆ’ πœ‡ 𝐡
𝜎 𝐡
2
+ πœ–
𝒴𝑖 = 𝛾 π‘₯𝑖 + 𝛽 = 𝜎 𝐡
2
+ πœ– βˆ—
π‘₯ 𝑖 βˆ’πœ‡ 𝐡
𝜎 𝐡
2+πœ–
+ πœ‡ 𝐡 = π‘₯𝑖
Identity transform
𝛾, 𝛽 Adapted by SGD
Backpropagation through Batch Normalization layer
𝒴𝑖 = 𝛾 π‘₯ + 𝛽 = 𝐡𝑁𝛾,𝛽(π‘₯𝑖)
𝝏𝑳
ππ“¨π’Š
𝝏𝑳
𝝏 π’™π’Š
πœ•L
πœ•Ξ³i
𝝏𝑳
𝝏𝜷
𝝏𝑳
𝝏𝝈 𝑩
πŸππ‘³
𝝏𝝁 𝑩
𝝏𝑳
𝝏𝒙 π’Š
SGD
𝜷 π’Œ + 𝟏 = 𝜷 𝜿 βˆ’ 𝜢
𝝏𝑳
𝝏𝜷
𝜸 π’Œ + 𝟏 = 𝜸 𝜿 βˆ’ 𝜢
𝝏𝑳
𝝏𝜸
𝝁 𝑩 =
𝟏
π’Ž
π’Š=𝟏
π’Ž
π’™π’Š
𝝈 𝑩
𝟐
=
𝟏
π’Ž
π’Š=𝟏
π’Ž
π’™π’Š βˆ’ 𝝁 𝑩
𝟐
π’™π’Š=
π’™π’Š βˆ’ 𝝁 𝑩
𝝈 𝑩
𝟐
+ 𝝐
π“¨π’Š= 𝜸 𝒙 + 𝜷 = 𝑩𝑡 𝜸,𝜷(π’™π’Š)
Backpropagation during test time
using the population, rather than mini-batch statistics. Effectively, we process mini-
batches of size π‘š and use their statistics to compute:
𝐸 π‘₯(π‘˜) = 𝐸 𝐡[ πœ‡ 𝐡]
π‘‰π‘Žπ‘Ÿ π‘₯(π‘˜) =
π‘š
π‘š βˆ’ 1
𝐸 𝐡[𝜎 𝐡
2 ]
we can use 𝑒π‘₯π‘π‘œπ‘›π‘’π‘›π‘‘π‘–π‘Žπ‘™ π‘šπ‘œπ‘£π‘–π‘›π‘” π‘Žπ‘£π‘’π‘Ÿπ‘Žπ‘”π‘’ to estimate the mean and variance to be
used during test time, we estimate the running average of mean and variance as:
πœ‡ π‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘” = 𝛼. πœ‡ π‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘” + (1- 𝛼). πœ‡ 𝐡
‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬
πœŽπ‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘”
2
= 𝛼. πœŽπ‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘”
2
+ (1- 𝛼).𝜎 𝐡
2
where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree
of dependence on the previous observations .
Ablation Study
MNIST dataset
28Γ—28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each
, the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross
entropy loss.
BN network is much more stable
Ablation Study
ImageNet of 1000 categories on GoogleNet/Inception(2014)
weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for
the testing
CNN architectures tested
Comparison with the State of the art Approaches
Some Notes : cross-entropy loss function
We use cross-entropy loss function
neural network (1)
Computed | targets | correct?
------------------------------------------------
0.3 0.3 0.4 | 0 0 1 (democrat) | yes
0.3 0.4 0.3 | 0 1 0 (republican) | yes
0.1 0.2 0.7 | 1 0 0 (other) | no
neural network (2)
Computed | targets | correct?
------------------------------------------------
0.1 0.2 0.7 | 0 0 1 (democrat) | yes
0.1 0.7 0.2 | 0 1 0 (republican) | yes
0.3 0.4 0.3 | 1 0 0 (other) | no
cross-entropy error for the first training
βˆ’( (ln(0.3) βˆ— 0) + (ln(0.3) βˆ— 0) + (ln(0.4) βˆ— 1) ) = βˆ’ln(0.4)
average cross-entropy error (ACE)
βˆ’(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
average cross-entropy error (ACE)
βˆ’(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64
mean squared error for the first item
(0.3 βˆ’ 0)^2 + (0.3 βˆ’ 0)^2 + (0.4 βˆ’ 1)^2 = 0.54
the MSE for the first neural network is
(0.54 + 0.54 + 1.34) / 3 = 0.81
The MSE for the second, better, network is
(0.14 + 0.14 + 0.74) / 3 = 0.34
(1.38 βˆ’ 0.64 = 0.74) > (0.81 βˆ’ 0.34 = 0.47)
The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction
Some Notes : Convolutional Neural Network (CNN)
 It use for Image classification is the task
 It was developed between 1988 and 1993, at Bell Labs
 the first convolutional network that could recognize handwritten digits
Some Notes : Convolutional Neural Network (CNN)
Convolution Layer
(Conv Layer)
Pooling Layer ReLU Layer
Fully Connected
Layer (Flatten)
Some Notes : Convolutional Neural Network (CNN)
Convolution Layer (Conv Layer)
convolution works by sliding a window across the input
Some Notes : Convolutional Neural Network (CNN)
2D filters
Some Notes : Convolutional Neural Network (CNN)
3D filters
Pooling Layer (Sub-sampling or Down-sampling)
 reduce the size of feature maps by using some functions average or the maximum ,
(hence called down-sampling)
 make extracted features more robust by making it more invariant to scale and
orientation changes.
ReLU Layer
Remove all the black elements from it, keeping only those carrying a positive value
(the grey and white colours )
)𝑂𝑒𝑑𝑝𝑒𝑑 = π‘€π‘Žπ‘₯(π‘§π‘’π‘Ÿπ‘œ, 𝐼𝑛𝑝𝑒𝑑
to introduce non-linearity in our ConvNet
Fully Connected Layer (Flatten)
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
MY WORK οƒ  MNIST on google colab
οƒ˜Inputs = 28*28 = 784
οƒ˜Layer 1&2 = 100 nodes | Layer 3 = 10 nodes
οƒ˜All Activations are sigmoid
οƒ˜Cross-entropy loss function
The train and test set
is already splited in
tensorflow
the distribution over
time of the inputs to
the sigmoid function
of the first five
neurons in the
second layer . Batch
normalization has a
visible and
significant effect of
removing
variance/noise in
these inputs.final acc: 99%
MY WORK οƒ  caltech dataset
π‘Šπ‘–π‘‘β„Ž 𝐡𝑁
#π‘’π‘π‘œπ‘β„Ž = 150
𝐿𝑅 = 1 βˆ— 10βˆ’3
π‘Šπ‘–π‘‘β„Žπ‘œπ‘’π‘‘ 𝐡𝑁
#π‘’π‘π‘œπ‘β„Ž = 150
𝐿𝑅 = 1 βˆ— 10βˆ’3
π‘€π‘–π‘‘β„Žπ‘œπ‘’π‘‘ BN
#π‘’π‘π‘œπ‘β„Ž = 250
𝐿𝑅 = 1 βˆ— 10βˆ’3
final acc: 90.54% final acc: 94.44%final acc: 96.04%
We use ten (10)
classes from Caltech
dataset instead of
ImageNet
Dataset because it’s
huge size
classify the
input image
1/3 ,1/3 , 1/3 train
, validation, test
Thanks
4
listening

More Related Content

What's hot

Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
Β 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
Β 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning Asma-AH
Β 
Cnn method
Cnn methodCnn method
Cnn methodAmirSajedi1
Β 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptxYanhuaSi
Β 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Basit Rafiq
Β 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
Β 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
Β 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
Β 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
Β 
Why Batch Normalization Works so Well
Why Batch Normalization Works so WellWhy Batch Normalization Works so Well
Why Batch Normalization Works so WellChun-Ming Chang
Β 
Activation functions
Activation functionsActivation functions
Activation functionsPRATEEK SAHU
Β 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketakiKetaki Patwari
Β 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
Β 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
Β 
Activation function
Activation functionActivation function
Activation functionAstha Jain
Β 
Neural network
Neural networkNeural network
Neural networkRamesh Giri
Β 
ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)Sanjay Saha
Β 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa
Β 

What's hot (20)

Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Β 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Β 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
Β 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
Β 
Cnn method
Cnn methodCnn method
Cnn method
Β 
Resnet.pptx
Resnet.pptxResnet.pptx
Resnet.pptx
Β 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
Β 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
Β 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Β 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Β 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
Β 
Why Batch Normalization Works so Well
Why Batch Normalization Works so WellWhy Batch Normalization Works so Well
Why Batch Normalization Works so Well
Β 
Activation functions
Activation functionsActivation functions
Activation functions
Β 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
Β 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Β 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
Β 
Activation function
Activation functionActivation function
Activation function
Β 
Neural network
Neural networkNeural network
Neural network
Β 
ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)
Β 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
Β 

Similar to Batch normalization presentation

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
Β 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
Β 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
Β 
Web spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsWeb spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsaciijournal
Β 
UNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxNoorUlHaq47
Β 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network AlgorithmsWeb Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network Algorithmsaciijournal
Β 
4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalizationDonghoon Park
Β 
High performance large-scale image recognition without normalization
High performance large-scale image recognition without normalizationHigh performance large-scale image recognition without normalization
High performance large-scale image recognition without normalizationtaeseon ryu
Β 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
Β 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksParrotAI
Β 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
Β 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...csandit
Β 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
Β 

Similar to Batch normalization presentation (20)

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
Β 
N ns 1
N ns 1N ns 1
N ns 1
Β 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Β 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Β 
Web spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsWeb spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithms
Β 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
Β 
UNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptx
Β 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network AlgorithmsWeb Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Β 
4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization
Β 
High performance large-scale image recognition without normalization
High performance large-scale image recognition without normalizationHigh performance large-scale image recognition without normalization
High performance large-scale image recognition without normalization
Β 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
Β 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
Β 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Β 
Mnist report ppt
Mnist report pptMnist report ppt
Mnist report ppt
Β 
Mnist report
Mnist reportMnist report
Mnist report
Β 
Unit ii supervised ii
Unit ii supervised iiUnit ii supervised ii
Unit ii supervised ii
Β 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Β 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
Β 
Cnn
CnnCnn
Cnn
Β 
H017376369
H017376369H017376369
H017376369
Β 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
Β 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
Β 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Β 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
Β 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
Β 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
Β 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
Β 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Β 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Β 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
Β 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
Β 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Β 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
Β 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Β 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Β 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Β 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Β 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Β 
Hot Sexy call girls in Panjabi Bagh πŸ” 9953056974 πŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh πŸ” 9953056974 πŸ” Delhi escort ServiceHot Sexy call girls in Panjabi Bagh πŸ” 9953056974 πŸ” Delhi escort Service
Hot Sexy call girls in Panjabi Bagh πŸ” 9953056974 πŸ” Delhi escort Service
Β 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Β 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Β 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
Β 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Β 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Β 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Β 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Β 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Β 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Β 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
Β 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Β 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Β 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Β 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Β 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Β 

Batch normalization presentation

  • 1. Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift By : seraj alhamidi Instructor: Associate Prof. Mohammed Alhanjouri June .2019
  • 2. About the paper Sergey Ioffe Google Inc., sioffe@google.com Christian Szegedy Google Inc., szegedy@google.com Authors The 32nd International Conference on Machine Learning (2015) presented publishers https://ai.google/research/pubs/pub43442 Journal of Machine Learning Research http://jmlr.org/proceedings/papers/v37/ioffe15.pdf Cornell university https://arxiv.org/abs/1502.03167 paper with over 6000 citations on ICML 2015citations
  • 3. Outlines οƒ˜Introduction οƒ˜Issues with Training Deep Neural Networks οƒ˜Batch Normalization οƒ˜Ablation Study οƒ˜Comparison with the State of the art Approaches οƒ˜Some notes οƒ˜My work
  • 4. Introduction ILSVRC Competition in 2015 The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored by google and Facebook ImageNet, is a dataset of over 15 millions labelled high-resolution images with around 22,000 categories, for classification and localization tasks ILSVRC uses a subset of ImageNet of 1000 categories. On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error rate which surpasses the human error rate of 5.1% Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8% error rate. Reach best accuracy in 7% of time need to reach same accuracy
  • 5. Issues with Training Deep Neural Networks Vanishing Gradient Saturating nonlinearities (like π‘‘π‘Žπ‘›β„Ž π‘œπ‘Ÿ π‘ π‘–π‘”π‘šπ‘œπ‘–π‘‘) cannot be used for deep networks An example, the sigmoid function and it’s derivative. When the inputs of the sigmoid function becomes larger or smaller , the derivative becomes close to zero. the sigmoid function and its derivativebackpropagation algorithm update rule  𝑀 πœ… + 1 = 𝑀 πœ… βˆ’ 𝛼 πœ•πΏ πœ•π‘€  𝐿 = 0.5 𝑑 βˆ’ π‘Ž 2 𝑑 ∢ π‘‘π‘Žπ‘Ÿπ‘”π‘’π‘‘ , π‘Ž ∢ π‘œπ‘’π‘‘π‘π‘’π‘‘ π‘œπ‘“ π‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œπ‘›  π‘Ž 𝑙 = 𝜎 π‘₯ 𝑙 𝜎 ∢ π‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œπ‘› 𝑧 ∢ 𝑖𝑛𝑝𝑒𝑑 π‘‘π‘œ π‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œπ‘›  π‘₯ 𝑙 = 𝑀𝑖,𝑗 βˆ— π‘Ž π‘™βˆ’1 + 𝑀𝑖+1 ,𝑗 βˆ— π‘Ž π‘™βˆ’1 + β‹―  πœ•πΏ πœ•π‘€ = πœ•πΏ πœ•π‘Ž . πœ•π‘Ž πœ•π‘₯ . πœ•π‘₯ πœ•π‘€ ≑ πœ•πΏ πœ•π‘Ž . 𝜎 π‘₯ 𝑙 . πœ•π‘₯ πœ•π‘€
  • 6. Issues with Training Deep Neural Networks Vanishing Gradient Sigmoid function with restricted inputsRectified linear units 𝑓 π‘₯ = π‘₯+ = max(0, π‘₯) Some ways around this are to use:  batch normalization layers can also resolve the issue  Nonlinearities like Rectified linear units (ReLU) which do not saturate.  Smaller learning rates  Careful weights initializations
  • 7. Issues with Training Deep Neural Networks Internal Covariate shift  Covariate – The Features of the Input Data  Covariate Shift - The change in the distribution of inputs layers in the middle of a deep neural network, is referred to the technical name β€œinternal covariate shift ”. when the distribution that is fed to the layers of a network should be somewhat: Zero-centered Constant through time and data the distribution of the data being fed to the layers should not vary too much across the mini-batches fed to the network Neural networks learn efficiently
  • 8. Issues with Training Deep Neural Networks Internal Covariate shift in deep NN Iteration i Iteration i+1 Iteration i+2 Every time there’s new relation (distribution) specially in deep layers and at the beginning of training
  • 9. Issues with Training Deep Neural Networks Take one layer from internal layers Assume has a distribution given below. Also, let us suppose, the function learned by the layer is represented by the dashed line suppose, after the gradient updating, the distribution of x gets changed to something like the loss for this mini-batch is more as compared to the previous loss
  • 10. Issues with Training Deep Neural Networks  Every time we force the layer (l) to design new Perceptron  Because I give it new disruption every time  In deep layers , we may have butterfly effect , however the change in π’˜ at first layers is small, thus make the network unstable
  • 11. Batch Normalization  The Batch Normalization attempts to normalize a batch of inputs before they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during training  so that the input to the activation function across each training batch has a mean of 0 and a variance of 1  applying batch normalization to the activation Οƒ(Wx + b) would result in Οƒ(BN(Wx + b)) where 𝐡𝑁 is the batch normalizing transform
  • 12. Batch Normalization To make each dimension unit gaussian, we apply: π‘₯ π‘˜ = π‘₯ π‘˜ βˆ’ 𝐸 π‘₯ π‘˜ π‘‰π‘Žπ‘Ÿ π‘₯ π‘˜ where 𝐸 π‘₯(π‘˜) and π‘‰π‘Žπ‘Ÿ π‘₯(π‘˜) are respectively the mean and variance of π‘˜-th feature over a batch. Then we transform π‘₯(π‘˜) as: 𝑦 π‘˜ = 𝛾 π‘˜ π‘₯ π‘˜ + 𝛽 π‘˜ where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch normalization layer 𝐾 is the π‘˜-th sample in one feature mini batch
  • 14. Forward Propagation through Batch Normalization layer We have shown the normalization of multiple sample in just one feature Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃 Output: 𝓨𝐒 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐒) Flow of computation through Batch Normalization layer 𝝁 𝑩 = 𝟏 π’Ž π’Š=𝟏 π’Ž π’™π’Š 𝝈 𝑩 𝟐 = 𝟏 π’Ž π’Š=𝟏 π’Ž π’™π’Š βˆ’ 𝝁 𝑩 𝟐 π’™π’Š= π’™π’Š βˆ’ 𝝁 𝑩 𝝈 𝑩 𝟐 + 𝝐 π“¨π’Š= 𝜸 𝒙 + 𝜷 = 𝑩𝑡 𝜸,𝜷(π’™π’Š) πœ– πœ– is a small value 1 βˆ— 10βˆ’8 for not devided by zero
  • 15. Forward Propagation through Batch Normalization layer TWO features
  • 16. THE MAGIC Imagine that the network was thought that the optimal that will minimize the cost is to Cancel the BN effect ! Forward Propagation through Batch Normalization layer Ξ² = 𝐸 π‘₯ = ΞΌB 𝛾 = π‘‰π‘Žπ‘Ÿ π‘₯ = 𝜎 𝐡 2 + πœ– π‘₯𝑖 = π‘₯𝑖 βˆ’ πœ‡ 𝐡 𝜎 𝐡 2 + πœ– 𝒴𝑖 = 𝛾 π‘₯𝑖 + 𝛽 = 𝜎 𝐡 2 + πœ– βˆ— π‘₯ 𝑖 βˆ’πœ‡ 𝐡 𝜎 𝐡 2+πœ– + πœ‡ 𝐡 = π‘₯𝑖 Identity transform 𝛾, 𝛽 Adapted by SGD
  • 17. Backpropagation through Batch Normalization layer 𝒴𝑖 = 𝛾 π‘₯ + 𝛽 = 𝐡𝑁𝛾,𝛽(π‘₯𝑖) 𝝏𝑳 ππ“¨π’Š 𝝏𝑳 𝝏 π’™π’Š πœ•L πœ•Ξ³i 𝝏𝑳 𝝏𝜷 𝝏𝑳 𝝏𝝈 𝑩 πŸππ‘³ 𝝏𝝁 𝑩 𝝏𝑳 𝝏𝒙 π’Š SGD 𝜷 π’Œ + 𝟏 = 𝜷 𝜿 βˆ’ 𝜢 𝝏𝑳 𝝏𝜷 𝜸 π’Œ + 𝟏 = 𝜸 𝜿 βˆ’ 𝜢 𝝏𝑳 𝝏𝜸 𝝁 𝑩 = 𝟏 π’Ž π’Š=𝟏 π’Ž π’™π’Š 𝝈 𝑩 𝟐 = 𝟏 π’Ž π’Š=𝟏 π’Ž π’™π’Š βˆ’ 𝝁 𝑩 𝟐 π’™π’Š= π’™π’Š βˆ’ 𝝁 𝑩 𝝈 𝑩 𝟐 + 𝝐 π“¨π’Š= 𝜸 𝒙 + 𝜷 = 𝑩𝑡 𝜸,𝜷(π’™π’Š)
  • 18. Backpropagation during test time using the population, rather than mini-batch statistics. Effectively, we process mini- batches of size π‘š and use their statistics to compute: 𝐸 π‘₯(π‘˜) = 𝐸 𝐡[ πœ‡ 𝐡] π‘‰π‘Žπ‘Ÿ π‘₯(π‘˜) = π‘š π‘š βˆ’ 1 𝐸 𝐡[𝜎 𝐡 2 ] we can use 𝑒π‘₯π‘π‘œπ‘›π‘’π‘›π‘‘π‘–π‘Žπ‘™ π‘šπ‘œπ‘£π‘–π‘›π‘” π‘Žπ‘£π‘’π‘Ÿπ‘Žπ‘”π‘’ to estimate the mean and variance to be used during test time, we estimate the running average of mean and variance as: πœ‡ π‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘” = 𝛼. πœ‡ π‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘” + (1- 𝛼). πœ‡ 𝐡 ‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬ πœŽπ‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘” 2 = 𝛼. πœŽπ‘Ÿπ‘’π‘›π‘›π‘–π‘›π‘” 2 + (1- 𝛼).𝜎 𝐡 2 where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree of dependence on the previous observations .
  • 19. Ablation Study MNIST dataset 28Γ—28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each , the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross entropy loss. BN network is much more stable
  • 20. Ablation Study ImageNet of 1000 categories on GoogleNet/Inception(2014) weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for the testing CNN architectures tested
  • 21. Comparison with the State of the art Approaches
  • 22. Some Notes : cross-entropy loss function We use cross-entropy loss function neural network (1) Computed | targets | correct? ------------------------------------------------ 0.3 0.3 0.4 | 0 0 1 (democrat) | yes 0.3 0.4 0.3 | 0 1 0 (republican) | yes 0.1 0.2 0.7 | 1 0 0 (other) | no neural network (2) Computed | targets | correct? ------------------------------------------------ 0.1 0.2 0.7 | 0 0 1 (democrat) | yes 0.1 0.7 0.2 | 0 1 0 (republican) | yes 0.3 0.4 0.3 | 1 0 0 (other) | no cross-entropy error for the first training βˆ’( (ln(0.3) βˆ— 0) + (ln(0.3) βˆ— 0) + (ln(0.4) βˆ— 1) ) = βˆ’ln(0.4) average cross-entropy error (ACE) βˆ’(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38 average cross-entropy error (ACE) βˆ’(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64 mean squared error for the first item (0.3 βˆ’ 0)^2 + (0.3 βˆ’ 0)^2 + (0.4 βˆ’ 1)^2 = 0.54 the MSE for the first neural network is (0.54 + 0.54 + 1.34) / 3 = 0.81 The MSE for the second, better, network is (0.14 + 0.14 + 0.74) / 3 = 0.34 (1.38 βˆ’ 0.64 = 0.74) > (0.81 βˆ’ 0.34 = 0.47) The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction
  • 23. Some Notes : Convolutional Neural Network (CNN)  It use for Image classification is the task  It was developed between 1988 and 1993, at Bell Labs  the first convolutional network that could recognize handwritten digits
  • 24. Some Notes : Convolutional Neural Network (CNN) Convolution Layer (Conv Layer) Pooling Layer ReLU Layer Fully Connected Layer (Flatten)
  • 25. Some Notes : Convolutional Neural Network (CNN)
  • 26. Convolution Layer (Conv Layer) convolution works by sliding a window across the input
  • 27. Some Notes : Convolutional Neural Network (CNN) 2D filters
  • 28. Some Notes : Convolutional Neural Network (CNN) 3D filters
  • 29. Pooling Layer (Sub-sampling or Down-sampling)  reduce the size of feature maps by using some functions average or the maximum , (hence called down-sampling)  make extracted features more robust by making it more invariant to scale and orientation changes.
  • 30. ReLU Layer Remove all the black elements from it, keeping only those carrying a positive value (the grey and white colours ) )𝑂𝑒𝑑𝑝𝑒𝑑 = π‘€π‘Žπ‘₯(π‘§π‘’π‘Ÿπ‘œ, 𝐼𝑛𝑝𝑒𝑑 to introduce non-linearity in our ConvNet
  • 33. MY WORK οƒ  MNIST on google colab οƒ˜Inputs = 28*28 = 784 οƒ˜Layer 1&2 = 100 nodes | Layer 3 = 10 nodes οƒ˜All Activations are sigmoid οƒ˜Cross-entropy loss function The train and test set is already splited in tensorflow the distribution over time of the inputs to the sigmoid function of the first five neurons in the second layer . Batch normalization has a visible and significant effect of removing variance/noise in these inputs.final acc: 99%
  • 34. MY WORK οƒ  caltech dataset π‘Šπ‘–π‘‘β„Ž 𝐡𝑁 #π‘’π‘π‘œπ‘β„Ž = 150 𝐿𝑅 = 1 βˆ— 10βˆ’3 π‘Šπ‘–π‘‘β„Žπ‘œπ‘’π‘‘ 𝐡𝑁 #π‘’π‘π‘œπ‘β„Ž = 150 𝐿𝑅 = 1 βˆ— 10βˆ’3 π‘€π‘–π‘‘β„Žπ‘œπ‘’π‘‘ BN #π‘’π‘π‘œπ‘β„Ž = 250 𝐿𝑅 = 1 βˆ— 10βˆ’3 final acc: 90.54% final acc: 94.44%final acc: 96.04% We use ten (10) classes from Caltech dataset instead of ImageNet Dataset because it’s huge size classify the input image 1/3 ,1/3 , 1/3 train , validation, test