machine_learning _presentation_on_paperpptx

Batch Normalization: Accelerating
Deep Network Training by Reducing
Internal Covariate Shift
A PAPER PRESENTATION ON -
P U B L I S H E D I N
I n t e r n a t i o n a l C o n f e r e n c e o n M a c h i n e
L e a r n i n g ( I C M L )
P U B L I S H E D O N
J u l y , 2 0 1 5
D A T E : 2 7 D E C , 2 0 2 4

PRESENTED BY
MD. JAMIL HASAN
ID: 212902029
Dept. of CSE
Green University of Bangladesh
MD. SAIFUL ISLAM RIMON
ID: 213002039
Dept. of CSE
MAHJABIN RAHMAN
ID: 213002259
Dept. of CSE
2

DR. MUHAMMAD ABUL HASAN
Chairperson
Dept. of ADS
A SPECIAL THANKS TO,
3

Table of contents
01
02
03
INTRODUCTION
TOWARDS REDUCING INTERNAL COVARIATE SHIFT
NORMALIZATION VIA MINI-BATCH STATISTICS
04 EXPERIMENTS
05 CONCLUSION
4

INTRODUCTION
&
TOWARDS REDUCING INTERNAL
COVARIATE SHIFT
01 & 02
Batch Normalization: Accelerating Deep
Network Training by Reducing
SECTION
5

Challenges in Training Deep Neural
Networks
the input distribution to a layer changes
during training as parameters of
preceding layers update.
Key issue: Internal Covariate Shift Impact on Training
•Requires careful initialization and small
learning rates.
•Nonlinearities (e.g., Sigmoid) saturate
gradients, slowing convergence.
•Amplified with increasing depth of
networks.
6

•Here,
• is the learning rate
•m is the mini-batch size
•F2 is the stand-alone network
•x is the input
Challenges in Training Deep Neural
Networks
• Where ...N is the training dataset,
and ℓ is the loss function.
Optimization Objective: Gradient Descent for sub-network
7

. -
Problem?
What Causes Internal Covariate Shift?
• Layers need to adapt to changing
input distributions during training.
• Example: In a sub-network,
ℓ = ((u, ), ),
changes in shifts the distribution of
• Gradient ℓ vanish if inputs saturate
nonlinearities (e.g., Sigmoid
activation)
• This slows down training and
deepens optimization
challenges.
8

Batch Normalization — A Solution to Internal
Covariate Shift
• Goal: Stabilize input distributions to accelerate training.
• Batch Normalization:
• Normalizes inputs across a mini-batch.
• Reduces dependency on careful initialization and mitigates
vanishing gradients.
Concept:
• Normalization fixes input mean and variance across batches, improving
gradient flow and optimization stability.
9

Normalization via Mini-Batch
Statistics
03
SECTION
10

•Here,
x: Input to the layer.
•E[x]: Mean of the mini-batch.
•Var[x]: Variance of the mini-batch.
•ϵ: Small constant added for numerical
stability.
Normalization via Mini-Batch Statistics
• Batch Normalization (BN)
normalizes layer inputs using
statistics from mini-batches.
11

Training and Inference with Batch-
Normalized Networks
Normalization: Each layer's inputs are
normalized using mini-batch statistics.
Training Phase:
Inference Phase:
Uses running averages of μ and σ²
computed during training for consistent
performance on unseen data.
• Compute mini-batch mean μᵦand
variance σᵦ².
• Normalize the input: x
̄
• Scale and shift: y=BNγ,β
(x)=γx
̄ +β
• Where, γ and β are learnable
parameters.
Key Algorithm:
12

. -
Impact on Performance:
Integration with
Convolutional Layers:
Batch-Normalized Convolutional Networks
• BN can be applied after
convolutional layers to stabilize
feature distributions.
• Achieves higher accuracy with fewer
training steps.
• Facilitates training of deeper
architectures by mitigating
vanishing/exploding gradients.
Key Benefits:
• Reduces sensitivity to weight
initialization.
• Enhances performance of deep
networks.
13

. -
Algorithm for Training a Batch-Normalized
Network
Input:
● A network N with trainable
parameters Θ.
● A subset of activations
across layers.
Output:
● A batch-normalized network
for inference,
• Steps:
• Create a Training Version of the Network:
• Transform the network N into a batch-
normalized version
, by:
• Adding a normalization layer
to each layer's output xᵏ .
• Modifying all layers to use yᵏ as input
instead of xᵏ .
• Train the Network:
• Optimize the parameters Θ, along with the BN-
specific parameters γᵏ and βᵏ, to minimize the
loss function.
14

. -
Algorithm for Training a Batch-Normalized
Network
Input:
● A network N with trainable
parameters Θ.
● A subset of activations
across layers.
Output:
● A batch-normalized network
for inference,
• Prepare for Inference:
• Freeze the learned parameters γᵏ, βᵏ and
compute global statistics for normalization:
• Mean (E[x]): Averaged over all mini-batches.
• Variance (Var[x]): Adjusted using a factor
m/(m-1) to account for small sample bias.
• Replace Batch-Normalization in
:
• Use pre-computed global statistics during
inference:
15

Batch Normalization and Higher Learning
Rates
BN allows larger learning rates without
divergence.
Higher Learning Rates: Resilience to Parameter Scale:
BN decouples parameter scale from
gradient propagation, allowing
aggressive optimization.
BN(Wu)=BN((aW)u)
• Where:
• W: Layer parameters.
• u: Input to the layer.
• a: Scalar that scales the parameters
The Equation:
Overall Benefits:
Accelerates convergence and reduces
the need for other regularization
techniques (e.g., Dropout).
16

Batch Normalization Experiments
and Conclusion
04 & 05
SECTION
17

• Internal Covariate Shift
• BN Normalization Mechanism
• Impact on Training Speed
Activations Over Time
18

• BN in Inception Networks
• Variants Overview
• Softmax Classification Equation.
ImageNet Classification
19

Advances and Conclusions
Faster Training Times:
• Accelerating convergence
rates.
• Allowing models to achieve
optimal performance in fewer
epochs.
Reduced Dropout Reliance:
• Improving generalization.
• Decreasing the necessity for
Dropout
• Enhancing model robustness
effectively.
Recurrent Network Implications:
• Handling vanishing gradient issues.
• Leading to better learning during
adaptation.
20

machine_learning _presentation_on_paperpptx

More Related Content

Similar to machine_learning _presentation_on_paperpptx

Recently uploaded

machine_learning _presentation_on_paperpptx

Editor's Notes