SlideShare a Scribd company logo
1 of 28
Kalman Normalization:Normalizing Internal
Representations Across Network Layers
(NIPS2018)
Guangrun WangSun Yat-sen University
wanggrun@mail2.sysu.edu.cn
Jiefeng Peng Sun Yat-sen University
jiefengpeng@gmail.com
Ping Luo The Chinese University of Hong Kong
pluo.lhi@gmail.com
XinjiangWang SenseTime Group Ltd.
Liang Lin Sun Yat-sen University
linliang@ieee.org
Ali albawi 8/13/2019
1
8/13/2019 2
• Batch Normalization (BN) has successfully improved the
training of deep neural networks (DNNs) with mini-batches.
• the effectiveness of BN would diminish with the scenario of
micro-batch (e.g. less than 4 samples in a mini-batch).
• the estimated statistics in a mini-batch are not reliable with
insufficient samples.
• This limits BN’s room in training larger models on
segmentation, detection, and video-related problems.
• which require small batches constrained by memory
consumption.
• Kalman Normalization (KN) for solve this problem.
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 3
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 4
INTODUCTION
K.F.P
KN
Experiments
conclusions
• Kalman Normalization (KN), for improving and accelerating
training of DNNs particularly under the context of micro-
batches.
• Unlike BN, KN treats all the layers in a network as a whole
system
• Estimates the statistics of a certain layer by considering the
distributions of all its preceding layers.
• KN increases only 0.015 compared to that of BN, leading to a
marginal additional computation.
8/13/2019 5
INTODUCTION
K.F.P
KN
Experiments
conclusions
• Kalman filtering(1 ,2 ,3 ,4 ,5), also known as linear quadratic estimation(LQE).
• The kalman filter is a recursive state space model based estimation algorithm.
• This filter is named after Rudolph E. Kalman, who in 1960 published his famous
paper describing a recursive solution to the discrete data linear filtering problem
(Kalman 1960).
• This algorithm was basically developed for single dimensional and real valued
signals which are associated with the linear systems assuming the system is
corrupted with linear additive white Gaussian noise.
• The Kalman filter addresses the general problem of trying to estimate the state x
∈ ℜn of a discrete-time controlled process that is governed by the linear
difference equation
• The Kalman filter estimates a process by using a form of feedback control: the
filter estimates the process state at some time and then obtains feedback in the
form of (noisy) measurements.
8/13/2019 6
INTODUCTION
K.F.P
KN
Experiments
conclusions
• Assume that the true values of the hidden neurons in the k-th
layer can be represented by the variable: 𝑥 𝑘
.
• the values in the previous layer : 𝑥 𝑘−1 .
• standard Kalman filtering process:
8/13/2019 7
INTODUCTION
K.F.P
KN
Experiments
conclusions
• KN performs two steps in an iterative way:
KN estimates the statistics of the current layer conditioned on
the estimations of the previous layer.
these estimations are combined with the observed batch
sample means and variances calculated within a mini-batch.
Kalman Normalization
8/13/2019 8
INTODUCTION
K.F.P
KN
Experiments
conclusions
• BN problem the desired quantity to estimate includes not just
the variances, but also the means : 𝜇 𝐾|𝐾.
• the means can be easily obtained due to the Kalman filter
property by compute expectation on both sides of:
• And we get:
8/13/2019 9
INTODUCTION
K.F.P
KN
Experiments
conclusions
• Similar to BN,KN also retains the moving average statistics to
appropriate the population statistics in each training iteration, and
employs them during the inference.
8/13/2019 10
INTODUCTION
K.F.P
KN
Experiments
conclusions
• KN has two unique characteristics that distinguish it from BN.
It offers a better estimation of the distribution.
 In contrast to the existing normalization methods, the depth
information is explicitly exploited in KN.
KN offers a more stable estimation when learning proceeds,
where the information flow from prior state to the current state
becomes more stable.
8/13/2019 11
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 12
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 13
INTODUCTION
K.F.P
KN
Experiments
conclusions
Kalman Normalization Property
• shortcut connection also incorporates information from
previous layers.
• KN has two unique characteristics that distinguish it from
shortcut connection:
KN provides better statistic estimation.
 In shortcut connection, the informations of previous layer and
current layer are simply summed up. No distribution
estimation is performed.
In theory KN can be applied to shortcut connection, because
we have received the entire feature map, then we can easily
obtain the mean/variance from the feature map.
8/13/2019 14
INTODUCTION
K.F.P
KN
Experiments
conclusions
Comparison with shortcuts in ResNet
Training withTypical Batch (Batch Size = 32)
• A nonnegligible founding is that when compared to BN in
typical-batch training, KN keeps competitive advantage
(76.8% vs 76.4% in ResNet50) while GN is at a disadvantage
(75.9% vs 76.4%). This may be attributed to optimization
efficiency of BN, upon which KN (i.e.BKN) is built.
8/13/2019 15
INTODUCTION
K.F.P
KN
Experiments
conclusions
Extra Parameters.
• KN introduces only 0.1% extra parameters, which is
negligible.
8/13/2019 16
INTODUCTION
K.F.P
KN
Experiments
conclusions
Computation Complexity.
Training with Micro Batch (Bacth Size = 4 & 1)
Batch Size of 4.
• They employ the baseline of typical batch size (i.e.32) for
comparison and they have three major observations.
obtain an improvement by replacing BN with KN.
under such setting the validation accuracy of all
normalization methods are lower than the baseline that
normalized over batch size of 32 (76.8 vs 76.1 for KN).
• training converges slowly.
• This indicates that the micro-batch training problem is better
addressed by using KN than BN.
8/13/2019 17
INTODUCTION
K.F.P
KN
Experiments
conclusions
interestingly they find that there is a gain between using
different kinds statistic estimation.
• the population statistics (moving mean/variance) are used to
normalize the layer’s inputs during inference.
• batch sample (online) statistics are used for normalization
during inference.
8/13/2019 18
INTODUCTION
K.F.P
KN
Experiments
conclusions
• Using online statistics weakens the superiority of GN over BN.
• What about 1-example-batch training?
Batch Size of 1.
8/13/2019 19
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 20
INTODUCTION
K.F.P
KN
Experiments
conclusions
• In 1-example-batch training, the information communication never
happens between any two examples.
• the moving averages are difficult to represent the population statistics.
COCO 2017 Object Detection and Segmentation
8/13/2019 21
INTODUCTION
K.F.P
KN
Experiments
conclusions
Analysis on CIFAR10, CIFAR100, and SVHN
• both BN and GN benefit from Kalman Normalization
mechanism. For example, BKN has a gain of 1.5% compared
with BN, verifying the effectiveness of BKN (i.e.KN).
• Second, the gain of ‘BKN - BN’ is larger than ‘GKN - GN’ (1.5%
vs 0.4%).This may be attributed to optimization efficiency of
BN.
• Third, Although GN has gains over BN on ImageNet in micrio-
batch training, it has no gain on CIFAR10.
8/13/2019 22
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 23
INTODUCTION
K.F.P
KN
Experiments
conclusions
Other Ablation Studies
Evidence of more accurate statistic estimations
First, direct evidence.
The gaps between batch sample variance and the moving
variance are visualized
8/13/2019 24
INTODUCTION
K.F.P
KN
Experiments
conclusions
Second, indirect evidence.
8/13/2019 25
INTODUCTION
K.F.P
KN
Experiments
conclusions
Comparison with BN variants.
8/13/2019 26
INTODUCTION
K.F.P
KN
Experiments
conclusions
• This paper presented a novel normalization method, called
Kalman Normalization(KN)
• Unlike previous methods that normalized each hidden layer
independently, KN treats the entire network as a whole.
• KN can be naturally generalized to other existing
normalization methods to obtain gains.
• Extensive experiments suggest that KN is capable of
strengthening several state-of-the-art neural networks by
improving their training stability and convergence speed.
• KN can handle the training with mini-batches of very small
sizes.
8/13/2019 27
INTODUCTION
K.F.P
KN
Experiments
conclusions
8/13/2019 28

More Related Content

Similar to Presentation albawi -2018-kalman normalization normalizing internal representations across network layers

ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ijdms
 
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ijdms
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 

Similar to Presentation albawi -2018-kalman normalization normalizing internal representations across network layers (20)

Graph convolutional neural networks for web-scale recommender systems.pptx
Graph convolutional neural networks for web-scale recommender systems.pptxGraph convolutional neural networks for web-scale recommender systems.pptx
Graph convolutional neural networks for web-scale recommender systems.pptx
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Network.pptx
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Network.pptxWeisfeiler and Leman Go Neural: Higher-order Graph Neural Network.pptx
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Network.pptx
 
Reconfiguration layers of convolutional neural network for fundus patches cla...
Reconfiguration layers of convolutional neural network for fundus patches cla...Reconfiguration layers of convolutional neural network for fundus patches cla...
Reconfiguration layers of convolutional neural network for fundus patches cla...
 
“Compound CNNs for Improved Classification Accuracy,” a Presentation from Sou...
“Compound CNNs for Improved Classification Accuracy,” a Presentation from Sou...“Compound CNNs for Improved Classification Accuracy,” a Presentation from Sou...
“Compound CNNs for Improved Classification Accuracy,” a Presentation from Sou...
 
Knowledge Graph Convolutional Networks for Recommender Systems.pptx
Knowledge Graph Convolutional Networks for Recommender Systems.pptxKnowledge Graph Convolutional Networks for Recommender Systems.pptx
Knowledge Graph Convolutional Networks for Recommender Systems.pptx
 
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
 
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
ACTIVE LEARNING ENTROPY SAMPLING BASED CLUSTERING OPTIMIZATION METHOD FOR ELE...
 
NS-CUK Seminar :J.H.Lee, "Review on "Similarity Preserving Adversarial Graph ...
NS-CUK Seminar :J.H.Lee, "Review on "Similarity Preserving Adversarial Graph ...NS-CUK Seminar :J.H.Lee, "Review on "Similarity Preserving Adversarial Graph ...
NS-CUK Seminar :J.H.Lee, "Review on "Similarity Preserving Adversarial Graph ...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
IRJET- A Review on Object Tracking based on KNN Classifier
IRJET- A Review on Object Tracking based on KNN ClassifierIRJET- A Review on Object Tracking based on KNN Classifier
IRJET- A Review on Object Tracking based on KNN Classifier
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
 
STLF PPT jagdish singh
STLF PPT jagdish singhSTLF PPT jagdish singh
STLF PPT jagdish singh
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Densebox
DenseboxDensebox
Densebox
 
NS-CUK Seminar: J.H.Lee, Review on "Graph Neural Networks with convolutional ...
NS-CUK Seminar: J.H.Lee, Review on "Graph Neural Networks with convolutional ...NS-CUK Seminar: J.H.Lee, Review on "Graph Neural Networks with convolutional ...
NS-CUK Seminar: J.H.Lee, Review on "Graph Neural Networks with convolutional ...
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Population Based Training of Neural Networks
Population Based Training of Neural NetworksPopulation Based Training of Neural Networks
Population Based Training of Neural Networks
 

Recently uploaded

Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 

Recently uploaded (20)

Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 

Presentation albawi -2018-kalman normalization normalizing internal representations across network layers

  • 1. Kalman Normalization:Normalizing Internal Representations Across Network Layers (NIPS2018) Guangrun WangSun Yat-sen University wanggrun@mail2.sysu.edu.cn Jiefeng Peng Sun Yat-sen University jiefengpeng@gmail.com Ping Luo The Chinese University of Hong Kong pluo.lhi@gmail.com XinjiangWang SenseTime Group Ltd. Liang Lin Sun Yat-sen University linliang@ieee.org Ali albawi 8/13/2019 1
  • 2. 8/13/2019 2 • Batch Normalization (BN) has successfully improved the training of deep neural networks (DNNs) with mini-batches. • the effectiveness of BN would diminish with the scenario of micro-batch (e.g. less than 4 samples in a mini-batch). • the estimated statistics in a mini-batch are not reliable with insufficient samples. • This limits BN’s room in training larger models on segmentation, detection, and video-related problems. • which require small batches constrained by memory consumption. • Kalman Normalization (KN) for solve this problem. INTODUCTION K.F.P KN Experiments conclusions
  • 4. 8/13/2019 4 INTODUCTION K.F.P KN Experiments conclusions • Kalman Normalization (KN), for improving and accelerating training of DNNs particularly under the context of micro- batches. • Unlike BN, KN treats all the layers in a network as a whole system • Estimates the statistics of a certain layer by considering the distributions of all its preceding layers. • KN increases only 0.015 compared to that of BN, leading to a marginal additional computation.
  • 5. 8/13/2019 5 INTODUCTION K.F.P KN Experiments conclusions • Kalman filtering(1 ,2 ,3 ,4 ,5), also known as linear quadratic estimation(LQE). • The kalman filter is a recursive state space model based estimation algorithm. • This filter is named after Rudolph E. Kalman, who in 1960 published his famous paper describing a recursive solution to the discrete data linear filtering problem (Kalman 1960). • This algorithm was basically developed for single dimensional and real valued signals which are associated with the linear systems assuming the system is corrupted with linear additive white Gaussian noise. • The Kalman filter addresses the general problem of trying to estimate the state x ∈ ℜn of a discrete-time controlled process that is governed by the linear difference equation • The Kalman filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of (noisy) measurements.
  • 6. 8/13/2019 6 INTODUCTION K.F.P KN Experiments conclusions • Assume that the true values of the hidden neurons in the k-th layer can be represented by the variable: 𝑥 𝑘 . • the values in the previous layer : 𝑥 𝑘−1 . • standard Kalman filtering process:
  • 7. 8/13/2019 7 INTODUCTION K.F.P KN Experiments conclusions • KN performs two steps in an iterative way: KN estimates the statistics of the current layer conditioned on the estimations of the previous layer. these estimations are combined with the observed batch sample means and variances calculated within a mini-batch. Kalman Normalization
  • 8. 8/13/2019 8 INTODUCTION K.F.P KN Experiments conclusions • BN problem the desired quantity to estimate includes not just the variances, but also the means : 𝜇 𝐾|𝐾. • the means can be easily obtained due to the Kalman filter property by compute expectation on both sides of: • And we get:
  • 9. 8/13/2019 9 INTODUCTION K.F.P KN Experiments conclusions • Similar to BN,KN also retains the moving average statistics to appropriate the population statistics in each training iteration, and employs them during the inference.
  • 10. 8/13/2019 10 INTODUCTION K.F.P KN Experiments conclusions • KN has two unique characteristics that distinguish it from BN. It offers a better estimation of the distribution.  In contrast to the existing normalization methods, the depth information is explicitly exploited in KN. KN offers a more stable estimation when learning proceeds, where the information flow from prior state to the current state becomes more stable.
  • 14. • shortcut connection also incorporates information from previous layers. • KN has two unique characteristics that distinguish it from shortcut connection: KN provides better statistic estimation.  In shortcut connection, the informations of previous layer and current layer are simply summed up. No distribution estimation is performed. In theory KN can be applied to shortcut connection, because we have received the entire feature map, then we can easily obtain the mean/variance from the feature map. 8/13/2019 14 INTODUCTION K.F.P KN Experiments conclusions Comparison with shortcuts in ResNet
  • 15. Training withTypical Batch (Batch Size = 32) • A nonnegligible founding is that when compared to BN in typical-batch training, KN keeps competitive advantage (76.8% vs 76.4% in ResNet50) while GN is at a disadvantage (75.9% vs 76.4%). This may be attributed to optimization efficiency of BN, upon which KN (i.e.BKN) is built. 8/13/2019 15 INTODUCTION K.F.P KN Experiments conclusions
  • 16. Extra Parameters. • KN introduces only 0.1% extra parameters, which is negligible. 8/13/2019 16 INTODUCTION K.F.P KN Experiments conclusions Computation Complexity.
  • 17. Training with Micro Batch (Bacth Size = 4 & 1) Batch Size of 4. • They employ the baseline of typical batch size (i.e.32) for comparison and they have three major observations. obtain an improvement by replacing BN with KN. under such setting the validation accuracy of all normalization methods are lower than the baseline that normalized over batch size of 32 (76.8 vs 76.1 for KN). • training converges slowly. • This indicates that the micro-batch training problem is better addressed by using KN than BN. 8/13/2019 17 INTODUCTION K.F.P KN Experiments conclusions
  • 18. interestingly they find that there is a gain between using different kinds statistic estimation. • the population statistics (moving mean/variance) are used to normalize the layer’s inputs during inference. • batch sample (online) statistics are used for normalization during inference. 8/13/2019 18 INTODUCTION K.F.P KN Experiments conclusions
  • 19. • Using online statistics weakens the superiority of GN over BN. • What about 1-example-batch training? Batch Size of 1. 8/13/2019 19 INTODUCTION K.F.P KN Experiments conclusions
  • 20. 8/13/2019 20 INTODUCTION K.F.P KN Experiments conclusions • In 1-example-batch training, the information communication never happens between any two examples. • the moving averages are difficult to represent the population statistics.
  • 21. COCO 2017 Object Detection and Segmentation 8/13/2019 21 INTODUCTION K.F.P KN Experiments conclusions
  • 22. Analysis on CIFAR10, CIFAR100, and SVHN • both BN and GN benefit from Kalman Normalization mechanism. For example, BKN has a gain of 1.5% compared with BN, verifying the effectiveness of BKN (i.e.KN). • Second, the gain of ‘BKN - BN’ is larger than ‘GKN - GN’ (1.5% vs 0.4%).This may be attributed to optimization efficiency of BN. • Third, Although GN has gains over BN on ImageNet in micrio- batch training, it has no gain on CIFAR10. 8/13/2019 22 INTODUCTION K.F.P KN Experiments conclusions
  • 24. Other Ablation Studies Evidence of more accurate statistic estimations First, direct evidence. The gaps between batch sample variance and the moving variance are visualized 8/13/2019 24 INTODUCTION K.F.P KN Experiments conclusions
  • 25. Second, indirect evidence. 8/13/2019 25 INTODUCTION K.F.P KN Experiments conclusions
  • 26. Comparison with BN variants. 8/13/2019 26 INTODUCTION K.F.P KN Experiments conclusions
  • 27. • This paper presented a novel normalization method, called Kalman Normalization(KN) • Unlike previous methods that normalized each hidden layer independently, KN treats the entire network as a whole. • KN can be naturally generalized to other existing normalization methods to obtain gains. • Extensive experiments suggest that KN is capable of strengthening several state-of-the-art neural networks by improving their training stability and convergence speed. • KN can handle the training with mini-batches of very small sizes. 8/13/2019 27 INTODUCTION K.F.P KN Experiments conclusions

Editor's Notes

  1. First, by leveraging the messages from the previous layers, estimation of the statistics is more stable in KN, making training converged faster, especially in the early stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative representation learning and hence improving classification accuracy.
  2. First, by leveraging the messages from the previous layers, estimation of the statistics is more stable in KN, making training converged faster, especially in the early stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative representation learning and hence improving classification accuracy.
  3. First, by leveraging the messages from the previous layers, estimation of the statistics is more stable in KN, making training converged faster, especially in the early stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative representation learning and hence improving classification accuracy.
  4. First, by leveraging the messages from the previous layers, estimation of the statistics is more stable in KN, making training converged faster, especially in the early stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative representation learning and hence improving classification accuracy.
  5. First, by leveraging the messages from the previous layers, estimation of the statistics is more stable in KN, making training converged faster, especially in the early stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative representation learning and hence improving classification accuracy.
  6. First, by leveraging the messages from the previous layers, estimation of the statistics is more stable in KN, making training converged faster, especially in the early stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative representation learning and hence improving classification accuracy.
  7. the horizontal axis represents values of different batches, while vertical axis represents neurons of different channels