EE5180_G-5.pptx

EE5180:Introduction to Machine
Learning-Final Presentation
Powerpropagation: A sparsity inducing
weight reparameterisation
Group 5
Saurabh Chawla (EE21M524)
Mandeep Chaudhary (EE21M523)
Nirmal Mundra (EE21M527)
Rahul Alok Sharma (EE21M519)

Introduction to Neural Networks
• Network consists of a large number of highly interconnected processing
elements (neurons) working together to learn from experience.
• Each neuron is connected to other neurons by means of directed
communication links, each with associated weight. The weight represent
information being used by the net to solve a problem.
Advantages:
• It can model non-linear systems
• The ability to learn allows the network to adapt to changes in
the surrounding environment
• Distributed nature of the NN gives it fault tolerant capabilities

Powerpropagation
• Powerpropagation is a new weight-parameterisation for neural networks
that leads to inherently sparse models.
• In this training the model is regularly pruned or sparsified during training
in order to reduce the computational burden thus eliminating parameters
that do not play a vital role in the functional behaviour of the model.
• During the Training, we raise the weight of the network to the a–th–power
(preserving the sign).This will result in magnitude of the weight appearing
in the gradient (chain rule), thus encouraging “rich got richer” dynamics.

Powerpropagation cont.
• This solution is encoded by reduced capacity, resulting from the learning
process itself, without any explicit force to impose frugality (i.e. no
masking, regularisation, etc.).
• Reparameterisation focus is towards sparse representation rather then
improving convergence.
= Original Weight
= Reparameterised Weight
= Link Function
= Raising Parameter Rate
α

Gradient of Reparameterised Loss L(., Ψ (Φ)) wrt. to Φ:
Properties:
1. α=1 recovers the baseline i.e, standard gradiant descent
2. Zero is the critical point of any weight for α > 1
3. In addition zero is surrounded by a plateau.Weights are less likely to
change sign

Effect on Weight Distribution
(a) Standard Learning (b) Powerpropagation

• When learning rate α= 1.0 applied then the distribution of weights is
standard and equally distributed about centre (as shown in fig. in blue
colour).
• When learning rate α= 1.5 applied then major density of weights are
distributed near centre (shown in green colour.)
• Parameters with larger magnitudes are allowed to adapt faster in order to
represent the required features to solve the task, while smaller magnitude
parameters are restricted, making it more likely that they will be irrelevant
in representing the learned solution

Hyper Parameters Selection for MNIST/Fashion MNIST
MNIST : Training set of 60000 images & Test set of 10000 images. Images
centred in a 28 x 28 Pixel.
Train Batch Size: 60
Training Steps: 50000
Learning Rate : 0.1
α (Power Propagation Parameter) checked with: 1,2,3,4 & 5.

Results (with MNIST , Fashion MNIST Dataset)
(a) MNIST (b) Fashion MNIST
Remaining
Weights
Raising
Parameter(α)
Accuracy
10% Baseline 70.00%
10% 3 96.77%
Remaining
Weights
Raising
Parameter(α)
Accuracy
10% Baseline 50.28%
10% 3 78.48%

Results (with CIFAR 10 Dataset)

Implementation variations
• Powerpropagation is general, intuitive, cheap and straight-forward to implement
and can readily be combined with various other techniques. Powerpropagation
can be implemented in two different settings:
1. Combining Powerpropagation with a traditional weight-pruning technique e.g.
Shot pruning, iterative pruning
2. Combining with recent state-of-the-art sparse-to-sparse algorithms e.g.
TopKAST

Continual learning
• It means Sequential learning of tasks without forgetting.
• Powerpropagation for Continual Learning will base its use on the class of methods that
implement gradient sparsity. i.e the catastrophic forgetting is overcome by masking the
gradients to parameters found to constitute the solution of the previous task.
• Eg PackNet
Problems in PackNet
• The no of Tasks should be known beforehand
• By reserving fixed fraction of weights for each task, no distinction is made in terms of
difficulty or relatedness to previous data.
• The above two problems are overcome by Efficient PackNet

How to resolve PackNet problem
• Problem of Resource Allocation: A simple search over the range of sparsity rates
is done and is terminated once the model’s performance falls short of minimum
accepted target performance or once minimal accepted sparsity is reached.
• The mask for the certain task is chosen among all network parameters, including the
ones used by previous tasks, hence it encourages the reuse of existing parameters.
• As the backward pass becomes sparse, the method becomes more computationally
efficient with large no of tasks.

Related Work
• Modern approaches to sparsity in deep learning are categorized as Dense to
Sparse and Sparse to sparse methods.
• Dense to sparse algo refers to instantiation of dense network that is
sparsified throughout the training. Eg One Shot Pruning,
• Sparse to Sparse Algo maintains a constant sparsity throughout. Eg single
Shot Pruning,

EE5180_G-5.pptx

Recommended

Recommended

More Related Content

Similar to EE5180_G-5.pptx

Similar to EE5180_G-5.pptx (20)

Recently uploaded

Recently uploaded (20)

EE5180_G-5.pptx