PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

OneTicket to WinThem All:
Generalizing LotteryTicket Initialization across Datasets and Optimizers
Ari S. Morcos, et al., “One ticket to win them all: generalizing lottery ticket initializations
across datasets and optimizers
29th September, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics

Related PR12
• PR-180:The LotteryTicket Hypothesis: Finding Sparse,Trainable
Neural Networks: https://youtu.be/dkNmYu610r8
• PR-072: Deep Compression: https://youtu.be/9mFZmpIbMDs

LotteryTicket Hypothesis
• Neural networks can be pruned by more than 90% without harming
accuracy.
• But if we train directly(from random initialization) on the pruned
structure, the performance always decrease.
• The initialization of overparameterized neural networks contains
much smaller sub-network initializations, which, when trained in
isolation, reach similar performance as the full network.
 WinningTickets
 f(x; W) reaches accuracy a in t iterations
 F(x; m⊙W) reaches accuracy a’ in t’ iteration
Mask m ∈{0, 1}
 Where |m| << |W|, a’ ≥ a, t’ ≤ t

How to Find theWinningTickets?
1. Randomly initialize a neural network 𝑓 𝑥; 𝜃0 (where 𝜃0~𝒟0)
2. Train the neural network for 𝑗 iterations, arriving at parameters 𝜃𝑗.
3. Prune 𝑝% of the parameters in 𝜃𝑗, creating a mask 𝑚.
4. Reset the remaining parameters to their value 𝜃0, creating the
winning ticket 𝑓(𝑥; 𝑚⨀𝜃0).

Implications fromWinning (Lottery)Tickets
• The most commonly used initialization scheme are sub-optimal and
have significant room to improve.
• Over-parameterization is not necessary during the course of training,
but rather over-parameterization is merely necessary to find a “good”
initialization of an appropriately parameterized network.

Limitations ofWinningTickets
• The process to find winning tickets is computationally expensive.
• It remains unclear whether the properties of winning tickets(high
performance) are specific to the precise combination of architecture,
optimizer, and dataset, or whether winning tickets contain inductive
biases which improve training more generically.
 If same winning ticket generalizes across training conditions and datasets, it
unlocks the possibility of generating a small number of such winning tickets
and reusing them across datasets.

Introduction
• The generality of winning ticket initializations remains unclear.
• The authors attempt to answer this question by generating winning
tickets for one training configuration (optimizer and dataset) and
evaluating their performance on another configuration.

RelatedWork
• The lottery ticket hypothesis was first postulated and examined in
smaller models and datasets.
• This hypothesis was analyzes in large models and datasets, leading
to the development of late resetting.

RelatedWork – LotteryTicket Hypothesis Has
Been Challenged
• If randomly initialized sub-networks with structured pruning are
scaled appropriately and trained for long enough, they can match
performance from winning tickets.
• The lottery ticket hypothesis in the context of large-scale models and
were unable to find successful winning tickets.
 They did not use iterative pruning and late resetting.

RelatedWork –Transfer Learning
• Training models on one dataset, pruning, and then fine-tuning on
another datasets can often result in high performance on the transfer
dataset.
• However, these studies investigate the transfer of learned
representations, whereas they analyze the transfer of initializations
across datasets.

Approach – Iterative Pruning
• Pruning a large fraction of weights in one step (“one-shot pruning”)
will often prune weights which were actually important.
• By only pruning a small fraction of weights on each iteration,
iterative pruning helps to de-noise the pruning process, and produces
substantially better pruned models and winning tickets.
• For this work, magnitude pruning with an iterative pruning rate of
20% of remaining parameters was used.

Approach – Late Resetting
• In the initial investigation of lottery tickets, winning ticket weights
were reset to their values at the beginning of training (training
iteration 0), and learning rate warmup was found to be necessary.
• Simply resetting the winning ticket weights to their values at training
iteration k(small number) consistently produces better winning
tickets and removes the need for learning rate warmup.

Approach – Global Prunning
• In local pruning, weights are pruned within each layer separately,
such that every layer will have the same fraction of pruned
parameters.
• In global pruning, all layers are pooled together prior to pruning,
allowing the pruning fraction to vary across layers.
• Consistently, we observed that global pruning leads to higher
performance than local pruning.
 The first layer ofVGG19 contains only 1792 parameters, so pruning this layer
at a rate of 99% would result in only 18 parameters remaining, significantly
harming the expressivity of the network.

Global Pruning vs Layerwise Pruning

Approach – Random Masks
• In previous work, the structure of the pruning mask was maintained
for the random ticket initializations, with only the values of the
weights themselves randomized.
• However, the structure of this mask contains a substantial amount of
information and requires fore-knowledge of the winning ticket
initialization.
• For this work, they consider random tickets to have both randomly
drawn weight values (from the initialization distribution) and
randomly permuted masks.

Models
• A modified form ofVGG19
 Fully-connected layers were removed and replaced with an average pool
layer.
• ResNet50

TransferringWinningTickets across Datasets and
Optimizers
• They generate winning tickets in one training configuration (“source”)
and evaluate performance in a different configuration (“target”).
• For many transfers, this requires changing the output layer size,
since datasets have different numbers of output classes.

Results –Transfer within the Same Data
Distribution

Results –Transfer across Datasets
• Six different natural image datasets were used
 Fashion-MNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet, Places365
 These datasets vary across a number of axes, including graycales vs. color,
input size, number of output classes, and training set size.

• Substantial fraction of the inductive bias provided by winning tickets
is dataset-independent.
• Winning tickets generated on larger, more complex datasets
generalized substantially better than those generated on small
datasets. Interestingly, the effect was not merely a result of training
set size, but also appeared to be impacted by the number of classes.
• Winning ticket transfer success was roughly similar acrossVGG19
and ResNet50 models with some differences.

Results –Transfer across Optimizers
• It is possible that winning tickets are also specific to the particular
optimizer that is used.
• They found that, as with transfer across datasets, transferred
winning tickets forVGG models achieved similar performance as
those generated using the source optimizer.

Results –Transfer across Optimizers

Results – Randomized Masks
• The winning ticket outperformed all random tickets.
• The layerwise statistics derived from training the over-
parameterized model are very informative.

Caveats and Next Steps
• Remaining key limitations
 Generating these winning tickets via iterative pruning is very slow.This issue
is especially prevalent given our observation that larger datasets produce
more generic winning tickets.
 They have only evaluated the transfer of winning tickets across datasets
within the same domain (natural images) and task (object classification).
 While they found that transferred winning tickets often achieved roughly
similar performance to those generated on the same dataset, there was
often a small, but noticeable gap between the transfer ticket and the same
dataset ticket.
 They only evaluated situations where the network topology is fixed in both
the source and target training configurations.
 A critical question remains unanswered: what makes winning tickets special?

PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Similar to PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers (20)

More from Jinwon Lee

More from Jinwon Lee (8)

Recently uploaded

Recently uploaded (20)

PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers