TensorFlow Korea 논문읽기모임 PR12 197번째 논문 review입니다
(2기 목표 200편까지 이제 3편이 남았습니다)
이번에 제가 발표한 논문은 FAIR(Facebook AI Research)에서 나온 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers 입니다
한 장의 ticket으로 모든 복권에서 1등을 할 수 있다면 얼마나 좋을까요?
일반적인 network pruning 방법은 pruning 하기 이전에 학습된 network weight를 그대로 사용하면서 fine tuning하는 방법을 사용해왔습니다
pruning한 이후에 network에 weight를 random intialization한 후 학습하면 성능이 잘 나오지 않는 문제가 있었는데요
작년 MIT에서 나온 Lottery ticket hypothesis라는 논문에서는 이렇게 pruning된 이후의 network를 어떻게 random intialization하면 높은 성능을 낼 수 있는지
이 intialization 방법을 공개하며 lottery ticket의 winning ticket이라고 이름붙였습니다.
그런데 이 winning ticket이 혹시 다른 dataset이나 다른 optimizer를 사용하는 경우에도 잘 동작할 수 있을까요?
예를 들어 CIFAR10에서 찾은 winning ticket이 ImageNet에서도 winning ticket의 성능을 나타낼 수 있을까요?
이 논문은 이러한 질문에 대한 답을 실험을 통해서 확인하였고, initialization에 대한 여러가지 insight를 담고 있습니다.
자세한 내용은 발표 영상을 참고해주세요~!
영상링크: https://youtu.be/YmTNpF2OOjA
발표자료링크: https://www.slideshare.net/JinwonLee9/pr197-one-ticket-to-win-them-all-generalizing-lottery-ticket-initializations-across-datasets-and-optimizers
논문링크: https://arxiv.org/abs/1906.02773
PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers
1. OneTicket to WinThem All:
Generalizing LotteryTicket Initialization across Datasets and Optimizers
Ari S. Morcos, et al., “One ticket to win them all: generalizing lottery ticket initializations
across datasets and optimizers
29th September, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics
2. Related PR12
• PR-180:The LotteryTicket Hypothesis: Finding Sparse,Trainable
Neural Networks: https://youtu.be/dkNmYu610r8
• PR-072: Deep Compression: https://youtu.be/9mFZmpIbMDs
3. LotteryTicket Hypothesis
• Neural networks can be pruned by more than 90% without harming
accuracy.
• But if we train directly(from random initialization) on the pruned
structure, the performance always decrease.
• The initialization of overparameterized neural networks contains
much smaller sub-network initializations, which, when trained in
isolation, reach similar performance as the full network.
WinningTickets
f(x; W) reaches accuracy a in t iterations
F(x; m⊙W) reaches accuracy a’ in t’ iteration
Mask m ∈{0, 1}
Where |m| << |W|, a’ ≥ a, t’ ≤ t
4. How to Find theWinningTickets?
1. Randomly initialize a neural network 𝑓 𝑥; 𝜃0 (where 𝜃0~𝒟0)
2. Train the neural network for 𝑗 iterations, arriving at parameters 𝜃𝑗.
3. Prune 𝑝% of the parameters in 𝜃𝑗, creating a mask 𝑚.
4. Reset the remaining parameters to their value 𝜃0, creating the
winning ticket 𝑓(𝑥; 𝑚⨀𝜃0).
5. Implications fromWinning (Lottery)Tickets
• The most commonly used initialization scheme are sub-optimal and
have significant room to improve.
• Over-parameterization is not necessary during the course of training,
but rather over-parameterization is merely necessary to find a “good”
initialization of an appropriately parameterized network.
6. Limitations ofWinningTickets
• The process to find winning tickets is computationally expensive.
• It remains unclear whether the properties of winning tickets(high
performance) are specific to the precise combination of architecture,
optimizer, and dataset, or whether winning tickets contain inductive
biases which improve training more generically.
If same winning ticket generalizes across training conditions and datasets, it
unlocks the possibility of generating a small number of such winning tickets
and reusing them across datasets.
7. Introduction
• The generality of winning ticket initializations remains unclear.
• The authors attempt to answer this question by generating winning
tickets for one training configuration (optimizer and dataset) and
evaluating their performance on another configuration.
8. RelatedWork
• The lottery ticket hypothesis was first postulated and examined in
smaller models and datasets.
• This hypothesis was analyzes in large models and datasets, leading
to the development of late resetting.
9. RelatedWork – LotteryTicket Hypothesis Has
Been Challenged
• If randomly initialized sub-networks with structured pruning are
scaled appropriately and trained for long enough, they can match
performance from winning tickets.
• The lottery ticket hypothesis in the context of large-scale models and
were unable to find successful winning tickets.
They did not use iterative pruning and late resetting.
10. RelatedWork –Transfer Learning
• Training models on one dataset, pruning, and then fine-tuning on
another datasets can often result in high performance on the transfer
dataset.
• However, these studies investigate the transfer of learned
representations, whereas they analyze the transfer of initializations
across datasets.
11. Approach – Iterative Pruning
• Pruning a large fraction of weights in one step (“one-shot pruning”)
will often prune weights which were actually important.
• By only pruning a small fraction of weights on each iteration,
iterative pruning helps to de-noise the pruning process, and produces
substantially better pruned models and winning tickets.
• For this work, magnitude pruning with an iterative pruning rate of
20% of remaining parameters was used.
12. Approach – Late Resetting
• In the initial investigation of lottery tickets, winning ticket weights
were reset to their values at the beginning of training (training
iteration 0), and learning rate warmup was found to be necessary.
• Simply resetting the winning ticket weights to their values at training
iteration k(small number) consistently produces better winning
tickets and removes the need for learning rate warmup.
13. Approach – Global Prunning
• In local pruning, weights are pruned within each layer separately,
such that every layer will have the same fraction of pruned
parameters.
• In global pruning, all layers are pooled together prior to pruning,
allowing the pruning fraction to vary across layers.
• Consistently, we observed that global pruning leads to higher
performance than local pruning.
The first layer ofVGG19 contains only 1792 parameters, so pruning this layer
at a rate of 99% would result in only 18 parameters remaining, significantly
harming the expressivity of the network.
15. Approach – Random Masks
• In previous work, the structure of the pruning mask was maintained
for the random ticket initializations, with only the values of the
weights themselves randomized.
• However, the structure of this mask contains a substantial amount of
information and requires fore-knowledge of the winning ticket
initialization.
• For this work, they consider random tickets to have both randomly
drawn weight values (from the initialization distribution) and
randomly permuted masks.
16. Models
• A modified form ofVGG19
Fully-connected layers were removed and replaced with an average pool
layer.
• ResNet50
17. TransferringWinningTickets across Datasets and
Optimizers
• They generate winning tickets in one training configuration (“source”)
and evaluate performance in a different configuration (“target”).
• For many transfers, this requires changing the output layer size,
since datasets have different numbers of output classes.
19. Results –Transfer across Datasets
• Six different natural image datasets were used
Fashion-MNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet, Places365
These datasets vary across a number of axes, including graycales vs. color,
input size, number of output classes, and training set size.
20. Results –Transfer across Datasets
• Substantial fraction of the inductive bias provided by winning tickets
is dataset-independent.
• Winning tickets generated on larger, more complex datasets
generalized substantially better than those generated on small
datasets. Interestingly, the effect was not merely a result of training
set size, but also appeared to be impacted by the number of classes.
• Winning ticket transfer success was roughly similar acrossVGG19
and ResNet50 models with some differences.
25. Results –Transfer across Optimizers
• It is possible that winning tickets are also specific to the particular
optimizer that is used.
• They found that, as with transfer across datasets, transferred
winning tickets forVGG models achieved similar performance as
those generated using the source optimizer.
27. Results – Randomized Masks
• The winning ticket outperformed all random tickets.
• The layerwise statistics derived from training the over-
parameterized model are very informative.
28. Caveats and Next Steps
• Remaining key limitations
Generating these winning tickets via iterative pruning is very slow.This issue
is especially prevalent given our observation that larger datasets produce
more generic winning tickets.
They have only evaluated the transfer of winning tickets across datasets
within the same domain (natural images) and task (object classification).
While they found that transferred winning tickets often achieved roughly
similar performance to those generated on the same dataset, there was
often a small, but noticeable gap between the transfer ticket and the same
dataset ticket.
They only evaluated situations where the network topology is fixed in both
the source and target training configurations.
A critical question remains unanswered: what makes winning tickets special?