SlideShare a Scribd company logo
1 of 29
Download to read offline
OneTicket to WinThem All:
Generalizing LotteryTicket Initialization across Datasets and Optimizers
Ari S. Morcos, et al., “One ticket to win them all: generalizing lottery ticket initializations
across datasets and optimizers
29th September, 2019
PR12 Paper Review
JinWon Lee
Samsung Electronics
Related PR12
• PR-180:The LotteryTicket Hypothesis: Finding Sparse,Trainable
Neural Networks: https://youtu.be/dkNmYu610r8
• PR-072: Deep Compression: https://youtu.be/9mFZmpIbMDs
LotteryTicket Hypothesis
• Neural networks can be pruned by more than 90% without harming
accuracy.
• But if we train directly(from random initialization) on the pruned
structure, the performance always decrease.
• The initialization of overparameterized neural networks contains
much smaller sub-network initializations, which, when trained in
isolation, reach similar performance as the full network.
 WinningTickets
 f(x; W) reaches accuracy a in t iterations
 F(x; m⊙W) reaches accuracy a’ in t’ iteration
Mask m ∈{0, 1}
 Where |m| << |W|, a’ ≥ a, t’ ≤ t
How to Find theWinningTickets?
1. Randomly initialize a neural network 𝑓 𝑥; 𝜃0 (where 𝜃0~𝒟0)
2. Train the neural network for 𝑗 iterations, arriving at parameters 𝜃𝑗.
3. Prune 𝑝% of the parameters in 𝜃𝑗, creating a mask 𝑚.
4. Reset the remaining parameters to their value 𝜃0, creating the
winning ticket 𝑓(𝑥; 𝑚⨀𝜃0).
Implications fromWinning (Lottery)Tickets
• The most commonly used initialization scheme are sub-optimal and
have significant room to improve.
• Over-parameterization is not necessary during the course of training,
but rather over-parameterization is merely necessary to find a “good”
initialization of an appropriately parameterized network.
Limitations ofWinningTickets
• The process to find winning tickets is computationally expensive.
• It remains unclear whether the properties of winning tickets(high
performance) are specific to the precise combination of architecture,
optimizer, and dataset, or whether winning tickets contain inductive
biases which improve training more generically.
 If same winning ticket generalizes across training conditions and datasets, it
unlocks the possibility of generating a small number of such winning tickets
and reusing them across datasets.
Introduction
• The generality of winning ticket initializations remains unclear.
• The authors attempt to answer this question by generating winning
tickets for one training configuration (optimizer and dataset) and
evaluating their performance on another configuration.
RelatedWork
• The lottery ticket hypothesis was first postulated and examined in
smaller models and datasets.
• This hypothesis was analyzes in large models and datasets, leading
to the development of late resetting.
RelatedWork – LotteryTicket Hypothesis Has
Been Challenged
• If randomly initialized sub-networks with structured pruning are
scaled appropriately and trained for long enough, they can match
performance from winning tickets.
• The lottery ticket hypothesis in the context of large-scale models and
were unable to find successful winning tickets.
 They did not use iterative pruning and late resetting.
RelatedWork –Transfer Learning
• Training models on one dataset, pruning, and then fine-tuning on
another datasets can often result in high performance on the transfer
dataset.
• However, these studies investigate the transfer of learned
representations, whereas they analyze the transfer of initializations
across datasets.
Approach – Iterative Pruning
• Pruning a large fraction of weights in one step (“one-shot pruning”)
will often prune weights which were actually important.
• By only pruning a small fraction of weights on each iteration,
iterative pruning helps to de-noise the pruning process, and produces
substantially better pruned models and winning tickets.
• For this work, magnitude pruning with an iterative pruning rate of
20% of remaining parameters was used.
Approach – Late Resetting
• In the initial investigation of lottery tickets, winning ticket weights
were reset to their values at the beginning of training (training
iteration 0), and learning rate warmup was found to be necessary.
• Simply resetting the winning ticket weights to their values at training
iteration k(small number) consistently produces better winning
tickets and removes the need for learning rate warmup.
Approach – Global Prunning
• In local pruning, weights are pruned within each layer separately,
such that every layer will have the same fraction of pruned
parameters.
• In global pruning, all layers are pooled together prior to pruning,
allowing the pruning fraction to vary across layers.
• Consistently, we observed that global pruning leads to higher
performance than local pruning.
 The first layer ofVGG19 contains only 1792 parameters, so pruning this layer
at a rate of 99% would result in only 18 parameters remaining, significantly
harming the expressivity of the network.
Global Pruning vs Layerwise Pruning
Approach – Random Masks
• In previous work, the structure of the pruning mask was maintained
for the random ticket initializations, with only the values of the
weights themselves randomized.
• However, the structure of this mask contains a substantial amount of
information and requires fore-knowledge of the winning ticket
initialization.
• For this work, they consider random tickets to have both randomly
drawn weight values (from the initialization distribution) and
randomly permuted masks.
Models
• A modified form ofVGG19
 Fully-connected layers were removed and replaced with an average pool
layer.
• ResNet50
TransferringWinningTickets across Datasets and
Optimizers
• They generate winning tickets in one training configuration (“source”)
and evaluate performance in a different configuration (“target”).
• For many transfers, this requires changing the output layer size,
since datasets have different numbers of output classes.
Results –Transfer within the Same Data
Distribution
Results –Transfer across Datasets
• Six different natural image datasets were used
 Fashion-MNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet, Places365
 These datasets vary across a number of axes, including graycales vs. color,
input size, number of output classes, and training set size.
Results –Transfer across Datasets
• Substantial fraction of the inductive bias provided by winning tickets
is dataset-independent.
• Winning tickets generated on larger, more complex datasets
generalized substantially better than those generated on small
datasets. Interestingly, the effect was not merely a result of training
set size, but also appeared to be impacted by the number of classes.
• Winning ticket transfer success was roughly similar acrossVGG19
and ResNet50 models with some differences.
Results –Transfer across Datasets
Results –Transfer across Datasets
Results –Transfer across Datasets
Results –Transfer across Datasets
Results –Transfer across Optimizers
• It is possible that winning tickets are also specific to the particular
optimizer that is used.
• They found that, as with transfer across datasets, transferred
winning tickets forVGG models achieved similar performance as
those generated using the source optimizer.
Results –Transfer across Optimizers
Results – Randomized Masks
• The winning ticket outperformed all random tickets.
• The layerwise statistics derived from training the over-
parameterized model are very informative.
Caveats and Next Steps
• Remaining key limitations
 Generating these winning tickets via iterative pruning is very slow.This issue
is especially prevalent given our observation that larger datasets produce
more generic winning tickets.
 They have only evaluated the transfer of winning tickets across datasets
within the same domain (natural images) and task (object classification).
 While they found that transferred winning tickets often achieved roughly
similar performance to those generated on the same dataset, there was
often a small, but noticeable gap between the transfer ticket and the same
dataset ticket.
 They only evaluated situations where the network topology is fixed in both
the source and target training configurations.
 A critical question remains unanswered: what makes winning tickets special?
Thank you

More Related Content

What's hot

Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide威智 黃
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044Jinwon Lee
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_papershanullah3
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sJinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture SearchDaeJin Kim
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]taeseon ryu
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012Jinwon Lee
 

What's hot (20)

Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 

Similar to PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesSubhayan Mukerjee
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3NIMMYRAJU
 
CrowdInG_learning_from_crowds.pptx
CrowdInG_learning_from_crowds.pptxCrowdInG_learning_from_crowds.pptx
CrowdInG_learning_from_crowds.pptxNeetha37
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Machine Learning Approach.pptx
Machine Learning Approach.pptxMachine Learning Approach.pptx
Machine Learning Approach.pptxCYPatrickKwee
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical dataPaul Skeie
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION K Srinivas Rao
 
Artificial Neural Networks for data mining
Artificial Neural Networks for data miningArtificial Neural Networks for data mining
Artificial Neural Networks for data miningALIZAIB KHAN
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsQuantUniversity
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIMLILAB
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Facial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional FaceFacial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional FaceTakrim Ul Islam Laskar
 

Similar to PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers (20)

Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
 
CrowdInG_learning_from_crowds.pptx
CrowdInG_learning_from_crowds.pptxCrowdInG_learning_from_crowds.pptx
CrowdInG_learning_from_crowds.pptx
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 
Machine Learning Approach.pptx
Machine Learning Approach.pptxMachine Learning Approach.pptx
Machine Learning Approach.pptx
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...
 
Artificial Neural Networks for data mining
Artificial Neural Networks for data miningArtificial Neural Networks for data mining
Artificial Neural Networks for data mining
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 
Time series analysis : Refresher and Innovations
Time series analysis : Refresher and InnovationsTime series analysis : Refresher and Innovations
Time series analysis : Refresher and Innovations
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Icbai 2018 ver_1
Icbai 2018 ver_1Icbai 2018 ver_1
Icbai 2018 ver_1
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Facial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional FaceFacial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional Face
 

More from Jinwon Lee

PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
PVANet - PR033
PVANet - PR033PVANet - PR033
PVANet - PR033Jinwon Lee
 
Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Jinwon Lee
 
YOLO9000 - PR023
YOLO9000 - PR023YOLO9000 - PR023
YOLO9000 - PR023Jinwon Lee
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝Jinwon Lee
 

More from Jinwon Lee (8)

PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PVANet - PR033
PVANet - PR033PVANet - PR033
PVANet - PR033
 
Deep learning seminar_snu_161031
Deep learning seminar_snu_161031Deep learning seminar_snu_161031
Deep learning seminar_snu_161031
 
YOLO9000 - PR023
YOLO9000 - PR023YOLO9000 - PR023
YOLO9000 - PR023
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

PR-197: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers

  • 1. OneTicket to WinThem All: Generalizing LotteryTicket Initialization across Datasets and Optimizers Ari S. Morcos, et al., “One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers 29th September, 2019 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. Related PR12 • PR-180:The LotteryTicket Hypothesis: Finding Sparse,Trainable Neural Networks: https://youtu.be/dkNmYu610r8 • PR-072: Deep Compression: https://youtu.be/9mFZmpIbMDs
  • 3. LotteryTicket Hypothesis • Neural networks can be pruned by more than 90% without harming accuracy. • But if we train directly(from random initialization) on the pruned structure, the performance always decrease. • The initialization of overparameterized neural networks contains much smaller sub-network initializations, which, when trained in isolation, reach similar performance as the full network.  WinningTickets  f(x; W) reaches accuracy a in t iterations  F(x; m⊙W) reaches accuracy a’ in t’ iteration Mask m ∈{0, 1}  Where |m| << |W|, a’ ≥ a, t’ ≤ t
  • 4. How to Find theWinningTickets? 1. Randomly initialize a neural network 𝑓 𝑥; 𝜃0 (where 𝜃0~𝒟0) 2. Train the neural network for 𝑗 iterations, arriving at parameters 𝜃𝑗. 3. Prune 𝑝% of the parameters in 𝜃𝑗, creating a mask 𝑚. 4. Reset the remaining parameters to their value 𝜃0, creating the winning ticket 𝑓(𝑥; 𝑚⨀𝜃0).
  • 5. Implications fromWinning (Lottery)Tickets • The most commonly used initialization scheme are sub-optimal and have significant room to improve. • Over-parameterization is not necessary during the course of training, but rather over-parameterization is merely necessary to find a “good” initialization of an appropriately parameterized network.
  • 6. Limitations ofWinningTickets • The process to find winning tickets is computationally expensive. • It remains unclear whether the properties of winning tickets(high performance) are specific to the precise combination of architecture, optimizer, and dataset, or whether winning tickets contain inductive biases which improve training more generically.  If same winning ticket generalizes across training conditions and datasets, it unlocks the possibility of generating a small number of such winning tickets and reusing them across datasets.
  • 7. Introduction • The generality of winning ticket initializations remains unclear. • The authors attempt to answer this question by generating winning tickets for one training configuration (optimizer and dataset) and evaluating their performance on another configuration.
  • 8. RelatedWork • The lottery ticket hypothesis was first postulated and examined in smaller models and datasets. • This hypothesis was analyzes in large models and datasets, leading to the development of late resetting.
  • 9. RelatedWork – LotteryTicket Hypothesis Has Been Challenged • If randomly initialized sub-networks with structured pruning are scaled appropriately and trained for long enough, they can match performance from winning tickets. • The lottery ticket hypothesis in the context of large-scale models and were unable to find successful winning tickets.  They did not use iterative pruning and late resetting.
  • 10. RelatedWork –Transfer Learning • Training models on one dataset, pruning, and then fine-tuning on another datasets can often result in high performance on the transfer dataset. • However, these studies investigate the transfer of learned representations, whereas they analyze the transfer of initializations across datasets.
  • 11. Approach – Iterative Pruning • Pruning a large fraction of weights in one step (“one-shot pruning”) will often prune weights which were actually important. • By only pruning a small fraction of weights on each iteration, iterative pruning helps to de-noise the pruning process, and produces substantially better pruned models and winning tickets. • For this work, magnitude pruning with an iterative pruning rate of 20% of remaining parameters was used.
  • 12. Approach – Late Resetting • In the initial investigation of lottery tickets, winning ticket weights were reset to their values at the beginning of training (training iteration 0), and learning rate warmup was found to be necessary. • Simply resetting the winning ticket weights to their values at training iteration k(small number) consistently produces better winning tickets and removes the need for learning rate warmup.
  • 13. Approach – Global Prunning • In local pruning, weights are pruned within each layer separately, such that every layer will have the same fraction of pruned parameters. • In global pruning, all layers are pooled together prior to pruning, allowing the pruning fraction to vary across layers. • Consistently, we observed that global pruning leads to higher performance than local pruning.  The first layer ofVGG19 contains only 1792 parameters, so pruning this layer at a rate of 99% would result in only 18 parameters remaining, significantly harming the expressivity of the network.
  • 14. Global Pruning vs Layerwise Pruning
  • 15. Approach – Random Masks • In previous work, the structure of the pruning mask was maintained for the random ticket initializations, with only the values of the weights themselves randomized. • However, the structure of this mask contains a substantial amount of information and requires fore-knowledge of the winning ticket initialization. • For this work, they consider random tickets to have both randomly drawn weight values (from the initialization distribution) and randomly permuted masks.
  • 16. Models • A modified form ofVGG19  Fully-connected layers were removed and replaced with an average pool layer. • ResNet50
  • 17. TransferringWinningTickets across Datasets and Optimizers • They generate winning tickets in one training configuration (“source”) and evaluate performance in a different configuration (“target”). • For many transfers, this requires changing the output layer size, since datasets have different numbers of output classes.
  • 18. Results –Transfer within the Same Data Distribution
  • 19. Results –Transfer across Datasets • Six different natural image datasets were used  Fashion-MNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet, Places365  These datasets vary across a number of axes, including graycales vs. color, input size, number of output classes, and training set size.
  • 20. Results –Transfer across Datasets • Substantial fraction of the inductive bias provided by winning tickets is dataset-independent. • Winning tickets generated on larger, more complex datasets generalized substantially better than those generated on small datasets. Interestingly, the effect was not merely a result of training set size, but also appeared to be impacted by the number of classes. • Winning ticket transfer success was roughly similar acrossVGG19 and ResNet50 models with some differences.
  • 25. Results –Transfer across Optimizers • It is possible that winning tickets are also specific to the particular optimizer that is used. • They found that, as with transfer across datasets, transferred winning tickets forVGG models achieved similar performance as those generated using the source optimizer.
  • 27. Results – Randomized Masks • The winning ticket outperformed all random tickets. • The layerwise statistics derived from training the over- parameterized model are very informative.
  • 28. Caveats and Next Steps • Remaining key limitations  Generating these winning tickets via iterative pruning is very slow.This issue is especially prevalent given our observation that larger datasets produce more generic winning tickets.  They have only evaluated the transfer of winning tickets across datasets within the same domain (natural images) and task (object classification).  While they found that transferred winning tickets often achieved roughly similar performance to those generated on the same dataset, there was often a small, but noticeable gap between the transfer ticket and the same dataset ticket.  They only evaluated situations where the network topology is fixed in both the source and target training configurations.  A critical question remains unanswered: what makes winning tickets special?