SlideShare a Scribd company logo
1 of 25
Download to read offline
Notable-top-5%
PR-445
주성훈, VUNO Inc.
2023. 6. 18.
1. Research Background
2. Methods
1. Research Background 3
•Vision Transformers (ViTs) by Dosovitskiy et al. (2020) has rapidly advanced the
field of computer vision.
•Vision has been since dominated by domain-specific transformer hybrids like
Swin (Liu et al., 2021; Dong et al., 2022) using vision-specific attention,
MViT (Fan et al., 2021; Li et al., 2022) using vision-specific pooling,
or LeViT (Graham et al., 2021) using vision-specific conv modules.
•The reason for this trend is simple: efficiency.
•Adding vision-specific inductive biases enables transformer hybrids to perform
better with less compute.
Background
/ 25
2. Methods
1. Research Background 4
•ViTs consist of simple matrix multiplications -> flop에 비해 빠른 속도
•powerful self-supervised pre-training techniques such as MAE (He et al., 2022)
•given their lack of assumptions about the data, they can be applied with little or no
changes across many modalities (Feichtenhofer et al., 2022; Huang et al., 2022);
•they scale well with massive amounts of data (Zhai et al., 2021; Singh et al., 2022),
recently obtaining up to 90.94% top-1 on ImageNet (Wortsman et al., 2022).
Yet, vanilla ViTs still have many desirable qualities
/ 25
2. Methods
1. Research Background 5
Problem Settings
A promising subfield of ViTs: Token pruning
•Due to the input-agnostic nature of transformers, tokens can be pruned at runtime to enable a
faster model (Rao et al., 2021; Yin et al., 2022; Meng et al., 2022; Liang et al., 2022; Kong et al.,
2022)
A more efficient approach is needed than token pruning.
•Disadvantages:
•the information loss from pruning limits how many tokens you can reasonably reduce
•current methods require re-training the model to be effective (some with extra parameters)
•most pruning papers apply a mask during training rather than remove tokens -> negates the speed-up
•several prune different numbers of tokens depending on the input content, making batched inference infeasible.
/ 25
2. Methods
1. Research Background 6
Token Merging (ToMe) to combine tokens, rather than prune them.
Combining tokens
•Kong et al. (2022), Liang et al. (2022): token pruning -> merging
•TokenLearner (Ryoo et al., 2021): uses an MLP to reduce the number of tokens.
•Token Pooling (Marin et al., 2021) is the most similar to our token merging but uses a slow
kmeans-based approach
Until now, no approach has been successful in offering a reasonable speed-accuracy trade-off
when combining tokens without training.
/ 25
2. Methods
2. Methods
2. Methods 8
Token merging strategy
•Gradually merge tokens in each block
In each block of a transformer, we merge tokens to reduce by r (quantity of tokens) per layer.
Over the L blocks in the network, we gradually merge rL tokens.
rL tokens regardless of the image’s content.
/ 25
2. Methods
2. Methods 9
Token merging strategy
•Objective: insert a token merging module into an existing ViT
In each block of a transformer, we merge tokens to reduce by r (quantity of tokens) per layer.
/ 25
2. Methods
2. Methods 10
Token merging strategy
•Token similarity - cosine similarity between the keys of each token
•the keys (K) already summarize the information contained in each token for use in dot product similarity.
•Thus, we use a dot product similarity metric (e.g., cosine similarity) between the keys of each token to determine which
contain similar information
(a) Feature Choice. The K matrix accurately
summarizes the information within tokens.
(b) Distance Function. Cosine similarity is the
best choice for speed and accuracy.
/ 25
2. Methods
2. Methods 11
Comparing Matching Algorithms
Table 2: Matching Algorithm. Different matching algorithms with
the same settings as Tab. 1. Our bipartite algorithm is almost as
fast as randomly pruning tokens, while retaining high accuracy.
Matching is more suited for this setup than pruning or clustering.
•Pruning: 빠름. 98% of the tokens removed (information loss)
•Kmeans: 유사하지 않은 token을 merge할 수 있음.
•Greedy matching: 가장 유사한 토큰을 merge하는 것을 r번 반복. Token
matching 정확도는 높으나, r이 커질수록 속도가 느려짐
•Bipartite matching: r에 관계없는 constant runtime. accuracy는 유지.
/ 25
2. Methods
2. Methods 12
Proportional attention
(d) Combining Method. Averaging tokens
weighted by their size, s (see Eq. 1), ensures
consistency.
•where s is a row vector containing the size of each token (number of patches the token represents). This performs the
same operation as if you’d have s copies of the key.
•We also need to weight tokens by s any time they would be aggregated, like when merging tokens together
/ 25
3. Experimental Results
2. Methods
3. Experimental Results 14
Main contribution
• We introduce a technique to increase the throughput and real-world training speed of ViT models,
both with and without training
• thoroughly ablate our design choices
• we perform extensive experiments on images with several ViT models and compare to state-
of-the-art in architecture design and token pruning methods (Sec. 4.3);
• we then repeat these experiments for both video (Sec. 5) and audio (Sec. 6) and find ToMe works
well across modalities;
/ 25
2. Methods
3. Experimental Results 15
ToMe in Supervised Models
•SWAG (Singh et al., 2022): pretrained 3.6B images
•AugReg (Steiner et al., 2022): “Augmented Regularization”
pretrained with various data augmentation techniques and
regularization methods
•Large models are deeper and thus allow for more gradual
change in features, which lessens the impact of merging.
/ 25
2. Methods
3. Experimental Results 16
ToMe in Self-Supervised Models (MAE)
•Our token merging method allows for roughly ~2× faster epochs at negligible accuracy drops for large models.
•This suggests that one could train even bigger models with token merging than was possible before.
/ 25
2. Methods
3. Experimental Results 17
Selecting a Merging Schedule
Figure 2: Token Merging Schedule. Our default constant merging
schedule is close to optimal when compared to 15k randomly sampled
merging schedules on an AugReg ViT-B/16. decreasing schedule works well at throughputs up to ∼3x.
/ 25
2. Methods
3. Experimental Results 18
Comparison to other works - state-of-the-art models trained on ImageNet-1k without extra data
•ToMe improves the throughput of ViT models such that ViT-L
and ViT-H become comparable in speed to models of a lower
tier, without scaling the number of features.
- baseline models trained by us without ToMe
- baseline models trained by us without ToMe
/ 25
2. Methods
3. Experimental Results 19
Comparison to other works - Token Pruning
•Train speed: ToMe at r = 13 trains 1.5× faster than the baseline DeiT-S
•if we take an off-the-shelf AugReg ViT-S model and apply the same merging schedule, we can match the
performance of the DeiT models without training.
•DeiT: Data-efficient Image Transformers (PR-297)
/ 25
2. Methods
3. Experimental Results 20
Visualization
•Token merging has some similarity to part segmentation
/ 25
2. Methods
3. Experimental Results 21
Visualization
multiple instances of the same class -> merged together
Different part -> different merged token
/ 25
2. Methods
3. Experimental Results 22
Performance on video experiment
•Spatiotemporal MAE (Feichtenhofer et al.Masked autoencoders as spatiotemporal learners. NeurIPS, 2022.
•simply applying our method off-the-shelf without training and by applying our method during MAE fine tuning with the default
training recipe
/ 25
2. Methods
3. Experimental Results 23
Performance on Audioset
AudioSet-2M
• collection of audio clips gathered from YouTube videos.
• It contains over 2 million audio samples, with each clip being 10 seconds in
length
• 527 categories: musical instruments, natural sounds, human activities,
environmental sounds
•Baseline: Audio MAE (Po-Yao Huang et al. Masked autoencoders that listen. NeurIPS, 2022)
/ 25
4. Conclusion
2. Methods
4. Conclusions 25
• This paper introduced Token Merging (ToMe) to increase the throughput of ViT models by
gradually merging tokens.
• ToMe naturally exploits redundancy in the input, allowing its use for any modality with
redundancy.
• This paper contained extensive experiments on images, video, and audio, obtaining speeds
and accuracies competitive with the state-of-the-art in each case.
• Paper focus on classification but our visualizations show potential on tasks like
segmentation.
• ToMe works well on large models across domains and cuts down training time and memory
usage, meaning it could be a core component of training huge models.
Thank you.
/ 25

More Related Content

What's hot

Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksSungchul Kim
 
Genetic algorithm fitness function
Genetic algorithm fitness functionGenetic algorithm fitness function
Genetic algorithm fitness functionProf Ansari
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachEslam Nader
 
210110 deformable detr
210110 deformable detr210110 deformable detr
210110 deformable detrtaeseon ryu
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and ApplicationsHoang Nguyen
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detectionNASIM ALAM
 
Ethical Issues in Machine Learning Algorithms. (Part 1)
Ethical Issues in Machine Learning Algorithms. (Part 1)Ethical Issues in Machine Learning Algorithms. (Part 1)
Ethical Issues in Machine Learning Algorithms. (Part 1)Vladimir Kanchev
 
A Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth EstimationA Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth EstimationRyo Takahashi
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeakin University
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathLê Hòa
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 

What's hot (20)

Word embedding
Word embedding Word embedding
Word embedding
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Revisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural NetworksRevisiting the Calibration of Modern Neural Networks
Revisiting the Calibration of Modern Neural Networks
 
Genetic algorithm fitness function
Genetic algorithm fitness functionGenetic algorithm fitness function
Genetic algorithm fitness function
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approach
 
210110 deformable detr
210110 deformable detr210110 deformable detr
210110 deformable detr
 
Bert
BertBert
Bert
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
 
Ethical Issues in Machine Learning Algorithms. (Part 1)
Ethical Issues in Machine Learning Algorithms. (Part 1)Ethical Issues in Machine Learning Algorithms. (Part 1)
Ethical Issues in Machine Learning Algorithms. (Part 1)
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
What is Data Analysis and Machine Learning?
What is Data Analysis and Machine Learning?What is Data Analysis and Machine Learning?
What is Data Analysis and Machine Learning?
 
A Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth EstimationA Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth Estimation
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining I
 
Unit 5
Unit 5Unit 5
Unit 5
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable Path
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 

Similar to PR-445: Token Merging: Your ViT But Faster

TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGijdkp
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingLionel Briand
 
Paper on experimental setup for verifying - "Slow Learners are Fast"
Paper  on experimental setup for verifying  - "Slow Learners are Fast"Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Paper on experimental setup for verifying - "Slow Learners are Fast"Robin Srivastava
 
Big shadow test
Big shadow testBig shadow test
Big shadow testmunsif123
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersSunghoon Joo
 
Study of Density Based Clustering Techniques on Data Streams
Study of Density Based Clustering Techniques on Data StreamsStudy of Density Based Clustering Techniques on Data Streams
Study of Density Based Clustering Techniques on Data StreamsIJERA Editor
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionUT, San Antonio
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionVasileiosMezaris
 
Bug Prediction Based on Fine-Grained Module Histories
Bug Prediction Based on Fine-Grained Module HistoriesBug Prediction Based on Fine-Grained Module Histories
Bug Prediction Based on Fine-Grained Module HistoriesHideaki Hata
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4YoussefKitane
 
A Low Rank Mechanism to Detect and Achieve Partially Completed Image Tags
A Low Rank Mechanism to Detect and Achieve Partially Completed Image TagsA Low Rank Mechanism to Detect and Achieve Partially Completed Image Tags
A Low Rank Mechanism to Detect and Achieve Partially Completed Image TagsIRJET Journal
 
M.Tech: AI and Neural Networks Assignment II
M.Tech:  AI and Neural Networks Assignment IIM.Tech:  AI and Neural Networks Assignment II
M.Tech: AI and Neural Networks Assignment IIVijayananda Mohire
 
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemEvolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemIDES Editor
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 

Similar to PR-445: Token Merging: Your ViT But Faster (20)

TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
 
Paper on experimental setup for verifying - "Slow Learners are Fast"
Paper  on experimental setup for verifying  - "Slow Learners are Fast"Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Paper on experimental setup for verifying - "Slow Learners are Fast"
 
Big shadow test
Big shadow testBig shadow test
Big shadow test
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
E502024047
E502024047E502024047
E502024047
 
E502024047
E502024047E502024047
E502024047
 
Study of Density Based Clustering Techniques on Data Streams
Study of Density Based Clustering Techniques on Data StreamsStudy of Density Based Clustering Techniques on Data Streams
Study of Density Based Clustering Techniques on Data Streams
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Bug Prediction Based on Fine-Grained Module Histories
Bug Prediction Based on Fine-Grained Module HistoriesBug Prediction Based on Fine-Grained Module Histories
Bug Prediction Based on Fine-Grained Module Histories
 
Fin_whales
Fin_whalesFin_whales
Fin_whales
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
A Low Rank Mechanism to Detect and Achieve Partially Completed Image Tags
A Low Rank Mechanism to Detect and Achieve Partially Completed Image TagsA Low Rank Mechanism to Detect and Achieve Partially Completed Image Tags
A Low Rank Mechanism to Detect and Achieve Partially Completed Image Tags
 
M.Tech: AI and Neural Networks Assignment II
M.Tech:  AI and Neural Networks Assignment IIM.Tech:  AI and Neural Networks Assignment II
M.Tech: AI and Neural Networks Assignment II
 
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement ProblemEvolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
 
M033059064
M033059064M033059064
M033059064
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 

More from Sunghoon Joo

PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfSunghoon Joo
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningSunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingSunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesSunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchSunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...Sunghoon Joo
 

More from Sunghoon Joo (18)

PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Recently uploaded

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 

PR-445: Token Merging: Your ViT But Faster

  • 3. 2. Methods 1. Research Background 3 •Vision Transformers (ViTs) by Dosovitskiy et al. (2020) has rapidly advanced the field of computer vision. •Vision has been since dominated by domain-specific transformer hybrids like Swin (Liu et al., 2021; Dong et al., 2022) using vision-specific attention, MViT (Fan et al., 2021; Li et al., 2022) using vision-specific pooling, or LeViT (Graham et al., 2021) using vision-specific conv modules. •The reason for this trend is simple: efficiency. •Adding vision-specific inductive biases enables transformer hybrids to perform better with less compute. Background / 25
  • 4. 2. Methods 1. Research Background 4 •ViTs consist of simple matrix multiplications -> flop에 비해 빠른 속도 •powerful self-supervised pre-training techniques such as MAE (He et al., 2022) •given their lack of assumptions about the data, they can be applied with little or no changes across many modalities (Feichtenhofer et al., 2022; Huang et al., 2022); •they scale well with massive amounts of data (Zhai et al., 2021; Singh et al., 2022), recently obtaining up to 90.94% top-1 on ImageNet (Wortsman et al., 2022). Yet, vanilla ViTs still have many desirable qualities / 25
  • 5. 2. Methods 1. Research Background 5 Problem Settings A promising subfield of ViTs: Token pruning •Due to the input-agnostic nature of transformers, tokens can be pruned at runtime to enable a faster model (Rao et al., 2021; Yin et al., 2022; Meng et al., 2022; Liang et al., 2022; Kong et al., 2022) A more efficient approach is needed than token pruning. •Disadvantages: •the information loss from pruning limits how many tokens you can reasonably reduce •current methods require re-training the model to be effective (some with extra parameters) •most pruning papers apply a mask during training rather than remove tokens -> negates the speed-up •several prune different numbers of tokens depending on the input content, making batched inference infeasible. / 25
  • 6. 2. Methods 1. Research Background 6 Token Merging (ToMe) to combine tokens, rather than prune them. Combining tokens •Kong et al. (2022), Liang et al. (2022): token pruning -> merging •TokenLearner (Ryoo et al., 2021): uses an MLP to reduce the number of tokens. •Token Pooling (Marin et al., 2021) is the most similar to our token merging but uses a slow kmeans-based approach Until now, no approach has been successful in offering a reasonable speed-accuracy trade-off when combining tokens without training. / 25
  • 8. 2. Methods 2. Methods 8 Token merging strategy •Gradually merge tokens in each block In each block of a transformer, we merge tokens to reduce by r (quantity of tokens) per layer. Over the L blocks in the network, we gradually merge rL tokens. rL tokens regardless of the image’s content. / 25
  • 9. 2. Methods 2. Methods 9 Token merging strategy •Objective: insert a token merging module into an existing ViT In each block of a transformer, we merge tokens to reduce by r (quantity of tokens) per layer. / 25
  • 10. 2. Methods 2. Methods 10 Token merging strategy •Token similarity - cosine similarity between the keys of each token •the keys (K) already summarize the information contained in each token for use in dot product similarity. •Thus, we use a dot product similarity metric (e.g., cosine similarity) between the keys of each token to determine which contain similar information (a) Feature Choice. The K matrix accurately summarizes the information within tokens. (b) Distance Function. Cosine similarity is the best choice for speed and accuracy. / 25
  • 11. 2. Methods 2. Methods 11 Comparing Matching Algorithms Table 2: Matching Algorithm. Different matching algorithms with the same settings as Tab. 1. Our bipartite algorithm is almost as fast as randomly pruning tokens, while retaining high accuracy. Matching is more suited for this setup than pruning or clustering. •Pruning: 빠름. 98% of the tokens removed (information loss) •Kmeans: 유사하지 않은 token을 merge할 수 있음. •Greedy matching: 가장 유사한 토큰을 merge하는 것을 r번 반복. Token matching 정확도는 높으나, r이 커질수록 속도가 느려짐 •Bipartite matching: r에 관계없는 constant runtime. accuracy는 유지. / 25
  • 12. 2. Methods 2. Methods 12 Proportional attention (d) Combining Method. Averaging tokens weighted by their size, s (see Eq. 1), ensures consistency. •where s is a row vector containing the size of each token (number of patches the token represents). This performs the same operation as if you’d have s copies of the key. •We also need to weight tokens by s any time they would be aggregated, like when merging tokens together / 25
  • 14. 2. Methods 3. Experimental Results 14 Main contribution • We introduce a technique to increase the throughput and real-world training speed of ViT models, both with and without training • thoroughly ablate our design choices • we perform extensive experiments on images with several ViT models and compare to state- of-the-art in architecture design and token pruning methods (Sec. 4.3); • we then repeat these experiments for both video (Sec. 5) and audio (Sec. 6) and find ToMe works well across modalities; / 25
  • 15. 2. Methods 3. Experimental Results 15 ToMe in Supervised Models •SWAG (Singh et al., 2022): pretrained 3.6B images •AugReg (Steiner et al., 2022): “Augmented Regularization” pretrained with various data augmentation techniques and regularization methods •Large models are deeper and thus allow for more gradual change in features, which lessens the impact of merging. / 25
  • 16. 2. Methods 3. Experimental Results 16 ToMe in Self-Supervised Models (MAE) •Our token merging method allows for roughly ~2× faster epochs at negligible accuracy drops for large models. •This suggests that one could train even bigger models with token merging than was possible before. / 25
  • 17. 2. Methods 3. Experimental Results 17 Selecting a Merging Schedule Figure 2: Token Merging Schedule. Our default constant merging schedule is close to optimal when compared to 15k randomly sampled merging schedules on an AugReg ViT-B/16. decreasing schedule works well at throughputs up to ∼3x. / 25
  • 18. 2. Methods 3. Experimental Results 18 Comparison to other works - state-of-the-art models trained on ImageNet-1k without extra data •ToMe improves the throughput of ViT models such that ViT-L and ViT-H become comparable in speed to models of a lower tier, without scaling the number of features. - baseline models trained by us without ToMe - baseline models trained by us without ToMe / 25
  • 19. 2. Methods 3. Experimental Results 19 Comparison to other works - Token Pruning •Train speed: ToMe at r = 13 trains 1.5× faster than the baseline DeiT-S •if we take an off-the-shelf AugReg ViT-S model and apply the same merging schedule, we can match the performance of the DeiT models without training. •DeiT: Data-efficient Image Transformers (PR-297) / 25
  • 20. 2. Methods 3. Experimental Results 20 Visualization •Token merging has some similarity to part segmentation / 25
  • 21. 2. Methods 3. Experimental Results 21 Visualization multiple instances of the same class -> merged together Different part -> different merged token / 25
  • 22. 2. Methods 3. Experimental Results 22 Performance on video experiment •Spatiotemporal MAE (Feichtenhofer et al.Masked autoencoders as spatiotemporal learners. NeurIPS, 2022. •simply applying our method off-the-shelf without training and by applying our method during MAE fine tuning with the default training recipe / 25
  • 23. 2. Methods 3. Experimental Results 23 Performance on Audioset AudioSet-2M • collection of audio clips gathered from YouTube videos. • It contains over 2 million audio samples, with each clip being 10 seconds in length • 527 categories: musical instruments, natural sounds, human activities, environmental sounds •Baseline: Audio MAE (Po-Yao Huang et al. Masked autoencoders that listen. NeurIPS, 2022) / 25
  • 25. 2. Methods 4. Conclusions 25 • This paper introduced Token Merging (ToMe) to increase the throughput of ViT models by gradually merging tokens. • ToMe naturally exploits redundancy in the input, allowing its use for any modality with redundancy. • This paper contained extensive experiments on images, video, and audio, obtaining speeds and accuracies competitive with the state-of-the-art in each case. • Paper focus on classification but our visualizations show potential on tasks like segmentation. • ToMe works well on large models across domains and cuts down training time and memory usage, meaning it could be a core component of training huge models. Thank you. / 25