Visual Prompt Tuning (VPT),Parameter-efficient fine-tuning
지금까지 발표한 논문 :https://github.com/Lilcob/-DL_PaperReadingMeeting
발표자료 : https://www.slideshare.net/taeseonryu/mplug
안녕하세요 딥러닝 논문읽기 모임 입니다! 오늘 소개 드릴 논문은 'Visual Prompt Tuning for Transformers with Frozen Weights' 입니다.
오늘 소개하는 논문은 대규모 Transformer 모델을 비전에 효율적이고 효과적으로 미세조정하는 대안인 Visual Prompt Tuning (VPT)를 소개하고 있습니다. VPT는 입력 공간에서 작은 양의 훈련 가능한 매개변수를 도입하면서 모델 백본을 고정합니다.
이 방법을 통해, VPT는 다른 매개변수 효율적인 튜닝 프로토콜에 비해 상당한 성능 향상을 달성하며, 많은 경우에는 전체 미세조정을 능가하면서 작업당 저장 비용을 줄인다는 것을 실험적으로 보여줍니다.
이 논문은 효과성과 효율성 면에서 대규모 사전 훈련된 Transformer를 하위 작업에 적용하는 도전을 다룹니다. 이를 통해, 더 효율적인 방식으로 다양한 비전-언어 작업에 대한 성능을 향상시킬 수 있음을 보여줍니다.
오늘 논문 리뷰를 위해 이미지처리 조경진님이 자세한 리뷰를 도와주셨습니다 많은 관심 미리 감사드립니다!
https://youtu.be/bVOk-hSYyZw
2. Introduction Related Work Methods Experiments Conclusions
Contents
1. Introduction
2. Related Work
3. Methods
4. Experiments
5. Conclusion
3. ❖ Adapting large foundation models pre-trained on massive data
Introduction Related Work Methods Experiments Conclusions
https://arxiv.org/abs/1512.04150
Large models to downstream tasks presents its own challenges.
• The most obvious adaptation strategy is full fine-tuning of the pre-trained model on the task at hand, end-to-end.
• However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single task.
• This is an expensive and often infeasible proposition, especially for modern Transformer-based architectures, which are significantly larger than
their convolutional neural networks counterparts, e.g., ViT-Huge (632M parameters) vs. ResNet-50 (25M parameters).
What is the best way to adapt large pre-trained
Transformers to downstream tasks in terms of
effectiveness and efficiency?
3
4. ❖ Adapting to new tasks
Introduction Related Work Methods Experiments Conclusions
(a): popular approach is to fine-tune only a subset of the parameters, such as the classifier head or the bias terms.
(b): Instead of altering or fine-tuning the pre-trained Transformer itself, authors modify the input to the Transformer. Drawing inspiration
from the recent advances on Prompting in NLP, a new simple and efficient method to adapt transformer models for downstream vision
tasks.
4
5. ❖ Post-training in large language model
Introduction Related Work Methods Experiments Conclusions
https://arxiv.org/abs/1512.04150
Transformer
Given their superior performance and much larger scale compared to ConvNets, how to efficiently adapt Transformers to different
vision tasks remains an important open problem. Our proposed VPT provides a promising path forward.
1) Transfer learning
Side tuning, bias tuning
2) Adapter
Extra lightweight modules inside each Transformer layer
3) Prompting
Originally refers to prepending language instruction to the input text so
that a pre-trained LM can “understand” the task.
Side tuning
Bias tuning
5
6. ❖ Adapter
Introduction Related Work Methods Experiments Conclusions
https://qdata.github.io/deep2Read//deep2reproduce/2019Fall//T11_Schoch_Stephaniesns2gr_Parameter-Efficient_Transfer.pdf
Extra lightweight modules inside each Transformer layer
6
7. ❖ Prompting
Introduction Related Work Methods Experiments Conclusions
Originally refers to prepending language instruction to the input text so that a pre-trained LM can “understand” the task.
Prompt template (depending on whether it can be interpreted literally by humans)
Discrete Prompts (a.k.a. Hard prompts)
• Search for the optimal combination of tokens in Vocab for the prompt template
• Although it should be human-readable and understandable, it is difficult to achieve good
performance when searching in a discrete space compared to searching in a continuous
space
Continuous Prompts (a.k.a. Soft prompts)
• It is not necessary for the prompt to be in natural language that humans can understand
• Special tokens (or virtual tokens) are created for the prompt to optimize in continuous
space
https://mobile.twitter.com/joeddav/status/1390731869319217158
7
8. ❖ Continuous prompting
Introduction Related Work Methods Experiments Conclusions
Special tokens (or virtual tokens) are created for the prompt to optimize in continuous space
https://arxiv.org/pdf/2103.10385.pdf
8
9. ❖ Visual-Prompt Tuning (VPT)
Introduction Related Work Methods Experiments Conclusions
VPT injects a small number of learnable parameters into Transformer’s input space and keeps the backbone frozen during
the downstream training stage.
For a plain ViT with 𝑁 layers, an input image is divided into 𝑚 fixed-sized patches 𝐼!∈ℝ" ×# ×$ 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚},
the collection of image patch embeddings, 𝐸% = 𝑒%
!
∈ ℝ& 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚}, as inputs to the 𝑖 + 1 -th Transformer layer 𝐿%'(
Together with an extra learnable classification tokens ([CLS]), the whole ViT is formulated as:
9
10. ❖ Visual-Prompt Tuning (VPT)
Introduction Related Work Methods Experiments Conclusions
Given a pre-trained Transformer model, author introduce a set of p continuous embeddings of dimension d,
(i.e., prompts, in the input space after the Embed layer)
Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen.
Depending on the number of Transformer layers involved, our approach has two variants, VPT-shallow and VPT-deep.
The colors • and • indicate learnable and frozen parameters, respectively.
VPT-Shallow
VPT-Deep
10
12. ❖ Wide range of downstream recognition tasks
Introduction Related Work Methods Experiments Conclusions
Compare both variants of VPT with other commonly used fine-tuning protocols:
Pre-trained Backbones.
(a) Full: update all backbone
(b) Classification head: linear, partial-k, MLP-k
(c) Subset parameters: Sidetune, bias, adapter
Datasets for downstream tasks
(a) FGVC (Fine-Grained Visual Classification): CUB-200-2011, NABirds, Oxford
Flowers, Stanford Dogs, Stanford Cars
(b) VTAB-1k (19 various tasks): Natural, Specialized, Structured..
12
13. ❖ Various dataset comparison
Introduction Related Work Methods Experiments Conclusions
The results of fine-tuning a pre-trained ViT-B/16 on averaged across 4 diverse
downstream task groups, comparing VPT to the other 7 tuning protocols.
13
14. ❖ Prompt location, length, depth
Introduction Related Work Methods Experiments Conclusions
14
18. ❖ Prompt learning in vision domain
Introduction Related Work Methods Experiments Conclusions
• Author present Visual Prompt Tuning, a new parameter-efficient approach to leverage large vision Transformer models for a wide range of
downstream tasks.
• VPT introduces task-specific learnable prompts in the input space, keeping the pre-trained backbone fixed.
• Author show that VPT can surpass other fine-tuning protocols (often including full fine-tuning) while dramatically reducing the storage cost.
18