Visual prompt tuning

Members: 조경진, 김병현, 김현진, 이희재, 안종식, 강인하
Team: 이미지 처리팀
2023.04.09
Visual Prompt Tuning
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim
https://arxiv.org/pdf/2203.12119
1

Introduction Related Work Methods Experiments Conclusions
Contents
1. Introduction
2. Related Work
3. Methods
4. Experiments
5. Conclusion

❖ Adapting large foundation models pre-trained on massive data
https://arxiv.org/abs/1512.04150
Large models to downstream tasks presents its own challenges.
• The most obvious adaptation strategy is full fine-tuning of the pre-trained model on the task at hand, end-to-end.
• However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single task.
• This is an expensive and often infeasible proposition, especially for modern Transformer-based architectures, which are significantly larger than
their convolutional neural networks counterparts, e.g., ViT-Huge (632M parameters) vs. ResNet-50 (25M parameters).
What is the best way to adapt large pre-trained
Transformers to downstream tasks in terms of
effectiveness and efficiency?
3

❖ Adapting to new tasks
(a): popular approach is to fine-tune only a subset of the parameters, such as the classifier head or the bias terms.
(b): Instead of altering or fine-tuning the pre-trained Transformer itself, authors modify the input to the Transformer. Drawing inspiration
from the recent advances on Prompting in NLP, a new simple and efficient method to adapt transformer models for downstream vision
tasks.
4

❖ Post-training in large language model
https://arxiv.org/abs/1512.04150
Transformer
Given their superior performance and much larger scale compared to ConvNets, how to efficiently adapt Transformers to different
vision tasks remains an important open problem. Our proposed VPT provides a promising path forward.
1) Transfer learning
Side tuning, bias tuning
2) Adapter
Extra lightweight modules inside each Transformer layer
3) Prompting
Originally refers to prepending language instruction to the input text so
that a pre-trained LM can “understand” the task.
Side tuning
Bias tuning
5

❖ Adapter
https://qdata.github.io/deep2Read//deep2reproduce/2019Fall//T11_Schoch_Stephaniesns2gr_Parameter-Efficient_Transfer.pdf
Extra lightweight modules inside each Transformer layer
6

❖ Prompting
Originally refers to prepending language instruction to the input text so that a pre-trained LM can “understand” the task.
Prompt template (depending on whether it can be interpreted literally by humans)
Discrete Prompts (a.k.a. Hard prompts)
• Search for the optimal combination of tokens in Vocab for the prompt template
• Although it should be human-readable and understandable, it is difficult to achieve good
performance when searching in a discrete space compared to searching in a continuous
space
Continuous Prompts (a.k.a. Soft prompts)
• It is not necessary for the prompt to be in natural language that humans can understand
• Special tokens (or virtual tokens) are created for the prompt to optimize in continuous
space
https://mobile.twitter.com/joeddav/status/1390731869319217158
7

❖ Continuous prompting
Special tokens (or virtual tokens) are created for the prompt to optimize in continuous space
https://arxiv.org/pdf/2103.10385.pdf
8

❖ Visual-Prompt Tuning (VPT)
VPT injects a small number of learnable parameters into Transformer’s input space and keeps the backbone frozen during
the downstream training stage.
For a plain ViT with 𝑁 layers, an input image is divided into 𝑚 fixed-sized patches 𝐼!∈ℝ" ×# ×$ 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚},
the collection of image patch embeddings, 𝐸% = 𝑒%
!
∈ ℝ& 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚}, as inputs to the 𝑖 + 1 -th Transformer layer 𝐿%'(
Together with an extra learnable classification tokens ([CLS]), the whole ViT is formulated as:
9

❖ Visual-Prompt Tuning (VPT)
Given a pre-trained Transformer model, author introduce a set of p continuous embeddings of dimension d,
(i.e., prompts, in the input space after the Embed layer)
Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen.
Depending on the number of Transformer layers involved, our approach has two variants, VPT-shallow and VPT-deep.
The colors • and • indicate learnable and frozen parameters, respectively.
VPT-Shallow
VPT-Deep
10

Question?
11

❖ Wide range of downstream recognition tasks
Compare both variants of VPT with other commonly used fine-tuning protocols:
Pre-trained Backbones.
(a) Full: update all backbone
(b) Classification head: linear, partial-k, MLP-k
(c) Subset parameters: Sidetune, bias, adapter
Datasets for downstream tasks
(a) FGVC (Fine-Grained Visual Classification): CUB-200-2011, NABirds, Oxford
Flowers, Stanford Dogs, Stanford Cars
(b) VTAB-1k (19 various tasks): Natural, Specialized, Structured..
12

❖ Various dataset comparison
The results of fine-tuning a pre-trained ViT-B/16 on averaged across 4 diverse
downstream task groups, comparing VPT to the other 7 tuning protocols.
13

❖ Prompt location, length, depth
14

❖ Final output
15

❖ Test of statistical significance
16

❖ Manifold visualization
17

❖ Prompt learning in vision domain
• Author present Visual Prompt Tuning, a new parameter-efficient approach to leverage large vision Transformer models for a wide range of
downstream tasks.
• VPT introduces task-specific learnable prompts in the input space, keeping the pre-trained backbone fixed.
• Author show that VPT can surpass other fine-tuning protocols (often including full fine-tuning) while dramatically reducing the storage cost.
18

Question?
19

Thank you for attention.
20

Visual prompt tuning

More Related Content

What's hot

Similar to Visual prompt tuning

More from taeseon ryu

Recently uploaded

Visual prompt tuning