Members: 조경진, 김병현, 김현진, 이희재, 안종식, 강인하
Team: 이미지 처리팀
2023.04.09
Visual Prompt Tuning
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim
https://arxiv.org/pdf/2203.12119
1
Introduction Related Work Methods Experiments Conclusions
Contents
1. Introduction
2. Related Work
3. Methods
4. Experiments
5. Conclusion
❖ Adapting large foundation models pre-trained on massive data
Introduction Related Work Methods Experiments Conclusions
https://arxiv.org/abs/1512.04150
Large models to downstream tasks presents its own challenges.
• The most obvious adaptation strategy is full fine-tuning of the pre-trained model on the task at hand, end-to-end.
• However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single task.
• This is an expensive and often infeasible proposition, especially for modern Transformer-based architectures, which are significantly larger than
their convolutional neural networks counterparts, e.g., ViT-Huge (632M parameters) vs. ResNet-50 (25M parameters).
What is the best way to adapt large pre-trained
Transformers to downstream tasks in terms of
effectiveness and efficiency?
3
❖ Adapting to new tasks
Introduction Related Work Methods Experiments Conclusions
(a): popular approach is to fine-tune only a subset of the parameters, such as the classifier head or the bias terms.
(b): Instead of altering or fine-tuning the pre-trained Transformer itself, authors modify the input to the Transformer. Drawing inspiration
from the recent advances on Prompting in NLP, a new simple and efficient method to adapt transformer models for downstream vision
tasks.
4
❖ Post-training in large language model
Introduction Related Work Methods Experiments Conclusions
https://arxiv.org/abs/1512.04150
Transformer
Given their superior performance and much larger scale compared to ConvNets, how to efficiently adapt Transformers to different
vision tasks remains an important open problem. Our proposed VPT provides a promising path forward.
1) Transfer learning
Side tuning, bias tuning
2) Adapter
Extra lightweight modules inside each Transformer layer
3) Prompting
Originally refers to prepending language instruction to the input text so
that a pre-trained LM can “understand” the task.
Side tuning
Bias tuning
5
❖ Adapter
Introduction Related Work Methods Experiments Conclusions
https://qdata.github.io/deep2Read//deep2reproduce/2019Fall//T11_Schoch_Stephaniesns2gr_Parameter-Efficient_Transfer.pdf
Extra lightweight modules inside each Transformer layer
6
❖ Prompting
Introduction Related Work Methods Experiments Conclusions
Originally refers to prepending language instruction to the input text so that a pre-trained LM can “understand” the task.
Prompt template (depending on whether it can be interpreted literally by humans)
Discrete Prompts (a.k.a. Hard prompts)
• Search for the optimal combination of tokens in Vocab for the prompt template
• Although it should be human-readable and understandable, it is difficult to achieve good
performance when searching in a discrete space compared to searching in a continuous
space
Continuous Prompts (a.k.a. Soft prompts)
• It is not necessary for the prompt to be in natural language that humans can understand
• Special tokens (or virtual tokens) are created for the prompt to optimize in continuous
space
https://mobile.twitter.com/joeddav/status/1390731869319217158
7
❖ Continuous prompting
Introduction Related Work Methods Experiments Conclusions
Special tokens (or virtual tokens) are created for the prompt to optimize in continuous space
https://arxiv.org/pdf/2103.10385.pdf
8
❖ Visual-Prompt Tuning (VPT)
Introduction Related Work Methods Experiments Conclusions
VPT injects a small number of learnable parameters into Transformer’s input space and keeps the backbone frozen during
the downstream training stage.
For a plain ViT with 𝑁 layers, an input image is divided into 𝑚 fixed-sized patches 𝐼!∈ℝ" ×# ×$ 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚},
the collection of image patch embeddings, 𝐸% = 𝑒%
!
∈ ℝ& 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚}, as inputs to the 𝑖 + 1 -th Transformer layer 𝐿%'(
Together with an extra learnable classification tokens ([CLS]), the whole ViT is formulated as:
9
❖ Visual-Prompt Tuning (VPT)
Introduction Related Work Methods Experiments Conclusions
Given a pre-trained Transformer model, author introduce a set of p continuous embeddings of dimension d,
(i.e., prompts, in the input space after the Embed layer)
Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen.
Depending on the number of Transformer layers involved, our approach has two variants, VPT-shallow and VPT-deep.
The colors • and • indicate learnable and frozen parameters, respectively.
VPT-Shallow
VPT-Deep
10
Introduction Related Work Methods Experiments Conclusions
Question?
11
❖ Wide range of downstream recognition tasks
Introduction Related Work Methods Experiments Conclusions
Compare both variants of VPT with other commonly used fine-tuning protocols:
Pre-trained Backbones.
(a) Full: update all backbone
(b) Classification head: linear, partial-k, MLP-k
(c) Subset parameters: Sidetune, bias, adapter
Datasets for downstream tasks
(a) FGVC (Fine-Grained Visual Classification): CUB-200-2011, NABirds, Oxford
Flowers, Stanford Dogs, Stanford Cars
(b) VTAB-1k (19 various tasks): Natural, Specialized, Structured..
12
❖ Various dataset comparison
Introduction Related Work Methods Experiments Conclusions
The results of fine-tuning a pre-trained ViT-B/16 on averaged across 4 diverse
downstream task groups, comparing VPT to the other 7 tuning protocols.
13
❖ Prompt location, length, depth
Introduction Related Work Methods Experiments Conclusions
14
❖ Final output
Introduction Related Work Methods Experiments Conclusions
15
❖ Test of statistical significance
Introduction Related Work Methods Experiments Conclusions
16
❖ Manifold visualization
Introduction Related Work Methods Experiments Conclusions
17
❖ Prompt learning in vision domain
Introduction Related Work Methods Experiments Conclusions
• Author present Visual Prompt Tuning, a new parameter-efficient approach to leverage large vision Transformer models for a wide range of
downstream tasks.
• VPT introduces task-specific learnable prompts in the input space, keeping the pre-trained backbone fixed.
• Author show that VPT can surpass other fine-tuning protocols (often including full fine-tuning) while dramatically reducing the storage cost.
18
Introduction Related Work Methods Experiments Conclusions
Question?
19
Thank you for attention.
Introduction Related Work Methods Experiments Conclusions
20

Visual prompt tuning

  • 1.
    Members: 조경진, 김병현,김현진, 이희재, 안종식, 강인하 Team: 이미지 처리팀 2023.04.09 Visual Prompt Tuning Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim https://arxiv.org/pdf/2203.12119 1
  • 2.
    Introduction Related WorkMethods Experiments Conclusions Contents 1. Introduction 2. Related Work 3. Methods 4. Experiments 5. Conclusion
  • 3.
    ❖ Adapting largefoundation models pre-trained on massive data Introduction Related Work Methods Experiments Conclusions https://arxiv.org/abs/1512.04150 Large models to downstream tasks presents its own challenges. • The most obvious adaptation strategy is full fine-tuning of the pre-trained model on the task at hand, end-to-end. • However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single task. • This is an expensive and often infeasible proposition, especially for modern Transformer-based architectures, which are significantly larger than their convolutional neural networks counterparts, e.g., ViT-Huge (632M parameters) vs. ResNet-50 (25M parameters). What is the best way to adapt large pre-trained Transformers to downstream tasks in terms of effectiveness and efficiency? 3
  • 4.
    ❖ Adapting tonew tasks Introduction Related Work Methods Experiments Conclusions (a): popular approach is to fine-tune only a subset of the parameters, such as the classifier head or the bias terms. (b): Instead of altering or fine-tuning the pre-trained Transformer itself, authors modify the input to the Transformer. Drawing inspiration from the recent advances on Prompting in NLP, a new simple and efficient method to adapt transformer models for downstream vision tasks. 4
  • 5.
    ❖ Post-training inlarge language model Introduction Related Work Methods Experiments Conclusions https://arxiv.org/abs/1512.04150 Transformer Given their superior performance and much larger scale compared to ConvNets, how to efficiently adapt Transformers to different vision tasks remains an important open problem. Our proposed VPT provides a promising path forward. 1) Transfer learning Side tuning, bias tuning 2) Adapter Extra lightweight modules inside each Transformer layer 3) Prompting Originally refers to prepending language instruction to the input text so that a pre-trained LM can “understand” the task. Side tuning Bias tuning 5
  • 6.
    ❖ Adapter Introduction RelatedWork Methods Experiments Conclusions https://qdata.github.io/deep2Read//deep2reproduce/2019Fall//T11_Schoch_Stephaniesns2gr_Parameter-Efficient_Transfer.pdf Extra lightweight modules inside each Transformer layer 6
  • 7.
    ❖ Prompting Introduction RelatedWork Methods Experiments Conclusions Originally refers to prepending language instruction to the input text so that a pre-trained LM can “understand” the task. Prompt template (depending on whether it can be interpreted literally by humans) Discrete Prompts (a.k.a. Hard prompts) • Search for the optimal combination of tokens in Vocab for the prompt template • Although it should be human-readable and understandable, it is difficult to achieve good performance when searching in a discrete space compared to searching in a continuous space Continuous Prompts (a.k.a. Soft prompts) • It is not necessary for the prompt to be in natural language that humans can understand • Special tokens (or virtual tokens) are created for the prompt to optimize in continuous space https://mobile.twitter.com/joeddav/status/1390731869319217158 7
  • 8.
    ❖ Continuous prompting IntroductionRelated Work Methods Experiments Conclusions Special tokens (or virtual tokens) are created for the prompt to optimize in continuous space https://arxiv.org/pdf/2103.10385.pdf 8
  • 9.
    ❖ Visual-Prompt Tuning(VPT) Introduction Related Work Methods Experiments Conclusions VPT injects a small number of learnable parameters into Transformer’s input space and keeps the backbone frozen during the downstream training stage. For a plain ViT with 𝑁 layers, an input image is divided into 𝑚 fixed-sized patches 𝐼!∈ℝ" ×# ×$ 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚}, the collection of image patch embeddings, 𝐸% = 𝑒% ! ∈ ℝ& 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑚}, as inputs to the 𝑖 + 1 -th Transformer layer 𝐿%'( Together with an extra learnable classification tokens ([CLS]), the whole ViT is formulated as: 9
  • 10.
    ❖ Visual-Prompt Tuning(VPT) Introduction Related Work Methods Experiments Conclusions Given a pre-trained Transformer model, author introduce a set of p continuous embeddings of dimension d, (i.e., prompts, in the input space after the Embed layer) Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen. Depending on the number of Transformer layers involved, our approach has two variants, VPT-shallow and VPT-deep. The colors • and • indicate learnable and frozen parameters, respectively. VPT-Shallow VPT-Deep 10
  • 11.
    Introduction Related WorkMethods Experiments Conclusions Question? 11
  • 12.
    ❖ Wide rangeof downstream recognition tasks Introduction Related Work Methods Experiments Conclusions Compare both variants of VPT with other commonly used fine-tuning protocols: Pre-trained Backbones. (a) Full: update all backbone (b) Classification head: linear, partial-k, MLP-k (c) Subset parameters: Sidetune, bias, adapter Datasets for downstream tasks (a) FGVC (Fine-Grained Visual Classification): CUB-200-2011, NABirds, Oxford Flowers, Stanford Dogs, Stanford Cars (b) VTAB-1k (19 various tasks): Natural, Specialized, Structured.. 12
  • 13.
    ❖ Various datasetcomparison Introduction Related Work Methods Experiments Conclusions The results of fine-tuning a pre-trained ViT-B/16 on averaged across 4 diverse downstream task groups, comparing VPT to the other 7 tuning protocols. 13
  • 14.
    ❖ Prompt location,length, depth Introduction Related Work Methods Experiments Conclusions 14
  • 15.
    ❖ Final output IntroductionRelated Work Methods Experiments Conclusions 15
  • 16.
    ❖ Test ofstatistical significance Introduction Related Work Methods Experiments Conclusions 16
  • 17.
    ❖ Manifold visualization IntroductionRelated Work Methods Experiments Conclusions 17
  • 18.
    ❖ Prompt learningin vision domain Introduction Related Work Methods Experiments Conclusions • Author present Visual Prompt Tuning, a new parameter-efficient approach to leverage large vision Transformer models for a wide range of downstream tasks. • VPT introduces task-specific learnable prompts in the input space, keeping the pre-trained backbone fixed. • Author show that VPT can surpass other fine-tuning protocols (often including full fine-tuning) while dramatically reducing the storage cost. 18
  • 19.
    Introduction Related WorkMethods Experiments Conclusions Question? 19
  • 20.
    Thank you forattention. Introduction Related Work Methods Experiments Conclusions 20