Visual prompt tuning

taeseonryu 405 views 20 slides Jun 14, 2023
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Visual Prompt Tuning (VPT),Parameter-efficient fine-tuning

지금까지 발표한 논문 :https://github.com/Lilcob/-DL_PaperReadingMeeting
발표자료 : https://www.slideshare.net/taeseonryu/mplug

안녕하세요 딥러닝 논문읽기 모임 입니다! 오늘 소개 드릴 논문은 'Visual...


Slide Content

Members: 조경진, 김병현, 김현진, 이희재, 안종식, 강인하
Team: 이미지처리팀
2023.04.09
Visual Prompt Tuning
MenglinJia, LumingTang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim
https://arxiv.org/pdf/2203.12119 1

IntroductionRelated WorkMethodsExperimentsConclusionsContents
1.Introduction
2.Related Work
3.Methods
4.Experiments
5.Conclusion

❖Adapting large foundation models pre-trained on massive data
IntroductionRelated WorkMethodsExperimentsConclusions
https://arxiv.org/abs/1512.04150
Largemodelstodownstreamtaskspresentsitsownchallenges.
•The most obvious adaptation strategy is full fine-tuning of the pre-trained model on the task at hand, end-to-end.
•However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single task.
•This is an expensive and often infeasible proposition, especially for modern Transformer-based architectures, which are significantly larger than
their convolutional neural networks counterparts, e.g., ViT-Huge (632M parameters) vs. ResNet-50 (25M parameters).
What is the best way to adapt large pre-trained
Transformers to downstream tasks in terms of
effectiveness and efficiency?
3

❖Adapting to new tasks
IntroductionRelated WorkMethodsExperimentsConclusions
(a): popular approach is to fine-tune only a subset of the parameters, such as the classifier heador the bias terms.
(b): Instead of altering or fine-tuning the pre-trained Transformer itself, authors modify the input to the Transformer. Drawing inspiration
from the recent advances on Prompting in NLP, a new simple and efficient method to adapt transformer models for downstream vision
tasks.
4

❖Post-training in large language model
IntroductionRelated WorkMethodsExperimentsConclusions
https://arxiv.org/abs/1512.04150
Transformer
Given their superior performance and much larger scalecompared to ConvNets, how to efficiently adapt Transformers to different
vision tasks remains an important open problem. Our proposed VPT provides a promising path forward.
1) Transfer learning
Side tuning, bias tuning
2) Adapter
Extra lightweight modules inside each Transformer layer
3) Prompting
Originally refers to prepending language instruction to the input text so
that a pre-trained LM can “understand” the task.
Side tuning
Bias tuning
5

❖Adapter
IntroductionRelated WorkMethodsExperimentsConclusions
https://qdata.github.io/deep2Read//deep2reproduce/2019Fall//T11_Schoch_Stephaniesns2gr_Parameter-Efficient_Transfer.pdf
Extra lightweight modules inside each Transformer layer
6

❖Prompting
IntroductionRelated WorkMethodsExperimentsConclusions
Originally refers to prepending language instruction to the input text so that a pre-trained LM can “understand” the task.
Prompt template (depending on whether it can be interpreted literally byhumans)
Discrete Prompts (a.k.a. Hard prompts)
•Search for the optimal combination of tokens in Vocab for the prompt template
•Although it should be human-readable and understandable, it is difficult to achieve good
performance when searching in a discrete space compared to searching in a continuous
space
Continuous Prompts (a.k.a. Soft prompts)
•It is not necessary for the prompt to be in natural language that humans can understand
•Special tokens (or virtual tokens) are created for the prompt to optimize in continuous
space
https://mobile.twitter.com/joeddav/status/1390731869319217158
7

❖Continuous prompting
IntroductionRelated WorkMethodsExperimentsConclusions
Special tokens (or virtual tokens) are created for the prompt to optimize in continuous space
https://arxiv.org/pdf/2103.10385.pdf
8

❖Visual-Prompt Tuning (VPT)
IntroductionRelated WorkMethodsExperimentsConclusions
VPT injects a small number of learnable parametersinto Transformer’s input space and keeps the backbone frozen during
the downstream training stage.
For a plain ViT with !layers, an input image is divided into"fixed-sized patches#!∈ℝ"×#×$'∈ℕ,1≤'≤"},
the collection of image patch embeddings, ,%=.%
!∈ℝ&'∈ℕ,1≤'≤"}, as inputs to the /+1-thTransformer layer 1%'(
Together with an extra learnable classification tokens ([CLS]), the whole ViT is formulated as:
9

❖Visual-Prompt Tuning (VPT)
IntroductionRelated WorkMethodsExperimentsConclusions
Given a pre-trained Transformer model, authorintroduce a set of p continuous embeddings of dimension d,
(i.e., prompts, in the input space after the Embed layer)
Only the task-specific prompts are being updated during fine-tuning, while the Transformer backbone is kept frozen.
Depending on the number of Transformer layers involved, our approach has two variants, VPT-shallow and VPT-deep.
The colors• and •indicate learnable and frozen parameters, respectively.
VPT-Shallow
VPT-Deep
10

IntroductionRelated WorkMethodsExperimentsConclusions
Question?
11

❖Wide range of downstream recognition tasks
IntroductionRelated WorkMethodsExperimentsConclusions
Compare both variants of VPT with other commonly used fine-tuning protocols:
Pre-trained Backbones.
(a)Full: update all backbone
(b)Classification head: linear, partial-k, MLP-k
(c)Subset parameters: Sidetune, bias, adapter
Datasets for downstream tasks
(a)FGVC (Fine-Grained Visual Classification): CUB-200-2011, NABirds, Oxford
Flowers, Stanford Dogs, Stanford Cars
(b)VTAB-1k (19 various tasks): Natural, Specialized, Structured..
12

❖Various dataset comparison
IntroductionRelated WorkMethodsExperimentsConclusions
The results of fine-tuning a pre-trained ViT-B/16 on averaged across 4 diverse
downstream task groups, comparing VPT to the other 7 tuning protocols.
13

❖Prompt location, length, depth
IntroductionRelated WorkMethodsExperimentsConclusions
14

❖Final output
IntroductionRelated WorkMethodsExperimentsConclusions
15

❖Test of statistical significance
IntroductionRelated WorkMethodsExperimentsConclusions
16

❖Manifold visualization
IntroductionRelated WorkMethodsExperimentsConclusions
17

❖Prompt learning in vision domain
IntroductionRelated WorkMethodsExperimentsConclusions
•Author present Visual Prompt Tuning, a new parameter-efficient approach to leverage large vision Transformer modelsfor a wide range of
downstream tasks.
•VPT introduces task-specific learnable prompts in the input space, keeping the pre-trained backbone fixed.
•Author show that VPT can surpass other fine-tuning protocols (often including full fine-tuning) while dramatically reducing the storage cost.
18

IntroductionRelated WorkMethodsExperimentsConclusions
Question?
19

Thank you for attention.
IntroductionRelated WorkMethodsExperimentsConclusions
20