SlideShare a Scribd company logo
1 of 37
Download to read offline
HyperPrompt
HyperPrompt:
Prompt-based Task-Conditioning of Transformers
Yun He, Huaixiu Steven Zheng, Yi Tay et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan Chen
March 7, 2023
1 / 37
HyperPrompt
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
2 / 37
HyperPrompt
Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
3 / 37
HyperPrompt
Abstract
Abstract
This paper proposes HyperPrompt, a novel architecture for
prompt-based task-conditioning of self-attention in Transformers.
HyperPrompt allows the network to learn task-specific feature maps
where the hyper-prompts serve as task global memories for the queries
to attend to, at the same time enabling flexible information sharing
among tasks.
HyperPrompt is competitive against strong multi-task learning
baselines with as few as 0.14% of additional task-conditioning
parameters, achieving great parameter and computational
efficiency.
4 / 37
HyperPrompt
Introduction
Introduction
Prompts are lightly tuned, allowing the model to be trained quickly
since the main body of the pretrained model is kept frozen.
HyperPrompt, a natural but novel extension of Prompt-Tuning to
multi-task learning (MTL) for language.
Hyper-prompts are injected to the keys and values in the
self-attention module, reminiscent of memory augmented
Transformers.
5 / 37
HyperPrompt
Introduction
Introduction
This mitigates the cost of having prompts pass through the standard
FFN layers in Transformers and serves as additional task-specific
memory tokens for queries to attend to.
Also, introducing task-aware and layer-aware HyperNetworks, it
imbues the model with the necessary flexibility and expressiveness
Meanwhile, HyperPrompt remains very parameter and computational
efficient and friendly to multi-task scaling.
6 / 37
HyperPrompt
Introduction
Contribution
1 HyperPrompt Transformer architecture with learnable
hyper-prompts for multi-task fine-tuning with great parameter
and computational efficiency
2 HyperNetworks as a prompt generator, and inject hyper-prompts
into the self-attention module as global task memory tokens.
3 HyperPrompt outperforms state-of-the-art parameter-efficient T5
models using Prompt-Tuning or adapters on well-established
benchmarks such as SuperGLUE and GLUE, across all explored
model sizes.
7 / 37
HyperPrompt
Problem Statement
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
8 / 37
HyperPrompt
Problem Statement
Problem Statement
Designing task-conditioned parameterization of Transformer models
to achieve greater parameter and computational efficiency as well
as Pareto efficiency for multi-task learning.
Some related topics:
Multi-task learning
Transformer-based pre-trained LMs
9 / 37
HyperPrompt
Problem Statement
Multi-task learning
A set of tasks {D𝜏 } = {x(n)
𝜏 , y(n)
𝜏 }N𝜏
n=1 indicates the corresponding
training set of the 𝜏-th task with N𝜏 samples.
To tackle such multi-task learning problem with pre-trained
Transformer model f𝜃 (·), it needs to minimize the objective function
L(𝜃) =
T
∑︁
𝜏=1
N𝜏
∑︁
n=1
C

f𝜃

x(n)
𝜏

, y(n)
𝜏

where C(·, ·) is typically the cross-entropy loss and f𝜃 (x(n)
𝜏 ) is the
output for training sample x(n)
𝜏 .
10 / 37
HyperPrompt
Problem Statement
Transformer-based pre-trained LMs
For example, T5 and BART are unified text-to-text frameworks where
all tasks share the same encoder-decoder architecture.
{{x(n)
𝜏 }N𝜏
n=1}T
𝜏=1 are fed into the same encoder
{{ŷ(n)
𝜏 }N𝜏
n=1}T
𝜏=1 are generated by the same decoder.
For such universal modules, multi-task learning simply corresponds to
mixing task data sets and there is no task-specific classification or
regression networks for each task as in encoder-only modules.
11 / 37
HyperPrompt
Problem Statement
Problem Statement
To improve the performance, they introduce a set of task-conditioned
parameters {𝛿𝜏 }T
𝜏=1 into f𝜃 (·).
Such that, the objective function can be updated as
L

𝜃, {𝛿𝜏 }T
𝜏=1

=
T
∑︁
𝜏=1
N𝜏
∑︁
n=1
C

f𝜃,𝛿𝜏

x(n)
𝜏

, y(n)
𝜏

During training, both 𝜃 and {𝛿𝜏 }T
𝜏=1 are updated via back-propagation
because they observe a large performance drop in SuperGLUE when
backbone model 𝜃 is frozen and only task-conditioned parameters are
tuned.
12 / 37
HyperPrompt
Methods
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
13 / 37
HyperPrompt
Methods
Methods
Prompt-Based Task-Conditioned Transformer
HyperPrompt
HyperPrompt-Global
Parameter Efficiency of HyperPrompt
14 / 37
HyperPrompt
Methods
Methods
15 / 37
HyperPrompt
Methods
Prompt-Based Task-Conditioned Transformer
To learn the task-specific information for the 𝜏-th task, having l
trainable d-dimensional vectors as the hyper-prompts for the key and
the value respectively denoted as P𝜏,k ∈ Rl×h×dh and P𝜏,v ∈ Rl×h×dh .
Then, the hyper-prompts are concatenated with the original key and
value:
K′
𝜏 = concat(P𝜏,k, K𝜏) (1)
V′
𝜏 = concat(P𝜏,v, V𝜏) (2)
where the new key (value) K′
𝜏 (V′
𝜏) ∈ R(l+L)×h×dh are used to compute
the multihead self-attention.
16 / 37
HyperPrompt
Methods
Prompt-Based Task-Conditioned Transformer
The hyper-prompts benefit Transformers for multitask learning in two
ways:
Prompt for key P𝜏,k directly interacts (matrix multiplication)
with the original query Q𝜏 allowing tokens to acquire
task-specific semantics.
Prompt for value P𝜏,v can serve as task-specific memories for
multihead attention to retrieve information.
After that, the multihead self-attention can be operated:
O𝜏 = Attention Q𝜏, K′
𝜏, V′
𝜏

= softmax(Q𝜏K′T
𝜏 V′
𝜏)
where O𝜏 ∈ RL×d is the output of multihead attention.
17 / 37
HyperPrompt
Methods
How to obtain the prompts for the m-th Transformer block?
Global Prompts, which initialize a set of global prompts {P𝜏 }T
𝜏=1,
where P𝜏 ∈ Rl×d is a trainable matrix to learn the task-specific
information of the 𝜏-th task, d is the model dimension and l is the
length of the prompt.
Local HyperNetworks, that applying two local HyperNetworks hm
k
and hm
v to transform the global prompt P𝜏 into layer-specific and
task-specific prompts
Pm
𝜏,k = hm
k (P𝜏) = Um
k Relu Dm
k (P𝜏)

(3)
Pm
𝜏,v = hm
v (P𝜏) = Um
v Relu Dm
v (P𝜏)

(4)
where Dm
v and Um
v are down-projection and up-projection matrices.
18 / 37
HyperPrompt
Methods
HyperPrompt-Share share the same two local HyperNetworks
defined by the down-project matrices Dm
v and Dm
k , and the up-project
matrices Um
v and Um
k .
HyperPrompt-Sep each task can have its own local HyperNetworks
hm
𝜏,k(P𝜏) and hm
𝜏,v(P𝜏)
Pm
𝜏,k = hm
𝜏,k (P𝜏) = Um
𝜏,k

Relu

Dm
𝜏,k (P𝜏)

(5)
Pm
𝜏,v = hm
𝜏,v (P𝜏) = Um
𝜏,v Relu Dm
𝜏,v (P𝜏)

(6)
19 / 37
HyperPrompt
Methods
HyperPrompt-Global
It can flexibly share information and knowledge among tasks and
blocks while maintaining a low parameter cost.
Layer-Aware Task Embedding
Im
𝜏 = ht (k𝜏, zm), where ht is a MLP consisting of two feed-forward
layers and a ReLU non-linearity and k𝜏 ∈ Rt′
denote the task
embedding, and zm ∈ Rt′
is the layer embedding.
20 / 37
HyperPrompt
Methods
HyperPrompt-Global
Global HyperNetworks
Having Hk(·) to generates the weight matrices for key hyper-prompts,
and Hv(·) to generate for value hyper-prompts.

Um
𝜏,k, Dm
𝜏,k

= Hk Im
𝜏

=

WUk
, WDk

Im
𝜏 (7)
Um
𝜏,v, Dm
𝜏,v

= Hv Im
𝜏

=

WUv
, WDv

Im
𝜏 (8)
After Um
𝜏,k/v and Dm
𝜏,k/v are generated, the method can project the
global prompts P𝜏,k/v into into hyper-promtps Pm
𝜏,k/v following (5)
and (6). And then, calculating the task-conditioned attention scores
by using Figure 2(a).
21 / 37
HyperPrompt
Methods
Parameter Efficiency of HyperPrompt
The total number of additional parameters from HyperPrompt-Global
dlT + 4(bdt) + Tt′
+ Mt′
+ (2t′
+ t)e
where T is the total number of tasks, b is the bottleneck dimension of
the weight matrices of the local HyperNetworks, t′/t is the dimension
of the raw/final layer-aware task embedding, and e is the hidden
dimension of hk/v.
Therefore, the space complexity is O(d(lT + 4bt)) or further
simplified as O(bdt).
22 / 37
HyperPrompt
Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
23 / 37
HyperPrompt
Experiments
Experiments
Experimental Setup
Key Results
Tuning all vs Task-Conditioned Parameters
Computational Efficiency
Ablation Study
Peeking into Hyper-Prompts
Impact of Hyper-Prompt Length
Encoder vs Decoder
24 / 37
HyperPrompt
Experiments
Experimental Setup
Datasets: GLUE and SuperGLUE
Transformers: Transformer model T5
Evaluation:
Testing the ability of transfer learning among the co-trained tasks.
Baselines:
1 MTL-Prompt and HyperPrompt-Share/Sep/Global with vanilla
T5 models
2 Vanilla Adapter
3 HyperFormer++
4 Prompt-Tuning
25 / 37
HyperPrompt
Experiments
26 / 37
HyperPrompt
Experiments
Tuning all vs Task-Conditioned Parameters
The experiments show that only tuning the task-conditioned
parameters is not enough to achieve competitive results as full model
training for multi-task learning, since the rest of the experiments are
conducted with tuning all model parameters.
27 / 37
HyperPrompt
Experiments
Computational Efficiency
This shows the computational efficiency of HyperPrompt for both
training and inference.
28 / 37
HyperPrompt
Experiments
Ablation Study
HyperPrompt-Global can adjust the degree of information sharing for
better multi-task generalization.
29 / 37
HyperPrompt
Experiments
For the top graph, the network has lower attention mass on
hyper-prompts in the lower layers and gradually increases attention
mass for higher layers.
30 / 37
HyperPrompt
Experiments
For the bottom graph, knowing that injecting hyper-prompts
encourages a more diverse attention distribution, which seems to be
beneficial to model generalization.
31 / 37
HyperPrompt
Experiments
Impact of Hyper-Prompt Length
As shoen in Figure, l = 6 is the best for the decoder, and l = 16 for the
encoder with l = 6 fixed for the decoder.
32 / 37
HyperPrompt
Experiments
Encoder vs Decoder
Based on this experiment, this paper adds task-conditioned parameters
to the decoder for SuperGLUE in their experiments.
33 / 37
HyperPrompt
Related Work
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
34 / 37
HyperPrompt
Related Work
Related Work
1 Prompt-Tuning
1 Discrete prommpting words
2 Soft prompts
2 Adapter-Tuning
3 Multi-task Natural Language Understanding
35 / 37
HyperPrompt
Conclusion
Table of contents
1 Abstract
2 Introduction
3 Problem Statement
4 Methods
5 Experiments
6 Related Work
7 Conclusion
36 / 37
HyperPrompt
Conclusion
Conclusion
This paper propose a novel architecture for prompt-based
task-conditioning of self-attention in Transformers.
The hyper-prompts are generated by a HyperNetwork to enable
flexible information sharing among tasks while remain efficient in
parameters and computation.
37 / 37

More Related Content

Similar to HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf

An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPCVictor Eijkhout
 
Applying Transformation Characteristics to Solve the Multi Objective Linear F...
Applying Transformation Characteristics to Solve the Multi Objective Linear F...Applying Transformation Characteristics to Solve the Multi Objective Linear F...
Applying Transformation Characteristics to Solve the Multi Objective Linear F...AIRCC Publishing Corporation
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor Systemijtsrd
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIlya Kryukov
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptxSeungeon Baek
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」kurotaki_weblab
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Multi-objective Optimization of PID Controller using Pareto-based Surrogate ...
Multi-objective Optimization of PID Controller using  Pareto-based Surrogate ...Multi-objective Optimization of PID Controller using  Pareto-based Surrogate ...
Multi-objective Optimization of PID Controller using Pareto-based Surrogate ...IJECEIAES
 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
Economic Dispatch of Generated Power Using Modified Lambda-Iteration Method
Economic Dispatch of Generated Power Using Modified Lambda-Iteration MethodEconomic Dispatch of Generated Power Using Modified Lambda-Iteration Method
Economic Dispatch of Generated Power Using Modified Lambda-Iteration MethodIOSR Journals
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Similar to HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf (20)

Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
PPoPP15
PPoPP15PPoPP15
PPoPP15
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
 
Applying Transformation Characteristics to Solve the Multi Objective Linear F...
Applying Transformation Characteristics to Solve the Multi Objective Linear F...Applying Transformation Characteristics to Solve the Multi Objective Linear F...
Applying Transformation Characteristics to Solve the Multi Objective Linear F...
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver Library
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
Multi-objective Optimization of PID Controller using Pareto-based Surrogate ...
Multi-objective Optimization of PID Controller using  Pareto-based Surrogate ...Multi-objective Optimization of PID Controller using  Pareto-based Surrogate ...
Multi-objective Optimization of PID Controller using Pareto-based Surrogate ...
 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
Tesis-Maestria-Presentacion-SIturriaga
Tesis-Maestria-Presentacion-SIturriagaTesis-Maestria-Presentacion-SIturriaga
Tesis-Maestria-Presentacion-SIturriaga
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Economic Dispatch of Generated Power Using Modified Lambda-Iteration Method
Economic Dispatch of Generated Power Using Modified Lambda-Iteration MethodEconomic Dispatch of Generated Power Using Modified Lambda-Iteration Method
Economic Dispatch of Generated Power Using Modified Lambda-Iteration Method
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

More from Po-Chuan Chen

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfPo-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdfPo-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Po-Chuan Chen
 
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Po-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
 
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
 

Recently uploaded

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf

  • 1. HyperPrompt HyperPrompt: Prompt-based Task-Conditioning of Transformers Yun He, Huaixiu Steven Zheng, Yi Tay et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan Chen March 7, 2023 1 / 37
  • 2. HyperPrompt Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 2 / 37
  • 3. HyperPrompt Abstract Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 3 / 37
  • 4. HyperPrompt Abstract Abstract This paper proposes HyperPrompt, a novel architecture for prompt-based task-conditioning of self-attention in Transformers. HyperPrompt allows the network to learn task-specific feature maps where the hyper-prompts serve as task global memories for the queries to attend to, at the same time enabling flexible information sharing among tasks. HyperPrompt is competitive against strong multi-task learning baselines with as few as 0.14% of additional task-conditioning parameters, achieving great parameter and computational efficiency. 4 / 37
  • 5. HyperPrompt Introduction Introduction Prompts are lightly tuned, allowing the model to be trained quickly since the main body of the pretrained model is kept frozen. HyperPrompt, a natural but novel extension of Prompt-Tuning to multi-task learning (MTL) for language. Hyper-prompts are injected to the keys and values in the self-attention module, reminiscent of memory augmented Transformers. 5 / 37
  • 6. HyperPrompt Introduction Introduction This mitigates the cost of having prompts pass through the standard FFN layers in Transformers and serves as additional task-specific memory tokens for queries to attend to. Also, introducing task-aware and layer-aware HyperNetworks, it imbues the model with the necessary flexibility and expressiveness Meanwhile, HyperPrompt remains very parameter and computational efficient and friendly to multi-task scaling. 6 / 37
  • 7. HyperPrompt Introduction Contribution 1 HyperPrompt Transformer architecture with learnable hyper-prompts for multi-task fine-tuning with great parameter and computational efficiency 2 HyperNetworks as a prompt generator, and inject hyper-prompts into the self-attention module as global task memory tokens. 3 HyperPrompt outperforms state-of-the-art parameter-efficient T5 models using Prompt-Tuning or adapters on well-established benchmarks such as SuperGLUE and GLUE, across all explored model sizes. 7 / 37
  • 8. HyperPrompt Problem Statement Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 8 / 37
  • 9. HyperPrompt Problem Statement Problem Statement Designing task-conditioned parameterization of Transformer models to achieve greater parameter and computational efficiency as well as Pareto efficiency for multi-task learning. Some related topics: Multi-task learning Transformer-based pre-trained LMs 9 / 37
  • 10. HyperPrompt Problem Statement Multi-task learning A set of tasks {D𝜏 } = {x(n) 𝜏 , y(n) 𝜏 }N𝜏 n=1 indicates the corresponding training set of the 𝜏-th task with N𝜏 samples. To tackle such multi-task learning problem with pre-trained Transformer model f𝜃 (·), it needs to minimize the objective function L(𝜃) = T ∑︁ 𝜏=1 N𝜏 ∑︁ n=1 C f𝜃 x(n) 𝜏 , y(n) 𝜏 where C(·, ·) is typically the cross-entropy loss and f𝜃 (x(n) 𝜏 ) is the output for training sample x(n) 𝜏 . 10 / 37
  • 11. HyperPrompt Problem Statement Transformer-based pre-trained LMs For example, T5 and BART are unified text-to-text frameworks where all tasks share the same encoder-decoder architecture. {{x(n) 𝜏 }N𝜏 n=1}T 𝜏=1 are fed into the same encoder {{ŷ(n) 𝜏 }N𝜏 n=1}T 𝜏=1 are generated by the same decoder. For such universal modules, multi-task learning simply corresponds to mixing task data sets and there is no task-specific classification or regression networks for each task as in encoder-only modules. 11 / 37
  • 12. HyperPrompt Problem Statement Problem Statement To improve the performance, they introduce a set of task-conditioned parameters {𝛿𝜏 }T 𝜏=1 into f𝜃 (·). Such that, the objective function can be updated as L 𝜃, {𝛿𝜏 }T 𝜏=1 = T ∑︁ 𝜏=1 N𝜏 ∑︁ n=1 C f𝜃,𝛿𝜏 x(n) 𝜏 , y(n) 𝜏 During training, both 𝜃 and {𝛿𝜏 }T 𝜏=1 are updated via back-propagation because they observe a large performance drop in SuperGLUE when backbone model 𝜃 is frozen and only task-conditioned parameters are tuned. 12 / 37
  • 13. HyperPrompt Methods Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 13 / 37
  • 16. HyperPrompt Methods Prompt-Based Task-Conditioned Transformer To learn the task-specific information for the 𝜏-th task, having l trainable d-dimensional vectors as the hyper-prompts for the key and the value respectively denoted as P𝜏,k ∈ Rl×h×dh and P𝜏,v ∈ Rl×h×dh . Then, the hyper-prompts are concatenated with the original key and value: K′ 𝜏 = concat(P𝜏,k, K𝜏) (1) V′ 𝜏 = concat(P𝜏,v, V𝜏) (2) where the new key (value) K′ 𝜏 (V′ 𝜏) ∈ R(l+L)×h×dh are used to compute the multihead self-attention. 16 / 37
  • 17. HyperPrompt Methods Prompt-Based Task-Conditioned Transformer The hyper-prompts benefit Transformers for multitask learning in two ways: Prompt for key P𝜏,k directly interacts (matrix multiplication) with the original query Q𝜏 allowing tokens to acquire task-specific semantics. Prompt for value P𝜏,v can serve as task-specific memories for multihead attention to retrieve information. After that, the multihead self-attention can be operated: O𝜏 = Attention Q𝜏, K′ 𝜏, V′ 𝜏 = softmax(Q𝜏K′T 𝜏 V′ 𝜏) where O𝜏 ∈ RL×d is the output of multihead attention. 17 / 37
  • 18. HyperPrompt Methods How to obtain the prompts for the m-th Transformer block? Global Prompts, which initialize a set of global prompts {P𝜏 }T 𝜏=1, where P𝜏 ∈ Rl×d is a trainable matrix to learn the task-specific information of the 𝜏-th task, d is the model dimension and l is the length of the prompt. Local HyperNetworks, that applying two local HyperNetworks hm k and hm v to transform the global prompt P𝜏 into layer-specific and task-specific prompts Pm 𝜏,k = hm k (P𝜏) = Um k Relu Dm k (P𝜏) (3) Pm 𝜏,v = hm v (P𝜏) = Um v Relu Dm v (P𝜏) (4) where Dm v and Um v are down-projection and up-projection matrices. 18 / 37
  • 19. HyperPrompt Methods HyperPrompt-Share share the same two local HyperNetworks defined by the down-project matrices Dm v and Dm k , and the up-project matrices Um v and Um k . HyperPrompt-Sep each task can have its own local HyperNetworks hm 𝜏,k(P𝜏) and hm 𝜏,v(P𝜏) Pm 𝜏,k = hm 𝜏,k (P𝜏) = Um 𝜏,k Relu Dm 𝜏,k (P𝜏) (5) Pm 𝜏,v = hm 𝜏,v (P𝜏) = Um 𝜏,v Relu Dm 𝜏,v (P𝜏) (6) 19 / 37
  • 20. HyperPrompt Methods HyperPrompt-Global It can flexibly share information and knowledge among tasks and blocks while maintaining a low parameter cost. Layer-Aware Task Embedding Im 𝜏 = ht (k𝜏, zm), where ht is a MLP consisting of two feed-forward layers and a ReLU non-linearity and k𝜏 ∈ Rt′ denote the task embedding, and zm ∈ Rt′ is the layer embedding. 20 / 37
  • 21. HyperPrompt Methods HyperPrompt-Global Global HyperNetworks Having Hk(·) to generates the weight matrices for key hyper-prompts, and Hv(·) to generate for value hyper-prompts. Um 𝜏,k, Dm 𝜏,k = Hk Im 𝜏 = WUk , WDk Im 𝜏 (7) Um 𝜏,v, Dm 𝜏,v = Hv Im 𝜏 = WUv , WDv Im 𝜏 (8) After Um 𝜏,k/v and Dm 𝜏,k/v are generated, the method can project the global prompts P𝜏,k/v into into hyper-promtps Pm 𝜏,k/v following (5) and (6). And then, calculating the task-conditioned attention scores by using Figure 2(a). 21 / 37
  • 22. HyperPrompt Methods Parameter Efficiency of HyperPrompt The total number of additional parameters from HyperPrompt-Global dlT + 4(bdt) + Tt′ + Mt′ + (2t′ + t)e where T is the total number of tasks, b is the bottleneck dimension of the weight matrices of the local HyperNetworks, t′/t is the dimension of the raw/final layer-aware task embedding, and e is the hidden dimension of hk/v. Therefore, the space complexity is O(d(lT + 4bt)) or further simplified as O(bdt). 22 / 37
  • 23. HyperPrompt Experiments Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 23 / 37
  • 24. HyperPrompt Experiments Experiments Experimental Setup Key Results Tuning all vs Task-Conditioned Parameters Computational Efficiency Ablation Study Peeking into Hyper-Prompts Impact of Hyper-Prompt Length Encoder vs Decoder 24 / 37
  • 25. HyperPrompt Experiments Experimental Setup Datasets: GLUE and SuperGLUE Transformers: Transformer model T5 Evaluation: Testing the ability of transfer learning among the co-trained tasks. Baselines: 1 MTL-Prompt and HyperPrompt-Share/Sep/Global with vanilla T5 models 2 Vanilla Adapter 3 HyperFormer++ 4 Prompt-Tuning 25 / 37
  • 27. HyperPrompt Experiments Tuning all vs Task-Conditioned Parameters The experiments show that only tuning the task-conditioned parameters is not enough to achieve competitive results as full model training for multi-task learning, since the rest of the experiments are conducted with tuning all model parameters. 27 / 37
  • 28. HyperPrompt Experiments Computational Efficiency This shows the computational efficiency of HyperPrompt for both training and inference. 28 / 37
  • 29. HyperPrompt Experiments Ablation Study HyperPrompt-Global can adjust the degree of information sharing for better multi-task generalization. 29 / 37
  • 30. HyperPrompt Experiments For the top graph, the network has lower attention mass on hyper-prompts in the lower layers and gradually increases attention mass for higher layers. 30 / 37
  • 31. HyperPrompt Experiments For the bottom graph, knowing that injecting hyper-prompts encourages a more diverse attention distribution, which seems to be beneficial to model generalization. 31 / 37
  • 32. HyperPrompt Experiments Impact of Hyper-Prompt Length As shoen in Figure, l = 6 is the best for the decoder, and l = 16 for the encoder with l = 6 fixed for the decoder. 32 / 37
  • 33. HyperPrompt Experiments Encoder vs Decoder Based on this experiment, this paper adds task-conditioned parameters to the decoder for SuperGLUE in their experiments. 33 / 37
  • 34. HyperPrompt Related Work Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 34 / 37
  • 35. HyperPrompt Related Work Related Work 1 Prompt-Tuning 1 Discrete prommpting words 2 Soft prompts 2 Adapter-Tuning 3 Multi-task Natural Language Understanding 35 / 37
  • 36. HyperPrompt Conclusion Table of contents 1 Abstract 2 Introduction 3 Problem Statement 4 Methods 5 Experiments 6 Related Work 7 Conclusion 36 / 37
  • 37. HyperPrompt Conclusion Conclusion This paper propose a novel architecture for prompt-based task-conditioning of self-attention in Transformers. The hyper-prompts are generated by a HyperNetwork to enable flexible information sharing among tasks while remain efficient in parameters and computation. 37 / 37