LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.pdf

LLaMA-Adapter
LLaMA-Adapter: Efficient Fine-tuning of
Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Chris Liu et al.
Speaker: Po-Chuan Chen
Jul 25, 2023
1 / 32

LLaMA-Adapter
Table of contents
1 Abstract
2 Introduction
3 Related Work
4 LLaMA-Adapter
5 Experiment
6 Conclusion
7 Reflection
2 / 32

LLaMA-Adapter
Abstract
Abstract
This paper proposes LLaMA-Adapter1, a lightweight adaption
method to efficiently finetune LLaMA into an instruction-following
model.
Specifically, they adopt a set of learnable adaption prompts, and
prepend them to the word tokens at higher transformer layers.
Then, a zero-initialized attention mechanism with zero gating is
proposed, which adaptively injects the new instructional cues into
LLaMA, while effectively preserves its pre-trained knowledge.
1https://github.com/OpenGVLab/LLaMA-Adapter
3 / 32

LLaMA-Adapter
Abstract
Abstract
LLaMA-Adapter can generate high-quality responses, comparable to
Alpaca [6] with fully fine-tuned 7B parameters.
Also, it can be simply extended to multi-modal instructions for
learning image-conditioned LLaMA model, which achieves superior
reasoning performance on ScienceQA and COCO Caption
benchmarks.
4 / 32

LLaMA-Adapter
Introduction
Table of contents
1 Abstract
2 Introduction
3 Related Work
4 LLaMA-Adapter
5 Experiment
6 Conclusion
7 Reflection
5 / 32

LLaMA-Adapter
Introduction
Introduction
Large-scale Language Models (LLMs) have stimulated widespread
attention in both academia and industry. However, lots of them
impeded by closed-source restriction and high development costs.
To alleviate this, Stanford Alpaca proposes to fine-tune an LLM, i.e.,
LLaMA [8] into an instruction-following model, which is affordable
and replicable.
Alpaca shows that fine-tunes the entire 7B parameters in LLaMA,
producing an exceptional instruction model that performs similarly to
GPT-3.5. But, finetuning LLaMA is still time-consuming,
computation-intensive.
6 / 32

LLaMA-Adapter
Introduction
Contribution
Figure 1: Characteristics of LLaMA-Adapter.
7 / 32

LLaMA-Adapter
Related Work
Related Work
Instruction-Following Language Models.
1 FLAN
2 InstructGPT
3 GPT-3.5 / GPT-4
4 Stanford Alpaca
5 Alpaca-LoRA [7]
Parameter-Efficient Fine-Tuning (PEFT).
1 Adapters
2 Low-Rank Adaptation (LoRA)
3 Prompt tuning
8 / 32

LLaMA-Adapter
LLaMA-Adapter
Table of contents
1 Abstract
2 Introduction
3 Related Work
4 LLaMA-Adapter
Learnable Adaption Prompts
Zero-initialized Attention
Multi-modal Reasoning
Zero-initialized Attention for other Large Models
5 Experiment
9 / 32

LLaMA-Adapter
LLaMA-Adapter
For the learnable adaption prompts for instruction-following
fine-tuning, they use 52K instruction-output data [9] and a pre-trained
LLaMA with an N-layer transformer.
Prompts can be defined for L transformer layers as {Pl}L
l=1, where
Pl ∈ RK×C with K denoting the prompt length for each layer, and C
equaling the feature dimension of LLaMA’s transformer.
Since they want tune the language representations with higher-level
semantics, L ≤ N.
10 / 32

LLaMA-Adapter
LLaMA-Adapter
Learnable Adaption Prompts (cont.)
The learnable adaption prompt is concatenated with Tl along the token
dimension as prefix, formulated as
[Pl, Tl] ∈ R(K+M)×C
(1)
where M is the length word tokens.
In this way, the instruction knowledge learned within Pl , can
effectively guide Tl to generate the subsequent contextual response via
attention layers in the transformer block.
11 / 32

LLaMA-Adapter
LLaMA-Adapter
They found that if the adaption prompts are randomly initialized, the
stability and effectiveness with fine-tuning will be harmed.
Such that they modify the vanilla attention mechanisms at the last L
transformer layers to be zero-initialized attention.
12 / 32

LLaMA-Adapter
LLaMA-Adapter
Figure 2: Details of LLaMA-Adapter.
13 / 32

LLaMA-Adapter
LLaMA-Adapter
Zero-initialized Attention (cont.)
In the attention mechanism, several linear projection layers are first
applied to transform the input tokens into queries, keys, and values.
Ql = Linearq(tl); (2)
Kl = Lineark([Pl; Tl; tl]); (3)
Vl = Linearv([Pl; Tl; tl]). (4)
Then, the attention scores of Ql and Kl before the softmax function are
calculated as
Sl = QlKT
l /
√
C ∈ R1×(K+M+1)
(5)
14 / 32

LLaMA-Adapter
LLaMA-Adapter
Meanwhile, Sl can be reformulated by two components as
Sl = [SK
l ; SM+1
l ]T
(6)
where SK
l ∈ RK and SM+1
l ∈ R(M+1)×1.
To this end, they adopt a learnable gating factor, denoted as gl, to
adaptively control the importance of SK
l in the attention.
Therefore, it independently apply the softmax functions to the two
components in Equation (6), and multiply the first term by gl,
formulated as
S
g
l = [softmax(SK
l ) · gl; softmax(SM+1
l )]T
(7)
15 / 32

LLaMA-Adapter
LLaMA-Adapter
Finally, they calculate the output of the l-th attention layer with a
linear projection layer as
to
l = Linearo(S
g
l Vl) ∈ R1×C
(8)
16 / 32

LLaMA-Adapter
LLaMA-Adapter
Apart from textual instructions, LLaMA-Adapter is capable of
answering a question based on input of other modalities, which
augments the language model with rich cross-modal information.
Figure 3: Multi-modal Reasoning of LLaMA-Adapter.
17 / 32

LLaMA-Adapter
LLaMA-Adapter
Multi-modal Reasoning (cont.)
For an input image as the visual context, they first leverage a
pre-trained visual encoder, e.g., CLIP [5], to extract its multi-scale
global features, denoted as {Im}M
m=1, where Im ∈ R1×Cm and M denotes
the scale number.
A learnable projection network formulated as
Ip = Projection(Concat({Im}M
m=1)) (9)
where Ip ∈ R1×C and is regarded as the overall image token with the
same feature dimension as the adaption prompts.
18 / 32

LLaMA-Adapter
LLaMA-Adapter
Multi-modal Reasoning (cont.)
And then they repeat Ip for K times, and element-wisely add it onto
the K-length adaption prompts at all L inserted transformer layers. For
the l-th layer, they denote the acquired multi-modal prompt as
Pv
l = Pl + Repeat(Ip) ∈ RK×C
(10)
where Pv
l denotes the adaption prompt incorporating visual
information from the given image context.
19 / 32

LLaMA-Adapter
LLaMA-Adapter
Here, vision model uses ViT [1], language Model uses RoBERTa [4].
Vision Models. They insert the adaption prompts as prefix into
the topmost L transformer layers in ViT, and modify the attention
operations to be zero-initialized at all inserted layers.
Language Models. They implement the zero-initialized
attention on top of P-tuning v2 [3], a prompt tuning method for
efficiently adapting large language models. Likewise, they only
enable the prompt tokens in P-tuning v2 and their zero gating
factors to be learnable during fine-tuning.
20 / 32

LLaMA-Adapter
Experiment
Table of contents
1 Abstract
2 Introduction
3 Related Work
4 LLaMA-Adapter
5 Experiment
6 Conclusion
7 Reflection
21 / 32

LLaMA-Adapter
Experiment
Experiment
Instruction-following Evaluation
Multi-modal Evaluation
Ablation Study
22 / 32

LLaMA-Adapter
Experiment
Figure 4: Instruction-following Comparison.
23 / 32

LLaMA-Adapter
Experiment
Table 1: Question Answering Accuracy (%) on ScienceQA’s test set.
24 / 32

LLaMA-Adapter
Experiment
Ablation Study
Focus on Insertion Layers, Zero-initialized Attention, Robustness to
Over-fitting.
Table 2: Inserted Layers (left) and Zero-initialized Attention (right)
Table 3: Robustness to Over-fitting
25 / 32

LLaMA-Adapter
Experiment
Table 4: Vision (left) / Language (right) Model Fine-tuning
This demonstrates their superiority on traditional vision and language
tasks compared to existing fine-tuning methods.
26 / 32

LLaMA-Adapter
Conclusion
Table of contents
1 Abstract
2 Introduction
3 Related Work
4 LLaMA-Adapter
5 Experiment
6 Conclusion
7 Reflection
27 / 32

LLaMA-Adapter
Conclusion
Conclusion
In this paper, they propose LLaMA-Adapter, an efficient adaption
method for training instruction following models.
Also, they introduce zero-initialized attention with gating mechanism,
which adaptively incorporates instructional signals, while preserving
the pre-trained knowledge in LLaMA.
LLaMA-Adapter can be generalized to image conditions for
multi-modal reasoning, and language tasks. Their zero-initialized
attention also attains favorable fine-tuning performance, which
indicates strong generalization capacity.
28 / 32

LLaMA-Adapter
Reflection
Reflection
This work presents a parameter-efficient tuning on LLaMA. A way to
add additional parameters in prefixes to finetune the language model.
[Softmax(QKT
1 ), 𝛼 · Softmax(QKT
2 )][VT
1 , VT
2 ]T
(11)
While in their other paper, LLaMA-Adapter V2 [2], they add the
parameter in the scaled dot product attention. Both are equalized,
which means they can implement LLaMA Adapter in a flexible way.
Softmax(QKT
1 )V1 + 𝛼 · Softmax(QKT
2 )V2 (12)
29 / 32

LLaMA-Adapter
Reflection
References I
[1] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale. 2021. arXiv:
2010.11929 [cs.CV].
[2] Peng Gao et al. LLaMA-Adapter V2: Parameter-Efficient Visual
Instruction Model. 2023. arXiv: 2304.15010 [cs.CV].
[3] Xiao Liu et al. P-Tuning v2: Prompt Tuning Can Be Comparable
to Fine-tuning Universally Across Scales and Tasks. 2022. arXiv:
2110.07602 [cs.CL].
[4] Yinhan Liu et al. RoBERTa: A Robustly Optimized BERT
Pretraining Approach. 2019. arXiv: 1907.11692 [cs.CL].
30 / 32

LLaMA-Adapter
Reflection
References II
[5] Alec Radford et al. “Learning Transferable Visual Models From
Natural Language Supervision”. In: Proceedings of the 38th
International Conference on Machine Learning. Ed. by
Marina Meila and Tong Zhang. Vol. 139. Proceedings of
Machine Learning Research. PMLR, July 2021, pp. 8748–8763.
url: https:
//proceedings.mlr.press/v139/radford21a.html.
[6] Rohan Taori et al. Stanford Alpaca: An Instruction-following
LLaMA model.
https://github.com/tatsu-lab/stanford_alpaca. 2023.
[7] tloen. Alpaca-LoRA.
https://github.com/tloen/alpaca-lora. 2023.
31 / 32

LLaMA-Adapter
Reflection
References III
[8] Hugo Touvron et al. LLaMA: Open and Efficient Foundation
Language Models. 2023. arXiv: 2302.13971 [cs.CL].
[9] Yizhong Wang et al. Self-Instruct: Aligning Language Model
with Self Generated Instructions. 2022.
32 / 32

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.pdf

Similar to LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.pdf