PaLM: Scaling Language
Modeling with Pathways
Chowdhery, Aakanksha, et al. arXiv preprint arXiv:2204.02311
2023. 02. 19
허정원, 조해창, 박산희
1
Contents
•
•
•
•
•
2
1. Introduction
3
Gopher
LaMDA
GaLM MT NLG
175B 1.2T 137B 280B 530B
1. Introduction
(1)
(2)
(3)
(4)
4
540B
780B Tokens
Achieved through the use of Pathways
PaLM
The key takeaways
•
•
•
•
•
•
5
Model Architecture
6
2. Model Architecture
•
•
•
•
•
•
•
7
8
9
•
SwiGLU = xW·sigmoid(βxW) @ xV
An improvement in quality in compute- equivalent experiments
10
•
The parallel formulation results in roughly 15% faster
training speed at large scales, since the MLP and
Attention input matrix multiplications can be fused.
11
• Multi-Query Attention
12
• RoPE Embeddings
𝑓! 𝑥" ≔ 𝑊
!𝑥"
𝑓# 𝑥$ + 𝑛 ≔ 𝑊#(𝑥$ + (
𝑝%
#
)
𝑓& 𝑥$ + 𝑛 ≔ 𝑊
&(𝑥$ + (
𝑝%
&
)
13
• Vocabulary
A SentencePiece vocabulary with 256k tokens, which was chosen
to support the large number of languages in the training corpus
without excess tokenization.
The vocabulary is completely lossless and reversible.
2. Model Architecture
•
•
• cost savings
•
•
•
•
14
2.1 Model Scale Hyperparameters
15
Model Architecture
16
Training
17
3 Training Dataset
18
4 Training Infrastructure
19
4.1 Training Efficiency
20
5 Training Setup
•
•
•
•
•
•
•
•
21
5 Training Setup
•
22
5 Training Setup
•
23
5 Training Setup
•
24
5 Training Setup
•
25
5 Training Setup
•
26
5 Training Setup
•
•
•
27
5.1 Training Instability
28
Training
29
Evaluation
30
31
6.1 English NLP tasks
6.2 BIG-bench
32
6.3 Reasoning
33
6.4 Code Tasks
34
6.5 Translation
•
•
•
35
6.6 Multilingual Natural Language Generation
•
•
•
• 36
6.7 Multilingual Question Answering
37
6.8 Analysis
38
Discussions
39
7 Memorization
•
•
•
40
8 Dataset Contamination
41
9 Exploring Explanations
•
•
•
42
10 Representational
Bias Analysis
43
13 Open Questions in Scaling
44
14 Conclusion
•
•
•
45
Q & A
46

PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf