Google's Pathways Language Model and Chain-of-Thought

How Computers Understood Humans
ideas existed at least since 1700s
but not enough compute and computer science
How to instruct computer to perform tasks?
How represent knowledge in computers?
How to generate the answers?
by his contrivance, the most ignorant person, at a reasonable charge, and with
a little bodily labour, might write books in philosophy, poetry, politics, laws,
mathematics, and theology, without the least assistance from genius or study.
... He then commanded six-and-thirty of the lads, to read the several lines
softly, as they appeared upon the frame
(Gulliver's Travels, by Jonathan Swift
1726, making fun of )
Ramon Llull 1232

Prompt as an Interface
2001: A Space Odyssey HAL 9000
input textual instructions e.g. explain a riddle
based on its knowledge computer generates the answer text

How To Represent Knowledge
library ~> textual documents in a database
search by list of words (query) ~1970s, find topics ~1980
counting word occurrences on documents level into
methods: TF*IDF, Latent semantic analysis
sparce matrices

Non-Contextual Words Vectors
document -> sentence or small running window of 10 words
vector is point in a multidimensional space - an array of numbers
each of 10k words gets one general vector in 300 dimensional space
each vector has to fit in "only" 300 dimensions - much less than 10k words
global (non) contextual word vectors - no disambiguation (flowering) vs
fruit (food)

Word2vec: Word To a Global Vector
count co-occurrence in a 10 word window
: 10 surrounding words sum close to the middle
word vector
words appearing in similar context are close in the 300 dimensional space
disambiguation - word strings should be just name not an id!
GloVe (Pennington 2014)
word2vec (Mikolov 2013)

Transformer: Contextual Word Vectors
word meaning based on context of 100s of words.
- sequential with memory
(Vaswani 2017)
calculates on entire input sequence
recurrent neural networks (LSTM, GRU)
transformer architecture

Knowledge Graph's Nodes Are Disambiguated
knowledge graph e.g. Wikidata: each node is specific fruit (flowering) vs
fruit (food)
imperfect tradeoff between database and training data samples
Wikipedia and internet is between knowledge graph and set of documents
random walk ~ valid "sentences", link prediction ~ generating text

Big Transformer Models
generate by predicting input text continuation
$10M transformers trained on large amount of text from the internet in
2022
solve wide variety naturally described problems sometimes with human
level performance
examples: , , , ...
PaLM (2022) RETRO (2021) GPT-3

Google's Pathways Language Model and Chain-of-Thought
training task: Given this text, predict the next word (token).
the biggest dense (540B) and likely the most expensive
(~$10M) as of early 2022
highly efficient training on 6k chips (TPU v4) across 2 clusters (Pods)
improvements from scaling continue in language understanding (few-shot)
disproportionate gains from certain scale e.g. reasoning: 62B to 540B vs
8B to 62B
breakthrough performance:
outperforming average human on a grade school logic and math
(BIG-bench)
outperforms specialized and fine-tuned models on multistep
reasoning
chain-of-thought prompting simulates inner monologue
PaLM's
transformer

PaLM's Size
PaLM has 540B parameters = 3x bigger than GPT-3 175B parameters
2x smaller than sparse 1T
only parts of the model is activated at each time.
human brain 100T connections
likely the most expensive model (2.5 yottaFLOPS) vs GPT-3
PaLM and GPT-3 fascinating, but likely not economical now
Switch Transformer
~$10M ~$5M

Zero-Shot vs Few-Shot Prompting vs Fine-Tuning
prompting: instructing via addition of textual context
zero-shot: task described, but demonstrations not given
few-shot: task described and random demonstrations provided
fine-tuning: model parameters are updated with correct answers

PaLM's Breakthrough Capabilities on BIG-bench
BIG-bench: 150+ tasks (a task contains samples) like reasoning, math, QA,
translation, programming
58 tasks available results on other models, often multiple choice questions
PaLM is SOTA: outperforming human average and other models GPT-3,
Gopher, Chinchilla
still underperforms average human on many tasks
certain capabilities emerge when a scale is reached (not log-linear)
english proverbs and logical sequence at 62B to 540B vs 8B to 62B

Chain-of-Thought Prompting
inserts reasoning steps between for each demonstration in the prompt
multi-step arithmetic
grade-school level math problems
difficult is to convert to equations
commonsense logical reasoning
strong world knowledge
chaining logical inferences
thoughts are useful for interpretation

PaLM Chain-of-Thought Prompting Results
reasoning tasks datasets: GSM8K, SVAMP, MAWPS, AQuA,
CommonsenseQA, StrategyQA
humans allowed calculators and search engines too
the model triggers
(+~5% acc)
on GSM8K: PaLM+chain-of-thought competitive with GPT-
3+finetuning+calculator+verifier,
outperforms with a calculator, but GPT-3+chain-of-thought?
generating explicit inference chains drastically increase the quality of the
predictions
a calculator by generating special annotation text

Explanation Using Chain-of-Thought Prompting
right reasons for the right answer?
interpretability may be desired output
2-shot prompting with greedy decoding (deterministic)
not in paper: I added GPT-3, results with zero-temperature
average airplane speed 460 – 575 mph
Input: Jennifer looked out her window and sees a really cool cloud below her.
She unbuckles her seatbelt and heads to the bathroom. Is Jennifer probably
traveling more than 300 miles per hour relative to the earth?
PaLM: 300 miles per hour is about 480 km/h. This is about the speed of a
commercial airplane. Clouds are usually below airplanes, so Jennifer is
probably on an airplane. The answer is "yes"
GPT-3: No, because an airplane typically travels around 500-600 miles per
hour, and the bathroom is usually located in the back of the plane.

PaLM on Code Tasks
achieves similar results to specialized OpenAI's Codex 12B with 3% of the
code training data
after fine-tuning outperforms OpenAI's Davinci Codex

PaLM on Other Tasks
generally SOTA on a few-shot competitions
few-shot is not competitive with fine-tuned model
fine-tuned only competitive with specialized fine-tuned variants
translation: without fine-tuning outperforms GPT-3, outperforms some
supervised baselines de-en, ro-en
summarization: fine-tuned results competitive, few-shot largely
underperforms the fine-tuned
multilingual question answering: fine-tuned results competitive, few-shot
largely underperformed of fine-tuned

PaLM Architecture:
standard decoder-only (attending only to the past, similar to
)
modified Feed-forward layer (MLP):
instead of RELU use
~1% better in compute equivalent setup
uses GLU: gated linear unit - a sigmoid controlled output
SwiGLU:
uses :
parallel Attention and Feed-forward layer (MLP) from :
instead of sequential is additive:
15% speedup for small degradation
:
transformer
GPT-3
max(0, xW +
1 b
)W +
1 2 b
2 SwiGLU feed-
foward
FFN
:=
SwiGLU (Swish(xW
) ⊗
1 xV )W
2
swish activation x(1 + exp(−x))−1
GPT-J
y = x + MLP(LayerNorm(x)) +
Attention(LayerNorm(x))
multi-query attention

block-shared key and value projections, different query projections
speeds up autoregressive decoding where queries
:
want relative position info in query-value dot-product
use multiplicative rotational matrix mixing pairwise neighboring
dimensions
improves performance on long sequences
RoPE Embeddings

PaLM Training Dataset
780B tokens of high-quality text (~100B human days), (
, GPT-3 300B tokens, MT-NLG 339B)
social media 50%, webpages 27%, books 13%, wikipedia 4%, code 5%,
news 1%
based on dataset used for LaMBDA, GLaM
private non-reproducible dataset, while MT-NLG 339B is reproducible but
non-hosted
5T token
MassiveText

PaLM Training Requirements
~17 TB of RAM, 2.5 yottaFLOPS ( ) needed for training
2 TPU v4 Pod clusters connected via data center network
mind of PalM is shattered across many chips (cheaper, replaceable,
cooling)
each Pod ~1 exaflop/s: 768 hosts, 3072 TPU v4 chips
but how to split the work and communicate?
1024

Parallel Training of Large Scale Models
parallel computing trades off compute ("time"), memory ("space"),
communication throughput (no cool name)
data parallelism - batches are divided between workers
tensor model parallelism
splits model layers i.e. transformer block into attention heads and
feedforward
pipeline mechanism ( )
computation DAG into stages e.g. layers
stages exchange forward and backward propagation information
(micro-batches)
step by step passing causes "bubbles" - idling
Megatron-Turing (MT-LNG) Microsoft and NVidia

PaLM Training Infrastructure
PaLM uses pipeline-free 2D parallelism
data parallel across 2 clusters (2 TPU v4 Pods)
each cluster (Pod) has full model copy
model partitioned into 12 parts
data partitioned into 256 parts
768 hosts connected to 3k chips which are interconnected
each batch triggers update between clusters to identical models
each host exchanges 1.3GB with its counterpart

Training Efficiency
observed throughput relative to the theoretical maximum of a system
price for parallelization of PaLM is 50%
in case of PaLM throughput is tokens-per-second

More Posts Like This

SRU++ Model Speeds Up Transformer with Simple Recurrent Unit
DeepMind's RETRO Transformer Model

Appendix
GPT-3 Explanations Full Prompts
Logical Inference 1
Best of 2 temperature 0.7 otherwise defaults:
We can infer that Jennifer is on an airplane because she is looking out a
window and sees a cloud below her. We can also infer that she is traveling
at a high speed because she is on an airplane. The average cruising speed
of an airplane is about 550 miles per hour, so Jennifer is probably traveling
more than 300 miles per hour relative to the earth. The answer is "yes".
We can infer that Jennifer is in an airplane because she is looking out a
window. We can also infer that she is high up in the sky because she can
I will explain how to make logical inferences:

(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C
(3) Jennifer looked out her window and sees a really cool cloud below her. She unbuckles her seatbelt and heads to the b
Answer:

see a cloud "below" her. Clouds are usually found at altitudes above 3,000
feet. Finally, we know that airplanes typically travel around 500 miles per
hour. So, Jennifer is probably traveling more than 300 miles per hour, but
we cannot be certain. The answer is "yes".
Greedy:
No, because an airplane typically travels around 500-600 miles per hour,
and the bathroom is usually located in the back of the plane.

Logical Inference 2
Best of 2, temperature 0.7 otherwise defaults:
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
home next Tuesday, so she will likely be going back to Virginia this
weekend.
I will explain how to make logical inferences:

(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C
(3) Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. Going home ne
Question: Is it likely that Shelley will be near the Pacific Ocean this weekend?

Answer:

weekend.
Greedy:
weekend.

Jokes
Best of 2 temperature 0.7 otherwise defaults:
This joke is about how annoying mothers can be. The speaker is trying to
visit their family, but their mother has just informed them that their step-
dad's poetry reading is on the same night. So now the speaker has to fly in
on the next day.
This joke is about how annoying it can be to have to deal with family
obligations. The person telling the joke is trying to visit their family on
April 6th, but their mom tells them that their step dad's poetry reading is
that night. So the person has to fly in on April 7th instead.
This joke is about how annoying and inconsiderate step-parents can be.
The punchline is that the person is flying in a day later just to avoid their
I will explain these jokes:

(1) The problem with kleptomaniacs is that they always take things literally.

Explanation: This joke is wordplay. Someone who "takes things literally" is someone who doesn't fully understand social
But the definition of kleptomania is someone who literally takes things.

(2) Always borrow money from a pessimist. They’ll never expect it back.

Explanation: Most people expect you to pay them back when you borrow money, however a pessimist is someone who always as
(3) I was going to fly to visit my family on April 6th. My mom said "Oh great, your step dad's poetry reading is that ni
Explanation:

step-dad's poetry reading.
Greedy:
This joke is about how annoying and intrusive in-laws can be. The person is
trying to visit their family, but their mother says that their step-dad's
poetry reading is that night. So the person is now flying in the next day to
avoid it.

Google's Pathways Language Model and Chain-of-Thought

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Google's Pathways Language Model and Chain-of-Thought

Similar to Google's Pathways Language Model and Chain-of-Thought (20)

Recently uploaded

Recently uploaded (20)

Google's Pathways Language Model and Chain-of-Thought