SlideShare a Scribd company logo
How Computers Understood Humans
ideas existed at least since 1700s
but not enough compute and computer science
How to instruct computer to perform tasks?
How represent knowledge in computers?
How to generate the answers?
by his contrivance, the most ignorant person, at a reasonable charge, and with
a little bodily labour, might write books in philosophy, poetry, politics, laws,
mathematics, and theology, without the least assistance from genius or study.
... He then commanded six-and-thirty of the lads, to read the several lines
softly, as they appeared upon the frame
(Gulliver's Travels, by Jonathan Swift
1726, making fun of )
Ramon Llull 1232
Prompt as an Interface
2001: A Space Odyssey HAL 9000
input textual instructions e.g. explain a riddle
based on its knowledge computer generates the answer text
How To Represent Knowledge
library ~> textual documents in a database
search by list of words (query) ~1970s, find topics ~1980
counting word occurrences on documents level into
methods: TF*IDF, Latent semantic analysis
sparce matrices
Non-Contextual Words Vectors
document -> sentence or small running window of 10 words
vector is point in a multidimensional space - an array of numbers
each of 10k words gets one general vector in 300 dimensional space
each vector has to fit in "only" 300 dimensions - much less than 10k words
global (non) contextual word vectors - no disambiguation (flowering) vs
fruit (food)
Word2vec: Word To a Global Vector
count co-occurrence in a 10 word window
: 10 surrounding words sum close to the middle
word vector
words appearing in similar context are close in the 300 dimensional space
disambiguation - word strings should be just name not an id!
GloVe (Pennington 2014)
word2vec (Mikolov 2013)
Transformer: Contextual Word Vectors
word meaning based on context of 100s of words.
- sequential with memory
(Vaswani 2017)
calculates on entire input sequence
recurrent neural networks (LSTM, GRU)
transformer architecture
Knowledge Graph's Nodes Are Disambiguated
knowledge graph e.g. Wikidata: each node is specific fruit (flowering) vs
fruit (food)
imperfect tradeoff between database and training data samples
Wikipedia and internet is between knowledge graph and set of documents
random walk ~ valid "sentences", link prediction ~ generating text
Big Transformer Models
generate by predicting input text continuation
$10M transformers trained on large amount of text from the internet in
2022
solve wide variety naturally described problems sometimes with human
level performance
examples: , , , ...
PaLM (2022) RETRO (2021) GPT-3
Google's Pathways Language Model and Chain-of-Thought
training task: Given this text, predict the next word (token).
the biggest dense (540B) and likely the most expensive
(~$10M) as of early 2022
highly efficient training on 6k chips (TPU v4) across 2 clusters (Pods)
improvements from scaling continue in language understanding (few-shot)
disproportionate gains from certain scale e.g. reasoning: 62B to 540B vs
8B to 62B
breakthrough performance:
outperforming average human on a grade school logic and math
(BIG-bench)
outperforms specialized and fine-tuned models on multistep
reasoning
chain-of-thought prompting simulates inner monologue
PaLM's
transformer
PaLM's Size
PaLM has 540B parameters = 3x bigger than GPT-3 175B parameters
2x smaller than sparse 1T
only parts of the model is activated at each time.
human brain 100T connections
likely the most expensive model (2.5 yottaFLOPS) vs GPT-3
PaLM and GPT-3 fascinating, but likely not economical now
Switch Transformer
~$10M ~$5M
Zero-Shot vs Few-Shot Prompting vs Fine-Tuning
prompting: instructing via addition of textual context
zero-shot: task described, but demonstrations not given
few-shot: task described and random demonstrations provided
fine-tuning: model parameters are updated with correct answers
PaLM's Breakthrough Capabilities on BIG-bench
BIG-bench: 150+ tasks (a task contains samples) like reasoning, math, QA,
translation, programming
58 tasks available results on other models, often multiple choice questions
PaLM is SOTA: outperforming human average and other models GPT-3,
Gopher, Chinchilla
still underperforms average human on many tasks
certain capabilities emerge when a scale is reached (not log-linear)
english proverbs and logical sequence at 62B to 540B vs 8B to 62B
Chain-of-Thought Prompting
inserts reasoning steps between for each demonstration in the prompt
multi-step arithmetic
grade-school level math problems
difficult is to convert to equations
commonsense logical reasoning
strong world knowledge
chaining logical inferences
thoughts are useful for interpretation
PaLM Chain-of-Thought Prompting Results
reasoning tasks datasets: GSM8K, SVAMP, MAWPS, AQuA,
CommonsenseQA, StrategyQA
humans allowed calculators and search engines too
the model triggers
(+~5% acc)
on GSM8K: PaLM+chain-of-thought competitive with GPT-
3+finetuning+calculator+verifier,
outperforms with a calculator, but GPT-3+chain-of-thought?
generating explicit inference chains drastically increase the quality of the
predictions
a calculator by generating special annotation text
Explanation Using Chain-of-Thought Prompting
right reasons for the right answer?
interpretability may be desired output
2-shot prompting with greedy decoding (deterministic)
not in paper: I added GPT-3, results with zero-temperature
average airplane speed 460 – 575 mph
Input: Jennifer looked out her window and sees a really cool cloud below her.
She unbuckles her seatbelt and heads to the bathroom. Is Jennifer probably
traveling more than 300 miles per hour relative to the earth?
PaLM: 300 miles per hour is about 480 km/h. This is about the speed of a
commercial airplane. Clouds are usually below airplanes, so Jennifer is
probably on an airplane. The answer is "yes"
GPT-3: No, because an airplane typically travels around 500-600 miles per
hour, and the bathroom is usually located in the back of the plane.
PaLM on Code Tasks
achieves similar results to specialized OpenAI's Codex 12B with 3% of the
code training data
after fine-tuning outperforms OpenAI's Davinci Codex
PaLM on Other Tasks
generally SOTA on a few-shot competitions
few-shot is not competitive with fine-tuned model
fine-tuned only competitive with specialized fine-tuned variants
translation: without fine-tuning outperforms GPT-3, outperforms some
supervised baselines de-en, ro-en
summarization: fine-tuned results competitive, few-shot largely
underperforms the fine-tuned
multilingual question answering: fine-tuned results competitive, few-shot
largely underperformed of fine-tuned
PaLM Architecture:
standard decoder-only (attending only to the past, similar to
)
modified Feed-forward layer (MLP):
instead of RELU use
~1% better in compute equivalent setup
uses GLU: gated linear unit - a sigmoid controlled output
SwiGLU:
uses :
parallel Attention and Feed-forward layer (MLP) from :
instead of sequential is additive:
15% speedup for small degradation
:
transformer
GPT-3
max(0, xW +
1 b ​
)W ​ +
1 2 b ​
2 SwiGLU feed-
foward
FFN ​
:=
SwiGLU (Swish(xW ​
) ⊗
1 xV )W ​
2
swish activation x(1 + exp(−x))−1
GPT-J
y = x + MLP(LayerNorm(x)) +
Attention(LayerNorm(x))
multi-query attention
block-shared key and value projections, different query projections
speeds up autoregressive decoding where queries
:
want relative position info in query-value dot-product
use multiplicative rotational matrix mixing pairwise neighboring
dimensions
improves performance on long sequences
RoPE Embeddings
PaLM Training Dataset
780B tokens of high-quality text (~100B human days), (
, GPT-3 300B tokens, MT-NLG 339B)
social media 50%, webpages 27%, books 13%, wikipedia 4%, code 5%,
news 1%
based on dataset used for LaMBDA, GLaM
private non-reproducible dataset, while MT-NLG 339B is reproducible but
non-hosted
5T token
MassiveText
PaLM Training Requirements
~17 TB of RAM, 2.5 yottaFLOPS ( ) needed for training
2 TPU v4 Pod clusters connected via data center network
mind of PalM is shattered across many chips (cheaper, replaceable,
cooling)
each Pod ~1 exaflop/s: 768 hosts, 3072 TPU v4 chips
but how to split the work and communicate?
1024
Parallel Training of Large Scale Models
parallel computing trades off compute ("time"), memory ("space"),
communication throughput (no cool name)
data parallelism - batches are divided between workers
tensor model parallelism
splits model layers i.e. transformer block into attention heads and
feedforward
pipeline mechanism ( )
computation DAG into stages e.g. layers
stages exchange forward and backward propagation information
(micro-batches)
step by step passing causes "bubbles" - idling
Megatron-Turing (MT-LNG) Microsoft and NVidia
PaLM Training Infrastructure
PaLM uses pipeline-free 2D parallelism
data parallel across 2 clusters (2 TPU v4 Pods)
each cluster (Pod) has full model copy
model partitioned into 12 parts
data partitioned into 256 parts
768 hosts connected to 3k chips which are interconnected
each batch triggers update between clusters to identical models
each host exchanges 1.3GB with its counterpart
Training Efficiency
observed throughput relative to the theoretical maximum of a system
price for parallelization of PaLM is 50%
in case of PaLM throughput is tokens-per-second
More Posts Like This




SRU++ Model Speeds Up Transformer with Simple Recurrent Unit
DeepMind's RETRO Transformer Model
Appendix
GPT-3 Explanations Full Prompts
Logical Inference 1
Best of 2 temperature 0.7 otherwise defaults:
We can infer that Jennifer is on an airplane because she is looking out a
window and sees a cloud below her. We can also infer that she is traveling
at a high speed because she is on an airplane. The average cruising speed
of an airplane is about 550 miles per hour, so Jennifer is probably traveling
more than 300 miles per hour relative to the earth. The answer is "yes".
We can infer that Jennifer is in an airplane because she is looking out a
window. We can also infer that she is high up in the sky because she can
I will explain how to make logical inferences:

(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C
(3) Jennifer looked out her window and sees a really cool cloud below her. She unbuckles her seatbelt and heads to the b
Answer:
see a cloud "below" her. Clouds are usually found at altitudes above 3,000
feet. Finally, we know that airplanes typically travel around 500 miles per
hour. So, Jennifer is probably traveling more than 300 miles per hour, but
we cannot be certain. The answer is "yes".
Greedy:
No, because an airplane typically travels around 500-600 miles per hour,
and the bathroom is usually located in the back of the plane.
Logical Inference 2
Best of 2, temperature 0.7 otherwise defaults:
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
home next Tuesday, so she will likely be going back to Virginia this
weekend.
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
I will explain how to make logical inferences:

(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C
(3) Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. Going home ne
Question: Is it likely that Shelley will be near the Pacific Ocean this weekend?

Answer:
home next Tuesday, so she will likely be going back to Virginia this
weekend.
Greedy:
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
home next Tuesday, so she will likely be going back to Virginia this
weekend.
Jokes
Best of 2 temperature 0.7 otherwise defaults:
This joke is about how annoying mothers can be. The speaker is trying to
visit their family, but their mother has just informed them that their step-
dad's poetry reading is on the same night. So now the speaker has to fly in
on the next day.
This joke is about how annoying it can be to have to deal with family
obligations. The person telling the joke is trying to visit their family on
April 6th, but their mom tells them that their step dad's poetry reading is
that night. So the person has to fly in on April 7th instead.
This joke is about how annoying and inconsiderate step-parents can be.
The punchline is that the person is flying in a day later just to avoid their
I will explain these jokes:

(1) The problem with kleptomaniacs is that they always take things literally.

Explanation: This joke is wordplay. Someone who "takes things literally" is someone who doesn't fully understand social
But the definition of kleptomania is someone who literally takes things.

(2) Always borrow money from a pessimist. They’ll never expect it back.

Explanation: Most people expect you to pay them back when you borrow money, however a pessimist is someone who always as
(3) I was going to fly to visit my family on April 6th. My mom said "Oh great, your step dad's poetry reading is that ni
Explanation:
step-dad's poetry reading.
Greedy:
This joke is about how annoying and intrusive in-laws can be. The person is
trying to visit their family, but their mother says that their step-dad's
poetry reading is that night. So the person is now flying in the next day to
avoid it.

More Related Content

What's hot

全体セミナー20170629
全体セミナー20170629全体セミナー20170629
全体セミナー20170629
Jiro Nishitoba
 
MLOpsはバズワード
MLOpsはバズワードMLOpsはバズワード
MLOpsはバズワード
Tetsutaro Watanabe
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
Toru Tamaki
 
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?
Deep Learning JP
 
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via T...
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via  T...[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via  T...
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via T...
Deep Learning JP
 
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Satoshi Kato
 
第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf
第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf
第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf
RyoKawanami
 
[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み
[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み
[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み
Deep Learning Lab(ディープラーニング・ラボ)
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
Yusuke Uchida
 
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック 大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
西岡 賢一郎
 
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
Deep Learning JP
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Preferred Networks
 
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
Deep Learning JP
 
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
Deep Learning JP
 
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Preferred Networks
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
NVIDIA Japan
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
taeseon ryu
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
Yusuke Uchida
 
TypeScriptでCLIアプリケーション開発
TypeScriptでCLIアプリケーション開発TypeScriptでCLIアプリケーション開発
TypeScriptでCLIアプリケーション開発
Shuto Suzuki
 

What's hot (20)

全体セミナー20170629
全体セミナー20170629全体セミナー20170629
全体セミナー20170629
 
MLOpsはバズワード
MLOpsはバズワードMLOpsはバズワード
MLOpsはバズワード
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
 
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?
 
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via T...
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via  T...[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via  T...
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via T...
 
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
 
第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf
第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf
第11回 全日本コンピュータビジョン勉強会(前編)_TableFormer_carnavi.pdf
 
[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み
[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み
[Track2-5] CPUだけでAIをやり切った最近のお客様事例 と インテルの先進的な取り組み
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック 大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
 
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
 
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
 
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
 
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
TypeScriptでCLIアプリケーション開発
TypeScriptでCLIアプリケーション開発TypeScriptでCLIアプリケーション開発
TypeScriptでCLIアプリケーション開発
 

Similar to Google's Pathways Language Model and Chain-of-Thought

Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Saurabh Saxena
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Tomasz Sikora
 
Presentation
PresentationPresentation
Presentation
butest
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
Sri Ambati
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
Geoffrey Fox
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
Daniel S. Katz
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
SciCompIIT
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
ChangWoo Min
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
geetachauhan
 

Similar to Google's Pathways Language Model and Chain-of-Thought (20)

Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
 
Presentation
PresentationPresentation
Presentation
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
 

Recently uploaded

Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 

Recently uploaded (20)

Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 

Google's Pathways Language Model and Chain-of-Thought

  • 1. How Computers Understood Humans ideas existed at least since 1700s but not enough compute and computer science How to instruct computer to perform tasks? How represent knowledge in computers? How to generate the answers? by his contrivance, the most ignorant person, at a reasonable charge, and with a little bodily labour, might write books in philosophy, poetry, politics, laws, mathematics, and theology, without the least assistance from genius or study. ... He then commanded six-and-thirty of the lads, to read the several lines softly, as they appeared upon the frame (Gulliver's Travels, by Jonathan Swift 1726, making fun of ) Ramon Llull 1232
  • 2. Prompt as an Interface 2001: A Space Odyssey HAL 9000 input textual instructions e.g. explain a riddle based on its knowledge computer generates the answer text
  • 3.
  • 4. How To Represent Knowledge library ~> textual documents in a database search by list of words (query) ~1970s, find topics ~1980 counting word occurrences on documents level into methods: TF*IDF, Latent semantic analysis sparce matrices
  • 5.
  • 6. Non-Contextual Words Vectors document -> sentence or small running window of 10 words vector is point in a multidimensional space - an array of numbers each of 10k words gets one general vector in 300 dimensional space each vector has to fit in "only" 300 dimensions - much less than 10k words global (non) contextual word vectors - no disambiguation (flowering) vs fruit (food)
  • 7.
  • 8. Word2vec: Word To a Global Vector count co-occurrence in a 10 word window : 10 surrounding words sum close to the middle word vector words appearing in similar context are close in the 300 dimensional space disambiguation - word strings should be just name not an id! GloVe (Pennington 2014) word2vec (Mikolov 2013)
  • 9. Transformer: Contextual Word Vectors word meaning based on context of 100s of words. - sequential with memory (Vaswani 2017) calculates on entire input sequence recurrent neural networks (LSTM, GRU) transformer architecture
  • 10. Knowledge Graph's Nodes Are Disambiguated knowledge graph e.g. Wikidata: each node is specific fruit (flowering) vs fruit (food) imperfect tradeoff between database and training data samples Wikipedia and internet is between knowledge graph and set of documents random walk ~ valid "sentences", link prediction ~ generating text
  • 11. Big Transformer Models generate by predicting input text continuation $10M transformers trained on large amount of text from the internet in 2022 solve wide variety naturally described problems sometimes with human level performance examples: , , , ... PaLM (2022) RETRO (2021) GPT-3
  • 12.
  • 13. Google's Pathways Language Model and Chain-of-Thought training task: Given this text, predict the next word (token). the biggest dense (540B) and likely the most expensive (~$10M) as of early 2022 highly efficient training on 6k chips (TPU v4) across 2 clusters (Pods) improvements from scaling continue in language understanding (few-shot) disproportionate gains from certain scale e.g. reasoning: 62B to 540B vs 8B to 62B breakthrough performance: outperforming average human on a grade school logic and math (BIG-bench) outperforms specialized and fine-tuned models on multistep reasoning chain-of-thought prompting simulates inner monologue PaLM's transformer
  • 14. PaLM's Size PaLM has 540B parameters = 3x bigger than GPT-3 175B parameters 2x smaller than sparse 1T only parts of the model is activated at each time. human brain 100T connections likely the most expensive model (2.5 yottaFLOPS) vs GPT-3 PaLM and GPT-3 fascinating, but likely not economical now Switch Transformer ~$10M ~$5M
  • 15.
  • 16. Zero-Shot vs Few-Shot Prompting vs Fine-Tuning prompting: instructing via addition of textual context zero-shot: task described, but demonstrations not given few-shot: task described and random demonstrations provided fine-tuning: model parameters are updated with correct answers
  • 17.
  • 18. PaLM's Breakthrough Capabilities on BIG-bench BIG-bench: 150+ tasks (a task contains samples) like reasoning, math, QA, translation, programming 58 tasks available results on other models, often multiple choice questions PaLM is SOTA: outperforming human average and other models GPT-3, Gopher, Chinchilla still underperforms average human on many tasks certain capabilities emerge when a scale is reached (not log-linear) english proverbs and logical sequence at 62B to 540B vs 8B to 62B
  • 19.
  • 20. Chain-of-Thought Prompting inserts reasoning steps between for each demonstration in the prompt multi-step arithmetic grade-school level math problems difficult is to convert to equations commonsense logical reasoning strong world knowledge chaining logical inferences thoughts are useful for interpretation
  • 21.
  • 22. PaLM Chain-of-Thought Prompting Results reasoning tasks datasets: GSM8K, SVAMP, MAWPS, AQuA, CommonsenseQA, StrategyQA humans allowed calculators and search engines too the model triggers (+~5% acc) on GSM8K: PaLM+chain-of-thought competitive with GPT- 3+finetuning+calculator+verifier, outperforms with a calculator, but GPT-3+chain-of-thought? generating explicit inference chains drastically increase the quality of the predictions a calculator by generating special annotation text
  • 23.
  • 24. Explanation Using Chain-of-Thought Prompting right reasons for the right answer? interpretability may be desired output 2-shot prompting with greedy decoding (deterministic) not in paper: I added GPT-3, results with zero-temperature average airplane speed 460 – 575 mph Input: Jennifer looked out her window and sees a really cool cloud below her. She unbuckles her seatbelt and heads to the bathroom. Is Jennifer probably traveling more than 300 miles per hour relative to the earth? PaLM: 300 miles per hour is about 480 km/h. This is about the speed of a commercial airplane. Clouds are usually below airplanes, so Jennifer is probably on an airplane. The answer is "yes" GPT-3: No, because an airplane typically travels around 500-600 miles per hour, and the bathroom is usually located in the back of the plane.
  • 25. PaLM on Code Tasks achieves similar results to specialized OpenAI's Codex 12B with 3% of the code training data after fine-tuning outperforms OpenAI's Davinci Codex
  • 26.
  • 27. PaLM on Other Tasks generally SOTA on a few-shot competitions few-shot is not competitive with fine-tuned model fine-tuned only competitive with specialized fine-tuned variants translation: without fine-tuning outperforms GPT-3, outperforms some supervised baselines de-en, ro-en summarization: fine-tuned results competitive, few-shot largely underperforms the fine-tuned multilingual question answering: fine-tuned results competitive, few-shot largely underperformed of fine-tuned
  • 28.
  • 29. PaLM Architecture: standard decoder-only (attending only to the past, similar to ) modified Feed-forward layer (MLP): instead of RELU use ~1% better in compute equivalent setup uses GLU: gated linear unit - a sigmoid controlled output SwiGLU: uses : parallel Attention and Feed-forward layer (MLP) from : instead of sequential is additive: 15% speedup for small degradation : transformer GPT-3 max(0, xW + 1 b ​ )W ​ + 1 2 b ​ 2 SwiGLU feed- foward FFN ​ := SwiGLU (Swish(xW ​ ) ⊗ 1 xV )W ​ 2 swish activation x(1 + exp(−x))−1 GPT-J y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x)) multi-query attention
  • 30. block-shared key and value projections, different query projections speeds up autoregressive decoding where queries : want relative position info in query-value dot-product use multiplicative rotational matrix mixing pairwise neighboring dimensions improves performance on long sequences RoPE Embeddings
  • 31. PaLM Training Dataset 780B tokens of high-quality text (~100B human days), ( , GPT-3 300B tokens, MT-NLG 339B) social media 50%, webpages 27%, books 13%, wikipedia 4%, code 5%, news 1% based on dataset used for LaMBDA, GLaM private non-reproducible dataset, while MT-NLG 339B is reproducible but non-hosted 5T token MassiveText
  • 32.
  • 33. PaLM Training Requirements ~17 TB of RAM, 2.5 yottaFLOPS ( ) needed for training 2 TPU v4 Pod clusters connected via data center network mind of PalM is shattered across many chips (cheaper, replaceable, cooling) each Pod ~1 exaflop/s: 768 hosts, 3072 TPU v4 chips but how to split the work and communicate? 1024
  • 34.
  • 35. Parallel Training of Large Scale Models parallel computing trades off compute ("time"), memory ("space"), communication throughput (no cool name) data parallelism - batches are divided between workers tensor model parallelism splits model layers i.e. transformer block into attention heads and feedforward pipeline mechanism ( ) computation DAG into stages e.g. layers stages exchange forward and backward propagation information (micro-batches) step by step passing causes "bubbles" - idling Megatron-Turing (MT-LNG) Microsoft and NVidia
  • 36.
  • 37. PaLM Training Infrastructure PaLM uses pipeline-free 2D parallelism data parallel across 2 clusters (2 TPU v4 Pods) each cluster (Pod) has full model copy model partitioned into 12 parts data partitioned into 256 parts 768 hosts connected to 3k chips which are interconnected each batch triggers update between clusters to identical models each host exchanges 1.3GB with its counterpart
  • 38.
  • 39. Training Efficiency observed throughput relative to the theoretical maximum of a system price for parallelization of PaLM is 50% in case of PaLM throughput is tokens-per-second
  • 40. More Posts Like This SRU++ Model Speeds Up Transformer with Simple Recurrent Unit DeepMind's RETRO Transformer Model
  • 41. Appendix GPT-3 Explanations Full Prompts Logical Inference 1 Best of 2 temperature 0.7 otherwise defaults: We can infer that Jennifer is on an airplane because she is looking out a window and sees a cloud below her. We can also infer that she is traveling at a high speed because she is on an airplane. The average cruising speed of an airplane is about 550 miles per hour, so Jennifer is probably traveling more than 300 miles per hour relative to the earth. The answer is "yes". We can infer that Jennifer is in an airplane because she is looking out a window. We can also infer that she is high up in the sky because she can I will explain how to make logical inferences: (1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack (2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C (3) Jennifer looked out her window and sees a really cool cloud below her. She unbuckles her seatbelt and heads to the b Answer:
  • 42. see a cloud "below" her. Clouds are usually found at altitudes above 3,000 feet. Finally, we know that airplanes typically travel around 500 miles per hour. So, Jennifer is probably traveling more than 300 miles per hour, but we cannot be certain. The answer is "yes". Greedy: No, because an airplane typically travels around 500-600 miles per hour, and the bathroom is usually located in the back of the plane.
  • 43. Logical Inference 2 Best of 2, temperature 0.7 otherwise defaults: It is not likely that Shelley will be near the Pacific Ocean this weekend. Shelley is from Virginia, which is on the East Coast of the United States. The city with the famous fish market is most likely Seattle, which is also on the West Coast. Shelley is visiting, so she is not from there. She is going home next Tuesday, so she will likely be going back to Virginia this weekend. It is not likely that Shelley will be near the Pacific Ocean this weekend. Shelley is from Virginia, which is on the East Coast of the United States. The city with the famous fish market is most likely Seattle, which is also on the West Coast. Shelley is visiting, so she is not from there. She is going I will explain how to make logical inferences: (1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack (2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C (3) Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. Going home ne Question: Is it likely that Shelley will be near the Pacific Ocean this weekend? Answer:
  • 44. home next Tuesday, so she will likely be going back to Virginia this weekend. Greedy: It is not likely that Shelley will be near the Pacific Ocean this weekend. Shelley is from Virginia, which is on the East Coast of the United States. The city with the famous fish market is most likely Seattle, which is also on the West Coast. Shelley is visiting, so she is not from there. She is going home next Tuesday, so she will likely be going back to Virginia this weekend.
  • 45. Jokes Best of 2 temperature 0.7 otherwise defaults: This joke is about how annoying mothers can be. The speaker is trying to visit their family, but their mother has just informed them that their step- dad's poetry reading is on the same night. So now the speaker has to fly in on the next day. This joke is about how annoying it can be to have to deal with family obligations. The person telling the joke is trying to visit their family on April 6th, but their mom tells them that their step dad's poetry reading is that night. So the person has to fly in on April 7th instead. This joke is about how annoying and inconsiderate step-parents can be. The punchline is that the person is flying in a day later just to avoid their I will explain these jokes: (1) The problem with kleptomaniacs is that they always take things literally. Explanation: This joke is wordplay. Someone who "takes things literally" is someone who doesn't fully understand social But the definition of kleptomania is someone who literally takes things. (2) Always borrow money from a pessimist. They’ll never expect it back. Explanation: Most people expect you to pay them back when you borrow money, however a pessimist is someone who always as (3) I was going to fly to visit my family on April 6th. My mom said "Oh great, your step dad's poetry reading is that ni Explanation:
  • 46. step-dad's poetry reading. Greedy: This joke is about how annoying and intrusive in-laws can be. The person is trying to visit their family, but their mother says that their step-dad's poetry reading is that night. So the person is now flying in the next day to avoid it.