PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)

[introductory]
A Gentle Introduction to Technologies Behind
Language Models
and Recent Achievement in ChatGPT
2023.05.25 (Thu)
PAKDD 2023 Tutorial 2
Jun Suzuki
Tohoku University
Kyosuke Nishida
NTT
Human Informatics Laboratories
Naoaki Okazaki
Tokyo Institute of Technology
● [Part 1, Part 2] https://www.fai.cds.tohoku.ac.jp/research/activities/
● [Part 3, Part 4] https://speakerdeck.com/kyoun/pakdd2023-tutorial
● [Part 5] https://speakerdeck.com/chokkan/efforts-for-responsible-llms-pakdd-2023-tutorial-2

1
PAKDD-2023 Tutorial 2: A Gentle Introduction to Technologies Behind Language Models and Recent Achievement in ChatGPT / 2023.05.25 (Thu)
Schedule
● Overview [This part]
● Part 1: [introductory] Language models (LMs)
● Part 2: Large language models (LLMs)
● Part 3: Technologies underlying ChatGPT-like LLMs
● Part 4: Recent achievements in ChatGPT-like LLMs
● Part 5: Efforts for Responsible LLMs
● Q&A
10 min
20 min
20 min
20 min
20 min
(Short Break: 10 min)
(Short Break: 10 min)
20 min
20 min

2
Abstract
● Language models (LMs) have a long history in natural language processing
(NLP) research. Their usage was mainly a text generation module (or
calculating likelihood of word sequences) in machine translation and speech
recognition systems, used together with translation or acoustic models. After
the current neural era, LMs take an essential role in the NLP field. In fact, LMs
are integrated into any models/systems to tackle almost all the NLP tasks and
provide state-of-the-art performance on conventional NLP benchmarks. The
usage of LMs is considered to be shifting to more like a world model of
languages or a general-purpose feature generator of any language-related
tasks. More recently, the public sometimes treats LMs like ChatGPT, and its
successor GPT-4, as general-purpose AI after starting an online service in the
public domain.
● This tutorial will first introduce some introductory topics we should know when
discussing the recent advances in LMs like ChatGPT. We will then briefly
introduce the technologies behind ChatGPT-like LMs. Additionally, we also
provide ChatGPT’s social impacts discussed recently in public.

3
Aims of This Tutorial
● This tutorial aims to introduce necessary factual pieces of knowledge for
discussing the latest LMs to researchers outside of the NLP field.
● In addition, another goal is for audiences to understand the strengths and
weaknesses of LMs from the viewpoint of LM users, and help their future
studies by learning the knowledge from this tutorial.
● Our audiences require no prior knowledge of LMs.

4
Brief Overview
● This tutorial will begin with some introductory topics that we
should know when discussing recent advances in LMs like
ChatGPT.
[Part 1, Part 2]
● https://www.fai.cds.tohoku.ac.jp/research/activities/
● We will then briefly introduce the technologies behind
ChatGPT-like LMs and their achievements.
[Part 3, Part 4]
● https://speakerdeck.com/kyoun/pakdd2023-tutorial
● Finally, we will also provide a topic of ChatGPT-like LMs,
Responsible LLMs, discussed recently in general public.
[Part 5]
● https://speakerdeck.com/chokkan/efforts-for-responsible-llms-pakdd-2023-tutorial-2

5
Contents (1/5)
● Part 1: [introductory] Neural Language models (LMs)
◆ Traditional Definition of LMs
◆ Typical Base Model Architecture: Transformer
◆ Three Major Model Types: Encoder, Decoder, Encoder-decoder
◆ Universal Features
◆ Pre-training & Fine-tuning Scheme
◆ Multi-task Learner

6
Contents (2/5)
● Part 2: Large Language Models (LLMs)
◆ LMs’ Scaling Laws (Parameter size ver.)
◆ LLMs: GPT-3
◆ Prompt Engineering
◆ Achievement and Remaining Issues

7
Contents (3/5)
● Part 3: Technologies underlying ChatGPT-like LLMs
◆ Codex
◆ InstructGPT and ChatGPT
◆ GPT-4 and ChatGPT plugins
◆ LLaMA
◆ Alpaca and Vicuna
◆ MPT
◆ LLaVA

8
Contents (4/5)
● Part 4: Recent achievements in ChatGPT-like LLMs
◆ Performance in Natural Language Processing benchmarks and exams
◆ Interesting results in Vision-and-Language Understanding evaluations
◆ Open-source applications powered by LLMs
◆ Remaining Issues

9
Contents (5/5)
● Part 5: Efforts for Responsible LLMs
◆ High-level overview of potential harms of LLMs
◆ Efforts for reducing potential harms in GPT-4 and PaLM 2

[introductory]
Language Models
2023.05.25 (Thu)
Jun Suzuki
Tohoku University
Kyosuke Nishida
NTT
Naoaki Okazaki
PART 1

1
Part 1: [introductory]
Neural Language models (LMs)
We only discuss neural LMs
=> In this talk, “LM” means “neural LM”
(unless otherwise specified)

2
Selected LMs Discussed in This Talk
l aa
① Model nick name
② Date of first published or
announced in public
③ URL for the announcement
e.g., paper or blog
① ELMo ② 2018.02
③ arxiv.org/abs/1802.05365
① GPT-1 ② 2018.06
③ openai.com/research/language-
unsupervised
① GPT-2 ② 2019.02
③ openai.com/blog/better-
language-models/
① GPT-3 ② 2020.05
* 1 https://www.microsoft.com/en-
us/research/blog/using-deepspeed-
and-megatron-to-train-megatron-
turing-nlg-530b-the-worlds-largest-
and-most-powerful-generative-
language-model/
① BLOOM ② 2022.01
③ *2
① Chinchilla ② 2022.03
① PaLM ② 2022.04
③ storage.googleapis.com/pathways-
language-model/PaLM-paper.pdff
*2
https://github.com/bigscience-
workshop/bigscience/tree/mas
ter/train/tr11-176B-ml
① RoBERTa ② 2019.07
① T5 ② 2019.10
① Megatron-turing NLG
② 2021.10 ③ *1
① PaLM2 ② 2023.05
③ https://ai.google/static/documents/palm2techreport.pdf
① GPT-4 ② 2023.03
③cdn.openai.com/papers/gpt-4.pdf
2018 2019 2020 2021 2022 2023 2024
① FLAN ② 2021.09
① BERT ② 2018.10
① T0 ② 2021.10
① LLaMA ② 2023.02
③research.facebook.com/publications/llama-
open-and-efficient-foundation-language-
models/
① OPT ② 2022.05

3
l aa
① ELMo ② 2018.02
① GPT-1 ② 2018.06
unsupervised
① GPT-2 ② 2019.02
language-models/
① GPT-3 ② 2020.05
language-model/
① BLOOM ② 2022.01
③ *2
① PaLM ② 2022.04
*2
① RoBERTa ② 2019.07
① T5 ② 2019.10
② 2021.10 ③ *1
① PaLM2 ② 2023.05
① GPT-4 ② 2023.03
2018 2019 2020 2021 2022 2023 2024
① FLAN ② 2021.09
① BERT ② 2018.10
① T0 ② 2021.10
① LLaMA ② 2023.02
models/
① OPT ② 2022.05
① Model nick name
announced in public
e.g., paper or blog

4
Contents of Part 1
l Part 1: [introductory] Neural Language Models (LMs)
u ① Traditional Definition of LMs
u ② Typical Base Model Architecture: Transformer
u ③ Three Major Model Types: Encoder, Decoder, Encoder-decoder
u ④ Universal Features
u ⑤ Pre-training & Fine-tuning Scheme
u ⑥ Multi-task Learner

5
① Traditional Definition of Language Model (LM)
Language Model = Probabilistic model
u Modeling probabilities for all next words (probability distribution)
in vocabulary, given a context
： Context, prefix text, before j-th word
=
J+1
Y
j=1
P✓(yj | Y<j)
✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
： Target word, j-th word
ocean is ̲̲
??
Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
<latexit sha1_base64="vBx2H+BCUKXtn7gWNGtLQQoCzKw=">AAAC2nichVG/axRBFP6y0SQmMTljI9gshkgkcLwNiCIGDm3U6pJ4uUg2Lrt7k2Qu+4vduYNz2cZOLRUsrBQsxN4uVRr/gRQBOytRqwg2Fr7dPRATTd4yO9/75n1vvuE5kScTRbQ/oA2eOj00PHJmdGz87MRk5dzUShJ2Ylc03NAL41XHToQnA9FQUnliNYqF7TueaDrbt/PzZlfEiQyD+6oXiXXf3gzkhnRtxZRVadat1HT81FRbQtlZNpsnD7Ir+oJuRnHYepjemzMyK20vGJn+r9peZrV105ctvVRa6c02y63KNFWpCP0oMPpgunb3+afacHO8HlY+wEQLIVx04EMggGLswUbC3xoMECLm1pEyFzOSxblAhlHWdrhKcIXN7Db/Nzlb67MB53nPpFC7fIvHK2aljhnao3d0QB/pPX2hX//tlRY9ci893p1SKyJr8umF5Z8nqnzeFbb+qI71rLCB64VXyd6jgslf4Zb67qOXB8s3lmbSy/SGvrL/17RPu/yCoPvDfbsoll4d48dhLxmPxzg8jKNgZb5qXK3SIs/pFsoYwUVcwixP4xpquIM6Gtx9B5/xDd81U3usPdGelaXaQF9zHn+F9uI3/lK11A==</latexit>
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
the
,
dark
bleu
white
today
fine
excellent
good
scary
…
Vocabulary Probability
…
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
Conditional Probability
??
dark ocean is ̲̲
Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)

6
Example of Traditional LMs: 𝑛-gram LM
l Example: Google 𝑛-gram
https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
serve as the incoming 92
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
serve as the information 838
serve as the informational 41
serve as the initial 5331
serve as the initiating 125
serve as the initiation 63
serve as the initiator 81
serve as the …………
Counts of consecutive 𝑛 words
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
Context Target word Count Probability
Total 187491
0.00049
0.00423
0.00119
0.00038
0.00064
0.00024
0.00059
0.00021
0.00125
0.00028
0.00324
0.00022
0.00054
0.00447
0.00022
0.02843
0.00067
0.00034
0.00043

7
Neural LM
l Fit probability distribution over a sequence of words
given a context with a (deep) neural network
...
serve
serve
as the
as
input
the
Output
Input
1 2 3 4 5
Time step
Deep
Neural
Network
Context
… serve as the input …
Input text (training data) Correct
Answer
Vector
Current
estimated
probability
distribution
incoming 0.03 0
independent 0.12 0
index 0.04 0
indication 0.01 0
...
input 0.11 1
Initial 0.21 0
... ... ...
Vocabulary
Parameter
update
direction
Target
word
...

8
Typical usages of LMs (1/2)
l If we have an LM, we can ... ??
1, Evaluate likelihood of given texts
Example: Statistical machine translation (SMT) system
HI YWUY1 KKIUH WKUYN LO WUNI
Input text
Translation
Model
Language Model
(LM)
(^ ^) ♪( ´▽｀) ##!
Output text
&’# ()#’(“
OO)” LOW)
(^ ^) ♪( ´▽｀)
##! ##!
OOooOO
(^~^)
21.5 x
145.2 x
3.2 x
111.2 x
...
72.4 x
9.2 x
...
Translation
score
21.5
35.2
21.2
11.2
...
32.4
119.2
...
LM
score
(intermediate)
Candidates

9
2, Generate texts
l Estimate next words one-by-one
auto-regressively
l For greedy search, compute
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
argmax
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)

10
2, Generate texts
l Estimate next words one-by-one
auto-regressively
l For greedy search, compute
Estimation: 𝑒
We have never met before , right ? Nice to meet you !
Neural LM (e.g., Generative pre-trained transformer: GPT)
A
this
that
…
meet
have
you
…
Nice
…
to
…
too
,
.
A
this
that
…
meet
have
you
…
Nice
…
to
…
too
,
.
A
this
that
…
meet
have
you
…
Nice
…
to
…
too
,
.
A
this
that
…
meet
have
you
…
Nice
…
to
…
too
,
.
Example: Generating text
using neural LM Nice to
Nice to
meet
meet
you
you
, too ...
, ...
Vocabulary Vocabulary Vocabulary Vocabulary
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)
argmax
P✓(Y ) =
J+1
Y
j=1
P✓(yj | Y<j)

11
l Rough comparisons between general abilities of 𝑛-gram
LM and Neural LM
Why using Neural LM
Computational
cost
Longer context
Unseen context
𝑛-gram LM Neural LM
😩 😋
😩
😋
😩 😋
Performance
(in terms of perplexity) 😩 😋

12
...
serve
serve
as the
as
input
the
Output
Input
1 2 3 4 5
Time step
Deep
Neural
Network
Context
… serve as the input …
Input text (training data) Correct
Answer
Vector
Current
estimated
probability
distribution
incoming 0.03 0
independent 0.12 0
index 0.04 0
indication 0.01 0
...
input 0.11 1
Initial 0.21 0
... ... ...
Vocabulary
Parameter
update
direction
Target
word
...
② Typical Base Model Architecture
Transformer
u Surprisingly, all the famous LMs developed recently have chosen
Transformers as their base model

13
Transformer (1/2)
l 2017: Published First draft paper
u Developed primarily for machine translation tasks
From: Ashish Vaswani [view email]
[v1] Mon, 12 Jun 2017 17:57:34 UTC
[v2] Mon, 19 Jun 2017 16:49:45 UTC
[v3] Tue, 20 Jun 2017 05:20:02 UTC
[v4] Fri, 30 Jun 2017 17:29:30 UTC
[v5] Wed, 6 Dec 2017 03:30:32 UTC

14
Transformer (2/2)
l 2017: Published First draft paper
l 2018: Selected as base model architecture of
BERT (and GPT)
u => fundamental model for language
l 2023 [Current]: used as base model architecture for
(almost) all famous LMs
l e.g., GPT-3 (ChatGPT), GPT-4, PaLM (PaLM2), OPT, LLaMa, Bloom, ...
(GPT: Generative Pre-trained Transformer)
(BERT: Bidirectional Encoder Representations from Transformers)

15
[FYI] Transformer for Image Processing
l Used for image processing tasks
u 2020: Vision Transformer (ViT)
u Competitive to the conventional CNNs
Split given image
into parts according
to the mesh
https://openreview.net/forum?id=YicbFdNTTy

16
③ Three Major Model Types
l Encoder type (Masked LMs)
l e.g., BERT, RoBERTa
l Bidirectional
l Decoder type (Causal LMs)
l e.g., GPT
l Unidirectional (left-to-right)
l Encoder-decoder type
l e.g., T5
l Encoder: bidirectional,
Decoder: unidirectional
Not discuss deeply about Encoder type in this talk
Reason: This type becomes less important in these days,
in terms of the context of “Generative AI”

17
④ LMs as Universal Features
l Typical usages of LMs
=> If we have an LM, we can ... ??
u 1, Evaluate likelihood
of given texts
u 2, Generate texts
u 3, Use LMs as universal features
l Novel usage of LMs
l Ability of neural networks has enabled it
l All LMs explicitly/implicitly take this approach
Identical to
traditional usages
like n-gram LMs
Pioneer of
this usage

18
LMs as Universal Features
l Neural LM => Implicitly encodes/captures many
linguistic aspects such as
l Such learned linguistic aspects can help a lot for many
NLP tasks
LM
Training
l Distribution of word occurrences given contexts
l Semantically similar/dissimilar expressions
l Syntactic/semantic structural information

19
Large-scale dataset
from the Web
e.g., news articles,
wikipedia, arXiv,
GitHub
⑤ Pre-training / Fine-tuning Scheme
l Two stage training scheme of LMs
u 1, Pre-training
l Train LMs from extremely many raw texts
obtained from the web
u 2, Fine-tuning
l Train LMs from human annotated data
u Human annotated data: relatively small
Pre-Training
Language
Model
Human-annotated data
Fine-tuning
First step
Second step

20
Viewpoint from Data Sparsity Problem
l Pre-training / Fine-tuning scheme
u Variant of Transfer learning
u Requires less human annotated data to achieve reasonable task
performance
l Free from having to prepare a large amount of human
annotation data for every NLP task
u Relatively expensive and time-consuming to increase human
annotation data
l May solves (or alleviates) an essential problem (called
data sparsity problem) in the NLP community that has
remained unsolved for a long time

21
⑥ LMs as multitask Learner
l Tackle to solve any NLP tasks by a single model
u Possible by casting in a unified ”text-to-text” generation task
Copied from
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
Example: T5

22
LMs as Multitask Learner
+ Pre-training / Fine-tuning Scheme
l Single LM can solve many NLP tasks
Large-scale dataset
from the Web
e.g., news articles,
wikipedia, arXiv, GitHub
Fact checking
○○ is good
is bad
Sentiment Analysis
Machine Translation
Hello! こんにちは
本日のニュース
・株価 ○○
・通常国会、、
・東京都、、
Text summarization
Question Answering
Language
Model
MT data QA data SA data
”text-to-text” format
Pre-Training
Fine-tuning
First step
Second step
T5 (and partially GPT-2): pioneers of this approach
=> First trial toward artificial general intelligence in LM literature

23
l Summary/Take Home Message: Part 1

24
Summary/Take Home Message: Part1 (1/2)
l ① Language Model: Probabilistic Model
u Modeling probabilities for all possible next words (probability
distribution) in vocabulary, given a context
l ② Base Model Architecture: Transformer
u Developed mainly for neural machine translation (NMT)
l ③ Model Types
u Encoder-type: (discard discussions of this type in this talk)
u Decoder-type:
u Encoder-decoder-type:

25
Summary/Take Home Message: Part1 (2/2)
l ④ Universal Features
u Neural LM: Implicitly encodes/captures many linguistic aspects
l E.g., ELMo, BERT, GPT-2, RoBERTa, ...
l ⑤ Pre-training & Fine-tuning Scheme
u Valiant of transfer learning (from Pre-trained LMs)
u Free from having to prepare a large amount of human annotation
data for every NLP task
u Alleviate data sparsity problem (long-time problem) in NLP
l ⑥ Multi-task Learner (towards artificial general
intelligence)
u Tackle any NLP tasks in a single model
u Toward artificial general intelligence

[introductory]
Language Models
2023.05.25 (Thu)
Jun Suzuki
Tohoku University
Kyosuke Nishida
NTT
Naoaki Okazaki
PART 2

1
Part 2: Large language models (LLMs)

3
l aa
① Model nick name
announced in public
e.g., paper or blog
① ELMo ② 2018.02
① GPT-1 ② 2018.06
unsupervised
① GPT-2 ② 2019.02
language-models/
① GPT-3 ② 2020.05
language-model/
① BLOOM ② 2022.01
③ *2
① PaLM ② 2022.04
*2
① RoBERTa ② 2019.07
① T5 ② 2019.10
② 2021.10 ③ *1
① PaLM2 ② 2023.05
① GPT-4 ② 2023.03
2018 2019 2020 2021 2022 2023 2024
① FLAN ② 2021.09
① BERT ② 2018.10
① T0 ② 2021.10
① LLaMA ② 2023.02
models/
① OPT ② 2022.05

4
Contents of Part 2
l Part 2: Large Language Models (LLMs)
u ① LMsʼ Scaling Laws (Parameter size ver.)
u ② LLMs: GPT-3
u ③ Prompt Engineering
u ④ Achievement: LLMs and Prompts
u ⑤ Remaining Issues: LLMs and Prompts

5
① LMsʼ Scaling Laws (Parameter size)
l General tendency:
More parameters => Better performance !?
[Kaplan+, 2020] Scaling Laws for Neural Language Models, arXiv:2001.08361
N : number of parameters (w/o embedding vectors) [logarithmic scale]
Sm
aller is
better

6
LMsʼ Scaling Laws: Parameter Size
PaLM 2
GPT-4 (?)
LLaMA
OPT
PaLM
Chinchilla
BLOOM
Megatron Turing-
NLG
GPT-3
T5
RoBERTa
GPT-2
BERT
GPT-1
ELMO
0.1
1
10
100
1000
01/2018
04/2018
07/2018
10/2018
01/2019
04/2019
07/2019
10/2019
01/2020
04/2020
07/2020
10/2020
01/2021
04/2021
07/2021
10/2021
01/2022
04/2022
07/2022
10/2022
01/2023
04/2023
07/2023
10/2023
Model
Size
(in
billions
of
parameters)
[logarithmic
scale
Roughly x10 per year

7
② LLMs: GPT-3
l Introducing several new concepts
u 1, Larger model: First >100B parameter LMs
u 2, Potential on prompt engineering (in-context learning)
l Few-shot learning
l Instruction
https://arxiv.org/abs/2005.14165
PaLM 2
GPT-4 (?)
LLaMA
OPT
PaLM
Chinchilla
BLOOM
Megatron Turing-
NLG
GPT-3
T5
RoBERTa
GPT-2
BERT
GPT-1
ELMO
0.1
1
10
100
1000
01/2018
04/2018
07/2018
10/2018
01/2019
04/2019
07/2019
10/2019
01/2020
04/2020
07/2020
10/2020
01/2021
04/2021
07/2021
10/2021
01/2022
04/2022
07/2022
10/2022
01/2023
04/2023
07/2023
10/2023
Model
Size
(in
billions
of
parameters)
[logarithmic
scale
Large jump-up in model size
(note: y-axis is log-scale)

8
GPT-3: Impact of using Large Neural Models
l By scaling up the model parameters,
LMs enable to gain the ... :
u 1, Capacity to memorize much of the training data in its entirety
u 2, Ability to produce very fluent sentences, like a human writer
u 3, Solvable patterns for simple calculations, reasoning,
argumentative thinking, common sense, etc., at the elementary
level
( Some might say, "gain the ability to solve/think simple
calculations ..." but not in this talk )

9
Generate Just Like Human Writers (?)
l Trial to generate fake news
Figure 3.14: The GPT-3 generated news article that humans had the greatest difficulty
distinguishing from a human written article (accuracy: 12%).
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article: After two days of intense debate, the United Methodist Church has agreed to a historic split
- one that is expected to end in the creation of a new denomination, one that will be "theologically
and socially conservative," according to The Washington Post. The majority of delegates attending
the church's annual General Conference in May voted to strengthen a ban on the ordination of
LGBTQ clergy and to write new rules that will "discipline" clergy who officiate at same-sex weddings.
But those who opposed these measures have a new plan: They say they will form a separate
denomination by 2020, calling their church the Christian Methodist denomination.
The Post notes that the denomination, which claims 12.5 million members, was in the early 20th
century the "largest Protestant denomination in the U.S.,” but that it has been shrinking in recent
decades. The new split will be the second in the church's history. The first occurred in 1968, when
roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The
Post notes that the proposed split "comes at a critical time for the church, which has been losing
members for years," which has been "pushed toward the brink of a schism over the role of LGBTQ
people in the church." Gay marriage is not the only issue that has divided the church. In 2016, the
denomination was split over ordination of transgender clergy, with the North Pacific regional
conference voting to ban them from serving as clergy, and the South Pacific regional conference
voting to allow them.
Given Title, Subtitle and prefix
(prompt) word “Article:”

10
Generate texts just like human writers?
l Several reports that attest to GPT-3's ability to
generate fluent sentences that can potentially fool us
https://www.technologyreview.com/2020/08/14/1006
780/ai-gpt-3-fake-blog-reached-top-of-hacker-news/
l A student who studies computer science at the UC
Berkeley, develops a fake blog site using GPT-3.
l One of the fake blog posts reaches #1 on Hacker News.
https://metastable.org/gpt-3.html
l Someone posted on a Reddit message
board for nearly a week using GPT-3.
l The posting by GPT-3 was eventually
stopped by the developer, but many
readers were unaware that the posts were
AI-generated.

11
Effectiveness of LLMs
l Task performance on several NLP tasks https://arxiv.org/abs/2005.14165

12
③ Prompt Engineering
l Model size becomes larger
=> fine-tuning cost also becomes larger
l Consider reducing the computational cost
=> Method without additional training from Pre-trained
LLMs
l Alternative approach of pre-training / fine-tuning
scheme

13
What is Prompt?
l Starting text given to LMs to generate subsequent text.
Example: standard text completion case
Sensoji is Language Model the oldest temple in Tokyo ...
Prompt
Equivalent to the “context”
(prefix text) in LMs
Why ? In GPT-2 & GPT-3 papers
Example: Machine translation experiment
Context format
English sentence = Japanese sentence
Hello. = こんにちは。
It is snowing. = 雪が降っている
Today it is sunny. = 今日は晴天です。
Yesterday it rained. =
Prompt
Starting
text
Confusing, but currently,
we usually call the entire
starting text (prefix text) as
“prompt“, not the last
sentence, since we set
several different types of
texts in the prompt.
This looks like “prompt” of
shell in terminal ??

14
More Information in Context
l Conventional LMs
u Generate sentences following context words
=> LMsʼ basic task: text completion
l LM + Prompt
u Generate sentences by giving a sort of instruction sentences on
how we expect LMs to solve a target task as context words
l e.g., 1, Examples (exemplars),
2, (Human) Instruction for explaining the target task
u Elicit the implicit capabilities of the LLM to solve the given task,
without changing (no-learn) the model parameters
l Prerequisite: already learned through pre-training
( => can't resolve the task that hasnʼt been learned in general )
u Use context to "control" generation

15
Example of Prompt: Few-shot (1/3)
l Equivalent to the “context” (prefix text)
u But, not just a standard prefix text, but more for ordering or
instruction for what we want LMs to generate a sentence
Estimation: 𝑒 Sensoji Temple ” <EOS>
Question: 𝑞
What is the oldest temple in Tokyo ? The answer is “ Sensoji Temple ”
instruction
Example: Question answering
Generative Pre-trained Transformer (GPT)
Q: What is the oldest temple in Tokyo? A: Sensoji / Sensoji Temple

16
Example of Prompt: Few-shot (2/3)
l Setting where the model is given a few demonstrations
(examples/exemplars) of the task at inference time
u No weight updates
https://arxiv.org/abs/2005.14165

17
Effectiveness of Few-show Learning
l Task performance on several NLP tasks https://arxiv.org/abs/2005.14165

18
Example of Prompt: Chain-of-thought (3/3)
l Scaling up model size alone has not proved sufficient
high performance on challenging tasks
u Arithmetic, Commonsense, Symbolic reasoning
Chain-of-thought (CoT) prompting improves performance
Example
(One-shot)
Question

19
Example of Prompt: Chain-of-thought (3/3)
l Examples of CoT prompting for challenging tasks

20
Effectiveness of CoT Prompting
l Performance of CoT prompting for challenging tasks

21
④ Achievement and Remaining Issues
l Achievement
u LLMs
l Very fluent like human-made sentences
u Prompt engineering
l Solve many language tasks without
1, parameter update
2, preparing the task specific supervised data

22
Remaining Issues
l Prompt: not consistently effective
u Limited availability, dependent on pre-training data
l Effective prompt: somewhat artificial (or non-intuitive to humans)
u LM tasks: basically text-completion tasks
l Sometimes generates
unreliable (not based on facts),
biased (discriminatory or socially criticized),
inconsistent (opposed responses in the same context), or
harmful (information that should not publicly shared) texts
Instruction tuning and chat tuning (Part 3)

23
l Summary/Take Home Message: Part 2

24
Summary/Take Home Message: Part 2 (1/2)
l ① LMsʼ Scaling laws
u Larger is better ?
l ② LLMs: GPT-3
u Very fluent as human writer
l ③ Prompt engineering (in-context learning)
u Background: discard fine-tuning cost
l Unnecessary of the additional cost for train large model
u Few-shot examples, chain-of-thought, instruction, ...

25
Summary/Take Home Message: Part 2 (1/2)
l ④ Achievement and Remaining Issues: LLMs and
Prompts
u Achievement
l Very fluent as human writer
l Solve many language tasks without preparing the task specific
supervised data
u Remaining Issues
l Prompts can be somewhat artificial
l How we prevent to generate unreliable, biased, inconsistent, or
harmful texts

PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)

Similar to PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2) (20)

Recently uploaded

Recently uploaded (20)

PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)