[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models

Design and Evaluation
of Large Language
Models
DSC Europe 2023
Dr. Dmitry Ustalov
JetBrains Grazie

About Me
Dr. Dmitry Ustalov
Senior Machine Learning Engineer
JetBrains Grazie
Belgrade, Serbia
Research Interests: Natural Language Processing,
Human-in-the-Loop, Evaluation
● https://github.com/dustalov
● https://www.linkedin.com/in/ustalov/
2

Outline
1. Introduction
2. Design of LLMs
3. Data Problem
4. Evaluation Problem
5. Conclusion
3

“general methods that leverage
computation are ultimately the most
effective, and by a large margin”
Richard S. Sutton (2019)
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
5

Computation seems to
always win.
Instead of designing a
task-specific model,
one can get great results with
a black-box statistical one.
6

● Machine Learning Models (early 1990s)
● Word2Vec (early 2010s)
● Transformer (late 2010s)
● Large Language Models (early 2020s)
Compute-Intensive NLP
More generic methods outperformed the more specific ones
More
Annotated
Data
More
Compute
Power
7

Problem 1:
Data Bottleneck
We need more and more
high-quality data.
8

Problem 2:
Evaluation Methodology
Assessing multi-task
models is hard.
9

Let's first take a look around
how people build LLMs nowadays.
10

Design of LLMs
How do people build them?
11

Transformers and LLMs
Originally proposed by Vaswani et al. (2017).
For LLMs, decoder-style models with autoregressive
decoding won due to the scalability of the training process
(GPT, LaMDA, OPT, etc.)
Models varied in the number of parameters
and amount of data used during pre-training
(before InstructGPT took oﬀ in 2022).
12

Training Process
Pre-training on
a large text corpus
The devil is in the details
13

Training Process
Pre-training on
a large text corpus
Supervised ﬁne-tuning (SFT)
on an instruction corpus
14

Training Process
Pre-training on
a large text corpus
Supervised ﬁne-tuning (SFT)
on an instruction corpus
Alignment using human or AI
preference data
15

Popular Instruction-Tuned LLMs
16
A non-exhaustive selection
Time
Alpaca
InstructG
PT
Q2 ‘23 Q3 ‘23
Q1 ‘23
Ancient
Times
Q4 ‘23
Vicuna
+∞
Claude
D
olly
Llam
a
2
M
istral Instruct
O
penChat
Zephyr
Reinforcement Learning
from Human Feedback
Direct Preference
Optimization
Other w/o Labeling
Only SFT

The distribution in Ouyang et al., (2022)
follows the one in GPT-3 usage logs.
InstructGPT Use Cases
17
Does your application match them? Use Case Fraction
Generation 45.6%
Open QA 12.4%
Brainstorming 11.2%
Chat 8.4%
Rewrite 6.6%
Summarization 4.2%
Classiﬁcation 3.5%
Other 3.5%
Closed QA 2.6%
Extract 1.9%

How do we address
the data and evaluation problems?
18

Data Problem
We need more and more high-quality data.
19

Superficial Alignment
Hypothesis
LLMs already know everything,
just show them the format!
(Zhou et al., 2023)
20

Alignment is Necessary
Superior writing abilities of LLMs
are fundamentally driven by RLHF.
(Touvron et al., 2023)
21

Who Brings the Data?
Pre-Trained Models
22

Pre-Trained Models Web Crawling
23

Pre-Trained Models Web Crawling Labeled Data
24

For supervised fine-tuning,
we need instructions with responses.
25

Supervised Fine-Tuning Datasets
26
Dataset Size (thousands of instructions)
O
penAssistant
LIM
A
100 1000
10
1 10000
InstructG
PT
+∞
Llam
a
2
ShareG
PT
(Vicuna, O
penChat)
Alpaca
Anthropic
RLAIF
(Claude)
W
izardLM
U
ltraChat (Zephyr)
AI Generated
Expert-Annotated
Crowdsourced

For alignment, we need
to obtain human preferences.
27

Human Feedback: OpenAI Summarize
A classical annotation approach
used in Stiennon et al. (2020).
Given two responses for the
given prompt, put the single
score and provide the
explanation.
28

AI Feedback: UltraFeedback
Cui et al. (2023) sampled
instructions according to the
pre-deﬁned guidelines to build
a preference dataset
resembling how humans would
judge the generations.
29

Human Preference Datasets
30
Dataset Size (thousands of comparisons)
AI Generated
Human Annotated
Crawled
Stack
Exchange
W
ebG
PT
50 100
10 1000
Anthropic
H
arm
less
U
ltraFeedback
O
penAI Sum
m
arize
Stanford
SH
P
Anthropic
H
elpful
+∞

31
More recent works
tend to avoid human labeling.

Limitations of Human Feedback
Expertise
Can you properly review responses in narrow
domains like medicine or physics?
32

Limitations of Human Feedback
Expertise Complexity
Can you properly review responses in narrow
domains like medicine or physics?
33
How much time do you need to review a source
code repository of 100K lines of code?

Addressing the Limitations
34
Simulations
Can the ﬁnal result be checked
in a simulated environment?

Can the expert annotator be
replaced by a computer
program?
Automated Veriﬁcation
35
Simulations

Can the expert annotator be
replaced by a computer
program?
Automated Veriﬁcation
36
Simulations
End-to-End Evaluation
Use human insight only for ﬁnal
decisions.

The scale of annotation is 10K+ prompts
for supervised fine-tuning
and 100K+ for preferences.
37

Leverage synthetic data
to avoid difficult data labeling.
38

If labeling is necessary,
focus on the smaller datasets
of higher quality.
39

Evaluation Problem
Assessing multi-task models is hard.
40

https://twitter.com/_jasonwei/status/1707102665321365793
41

42
Evaluation of Large Language Models
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.

43
Leaderboards
data.
Dashboards featuring model
outputs evaluated by humans
or machines.

44
Leaderboards
data.
Online Evaluation
User feedback in downstream
applications (not covered here).
Dashboards featuring model
outputs evaluated by humans
or machines.

45
Dataset # of Tasks
Massive Multitask Language Understanding (Hendrycks et al., 2021) 57
EleutherAI Language Model Evaluation Harness (Gao et al., 2021) 200+
Beyond the Imitation Game Benchmark (Srivastava et al., 2023) 200+
AGIEval (Zhong et al., 2023) 20

Are open-source benchmarks
just training data?
46

Coping with Evaluation Data Leaks
47
Proprietary Data
Use the data that will surely not be a part of
training corpus of any LLM.

Coping with Evaluation Data Leaks
48
Proprietary Data Latest Data
Use the data that will surely not be a part of
training corpus of any LLM.
Use the data that are not a part of any training
corpus yet (and refresh your dataset frequently).

Leaderboards
49
Model-Based Evaluation
A trusted LLM evaluates the
contestants.
Example: LMSYS Leaderboard

Leaderboards
Crowdsourced Evaluation
50
Unpaid volunteers evaluate the
model outputs.
Example: OpenAssistant, LMSYS
contestants.

Leaderboards
Crowdsourced Evaluation
51
Unpaid volunteers evaluate the
model outputs.
Example: OpenAssistant, LMSYS
contestants.
Managed Evaluation
Paid experts or carefully selected
annotators evaluate the models.
Example: Hugging Face H4

I tried making one myself
(almost) from scratch.
53

LLMFAO
Large Language Model Feedback Analysis and Optimization
54
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.

LLMFAO
55
Pick a small yet
prompts and produce
Sample pairs of the most
dissimilar responses by
diﬀerent models for each
prompt.

LLMFAO
Perform pairwise annotation
with carefully selected
annotators and then obtain the
ﬁnal ranking.
56
Pick a small yet
prompts and produce
Sample pairs of the most
dissimilar responses by
diﬀerent models for each
prompt.

Setup
57
“Argue for and against the use of
kubernetes in the style of a haiku.”
LLMFAO had 13 non-coding
prompts with generations by
59 diﬀerent models from the
llmonitor.com dataset.
2,139 dissimilar pairs were
labeled by 124 annotators,
resulting in 8,931 judgements.
Additionally, GPT-4 and GPT-3.5
Turbo Instruct judgements were
obtained.

System B
I'm going on vacation, but I'm not sure if I
should take my laptop. It's such a pain in the
neck to lug it around with me.
User
Annotation Interface
Prompt
Tell a joke about going on vacation.
58
System A
Why don't some people go on vacation?
Because they're afraid to leave their troubles
behind.
Which output do you like better? System A
System B
Tie

.730
Spearman's ρ
correlation between
humans and GPT-4
59

.716
.730
Spearman's ρ
correlation between
humans and GPT-3.5
Turbo Instruct
Spearman's ρ
correlation between
humans and GPT-4
60

.716
.716
.730
Spearman's ρ
correlation between
GPT-4 and GPT-3.5
Turbo Instruct
Spearman's ρ
correlation between
humans and GPT-3.5
Turbo Instruct
Spearman's ρ
correlation between
humans and GPT-4
61

Pairwise comparisons were transformed into
ranked lists of models using the Bradley-Terry
(1952) algorithm that estimates latent item scores.
● Leaderboard: https://dustalov.github.io/llmfao/
● Pair2Rank, aggregation and analysis tool:
https://dustalov-pair2rank.hf.space/
● Code, data, and guidelines:
https://github.com/dustalov/llmfao
● A more complete description:
https://evalovernite.substack.com/p/
llmfao-human-ranking
Rankings and Tools
62

63
https://dustalov-pair2rank.hf.space/

Today's LLMs are gradually
giving up the use
of annotated data for training.
65

Evaluation has never been
as important as today.
66

Rely on your expertise, your data, and
your downstream application.
67

Dr. Dmitry Ustalov,
Senior Machine Learning Engineer
JetBrains Grazie
Belgrade, Serbia
dmitry.ustalov@jetbrains.com
Hvala!
68

[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models

Similar to [DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models