As many organizations are bundling large language models (LLMs) in their products, they face the problem of rigorous model selection. This talk gives a data-centric understanding of how LLMs are built and evaluated. We will discuss the limitations of current models and pay special attention to the available evaluation protocols. How do we distinguish good models from the others? What tasks and datasets should we try or avoid? How do we incorporate feedback from our users? We will present the guidelines the attendees can use in their future experiments.
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
1. Design and Evaluation
of Large Language
Models
DSC Europe 2023
Dr. Dmitry Ustalov
JetBrains Grazie
2. About Me
Dr. Dmitry Ustalov
Senior Machine Learning Engineer
JetBrains Grazie
Belgrade, Serbia
Research Interests: Natural Language Processing,
Human-in-the-Loop, Evaluation
● https://github.com/dustalov
● https://www.linkedin.com/in/ustalov/
2
5. “general methods that leverage
computation are ultimately the most
effective, and by a large margin”
Richard S. Sutton (2019)
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
5
6. Computation seems to
always win.
Instead of designing a
task-specific model,
one can get great results with
a black-box statistical one.
6
7. ● Machine Learning Models (early 1990s)
● Word2Vec (early 2010s)
● Transformer (late 2010s)
● Large Language Models (early 2020s)
Compute-Intensive NLP
More generic methods outperformed the more specific ones
More
Annotated
Data
More
Compute
Power
7
12. Transformers and LLMs
Originally proposed by Vaswani et al. (2017).
For LLMs, decoder-style models with autoregressive
decoding won due to the scalability of the training process
(GPT, LaMDA, OPT, etc.)
Models varied in the number of parameters
and amount of data used during pre-training
(before InstructGPT took off in 2022).
12
14. Training Process
Pre-training on
a large text corpus
The devil is in the details
Supervised fine-tuning (SFT)
on an instruction corpus
14
15. Training Process
Pre-training on
a large text corpus
The devil is in the details
Supervised fine-tuning (SFT)
on an instruction corpus
Alignment using human or AI
preference data
15
16. Popular Instruction-Tuned LLMs
16
A non-exhaustive selection
Time
Alpaca
InstructG
PT
Q2 ‘23 Q3 ‘23
Q1 ‘23
Ancient
Times
Q4 ‘23
Vicuna
+∞
Claude
D
olly
Llam
a
2
M
istral Instruct
O
penChat
Zephyr
Reinforcement Learning
from Human Feedback
Direct Preference
Optimization
Other w/o Labeling
Only SFT
17. The distribution in Ouyang et al., (2022)
follows the one in GPT-3 usage logs.
InstructGPT Use Cases
17
Does your application match them? Use Case Fraction
Generation 45.6%
Open QA 12.4%
Brainstorming 11.2%
Chat 8.4%
Rewrite 6.6%
Summarization 4.2%
Classification 3.5%
Other 3.5%
Closed QA 2.6%
Extract 1.9%
18. How do we address
the data and evaluation problems?
18
28. Human Feedback: OpenAI Summarize
A classical annotation approach
used in Stiennon et al. (2020).
Given two responses for the
given prompt, put the single
score and provide the
explanation.
28
29. AI Feedback: UltraFeedback
Cui et al. (2023) sampled
instructions according to the
pre-defined guidelines to build
a preference dataset
resembling how humans would
judge the generations.
29
30. Human Preference Datasets
30
A non-exhaustive selection
Dataset Size (thousands of comparisons)
AI Generated
Human Annotated
Crawled
Stack
Exchange
W
ebG
PT
50 100
10 1000
Anthropic
H
arm
less
U
ltraFeedback
O
penAI Sum
m
arize
Stanford
SH
P
Anthropic
H
elpful
+∞
32. Limitations of Human Feedback
Expertise
Can you properly review responses in narrow
domains like medicine or physics?
32
33. Limitations of Human Feedback
Expertise Complexity
Can you properly review responses in narrow
domains like medicine or physics?
33
How much time do you need to review a source
code repository of 100K lines of code?
35. Addressing the Limitations
Can the expert annotator be
replaced by a computer
program?
Automated Verification
35
Simulations
Can the final result be checked
in a simulated environment?
36. Addressing the Limitations
Can the expert annotator be
replaced by a computer
program?
Automated Verification
36
Simulations
Can the final result be checked
in a simulated environment?
End-to-End Evaluation
Use human insight only for final
decisions.
37. The scale of annotation is 10K+ prompts
for supervised fine-tuning
and 100K+ for preferences.
37
42. 42
Evaluation of Large Language Models
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.
43. 43
Evaluation of Large Language Models
Leaderboards
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.
Dashboards featuring model
outputs evaluated by humans
or machines.
44. 44
Evaluation of Large Language Models
Leaderboards
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.
Online Evaluation
User feedback in downstream
applications (not covered here).
Dashboards featuring model
outputs evaluated by humans
or machines.
45. A non-exhaustive selection
Multi-Task Benchmarks
45
Dataset # of Tasks
Massive Multitask Language Understanding (Hendrycks et al., 2021) 57
EleutherAI Language Model Evaluation Harness (Gao et al., 2021) 200+
Beyond the Imitation Game Benchmark (Srivastava et al., 2023) 200+
AGIEval (Zhong et al., 2023) 20
47. Coping with Evaluation Data Leaks
47
Proprietary Data
Use the data that will surely not be a part of
training corpus of any LLM.
48. Coping with Evaluation Data Leaks
48
Proprietary Data Latest Data
Use the data that will surely not be a part of
training corpus of any LLM.
Use the data that are not a part of any training
corpus yet (and refresh your dataset frequently).
54. LLMFAO
Large Language Model Feedback Analysis and Optimization
54
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.
55. LLMFAO
Large Language Model Feedback Analysis and Optimization
55
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.
Sample pairs of the most
dissimilar responses by
different models for each
prompt.
56. LLMFAO
Perform pairwise annotation
with carefully selected
annotators and then obtain the
final ranking.
Large Language Model Feedback Analysis and Optimization
56
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.
Sample pairs of the most
dissimilar responses by
different models for each
prompt.
57. Setup
57
“Argue for and against the use of
kubernetes in the style of a haiku.”
LLMFAO had 13 non-coding
prompts with generations by
59 different models from the
llmonitor.com dataset.
2,139 dissimilar pairs were
labeled by 124 annotators,
resulting in 8,931 judgements.
Additionally, GPT-4 and GPT-3.5
Turbo Instruct judgements were
obtained.
58. System B
I'm going on vacation, but I'm not sure if I
should take my laptop. It's such a pain in the
neck to lug it around with me.
User
Annotation Interface
Prompt
Tell a joke about going on vacation.
58
System A
Why don't some people go on vacation?
Because they're afraid to leave their troubles
behind.
Which output do you like better? System A
System B
Tie
62. Pairwise comparisons were transformed into
ranked lists of models using the Bradley-Terry
(1952) algorithm that estimates latent item scores.
● Leaderboard: https://dustalov.github.io/llmfao/
● Pair2Rank, aggregation and analysis tool:
https://dustalov-pair2rank.hf.space/
● Code, data, and guidelines:
https://github.com/dustalov/llmfao
● A more complete description:
https://evalovernite.substack.com/p/
llmfao-human-ranking
Rankings and Tools
62