SlideShare a Scribd company logo
1 of 68
Download to read offline
Design and Evaluation
of Large Language
Models
DSC Europe 2023
Dr. Dmitry Ustalov
JetBrains Grazie
About Me
Dr. Dmitry Ustalov
Senior Machine Learning Engineer
JetBrains Grazie
Belgrade, Serbia
Research Interests: Natural Language Processing,
Human-in-the-Loop, Evaluation
● https://github.com/dustalov
● https://www.linkedin.com/in/ustalov/
2
Outline
1. Introduction
2. Design of LLMs
3. Data Problem
4. Evaluation Problem
5. Conclusion
3
Introduction
4
“general methods that leverage
computation are ultimately the most
effective, and by a large margin”
Richard S. Sutton (2019)
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
5
Computation seems to
always win.
Instead of designing a
task-specific model,
one can get great results with
a black-box statistical one.
6
● Machine Learning Models (early 1990s)
● Word2Vec (early 2010s)
● Transformer (late 2010s)
● Large Language Models (early 2020s)
Compute-Intensive NLP
More generic methods outperformed the more specific ones
More
Annotated
Data
More
Compute
Power
7
Problem 1:
Data Bottleneck
We need more and more
high-quality data.
8
Problem 2:
Evaluation Methodology
Assessing multi-task
models is hard.
9
Let's first take a look around
how people build LLMs nowadays.
10
Design of LLMs
How do people build them?
11
Transformers and LLMs
Originally proposed by Vaswani et al. (2017).
For LLMs, decoder-style models with autoregressive
decoding won due to the scalability of the training process
(GPT, LaMDA, OPT, etc.)
Models varied in the number of parameters
and amount of data used during pre-training
(before InstructGPT took off in 2022).
12
Training Process
Pre-training on
a large text corpus
The devil is in the details
13
Training Process
Pre-training on
a large text corpus
The devil is in the details
Supervised fine-tuning (SFT)
on an instruction corpus
14
Training Process
Pre-training on
a large text corpus
The devil is in the details
Supervised fine-tuning (SFT)
on an instruction corpus
Alignment using human or AI
preference data
15
Popular Instruction-Tuned LLMs
16
A non-exhaustive selection
Time
Alpaca
InstructG
PT
Q2 ‘23 Q3 ‘23
Q1 ‘23
Ancient
Times
Q4 ‘23
Vicuna
+∞
Claude
D
olly
Llam
a
2
M
istral Instruct
O
penChat
Zephyr
Reinforcement Learning
from Human Feedback
Direct Preference
Optimization
Other w/o Labeling
Only SFT
The distribution in Ouyang et al., (2022)
follows the one in GPT-3 usage logs.
InstructGPT Use Cases
17
Does your application match them? Use Case Fraction
Generation 45.6%
Open QA 12.4%
Brainstorming 11.2%
Chat 8.4%
Rewrite 6.6%
Summarization 4.2%
Classification 3.5%
Other 3.5%
Closed QA 2.6%
Extract 1.9%
How do we address
the data and evaluation problems?
18
Data Problem
We need more and more high-quality data.
19
Superficial Alignment
Hypothesis
LLMs already know everything,
just show them the format!
(Zhou et al., 2023)
20
Alignment is Necessary
Superior writing abilities of LLMs
are fundamentally driven by RLHF.
(Touvron et al., 2023)
21
Who Brings the Data?
Pre-Trained Models
22
Who Brings the Data?
Pre-Trained Models Web Crawling
23
Who Brings the Data?
Pre-Trained Models Web Crawling Labeled Data
24
For supervised fine-tuning,
we need instructions with responses.
25
Supervised Fine-Tuning Datasets
26
A non-exhaustive selection
Dataset Size (thousands of instructions)
O
penAssistant
LIM
A
100 1000
10
1 10000
InstructG
PT
+∞
Llam
a
2
ShareG
PT
(Vicuna, O
penChat)
Alpaca
Anthropic
RLAIF
(Claude)
W
izardLM
U
ltraChat (Zephyr)
AI Generated
Expert-Annotated
Crowdsourced
For alignment, we need
to obtain human preferences.
27
Human Feedback: OpenAI Summarize
A classical annotation approach
used in Stiennon et al. (2020).
Given two responses for the
given prompt, put the single
score and provide the
explanation.
28
AI Feedback: UltraFeedback
Cui et al. (2023) sampled
instructions according to the
pre-defined guidelines to build
a preference dataset
resembling how humans would
judge the generations.
29
Human Preference Datasets
30
A non-exhaustive selection
Dataset Size (thousands of comparisons)
AI Generated
Human Annotated
Crawled
Stack
Exchange
W
ebG
PT
50 100
10 1000
Anthropic
H
arm
less
U
ltraFeedback
O
penAI Sum
m
arize
Stanford
SH
P
Anthropic
H
elpful
+∞
31
More recent works
tend to avoid human labeling.
Limitations of Human Feedback
Expertise
Can you properly review responses in narrow
domains like medicine or physics?
32
Limitations of Human Feedback
Expertise Complexity
Can you properly review responses in narrow
domains like medicine or physics?
33
How much time do you need to review a source
code repository of 100K lines of code?
Addressing the Limitations
34
Simulations
Can the final result be checked
in a simulated environment?
Addressing the Limitations
Can the expert annotator be
replaced by a computer
program?
Automated Verification
35
Simulations
Can the final result be checked
in a simulated environment?
Addressing the Limitations
Can the expert annotator be
replaced by a computer
program?
Automated Verification
36
Simulations
Can the final result be checked
in a simulated environment?
End-to-End Evaluation
Use human insight only for final
decisions.
The scale of annotation is 10K+ prompts
for supervised fine-tuning
and 100K+ for preferences.
37
Leverage synthetic data
to avoid difficult data labeling.
38
If labeling is necessary,
focus on the smaller datasets
of higher quality.
39
Evaluation Problem
Assessing multi-task models is hard.
40
https://twitter.com/_jasonwei/status/1707102665321365793
41
42
Evaluation of Large Language Models
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.
43
Evaluation of Large Language Models
Leaderboards
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.
Dashboards featuring model
outputs evaluated by humans
or machines.
44
Evaluation of Large Language Models
Leaderboards
Multi-Task Benchmarks
Static datasets of challenging
problems with ground truth
data.
Online Evaluation
User feedback in downstream
applications (not covered here).
Dashboards featuring model
outputs evaluated by humans
or machines.
A non-exhaustive selection
Multi-Task Benchmarks
45
Dataset # of Tasks
Massive Multitask Language Understanding (Hendrycks et al., 2021) 57
EleutherAI Language Model Evaluation Harness (Gao et al., 2021) 200+
Beyond the Imitation Game Benchmark (Srivastava et al., 2023) 200+
AGIEval (Zhong et al., 2023) 20
Are open-source benchmarks
just training data?
46
Coping with Evaluation Data Leaks
47
Proprietary Data
Use the data that will surely not be a part of
training corpus of any LLM.
Coping with Evaluation Data Leaks
48
Proprietary Data Latest Data
Use the data that will surely not be a part of
training corpus of any LLM.
Use the data that are not a part of any training
corpus yet (and refresh your dataset frequently).
Leaderboards
49
Model-Based Evaluation
A trusted LLM evaluates the
contestants.
Example: LMSYS Leaderboard
Leaderboards
Crowdsourced Evaluation
50
Model-Based Evaluation
Unpaid volunteers evaluate the
model outputs.
Example: OpenAssistant, LMSYS
A trusted LLM evaluates the
contestants.
Example: LMSYS Leaderboard
Leaderboards
Crowdsourced Evaluation
51
Model-Based Evaluation
Unpaid volunteers evaluate the
model outputs.
Example: OpenAssistant, LMSYS
A trusted LLM evaluates the
contestants.
Example: LMSYS Leaderboard
Managed Evaluation
Paid experts or carefully selected
annotators evaluate the models.
Example: Hugging Face H4
52
https://chat.lmsys.org/
I tried making one myself
(almost) from scratch.
53
LLMFAO
Large Language Model Feedback Analysis and Optimization
54
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.
LLMFAO
Large Language Model Feedback Analysis and Optimization
55
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.
Sample pairs of the most
dissimilar responses by
different models for each
prompt.
LLMFAO
Perform pairwise annotation
with carefully selected
annotators and then obtain the
final ranking.
Large Language Model Feedback Analysis and Optimization
56
Pick a small yet
comprehensive set of
prompts and produce
responses by multiple models.
Sample pairs of the most
dissimilar responses by
different models for each
prompt.
Setup
57
“Argue for and against the use of
kubernetes in the style of a haiku.”
LLMFAO had 13 non-coding
prompts with generations by
59 different models from the
llmonitor.com dataset.
2,139 dissimilar pairs were
labeled by 124 annotators,
resulting in 8,931 judgements.
Additionally, GPT-4 and GPT-3.5
Turbo Instruct judgements were
obtained.
System B
I'm going on vacation, but I'm not sure if I
should take my laptop. It's such a pain in the
neck to lug it around with me.
User
Annotation Interface
Prompt
Tell a joke about going on vacation.
58
System A
Why don't some people go on vacation?
Because they're afraid to leave their troubles
behind.
Which output do you like better? System A
System B
Tie
.730
Spearman's ρ
correlation between
humans and GPT-4
59
.716
.730
Spearman's ρ
correlation between
humans and GPT-3.5
Turbo Instruct
Spearman's ρ
correlation between
humans and GPT-4
60
.716
.716
.730
Spearman's ρ
correlation between
GPT-4 and GPT-3.5
Turbo Instruct
Spearman's ρ
correlation between
humans and GPT-3.5
Turbo Instruct
Spearman's ρ
correlation between
humans and GPT-4
61
Pairwise comparisons were transformed into
ranked lists of models using the Bradley-Terry
(1952) algorithm that estimates latent item scores.
● Leaderboard: https://dustalov.github.io/llmfao/
● Pair2Rank, aggregation and analysis tool:
https://dustalov-pair2rank.hf.space/
● Code, data, and guidelines:
https://github.com/dustalov/llmfao
● A more complete description:
https://evalovernite.substack.com/p/
llmfao-human-ranking
Rankings and Tools
62
63
https://dustalov-pair2rank.hf.space/
Conclusion
64
Today's LLMs are gradually
giving up the use
of annotated data for training.
65
Evaluation has never been
as important as today.
66
Rely on your expertise, your data, and
your downstream application.
67
Dr. Dmitry Ustalov,
Senior Machine Learning Engineer
JetBrains Grazie
Belgrade, Serbia
dmitry.ustalov@jetbrains.com
Hvala!
68

More Related Content

Similar to [DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models

Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsXavier Amatriain
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGEVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGIJwest
 
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGEVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGdannyijwest
 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET Journal
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5gdgsurrey
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoMLArpitha Gurumurthy
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfJamieDornan2
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017 Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017 MLconf
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Codex AI.pdf
Codex AI.pdfCodex AI.pdf
Codex AI.pdfepetitjr
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 

Similar to [DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models (20)

Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systems
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGEVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
 
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGEVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
Foutse_Khomh.pptx
Foutse_Khomh.pptxFoutse_Khomh.pptx
Foutse_Khomh.pptx
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017 Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017
Byron Galbraith, Chief Data Scientist, Talla, at MLconf SEA 2017
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Codex AI.pdf
Codex AI.pdfCodex AI.pdf
Codex AI.pdf
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 

More from DataScienceConferenc1

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdfDataScienceConferenc1
 
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...DataScienceConferenc1
 
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdfDataScienceConferenc1
 
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdfDataScienceConferenc1
 
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdfDataScienceConferenc1
 
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptxDataScienceConferenc1
 
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdfDataScienceConferenc1
 
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...DataScienceConferenc1
 
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdfDataScienceConferenc1
 
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...DataScienceConferenc1
 
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...DataScienceConferenc1
 
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdfDataScienceConferenc1
 
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptxDataScienceConferenc1
 
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...DataScienceConferenc1
 
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptxDataScienceConferenc1
 
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...DataScienceConferenc1
 
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...DataScienceConferenc1
 
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptxDataScienceConferenc1
 
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptxDataScienceConferenc1
 
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdfDataScienceConferenc1
 

More from DataScienceConferenc1 (20)

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
 
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
 
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
 
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
 
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
 
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
 
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
 
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
 
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
 
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
 
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
 
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
 
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
 
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
 
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
 
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
 
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
 
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
 
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
 
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 

[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models

  • 1. Design and Evaluation of Large Language Models DSC Europe 2023 Dr. Dmitry Ustalov JetBrains Grazie
  • 2. About Me Dr. Dmitry Ustalov Senior Machine Learning Engineer JetBrains Grazie Belgrade, Serbia Research Interests: Natural Language Processing, Human-in-the-Loop, Evaluation ● https://github.com/dustalov ● https://www.linkedin.com/in/ustalov/ 2
  • 3. Outline 1. Introduction 2. Design of LLMs 3. Data Problem 4. Evaluation Problem 5. Conclusion 3
  • 5. “general methods that leverage computation are ultimately the most effective, and by a large margin” Richard S. Sutton (2019) http://www.incompleteideas.net/IncIdeas/BitterLesson.html 5
  • 6. Computation seems to always win. Instead of designing a task-specific model, one can get great results with a black-box statistical one. 6
  • 7. ● Machine Learning Models (early 1990s) ● Word2Vec (early 2010s) ● Transformer (late 2010s) ● Large Language Models (early 2020s) Compute-Intensive NLP More generic methods outperformed the more specific ones More Annotated Data More Compute Power 7
  • 8. Problem 1: Data Bottleneck We need more and more high-quality data. 8
  • 9. Problem 2: Evaluation Methodology Assessing multi-task models is hard. 9
  • 10. Let's first take a look around how people build LLMs nowadays. 10
  • 11. Design of LLMs How do people build them? 11
  • 12. Transformers and LLMs Originally proposed by Vaswani et al. (2017). For LLMs, decoder-style models with autoregressive decoding won due to the scalability of the training process (GPT, LaMDA, OPT, etc.) Models varied in the number of parameters and amount of data used during pre-training (before InstructGPT took off in 2022). 12
  • 13. Training Process Pre-training on a large text corpus The devil is in the details 13
  • 14. Training Process Pre-training on a large text corpus The devil is in the details Supervised fine-tuning (SFT) on an instruction corpus 14
  • 15. Training Process Pre-training on a large text corpus The devil is in the details Supervised fine-tuning (SFT) on an instruction corpus Alignment using human or AI preference data 15
  • 16. Popular Instruction-Tuned LLMs 16 A non-exhaustive selection Time Alpaca InstructG PT Q2 ‘23 Q3 ‘23 Q1 ‘23 Ancient Times Q4 ‘23 Vicuna +∞ Claude D olly Llam a 2 M istral Instruct O penChat Zephyr Reinforcement Learning from Human Feedback Direct Preference Optimization Other w/o Labeling Only SFT
  • 17. The distribution in Ouyang et al., (2022) follows the one in GPT-3 usage logs. InstructGPT Use Cases 17 Does your application match them? Use Case Fraction Generation 45.6% Open QA 12.4% Brainstorming 11.2% Chat 8.4% Rewrite 6.6% Summarization 4.2% Classification 3.5% Other 3.5% Closed QA 2.6% Extract 1.9%
  • 18. How do we address the data and evaluation problems? 18
  • 19. Data Problem We need more and more high-quality data. 19
  • 20. Superficial Alignment Hypothesis LLMs already know everything, just show them the format! (Zhou et al., 2023) 20
  • 21. Alignment is Necessary Superior writing abilities of LLMs are fundamentally driven by RLHF. (Touvron et al., 2023) 21
  • 22. Who Brings the Data? Pre-Trained Models 22
  • 23. Who Brings the Data? Pre-Trained Models Web Crawling 23
  • 24. Who Brings the Data? Pre-Trained Models Web Crawling Labeled Data 24
  • 25. For supervised fine-tuning, we need instructions with responses. 25
  • 26. Supervised Fine-Tuning Datasets 26 A non-exhaustive selection Dataset Size (thousands of instructions) O penAssistant LIM A 100 1000 10 1 10000 InstructG PT +∞ Llam a 2 ShareG PT (Vicuna, O penChat) Alpaca Anthropic RLAIF (Claude) W izardLM U ltraChat (Zephyr) AI Generated Expert-Annotated Crowdsourced
  • 27. For alignment, we need to obtain human preferences. 27
  • 28. Human Feedback: OpenAI Summarize A classical annotation approach used in Stiennon et al. (2020). Given two responses for the given prompt, put the single score and provide the explanation. 28
  • 29. AI Feedback: UltraFeedback Cui et al. (2023) sampled instructions according to the pre-defined guidelines to build a preference dataset resembling how humans would judge the generations. 29
  • 30. Human Preference Datasets 30 A non-exhaustive selection Dataset Size (thousands of comparisons) AI Generated Human Annotated Crawled Stack Exchange W ebG PT 50 100 10 1000 Anthropic H arm less U ltraFeedback O penAI Sum m arize Stanford SH P Anthropic H elpful +∞
  • 31. 31 More recent works tend to avoid human labeling.
  • 32. Limitations of Human Feedback Expertise Can you properly review responses in narrow domains like medicine or physics? 32
  • 33. Limitations of Human Feedback Expertise Complexity Can you properly review responses in narrow domains like medicine or physics? 33 How much time do you need to review a source code repository of 100K lines of code?
  • 34. Addressing the Limitations 34 Simulations Can the final result be checked in a simulated environment?
  • 35. Addressing the Limitations Can the expert annotator be replaced by a computer program? Automated Verification 35 Simulations Can the final result be checked in a simulated environment?
  • 36. Addressing the Limitations Can the expert annotator be replaced by a computer program? Automated Verification 36 Simulations Can the final result be checked in a simulated environment? End-to-End Evaluation Use human insight only for final decisions.
  • 37. The scale of annotation is 10K+ prompts for supervised fine-tuning and 100K+ for preferences. 37
  • 38. Leverage synthetic data to avoid difficult data labeling. 38
  • 39. If labeling is necessary, focus on the smaller datasets of higher quality. 39
  • 42. 42 Evaluation of Large Language Models Multi-Task Benchmarks Static datasets of challenging problems with ground truth data.
  • 43. 43 Evaluation of Large Language Models Leaderboards Multi-Task Benchmarks Static datasets of challenging problems with ground truth data. Dashboards featuring model outputs evaluated by humans or machines.
  • 44. 44 Evaluation of Large Language Models Leaderboards Multi-Task Benchmarks Static datasets of challenging problems with ground truth data. Online Evaluation User feedback in downstream applications (not covered here). Dashboards featuring model outputs evaluated by humans or machines.
  • 45. A non-exhaustive selection Multi-Task Benchmarks 45 Dataset # of Tasks Massive Multitask Language Understanding (Hendrycks et al., 2021) 57 EleutherAI Language Model Evaluation Harness (Gao et al., 2021) 200+ Beyond the Imitation Game Benchmark (Srivastava et al., 2023) 200+ AGIEval (Zhong et al., 2023) 20
  • 46. Are open-source benchmarks just training data? 46
  • 47. Coping with Evaluation Data Leaks 47 Proprietary Data Use the data that will surely not be a part of training corpus of any LLM.
  • 48. Coping with Evaluation Data Leaks 48 Proprietary Data Latest Data Use the data that will surely not be a part of training corpus of any LLM. Use the data that are not a part of any training corpus yet (and refresh your dataset frequently).
  • 49. Leaderboards 49 Model-Based Evaluation A trusted LLM evaluates the contestants. Example: LMSYS Leaderboard
  • 50. Leaderboards Crowdsourced Evaluation 50 Model-Based Evaluation Unpaid volunteers evaluate the model outputs. Example: OpenAssistant, LMSYS A trusted LLM evaluates the contestants. Example: LMSYS Leaderboard
  • 51. Leaderboards Crowdsourced Evaluation 51 Model-Based Evaluation Unpaid volunteers evaluate the model outputs. Example: OpenAssistant, LMSYS A trusted LLM evaluates the contestants. Example: LMSYS Leaderboard Managed Evaluation Paid experts or carefully selected annotators evaluate the models. Example: Hugging Face H4
  • 53. I tried making one myself (almost) from scratch. 53
  • 54. LLMFAO Large Language Model Feedback Analysis and Optimization 54 Pick a small yet comprehensive set of prompts and produce responses by multiple models.
  • 55. LLMFAO Large Language Model Feedback Analysis and Optimization 55 Pick a small yet comprehensive set of prompts and produce responses by multiple models. Sample pairs of the most dissimilar responses by different models for each prompt.
  • 56. LLMFAO Perform pairwise annotation with carefully selected annotators and then obtain the final ranking. Large Language Model Feedback Analysis and Optimization 56 Pick a small yet comprehensive set of prompts and produce responses by multiple models. Sample pairs of the most dissimilar responses by different models for each prompt.
  • 57. Setup 57 “Argue for and against the use of kubernetes in the style of a haiku.” LLMFAO had 13 non-coding prompts with generations by 59 different models from the llmonitor.com dataset. 2,139 dissimilar pairs were labeled by 124 annotators, resulting in 8,931 judgements. Additionally, GPT-4 and GPT-3.5 Turbo Instruct judgements were obtained.
  • 58. System B I'm going on vacation, but I'm not sure if I should take my laptop. It's such a pain in the neck to lug it around with me. User Annotation Interface Prompt Tell a joke about going on vacation. 58 System A Why don't some people go on vacation? Because they're afraid to leave their troubles behind. Which output do you like better? System A System B Tie
  • 60. .716 .730 Spearman's ρ correlation between humans and GPT-3.5 Turbo Instruct Spearman's ρ correlation between humans and GPT-4 60
  • 61. .716 .716 .730 Spearman's ρ correlation between GPT-4 and GPT-3.5 Turbo Instruct Spearman's ρ correlation between humans and GPT-3.5 Turbo Instruct Spearman's ρ correlation between humans and GPT-4 61
  • 62. Pairwise comparisons were transformed into ranked lists of models using the Bradley-Terry (1952) algorithm that estimates latent item scores. ● Leaderboard: https://dustalov.github.io/llmfao/ ● Pair2Rank, aggregation and analysis tool: https://dustalov-pair2rank.hf.space/ ● Code, data, and guidelines: https://github.com/dustalov/llmfao ● A more complete description: https://evalovernite.substack.com/p/ llmfao-human-ranking Rankings and Tools 62
  • 65. Today's LLMs are gradually giving up the use of annotated data for training. 65
  • 66. Evaluation has never been as important as today. 66
  • 67. Rely on your expertise, your data, and your downstream application. 67
  • 68. Dr. Dmitry Ustalov, Senior Machine Learning Engineer JetBrains Grazie Belgrade, Serbia dmitry.ustalov@jetbrains.com Hvala! 68