Benchmarking Lg Language Models on Skills & Domains

•

0 likes•280 views

The document summarizes the results of a benchmark comparison that tested several large language models across different skillsets and domains. It shows that GPT-4 performed the best overall based on metrics like logical robustness, correctness, efficiency, factuality, and common sense. Tables display the scores each model received for different skillsets and how they compare between open-sourced, proprietary, and oracle models. The source is listed as an unreviewed preprint paper and related GitHub page under a Creative Commons license.

Engineering

Benchmark Comparison of
Large Language Models

On this particular test,
GPT-4 performed the best...

Tested models, tested skillsets, tested domains
Models
GPT-4
GPT-3.5
LLaMA2
Bard
Claude
Vicuna
Alpaca
WizardLM
Tulu
Skillsets
Logical Robustness
Logical Correctness
Logical Eﬃciency
Factuality
Commonsense
Understanding
Comprehension
Insightfulness
Completeness
Metacognition
Readability
Conciseness
Harmlessness
Domains
Language
Culture
Health
History
Natural Science
Math
Social Science
Technology
Coding
Humanities

Results of comparisons II
Open-sourced Proprietary Oracle
Vicuna Alpaca LLAMA2 GPT-3.5 Bard Claude GPT-4
Logical Robustness 2.29 2.04 2.65 4.00 3.51 3.59 4.25
Logical Correctness 2.61 2.41 2.96 3.83 3.52 3.68 4.25
Logical Eﬃciency 2.87 2.44 3.09 4.29 3.82 4.13 4.54
Factuality 3.38 2.87 3.60 3.91 3.76 3.89 4.23
Common sense 3.49 3.13 3.77 4.13 4.02 4.09 4.50
Comprehension 3.55 2.91 3.73 3.97 3.84 4.13 4.34
Insightfulness 3.03 2.35 3.57 3.28 3.43 3.46 3.80
Completeness 3.46 2.62 3.92 3.8 3.92 4.17 4.26
Metacognition 3.69 2.13 3.98 3.74 3.34 3.92 4.33
Readability 4.65 4.43 4.74 4.86 4.68 4.82 4.85
Conciseness 4.36 4.43 3.95 4.57 3.69 4.56 4.69
Harmlessness 4.91 4.26 4.94 4.97 4.79 4.91 4.85

Source
Submitted (non-reviewed) paper
Ye, Seonghyeon, et al. "FLASK: Fine-grained Language Model Evaluation based on
Alignment Skill Sets." arXiv preprint arXiv:2307.10928 (2023).
Web-sources
https://github.com/kaistAI/FLASK

What's hot

Let's talk about GPT: A crash course in Generative AI for researchersSteven Van Vaerenbergh

Customizing LLMsJim Steele

A brief primer on OpenAI's GPT-3Ishan Jain

Transformers, LLMs, and the Possibility of AGISynaptonIncorporated

gpt3_presentation.pdfGiacomo Frisoni

How Does Generative AI Actually Work? (a quick semi-technical introduction to...ssuser4edc93

A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti

The current state of generative AIBenjaminlapid1

Pre trained language modelJiWenKim

Open ai’s gpt 3 language explained under 5 minsAnshul Nema

Generative AI at the edge.pdfQualcomm Research

LLMs BootcampFiza987241

Unlocking the Power of Generative AI An Executive's Guide.pdfPremNaraindas1

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Robert McDermott

AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10

OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen

Generative AI and law.pptxChris Marsden

Leveraging Generative AI & Best practicesDianaGray10

LanGCHAIN FrameworkKeymate.AI

What's hot (20)

Let's talk about GPT: A crash course in Generative AI for researchers

Customizing LLMs

A brief primer on OpenAI's GPT-3

Transformers, LLMs, and the Possibility of AGI

gpt3_presentation.pdf

How Does Generative AI Actually Work? (a quick semi-technical introduction to...

A Comprehensive Review of Large Language Models for.pptx

The current state of generative AI

Pre trained language model

Open ai’s gpt 3 language explained under 5 mins

Generative AI at the edge.pdf

LLMs Bootcamp

Unlocking the Power of Generative AI An Executive's Guide.pdf

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...

AI and ML Series - Introduction to Generative AI and LLMs - Session 1

OpenAI’s GPT 3 Language Model - guest Steve Omohundro

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...

Generative AI and law.pptx

Leveraging Generative AI & Best practices

LanGCHAIN Framework

Recently uploaded

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

What are the advantages and disadvantages of membrane structures.pptxwendy cai

Internship report on mechanical engineeringmalavadedarshan25

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Analog to Digital and Digital to Analog ConverterAbhinavSharma374939

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95

Current Transformer Drawing and GTP for MSETCLDeelipZope

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha

Recently uploaded (20)

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

What are the advantages and disadvantages of membrane structures.pptx

Internship report on mechanical engineering

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

Analog to Digital and Digital to Analog Converter

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...

Current Transformer Drawing and GTP for MSETCL

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx

Benchmarking Lg Language Models on Skills & Domains

1. Benchmark Comparison of Large Language Models

2. On this particular test, GPT-4 performed the best...

3. ... but let’s see how...

4. Tested models, tested skillsets, tested domains Models GPT-4 GPT-3.5 LLaMA2 Bard Claude Vicuna Alpaca WizardLM Tulu Skillsets Logical Robustness Logical Correctness Logical Eﬃciency Factuality Commonsense Understanding Comprehension Insightfulness Completeness Metacognition Readability Conciseness Harmlessness Domains Language Culture Health History Natural Science Math Social Science Technology Coding Humanities

5. Results of comparisons I

6. Results of comparisons II Open-sourced Proprietary Oracle Vicuna Alpaca LLAMA2 GPT-3.5 Bard Claude GPT-4 Logical Robustness 2.29 2.04 2.65 4.00 3.51 3.59 4.25 Logical Correctness 2.61 2.41 2.96 3.83 3.52 3.68 4.25 Logical Eﬃciency 2.87 2.44 3.09 4.29 3.82 4.13 4.54 Factuality 3.38 2.87 3.60 3.91 3.76 3.89 4.23 Common sense 3.49 3.13 3.77 4.13 4.02 4.09 4.50 Comprehension 3.55 2.91 3.73 3.97 3.84 4.13 4.34 Insightfulness 3.03 2.35 3.57 3.28 3.43 3.46 3.80 Completeness 3.46 2.62 3.92 3.8 3.92 4.17 4.26 Metacognition 3.69 2.13 3.98 3.74 3.34 3.92 4.33 Readability 4.65 4.43 4.74 4.86 4.68 4.82 4.85 Conciseness 4.36 4.43 3.95 4.57 3.69 4.56 4.69 Harmlessness 4.91 4.26 4.94 4.97 4.79 4.91 4.85

7. Source Submitted (non-reviewed) paper Ye, Seonghyeon, et al. "FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets." arXiv preprint arXiv:2307.10928 (2023). Web-sources https://github.com/kaistAI/FLASK

8. CC BY 4.0, Matej Varga

Benchmarking Lg Language Models on Skills & Domains

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Recently uploaded

Recently uploaded (20)

Benchmarking Lg Language Models on Skills & Domains