1 | © Copyright 2024 Zilliz
1
1 | © Copyright 10/22/23 Zilliz
1 | © Copyright 2024 Zilliz
Stefan Webb
Developer Advocate, Zilliz
stefan.webb@zilliz.com
https://www.linkedin.com/in/stefan-webb
https://x.com/stefan_webb
Webinar: Building a Principled RAG Pipeline | Host
2 | © Copyright 2024 Zilliz
2
01 Introduction
CONTENTS
02 Evaluating Foundation Models
Challenges and Limitations
03
Open Source Evaluation Frameworks
04
3 | © Copyright Zilliz
3
01 Introduction
4 | © Copyright Zilliz
4
5 | © Copyright Zilliz
5
Why is Semantic Search Important?
10%
Other
newly generated data in 2025
will be unstructured data
90%
Data Source: The Digitization of the World by IDC
6 | © Copyright Zilliz
6
RAG Refresher
7 | © Copyright Zilliz
7
RAG Pipeline Options
Credit: https://github.com/langchain-ai/rag-from-scratch]
8 | © Copyright Zilliz
8
RAG Pipeline Options
Credit: Gao et al., 2024]
9 | © Copyright Zilliz
9
How to Make RAG Design Choices?
Measure effects
Consider alternatives
Update and repeat
How??
10 | © Copyright Zilliz
10
Scope of Webinar
● Deeper discussion of
limitations of LLM-as-a-Judge
● Further / alternative methods
● Evaluating output:
○ Agents
○ Multimodal models
○ Online
● Evaluating latency
● Adversarial attacks
● Fundamentals of
LLM-as-a-Judge
● Evaluating output:
○ LLMs
○ RAG
○ Offline
● Frameworks:
○ LM Harness
○ RAGAS
11 | © Copyright Zilliz
11
02 Evaluating Foundation
Models
12 | © Copyright Zilliz
12
How to Measure “Performanceˮ?
● Distinction between evaluation on a task vs evaluation on itself
● Comparing answer to ground-truth vs. comparing output to
context
● Evaluating retrieval vs. evaluating LLM output
● With ground-truth and without ground-truth
● Human evaluation is “gold standardˮ but doesnʼt scale well
● Surprisingly, (strong) LLMs can evaluate LLMs when there is no
ground truth
13 | © Copyright Zilliz
13
Task-Based Evaluation
● Knowledge-based
○ MMLU
○ HellaSwag
○ ARC
● Instruct following
○ Flan
○ Self-instruct
○ NaturalInstructions
● Conversational
○ CoQA
○ MMDialog
○ OpenAssistant
14 | © Copyright Zilliz
14
“Introspectionˮ-based Evaluation
● Generation-based
○ Faithfulness / groundedness
■ “Measures the factual consistency of the generated
answer against the given context. If any claims are made
in the answer that cannot be deduced from context,
then these will be penalized.”
○ Answer Relevancy
■ “Refers to the degree to which a response directly
addresses and is appropriate for a given question or
context.”
● Retrieval-based
○ Context Relevance
■ “Measures how relevant retrieved contexts are to the
question. Ideally, the context should only contain
information necessary to answer the question.”
○ Context Recall / retrieval-based metrics
■ Measures the recall of the retrieved context using the
annotated answer as ground truth.
15 | © Copyright Zilliz
15
03 Challenges and Limitations
16 | © Copyright Zilliz
16
Position
Bias
17 | © Copyright Zilliz
17
Verbosity
Bias
18 | © Copyright Zilliz
18
Can Answer
But Canʼt
Judge
19 | © Copyright Zilliz
19
C-o-T Failure
Mode
20 | © Copyright Zilliz
20
Data Quality
Like in other parts of Gen AI, the main
challenge is producing the right dataset!
21 | © Copyright Zilliz
21
Fine-tuned Judge Models
● https://huggingface.co/nuclia/REMi-v0
● https://huggingface.co/grounded-ai
● https://huggingface.co/prometheus-eval
● https://huggingface.co/flowaicom/Flow-Judge-v0.1
● https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B
22 | © Copyright Zilliz
22
04 Open-Source Evaluation
Frameworks
23 | © Copyright Zilliz
23
LM Evaluation Harness
● https://github.com/EleutherAI/lm-evaluation-harness
24 | © Copyright Zilliz
24
RAGAS
Covered in following webinar!
25 | © Copyright Zilliz
25
ARES, HuggingFace Lighteval, etc.
Covered in following webinar!
26 | © Copyright Zilliz
26
About Milvus
Milvus is an open-source vector database for
GenAI projects. pip install on your laptop, plug into
popular AI dev tools, and push to production with
a single line of code.
30K
GitHub Stars
66M
Docker Pulls
400
Contributors
2.7K
Forks
Easy Setup
Pip-install to start coding in a notebook within seconds
Integration
Plug into OpenAI, Langchain, LlmaIndex, and many more
Reusable Code
Write once, and deploy with one line of code into the production
environment
Feature-rich
Dense & sparse embeddings, filtering, reranking and beyond
27 | © Copyright 2024 Zilliz
27
Milvus Users
28 | © Copyright 2024 Zilliz
28
28 | © Copyright 10/22/23 Zilliz
28 | © Copyright 2024 Zilliz
github.com/milvus-io/milvus zilliz.com/learn/generative-ai
29 | © Copyright Zilliz
29
Summary
There are many methods,
models, etc.
We need quantitative
eval for principled design
Tasks can evaluate LLMs
when thereʼs a ground truth
LLMs can evaluate LLMs
when thereʼs no ground truth
(also, when there is a ground
truth)
There are several excellent
open-source evaluation
frameworks although it is a
relative nascent area
30 | © Copyright Zilliz
30
T H A N K Y O U
31 | © Copyright 2024 Zilliz
31
https://milvus.io/discord
https://github.com/milvus-io/milvus
https://x.com/milvusio
https://www.linkedin.com/company/the-milvus-project
LET’S STAY CONNECTED!
Stefan Webb
Developer Advocate, Zilliz
32 | © Copyright 2024 Zilliz
32
Become a
Speaker!
Interesting in speaking
at and/or sponsoring a
Zilliz Unstructured Data Meetup?
Fill out this form
🎤🎤🎤
33 | © Copyright 2024 Zilliz
33
Join us at our next meetup!
lu.ma/unstructured-data-meetup
Nov 13, South Bay, Sports Basement Sunnyvale
“Evaluating Retrieval-Augmented Generationˮ
Nov 19, San Francisco, GitHub Office
“Structured Output from Unstructured Input: Part 2ˮ

Evaluating Retrieval-Augmented Generation - Webinar

  • 1.
    1 | ©Copyright 2024 Zilliz 1 1 | © Copyright 10/22/23 Zilliz 1 | © Copyright 2024 Zilliz Stefan Webb Developer Advocate, Zilliz stefan.webb@zilliz.com https://www.linkedin.com/in/stefan-webb https://x.com/stefan_webb Webinar: Building a Principled RAG Pipeline | Host
  • 2.
    2 | ©Copyright 2024 Zilliz 2 01 Introduction CONTENTS 02 Evaluating Foundation Models Challenges and Limitations 03 Open Source Evaluation Frameworks 04
  • 3.
    3 | ©Copyright Zilliz 3 01 Introduction
  • 4.
    4 | ©Copyright Zilliz 4
  • 5.
    5 | ©Copyright Zilliz 5 Why is Semantic Search Important? 10% Other newly generated data in 2025 will be unstructured data 90% Data Source: The Digitization of the World by IDC
  • 6.
    6 | ©Copyright Zilliz 6 RAG Refresher
  • 7.
    7 | ©Copyright Zilliz 7 RAG Pipeline Options Credit: https://github.com/langchain-ai/rag-from-scratch]
  • 8.
    8 | ©Copyright Zilliz 8 RAG Pipeline Options Credit: Gao et al., 2024]
  • 9.
    9 | ©Copyright Zilliz 9 How to Make RAG Design Choices? Measure effects Consider alternatives Update and repeat How??
  • 10.
    10 | ©Copyright Zilliz 10 Scope of Webinar ● Deeper discussion of limitations of LLM-as-a-Judge ● Further / alternative methods ● Evaluating output: ○ Agents ○ Multimodal models ○ Online ● Evaluating latency ● Adversarial attacks ● Fundamentals of LLM-as-a-Judge ● Evaluating output: ○ LLMs ○ RAG ○ Offline ● Frameworks: ○ LM Harness ○ RAGAS
  • 11.
    11 | ©Copyright Zilliz 11 02 Evaluating Foundation Models
  • 12.
    12 | ©Copyright Zilliz 12 How to Measure “Performanceˮ? ● Distinction between evaluation on a task vs evaluation on itself ● Comparing answer to ground-truth vs. comparing output to context ● Evaluating retrieval vs. evaluating LLM output ● With ground-truth and without ground-truth ● Human evaluation is “gold standardˮ but doesnʼt scale well ● Surprisingly, (strong) LLMs can evaluate LLMs when there is no ground truth
  • 13.
    13 | ©Copyright Zilliz 13 Task-Based Evaluation ● Knowledge-based ○ MMLU ○ HellaSwag ○ ARC ● Instruct following ○ Flan ○ Self-instruct ○ NaturalInstructions ● Conversational ○ CoQA ○ MMDialog ○ OpenAssistant
  • 14.
    14 | ©Copyright Zilliz 14 “Introspectionˮ-based Evaluation ● Generation-based ○ Faithfulness / groundedness ■ “Measures the factual consistency of the generated answer against the given context. If any claims are made in the answer that cannot be deduced from context, then these will be penalized.” ○ Answer Relevancy ■ “Refers to the degree to which a response directly addresses and is appropriate for a given question or context.” ● Retrieval-based ○ Context Relevance ■ “Measures how relevant retrieved contexts are to the question. Ideally, the context should only contain information necessary to answer the question.” ○ Context Recall / retrieval-based metrics ■ Measures the recall of the retrieved context using the annotated answer as ground truth.
  • 15.
    15 | ©Copyright Zilliz 15 03 Challenges and Limitations
  • 16.
    16 | ©Copyright Zilliz 16 Position Bias
  • 17.
    17 | ©Copyright Zilliz 17 Verbosity Bias
  • 18.
    18 | ©Copyright Zilliz 18 Can Answer But Canʼt Judge
  • 19.
    19 | ©Copyright Zilliz 19 C-o-T Failure Mode
  • 20.
    20 | ©Copyright Zilliz 20 Data Quality Like in other parts of Gen AI, the main challenge is producing the right dataset!
  • 21.
    21 | ©Copyright Zilliz 21 Fine-tuned Judge Models ● https://huggingface.co/nuclia/REMi-v0 ● https://huggingface.co/grounded-ai ● https://huggingface.co/prometheus-eval ● https://huggingface.co/flowaicom/Flow-Judge-v0.1 ● https://huggingface.co/facebook/Self-taught-evaluator-llama3.1-70B
  • 22.
    22 | ©Copyright Zilliz 22 04 Open-Source Evaluation Frameworks
  • 23.
    23 | ©Copyright Zilliz 23 LM Evaluation Harness ● https://github.com/EleutherAI/lm-evaluation-harness
  • 24.
    24 | ©Copyright Zilliz 24 RAGAS Covered in following webinar!
  • 25.
    25 | ©Copyright Zilliz 25 ARES, HuggingFace Lighteval, etc. Covered in following webinar!
  • 26.
    26 | ©Copyright Zilliz 26 About Milvus Milvus is an open-source vector database for GenAI projects. pip install on your laptop, plug into popular AI dev tools, and push to production with a single line of code. 30K GitHub Stars 66M Docker Pulls 400 Contributors 2.7K Forks Easy Setup Pip-install to start coding in a notebook within seconds Integration Plug into OpenAI, Langchain, LlmaIndex, and many more Reusable Code Write once, and deploy with one line of code into the production environment Feature-rich Dense & sparse embeddings, filtering, reranking and beyond
  • 27.
    27 | ©Copyright 2024 Zilliz 27 Milvus Users
  • 28.
    28 | ©Copyright 2024 Zilliz 28 28 | © Copyright 10/22/23 Zilliz 28 | © Copyright 2024 Zilliz github.com/milvus-io/milvus zilliz.com/learn/generative-ai
  • 29.
    29 | ©Copyright Zilliz 29 Summary There are many methods, models, etc. We need quantitative eval for principled design Tasks can evaluate LLMs when thereʼs a ground truth LLMs can evaluate LLMs when thereʼs no ground truth (also, when there is a ground truth) There are several excellent open-source evaluation frameworks although it is a relative nascent area
  • 30.
    30 | ©Copyright Zilliz 30 T H A N K Y O U
  • 31.
    31 | ©Copyright 2024 Zilliz 31 https://milvus.io/discord https://github.com/milvus-io/milvus https://x.com/milvusio https://www.linkedin.com/company/the-milvus-project LET’S STAY CONNECTED! Stefan Webb Developer Advocate, Zilliz
  • 32.
    32 | ©Copyright 2024 Zilliz 32 Become a Speaker! Interesting in speaking at and/or sponsoring a Zilliz Unstructured Data Meetup? Fill out this form 🎤🎤🎤
  • 33.
    33 | ©Copyright 2024 Zilliz 33 Join us at our next meetup! lu.ma/unstructured-data-meetup Nov 13, South Bay, Sports Basement Sunnyvale “Evaluating Retrieval-Augmented Generationˮ Nov 19, San Francisco, GitHub Office “Structured Output from Unstructured Input: Part 2ˮ