1 | © Copyright 11/17/23 Zilliz
1 | © Copyright 11/17/23 Zilliz
1 | © Copyright 11/17/23 Zilliz
1 | © Copyright 11/17/23 Zilliz
Speaker
Christy Bergman
Developer Advocate, Zilliz
https://www.linkedin.com/in/christybergman/
https://github.com/milvus-io/milvus
discord: https://discord.gg/FjCMmaJng6
2 | © Copyright 11/17/23 Zilliz
2 | © Copyright 11/17/23 Zilliz
Image source: https://thedataquarry.com/posts/vector-db-1/
3 | © Copyright 11/17/23 Zilliz
3 | © Copyright 11/17/23 Zilliz
3 Pillars of Generative AI
4 | © Copyright 11/17/23 Zilliz
4 | © Copyright 11/17/23 Zilliz
3 Pillars of Generative AI
5 | © Copyright 11/17/23 Zilliz
5 | © Copyright 11/17/23 Zilliz
Opportunities in Unstructured Data
6 | © Copyright 11/17/23 Zilliz
6 | © Copyright 11/17/23 Zilliz
7 | © Copyright 11/17/23 Zilliz
7 | © Copyright 11/17/23 Zilliz
T H A N K Y O U
󰚥 We need your stars!
https://github.com/milvus-io/milvus
💬Join our discord: https://discord.gg/FjCMmaJng6
8 | © Copyright 11/17/23 Zilliz
8 | © Copyright 11/17/23 Zilliz
AGENDA
01 AI Hallucinations and RAG
03
04 RAG Evaluation Methods
02 4 Challenges
Demo
9 | © Copyright 11/17/23 Zilliz
9 | © Copyright 11/17/23 Zilliz
01
AI Hallucinations
and RAG
10 | © Copyright 11/17/23 Zilliz
10 | © Copyright 11/17/23 Zilliz
Example AI Hallucination
gemini
wikipedia
11 | © Copyright 11/17/23 Zilliz
11 | © Copyright 11/17/23 Zilliz
Example AI Hallucination
gemini
wikipedia
hallucinated
answer
12 | © Copyright 11/17/23 Zilliz
12 | © Copyright 11/17/23 Zilliz
Why do models hallucinate?
• The reason LLMs
hallucinate is because
…
• They are trained on
sequences of words
(tokens)
Sample Data
The hamster cabinet …
!!@#%# …
Monkey eats shark …
trees in the moons…
13 | © Copyright 11/17/23 Zilliz
13 | © Copyright 11/17/23 Zilliz
Vector
Database
Where do Vectors Come From?
Unstructured Data
Embeddings here
Pre-trained Deep
Learning Models
Vectors
14 | © Copyright 11/17/23 Zilliz
14 | © Copyright 11/17/23 Zilliz
Where do Vectors Come From?
Unstructured Data Vectors
15 | © Copyright 11/17/23 Zilliz
15 | © Copyright 11/17/23 Zilliz
Semantic Similarity
Image from Sutor et al
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Man = [0.5, 0.2]
Queen - Woman + Man = King
Queen = [0.3, 0.9]
- Woman = [0.3, 0.4]
[0.0, 0.5]
+ Man = [0.5, 0.2]
King = [0.5, 0.7]
Man = [0.5, 0.2]
16 | © Copyright 11/17/23 Zilliz
16 | © Copyright 11/17/23 Zilliz
Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus?
The default
AUTOINDEX distance
metric in Milvus is L2.
17 | © Copyright 11/17/23 Zilliz
17 | © Copyright 11/17/23 Zilliz
Conversation
Data
Documentation
Data
Lecture or Q/A
Data
Pain Point #3: Chunking
18 | © Copyright 11/17/23 Zilliz
18 | © Copyright 11/17/23 Zilliz
Conversation
Data
Documentation
Data
Question Answer
Data
add
conversation
memory
use Q&A tuple
formatting
Pain Point #3: Chunking
19 | © Copyright 11/17/23 Zilliz
19 | © Copyright 11/17/23 Zilliz
Pain Point #3: Chunks need more context
Tesla Roadster
2018
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem
2023
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem
Chunk #1
Chunk #2
Naive Chunks
20 | © Copyright 11/17/23 Zilliz
20 | © Copyright 11/17/23 Zilliz
Tesla Roadster
2018
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem
2023
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem
Tesla Roadster 2018
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
Tesla Roadster 2023
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
HTMLHeaderTextSplitter
ParentDocumentRetriever
Title 2-levels above
Title 1-level above
Naive Chunks Better Chunks
HierarchicalNodeParser
AutoMergingRetriever
Pain Point #3: Chunks need more context
21 | © Copyright 11/17/23 Zilliz
21 | © Copyright 11/17/23 Zilliz
Pain Point #3: Chunks need more context
Naive Chunks
Better Chunks
22 | © Copyright 11/17/23 Zilliz
22 | © Copyright 11/17/23 Zilliz
04
RAG Evaluation
Methods
23 | © Copyright 11/17/23 Zilliz
23 | © Copyright 11/17/23 Zilliz
Foundation Model Evals vs Production System Evals
Your RAG system
Arena Elo score
24 | © Copyright 11/17/23 Zilliz
24 | © Copyright 11/17/23 Zilliz
RAG Evaluation Methods
https://arxiv.org/pdf/2306.05685.pdf
GPT-4 favors itself with a 10% higher
win rate; Claude-v1 favors itself with a
25% higher win rate
Open weight Prometheus-eval aligns
with human judgments up to 85% as
of May 2024.
25 | © Copyright 11/17/23 Zilliz
25 | © Copyright 11/17/23 Zilliz
Known Problems with LLM-as-Judge
https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
GPT-4 is not a good
judge of
comprehensiveness
GPT-4
Matches
Human
judgements on
Correctness &
Readability
26 | © Copyright 11/17/23 Zilliz
26 | © Copyright 11/17/23 Zilliz
Known Problems with LLM-as-Judge
https://arxiv.org/pdf/2305.17926
AI scores
max/min higher
Humans
score
medians
higher
27 | © Copyright 11/17/23 Zilliz
27 | © Copyright 11/17/23 Zilliz
RAG Evaluation Methods
https://github.com/explodinggradients/ragas
faithfulness
context_precision
context_recall
Query
Context
answer_relevancy
Ground Truth
Answer
answer_correctness
answer_similarity
Response
28 | © Copyright 11/17/23 Zilliz
28 | © Copyright 11/17/23 Zilliz
05 Demo RAG Eval
29 | © Copyright 11/17/23 Zilliz
29 | © Copyright 11/17/23 Zilliz
RETRIEVAL +46%, GENERATION +6%
####################################################
# Avg Context Precision htmlsplitter score = 0.67 (46% improvement)
# Avg Context Precision simple score = 0.46
####################################################
####################################################
# Avg mistralai mixtral_8x7b_instruct score = 0.7031 (6% improvement over
gpt-3.5-turbo)
# Avg llama3_70b_anyscale_chat score = 0.6888
# Avg llama3_70b_groq_instruct score = 0.6867
# Avg llama_3_70b_octoai_instruct score = 0.6863
# Avg llama_3_8b_ollama_instruct score = 0.6783
# Avg openai gpt-3.5-turbo score = 0.665
####################################################

Retrieval Augmented Generation Evaluation with Ragas

  • 1.
    1 | ©Copyright 11/17/23 Zilliz 1 | © Copyright 11/17/23 Zilliz 1 | © Copyright 11/17/23 Zilliz 1 | © Copyright 11/17/23 Zilliz Speaker Christy Bergman Developer Advocate, Zilliz https://www.linkedin.com/in/christybergman/ https://github.com/milvus-io/milvus discord: https://discord.gg/FjCMmaJng6
  • 2.
    2 | ©Copyright 11/17/23 Zilliz 2 | © Copyright 11/17/23 Zilliz Image source: https://thedataquarry.com/posts/vector-db-1/
  • 3.
    3 | ©Copyright 11/17/23 Zilliz 3 | © Copyright 11/17/23 Zilliz 3 Pillars of Generative AI
  • 4.
    4 | ©Copyright 11/17/23 Zilliz 4 | © Copyright 11/17/23 Zilliz 3 Pillars of Generative AI
  • 5.
    5 | ©Copyright 11/17/23 Zilliz 5 | © Copyright 11/17/23 Zilliz Opportunities in Unstructured Data
  • 6.
    6 | ©Copyright 11/17/23 Zilliz 6 | © Copyright 11/17/23 Zilliz
  • 7.
    7 | ©Copyright 11/17/23 Zilliz 7 | © Copyright 11/17/23 Zilliz T H A N K Y O U 󰚥 We need your stars! https://github.com/milvus-io/milvus 💬Join our discord: https://discord.gg/FjCMmaJng6
  • 8.
    8 | ©Copyright 11/17/23 Zilliz 8 | © Copyright 11/17/23 Zilliz AGENDA 01 AI Hallucinations and RAG 03 04 RAG Evaluation Methods 02 4 Challenges Demo
  • 9.
    9 | ©Copyright 11/17/23 Zilliz 9 | © Copyright 11/17/23 Zilliz 01 AI Hallucinations and RAG
  • 10.
    10 | ©Copyright 11/17/23 Zilliz 10 | © Copyright 11/17/23 Zilliz Example AI Hallucination gemini wikipedia
  • 11.
    11 | ©Copyright 11/17/23 Zilliz 11 | © Copyright 11/17/23 Zilliz Example AI Hallucination gemini wikipedia hallucinated answer
  • 12.
    12 | ©Copyright 11/17/23 Zilliz 12 | © Copyright 11/17/23 Zilliz Why do models hallucinate? • The reason LLMs hallucinate is because … • They are trained on sequences of words (tokens) Sample Data The hamster cabinet … !!@#%# … Monkey eats shark … trees in the moons…
  • 13.
    13 | ©Copyright 11/17/23 Zilliz 13 | © Copyright 11/17/23 Zilliz Vector Database Where do Vectors Come From? Unstructured Data Embeddings here Pre-trained Deep Learning Models Vectors
  • 14.
    14 | ©Copyright 11/17/23 Zilliz 14 | © Copyright 11/17/23 Zilliz Where do Vectors Come From? Unstructured Data Vectors
  • 15.
    15 | ©Copyright 11/17/23 Zilliz 15 | © Copyright 11/17/23 Zilliz Semantic Similarity Image from Sutor et al Woman = [0.3, 0.4] Queen = [0.3, 0.9] King = [0.5, 0.7] Woman = [0.3, 0.4] Queen = [0.3, 0.9] King = [0.5, 0.7] Man = [0.5, 0.2] Queen - Woman + Man = King Queen = [0.3, 0.9] - Woman = [0.3, 0.4] [0.0, 0.5] + Man = [0.5, 0.2] King = [0.5, 0.7] Man = [0.5, 0.2]
  • 16.
    16 | ©Copyright 11/17/23 Zilliz 16 | © Copyright 11/17/23 Zilliz Retrieval Augmented Generation (RAG) Your Data Embedding Model Vector Database Question Question + Context Search Gen AI Model Reliable Answers What is the default AUTOINDEX distance metric in Milvus? The default AUTOINDEX distance metric in Milvus is L2.
  • 17.
    17 | ©Copyright 11/17/23 Zilliz 17 | © Copyright 11/17/23 Zilliz Conversation Data Documentation Data Lecture or Q/A Data Pain Point #3: Chunking
  • 18.
    18 | ©Copyright 11/17/23 Zilliz 18 | © Copyright 11/17/23 Zilliz Conversation Data Documentation Data Question Answer Data add conversation memory use Q&A tuple formatting Pain Point #3: Chunking
  • 19.
    19 | ©Copyright 11/17/23 Zilliz 19 | © Copyright 11/17/23 Zilliz Pain Point #3: Chunks need more context Tesla Roadster 2018 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tem 2023 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tem Chunk #1 Chunk #2 Naive Chunks
  • 20.
    20 | ©Copyright 11/17/23 Zilliz 20 | © Copyright 11/17/23 Zilliz Tesla Roadster 2018 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tem 2023 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tem Tesla Roadster 2018 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tem Tesla Roadster 2023 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tem HTMLHeaderTextSplitter ParentDocumentRetriever Title 2-levels above Title 1-level above Naive Chunks Better Chunks HierarchicalNodeParser AutoMergingRetriever Pain Point #3: Chunks need more context
  • 21.
    21 | ©Copyright 11/17/23 Zilliz 21 | © Copyright 11/17/23 Zilliz Pain Point #3: Chunks need more context Naive Chunks Better Chunks
  • 22.
    22 | ©Copyright 11/17/23 Zilliz 22 | © Copyright 11/17/23 Zilliz 04 RAG Evaluation Methods
  • 23.
    23 | ©Copyright 11/17/23 Zilliz 23 | © Copyright 11/17/23 Zilliz Foundation Model Evals vs Production System Evals Your RAG system Arena Elo score
  • 24.
    24 | ©Copyright 11/17/23 Zilliz 24 | © Copyright 11/17/23 Zilliz RAG Evaluation Methods https://arxiv.org/pdf/2306.05685.pdf GPT-4 favors itself with a 10% higher win rate; Claude-v1 favors itself with a 25% higher win rate Open weight Prometheus-eval aligns with human judgments up to 85% as of May 2024.
  • 25.
    25 | ©Copyright 11/17/23 Zilliz 25 | © Copyright 11/17/23 Zilliz Known Problems with LLM-as-Judge https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG GPT-4 is not a good judge of comprehensiveness GPT-4 Matches Human judgements on Correctness & Readability
  • 26.
    26 | ©Copyright 11/17/23 Zilliz 26 | © Copyright 11/17/23 Zilliz Known Problems with LLM-as-Judge https://arxiv.org/pdf/2305.17926 AI scores max/min higher Humans score medians higher
  • 27.
    27 | ©Copyright 11/17/23 Zilliz 27 | © Copyright 11/17/23 Zilliz RAG Evaluation Methods https://github.com/explodinggradients/ragas faithfulness context_precision context_recall Query Context answer_relevancy Ground Truth Answer answer_correctness answer_similarity Response
  • 28.
    28 | ©Copyright 11/17/23 Zilliz 28 | © Copyright 11/17/23 Zilliz 05 Demo RAG Eval
  • 29.
    29 | ©Copyright 11/17/23 Zilliz 29 | © Copyright 11/17/23 Zilliz RETRIEVAL +46%, GENERATION +6% #################################################### # Avg Context Precision htmlsplitter score = 0.67 (46% improvement) # Avg Context Precision simple score = 0.46 #################################################### #################################################### # Avg mistralai mixtral_8x7b_instruct score = 0.7031 (6% improvement over gpt-3.5-turbo) # Avg llama3_70b_anyscale_chat score = 0.6888 # Avg llama3_70b_groq_instruct score = 0.6867 # Avg llama_3_70b_octoai_instruct score = 0.6863 # Avg llama_3_8b_ollama_instruct score = 0.6783 # Avg openai gpt-3.5-turbo score = 0.665 ####################################################