Introduction to Open Source RAG and RAG Evaluation

1 | © Copyright 11/17/23 Zilliz
Speaker
Christy Bergman
Developer Advocate, Zilliz
christy.bergman@zilliz.com
https://www.linkedin.com/in/christybergman/
https://github.com/milvus-io/milvus
discord: https://discord.gg/FjCMmaJng6

Image source: https://thedataquarry.com/posts/vector-db-1/

27K+
GitHub
Stars
25M+
Downloads
250+
Contributors
2,600
+
Forks
Milvus is an open-source vector database for GenAI projects. Pip-install on your
laptop, plug into popular AI dev tools, and push to production with a single line of
code.
Easy Setup
Pip-install to start
coding in a notebook
within seconds.
Reusable Code
Write once, and
deploy with one line
of code into the
production
environment
Integration
Plug into OpenAI,
Langchain,
LlmaIndex, and
many more
Feature-rich
Dense & sparse
embeddings,
filtering, reranking
and beyond

Zilliz Cloud is a fully-managed vector
database built atop of OSS Milvus
Open Source
Flexible & Secure Deployment
Enterprise features
for production-ready
Cardinal Search Engine &
Use Case Optimized Compute
Milvus completely
re-engineered to
be optimized
Pipelines Connectors Model Library
A streamlined
unstructured data
platform
Stable Milvus
versions are
continuously
deployed to Zilliz
Cloud

Milvus
Open Source Self-Managed
Milvus Discord
Join our community
github.com/milvus-io/milvus
Getting Started with Vector Databases
milvus.io/discord

AGENDA
01 AI Hallucinations and RAG
03
04 RAG Evaluation Methods
02 4 Challenges
Demo RAG
05 Demo Eval

01
AI Hallucinations
and RAG

Example AI Hallucination
gemini
wikipedia

Example AI Hallucination
gemini
wikipedia
hallucinated
answer

Why do models hallucinate?
• The reason LLMs
hallucinate is because
…
• They are trained on
sequences of words
(tokens)
Sample Data
The hamster cabinet …
!!@#%# …
Monkey eats shark …
trees in the moons…

Vector
Database
Where do Vectors Come From?
Unstructured Data
Embeddings here
Pre-trained Deep
Learning Models
Vectors

Unstructured Data Vectors

Unstructured Data Vectors
Embedding
model
Generator
Model
or LLM

Semantic Similarity
Image from Sutor et al
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Man = [0.5, 0.2]
Queen - Woman + Man = King
Queen = [0.3, 0.9]
- Woman = [0.3, 0.4]
[0.0, 0.5]
+ Man = [0.5, 0.2]
King = [0.5, 0.7]
Man = [0.5, 0.2]

Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus
Client?
The default
AUTOINDEX distance
metric in Milvus
Client is L2.

02
3 Challenges and
Lessons Learned

Pain Point #1: Choosing an Embedding Model
https://huggingface.co/spaces/mteb/leaderboard

Pain Point #1: Choosing an Embedding Model
Creator Model Embedding
Dim
Context
Length
Use Case
Tasks
Open
Source
MTEB
Score
OpenAI text-embedding-
3-small
512-1536 8K Real-time
Multilingual text
chatbots
No 62 (1536)
62 (512)
OpenAI text-embedding-
3-large
256-3072 8K Real-time
Multilingual text
chatbots
No 65 (3072)
62 (256)
Matryoshka Representation Learning:
https://arxiv.org/pdf/2205.13147v4.pdf

Pain Point #2: Choosing an Index
https://milvus.io/docs/index.md

● In-memory
○ Floating point dense
■ Flat - The FLAT index is an exhaustive, brute-force approach that compares the query vector
against every single vector in the dataset to ﬁnd the nearest neighbors. Suitable for small
datasets where perfect accuracy is required, and search latency is not of concern.
■ IVF_Flat - The IVF_FLAT (Inverted File FLAT) index is a quantization-based index that
divides the vector space into clusters. During indexing, vectors are assigned to the nearest
cluster centroid, and during search, only the vectors within the closest clusters to the query
vector are compared.
■ HNSW - HNSW organizes vectors in a hierarchical, multi-layered graph, so search
complexity is logarithmic. The basic idea is to separate nearest neighbours into layers in the
graph where the top layer is the sparsest. The lowest layer forms the complete graph. Search is
performed from top to bottom.
○ Floating point sparse - SPLADE, BGE-M3
○ Binary
● On-disk - diskANN when your data is too large to fit in memory
● Hardware-optimized: GPU CAGRA, ARM,

IVF-Flat
HNSW
https://arxiv.org/abs/160
3.09320

Conversation
Data
Documentation
Data
Lecture or Q/A
Data
Pain Point #3: Chunking

Conversation
Data
Documentation
Data
Question Answer
Data
add
conversation
memory
use Q&A pair
formatting
Pain Point #3: Chunking

Pain Point #3: Chunks need more context
Tesla Roadster
2018
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tem
2023
sed do eiusmod tem
Chunk #1
Chunk #2
Naive Chunks

Tesla Roadster
2018
sed do eiusmod tem
2023
sed do eiusmod tem
Tesla Roadster 2018
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
Tesla Roadster 2023
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tem
HTMLHeaderTextSplitter
ParentDocumentRetriever
Title 2-levels above
Title 1-level above
Naive Chunks Better Chunks
HierarchicalNodeParser
AutoMergingRetriever
Pain Point #3: Chunks need more context

Example

Pain Point #4: Keyword or Semantic Search?
��
Good for:
● Exact product name
● Jargon words
Examples:
● Product name =
“2022 RF GT 6MT”
Good for:
● Similar meaning but
maybe not exact
Examples:
● Similar image search
● Related wiki articles

Pain Point #4: Keyword or Semantic Search?
Dense Vector
Sparse Vector
TF-IDF
BM25
SPLADE
Lucene WAND pruning
BGE-M3
Top10 Top5
Final top_k
Prompt & Question
Improved context
Best of both worlds!
● Reranked Keyword AND Semantic top_k
● Put reranked into the Prompt Context
Keyword
Search
Semantic
Search
Linear comb.
Cross-encoder
Neural reranker

Rerankers - when are they computed?
- Straight up Cosine similarity is called no interaction. This is dense embeddings “semantic
search”.
- BERT was an Early Interaction model meaning relationship between question and docs are
pre-computed as part of Embedding model, offline.
- Cross-encoders are ML-model Late Interaction, calculated at query time. Too
computation-heavy to run real-time except for small top_k to reduce to smaller top_2.
Cross-encoder reranking (adds classifier to Q, A pairs).
- ColBERT v2 is Neural-model Late Interaction calculated offline, before the user asks
their question! ~2% increased accuracy, but requires storing extra embeddings.
- Cohere’s rerank-3, claims ~26% improvement over sparse only; 6% over dense
- Jina.ai Reranker, claims ~20% improvement over sparse only

BERT vs ColBert
BERT: SPLADE, BGE-M3
Query Top_k candidates
Final
top_k
https://arxiv.org/pdf/2112.01488.pdf

Colbert v2 Reranker

Slide from Tengyu Ma, April 2024
talk at Unstructured Data
(+add Milvus metadata filtering)
Metadata
filtering (hash)

BGE M3-Embedding
● “Multi-vec” - Multi-vector retrieval, uses
fine-grained interactions between query
and passage’s embeddings to compute
the relevance score. Re-rank the
top-200 Dense candidates, for efficient
processing.
● “Dense+Sparse” - Retrieve the top-1000
candidates with dense and sparse
method; then re-rank using the sum of
two scores.
● “All” - Re-rank based on the sum of all
three scores.
…
Multi-lingual retrieval performance on the MIRACL dev set (measured by nDCG@10).
https://arxiv.org/pdf/2402.03216

https://chat.lmsys.org/?leaderboard
chart by @maximelabonne

Mixtral 8x22B-Instruct-v0.1 with Anyscale Endpoints
https://console.anyscale.com/v2/playground

Question: What do the parameters for HNSW mean?
Prompt
GPT-3.5-turbo
Anyscale endpoints
Mixtral-8x22B-Instruct-v0.1

2023 Lost-in-the-middle
2024 Needle-in-a-haystack experiments
https://github.com/gkamradt/LLMTest_NeedleInAHaystack
Is RAG dead?

Is RAG dead?
Needle in haystack experiments
Slide from Lance Martin, Langchain
https://blog.langchain.dev/multi-nee
dle-in-a-haystack/

03 Demo Custom RAG

04
RAG Evaluation
Methods

Retrieval Augmented Generation (RAG)
Your Data
Embedding Model
Vector Database
Question
Question + Context
Search
Gen AI Model
Reliable Answers
What is the default
AUTOINDEX distance
metric in Milvus?
The default
AUTOINDEX distance
metric in Milvus is L2.

Model Evals vs Production System Evals
Your RAG system
Arena Elo score

RAG Evaluation Methods
GPT-4 favors itself with a 10% higher
win rate; Claude-v1 favors itself with a
25% higher win rate
Open weight Prometheus-eval aligns
with human judgments up to 85% as
of May 2024.

Known Problems with LLM-as-Judge
https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
GPT-4 is not a good
judge of
comprehensiveness
GPT-4
Matches
Human
judgements on
Correctness &
Readability

Known Problems with LLM-as-Judge
AI scores
max/min higher
Humans
score
medians
higher

RAG Evaluation Methods
https://github.com/explodinggradients/ragas
faithfulness
context_precision
context_recall
Query
Context
answer_relevancy
Ground Truth
Answer
answer_correctness
answer_similarity
Response

03 Demo RAG Eval

T H A N K Y O U
󰚥 We need your stars!
https://github.com/milvus-io/milvus
💬Join our discord: https://discord.gg/FjCMmaJng6

Open Source Zilliz Architecture

Introduction to Open Source RAG and RAG Evaluation

Recommended

Recommended

More Related Content

Similar to Introduction to Open Source RAG and RAG Evaluation

Similar to Introduction to Open Source RAG and RAG Evaluation (20)

More from Zilliz

More from Zilliz (20)

Recently uploaded

Recently uploaded (20)

Introduction to Open Source RAG and RAG Evaluation