Nour and Maria present the work they did at Tweag, Modus Create innovation arm, where the GenAI team developed an evaluation framework for Retrieval-Augmented Generation (RAG) systems. RAG systems provide an easy and low-cost way to extend the knowledge of Large Language Models (LLMs) but measuring their performance is not an easy task.
The presentation will review existing evaluation frameworks, ranging from those based on the traditional ML approach of using groundtruth datasets, including Tweag's, to those that use LLMs to compute evaluation metrics.
It will also delve into the practical implementation of Tweag's chatbot over two distinct documents datasets and provide insights on chunking, embedding and how open source and commercial LLMs compare.
4. 4
Retrieval Augmented Generation
<your question here>
● LLMs have a learning cutoff
● Fine-tuning is costly
● Adding relevant context to the
prompt is cheap and easy
● Find relevant context with
semantic search
5. Semantic search
● Vectorizing a documents base:
○ Chunking
○ Embedding/Vectorizing
○ Indexing
● Finding documents similar to a query:
○ Vectorize query
○ Find closest vectors
5
8. 8
<your question here>
The GenAI team at Tweag has been working on
applying the Retrieval-Augmented Generation (RAG)
paradigm together with commercial and open source
LLMs to perform intelligent search and suggestion
over a collection of Confluence and Bazel documents.
The LLM processing can be carried out within a virtual
private cloud domain (AWS in this case) so that no
information is shared with third parties.
10. Experimenting vs "eyeballing"
10
- No benchmark: No guarantee that
the introduced change did not
degrade performance on other
questions.
- No experiments tracking: Likely
none of the intermediate states was
committed or properly tracked.
- No evaluation metrics: We cannot
numerically compare the current RAG
state to any other possible state.
- No solution space: What alternatives
are we exploring?
13. Benchmark
● Benchmark over the documents database:
○ Questions
○ Pairs of (question, answers)
○ Pairs of (question, relevant_documents)
● Not easy: need representative and varied queries
13
Human-generated LLM-generated
● Can be automated with LLMs
○ generate questions over documents
○ reformulate questions
15. Evaluation metrics
● Information Retrieval metrics (traditional ML)
○ Labeled dataset
○ Evaluate recall and precision at k
● LLM-based evaluation
○ Context relevance
■ ratio of relevant to total sentences in
the retrieved documents
○ Context recall
15
LLM-based RAG
metrics
Information
retrieval metrics
20. Takeaways
● You need to evaluate your system, no eyeballing!
● Many frameworks and tools: check our blog posts for an
introduction.
20
https://www.tweag.io/group/genai/