Nour and Maria present the work they did at Tweag, Modus Create innovation arm, where the GenAI team developed an evaluation framework for Retrieval-Augmented Generation (RAG) systems. RAG systems provide an easy and low-cost way to extend the knowledge of Large Language Models (LLMs) but measuring their performance is not an easy task. The presentation will review existing evaluation frameworks, ranging from those based on the traditional ML approach of using groundtruth datasets, including Tweag's, to those that use LLMs to compute evaluation metrics. It will also delve into the practical implementation of Tweag's chatbot over two distinct documents datasets and provide insights on chunking, embedding and how open source and commercial LLMs compare.