1. Question Answer Pair Auto Generation:
Reliability and Consistency Assessment for Large Language Models Applications
- Jyotirmoy Sundi
2. Evaluation/Testing of LLM Apps is a Pain
● Manual Testing
○ By a small set of users to test or outsource testing
○ Based on bugs in prod, developer updates prompts/extraction/llm chaining/ reasoning etc
○ Time-consuming and error-prone, leading to inaccurate results.
● Coverage
○ 100% coverage is hard on a large corpus, a big army of manual testers is needed
● High Variability in Output
○ based on user inputs, prompts, providers like openai, cohere, claude, google palm etc
● Reliability & Hallucinations
○ handling of different chat contexts and intents
○ handling turnkey questions when user suddenly asks about a new concept , previous chat message
topics/intents become useless
● Privacy
○ redact/anonymization of data/masking PII data before sending to a LLM provider endpoint
○ adhere to updated GDPR/CCPA/EU govt policies.
3. Advantages of Datacraft
● Automated question/answer data generation
○ Reduced manual testing
● Improved coverage
○ Not 100% but can be much higher based on your budget
● Reduced bias
○ Consistent set of ground truths to rank llm endpoints
● Enhanced testing efficiency
○ With a ground truth dataset evaluate at scale and quickly
● Consistency
○ Test consistency across any changes in chaining, prompts, rag , provider updates
● Ranking RAG responses systematically
○ Test multiple LLM apps based on prompts/providers/rag techniques to choose a winner before rolling out to customers
4. Overview of Datacraft
● Stratified sampling
○ Method for selecting samples from a diverse population by dividing it into subgroups or strata based on specific
characteristics
○ Ensures that our QA dataset accurately represents various types of questions with high coverage across the
corpus
● Verified QA prompts for various scenarios like blogs, readme, text files, and catalogs.
○ Curated Prompts are verified and tested to ensure they are effective in generating QA datasets
○ Addition of more prompts is easy
● Generation with context injection of sampled data & selected prompts from each strata
○ Using Language Models, we create questions and answers based on the sampled data.
○ Context Injection: We inject relevant context from the sampled documents into the generated QA pairs
5. Use Cases
● Question answering on any data source
○ PDFs / CSV / JSON / Text files
○ README.mk
○ Online Blogs
■ Imagine a user experience of reading blogs through the seamless integration of personalized curated
Q&A section, thoughtfully designed to facilitate easy navigation and comprehension of the content in the
blog, might lead to increased inquiry, engagement, conversions, signups etc.
○ API/SDK Docs
○ Databases
○ Commerce Catalogs
● Synthetic Dataset generation for any custom model development
○ AI/ML training of tabular or text data
○ Generate NER synthetic data for named entity recognition(NER) models to train on custom NERs or common
like name, credit card, and SSN or any private entities of a company
■ Help in data redaction/anonymization