AI presentation and introduction - Retrieval Augmented Generation RAG 101

•

0 likes•101 views

Brief Introduction to Generative AI and LLM in particular. Overview of the market, and usages of LLMs. What's it like to train and build a model. Retrieval Augmented Generation 101, explained for non savvies, and a perspective of what are the moving parts making it complex

Technology

You said Large Language Model ?
• Generative deep learning models for
understanding and generating text, images
and other types
• A special kind : Transformers
• “Attention is All you Need”, Vaswani et al.
2017 (https://arxiv.org/abs/1706.03762)
• Transformers analyse chunks of data, called
“tokens” and learn to predict the next token
in a sequence
• Prediction is a probability
• Model that can generalize : one single model
to address several use cases
Focus on Language Models

Build the model - Training
What it’s like ?
• Foundational models
• Datasets
LLM are trained using techniques that requires huge text-based datasets, e.g.
“The Pile” : +880 Gb (Wikipedia, Youtube st, Github, …)
“RedPajama”: +5Tb (wikipedia, StackExchange, ArXiv, …)
Choosing and curating datasets for training is the secret sauce !
• Computing Power
Transformer-based model have limitations: quadratic-complexity of attention mechanism
Computationally intensive for long sequences

Common patterns
• Context
The size of input data given to the model :
size is limited !
• Prompt
The question / the task, enriched with ‘pre-
prompt’
• Zero-shot / Few-shot, …
To give or not samples of answers expected
• Temperature
How much the model is imaginative
Use the model - Inference

Which Model ?
Criteria to take in account for a use case
• Open Source vs Commercial
• Best of breed
• Versioning & lifecycle
• Cost e
ffi
ciency vs Overkill -> Size
• Accuracy

At the heart of the machine
• On Premises
• Compute: GPUs choice / VRAM size / Model
quantization
• NVIDIA T4 = 16Gb / 1100$
• NVIDIA A100 = 80Gb / 8000$
• Scalability : concurrent users, context size
• Online vs batch
• On Cloud
• Which one ? Cost, diversity and availability
• Pricing model: 1M token comes very fast ! 1 word ~ 4
tokens
• Sovereignty, data privacy
Infrastructure

Aka your search engine 2.0
Very common use case =
“Retrival Augmented Generation”

Step 1 - Document loading
• Documents are loaded from data
connectors
• They are split into chunks
RAG

Step 2 - Embeddings
• Chunks are 'transformed' into
vectors (numbers)
✓It's the process of word
embedding, using a pre-trained
model
✓hundreds (even thousands !) of
dimensions are required to
represent the space of all words
• Vectors are stored in a dedicated
database (a vector database)
RAG

Step 3 - Retrieval
• Previous steps were preparatory
work, now comes the live part
• Question is vectorized as well,
used as an input for similarity
search
• Most relevant chunks are
retrieved, i.e. vectors coordinates
are close together
RAG

Step 4 - Generation
•Retrieved chunks are used to feed
the LLM prompt context
•Question is added to the prompt
•LLM reads the prompt and
generates a natural language
answer
•During this inference time,
the model requires a lot of GPU
power !
RAG

RAG engineering
Lots of moving part to reach performance !
Flow / Batch
Data Policy
Deduplication
Data cleanage
Attachments (images, pdf)
PII / Anonymization
Data policy / criticity
Chunking strategy
Embedding Model
Size
Language
Tokenizer
Vector DB Choice
Cloud / Local
Vectors dimensions
& reduction
Retrieval con
fi
g
(top_k, similarity)
Re-ranking
MMR score
RAG techniques
(Corrective, Self-re
fl
ective
Rag-Fusion, HyDE)
Chat memory
Model con
fi
g
(temperature, top_k, top_p)
Model Evaluation / derivation
(BLUE/RED, precision,
recall, F1 score, Ragas, truelens,
Human Feedback)
Prompt eng.
Guard rails
(Hallucinations, NSFW, …)
model compare / VertexSxS
Performance (TTFT, TPS, …)
PII / Anon (again)
UI-Integration
LLMOPS / MLOPS
Cost Ef
fi
ciency

This document provides an overview of natural language processing (NLP) and discusses various NLP applications and techniques. It covers the scope of NLP including natural language understanding, generation, and speech recognition/synthesis. Example applications mentioned include chatbots, sentiment analysis, text classification, summarization, and more. Popular Python packages for NLP like NLTK, SpaCy, and Gensim are also highlighted. Techniques like word embeddings, neural networks, and deep learning approaches to NLP are briefly outlined.

Duraspace Hot Topics Series 6: Metadata and Repository Services

Matthew Critchlow

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Lucidworks

The document discusses implementing conceptual search in Solr. It describes how conceptual search aims to improve recall without reducing precision by matching documents based on concepts rather than keywords alone. It explains how Word2Vec can be used to learn related concepts from documents and represent words as vectors, which can then be embedded in Solr through synonym filters and payloads to enable conceptual search queries. This allows retrieving more relevant documents that do not contain the exact search terms but are still conceptually related.

Building NLP solutions using Python

botsplash.com

10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...

DuraSpace

“Hot Topics: The DuraSpace Community Webinar Series," Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 2: “Metadata and Repository Services for Research Data Curation” Presented by Declan Fleming, Chief Technology Strategist, Arwen Hutt, Metadata Librarian & Matt Critchlow, Manager of Development and Web ServicesUC, San Diego Library.

Final presentation

Nitish Upreti

This document outlines an approach to query formulation for similarity search using term extraction algorithms. It discusses the challenges of similarity search and constructing queries from documents. The solution involves preprocessing documents, extracting candidate terms, building an index, calculating statistical features, executing term extraction algorithms, and postprocessing outputs. Evaluation on a plagiarism detection dataset found TF-IDF and RIDF performed best among algorithms tested. The code is available on GitHub and further improvements could integrate topic modeling.

Haystack 2019 - Search with Vectors - Simon Hughes

OpenSource Connections

With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.

Searching with vectors

Simon Hughes

The document discusses using a vector database to enable question answering with custom data. Key points: - Data is converted to vector embeddings and stored in a vector database like Pinecone to allow for similarity searches. - When a user asks a question, it is converted to a vector and queried against the database to retrieve similar content to provide as input to a language model for generating an answer. - The OpenAI API can also be used to build an assistant using a language model, where custom data is loaded to enable answering questions about that data as a "support manager."

Introduction to Text Mining

Minha Hwang

The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...

Lucidworks

The document discusses research into using deep learning to improve question answering systems. It describes using Solr to retrieve documents and then using machine learning models to rerank the results. The research compared various supervised and unsupervised models for question similarity and answer selection tasks. For question similarity, ensemble models using TFIDF and sentence embeddings performed best. For answer selection, deep learning models outperformed traditional models when sufficient training data was available.

Architecting Your First Big Data Implementation

Adaryl "Bob" Wakefield, MBA

This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.

Cork AI Meetup Number 3

Nick Grattan

This document provides an introduction to text analytics and natural language processing techniques. It discusses bag-of-words models, term frequency-inverse document frequency (TF-IDF), vector space models, distance measures, document clustering, word embeddings using word2vec, and recurrent neural networks. The agenda covers traditional "frequentist" text analysis methods as well as deep learning techniques for semantic analysis. Hands-on examples in Python are provided to illustrate document clustering, creating word embeddings, and generating text with recurrent neural networks.

The Big Data Stack

Zubair Nabi

The document summarizes the key components of the big data stack, from the presentation layer where users interact, through various processing and storage layers, down to the physical infrastructure of data centers. It provides examples like Facebook's petabyte-scale data warehouse and Google's globally distributed database Spanner. The stack aims to enable the processing and analysis of massive datasets across clusters of servers and data centers.

Deep Domain

Zachary S. Brown

This document discusses using deep learning models to generate text-based regression scores for web domain reputation. It motivates using deep learning models to supplement existing reputation scores for new domains and provide data enrichment. The document outlines preprocessing input domain text data, describing common neural network architectures, and training an initial LSTM model on a dataset of 1.6 million domains and their reputation scores. It discusses results, opportunities for improvement, and options for model deployment.

Session 2.1 ontological representation of the telecom domain for advanced a...

semanticsconference

This document discusses the creation of an ontology for the telecom domain to support advanced AI applications. It describes extracting concepts, relations, and synonyms from various data sources through both manual and automated methods. Machine learning techniques like word embeddings are used to retrieve synonym suggestions. The ontology is stored in a semantic graph and can be queried through a natural language interface to power applications such as a semantic search and chatbot integration. The ontology provides a centralized knowledge base that strengthens independence and allows reuse of data across different AI systems.

Vectors in Search - Towards More Semantic Matching

Simon Hughes

With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com

Lucidworks

This document summarizes Simon Hughes' presentation on using vector representations for semantic matching in search. It discusses using word embeddings to learn vector representations of words that capture their semantic meaning based on context. Approaches for searching with word embeddings include expanding queries with related terms from the embedding model or clustering the embeddings and mapping queries to clusters. The document also covers techniques for indexing and searching vector representations in an inverted index, such as using locality-sensitive hashing or k-means trees to map vectors to discrete tokens that can be indexed.

The Tipping Point

Andrzej Zydroń MBCS

The document discusses the evolution of machine translation technology from early systems in the 1950s to modern statistical machine translation approaches. It outlines key developments like increased processing power, large datasets, and integration with other tools. While making progress, current statistical MT still has limitations in areas like syntax, morphology, and context. The document suggests future systems may need a new approach, discussing neuroscience concepts like hierarchical temporal memory as a way to better simulate human intelligence.

The tipping point

Andrzej Zydroń MBCS

The document discusses the evolution of machine translation technology from early systems in the 1950s to modern statistical machine translation approaches. It outlines key developments like increased processing power, large aligned text corpora, and integration with other language technologies. While making progress, current statistical MT still has limitations in areas like morphology, syntax and ambiguity. The document proposes that future advances may require new approaches inspired by neuroscience understanding of human intelligence and the brain's hierarchical temporal memory architecture.

Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...

Mihai Criveti

Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models. Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers. Scaling factors for Large Language Model Architectures: • Vector Database: consider sharding and High Availability • Fine Tuning: collecting data to be used for fine tuning • Governance and Model Benchmarking: how are you testing your model performance over time, with different prompts, one-shot, and various parameters • Chain of Reasoning and Agents • Caching embeddings and responses • Personalization and Conversational Memory Database • Streaming Responses and optimizing performance. A fine tuned 13B model may perform better than a poor 70B one! • Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are terrible at reasoning and prediction, consider calling other models) • Fallback techniques: fallback to a different model, or default answers • API scaling techniques, rate limiting, etc. • Async, streaming and parallelization, multiprocessing, GPU acceleration (including embeddings), generating your API using OpenAPI, etc.

Machine Learning & Apache Mahout

Domingo Suarez Torres

Database Systems - Lecture Week 1

Dios Kurniawan

This document provides an introduction and overview of an IS220 Database Systems course. It outlines that the course will cover topics like database design, file organization, indexing and hashing, query processing and optimization, transactions, object-oriented and XML databases. It notes that the class will be 70% theory and 30% hands-on assignments completed in pairs. Assessment will include group work, tests, and a final exam. Class rules require punctuality, use of English, dressing professionally, and minimum 80% attendance.

Building genomic data cyberinfrastructure with the online database software T...

mestato

This document discusses building genomic data cyberinfrastructure using the Tripal online database software and Galaxy analysis workflows. It summarizes Tripal's goals of simplifying community genomics website construction and encouraging standards-based data sharing. Key Tripal modules like organisms, sequences, and genotypes are mentioned. Extensions discussed include Elasticsearch for improved searching, an expression module for RNA-Seq data, and a Galaxy module to integrate analysis workflows. Future work includes mobile data collection apps and expanding Tripal and Galaxy integration.

Decoder Ring

Jeff Beeman

Decoder Ring is a web-based tool for analyzing large datasets that provides a normalized data model, facilitates importing and browsing data, and enables automated reporting and collaboration. It abstracts raw data into a flexible data model of collections, posts, users, and taxonomies. Scrapers can be written to import content from websites or other sources in just a few hours. This allows large amounts of data to be analyzed and explored through the tool's content navigation, editing, and automated reporting features. Future goals include improving search, access controls, cross-collection reporting, and learning from analyzed data over time.

prace_days_ml_2019.pptx

ssuserf583ac

This document provides an introduction to machine learning and artificial intelligence. It discusses the types of machine learning tasks including supervised learning, unsupervised learning, and reinforcement learning. It also summarizes commonly used machine learning algorithms and frameworks. Examples are given of applying machine learning to tasks like image classification, sentiment analysis, and handwritten digit recognition. Issues that can cause machine learning projects to fail are identified and approaches to addressing different machine learning problems are outlined.

prace_days_ml_2019.pptx

RohanBorgalli

prace_days_ml_2019.pptx

SreeVani74

Mutation Testing for Task-Oriented Chatbots

Pablo Gómez Abajo

Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots. To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.

AppSec PNW: Android and iOS Application Security with MobSF

Ajin Abraham

Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application. This talk covers: Using MobSF for static analysis of mobile applications. Interactive dynamic security assessment of Android and iOS applications. Solving Mobile app CTF challenges. Reverse engineering and runtime analysis of Mobile malware. How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.

Skybuffer SAM4U tool for SAP license adoption

Tatiana Kojar

Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool. SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.

High performance Serverless Java on AWS- GoTo Amsterdam 2024

Vadym Kazulkin

Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.

The Microsoft 365 Migration Tutorial For Beginner.pptx

operationspcvita

GNSS spoofing via SDR (Criptored Talks 2024)

Javier Junquera

In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security. This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing. The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.

Fueling AI with Great Data with Airbyte Webinar

Zilliz

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...

Alex Pruden

Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security. Paper Link: https://eprint.iacr.org/2024/257

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

Fwdays

At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience

Recently uploaded (20)

Mutation Testing for Task-Oriented Chatbots

AppSec PNW: Android and iOS Application Security with MobSF

Demystifying Knowledge Management through Storytelling

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

Christine's Supplier Sourcing Presentaion.pptx

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Essentials of Automations: Exploring Attributes & Automation Parameters

Columbus Data & Analytics Wednesdays - June 2024

Christine's Product Research Presentation.pptx

Must Know Postgres Extension for DBA and Developer during Migration

Main news related to the CCS TSI 2023 (2023/1695)

Taking AI to the Next Level in Manufacturing.pdf

Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...

Skybuffer SAM4U tool for SAP license adoption

High performance Serverless Java on AWS- GoTo Amsterdam 2024

The Microsoft 365 Migration Tutorial For Beginner.pptx

GNSS spoofing via SDR (Criptored Talks 2024)

Fueling AI with Great Data with Airbyte Webinar

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

AI presentation and introduction - Retrieval Augmented Generation RAG 101

1. Gen AI meetup

2. Technology

3. You said Large Language Model ? • Generative deep learning models for understanding and generating text, images and other types • A special kind : Transformers • “Attention is All you Need”, Vaswani et al. 2017 (https://arxiv.org/abs/1706.03762) • Transformers analyse chunks of data, called “tokens” and learn to predict the next token in a sequence • Prediction is a probability • Model that can generalize : one single model to address several use cases Focus on Language Models

5. Build the model - Training What it’s like ? • Foundational models • Datasets LLM are trained using techniques that requires huge text-based datasets, e.g. “The Pile” : +880 Gb (Wikipedia, Youtube st, Github, …) “RedPajama”: +5Tb (wikipedia, StackExchange, ArXiv, …) Choosing and curating datasets for training is the secret sauce ! • Computing Power Transformer-based model have limitations: quadratic-complexity of attention mechanism Computationally intensive for long sequences

6. Common patterns • Context The size of input data given to the model : size is limited ! • Prompt The question / the task, enriched with ‘pre- prompt’ • Zero-shot / Few-shot, … To give or not samples of answers expected • Temperature How much the model is imaginative Use the model - Inference

7. Which Model ? Criteria to take in account for a use case • Open Source vs Commercial • Best of breed • Versioning & lifecycle • Cost e ffi ciency vs Overkill -> Size • Accuracy

8. At the heart of the machine • On Premises • Compute: GPUs choice / VRAM size / Model quantization • NVIDIA T4 = 16Gb / 1100$ • NVIDIA A100 = 80Gb / 8000$ • Scalability : concurrent users, context size • Online vs batch • On Cloud • Which one ? Cost, diversity and availability • Pricing model: 1M token comes very fast ! 1 word ~ 4 tokens • Sovereignty, data privacy Infrastructure

9. Real-world usage

10. Aka your search engine 2.0 Very common use case = “Retrival Augmented Generation”

11. RAG - 101 Search & Summarize In 4 Steps

12. Step 1 - Document loading • Documents are loaded from data connectors • They are split into chunks RAG

13. Step 2 - Embeddings • Chunks are 'transformed' into vectors (numbers) ✓It's the process of word embedding, using a pre-trained model ✓hundreds (even thousands !) of dimensions are required to represent the space of all words • Vectors are stored in a dedicated database (a vector database) RAG

14. Step 3 - Retrieval • Previous steps were preparatory work, now comes the live part • Question is vectorized as well, used as an input for similarity search • Most relevant chunks are retrieved, i.e. vectors coordinates are close together RAG

15. Step 4 - Generation •Retrieved chunks are used to feed the LLM prompt context •Question is added to the prompt •LLM reads the prompt and generates a natural language answer •During this inference time, the model requires a lot of GPU power ! RAG

16. RAG engineering Lots of moving part to reach performance ! Flow / Batch Data Policy Deduplication Data cleanage Attachments (images, pdf) PII / Anonymization Data policy / criticity Chunking strategy Embedding Model Size Language Tokenizer Vector DB Choice Cloud / Local Vectors dimensions & reduction Retrieval con fi g (top_k, similarity) Re-ranking MMR score RAG techniques (Corrective, Self-re fl ective Rag-Fusion, HyDE) Chat memory Model con fi g (temperature, top_k, top_p) Model Evaluation / derivation (BLUE/RED, precision, recall, F1 score, Ragas, truelens, Human Feedback) Prompt eng. Guard rails (Hallucinations, NSFW, …) model compare / VertexSxS Performance (TTFT, TPS, …) PII / Anon (again) UI-Integration LLMOPS / MLOPS Cost Ef fi ciency

17. Fine Tuning ? OpenAI’s strategy

18. Demo time !

AI presentation and introduction - Retrieval Augmented Generation RAG 101

Recommended

Recommended

More Related Content

Similar to AI presentation and introduction - Retrieval Augmented Generation RAG 101

Similar to AI presentation and introduction - Retrieval Augmented Generation RAG 101 (20)

Recently uploaded

Recently uploaded (20)

AI presentation and introduction - Retrieval Augmented Generation RAG 101