SlideShare a Scribd company logo
www.bentoml.com
Unstructured Data Meetup
Infrastructure Challenges
in Scaling RAG with
Custom AI models
Chaoyu Yang
Founder/CEO of BentoML
Jun 3, 2024
• Custom AI models in RAG systems - why you should care about
leveraging your data and custom models for improving RAG
performance
• Deploying custom model inference APIs - learn best practices in
serving your
fi
ne-tuned text embedding model or LLM as inference
APIs for RAG
• Advanced inference patterns for RAG - running multi-model
inference pipelines as online inference APIs and batch of
fl
ine inference
jobs for RAG
Agenda
`
Simple RAG System
Text Embedding Model Large Language Model
Structured Data
Unstructured Data
Chunks
`
Retrieved
Chunks
Vector DB
(Embeddings)
Response
Generation
Production RAG Challenges
Retrieval Performance
• Recall: Not all chunks that are relevant to the user query are retrieved.
• Precision: Not all chunks retrieved are relevant to the user query.
• Data Ingestion: Complex document and unstructured data
Response Synthesis
• Safeguarding: Is the user query toxic or offensive
• Context Accuracy: Retrieved chunks lacking necessary context or containing misaligned context
Evaluate RAG responses
• Synthetic evaluation dataset: Use LLMs to bootstrap evaluation dataset
• LLMs as evaluators: use LLMs to evaluate end-to-end RAG performance
Why fine-tuned Text Embedding Model
The “default” text-embedding-ada-002 is ranked 57th on MTEB leaderboard for English
Fine-tuning optimizes embedding representations over your specific dataset
Best performing LLMs are still mostly
proprietary models
Fine-tuned OSS models offer comparable
performance on speci
fi
c tasks
Key questions to ask:
• What level of control you need for security & data
privacy?
• What’s your latency and SLA requirement?
• What speci
fi
c capabilities that you need from LLM?
• What’s the cost of running LLM at scale?
• What’s the total cost of ownership for hosting and
maintaining custom LLMs
Why hosting your own LLM?
• Document Layout Analysis
(LayoutLM)
• Table Detection, structure
recognition and analysis (Table
Transformers TATR)
• OCR optical character recognition
(EasyOCR, Tesseract)
• Visual Document QA (LayoutLM v3,
Donut)
• Fine-tuning on your speci
fi
c
document
Document Processing & Understanding
And many more..
Context-Aware chunking
And global concept aware
chunking Metadata
{ datetime: ‘2024-04-11’,
product: “..”, user_id: “..”,
sentiment: “positive”,
summary: “..”,
topics: [..], }
Text Chunk
“I recently purchased the
product and I'm extremely
satis
fi
ed with it! …”
Metadata extraction for Improved retrieval accuracy
And additional context for response synthesis
Reranker Model
fi
ne-tuned
with your dataset generally
performs ~10-30% better
Building
Inference APIs
For Custom Models
From Inference Script to Serving Endpoint
So you’ve got a fine-tuned embedding model ready
From Inference Script to Serving Endpoint
Simple things simple - build inference APIs for your fine-tuned text embedding model
Serving Optimizations: Dynamic Batching
Dynamically forming small batches, breaking down large batches, auto-tuning batch size
Dynamic Batching typically brings up to
3x faster response time and ~200%
improved throughput for embedding
serving
Deployment & Serving Infrastructure
External Queue, Auto-scaling, Instance Selection, Traffic control, Concurrency Control, and more
Serving Optimization
• Continuous Batching, KV-caching
• Paged Attention, Flash Attention
• Speculative Decoding
• Operator fusion
• Quantization
• Output Streaming
Important Metrics
• Time to
fi
rst token (TTFT)
• Time per output token (TPOT)
• End-to-end Latency
• Throughput
Self-hosting LLMs
Recommendations
• The
fi
eld of LLM Inference Backend
is rapidly evolving and heavily
researched
• Choosing the right backend
depends on your workload type,
optimization target, quantization
method and model type
• Developer experience can be a
signi
fi
cant factor given the
complexity in model compilation
and integrating
fi
ne-tuned models
Self-hosting LLMs
LMDeploy TensorRT-LLM vLLM MLC-LLM TGI
Quantization
Support
Yes. Support 4-bit
AWQ, 8-bit
quantization
options. Also
support 4-bit KV
quantization
Partially.
Quantization via
modelopt. But note
that quantized data
types is not
implemented for all
the models.
Not fully supported
as of now. Need to
quantize the model
through AutoAWQ
or
fi
nd pre-
quantized models
on HF. Performance
is under-optimized.
Yes. Support 3-bit,
4-bit group
quantization
options. AWQ
quantization
support is still
experimental.
Offers AWQ, GPTQ
and bits-and-bytes
quantization
Supported
Model
Architectures
About 20 models
supported by
TurboMind engine.
30+ model
supported.
30+ model
supported.
20+ model support.
Does not include
some models like
Cohere command-
R, Arctic, etc.
20+ model
supported.
Hardware
Limitation
Only optimized for
Nvidia CUDA
Only support Nvidia
CUDA
Nvidia CUDA, AMD
ROCm, AWS
Neuron, CPU
Nvidia CUDA, AMD
ROCm, Metal,
Android, IOS,
WebGPU
Nvidia CUDA, AMD
ROCm, Intel Gaudi,
AWS Inferentia
Self-hosting LLMs
Scaling LLMs Inference Service
GPU Utilization
• “Fully utilized” GPU can often handle a lot more traf
fi
c in LLM serving
• Re
fl
ect usage after the resources have been consumed, results in a conservative
scale-up behavior that doesn’t match demand
QPS
• For LLM APIs, cost of each request is not uniform
• Hard to con
fi
gure the right QPS target
Concurrency
• Accurately re
fl
ect load and desired replicas
• Easy to con
fi
gure based of GPU count and max batch size
Autoscaling: Request based metrics vs. Resource utilization metrics
Most files in a container image are not used. Stream loading
container image based on requested file can drastically speed
up container downloading and startup time, from minutes to
seconds.
Scaling LLMs Inference Service
Cold start optimization for large container images with many unused files
Cold start optimization for loading large model weight files
GenAI inference requires specialized
infrastructure, such as streaming
model loading and efficient caching
to help accelerate this process.
Scaling LLMs Inference Service
bentoml.com/blog/scaling-ai-model-deployment
Auto-scaling based on traffic, request queue, and resource utilization
Scaling LLMs Inference Service
Advanced
Inference Patterns
For RAG systems
Apply “Small” Language Models
Example: Toxic query detection with GPT 3.5 vs. a fine-tuned text classification model
GPT 3.5
Bert For
Sequence Classi
fi
cation
{User Query}
Determine if a user
query is toxic. Reply
yes or no.
Here’re some
examples:
{EXAMPLES}
{User Query}
Response:
Assuming 400 tokens per input
sequence
Cost: $200 per 1 million sequence
Latency: 8 seconds
Cost: $0.07 per 1 million
sequence, 2800x improvement
Latency: 50 million second, 160x
improvement
Yes / No
“
”
“
”
Yes / No
Apply “Small” Language Models
Cold start and scaling with huge container and model files
User Request
LLMRouter
Toxic Classifier Mistral LLM
/generate
Distributed Bento Deployment
Document Processing Pipelines
Layout analysis, Table extraction, Image understanding and OCR for RAG data ingestion pipeline
Long running inference task
Async Task Submission ideal for large PDF files ingestion tasks
Large scale Batch Inference Jobs
• Bring compute (models) to your
data via BentoCloud BYOC
• Custom deployment and easy
integration via Snow
fl
ake External
Functions or Spark UDF
• Right sizing GPU clusters perfectly
based on your actual workload,
leveraging the same fast auto-
scaling infrastructure for real-time
inference
Ingest and index documents right from your cloud data warehouses
github.com/bentoml/rag-tutorials
• De
fi
ne entire RAG service’s components
within one Python
fi
le
• Compile to one versioned unit for
evaluation and deployment
• Each model inference components can be
assigned to different GPU shapes and scale
independently
• Model serving and inference best practices
and optimizations baked in
• Auto generated production monitoring
dashboard
github.com/bentoml/rag-tutorials
RAG-as-a-service
Fully private RAG deployment in your own
infrastructure
Production RAG System Components
LLMs
Reranker
Text Embedding
Layout Analysis
Chunking
Classi
fi
cation for
Router and tool use
OCR
Summarization
Visual Reasoning
AI-powered Tools
Model Selection
Text Embedding
Synthetic Data
Generation
LLM Based
Evaluators
Summarization
Entity Extraction
Summary
• Production RAG often involves running
multiple open source or custom models for
better performance and data privacy protection.
• “State-of-the-art AI results are increasingly
obtained by compound systems with multiple
components, not just monolithic models.” -
Compound AI System (Matei et al. 2024)
• Scaling RAG with multiple components, each
with different computation needs and scaling
needs requires specialized infrastructure for
scaling the inference workload..
Inference Platfom
For Fast-Moving
AI Teams
github.com/bentoml/BentoML
BentoML

More Related Content

Similar to Infrastructure Challenges in Scaling RAG with Custom AI models

Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
Helen Rogers
 
(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud
Amazon Web Services
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?
Amazon Web Services
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San Francisco
Amazon Web Services
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SF
Amazon Web Services
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PerformanceVision (previously SecurActive)
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Amazon Web Services
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
Amazon Web Services
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
Klas Berlič Fras
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
Ml2
Ml2Ml2
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
Chin Huang
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
Amazon Web Services
 

Similar to Infrastructure Challenges in Scaling RAG with Custom AI models (20)

Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San Francisco
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SF
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 

More from Zilliz

Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
Zilliz
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
ASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLCASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLC
Zilliz
 
Metadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - DatastratoMetadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - Datastrato
Zilliz
 
Multimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with MilvusMultimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with Milvus
Zilliz
 
Building an Agentic RAG locally with Ollama and Milvus
Building an Agentic RAG locally with Ollama and MilvusBuilding an Agentic RAG locally with Ollama and Milvus
Building an Agentic RAG locally with Ollama and Milvus
Zilliz
 
Specializing Small Language Models With Less Data
Specializing Small Language Models With Less DataSpecializing Small Language Models With Less Data
Specializing Small Language Models With Less Data
Zilliz
 
Occiglot - Open Language Models by and for Europe
Occiglot - Open Language Models by and for EuropeOcciglot - Open Language Models by and for Europe
Occiglot - Open Language Models by and for Europe
Zilliz
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
MemGPT: Introduction to Memory Augmented Chat
MemGPT: Introduction to Memory Augmented ChatMemGPT: Introduction to Memory Augmented Chat
MemGPT: Introduction to Memory Augmented Chat
Zilliz
 
Copilot Workspace: What it is, how it works, why it matters
Copilot Workspace: What it is, how it works, why it mattersCopilot Workspace: What it is, how it works, why it matters
Copilot Workspace: What it is, how it works, why it matters
Zilliz
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Knowledge Graphs in Retrieval Augmented Generation with WhyHow.AI
Knowledge Graphs in Retrieval Augmented Generation with WhyHow.AIKnowledge Graphs in Retrieval Augmented Generation with WhyHow.AI
Knowledge Graphs in Retrieval Augmented Generation with WhyHow.AI
Zilliz
 
Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...
Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...
Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...
Zilliz
 

More from Zilliz (20)

Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
How Vector Databases are Revolutionizing Unstructured Data Search in AI Appli...
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
ASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLCASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLC
 
Metadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - DatastratoMetadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - Datastrato
 
Multimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with MilvusMultimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with Milvus
 
Building an Agentic RAG locally with Ollama and Milvus
Building an Agentic RAG locally with Ollama and MilvusBuilding an Agentic RAG locally with Ollama and Milvus
Building an Agentic RAG locally with Ollama and Milvus
 
Specializing Small Language Models With Less Data
Specializing Small Language Models With Less DataSpecializing Small Language Models With Less Data
Specializing Small Language Models With Less Data
 
Occiglot - Open Language Models by and for Europe
Occiglot - Open Language Models by and for EuropeOcciglot - Open Language Models by and for Europe
Occiglot - Open Language Models by and for Europe
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
MemGPT: Introduction to Memory Augmented Chat
MemGPT: Introduction to Memory Augmented ChatMemGPT: Introduction to Memory Augmented Chat
MemGPT: Introduction to Memory Augmented Chat
 
Copilot Workspace: What it is, how it works, why it matters
Copilot Workspace: What it is, how it works, why it mattersCopilot Workspace: What it is, how it works, why it matters
Copilot Workspace: What it is, how it works, why it matters
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Knowledge Graphs in Retrieval Augmented Generation with WhyHow.AI
Knowledge Graphs in Retrieval Augmented Generation with WhyHow.AIKnowledge Graphs in Retrieval Augmented Generation with WhyHow.AI
Knowledge Graphs in Retrieval Augmented Generation with WhyHow.AI
 
Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...
Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...
Answer 'What's for Dinner?' with Vector Search and Natural Language using Hay...
 

Recently uploaded

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Networks
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
ChristopherTHyatt
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
HackersList
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 

Recently uploaded (20)

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 

Infrastructure Challenges in Scaling RAG with Custom AI models

  • 1. www.bentoml.com Unstructured Data Meetup Infrastructure Challenges in Scaling RAG with Custom AI models Chaoyu Yang Founder/CEO of BentoML Jun 3, 2024
  • 2. • Custom AI models in RAG systems - why you should care about leveraging your data and custom models for improving RAG performance • Deploying custom model inference APIs - learn best practices in serving your fi ne-tuned text embedding model or LLM as inference APIs for RAG • Advanced inference patterns for RAG - running multi-model inference pipelines as online inference APIs and batch of fl ine inference jobs for RAG Agenda
  • 3. ` Simple RAG System Text Embedding Model Large Language Model Structured Data Unstructured Data Chunks ` Retrieved Chunks Vector DB (Embeddings) Response Generation
  • 4. Production RAG Challenges Retrieval Performance • Recall: Not all chunks that are relevant to the user query are retrieved. • Precision: Not all chunks retrieved are relevant to the user query. • Data Ingestion: Complex document and unstructured data Response Synthesis • Safeguarding: Is the user query toxic or offensive • Context Accuracy: Retrieved chunks lacking necessary context or containing misaligned context Evaluate RAG responses • Synthetic evaluation dataset: Use LLMs to bootstrap evaluation dataset • LLMs as evaluators: use LLMs to evaluate end-to-end RAG performance
  • 5. Why fine-tuned Text Embedding Model The “default” text-embedding-ada-002 is ranked 57th on MTEB leaderboard for English Fine-tuning optimizes embedding representations over your specific dataset
  • 6. Best performing LLMs are still mostly proprietary models Fine-tuned OSS models offer comparable performance on speci fi c tasks Key questions to ask: • What level of control you need for security & data privacy? • What’s your latency and SLA requirement? • What speci fi c capabilities that you need from LLM? • What’s the cost of running LLM at scale? • What’s the total cost of ownership for hosting and maintaining custom LLMs Why hosting your own LLM?
  • 7. • Document Layout Analysis (LayoutLM) • Table Detection, structure recognition and analysis (Table Transformers TATR) • OCR optical character recognition (EasyOCR, Tesseract) • Visual Document QA (LayoutLM v3, Donut) • Fine-tuning on your speci fi c document Document Processing & Understanding
  • 8. And many more.. Context-Aware chunking And global concept aware chunking Metadata { datetime: ‘2024-04-11’, product: “..”, user_id: “..”, sentiment: “positive”, summary: “..”, topics: [..], } Text Chunk “I recently purchased the product and I'm extremely satis fi ed with it! …” Metadata extraction for Improved retrieval accuracy And additional context for response synthesis Reranker Model fi ne-tuned with your dataset generally performs ~10-30% better
  • 10. From Inference Script to Serving Endpoint So you’ve got a fine-tuned embedding model ready
  • 11. From Inference Script to Serving Endpoint Simple things simple - build inference APIs for your fine-tuned text embedding model
  • 12. Serving Optimizations: Dynamic Batching Dynamically forming small batches, breaking down large batches, auto-tuning batch size Dynamic Batching typically brings up to 3x faster response time and ~200% improved throughput for embedding serving
  • 13. Deployment & Serving Infrastructure External Queue, Auto-scaling, Instance Selection, Traffic control, Concurrency Control, and more
  • 14. Serving Optimization • Continuous Batching, KV-caching • Paged Attention, Flash Attention • Speculative Decoding • Operator fusion • Quantization • Output Streaming Important Metrics • Time to fi rst token (TTFT) • Time per output token (TPOT) • End-to-end Latency • Throughput Self-hosting LLMs
  • 15. Recommendations • The fi eld of LLM Inference Backend is rapidly evolving and heavily researched • Choosing the right backend depends on your workload type, optimization target, quantization method and model type • Developer experience can be a signi fi cant factor given the complexity in model compilation and integrating fi ne-tuned models Self-hosting LLMs
  • 16. LMDeploy TensorRT-LLM vLLM MLC-LLM TGI Quantization Support Yes. Support 4-bit AWQ, 8-bit quantization options. Also support 4-bit KV quantization Partially. Quantization via modelopt. But note that quantized data types is not implemented for all the models. Not fully supported as of now. Need to quantize the model through AutoAWQ or fi nd pre- quantized models on HF. Performance is under-optimized. Yes. Support 3-bit, 4-bit group quantization options. AWQ quantization support is still experimental. Offers AWQ, GPTQ and bits-and-bytes quantization Supported Model Architectures About 20 models supported by TurboMind engine. 30+ model supported. 30+ model supported. 20+ model support. Does not include some models like Cohere command- R, Arctic, etc. 20+ model supported. Hardware Limitation Only optimized for Nvidia CUDA Only support Nvidia CUDA Nvidia CUDA, AMD ROCm, AWS Neuron, CPU Nvidia CUDA, AMD ROCm, Metal, Android, IOS, WebGPU Nvidia CUDA, AMD ROCm, Intel Gaudi, AWS Inferentia Self-hosting LLMs
  • 17. Scaling LLMs Inference Service GPU Utilization • “Fully utilized” GPU can often handle a lot more traf fi c in LLM serving • Re fl ect usage after the resources have been consumed, results in a conservative scale-up behavior that doesn’t match demand QPS • For LLM APIs, cost of each request is not uniform • Hard to con fi gure the right QPS target Concurrency • Accurately re fl ect load and desired replicas • Easy to con fi gure based of GPU count and max batch size Autoscaling: Request based metrics vs. Resource utilization metrics
  • 18. Most files in a container image are not used. Stream loading container image based on requested file can drastically speed up container downloading and startup time, from minutes to seconds. Scaling LLMs Inference Service Cold start optimization for large container images with many unused files
  • 19. Cold start optimization for loading large model weight files GenAI inference requires specialized infrastructure, such as streaming model loading and efficient caching to help accelerate this process. Scaling LLMs Inference Service
  • 20. bentoml.com/blog/scaling-ai-model-deployment Auto-scaling based on traffic, request queue, and resource utilization Scaling LLMs Inference Service
  • 22. Apply “Small” Language Models Example: Toxic query detection with GPT 3.5 vs. a fine-tuned text classification model GPT 3.5 Bert For Sequence Classi fi cation {User Query} Determine if a user query is toxic. Reply yes or no. Here’re some examples: {EXAMPLES} {User Query} Response: Assuming 400 tokens per input sequence Cost: $200 per 1 million sequence Latency: 8 seconds Cost: $0.07 per 1 million sequence, 2800x improvement Latency: 50 million second, 160x improvement Yes / No “ ” “ ” Yes / No
  • 23. Apply “Small” Language Models Cold start and scaling with huge container and model files User Request LLMRouter Toxic Classifier Mistral LLM /generate Distributed Bento Deployment
  • 24. Document Processing Pipelines Layout analysis, Table extraction, Image understanding and OCR for RAG data ingestion pipeline
  • 25. Long running inference task Async Task Submission ideal for large PDF files ingestion tasks
  • 26. Large scale Batch Inference Jobs • Bring compute (models) to your data via BentoCloud BYOC • Custom deployment and easy integration via Snow fl ake External Functions or Spark UDF • Right sizing GPU clusters perfectly based on your actual workload, leveraging the same fast auto- scaling infrastructure for real-time inference Ingest and index documents right from your cloud data warehouses github.com/bentoml/rag-tutorials
  • 27. • De fi ne entire RAG service’s components within one Python fi le • Compile to one versioned unit for evaluation and deployment • Each model inference components can be assigned to different GPU shapes and scale independently • Model serving and inference best practices and optimizations baked in • Auto generated production monitoring dashboard github.com/bentoml/rag-tutorials RAG-as-a-service Fully private RAG deployment in your own infrastructure
  • 28. Production RAG System Components LLMs Reranker Text Embedding Layout Analysis Chunking Classi fi cation for Router and tool use OCR Summarization Visual Reasoning AI-powered Tools Model Selection Text Embedding Synthetic Data Generation LLM Based Evaluators Summarization Entity Extraction
  • 29. Summary • Production RAG often involves running multiple open source or custom models for better performance and data privacy protection. • “State-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models.” - Compound AI System (Matei et al. 2024) • Scaling RAG with multiple components, each with different computation needs and scaling needs requires specialized infrastructure for scaling the inference workload..
  • 30. Inference Platfom For Fast-Moving AI Teams github.com/bentoml/BentoML BentoML