Infrastructure Challenges in Scaling RAG with Custom AI models

www.bentoml.com
Unstructured Data Meetup
Infrastructure Challenges
in Scaling RAG with
Custom AI models
Chaoyu Yang
Founder/CEO of BentoML
Jun 3, 2024

• Custom AI models in RAG systems - why you should care about
leveraging your data and custom models for improving RAG
performance
• Deploying custom model inference APIs - learn best practices in
serving your
fi
ne-tuned text embedding model or LLM as inference
APIs for RAG
• Advanced inference patterns for RAG - running multi-model
inference pipelines as online inference APIs and batch of
fl
ine inference
jobs for RAG
Agenda

`
Simple RAG System
Text Embedding Model Large Language Model
Structured Data
Unstructured Data
Chunks
`
Retrieved
Chunks
Vector DB
(Embeddings)
Response
Generation

Production RAG Challenges
Retrieval Performance
• Recall: Not all chunks that are relevant to the user query are retrieved.
• Precision: Not all chunks retrieved are relevant to the user query.
• Data Ingestion: Complex document and unstructured data
Response Synthesis
• Safeguarding: Is the user query toxic or offensive
• Context Accuracy: Retrieved chunks lacking necessary context or containing misaligned context
Evaluate RAG responses
• Synthetic evaluation dataset: Use LLMs to bootstrap evaluation dataset
• LLMs as evaluators: use LLMs to evaluate end-to-end RAG performance

Why fine-tuned Text Embedding Model
The “default” text-embedding-ada-002 is ranked 57th on MTEB leaderboard for English
Fine-tuning optimizes embedding representations over your specific dataset

Best performing LLMs are still mostly
proprietary models
Fine-tuned OSS models offer comparable
performance on speci
fi
c tasks
Key questions to ask:
• What level of control you need for security & data
privacy?
• What’s your latency and SLA requirement?
• What speci
fi
c capabilities that you need from LLM?
• What’s the cost of running LLM at scale?
• What’s the total cost of ownership for hosting and
maintaining custom LLMs
Why hosting your own LLM?

• Document Layout Analysis
(LayoutLM)
• Table Detection, structure
recognition and analysis (Table
Transformers TATR)
• OCR optical character recognition
(EasyOCR, Tesseract)
• Visual Document QA (LayoutLM v3,
Donut)
• Fine-tuning on your speci
fi
c
document
Document Processing & Understanding

And many more..
Context-Aware chunking
And global concept aware
chunking Metadata
{ datetime: ‘2024-04-11’,
product: “..”, user_id: “..”,
sentiment: “positive”,
summary: “..”,
topics: [..], }
Text Chunk
“I recently purchased the
product and I'm extremely
satis
fi
ed with it! …”
Metadata extraction for Improved retrieval accuracy
And additional context for response synthesis
Reranker Model
fi
ne-tuned
with your dataset generally
performs ~10-30% better

Building
Inference APIs
For Custom Models

From Inference Script to Serving Endpoint
So you’ve got a fine-tuned embedding model ready

From Inference Script to Serving Endpoint
Simple things simple - build inference APIs for your fine-tuned text embedding model

Serving Optimizations: Dynamic Batching
Dynamically forming small batches, breaking down large batches, auto-tuning batch size
Dynamic Batching typically brings up to
3x faster response time and ~200%
improved throughput for embedding
serving

Deployment & Serving Infrastructure
External Queue, Auto-scaling, Instance Selection, Traffic control, Concurrency Control, and more

Serving Optimization
• Continuous Batching, KV-caching
• Paged Attention, Flash Attention
• Speculative Decoding
• Operator fusion
• Quantization
• Output Streaming
Important Metrics
• Time to
fi
rst token (TTFT)
• Time per output token (TPOT)
• End-to-end Latency
• Throughput
Self-hosting LLMs

Recommendations
• The
fi
eld of LLM Inference Backend
is rapidly evolving and heavily
researched
• Choosing the right backend
depends on your workload type,
optimization target, quantization
method and model type
• Developer experience can be a
signi
fi
cant factor given the
complexity in model compilation
and integrating
fi
ne-tuned models
Self-hosting LLMs

LMDeploy TensorRT-LLM vLLM MLC-LLM TGI
Quantization
Support
Yes. Support 4-bit
AWQ, 8-bit
quantization
options. Also
support 4-bit KV
quantization
Partially.
Quantization via
modelopt. But note
that quantized data
types is not
implemented for all
the models.
Not fully supported
as of now. Need to
quantize the model
through AutoAWQ
or
fi
nd pre-
quantized models
on HF. Performance
is under-optimized.
Yes. Support 3-bit,
4-bit group
quantization
options. AWQ
quantization
support is still
experimental.
Offers AWQ, GPTQ
and bits-and-bytes
quantization
Supported
Model
Architectures
About 20 models
supported by
TurboMind engine.
30+ model
supported.
30+ model
supported.
20+ model support.
Does not include
some models like
Cohere command-
R, Arctic, etc.
20+ model
supported.
Hardware
Limitation
Only optimized for
Nvidia CUDA
Only support Nvidia
CUDA
Nvidia CUDA, AMD
ROCm, AWS
Neuron, CPU
Nvidia CUDA, AMD
ROCm, Metal,
Android, IOS,
WebGPU
Nvidia CUDA, AMD
ROCm, Intel Gaudi,
AWS Inferentia
Self-hosting LLMs

Scaling LLMs Inference Service
GPU Utilization
• “Fully utilized” GPU can often handle a lot more traf
fi
c in LLM serving
• Re
fl
ect usage after the resources have been consumed, results in a conservative
scale-up behavior that doesn’t match demand
QPS
• For LLM APIs, cost of each request is not uniform
• Hard to con
fi
gure the right QPS target
Concurrency
• Accurately re
fl
ect load and desired replicas
• Easy to con
fi
gure based of GPU count and max batch size
Autoscaling: Request based metrics vs. Resource utilization metrics

Most files in a container image are not used. Stream loading
container image based on requested file can drastically speed
up container downloading and startup time, from minutes to
seconds.
Cold start optimization for large container images with many unused files

Cold start optimization for loading large model weight files
GenAI inference requires specialized
infrastructure, such as streaming
model loading and efficient caching
to help accelerate this process.

bentoml.com/blog/scaling-ai-model-deployment
Auto-scaling based on traffic, request queue, and resource utilization

Advanced
Inference Patterns
For RAG systems

Apply “Small” Language Models
Example: Toxic query detection with GPT 3.5 vs. a fine-tuned text classification model
GPT 3.5
Bert For
Sequence Classi
fi
cation
{User Query}
Determine if a user
query is toxic. Reply
yes or no.
Here’re some
examples:
{EXAMPLES}
{User Query}
Response:
Assuming 400 tokens per input
sequence
Cost: $200 per 1 million sequence
Latency: 8 seconds
Cost: $0.07 per 1 million
sequence, 2800x improvement
Latency: 50 million second, 160x
improvement
Yes / No
“
”
“
”
Yes / No

Apply “Small” Language Models
Cold start and scaling with huge container and model files
User Request
LLMRouter
Toxic Classifier Mistral LLM
/generate
Distributed Bento Deployment

Document Processing Pipelines
Layout analysis, Table extraction, Image understanding and OCR for RAG data ingestion pipeline

Long running inference task
Async Task Submission ideal for large PDF files ingestion tasks

Large scale Batch Inference Jobs
• Bring compute (models) to your
data via BentoCloud BYOC
• Custom deployment and easy
integration via Snow
fl
ake External
Functions or Spark UDF
• Right sizing GPU clusters perfectly
based on your actual workload,
leveraging the same fast auto-
scaling infrastructure for real-time
inference
Ingest and index documents right from your cloud data warehouses
github.com/bentoml/rag-tutorials

• De
fi
ne entire RAG service’s components
within one Python
fi
le
• Compile to one versioned unit for
evaluation and deployment
• Each model inference components can be
assigned to different GPU shapes and scale
independently
• Model serving and inference best practices
and optimizations baked in
• Auto generated production monitoring
dashboard
github.com/bentoml/rag-tutorials
RAG-as-a-service
Fully private RAG deployment in your own
infrastructure

Production RAG System Components
LLMs
Reranker
Text Embedding
Layout Analysis
Chunking
Classi
fi
cation for
Router and tool use
OCR
Summarization
Visual Reasoning
AI-powered Tools
Model Selection
Text Embedding
Synthetic Data
Generation
LLM Based
Evaluators
Summarization
Entity Extraction

Summary
• Production RAG often involves running
multiple open source or custom models for
better performance and data privacy protection.
• “State-of-the-art AI results are increasingly
obtained by compound systems with multiple
components, not just monolithic models.” -
Compound AI System (Matei et al. 2024)
• Scaling RAG with multiple components, each
with different computation needs and scaling
needs requires specialized infrastructure for
scaling the inference workload..

Inference Platfom
For Fast-Moving
AI Teams
github.com/bentoml/BentoML
BentoML

Infrastructure Challenges in Scaling RAG with Custom AI models

More Related Content

What's hot

Similar to Infrastructure Challenges in Scaling RAG with Custom AI models

More from Zilliz

Recently uploaded

Infrastructure Challenges in Scaling RAG with Custom AI models