Use Case Patterns for LLM Applications (1).pdf

Use case patterns for LLM Apps: What to look for
M Waleed Kadous
Chief Scientist, Anyscale
K1st
Oct 11, 2023

- Certain “use case patterns” we see
- Equip you to spot LLM opportunities in your organization
- Patterns (in rough order of difficulty)
- Summarization
- The RAG Family
- Knowledge Base Question Answering
- Document Question Answering
- Talk to your data
- Talk to your system
- In-context assistance family
- Co-creator
- Diagnostician
- Bonus material:
- Thoughts on Dr Nguyen’s presentation
- Waleed’s Hard Won Heuristics
Key points

- Company behind the Open Source project Ray
- Widely used Scalable AI Platform used by many
companies
- What scalable means:
- Distributed: Up to 4,000 nodes, 16,000 GPUs
- Efficient: Keep costs down by efficient resource mgmt
- Reliable: Fault tolerant, highly available
- Widely used by GenAI companies e.g. OpenAI, Cohere
- ChatGPT trained using Ray
Who is Anyscale? Why should you listen to us?

We provide LLMs as a service (Llama models)
We use LLMs to make our products better
We help our customers deploy LLMs on Ray and on the
managed version of Ray (Anyscale Platform)
What’s our experience with LLMs?

Anyscale Endpoints
LLMs served via API
LLMs fine-tuned via API
Anyscale Endpoints
Llama2 70B
Codellama 34B $1.00
Llama2 13B
$0.25
Llama2 7B $0.15
LLM Serving Price
(per million tokens)
endpoints.anyscale.com

Anyscale Endpoints
Cost efficiency touches every layer of the stack
Anyscale Endpoints
Single GPU optimizations
Multi-GPU modeling
Inference server
Autoscaling
Multi-region, multi-cloud
$1 / million
tokens
(Llama-2 70B)

End-to-end LLM privacy, customization and control
Anyscale Endpoints
LLMs served via API
Serve your LLMs from your Cloud
Fine-tune & customize in your Cloud
Anyscale Private
Endpoints
Cost Quality

How all the pieces fit together
AI app serving & routing
Model training & continuous tuning
Python-native Workspaces
GPU/CPU optimizations
Multi-Cloud, auto-scaling
Anyscale AI Platform
Anyscale Endpoints
LLMs served via API
Ray AI Libraries Ray Core
Ray Open Source
Anyscale Private
Endpoints

LLMs are very good at summarizing
When GPT-3 came out, it outperformed existing engineered
solutions
Easy: prompt is
- Please summarize this into x bullet points
- Stick to the facts in the document
- Leave out irrelevant parts
- [Optional] Particularly focus on topics A, B and C
Summarization

- Summarize:
- Research papers
- Product updates
- Business contracts
- Latest industry news
- Legislative changes
- Quality control reports
Practical examples

Anyscale Customer using Summarization: Merlin

Merlin
“We use Anyscale Endpoints to power
consumer-facing services that have
reach to millions of users … Anyscale
Endpoints gives us 5x-8x cost
advantages over alternatives, making
it easy for us to make Merlin even more
powerful while staying affordable for
millions of users.”

Watch out for cost!
Summarization: Lesson 1
30x!

Summary Ranking established in literature.
“insiders say the row brought simmering
tensions between the starkly contrasting
pair -- both rivals for miliband's ear --
to a head.”
A: insiders say the row brought tensions between
the contrasting pair.
B: insiders say the row brought simmering tensions
between miliband's ear.
Example of comparable quality: Factuality eval

For the summarization task, LLama 2 70b is about as good
as GPT-4 (on factuality)
Dropping to GPT-3.5-Turbo doesn’t work, significant drop in
quality
Llama 2 70b costs 30x less
Cheaper not always worse

Summarization: Lesson 3
One issue is context window size
Most LLMs can take 4000-8000 tokens (3000-6000 words)
as input
2 solutions
- Long Context Window LLMs (e.g. Claude 2: 75,000 words)
- Split-and-merge approach:
- LangChain et already have chains to do this

The Retrieval Augmented
Generation Family

Retrieval Augmented Generation
Solves 2 problems:
- How do I add knowledge to an LLM that’s already trained
without retraining the whole thing?
- How do I stop the LLM from simply making stuff up
(hallucination)?
Basic approach: use a secondary source (e.g. vector
database) to augmented the prompt with context

Source
Timing
Pre-indexed Real-time
Text Knowledge Base QA Document QA
Data Talk to data Talk to system

- Source: existing documentation (e.g. wikis, intranets,
slack records, etc)
- Source: a new document
- Talk to data
- Source: an existing SQL, CSV or similar
- Talk to system
- Source: a live engine or source of data
Four flavors

Pre-index stage: Build the index of “chunks” of text

This is a mini search engine that provides snippets

- Customer support
- Internal company knowledge chatbot
- Sales search: “Who is Customer X?”
- Technical documentation
Knowledge base example applications

Endless possibilities for AI innovation.
AI app serving & routing
Model training & continuous tuning
Python-native Workspaces
GPU/CPU optimizations
Multi-Cloud, auto-scaling
Anyscale AI Platform
Anyscale Endpoints
LLMs served via API
Ray AI Libraries Ray Core
Ray Open Source
Anyscale Private
Endpoints

Erik Brynjolfsson: Professor here at Stanford
- Introduced a RAG-based customer support system
- 14% increase in resolved customer issues per hour
- 35% increase for lowest skilled worker
“Generative AI at work”, Brynjolfsson et al,
https://www.nber.org/system/files/working_papers/w31161/w31161.pdf
Real Measured Results

- Easy incremental step if you already have a existing knowledge base
- Real challenge is not the synthesis stage, but building a good search
engine
- GIGO: Garbage in, Garbage out.
- If the retrieved results are garbage, LLMs won’t fix it
- Example startup in this space: glean.com
- You don’t need GPT-4 for synthesis. Llama 2 70b or GPT-3.5-Turbo is
good enough.
Knowledge base QA: Lessons

- Example:
- Upload a 20,000 word contract.
- Ask: Does this contract give us any rights if the
customer files Chapter 11?
- Main difference: have not seen document before
- Blog post in preparation that looks at 3 approaches:
- Index it in real-time
- Use Large Context Window and shove it in (Claude 2)
- Divide into paragraphs then scatter-gather
Document Question Answering

- Example:
- You have a database of sales numbers
- You ask a natural language query:
- “Which salesperson in the East Coast has seen the
greatest monthly sales?”
- Usual approach
- Translate natural language to SQL or similar
- Note: You have to be really careful with SQL from an LLM
– could contain injection attacks.
Talk to data

A small fine-tuned open source model
can outperform the best available general model
in some cases
The Power of Fine-tuning in Cost Reduction

Anyscale Endpoints - fine-tuning
Llama-2-7B GPT-4
Superior task-specific performance at 1/300th the cost of GPT-4.
fine-tuned
3%
78%
86%

- Similar to talk to data but instead of a database talk to a
live system
- Example (Wireless Network):
- Q: “Any area seeing wifi congestion?”
- A: “Yes, floor 7 is. I see that there are a large number of
visitors trying to use the guest network.”
Talk to system

- Define functions for querying your system
- E.g. get_congestion_status(),
get_network_usage_type()
- Translate your queries into those functions
Basic approach

This is easy but only know one company …

- Tools that help you get your job done while you are working on it
- Automatically analyze the content
- Example:
- Smart code completion
- Looks at surrounding code, environment etc
In-context assistance

- “Autocomplete on steroids” for software developers
- 95 developers, randomized controlled trial.
Copilot users 55% faster
- 96% of Copilot users faster on repetitive tasks
- 74% said it allowed them to focus on more satisfying
work
source:
https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivit
y-and-happiness/
Github Copilot

- Our internal diagnostic tool
- Anyscale has an IDE – incl Jupyter notebook
Anyscale Doctor

- Took a lot of trial and error to build
- Build on top of RAG system for additional analysis
- At the end, not that complicated
- But lots of experimentation
- Now going to be deployed in the product
Our experiences

Anyscale Doctor
User Input Summarize Categorize
Dependency
Error
Python Error
Infra Error
🦙🦙
70B
🦙🦙
70B
🦙🦙
Code
🦙🦙
Code
🦙🦙
Code QA
🦙🦙
Code QA

1. Prototype with GPT-4 (or Claude if you need big context windows).
If GPT-4 doesn’t work, nothing else is likely to.
2. One LLM call does one job. Don’t ask an LLM to summarize and
classify. Do 2 llm calls, one to summarize one to classify.
3. Llama 2 70b can be useful as a “day to day” LLM if you remember
Rule 2. GPT-4 is less sensitive to dual tasks.
4. Fine tuning is for form, not facts. RAG is for facts.
5. If you can, avoid self-hosting. It’s more difficult than it looks (e.g.
dealing with traffic peaks cost effectively), esp multi-GPU LLMs like
Llama 70b. If you have to, use RayLLM.
Bonus: Waleed’s Hard-won Heuristics

- Certain “use case patterns” we see
- Equip you to spot LLM opportunities in your organization
- Patterns (in rough order of difficulty)
- Summarization
- The RAG Family
- Talk to your data
- Talk to your system
- In-context assistance family
- Co-creator
- Diagnostician
Key points

Thank You!
Endpoints: endpoints.anyscale.com
RayLLM: github.com/ray-project/ray-llm
Details: anyscale.com/blog
Numbers: llm-numbers.ray.io
Ray: ray.io
Anyscale: anyscale.com
Me: mwk@anyscale.com

Use Case Patterns for LLM Applications (1).pdf

More Related Content

What's hot

Similar to Use Case Patterns for LLM Applications (1).pdf

Recently uploaded

Use Case Patterns for LLM Applications (1).pdf