Use case patterns for LLM Apps: What to look for
M Waleed Kadous
Chief Scientist, Anyscale
K1st
Oct 11, 2023
- Certain “use case patterns” we see
- Equip you to spot LLM opportunities in your organization
- Patterns (in rough order of difficulty)
- Summarization
- The RAG Family
- Knowledge Base Question Answering
- Document Question Answering
- Talk to your data
- Talk to your system
- In-context assistance family
- Co-creator
- Diagnostician
- Bonus material:
- Thoughts on Dr Nguyen’s presentation
- Waleed’s Hard Won Heuristics
Key points
- Company behind the Open Source project Ray
- Widely used Scalable AI Platform used by many
companies
- What scalable means:
- Distributed: Up to 4,000 nodes, 16,000 GPUs
- Efficient: Keep costs down by efficient resource mgmt
- Reliable: Fault tolerant, highly available
- Widely used by GenAI companies e.g. OpenAI, Cohere
- ChatGPT trained using Ray
Who is Anyscale? Why should you listen to us?
We provide LLMs as a service (Llama models)
We use LLMs to make our products better
We help our customers deploy LLMs on Ray and on the
managed version of Ray (Anyscale Platform)
What’s our experience with LLMs?
Anyscale Endpoints
LLMs served via API
LLMs fine-tuned via API
Anyscale Endpoints
Llama2 70B
Codellama 34B $1.00
Llama2 13B
$0.25
Llama2 7B $0.15
LLM Serving Price
(per million tokens)
endpoints.anyscale.com
Anyscale Endpoints
Cost efficiency touches every layer of the stack
Anyscale Endpoints
Single GPU optimizations
Multi-GPU modeling
Inference server
Autoscaling
Multi-region, multi-cloud
$1 / million
tokens
(Llama-2 70B)
End-to-end LLM privacy, customization and control
Anyscale Endpoints
LLMs served via API
LLMs fine-tuned via API
Serve your LLMs from your Cloud
Fine-tune & customize in your Cloud
Anyscale Private
Endpoints
Cost Quality
How all the pieces fit together
AI app serving & routing
Model training & continuous tuning
Python-native Workspaces
GPU/CPU optimizations
Multi-Cloud, auto-scaling
Anyscale AI Platform
Anyscale Endpoints
LLMs served via API
LLMs fine-tuned via API
Ray AI Libraries Ray Core
Ray Open Source
Serve your LLMs from your Cloud
Fine-tune & customize in your Cloud
Anyscale Private
Endpoints
Summarization
LLMs are very good at summarizing
When GPT-3 came out, it outperformed existing engineered
solutions
Easy: prompt is
- Please summarize this into x bullet points
- Stick to the facts in the document
- Leave out irrelevant parts
- [Optional] Particularly focus on topics A, B and C
Summarization
- Summarize:
- Research papers
- Product updates
- Business contracts
- Latest industry news
- Legislative changes
- Quality control reports
Practical examples
Anyscale Customer using Summarization: Merlin
Merlin
“We use Anyscale Endpoints to power
consumer-facing services that have
reach to millions of users … Anyscale
Endpoints gives us 5x-8x cost
advantages over alternatives, making
it easy for us to make Merlin even more
powerful while staying affordable for
millions of users.”
Watch out for cost!
Summarization: Lesson 1
30x!
Summary Ranking established in literature.
“insiders say the row brought simmering
tensions between the starkly contrasting
pair -- both rivals for miliband's ear --
to a head.”
A: insiders say the row brought tensions between
the contrasting pair.
B: insiders say the row brought simmering tensions
between miliband's ear.
Example of comparable quality: Factuality eval
For the summarization task, LLama 2 70b is about as good
as GPT-4 (on factuality)
Dropping to GPT-3.5-Turbo doesn’t work, significant drop in
quality
Llama 2 70b costs 30x less
Cheaper not always worse
Summarization: Lesson 3
One issue is context window size
Most LLMs can take 4000-8000 tokens (3000-6000 words)
as input
2 solutions
- Long Context Window LLMs (e.g. Claude 2: 75,000 words)
- Split-and-merge approach:
- LangChain et already have chains to do this
The Retrieval Augmented
Generation Family
Retrieval Augmented Generation
Solves 2 problems:
- How do I add knowledge to an LLM that’s already trained
without retraining the whole thing?
- How do I stop the LLM from simply making stuff up
(hallucination)?
Basic approach: use a secondary source (e.g. vector
database) to augmented the prompt with context
Source 
Timing
Pre-indexed Real-time
Text Knowledge Base QA Document QA
Data Talk to data Talk to system
- Knowledge Base Question Answering
- Source: existing documentation (e.g. wikis, intranets,
slack records, etc)
- Document Question Answering
- Source: a new document
- Talk to data
- Source: an existing SQL, CSV or similar
- Talk to system
- Source: a live engine or source of data
Four flavors
Pre-index stage: Build the index of “chunks” of text
This is a mini search engine that provides snippets
- Customer support
- Internal company knowledge chatbot
- Sales search: “Who is Customer X?”
- Technical documentation
Knowledge base example applications
Endless possibilities for AI innovation.
AI app serving & routing
Model training & continuous tuning
Python-native Workspaces
GPU/CPU optimizations
Multi-Cloud, auto-scaling
Anyscale AI Platform
Anyscale Endpoints
LLMs served via API
LLMs fine-tuned via API
Ray AI Libraries Ray Core
Ray Open Source
Serve your LLMs from your Cloud
Fine-tune & customize in your Cloud
Anyscale Private
Endpoints
Erik Brynjolfsson: Professor here at Stanford
- Introduced a RAG-based customer support system
- 14% increase in resolved customer issues per hour
- 35% increase for lowest skilled worker
“Generative AI at work”, Brynjolfsson et al,
https://www.nber.org/system/files/working_papers/w31161/w31161.pdf
Real Measured Results
- Easy incremental step if you already have a existing knowledge base
- Real challenge is not the synthesis stage, but building a good search
engine
- GIGO: Garbage in, Garbage out.
- If the retrieved results are garbage, LLMs won’t fix it
- Example startup in this space: glean.com
- You don’t need GPT-4 for synthesis. Llama 2 70b or GPT-3.5-Turbo is
good enough.
Knowledge base QA: Lessons
- Example:
- Upload a 20,000 word contract.
- Ask: Does this contract give us any rights if the
customer files Chapter 11?
- Main difference: have not seen document before
- Blog post in preparation that looks at 3 approaches:
- Index it in real-time
- Use Large Context Window and shove it in (Claude 2)
- Divide into paragraphs then scatter-gather
Document Question Answering
- Example:
- You have a database of sales numbers
- You ask a natural language query:
- “Which salesperson in the East Coast has seen the
greatest monthly sales?”
- Usual approach
- Translate natural language to SQL or similar
- Note: You have to be really careful with SQL from an LLM
– could contain injection attacks.
Talk to data
A small fine-tuned open source model
can outperform the best available general model
in some cases
The Power of Fine-tuning in Cost Reduction
Anyscale Endpoints - fine-tuning
Llama-2-7B GPT-4
Superior task-specific performance at 1/300th the cost of GPT-4.
fine-tuned
3%
78%
86%
- Similar to talk to data but instead of a database talk to a
live system
- Example (Wireless Network):
- Q: “Any area seeing wifi congestion?”
- A: “Yes, floor 7 is. I see that there are a large number of
visitors trying to use the guest network.”
Talk to system
- Define functions for querying your system
- E.g. get_congestion_status(),
get_network_usage_type()
- Translate your queries into those functions
Basic approach
This is easy but only know one company …
In-context assistance
- Tools that help you get your job done while you are working on it
- Automatically analyze the content
- Example:
- Smart code completion
- Looks at surrounding code, environment etc
In-context assistance
- “Autocomplete on steroids” for software developers
- 95 developers, randomized controlled trial.
Copilot users 55% faster
- 96% of Copilot users faster on repetitive tasks
- 74% said it allowed them to focus on more satisfying
work
source:
https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivit
y-and-happiness/
Github Copilot
- Our internal diagnostic tool
- Anyscale has an IDE – incl Jupyter notebook
Anyscale Doctor
- Took a lot of trial and error to build
- Build on top of RAG system for additional analysis
- At the end, not that complicated
- But lots of experimentation
- Now going to be deployed in the product
Our experiences
Anyscale Doctor
User Input Summarize Categorize
Dependency
Error
Python Error
Infra Error
🦙🦙
70B
🦙🦙
70B
🦙🦙
Code
🦙🦙
Code
🦙🦙
Code QA
🦙🦙
Code QA
1. Prototype with GPT-4 (or Claude if you need big context windows).
If GPT-4 doesn’t work, nothing else is likely to.
2. One LLM call does one job. Don’t ask an LLM to summarize and
classify. Do 2 llm calls, one to summarize one to classify.
3. Llama 2 70b can be useful as a “day to day” LLM if you remember
Rule 2. GPT-4 is less sensitive to dual tasks.
4. Fine tuning is for form, not facts. RAG is for facts.
5. If you can, avoid self-hosting. It’s more difficult than it looks (e.g.
dealing with traffic peaks cost effectively), esp multi-GPU LLMs like
Llama 70b. If you have to, use RayLLM.
Bonus: Waleed’s Hard-won Heuristics
- Certain “use case patterns” we see
- Equip you to spot LLM opportunities in your organization
- Patterns (in rough order of difficulty)
- Summarization
- The RAG Family
- Knowledge Base Question Answering
- Document Question Answering
- Talk to your data
- Talk to your system
- In-context assistance family
- Co-creator
- Diagnostician
Key points
Thank You!
Endpoints: endpoints.anyscale.com
RayLLM: github.com/ray-project/ray-llm
Details: anyscale.com/blog
Numbers: llm-numbers.ray.io
Ray: ray.io
Anyscale: anyscale.com
Me: mwk@anyscale.com

Use Case Patterns for LLM Applications (1).pdf

  • 1.
    Use case patternsfor LLM Apps: What to look for M Waleed Kadous Chief Scientist, Anyscale K1st Oct 11, 2023
  • 2.
    - Certain “usecase patterns” we see - Equip you to spot LLM opportunities in your organization - Patterns (in rough order of difficulty) - Summarization - The RAG Family - Knowledge Base Question Answering - Document Question Answering - Talk to your data - Talk to your system - In-context assistance family - Co-creator - Diagnostician - Bonus material: - Thoughts on Dr Nguyen’s presentation - Waleed’s Hard Won Heuristics Key points
  • 3.
    - Company behindthe Open Source project Ray - Widely used Scalable AI Platform used by many companies - What scalable means: - Distributed: Up to 4,000 nodes, 16,000 GPUs - Efficient: Keep costs down by efficient resource mgmt - Reliable: Fault tolerant, highly available - Widely used by GenAI companies e.g. OpenAI, Cohere - ChatGPT trained using Ray Who is Anyscale? Why should you listen to us?
  • 5.
    We provide LLMsas a service (Llama models) We use LLMs to make our products better We help our customers deploy LLMs on Ray and on the managed version of Ray (Anyscale Platform) What’s our experience with LLMs?
  • 6.
    Anyscale Endpoints LLMs servedvia API LLMs fine-tuned via API Anyscale Endpoints Llama2 70B Codellama 34B $1.00 Llama2 13B $0.25 Llama2 7B $0.15 LLM Serving Price (per million tokens) endpoints.anyscale.com
  • 7.
    Anyscale Endpoints Cost efficiencytouches every layer of the stack Anyscale Endpoints Single GPU optimizations Multi-GPU modeling Inference server Autoscaling Multi-region, multi-cloud $1 / million tokens (Llama-2 70B)
  • 8.
    End-to-end LLM privacy,customization and control Anyscale Endpoints LLMs served via API LLMs fine-tuned via API Serve your LLMs from your Cloud Fine-tune & customize in your Cloud Anyscale Private Endpoints Cost Quality
  • 9.
    How all thepieces fit together AI app serving & routing Model training & continuous tuning Python-native Workspaces GPU/CPU optimizations Multi-Cloud, auto-scaling Anyscale AI Platform Anyscale Endpoints LLMs served via API LLMs fine-tuned via API Ray AI Libraries Ray Core Ray Open Source Serve your LLMs from your Cloud Fine-tune & customize in your Cloud Anyscale Private Endpoints
  • 10.
  • 11.
    LLMs are verygood at summarizing When GPT-3 came out, it outperformed existing engineered solutions Easy: prompt is - Please summarize this into x bullet points - Stick to the facts in the document - Leave out irrelevant parts - [Optional] Particularly focus on topics A, B and C Summarization
  • 12.
    - Summarize: - Researchpapers - Product updates - Business contracts - Latest industry news - Legislative changes - Quality control reports Practical examples
  • 13.
    Anyscale Customer usingSummarization: Merlin
  • 14.
    Merlin “We use AnyscaleEndpoints to power consumer-facing services that have reach to millions of users … Anyscale Endpoints gives us 5x-8x cost advantages over alternatives, making it easy for us to make Merlin even more powerful while staying affordable for millions of users.”
  • 15.
    Watch out forcost! Summarization: Lesson 1 30x!
  • 16.
    Summary Ranking establishedin literature. “insiders say the row brought simmering tensions between the starkly contrasting pair -- both rivals for miliband's ear -- to a head.” A: insiders say the row brought tensions between the contrasting pair. B: insiders say the row brought simmering tensions between miliband's ear. Example of comparable quality: Factuality eval
  • 18.
    For the summarizationtask, LLama 2 70b is about as good as GPT-4 (on factuality) Dropping to GPT-3.5-Turbo doesn’t work, significant drop in quality Llama 2 70b costs 30x less Cheaper not always worse
  • 19.
    Summarization: Lesson 3 Oneissue is context window size Most LLMs can take 4000-8000 tokens (3000-6000 words) as input 2 solutions - Long Context Window LLMs (e.g. Claude 2: 75,000 words) - Split-and-merge approach: - LangChain et already have chains to do this
  • 20.
  • 21.
    Retrieval Augmented Generation Solves2 problems: - How do I add knowledge to an LLM that’s already trained without retraining the whole thing? - How do I stop the LLM from simply making stuff up (hallucination)? Basic approach: use a secondary source (e.g. vector database) to augmented the prompt with context
  • 22.
    Source Timing Pre-indexed Real-time TextKnowledge Base QA Document QA Data Talk to data Talk to system
  • 23.
    - Knowledge BaseQuestion Answering - Source: existing documentation (e.g. wikis, intranets, slack records, etc) - Document Question Answering - Source: a new document - Talk to data - Source: an existing SQL, CSV or similar - Talk to system - Source: a live engine or source of data Four flavors
  • 24.
    Pre-index stage: Buildthe index of “chunks” of text
  • 25.
    This is amini search engine that provides snippets
  • 27.
    - Customer support -Internal company knowledge chatbot - Sales search: “Who is Customer X?” - Technical documentation Knowledge base example applications
  • 28.
    Endless possibilities forAI innovation. AI app serving & routing Model training & continuous tuning Python-native Workspaces GPU/CPU optimizations Multi-Cloud, auto-scaling Anyscale AI Platform Anyscale Endpoints LLMs served via API LLMs fine-tuned via API Ray AI Libraries Ray Core Ray Open Source Serve your LLMs from your Cloud Fine-tune & customize in your Cloud Anyscale Private Endpoints
  • 29.
    Erik Brynjolfsson: Professorhere at Stanford - Introduced a RAG-based customer support system - 14% increase in resolved customer issues per hour - 35% increase for lowest skilled worker “Generative AI at work”, Brynjolfsson et al, https://www.nber.org/system/files/working_papers/w31161/w31161.pdf Real Measured Results
  • 30.
    - Easy incrementalstep if you already have a existing knowledge base - Real challenge is not the synthesis stage, but building a good search engine - GIGO: Garbage in, Garbage out. - If the retrieved results are garbage, LLMs won’t fix it - Example startup in this space: glean.com - You don’t need GPT-4 for synthesis. Llama 2 70b or GPT-3.5-Turbo is good enough. Knowledge base QA: Lessons
  • 31.
    - Example: - Uploada 20,000 word contract. - Ask: Does this contract give us any rights if the customer files Chapter 11? - Main difference: have not seen document before - Blog post in preparation that looks at 3 approaches: - Index it in real-time - Use Large Context Window and shove it in (Claude 2) - Divide into paragraphs then scatter-gather Document Question Answering
  • 32.
    - Example: - Youhave a database of sales numbers - You ask a natural language query: - “Which salesperson in the East Coast has seen the greatest monthly sales?” - Usual approach - Translate natural language to SQL or similar - Note: You have to be really careful with SQL from an LLM – could contain injection attacks. Talk to data
  • 34.
    A small fine-tunedopen source model can outperform the best available general model in some cases The Power of Fine-tuning in Cost Reduction
  • 36.
    Anyscale Endpoints -fine-tuning Llama-2-7B GPT-4 Superior task-specific performance at 1/300th the cost of GPT-4. fine-tuned 3% 78% 86%
  • 37.
    - Similar totalk to data but instead of a database talk to a live system - Example (Wireless Network): - Q: “Any area seeing wifi congestion?” - A: “Yes, floor 7 is. I see that there are a large number of visitors trying to use the guest network.” Talk to system
  • 38.
    - Define functionsfor querying your system - E.g. get_congestion_status(), get_network_usage_type() - Translate your queries into those functions Basic approach
  • 39.
    This is easybut only know one company …
  • 40.
  • 41.
    - Tools thathelp you get your job done while you are working on it - Automatically analyze the content - Example: - Smart code completion - Looks at surrounding code, environment etc In-context assistance
  • 42.
    - “Autocomplete onsteroids” for software developers - 95 developers, randomized controlled trial. Copilot users 55% faster - 96% of Copilot users faster on repetitive tasks - 74% said it allowed them to focus on more satisfying work source: https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivit y-and-happiness/ Github Copilot
  • 43.
    - Our internaldiagnostic tool - Anyscale has an IDE – incl Jupyter notebook Anyscale Doctor
  • 46.
    - Took alot of trial and error to build - Build on top of RAG system for additional analysis - At the end, not that complicated - But lots of experimentation - Now going to be deployed in the product Our experiences
  • 47.
    Anyscale Doctor User InputSummarize Categorize Dependency Error Python Error Infra Error 🦙🦙 70B 🦙🦙 70B 🦙🦙 Code 🦙🦙 Code 🦙🦙 Code QA 🦙🦙 Code QA
  • 48.
    1. Prototype withGPT-4 (or Claude if you need big context windows). If GPT-4 doesn’t work, nothing else is likely to. 2. One LLM call does one job. Don’t ask an LLM to summarize and classify. Do 2 llm calls, one to summarize one to classify. 3. Llama 2 70b can be useful as a “day to day” LLM if you remember Rule 2. GPT-4 is less sensitive to dual tasks. 4. Fine tuning is for form, not facts. RAG is for facts. 5. If you can, avoid self-hosting. It’s more difficult than it looks (e.g. dealing with traffic peaks cost effectively), esp multi-GPU LLMs like Llama 70b. If you have to, use RayLLM. Bonus: Waleed’s Hard-won Heuristics
  • 49.
    - Certain “usecase patterns” we see - Equip you to spot LLM opportunities in your organization - Patterns (in rough order of difficulty) - Summarization - The RAG Family - Knowledge Base Question Answering - Document Question Answering - Talk to your data - Talk to your system - In-context assistance family - Co-creator - Diagnostician Key points
  • 50.
    Thank You! Endpoints: endpoints.anyscale.com RayLLM:github.com/ray-project/ray-llm Details: anyscale.com/blog Numbers: llm-numbers.ray.io Ray: ray.io Anyscale: anyscale.com Me: mwk@anyscale.com