Open LLMs: Viable for Production or Low-Quality Toy?

Open Source LLMs:
Viable for Production or
a Low-Quality Toy?
M Waleed Kadous
Chief Scientist, Anyscale

What we’ll cover
- Propietary vs Open LLMs
- Examples of people using Open LLMs in production
- Why people use Open LLMs (with supporting experiments)
- Cost
- Deployment Flexibility
- Fine-tuning options
- Where Open LLMs are lagging
- Quality
- Instruction following
- Missing features
- Function Templates
- Big context windows
2

Summary
Open Models are viable in production – people are using them already
It is often possible to get close to commercial LLM quality
Small fine-tuned models outperform giant general models (sometimes)
It is often radically cheaper (e.g. 30x)
Usually takes a bit of extra work e.g. prompt tuning, post-processing
OS Models still missing key features (but being worked on)
3

Being used already!
endpoints.anyscale.com – right now, use an open LLM in 2 minutes
4 models:
- Llama 2 7B, 13B, 70B
- Code Llama 34B Instruct
$0.15 per million tokens to $1 per million tokens
Some quotes from our customers
4

5
Merlin
“We use Anyscale Endpoints to power
consumer-facing services that have
reach to millions of users … Anyscale
Endpoints gives us 5x-8x cost
advantages over alternatives, making
it easy for us to make Merlin even more
powerful while staying affordable for
millions of users.”
Some quotes from our customers
Realchar.ai
“Realchar.ai is about delivering
immersive, realistic experiences for our
users, not fighting infrastructure or
upgrading open source models.
Endpoints made it possible for us to
introduce new services in hours, instead
of weeks, and for a fraction of the cost of
proprietary services. It also enables us
to seamlessly personalize user
experiences at scale.”

We are using Open LLMs: docs.ray.io
6

Endless possibilities for AI innovation.
AI app serving & routing
Model training & continuous tuning
Python-native Workspaces
GPU/CPU optimizations
Multi-Cloud, auto-scaling
Anyscale AI Platform
Anyscale Endpoints
LLMs served via API
LLMs fine-tuned via API
Ray AI Libraries Ray Core
Ray Open Source
Serve your LLMs from your Cloud
Fine-tune & customize in your Cloud
Anyscale Private
Endpoints

Your options for LLMs
Proprietary
OpenAI, Anthropic, Cohere
Managed Open Source
Anyscale Endpoints, Hugging Face, etc
Self Hosted
Run and maintain your own Open Source models
- Won’t dive into today, more details: walee.dk/selfhost
- TL;DR: Doable but harder than it looks (and maybe more expensive)
- Aviary: easy serving of LLMs using Ray Serve.
8

The Most Popular “Open” Models
Llama 2 (99% open)
Released in July
3 sizes: 7B, 13B, 70B
Permissive licence
- Can be used commercially
- Can’t be used to train other models
Code Llama (99% open)
Released in August
Specifically for generating code
3 sizes: 7B, 13B, 34B
3 “tunes”: Base, Python and Instruct
9
Falcon (90% open)
In June, released 7B, 40B
In September, released 180B model
Need a license for managed hosting
Very Dynamic Space
No LLM has been “most popular”
> 2 months
Keep an eye on this!

Direct comparisons
Open vs Proprietary

Summary Ranking established in literature.
“insiders say the row brought simmering tensions
between the starkly contrasting pair -- both
rivals for miliband's ear -- to a head.”
A: insiders say the row brought tensions between the
contrasting pair.
B: insiders say the row brought simmering tensions
between miliband's ear.
Comparing quality: Factuality eval
11

GPT-4 is Expensive – 30x Llama 2 70b for similar performance
Comparing Cost: Summarization
30x!
13

Can mean the difference between a product being viable or not
RayAssistant numbers (approx):
2000 tokens in, 500 tokens out, 1,000 questions/day
GPT-4: 10c per question ~$35,000 (VP approval?)
Llama 2 70b: 0.25c per question ~$900 (Credit card?)
30x is radically cheaper

A small fine-tuned open source model
can outperform the best available general model
in some cases
The Power of Fine-tuning

Natural Language to SQL
Llama-2-7B GPT-4 (~1.4T?)
general
fine-tuned
86%
3%
78%

Fine tuning is for form, not facts
17

Retrieval Augmented Generation
Vector DB does a lot of the heavy lifting
LLM mostly just has to synthesize the context
A much easier problem
OS LLMs like Llama 70b work well – don’t see as big a difference vs
GPT-4
19

Open model challenges
- Quality
- Instruction following
- Function Templates
- Large Context Windows
No. The Right tool for the Right job

High End Proprietary APIs (esp GPT-4 and Claude 2)
are the best quality:
- Better logical & analogical reasoning
- Better “general knowledge”
- More refined answers
Open LLMs are “good enough” for (blog post forthcoming)
- Summarization
- Generation stage of RAG
Quality

Hybrids make a lot of sense
For evaluations, we still use GPT-4:
“Is answer A better or answer B better?”
We still send ~5% of queries to GPT-4 for Ray Assistant
(costs 150% more: $900 → $2250)
We still use proprietary LLMs

Proprietary LLMs are better at following instructions. OS LLMs
tend to not be as refined
Hypothesis: OS LLMs don’t have the same level of human
review and RLHF.
Instruction following

What we asked for: Please give an A or a B.
What we got from GPT-4:
A
What we got from Llama 2 70b:
‘The correct answer is A: those who receive centrelink payments made up half of radio rental's
income last year. Explanation: Summary A accurately summarizes the article sentence by
mentioning that those who receive centrelink payments made up half of radio rental's income
last year. It maintains the same meaning and information as the original sentence. On the other
hand, Summary B is inconsistent with the article sentence. It suggests that the ABC's report
only mentioned that those who receive centrelink payments made up radio rental's income last
year, which is not entirely accurate. The article sentence explicitly states that the ABC reported
that those who receive centrelink payments made up half of radio rental's income last year.
Therefore, Summary A is the better choice’.
Example of instruction following

Easily fixed by another LLM
System prompt
You are a helpful assistant that carefully follows instruction. You
provide only answers, no explanations.
User prompt
Determine if the following text says whether the answer is A, B or other.
Only output a single word, either: A B or other
Text: {query}
26

Function Templates
Convert the text below into one that calls a Python function.
The function is find_flights(departure_city, arrival_city,time, date,
class)
Convert to the appropriate city code using another function
city_code(str) that returns the city code for a given city.
“Hi. I'd like to book a flight to SF from Boston on Wednesday 20
September in the evening. Business class.”
27

Llama 13B output:
find_flights(Boston,
San_Francisco,
“2023-09-20”,
“18:00”,
“business”)
Does this parse?
- No, first two parameters are variables, should have quotes
- Didn’t use city_code function
- Decided 6pm was evening
28

Vs OpenAI strictly defined templates
"functions": [{
"name": "find_flights",
"description": "template to find flights.",
"parameters": {
"type": "object",
"properties": {
"from_city_code": {
"type": "string",
"description": "Three letter code for the city"
}, ...
29

vs Proprietary (OpenAI)
find_flights(city_code(“Boston”),
city_code(“San Francisco”),
“2023-09-20”,
“evening”,
“business”)
30

Large context windows
Bigger context windows are useful for retrieval augmented generation
From Ray Assistant Blog:
Increasing our number of chunks improves our retrieval and quality scores. We
had to stop testing at 7 chunks since Llama-2-70b's maximum content length is
4096 tokens. This is a compelling reason to invest in extending context size
31

Current status
Anthropic: 100K context window
GPT-4: 32K context window (8K by default)
Llama 2: 4K context window
CodeLlama: 16K context window
OSS
- Actively being worked on (eg RoPE)
- Larger context windows also need more GPU resources
- GPT-4 charges 2x for 32K context (vs 8K)
32

Status of Open LLM Weaknesses
Quality
- Larger and larger open models (180B now largest)
- Will likely be a moving target (eg Google’s Gemini)
Instruction following
- RLHF is pretty expensive and hard to do – may have to live with this
Expanded context window is actively being developed
- RoPE, YaRN, Hyena
Function templates being actively worked on
- Guidance, JSONFormer, LMQL
33

Best place to run Open LLMs?
endpoints.anyscale.com – right now, use an Open LLM in 2 minutes
4 models:
- Llama 2 7B, Llama 2 13B, Llama 70B
- Code Llama 34B Instruct
$0.15 per million tokens to $1 per million tokens
Fine tuning in Preview – super easy
34

One more thing …
$50 credit for Anyscale Endpoints if you sign up today
35

Summary
Open Models are viable in production – people are using them already
It is often possible to get close to proprietary LLM quality
Small fine-tuned models outperform giant general models (sometimes)
Use RAG for factual information
Open models are often radically cheaper (e.g. 30x)
Usually takes a bit of extra work e.g. prompt tuning, post-processing
Open Models still missing key features (but being worked on)
36

Here is a Basic Light Slide
39

Ray Summit 2023 Color Palette
40

Realchar.ai
“Realchar.ai is about delivering immersive, realistic experiences for our users, not fighting infrastructure or upgrading open source
models. Endpoints made it possible for us to introduce new services in hours, instead of weeks, and for a fraction of the cost of
proprietary services. It also enables us to seamlessly personalize user experiences at scale.”
42

43
Here is an info card
Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor
incid idunt ut labo re et dolore magna
aliqu Ut enim ad minim veniam, quis
nostrud exercitation
How about a slide with 2 options?
Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor
incid idunt ut labo re et dolore magna
aliqu Ut enim ad minim veniam, quis

44
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed do
eiusmod tempor incid idunt ut
labo re et dolore magna aliqu Ut
enim ad minim veniam, quis
How about a slide with 3?

HERE IS A SECTION
HEADER
Here is a Section Header

Thank you.
Follow up information can go here.

Open LLMs: Viable for Production or Low-Quality Toy?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Open LLMs: Viable for Production or Low-Quality Toy?

Similar to Open LLMs: Viable for Production or Low-Quality Toy? (20)

Recently uploaded

Recently uploaded (20)

Open LLMs: Viable for Production or Low-Quality Toy?