Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"

Prompt Design
LLMs with Text Classification and Open Source

1. GPT-4o
2. Multimodal LLMs
3. Vector Databases and Semantic Search
4. What is Text Classification?
5. How is it useful?
6. Traditional Approaches
7. LLMs and Text Classification
8. Open Source LLMs
Goals

GPT-4o
A New Model
● Pricing: GPT-4o is 50% cheaper than
GPT-4 Turbo, coming in at $5/M input
and $15/M output tokens).
● Rate limits: GPT-4o’s rate limits are 5x
higher than GPT-4 Turbo—up to 10
million tokens per minute.
● Speed: GPT-4o is 2x as fast as GPT-4
Turbo.
● Vision: GPT-4o’s vision capabilities
perform better than GPT-4 Turbo in
evals related to vision capabilities.
● Multilingual: GPT-4o has improved
support for non-English languages over
GPT-4 Turbo.
● GPT-4o currently has a context window
of 128k and has a knowledge cut-off
date of October 2023.

GPT-4o
A New Model
● Released This week
● Purely Multimodal
● Exceptionally fast (low latency)
● Cheaper
● Available via the API and Chat

GPT-4o
Multimodal
“GPT-4o is OpenAI's new flagship
model that can reason across
audio, vision, and text in real
time.” - OpenAI’s Docs

GPT-4o
Multimodal
Text, Audio, and Video are all
vectorized by the same model and
treated the same way. In other
words, a text that describes a
beach would be very similar in
vector space to an image of a
beach.

Vector Databases and Semantic Search

Representing
Texts
Digitally
Embeddings
● The apple is in the tree.
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 2-different vector
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]

Vector
Database
What is it?
● It holds vectors in a database
as storage.
● Similar vectors are stored
closer.

Vector
Database
How do we use a vector
database?
● We populate a vector database
with by using a machine
learning model to vectorize
data and send them to the
database.

Vector
Database
Why use a vector database?

Vector
Database
Why use a vector database?
● Vector databases allow users
to store vector data in a way
that allows users to query it
and find similarity based on a
vector-level similarity, rather
than explicit human-defined
similarity.

Vector
Database
What is it?
● A vector database holds
numerous vectors or
embeddings of data.
Sometimes, the database will
also store the original data
alongside these vectors.

Vector Database
Stacks
What is available to us?
● Python, Annoy, Streamlit
○ Cheap, easy to deploy, great for
smaller datasets, but requires a
little bit of knowledge to build from
scratch
○ Best for smaller databases (under
10,000 data)
● Python, txtAI
○ Cheap and easy to use, more
resource intensive but easy to
deploy
○ Allows for easy interpretability (via
highlighting)

Text
Classification
Overview
Assign a text to a specific category
or categories.
Categories == labels.

Text
Classification
Emails
"Congratulations! You've won a
$1,000 Walmart gift card. Click here
to claim your prize."
"Limited time offer: Buy one get one
free on all items in our store."
"Dear customer, your account has
been temporarily suspended.
Please update your information to
restore access."

Text
Classification
Sentiment
"I love this product! It works exactly
as described."
"The product arrived late and was
damaged. Very disappointed."
"It's okay, not great but not terrible
either."
"Excellent service and quick
delivery. Highly recommend!"

Text
Classification
Types
Binary Classification
Multiclass Classification
Multilabel Classification
Hierarchical Classification

Text
Classification
Binary Classification
Classifies text into one of two
categories.
Spam detection in emails, where
emails are classified as either
"spam" or "not spam."

Text
Classification
Multiclass Classification
Classifies text into one of three or
more categories.
Sentiment analysis with categories
such as "positive," "negative," and
"neutral."

Text
Classification
Multilabel Classification
Assigns multiple (or single) labels to
a single text instance, where each
label represents a different
category.
News categorization where an
article can belong to multiple
categories such as "politics,"
"economy," and "health."

Text
Classification
Hierarchical Classification
Classifies text into a hierarchy of
categories, where categories are
structured in a tree-like hierarchy.
Document classification in a library
where documents are classified into
categories like "science," "arts,"
"technology," with subcategories
under each (e.g., "science" can
have "physics," "chemistry,"
"biology").

Open Source
ML
Overview
Open source machine learning, like
open source software (OSS), is
driven by the public. It has several
components: open source datasets,
open source machine learning
models, and open source
applications.
The best resource: HuggingFace

Open Source
ML
Datasets
● Datasets for training task-
specific models
○ NER
○ Text Classification
○ Image Classification
○ Object Detection
● Datasets for training language
models
○ Unannotated collections of texts
● Dataset Cards
○ Task
○ Language
○ Biases

Open Source
ML
Models
● Trained Machine Learning
Models for specific Tasks
○ NER
○ Text Classification
○ Image Classification
○ Object Detection
○ ASR
○ HTR
○ OCR
● Trained machine learning
language models (including
LLMs)
● Dataset Cards
○ Task
○ Language
○ Biases

Open Source
ML
Benefits and Limitations
● Benefits
○ Open, meaning they are freely
available to use (though
sometimes with commercial
limitations)
○ Publicly Critiqued
○ Understanding of the Data
● Limitations
○ Closed models are better in many
cases (BUT!!! That gap is closing).

Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"

Recommended

Recommended

More Related Content

Similar to Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"

Similar to Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source" (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"