apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

Featured Speaker
Jocelyn Matthews
Head of Community, Pinecone
Presented by
jocelyn@pinecone.io
3

What are embeddings?
Embeddings are numerical representations that capture the essential
features and relationships of discrete objects, like words or documents,
in a continuous vector space.

Embeddings:
● Are dynamic and context-sensitive.
● Capture the essence of the data they represent
● Are influenced by the context in which they are used
● Adaptability makes them powerful
Humans think in sensations, words, ideas.
Computers think in numbers

You don’t need to memorize this now
Vector: a list of numbers that tell us about something
Vector space: an environment in which vectors exist
Semantics: the study of meaning communicated through language

Vectors
A vector is a mathematical structure
with a size and a direction. For
example, we can think of the vector
as a point in space, with the
“direction” being an arrow from
(0,0,0) to that point in the vector
space.

Vectors
As developers, it might be easier to
think of a vector as an array
containing numerical values. For
example:
vector = [0,-2,...4]

Vectors
When we look at a bunch of vectors
in one space, we can say that some
are closer to one another, while
others are far apart. Some vectors
can seem to cluster together, while
others could be sparsely distributed
in the space.

An example you can bank on
🏦 Where is the Bank of England?
🌱 Where is the grassy bank?
🛩️ How does a plane bank?
🐝 “the bees decided to have a mutiny against their queen”
🐝 “flying stinging insects rebelled in opposition to the
matriarch”

Word arithmetic
king – man + woman = queen
Image, Peter Sutor, “Metaconcepts: Isolating Context in Word Embeddings”

Word arithmetic
“Distributed Representations of Words and Phrases and their Compositionality”

Word arithmetic
“adding the vectors associated with the words king
and woman while subtracting man is equal to the
vector associated with queen. This describes a gender
relationship.”
– MIT Technology Review, 2015

Word arithmetic
Paris - France + Poland = Warsaw

Word arithmetic
Paris - France + Poland = Warsaw
“In this case, the vector difference between Paris and
France captures the concept of capital city.”
– MIT Technology Review, 2015

Together and apart
Coffee
Hospital
Music
Restaurant
School

Together and apart
Coffee
Hospital
Music
Restaurant
School
Cup
Caffeine
Morning
Galaxy
Dinosaur
Doctor
Patient
Surgery
Volcano
Unicorn
Song
Melody
Instrument
Asteroid
Bacteria
Food
Menu
Waiter
Nebula
Dragon
Teacher
Classroom
Student
Volcano
Spaceship
Exam

Dimensionality!
Coffee
Hospital
Music
Restaurant
School
Cup
Caffeine
Morning
Galaxy
Dinosaur
Doctor
Patient
Surgery
Volcano
Unicorn
Song
Melody
Instrument
Asteroid
Bacteria
Food
Menu
Waiter
Nebula
Dragon
Teacher
Classroom
Student
Volcano
Spaceship
Exam

As orange is to…
green
orange
blue

As orange is to…yep!
green
orange
blue
red

What’s The Fallacy?
Why "Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool
• Simplicity of relationships
• Linear vs nuanced
• Lack of Context
• How are the words used?
• Dimensionality
• 3D vs 100s of D
• Oversimplification

Life is Like a Box of…
(Or, ”Check the Vectors”)

Check the vectors
The distance between red and orange
is incredibly similar to blue and green…
But when we tested things trying to verify,
we got interesting results which show the
"understanding" of the relationship
This actually yields this
# Find a term that has the same distance and direction blue has from green, but starting from
blue
target_distance = distance_green_blue
target_direction = direction_green_blue
# Define a list of terms to compare
terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"]
# Get the embedding for each term
term_embeddings = {term: get_embedding(term) for term in terms}
# Find the term with the closest distance and same direction to the target distance and direction
closest_term = None
closest_distance = float('inf')
start_term = "red"
start_embedding = get_embedding(start_term)
for term, embedding in term_embeddings.items():
if term == start_term:
continue
distance, direction = cosine_distance_and_direction(start_embedding, embedding)
if direction == target_direction and abs(distance - target_distance) < closest_distance:
closest_distance = abs(distance - target_distance)
closest_term = term
closest_term, closest_distance

Check the vectors
The distance between red and orange
is incredibly similar to blue and green…
But when we played around to verify, we
got interesting results revealing the
semantic "understanding" of the
relationship
This actually yields this
# Find a term that has the same distance and direction blue
has from green, but starting from blue
terms = ["red", "orange", "yellow", "green", "blue",
"purple", "pink", "black", "white", "gray"]
term_embeddings = {term: get_embedding(term) for term in
terms}
# Find the term with the closest distance and same direction
to the target distance and direction
closest_term = None
start_term = "red"
continue
distance, direction =
cosine_distance_and_direction(start_embedding, embedding)
if direction == target_direction and abs(distance -
target_distance) < closest_distance:
closest_term = term
('purple', np.float64(0.006596347059928065))
Purple

Why not 'orange'?
The code's result of ('purple',
np.float64(0.006596347059928065))
suggests that, in the embedding space used by
the model, "red" and "purple" have a closer
semantic relationship than "red" and "orange".
The embedding model used in the code has
determined that "red" and "purple" are closer
semantically. This is likely due to the specific
contexts and relationships captured by the model
during training.
It yields 'purple' instead of orange because the
cosine distance and direction calculations
between the embeddings of "red" and other terms
result in "purple" being the closest match to the
target distance and direction from "green" to
"blue".
# Find a term that has the same distance and direction blue
has from green, but starting from blue
terms = ["red", "orange", "yellow", "green", "blue",
"purple", "pink", "black", "white", "gray"]
term_embeddings = {term: get_embedding(term) for term in
terms}
# Find the term with the closest distance and same direction
to the target distance and direction
closest_term = None
start_term = "red"
continue
distance, direction =
cosine_distance_and_direction(start_embedding, embedding)
if direction == target_direction and abs(distance -
target_distance) < closest_distance:
closest_term = term

The most important thing to understand
Embeddings are numerical representations of data that:
capture semantic meaning
and
allow for efficient comparison of similarity.

Key points about embeddings
1. They can represent various data types, not just text.
2. Dimensionality
3. Context sensitivity affects interpretation and application.

Applications of embeddings include:
- Semantic search
- Question-answering applications
- Image search
- Audio search
- Recommender systems
- Anomaly detection
“Generate your own embeddings”
(Inference API)

Sample app
Legal Semantic Search

Thank you!
jocelyn@pinecone.io

apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

More Related Content

Similar to apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

More from apidays

Recently uploaded

apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone