Embeddings
Presented by
Featured Speaker
Jocelyn Matthews
Head of Community, Pinecone
Presented by
jocelyn@pinecone.io
3
What are embeddings?
Embeddings are numerical representations that capture the essential
features and relationships of discrete objects, like words or documents,
in a continuous vector space.
Embeddings:
● Are dynamic and context-sensitive.
● Capture the essence of the data they represent
● Are influenced by the context in which they are used
● Adaptability makes them powerful
Humans think in sensations, words, ideas.
Computers think in numbers
You don’t need to memorize this now
Vector: a list of numbers that tell us about something
Vector space: an environment in which vectors exist
Semantics: the study of meaning communicated through language
Vectors
A vector is a mathematical structure
with a size and a direction. For
example, we can think of the vector
as a point in space, with the
“direction” being an arrow from
(0,0,0) to that point in the vector
space.
Vectors
As developers, it might be easier to
think of a vector as an array
containing numerical values. For
example:
vector = [0,-2,...4]
Vectors
When we look at a bunch of vectors
in one space, we can say that some
are closer to one another, while
others are far apart. Some vectors
can seem to cluster together, while
others could be sparsely distributed
in the space.
An example you can bank on
🏦 Where is the Bank of England?
🌱 Where is the grassy bank?​
🛩️ How does a plane bank?
🐝 “the bees decided to have a mutiny against their queen”
🐝 “flying stinging insects rebelled in opposition to the
matriarch”
Polysemy and homonyms
Embeddings visualized
Owning the concepts
Word arithmetic
king – man + woman = queen
Image, Peter Sutor, “Metaconcepts: Isolating Context in Word Embeddings”
Word arithmetic
king – man + woman = queen
“Distributed Representations of Words and Phrases and their Compositionality”
Word arithmetic
king – man + woman = queen
“adding the vectors associated with the words king
and woman while subtracting man is equal to the
vector associated with queen. This describes a gender
relationship.”
– MIT Technology Review, 2015
Word arithmetic
Paris - France + Poland = Warsaw
Word arithmetic
Paris - France + Poland = Warsaw
“In this case, the vector difference between Paris and
France captures the concept of capital city.”
– MIT Technology Review, 2015
Proximity
Together and apart
Coffee
Hospital
Music
Restaurant
School
Together and apart
Coffee
Hospital
Music
Restaurant
School
Cup
Caffeine
Morning
Galaxy
Dinosaur
Doctor
Patient
Surgery
Volcano
Unicorn
Song
Melody
Instrument
Asteroid
Bacteria
Food
Menu
Waiter
Nebula
Dragon
Teacher
Classroom
Student
Volcano
Spaceship
Exam
Dimensionality!
Coffee
Hospital
Music
Restaurant
School
Cup
Caffeine
Morning
Galaxy
Dinosaur
Doctor
Patient
Surgery
Volcano
Unicorn
Song
Melody
Instrument
Asteroid
Bacteria
Food
Menu
Waiter
Nebula
Dragon
Teacher
Classroom
Student
Volcano
Spaceship
Exam
Green is to blue
green blue
As orange is to…
green
orange
blue
As orange is to…yep!
green
orange
blue
red
What’s The Fallacy?
Why "Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool
• Simplicity of relationships
• Linear vs nuanced
• Lack of Context
• How are the words used?
• Dimensionality
• 3D vs 100s of D
• Oversimplification
What’s The Fallacy?
Why "Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool
• Simplicity of relationships
• Linear vs nuanced
• Lack of Context
• How are the words used?
• Dimensionality
• 3D vs 100s of D
• Oversimplification
What’s The Fallacy?
Why "Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool
• Simplicity of relationships
• Linear vs nuanced
• Lack of Context
• How are the words used?
• Dimensionality
• 3D vs 100s of D
• Oversimplification
What’s The Fallacy?
Why "Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool
• Simplicity of relationships
• Linear vs nuanced
• Lack of Context
• How are the words used?
• Dimensionality
• 3D vs 100s of D
• Oversimplification
Life is Like a Box of…
(Or, ”Check the Vectors”)
Check the vectors
The distance between red and orange
is incredibly similar to blue and green…
But when we tested things trying to verify,
we got interesting results which show the
"understanding" of the relationship
This actually yields this
# Find a term that has the same distance and direction blue has from green, but starting from
blue
target_distance = distance_green_blue
target_direction = direction_green_blue
# Define a list of terms to compare
terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"]
# Get the embedding for each term
term_embeddings = {term: get_embedding(term) for term in terms}
# Find the term with the closest distance and same direction to the target distance and direction
closest_term = None
closest_distance = float('inf')
start_term = "red"
start_embedding = get_embedding(start_term)
for term, embedding in term_embeddings.items():
if term == start_term:
continue
distance, direction = cosine_distance_and_direction(start_embedding, embedding)
if direction == target_direction and abs(distance - target_distance) < closest_distance:
closest_distance = abs(distance - target_distance)
closest_term = term
closest_term, closest_distance
Check the vectors
The distance between red and orange
is incredibly similar to blue and green…
But when we played around to verify, we
got interesting results revealing the
semantic "understanding" of the
relationship
This actually yields this
# Find a term that has the same distance and direction blue
has from green, but starting from blue
target_distance = distance_green_blue
target_direction = direction_green_blue
# Define a list of terms to compare
terms = ["red", "orange", "yellow", "green", "blue",
"purple", "pink", "black", "white", "gray"]
# Get the embedding for each term
term_embeddings = {term: get_embedding(term) for term in
terms}
# Find the term with the closest distance and same direction
to the target distance and direction
closest_term = None
closest_distance = float('inf')
start_term = "red"
start_embedding = get_embedding(start_term)
for term, embedding in term_embeddings.items():
if term == start_term:
continue
distance, direction =
cosine_distance_and_direction(start_embedding, embedding)
if direction == target_direction and abs(distance -
target_distance) < closest_distance:
closest_distance = abs(distance - target_distance)
closest_term = term
closest_term, closest_distance
('purple', np.float64(0.006596347059928065))
Purple
Why not 'orange'?
The code's result of ('purple',
np.float64(0.006596347059928065))
suggests that, in the embedding space used by
the model, "red" and "purple" have a closer
semantic relationship than "red" and "orange".
The embedding model used in the code has
determined that "red" and "purple" are closer
semantically. This is likely due to the specific
contexts and relationships captured by the model
during training.
It yields 'purple' instead of orange because the
cosine distance and direction calculations
between the embeddings of "red" and other terms
result in "purple" being the closest match to the
target distance and direction from "green" to
"blue".
# Find a term that has the same distance and direction blue
has from green, but starting from blue
target_distance = distance_green_blue
target_direction = direction_green_blue
# Define a list of terms to compare
terms = ["red", "orange", "yellow", "green", "blue",
"purple", "pink", "black", "white", "gray"]
# Get the embedding for each term
term_embeddings = {term: get_embedding(term) for term in
terms}
# Find the term with the closest distance and same direction
to the target distance and direction
closest_term = None
closest_distance = float('inf')
start_term = "red"
start_embedding = get_embedding(start_term)
for term, embedding in term_embeddings.items():
if term == start_term:
continue
distance, direction =
cosine_distance_and_direction(start_embedding, embedding)
if direction == target_direction and abs(distance -
target_distance) < closest_distance:
closest_distance = abs(distance - target_distance)
closest_term = term
closest_term, closest_distance
Embeddings
TL;DR
What are embeddings?
Embeddings are numerical representations that capture the essential
features and relationships of discrete objects, like words or documents,
in a continuous vector space.
The most important thing to understand
Embeddings are numerical representations of data that:
capture semantic meaning
and
allow for efficient comparison of similarity.
Key points about embeddings
1. They can represent various data types, not just text.
2. Dimensionality
3. Context sensitivity affects interpretation and application.
Applications of embeddings include:
- Semantic search
- Question-answering applications
- Image search
- Audio search
- Recommender systems
- Anomaly detection
“Generate your own embeddings”
(Inference API)
Sample app
Legal Semantic Search
Sample app
Shop the Look
© 2024 Pinecone – All rights reserved 45
1. Questions?
#hallwaytrack
2. Recording?
YouTube!
3. Slides?
Ask me
Thank you!
jocelyn@pinecone.io
© 2024 Pinecone – All rights reserved 48

apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

  • 1.
  • 2.
    Featured Speaker Jocelyn Matthews Headof Community, Pinecone Presented by jocelyn@pinecone.io 3
  • 3.
    What are embeddings? Embeddingsare numerical representations that capture the essential features and relationships of discrete objects, like words or documents, in a continuous vector space.
  • 4.
    Embeddings: ● Are dynamicand context-sensitive. ● Capture the essence of the data they represent ● Are influenced by the context in which they are used ● Adaptability makes them powerful Humans think in sensations, words, ideas. Computers think in numbers
  • 5.
    You don’t needto memorize this now Vector: a list of numbers that tell us about something Vector space: an environment in which vectors exist Semantics: the study of meaning communicated through language
  • 6.
    Vectors A vector isa mathematical structure with a size and a direction. For example, we can think of the vector as a point in space, with the “direction” being an arrow from (0,0,0) to that point in the vector space.
  • 7.
    Vectors As developers, itmight be easier to think of a vector as an array containing numerical values. For example: vector = [0,-2,...4]
  • 8.
    Vectors When we lookat a bunch of vectors in one space, we can say that some are closer to one another, while others are far apart. Some vectors can seem to cluster together, while others could be sparsely distributed in the space.
  • 9.
    An example youcan bank on 🏦 Where is the Bank of England? 🌱 Where is the grassy bank?​ 🛩️ How does a plane bank? 🐝 “the bees decided to have a mutiny against their queen” 🐝 “flying stinging insects rebelled in opposition to the matriarch”
  • 10.
  • 11.
  • 13.
  • 14.
    Word arithmetic king –man + woman = queen Image, Peter Sutor, “Metaconcepts: Isolating Context in Word Embeddings”
  • 15.
    Word arithmetic king –man + woman = queen “Distributed Representations of Words and Phrases and their Compositionality”
  • 16.
    Word arithmetic king –man + woman = queen “adding the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen. This describes a gender relationship.” – MIT Technology Review, 2015
  • 17.
    Word arithmetic Paris -France + Poland = Warsaw
  • 18.
    Word arithmetic Paris -France + Poland = Warsaw “In this case, the vector difference between Paris and France captures the concept of capital city.” – MIT Technology Review, 2015
  • 19.
  • 20.
  • 21.
  • 22.
  • 24.
    Green is toblue green blue
  • 25.
    As orange isto… green orange blue
  • 26.
    As orange isto…yep! green orange blue red
  • 28.
    What’s The Fallacy? Why"Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  • 29.
    What’s The Fallacy? Why"Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  • 30.
    What’s The Fallacy? Why"Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  • 31.
    What’s The Fallacy? Why"Green : Blue :: Orange : Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification
  • 32.
    Life is Likea Box of… (Or, ”Check the Vectors”)
  • 33.
    Check the vectors Thedistance between red and orange is incredibly similar to blue and green… But when we tested things trying to verify, we got interesting results which show the "understanding" of the relationship This actually yields this # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance
  • 34.
    Check the vectors Thedistance between red and orange is incredibly similar to blue and green… But when we played around to verify, we got interesting results revealing the semantic "understanding" of the relationship This actually yields this # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance ('purple', np.float64(0.006596347059928065)) Purple
  • 35.
    Why not 'orange'? Thecode's result of ('purple', np.float64(0.006596347059928065)) suggests that, in the embedding space used by the model, "red" and "purple" have a closer semantic relationship than "red" and "orange". The embedding model used in the code has determined that "red" and "purple" are closer semantically. This is likely due to the specific contexts and relationships captured by the model during training. It yields 'purple' instead of orange because the cosine distance and direction calculations between the embeddings of "red" and other terms result in "purple" being the closest match to the target distance and direction from "green" to "blue". # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance
  • 36.
  • 37.
    What are embeddings? Embeddingsare numerical representations that capture the essential features and relationships of discrete objects, like words or documents, in a continuous vector space.
  • 38.
    The most importantthing to understand Embeddings are numerical representations of data that: capture semantic meaning and allow for efficient comparison of similarity.
  • 39.
    Key points aboutembeddings 1. They can represent various data types, not just text. 2. Dimensionality 3. Context sensitivity affects interpretation and application.
  • 40.
    Applications of embeddingsinclude: - Semantic search - Question-answering applications - Image search - Audio search - Recommender systems - Anomaly detection “Generate your own embeddings” (Inference API)
  • 41.
  • 42.
  • 43.
    © 2024 Pinecone– All rights reserved 45 1. Questions? #hallwaytrack 2. Recording? YouTube! 3. Slides? Ask me
  • 44.
  • 46.
    © 2024 Pinecone– All rights reserved 48