This presentation was provided by William Mattingly of the Smithsonian Institution, for the seventh session of NISO's 2023 Training Series on Text and Data Mining. Session seven, "Vector Databases and Semantic Searching" was held on Thursday, November 30, 2023.
7. Paris sails are fun.
(Presume this is a bad transcription of audio)
8. NER
Overview
● Classify individual spans, or
sequence of tokens, in a text
● Types Classification
○ Hard Classification
○ Soft Classification
● Types of Methods
○ Machine Learning
○ Rules-Based
9. NER
Labels
● Locations
○ LOC - Location
○ GPE - Geopolitical Entity
● PERSON
● NORP - Nationalities, religious,
or political groups
● TIME
● DATE
● EVENT
● PRODUCT
● FAC - Buildings, airports,
highways, bridges, etc.
17. Vector
Database
How do we use a vector
database?
● We populate a vector database
with by using a machine
learning model to vectorize
data and send them to the
database.
19. Vector
Database
Why use a vector database?
● Vector databases allow users
to store vector data in a way
that allows users to query it
and find similarity based on a
vector-level similarity, rather
than explicit human-defined
similarity.
20. Vector
Database
What is it?
● A vector database holds
numerous vectors or
embeddings of data.
Sometimes, the database will
also store the original data
alongside these vectors.
23. Vector Database
Stacks
What is available to us?
● Python, Annoy, Streamlit
○ Cheap, easy to deploy, great for
smaller datasets, but requires a
little bit of knowledge to build from
scratch
○ Best for smaller databases (under
10,000 data)
● Python, txtAI
○ Cheap and easy to use, more
resource intensive but easy to
deploy
○ Allows for easy interpretability (via
highlighting)
24. Vector Database
Stacks
What are available to us?
● Python/JavaScript and
Weaviate
○ Open-source
○ Can be done locally, on a server,
or via the Weaviate paid-hosting
○ API is easy to use and easy to
setup
26. Multi-Modal
What is it?
● Multi-modal data mining is
when we use one type of data
to find data of a different type.
● We could use text to find
images (which do not have
metadata or descriptions) or
images to find text.
32. RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query
33. RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query