Mattingly "Text and Data Mining: Searching Vectors"

•

0 likes•187 views

National Information Standards Organization (NISO)

This presentation was provided by William Mattingly of the Smithsonian Institution, for the seventh session of NISO's 2023 Training Series on Text and Data Mining. Session seven, "Vector Databases and Semantic Searching" was held on Thursday, November 30, 2023.

Education

1. Named Entity Recognition - Deeper Dive
2. Semantic Searching as a Concept
3. Vector Databases
4. Semantic Searching
5. Multi-Modal Data Mining
6. Retrieval-Augmented Generation (RAG)
Goals

Jim did not like the store.
It did not have chocolate.
He could not find anyone to help him.

Paris sails are fun.
(Presume this is a bad transcription of audio)

NER
Overview
● Classify individual spans, or
sequence of tokens, in a text
● Types Classification
○ Hard Classification
○ Soft Classification
● Types of Methods
○ Machine Learning
○ Rules-Based

NER
Labels
● Locations
○ LOC - Location
○ GPE - Geopolitical Entity
● PERSON
● NORP - Nationalities, religious,
or political groups
● TIME
● DATE
● EVENT
● PRODUCT
● FAC - Buildings, airports,
highways, bridges, etc.

Representing
Texts
Digitally
● Bag-of-Words
● Embeddings

Representing
Texts
Digitally
Bag-of-Words
● The apple is in the tree.
○ 1-the
○ 2-apple
○ 3-is
○ 4-in
○ 1-the
○ 5-tree
● [1, 2, 3, 4, 1, 5]

Representing
Texts
Digitally
Embeddings
● The apple is in the tree.
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 2-different vector
○ 3-different vector
○ 4-different vector
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 5-different vector

Vector
Database
What is it?
● It holds vectors in a database
as storage.
● Similar vectors are stored
closer.

Vector
Database
How do we use a vector
database?
● We populate a vector database
with by using a machine
learning model to vectorize
data and send them to the
database.

Vector
Database
Why use a vector database?

Vector
Database
Why use a vector database?
● Vector databases allow users
to store vector data in a way
that allows users to query it
and find similarity based on a
vector-level similarity, rather
than explicit human-defined
similarity.

Vector
Database
What is it?
● A vector database holds
numerous vectors or
embeddings of data.
Sometimes, the database will
also store the original data
alongside these vectors.

Vector Database
Stacks
What is available to us?
● Python, Annoy, Streamlit
○ Cheap, easy to deploy, great for
smaller datasets, but requires a
little bit of knowledge to build from
scratch
○ Best for smaller databases (under
10,000 data)
● Python, txtAI
○ Cheap and easy to use, more
resource intensive but easy to
deploy
○ Allows for easy interpretability (via
highlighting)

Vector Database
Stacks
What are available to us?
● Python/JavaScript and
Weaviate
○ Open-source
○ Can be done locally, on a server,
or via the Weaviate paid-hosting
○ API is easy to use and easy to
setup

Multi-Modal
What is it?
● Multi-modal data mining is
when we use one type of data
to find data of a different type.
● We could use text to find
images (which do not have
metadata or descriptions) or
images to find text.

RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query

Similar to Mattingly "Text and Data Mining: Searching Vectors"

Big Data with IOT approach and trends with case studySharjeel Imtiaz

Extending DCAM for Metadata ProvenanceKai Eckert

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

CS6007 information retrieval - 5 units notesAnandh Arumugakan

Our World is Socio-technicalMarkus Luczak-Rösch

First steps in Data Mining KindergartenAlexey Zinoviev

Py tablesAli Hallaji

PyTablesAli Hallaji

Large Data Analyze With PyTablesInnfinision Cloud and BigData Solutions

Data Structures & AlgorithmsMuhammad Jahanzaib

Data Science Machine Lerning Bigdat.pptxPriyadarshini648418

Session 2HarithaAshok3

Dwdmunit1 abhagathk

Week-1-Introduction to Data Mining.pptxTake1As

An Intro to Elasticsearch and KibanaObjectRocket

Babak Rasolzadeh: The importance of entitiesZoltan Varju

Scaling the (evolving) web data –at low cost-WU (Vienna University of Economics and Business)

Data sciencePurna Chander

An Introduction to Linked Data and MicrodataDLFCLIR

Big Data & Social Analytics presentationgustavosouto

Similar to Mattingly "Text and Data Mining: Searching Vectors" (20)

Big Data with IOT approach and trends with case study

Extending DCAM for Metadata Provenance

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

CS6007 information retrieval - 5 units notes

Our World is Socio-technical

First steps in Data Mining Kindergarten

Py tables

PyTables

Large Data Analyze With PyTables

Data Structures & Algorithms

Data Science Machine Lerning Bigdat.pptx

Session 2

Dwdmunit1 a

Week-1-Introduction to Data Mining.pptx

An Intro to Elasticsearch and Kibana

Babak Rasolzadeh: The importance of entities

Scaling the (evolving) web data –at low cost-

Data science

An Introduction to Linked Data and Microdata

Big Data & Social Analytics presentation

More from National Information Standards Organization (NISO)

Bazargan "NISO Webinar, Sustainability in Publishing"National Information Standards Organization (NISO)

Rapple "Scholarly Communications and the Sustainable Development Goals"National Information Standards Organization (NISO)

Compton "NISO Webinar, Sustainability in Publishing"National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design: Large Language Models"National Information Standards Organization (NISO)

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...National Information Standards Organization (NISO)

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"National Information Standards Organization (NISO)

Mattingly "Text and Data Mining: Building Data Driven Applications"National Information Standards Organization (NISO)

Mattingly "Text Mining Techniques"National Information Standards Organization (NISO)

Mattingly "Text Processing for Library Data: Representing Text as Data"National Information Standards Organization (NISO)

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"National Information Standards Organization (NISO)

Ross and Clark "Strategic Planning"National Information Standards Organization (NISO)

Mattingly "Data Mining Techniques: Classification and Clustering"National Information Standards Organization (NISO)

Straza "Global collaboration towards equitable and open science: UNESCO Recom...National Information Standards Organization (NISO)

Lippincott "Beyond access: Accelerating discovery and increasing trust throug...National Information Standards Organization (NISO)

Kriegsman "Integrating Open and Equitable Research into Open Science"National Information Standards Organization (NISO)

Mattingly "Ethics and Cleaning Data"National Information Standards Organization (NISO)

Mercado-Lara "Open & Equitable Program"National Information Standards Organization (NISO)

Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"National Information Standards Organization (NISO)

Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"National Information Standards Organization (NISO)

Hahnel “Mapping Progress: Reflections and Charting Future Pathways"National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Bazargan "NISO Webinar, Sustainability in Publishing"

Rapple "Scholarly Communications and the Sustainable Development Goals"

Compton "NISO Webinar, Sustainability in Publishing"

Mattingly "AI & Prompt Design: Large Language Models"

Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...

Mattingly "AI & Prompt Design" - Introduction to Machine Learning"

Mattingly "Text and Data Mining: Building Data Driven Applications"

Mattingly "Text Mining Techniques"

Mattingly "Text Processing for Library Data: Representing Text as Data"

Carpenter "Designing NISO's New Strategic Plan: 2023-2026"

Ross and Clark "Strategic Planning"

Mattingly "Data Mining Techniques: Classification and Clustering"

Straza "Global collaboration towards equitable and open science: UNESCO Recom...

Lippincott "Beyond access: Accelerating discovery and increasing trust throug...

Kriegsman "Integrating Open and Equitable Research into Open Science"

Mattingly "Ethics and Cleaning Data"

Mercado-Lara "Open & Equitable Program"

Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"

Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"

Hahnel “Mapping Progress: Reflections and Charting Future Pathways"

Recently uploaded

AmericanHighSchoolsprezentacijaoskolama.arsicmarija21

Difference Between Search & Browse Methods in Odoo 17Celine George

Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo

Types of Journalistic Writing Grade 8.pptxEyham Joco

Influencing policy (training slides from Fast Track Impact)Mark Reed

Atmosphere science 7 quarter 4 .........LeaCamillePacle

9953330565 Low Rate Call Girls In Rohini Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood

Earth Day Presentation wow hello nice greatYousafMalik24

How to do quick user assign in kanban in Odoo 17 ERPCeline George

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR

Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161

Recently uploaded (20)

AmericanHighSchoolsprezentacijaoskolama.

Difference Between Search & Browse Methods in Odoo 17

Romantic Opera MUSIC FOR GRADE NINE pptx

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx

Quarter 4 Peace-education.pptx Catch Up Friday

Types of Journalistic Writing Grade 8.pptx

Influencing policy (training slides from Fast Track Impact)

Atmosphere science 7 quarter 4 .........

9953330565 Low Rate Call Girls In Rohini Delhi NCR

Employee wellbeing at the workplace.pptx

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT

Earth Day Presentation wow hello nice great

How to do quick user assign in kanban in Odoo 17 ERP

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️

Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝

Keynote by Prof. Wurzer at Nordex about IP-design

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝

ROOT CAUSE ANALYSIS PowerPoint Presentation

Mattingly "Text and Data Mining: Searching Vectors"

1. Text and Data Mining Searching Vectors

2. 1. Named Entity Recognition - Deeper Dive 2. Semantic Searching as a Concept 3. Vector Databases 4. Semantic Searching 5. Multi-Modal Data Mining 6. Retrieval-Augmented Generation (RAG) Goals

3. Named Entity Recognition (NER)

4. Jim did not like the store. It did not have chocolate. He could not find anyone to help him.

5. Paris is a lovely city.

6. Paris enjoys flying to Asia.

7. Paris sails are fun. (Presume this is a bad transcription of audio)

8. NER Overview ● Classify individual spans, or sequence of tokens, in a text ● Types Classification ○ Hard Classification ○ Soft Classification ● Types of Methods ○ Machine Learning ○ Rules-Based

9. NER Labels ● Locations ○ LOC - Location ○ GPE - Geopolitical Entity ● PERSON ● NORP - Nationalities, religious, or political groups ● TIME ● DATE ● EVENT ● PRODUCT ● FAC - Buildings, airports, highways, bridges, etc.

10. Brief Recap on Vectors or Embeddings

11. Representing Texts Digitally ● Bag-of-Words ● Embeddings

12. Representing Texts Digitally Bag-of-Words ● The apple is in the tree. ○ 1-the ○ 2-apple ○ 3-is ○ 4-in ○ 1-the ○ 5-tree ● [1, 2, 3, 4, 1, 5]

13. Representing Texts Digitally Embeddings ● The apple is in the tree. ○ 1-[0.01234, -0.23456, 0.87654, 0.45678, -0.56123, 0.65432, 0.12345, -0.77123, 0.08456, 0.34567, ...] ○ 2-different vector ○ 3-different vector ○ 4-different vector ○ 1-[0.01234, -0.23456, 0.87654, 0.45678, -0.56123, 0.65432, 0.12345, -0.77123, 0.08456, 0.34567, ...] ○ 5-different vector

14. Vector Databases

15. Vector Database What is it? ● It holds vectors in a database as storage. ● Similar vectors are stored closer.

16.

17. Vector Database How do we use a vector database? ● We populate a vector database with by using a machine learning model to vectorize data and send them to the database.

18. Vector Database Why use a vector database?

19. Vector Database Why use a vector database? ● Vector databases allow users to store vector data in a way that allows users to query it and find similarity based on a vector-level similarity, rather than explicit human-defined similarity.

20. Vector Database What is it? ● A vector database holds numerous vectors or embeddings of data. Sometimes, the database will also store the original data alongside these vectors.

21. Vector Database Stacks

22. Vector Database Stacks

23. Vector Database Stacks What is available to us? ● Python, Annoy, Streamlit ○ Cheap, easy to deploy, great for smaller datasets, but requires a little bit of knowledge to build from scratch ○ Best for smaller databases (under 10,000 data) ● Python, txtAI ○ Cheap and easy to use, more resource intensive but easy to deploy ○ Allows for easy interpretability (via highlighting)

24. Vector Database Stacks What are available to us? ● Python/JavaScript and Weaviate ○ Open-source ○ Can be done locally, on a server, or via the Weaviate paid-hosting ○ API is easy to use and easy to setup

25. Multi-Modal Mining

26. Multi-Modal What is it? ● Multi-modal data mining is when we use one type of data to find data of a different type. ● We could use text to find images (which do not have metadata or descriptions) or images to find text.

27. Multi-Modal How does it work?

28. Retrieval-Augmented Generation

29. How tall is Wookie?

30.

31. How tall is Wookie?

32. RAG What is it? ● RAG allows for you to combine the strengths of large language models (LLMs) with vector databases ● It limits the chances for an LLM to hallucinate (generate fake information) ● It uses a vector database to find relevant material to a query

33. RAG What is it? ● RAG allows for you to combine the strengths of large language models (LLMs) with vector databases ● It limits the chances for an LLM to hallucinate (generate fake information) ● It uses a vector database to find relevant material to a query

Mattingly "Text and Data Mining: Searching Vectors"

Recommended

Recommended

More Related Content

Similar to Mattingly "Text and Data Mining: Searching Vectors"

Similar to Mattingly "Text and Data Mining: Searching Vectors" (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

Mattingly "Text and Data Mining: Searching Vectors"