This talk demonstrates how to use word2vec models in a Postgres database to facilitate semantic search of job posts. Attendees will learn to structure models for usage in a relational database.
5. Why Work in a relational database?
- Bring algorithms to the data
- E.g. machine learning in data warehouses
- Access to other DB features:
- Geospatial search
- Full text search
- Other DB types: trees, ranges, json
- Work with existing tuning options -
- materialized tables
- control here data is stored
6. Architectural Alternatives
- One or more large databases
- Could test/run locally in containers
- Sharded systems
- DB architected around many separate parts
- E.g. AWS S3 + Athena + Glue + Lambdas
8. Vectors
- Need to store in a DB
- Word2vec - Google News: 300-dimensional vectors for 3 million words and phrases.
- Image vectors (e.g. SIFT)
- Audio
- One large matrix
15. Why Word Vectors?
- Give access to “meaning”, rather than tokens:
- Search on similar words, concepts
- Or dis-similar words, concepts
- Averaging terms in documents allows you to compare meaning
- E.g. re-ranking top search results for “aboutness” or meaning diversity
16. Design
Take my resume, tokenize it
Average term vectors
Find a large list of related terms (e.g., javascript -> js, node, css)
Find matching postings
Take each posting, tokenize it
Average terms in the entire posting
Compute the cosine distance between the resume and posting
Sort
17. Average terms in a resume
CREATE TABLE resumes_average
SELECT
resumes.person_name,
tokenize(resume)
FROM resumes
18. Average terms in a job
CREATE TABLE job_averages AS
SELECT
url,
tokenize(terms) AS word_averages
FROM jobs
23. “Inverted File System with Asymmetric Distance Calculation”
- Locality sensitive hashing
- Store near-ish vectors together
- Distance can be between hashes, between hash and vector
- Choosing search performance vs. accuracy
25. Issues
- Doesn’t consider term or concept frequency
- Doesn’t show us jobs that are the next steps
- Doesn’t consider how old a job in the resume was
30. Improvements
● Tune TF*IDF implementation (not currently in Postgres)
● Search / cap results repeatedly. Can handle:
○ Aboutness / not-aboutness
○ Result diversity
31. Variations of Word2vec
FastText - incorporates letters with the words
StarSpace - Semantically similar sentences, categorization
36. FAISS
- Library by Facebook for fast vector search
- Vectors stored in Voroni cells
- Can be quantized
- Can use GPUs
- Offers Clustering, PCA
- E.g. nearest vectors:
D, I = index.search(xq, 5)
print(I[:5])
39. Google for
Concatenated orientation histograms
Why did they use euclidian distance?
restricted boltzman machine
Spectral hashing
Euclidean Locality-Sensitive Hashing
inverted file system with asymmetric distance calculation
40. How much disk space do these take?
Word2vec model: 3,644,258,522
Google_vecs.txt: 10,766,478,818
Quantized index:
IVSADC: