Word2vec in Postgres

Word2vec in Postgres
Gary Sieling
Software Architect
IQVIA

Agenda
Trends in integrating ML and relational databases
Specific examples around word2vec and Postgres
Job search

Example:
Find jobs within driving distance of Philadelphia that pay more than I make,
requiring a skillset that roughly matches mine

Notes
- Geographical Location
- Pay > X
- Semantic similarity to current or previous roles

Why Work in a relational database?
- Bring algorithms to the data
- E.g. machine learning in data warehouses
- Access to other DB features:
- Geospatial search
- Full text search
- Other DB types: trees, ranges, json
- Work with existing tuning options -
- materialized tables
- control here data is stored

Architectural Alternatives
- One or more large databases
- Could test/run locally in containers
- Sharded systems
- DB architected around many separate parts
- E.g. AWS S3 + Athena + Glue + Lambdas

Examples
- Apache MADlib
- BigQuery
- Microsoft SQL Server Machine Learning (R)

Vectors
- Need to store in a DB
- Word2vec - Google News: 300-dimensional vectors for 3 million words and phrases.
- Image vectors (e.g. SIFT)
- Audio
- One large matrix

Vectors
- HDF5 / numpy
- Dl4j - JSON
- Lucene - length limited, feature limited
- Postgres: Bytes, optimized (bytea)
- FAISS (standalone system)

FAISS
“A library for efficient similarity search and clustering of dense vectors.”

Common Operations
- Reshape
- Cosine distance
- Sum all
- Nearest

Vectors - Memory
- KeyedVectors
- LMDB (https://github.com/ThoughtRiver/lmdb-embeddings)
- Plasticity (https://github.com/plasticityai/magnitude)

Dataset
~30k jobs:
const indeed = require('indeed-scraper');
const queryOptions = {
city: 'Seattle, WA',
radius: '25',
level: 'entry_level',
jobType: 'fulltime',
maxAge: '7',
sort: 'date',
limit: '100'
};

Goal
Semantic similarity between resume and jobs

Why Word Vectors?
- Give access to “meaning”, rather than tokens:
- Search on similar words, concepts
- Or dis-similar words, concepts
- Averaging terms in documents allows you to compare meaning
- E.g. re-ranking top search results for “aboutness” or meaning diversity

Design
Take my resume, tokenize it
Average term vectors
Find a large list of related terms (e.g., javascript -> js, node, css)
Find matching postings
Take each posting, tokenize it
Average terms in the entire posting
Compute the cosine distance between the resume and posting
Sort

Average terms in a resume
CREATE TABLE resumes_average
SELECT
resumes.person_name,
tokenize(resume)
FROM resumes

Average terms in a job
CREATE TABLE job_averages AS
SELECT
url,
tokenize(terms) AS word_averages
FROM jobs

Comparison
SELECT
jobs.url,
resumes.person_name,
cosine_similarity_bytea(words_average, resume_avg)
FROM job_averages, resume_average
ORDER BY 3 DESC

Focus
I.e. “clinical” vs. “software”

Focus
SELECT
term,
cosine_similarity_bytea(tokenize(term), tokenize(resume))
FROM resumes
ORDER BY 2 desc

Postgres-word2vec
- https://github.com/guenthermi/postgres-word2vec
- Library puts FAISS data into tables
- Exposes functions to SQL to process data
- Trade time for precision

“Inverted File System with Asymmetric Distance Calculation”
- Locality sensitive hashing
- Store near-ish vectors together
- Distance can be between hashes, between hash and vector
- Choosing search performance vs. accuracy

Distance
Product Quantization
- Snap to grid
- “Product Quantization for Nearest Neighbor search”
- Voronoi

Issues
- Doesn’t consider term or concept frequency
- Doesn’t show us jobs that are the next steps
- Doesn’t consider how old a job in the resume was

TF-IDF
● Consider global term frequency to establish significance
● Not in Postgres FTS

TF-IDF
SELECT
url,
sum( ( rt.term_in_doc_count / gt.tf ) / log(gt.term_global_count / gt.df)) as score
FROM global_terms gt,
resume_terms rt
group by url, title
order by 2 desc

Example query - TF*IDF
SELECT url,
sum(
( cosine_similarity_bytea(tokenize(resume), resume_avg) ) -- “semantic similarity”
* (term_in_doc_count / tf) / log(total_docs.cnt / df) -- tf*idf
) as score,
FROM tfidf JOIN resumes as terms on ...
GROUP BY url

Example output
"0.08" "https://www.indeed.com/rc/clk?jk=24f9c321eca15660&fccid=663350f2630dae21&vjs=3" "Engineering
Intern"
"0.05" "https://www.indeed.com/rc/clk?jk=0e0488d24db733db&fccid=7a3824693ee1074b&vjs=3" "EE"
"0.049" "https://www.indeed.com/rc/clk?jk=906d782bc8b1189c&fccid=db9e6b2d86b4afad&vjs=3" "AWS Javascript
Engineering Intern"
"0.049" "https://www.indeed.com/rc/clk?jk=3fb6094b98c26f6e&fccid=734cb5a01ee60f80&vjs=3" "Business
Program Manager"
"0.049" "https://www.indeed.com/rc/clk?jk=e0bedc0323826646&fccid=113517153f849886&vjs=3" "CPS
Investigation Worker Trainee"
"0.046" "https://www.indeed.com/rc/clk?jk=68f7cd8164eac80c&fccid=b795f294efb0ecd0&vjs=3" "CIVILIAN
PAYROLL TECHNICIAN"

Improvements
● Tune TF*IDF implementation (not currently in Postgres)
● Search / cap results repeatedly. Can handle:
○ Aboutness / not-aboutness
○ Result diversity

Variations of Word2vec
FastText - incorporates letters with the words
StarSpace - Semantically similar sentences, categorization

References
- https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.
pdf
- https://stacks.stanford.edu/file/druid:yj296hj2790/Mak_Using_SIFT_features_f
or_small_to_mid_scale_image_database_search.pdf

Contact
gary.sieling@gmail.com
@garysieling
https://www.findlectures.com
Interests:
● Continuous Integration, Solr, Postgres, Word2vec, Data Warehousing, Scala
● Hiring, 1-1s, etc

Find documents in different Voronoi spots

FAISS
- Library by Facebook for fast vector search
- Vectors stored in Voroni cells
- Can be quantized
- Can use GPUs
- Offers Clustering, PCA
- E.g. nearest vectors:
D, I = index.search(xq, 5)
print(I[:5])

Dataset
https://www.kaggle.com/madhab/jobposts/version/1#
https://catalog.data.gov/dataset/nyc-jobs-26c80
http://data-wake.opendata.arcgis.com/datasets/ral::current-job-postings
https://www.kaggle.com/c/job-recommendation
https://www.jobspikr.com/
https://opendata.stackexchange.com/questions/1907/a-dataset-of-resumes
Federal: https://public.enigma.com/datasets/o-net-occupations/f94323ab-b0de-

Similarity
Distance measures:
- Cosine
- Euclidian
- Manhattan distance

Google for
Concatenated orientation histograms
Why did they use euclidian distance?
restricted boltzman machine
Spectral hashing
Euclidean Locality-Sensitive Hashing
inverted file system with asymmetric distance calculation

How much disk space do these take?
Word2vec model: 3,644,258,522
Google_vecs.txt: 10,766,478,818
Quantized index:
IVSADC:

Query performance
Word2vec model:
Quantized index:
IVSADC

Where to get datasets
Geospatial
Jobs
Resumes
NAICS codes

Geospatial search (jobs near me)

Hide jobs I really don’t want

Matching jobs to industries (NAICS)

Query for resumes (filling a position)

Supply / Demand for skills in my area

How focused is this resume?
Time from start - end
# of unrelated things (variance)

How likely am I to get someone with this skillset

When does “solr” mean “solr” the way I want
Saw something recent about this - starrr?

Effect of Male/Female names on search
http://www.nber.org/papers/w9873.pdf

Comparison to other technologies
Vespa.ai
ElasticSearch
Solr

Word2vec in Postgres

Recommended

Recommended

More Related Content

Similar to Word2vec in Postgres

Similar to Word2vec in Postgres (20)

More from Gary Sieling

More from Gary Sieling (7)

Recently uploaded

Recently uploaded (20)

Word2vec in Postgres