Exploring Direct Concept Search - Steve Rowe, Lucidworks
1. Exploring Direct Concept Search
Steve Rowe @steven_a_rowe
Senior Software Engineer, Lucidworks
Committer & PMC member, Lucene/Solr
2. Agenda
• Direct Concept Search
• Word Embedding
• Vector proximity = synonymy?
• Lucene/Solr Dimensional Points
• Lucene modifications
• Data processing
• Query Expansion and Search
• Conclusions
3. 3
01
Direct Concept Search
• Basic idea: map both query and index terms into
representations in a conceptual space
• Improve recall by expanding queries with concepts
• Concepts can be represented as synonym sets (see WordNet)
• Synonymy is the dominant relation discoverable via proximity
in word embedding
• Other relations are interesting, but not explored here, e.g.:
- meronymy/holonymy (part/whole)
- hypernymy/hyponymy (conceptually broader/narrower)
- king - man + woman ≈ queen
5. 5
01
Word Embedding
• You shall know a word by the company it keeps — John R. Firth
• Reduced dimension word representation: each term is
represented as a vector of real numbers
• Producible via various methods including neural network
training
• Software: word2vec, GloVe, Gensim, Deeplearning4j
• word2vec supports two algorithms: continuous bag of words
(predict a word given context) and continuous skip-gram
(predict context given a word)
• The distance between two vectors is a relatedness predictor
6. 6
01
Word embedding proximity: synonymy?
Leeuwenberg et. al., “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”, PBML 105, April 2016, 111-142
Interventions:
• Synonymy
precision@1
increased by
agreement
between multiple
models trained
with different
parameters
• POS tags increased
synonymy
precision
7. 7
01
Word embedding proximity: synonymy?
Dwaipayan et. al., “Using Word Embeddings for Automatic Query
Expansion”, Nue-IR ’16 SIGIR Workshop on Neural Information
Retrieval
Intervention: Incremental KNN: iteratively reorder each candidate via
similarity with other more proximate candidates, and prune least
similar
8. 8
01
Word embedding: polysemy
• All senses of a surface form are treated as if they are a single
concept, e.g. cat (animal, person, equipment manufacturer),
so the vector is trained over multiple unrelated contexts.
This can reduce the vector’s distinguishing power.
9. 9
01
Lucene/Solr Dimensional Points
• Added in Lucene 6.0
• Replaced Trie* numerics
• Faster range search
• K nearest neighbor search
• 1-8 dimensional values, 1-16 bytes in length
• Underlying data structure: Block k-
dimensional trees, aka bkd-trees
• Binary tree with multiple values in “leaf”
blocks (currently 512-1024 by default)
• Recursive split along widest dimension until a
block contains fewer than configured values
10. 10
01
Lucene modifications: point dimensions++
• Lucene 6.6.1
• Dimensional points are limited to 8 dimensions
- Prior to Lucene 6.0, the limit was briefly 255 dimensions
• Increased limit to 300 dimensions
- For a few FloatPoint-s, 300 dim values seemed to work
- Failed with more points: negative dimension exceptions
• Trial and error -> 127 dimensions works
LUCENE-6917
r1719562
12. 12
01
Lucene modifications:
FloatPointNearestNeighbor
• Adapted lucene/sandbox LatLonPoint.nearest() to
k-dimensional FloatPoint-s
• In some ways this was a simplification:
- LatLonPoint.nearest() calculates distance in meters
between two points on a sphere in polar coordinates;
special dateline handling required
• FloatPointNearestNeighbor calculates Euclidean distance
between two points using the Pythagorean formula:
13. 13
01
Data preparation: English Wikipedia
• English Wikipedia 1/1/2016 dump
- ~5M articles
- 57GB uncompressed
• Converted to plaintext with Giuseppe Attardi’s Wikiextractor:
http://attardi.github.io/wikiextractor/ -> 11GB
• Sentence segmentation with OpenNLP’s pre-trained English model
• Tokenization using Lucene StandardTokenizer+LowercaseFilter
• Significant phrase identification using word2vec’s word2phrase
(twice, to produce up to 4-grams)
• Vocab size: 4.1M (2.6M phrases) Corpus words/phrases: 1.7B
15. 15
01
Data preparation: indexing
• Created side-car index with each document having a
Wikipedia term or phrase and a word-embedding vector
as a 127-dimension FloatPoint
- Initially indexed on an SSD, but segment merging used
all free space (>100GB) for offline point sorting
- On a larger rotating disk, indexing took about 5 hours,
and disk usage peaked at about 60GB.
- 2GB disk used
• Created full Wikipedia index
- Indexing took ~30 minutes
- 10GB disk used
16. 16
01
Searching
• Adapted Lucene SearchFiles demo code to expand queries
• Query parser analyzer: StandardTokenizer+LowercaseFilter
+ShingleFilter
• Expand query:
- Look up vectors for all analyzed terms, add them together, perform
KNN search (K=1) against the side-car index, then add resulting
term to the final query
- For each analyzed term, perform KNN search (K=2) against the
side-car index, then add these terms to the final query
• Query the Wikipedia index with the expanded query
• This is not “direct concept search”!
18. 18
01
Conclusions
• Building high-dimension points indexes with Lucene is slow
• Lucene/Solr should have KNN search over k-dimensional points
• Word embedding proximity isn’t entirely reliable for synonymy
• Faiss: A library for efficient similarity search
https://code.facebook.com/posts/1373769912645926/faiss-a-
library-for-efficient-similarity-search/
- KNN search on high-dimension vectors on GPUs