Exploring Direct Concept Search - Steve Rowe, Lucidworks

Exploring Direct Concept Search
Steve Rowe @steven_a_rowe
Senior Software Engineer, Lucidworks
Committer & PMC member, Lucene/Solr

Agenda
• Direct Concept Search
• Word Embedding
• Vector proximity = synonymy?
• Lucene/Solr Dimensional Points
• Lucene modiﬁcations
• Data processing
• Query Expansion and Search
• Conclusions

3
01
Direct Concept Search
• Basic idea: map both query and index terms into
representations in a conceptual space
• Improve recall by expanding queries with concepts
• Concepts can be represented as synonym sets (see WordNet)
• Synonymy is the dominant relation discoverable via proximity
in word embedding
• Other relations are interesting, but not explored here, e.g.:
- meronymy/holonymy (part/whole)
- hypernymy/hyponymy (conceptually broader/narrower)
- king - man + woman ≈ queen

4
01
Previously scheduled programming

5
01
Word Embedding
• You shall know a word by the company it keeps — John R. Firth
• Reduced dimension word representation: each term is
represented as a vector of real numbers
• Producible via various methods including neural network
training
• Software: word2vec, GloVe, Gensim, Deeplearning4j
• word2vec supports two algorithms: continuous bag of words
(predict a word given context) and continuous skip-gram
(predict context given a word)
• The distance between two vectors is a relatedness predictor

6
01
Word embedding proximity: synonymy?
Leeuwenberg et. al., “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”, PBML 105, April 2016, 111-142
Interventions:
• Synonymy
precision@1
increased by
agreement
between multiple
models trained
with different
parameters
• POS tags increased
synonymy
precision

7
01
Word embedding proximity: synonymy?
Dwaipayan et. al., “Using Word Embeddings for Automatic Query
Expansion”, Nue-IR ’16 SIGIR Workshop on Neural Information
Retrieval
Intervention: Incremental KNN: iteratively reorder each candidate via
similarity with other more proximate candidates, and prune least
similar

8
01
Word embedding: polysemy
• All senses of a surface form are treated as if they are a single
concept, e.g. cat (animal, person, equipment manufacturer),
so the vector is trained over multiple unrelated contexts.
This can reduce the vector’s distinguishing power.

9
01
Lucene/Solr Dimensional Points
• Added in Lucene 6.0
• Replaced Trie* numerics
• Faster range search
• K nearest neighbor search
• 1-8 dimensional values, 1-16 bytes in length
• Underlying data structure: Block k-
dimensional trees, aka bkd-trees
• Binary tree with multiple values in “leaf”
blocks (currently 512-1024 by default)
• Recursive split along widest dimension until a
block contains fewer than conﬁgured values

10
01
Lucene modiﬁcations: point dimensions++
• Lucene 6.6.1
• Dimensional points are limited to 8 dimensions
- Prior to Lucene 6.0, the limit was brieﬂy 255 dimensions
• Increased limit to 300 dimensions
- For a few FloatPoint-s, 300 dim values seemed to work
- Failed with more points: negative dimension exceptions
• Trial and error -> 127 dimensions works
LUCENE-6917
r1719562

11
01
Lucene modiﬁcations: point dimensions++
• Increased Lucene60PointsWriter’s maxMBSortInHeap

12
01
Lucene modiﬁcations:
FloatPointNearestNeighbor
• Adapted lucene/sandbox LatLonPoint.nearest() to 
k-dimensional FloatPoint-s
• In some ways this was a simpliﬁcation:
- LatLonPoint.nearest() calculates distance in meters
between two points on a sphere in polar coordinates; 
special dateline handling required
• FloatPointNearestNeighbor calculates Euclidean distance
between two points using the Pythagorean formula:

13
01
Data preparation: English Wikipedia
• English Wikipedia 1/1/2016 dump
- ~5M articles
- 57GB uncompressed
• Converted to plaintext with Giuseppe Attardi’s Wikiextractor:
http://attardi.github.io/wikiextractor/ -> 11GB
• Sentence segmentation with OpenNLP’s pre-trained English model
• Tokenization using Lucene StandardTokenizer+LowercaseFilter
• Signiﬁcant phrase identiﬁcation using word2vec’s word2phrase
(twice, to produce up to 4-grams)
• Vocab size: 4.1M (2.6M phrases) Corpus words/phrases: 1.7B

14
01
Data preparation: word2vec
• Used original word2vec implementation from 
https://code.google.com/archive/p/word2vec/
$ ~/word2vec/trunk/word2vec -train tokenized.phrase2.txt 
-output vectors.txt -size 127 -window 5 -cbow 1 
-threads 16 -mincount 10
school 2.116524 1.795494 0.633652 -4.336646 -0.516108 2.670166 -1.869003 -2.439780 4.184513 1.346064 0.575479 -1.253948
2.402044 1.509220 -1.731757 0.579987 3.574120 1.636222 -1.123773 -5.383362 0.237526 -2.355063 0.005370 -0.376685
0.012410 -2.879671 2.563775 1.337207 -1.521547 -0.103925 -0.331752 -2.495052 -3.307769 0.275708 3.144449 -1.311633
-1.670449 2.828144 0.554480 1.205654 -1.790552 -1.925511 2.222771 -3.177780 1.935470 -1.648508 0.984828 0.888824
-1.407604 2.499013 0.773771 0.714184 -2.964143 2.064912 2.760380 -0.900619 -1.760897 -0.729006 0.807967 -0.923451
-2.068305 3.208712 -2.559078 1.862214 5.013770 -1.164013 -1.733069 1.425113 3.193498 1.280118 2.542754 -2.744411
-1.099156 1.168452 1.237562 0.121698 0.480449 0.576633 -2.033551 1.836802 -2.234928 1.484420 3.865989 4.546199 -1.161416
-0.885669 -3.249272 1.889583 1.418283 0.332812 -0.710768 -0.470587 0.284066 0.071785 -1.033986 -1.350853 0.170050
-0.546051 0.716926 -0.462904 -1.691572 -2.926569 1.709693 2.521250 -0.547381 1.366463 -0.127229 -2.332556 1.816783
1.457195 0.672401 2.499213 0.938153 1.934247 -3.680648 0.068835 -2.957135 4.970984 -0.213793 -1.180181 4.374979 0.780540
0.966719 0.640560 -3.594667 0.159087 2.113389

15
01
Data preparation: indexing
• Created side-car index with each document having a
Wikipedia term or phrase and a word-embedding vector
as a 127-dimension FloatPoint
- Initially indexed on an SSD, but segment merging used
all free space (>100GB) for ofﬂine point sorting
- On a larger rotating disk, indexing took about 5 hours,
and disk usage peaked at about 60GB.
- 2GB disk used
• Created full Wikipedia index
- Indexing took ~30 minutes
- 10GB disk used

16
01
Searching
• Adapted Lucene SearchFiles demo code to expand queries
• Query parser analyzer: StandardTokenizer+LowercaseFilter
+ShingleFilter
• Expand query:
- Look up vectors for all analyzed terms, add them together, perform
KNN search (K=1) against the side-car index, then add resulting
term to the ﬁnal query
- For each analyzed term, perform KNN search (K=2) against the
side-car index, then add these terms to the ﬁnal query
• Query the Wikipedia index with the expanded query
• This is not “direct concept search”!

18
01
Conclusions
• Building high-dimension points indexes with Lucene is slow
• Lucene/Solr should have KNN search over k-dimensional points
• Word embedding proximity isn’t entirely reliable for synonymy
• Faiss: A library for efﬁcient similarity search 
https://code.facebook.com/posts/1373769912645926/faiss-a-
library-for-efﬁcient-similarity-search/
- KNN search on high-dimension vectors on GPUs

Exploring Direct Concept Search - Steve Rowe, Lucidworks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exploring Direct Concept Search - Steve Rowe, Lucidworks

Similar to Exploring Direct Concept Search - Steve Rowe, Lucidworks (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Exploring Direct Concept Search - Steve Rowe, Lucidworks