Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson

Approximate nearest
neighbors & vector
models

I’m Erik
• @fulhack
• Author of Annoy, Luigi
• Currently CTO of Better
• Previously 5 years at Spotify

What’s nearest
neighbor(s)
• Let’s say you have a bunch of points

…But what’s the point?
• vector models are everywhere
• lots of applications (language processing,
recommender systems, computer vision)

MNIST example
• 28x28 = 784-dimensional dataset
• Deﬁne distance in terms of pixels:

…Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space

Deep learning for food
• Deep model trained on a GPU on 6M random pics
downloaded from Yelp
156x156x32
154x154x32
152x152x32
76x76x64
74x74x64
72x72x64
36x36x128
34x34x128
32x32x128
16x16x256
14x14x256
12x12x256
6x6x512
4x4x512
2x2x512
2048
2048
128
1244
3x3 convolutions
2x2 maxpool
fully
connected
with dropout
bottleneck
layer

Distance in smaller space
1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an
item vector
3. Use cosine distance in the reduced space

Vector methods for text
• TF-IDF (old) – no dimensionality reduction
• Latent Semantic Analysis (1988)
• Probabilistic Latent Semantic Analysis (2000)
• Semantic Hashing (2007)
• word2vec (2013), RNN, LSTM, …

Represent documents and/or
words as f-dimensional vector
Latentfactor1
Latent factor 2
banana
apple
boat

Vector methods for
collaborative ﬁltering
• Supervised methods: See everything from the
Netﬂix Prize
• Unsupervised: Use NLP methods

CF vectors – examplesIPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
simij = |bj bi|
2
(u, i, count)
@L

Geospatial indexing
• Ping the world: https://github.com/erikbern/ping
• k-NN regression using Annoy

Nearest neighbors the
brute force way
• we can always do an exhaustive search to ﬁnd the
nearest neighbors
• imagine MySQL doing a linear scan for every
query…

Using word2vec’s brute
force search
$ time echo -e "chinese rivernEXITn" | ./distance GoogleNews-
vectors-negative300.bin
!
Qiantang_River 0.597229
Yangtse 0.587990
Yangtze_River 0.576738
lake 0.567611
rivers 0.567264
creek 0.567135
Mekong_river 0.550916
Xiangjiang_River 0.550451
Beas_river 0.549198
Minjiang_River 0.548721
real 2m34.346s
user 1m36.235s
sys 0m16.362s

Introducing Annoy
• https://github.com/spotify/annoy
• mmap-based ANN library
• Written in C++, with Python and R bindings
• 585 stars on Github

Using Annoy’s search
$ time echo -e "chinese rivernEXITn" | python
nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-
negative300.bin 100000
Yangtse 0.907756
rivers 0.930308
creek 0.930447
Huangpu_River 0.951850
Ganges 0.959261
Thu_Bon 0.960545
Yangtze 0.966199
Yangtze_river 0.978978
real 0m0.470s
user 0m0.285s
sys 0m0.162s

Using Annoy’s search
$ time echo -e "chinese rivernEXITn" | python
nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-
negative300.bin 1000000
Qiantang_River 0.897519
Yangtse 0.907756
lake 0.929934
rivers 0.930308
creek 0.930447
Xiangjiang_River 0.948208
Beas_river 0.949528
Minjiang_River 0.950031
real 0m2.013s
user 0m1.386s
sys 0m0.614s

Side note: making trees
small
• Split until K items in each leaf (K~100)
• Takes (n/K) memory instead of n

Problemo
• The point that’s the closest isn’t necessarily in the
same leaf of the binary tree
• Two points that are really close may end up on
different sides of a split
• Solution: go to both sides of a split if it’s close

Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root

Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at
the same time

heap + forest = best
• Since we use a priority queue, we will dive down
the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM

Annoy query structure
1. Use priority queue to search all trees until we’ve
found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items

Are we screwed?
• Would be nice if the data is has a much smaller
“intrinsic dimension”!

Improving the algorithm
Queries/s
1-NN accuracy
more accurate
faster

• https://github.com/erikbern/ann-benchmarks
ann-benchmarks

perf/accuracy tradeoffs
Queries/s
1-NN accuracy
search more nodes
more trees

Things that work
• Smarter plane splitting
• Priority queue heuristics
• Search more nodes than number of results
• Align nodes closer together

Things that don’t work
• Use lower-precision arithmetic
• Priority queue by other heuristics (number of trees)
• Precompute vector norms

Things for the future
• Use a optimization scheme for tree building
• Add more distance functions (eg. edit distance)
• Use a proper KV store as a backend (eg. LMDB) to
support incremental adds, out-of-core, arbitrary
keys: https://github.com/Houzz/annoy2

Thanks!
• https://github.com/spotify/annoy
• https://github.com/erikbern/ann-benchmarks
• https://github.com/erikbern/ann-presentation
• erikbern.com
• @fulhack

Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson

More Related Content

What's hot

Similar to Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson

More from Hakka Labs

Recently uploaded

Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson