Approximate nearest
neighbors & vector
models
I’m Erik
• @fulhack
• Author of Annoy, Luigi
• Currently CTO of Better
• Previously 5 years at Spotify
What’s nearest
neighbor(s)
• Let’s say you have a bunch of points
Grab a bunch of points
5 nearest neighbors
20 nearest neighbors
100 nearest neighbors
…But what’s the point?
• vector models are everywhere
• lots of applications (language processing,
recommender systems, computer vision)
MNIST example
• 28x28 = 784-dimensional dataset
• Define distance in terms of pixels:
MNIST neighbors
…Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space
Deep learning for food
• Deep model trained on a GPU on 6M random pics
downloaded from Yelp
156x156x32
154x154x32
152x152x32
76x76x64
74x74x64
72x72x64
36x36x128
34x34x128
32x32x128
16x16x256
14x14x256
12x12x256
6x6x512
4x4x512
2x2x512
2048
2048
128
1244
3x3 convolutions
2x2 maxpool
fully
connected
with dropout
bottleneck
layer
Distance in smaller space
1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an
item vector
3. Use cosine distance in the reduced space
Nearest food pics
Vector methods for text
• TF-IDF (old) – no dimensionality reduction
• Latent Semantic Analysis (1988)
• Probabilistic Latent Semantic Analysis (2000)
• Semantic Hashing (2007)
• word2vec (2013), RNN, LSTM, …
Represent documents and/or
words as f-dimensional vector
Latentfactor1
Latent factor 2
banana
apple
boat
Vector methods for
collaborative filtering
• Supervised methods: See everything from the
Netflix Prize
• Unsupervised: Use NLP methods
CF vectors – examplesIPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
simij = |bj bi|
2
(u, i, count)
@L
Geospatial indexing
• Ping the world: https://github.com/erikbern/ping
• k-NN regression using Annoy
Nearest neighbors the
brute force way
• we can always do an exhaustive search to find the
nearest neighbors
• imagine MySQL doing a linear scan for every
query…
Using word2vec’s brute
force search
$ time echo -e "chinese rivernEXITn" | ./distance GoogleNews-
vectors-negative300.bin	
!
Qiantang_River		 0.597229	
Yangtse		 0.587990	
Yangtze_River		 0.576738	
lake		 0.567611	
rivers		 0.567264	
creek		 0.567135	
Mekong_river		 0.550916	
Xiangjiang_River		 0.550451	
Beas_river		 0.549198	
Minjiang_River		 0.548721	
real	2m34.346s	
user	1m36.235s	
sys	0m16.362s
Introducing Annoy
• https://github.com/spotify/annoy
• mmap-based ANN library
• Written in C++, with Python and R bindings
• 585 stars on Github
Using Annoy’s search
$ time echo -e "chinese rivernEXITn" | python
nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-
negative300.bin 100000	
Yangtse	 0.907756	
Yangtze_River	 0.920067	
rivers	 0.930308	
creek	 0.930447	
Mekong_river	 0.947718	
Huangpu_River	 0.951850	
Ganges	 0.959261	
Thu_Bon	 0.960545	
Yangtze	 0.966199	
Yangtze_river	 0.978978	
real	0m0.470s	
user	0m0.285s	
sys	0m0.162s
Using Annoy’s search
$ time echo -e "chinese rivernEXITn" | python
nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-
negative300.bin 1000000	
Qiantang_River	 0.897519	
Yangtse	 0.907756	
Yangtze_River	 0.920067	
lake	 0.929934	
rivers	 0.930308	
creek	 0.930447	
Mekong_river	 0.947718	
Xiangjiang_River	 0.948208	
Beas_river	 0.949528	
Minjiang_River	 0.950031	
real	0m2.013s	
user	0m1.386s	
sys	0m0.614s
(performance)
1. Building an Annoy
index
Start with the point set
Split it in two halves
Split again
Again…
…more iterations later
Side note: making trees
small
• Split until K items in each leaf (K~100)
• Takes (n/K) memory instead of n
Binary tree
2. Searching
Nearest neighbors
Searching the tree
Problemo
• The point that’s the closest isn’t necessarily in the
same leaf of the binary tree
• Two points that are really close may end up on
different sides of a split
• Solution: go to both sides of a split if it’s close
Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root
Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at
the same time
heap + forest = best
• Since we use a priority queue, we will dive down
the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM
Annoy query structure
1. Use priority queue to search all trees until we’ve
found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items
Find candidates
Take union of all leaves
Compute distances
Return nearest neighbors
“Curse of dimensionality”
Are we screwed?
• Would be nice if the data is has a much smaller
“intrinsic dimension”!
Improving the algorithm
Queries/s
1-NN accuracy
more accurate
faster
• https://github.com/erikbern/ann-benchmarks
ann-benchmarks
perf/accuracy tradeoffs
Queries/s
1-NN accuracy
search more nodes
more trees
Things that work
• Smarter plane splitting
• Priority queue heuristics
• Search more nodes than number of results
• Align nodes closer together
Things that don’t work
• Use lower-precision arithmetic
• Priority queue by other heuristics (number of trees)
• Precompute vector norms
Things for the future
• Use a optimization scheme for tree building
• Add more distance functions (eg. edit distance)
• Use a proper KV store as a backend (eg. LMDB) to
support incremental adds, out-of-core, arbitrary
keys: https://github.com/Houzz/annoy2
Thanks!
• https://github.com/spotify/annoy
• https://github.com/erikbern/ann-benchmarks
• https://github.com/erikbern/ann-presentation
• erikbern.com
• @fulhack
Questions?

Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson